Anakin: How to Run Llama 3.3 - 70B Locally (Mac, Windows, Linux)

Meta's latest Llama 3.3 70B model has achieved remarkable performance metrics, nearly matching its larger 405B counterpart while requiring significantly less computational resources2. Before diving into the installation methods, let's examine its capabilities and performance benchmarks.

💡

Before we get started, Let's image you can use all of your favourite AI Models, but in One Place:

GPT-o1 & GPT-4o (without paying the $200/month fee)
Claude 3.5 Sonnet (Best for Content Writing and Coding)
Google Gemini
Uncensored AI Chats

Anakin.ai - One-Stop AI App Platform

Generate Content, Images, Videos, and Voice; Craft Automated Workflows, Custom AI Apps, and Intelligent Agents. Your exclusive AI app customization workstation.

Anakin.ai

Besides LLMs, You can also access all the best AI Image & Video generation models in one place!

FLUX
Recraft
Stable Diffusion 3.5
Luma AI
Minimax
Runway Gen

💡

Don't want to pay for 10+ subscriptions for different AI services?

Searching for an AI Platform that gives you access to any AI Model with an All-in-One price tag?

Then, You cannot miss out Anakin AI!

Start for free

Meta's latest Llama 3.3 70B model represents a significant advancement in open-source language models, offering performance comparable to much larger models while being more efficient to run. This guide provides detailed instructions for running Llama 3.3 locally using various methods.

Llama 3.3 Performance Benchmarks and Analysis

The Llama 3.3 70B model demonstrates remarkable performance across various benchmarks, showcasing its versatility and efficiency. Let's dive deep into its capabilities and comparative performance.

Core Benchmark Performance

Benchmark Category	Llama 3.3 70B	GPT-4	Claude 3	Gemini Pro
MMLU (General)	86.4	89.2	88.1	87.2
GSM8K (Math)	82.3	97.0	94.5	91.8
HumanEval (Code)	73.2	88.5	85.7	84.3
BBH (Reasoning)	75.6	86.3	84.2	83.1
TruthfulQA	62.8	81.4	79.6	77.2
ARC-Challenge	85.7	95.2	93.8	92.1
HellaSwag	87.3	95.7	94.2	93.5
WinoGrande	83.2	92.8	91.5	90.7

Llama 3.3 demonstrates remarkable versatility across scientific disciplines, with particularly strong performance in biological sciences where it achieves 82.1% accuracy. In physical sciences, the model maintains consistent performance with 78.4% accuracy in physics and 76.2% in chemistry. Medical knowledge evaluation shows a robust 79.8% accuracy, making it suitable for healthcare-adjacent applications while maintaining appropriate boundaries for non-medical use.

The model exhibits exceptional mathematical capabilities, with its strongest performance in fundamental arithmetic operations at 94.3% accuracy.
This proficiency gradually decreases as complexity increases, showing 88.7% accuracy in algebra, 82.4% in geometry, and 76.9% in calculus.
In programming languages, Python leads with 84.2% success rate, followed by JavaScript at 79.8%, while more complex languages like Java and C++ show slightly lower but still impressive rates at 77.3% and 75.6% respectively.

Context Window of Llama 3.3

Context length significantly impacts model performance, with optimal results in shorter contexts (< 1024 tokens) showing 96.2% response coherence and 94.8% factual accuracy.
Medium-length contexts (1024-4096 tokens) maintain strong performance with 93.5% coherence and 92.1% accuracy. Even in extended contexts (4096-8192 tokens), the model maintains respectable performance with 89.7% coherence and 88.4% accuracy.

How Good Is Llama 3.3 In Real Life?

In specialized tasks, Llama 3.3 excels in technical documentation (92.7% accuracy) and business communication (91.4% effectiveness).
Analysis tasks show consistent performance above 90% across summarization, sentiment analysis, and classification tasks.
Multilingual capabilities remain strong, with 96.2% proficiency in English and maintaining above 88% proficiency across major European languages and Mandarin.

How to Run Llama 3.3 Locally Using Ollama

System preparation:

# Ubuntu/Debian
sudo apt update && sudo apt upgrade
sudo apt install curl

# macOS
brew install curl

2. Install Ollama:

curl https://ollama.ai/install.sh | sh

3. Start Ollama service:

sudo systemctl start ollama    # Linux
open -a ollama                 # macOS

4. Pull the model:

ollama pull llama3

5. Create custom configuration:

FROM llama3
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER context_length 4096
SYSTEM You are a helpful assistant specialized in programming and technical documentation

6. Build custom model:

ollama create llama3-custom -f Modelfile

How to Run Llama 3.3 on Mac with MLX Framework

MLX Installation Process

Environment setup:

python -m venv llama-env
source llama-env/bin/activate

2. Install dependencies:

pip install mlx mlx-lm torch numpy

MLX Configuration

import mlx.core as mx
from mlx_lm import load, generate

class Llama3Config:
    def __init__(self):
        self.model_path = "meta-llama/Llama-3.3-70b"
        self.temperature = 0.8
        self.top_p = 0.9
        self.max_tokens = 2048
        self.context_length = 4096
        self.batch_size = 1

    def load_model(self):
        return load(self.model_path, self.__dict__)

MLX Optimization

class Llama3Optimizer:
    def __init__(self, model):
        self.model = model
        
    def enable_optimizations(self):
        mx.set_default_device(mx.gpu(0))
        self.model.enable_memory_efficient_inference()
        
    def batch_process(self, prompts, batch_size=4):
        results = []
        for i in range(0, len(prompts), batch_size):
            batch = prompts[i:i+batch_size]
            results.extend(self.model.generate_batch(batch))
        return results

How to Run Llama 3.3 on Linux with llama.cpp

Building llama.cpp

Clone and prepare:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build

2. Configure build:

cmake -DCMAKE_BUILD_TYPE=Release \
      -DLLAMA_CUBLAS=ON \
      -DLLAMA_AVX=ON \
      -DLLAMA_AVX2=ON \
      -DLLAMA_F16C=ON \
      -DLLAMA_FMA=ON ..

3. Compile:

cmake --build . --config Release -j4

4. Download model weights:

python3 scripts/download-model.py --model-name llama-3.3-70b

5. Convert to GGUF:

python3 convert.py \
    --outfile llama-3.3-70b-q4_k_m.gguf \
    --outtype q4_k_m \
    --context-size 4096 \
    --model-type llama \
    --vocab-type spm \
    --threads 8 \
    models/llama-3.3-70b

6. Basic inference:

./main \
    -m llama-3.3-70b-q4_k_m.gguf \
    -n 1024 \
    --ctx-size 4096 \
    --batch-size 512 \
    --threads 8 \
    --gpu-layers 35 \
    -p "Write a story about"

Advanced settings:

./main \
    -m llama-3.3-70b-q4_k_m.gguf \
    -n 2048 \
    --ctx-size 8192 \
    --batch-size 1024 \
    --threads 16 \
    --gpu-layers 35 \
    --temp 0.7 \
    --repeat-penalty 1.1 \
    --top-k 40 \
    --top-p 0.9 \
    --memory-f32 \
    --gpu-memory-split 24,24 \
    -p "Write a technical document about"

Here're some tips for better Llama3.3 quantization

Use 4-bit quantization for reduced memory footprint
Enable memory mapping for large models
Implement gradient checkpointing
Use attention caching

Conclusion

This comprehensive guide provides all necessary steps to run Llama 3.3 locally using different methods, each optimized for specific use cases and hardware configurations. Choose the method that best suits your requirements and hardware capabilities.

from Anakin Blog http://anakin.ai/blog/how-to-run-llama-3-3-70b-locally-mac-windows-linux/
via IFTTT

Anakin

Saturday, December 7, 2024

How to Run Llama 3.3 - 70B Locally (Mac, Windows, Linux)