Anakin: How to Run DeepSeek Coder V2 Locally on Mac

DeepSeek Coder has emerged as a powerful contender, challenging the dominance of well-established models like GPT-4 and Claude. But what exactly is DeepSeek Coder, and why is it causing such a stir in the developer community?

The 236B DeepSeek coder V2 runs at 25 toks/sec on a single M2 Ultra. Not bad for such a large model: pic.twitter.com/Dg5tnRniHc
— Awni Hannun (@awnihannun) July 18, 2024

DeepSeek Coder is an innovative series of code language models developed by DeepSeek-AI, an academic research collective. These models are designed to push the boundaries of what's possible in AI-assisted coding, aiming to compete with - and in some cases surpass - leading commercial models in code generation capabilities.

💡

Want to try out DeepSeek LLM online? Having trouble switching between multiple AI Subscriptions?

Anakin AI is your all-in-one AI platform that connects everything in one place!

Start for free

The Evolution of DeepSeek Coder

The journey of DeepSeek Coder is a testament to rapid progress in AI. The latest iteration, DeepSeek Coder V2, builds upon its predecessor with significant improvements. It's been trained on an impressive 6 trillion tokens from a high-quality, multi-source corpus. This extensive training has expanded its capabilities dramatically, now supporting a whopping 338 programming languages, up from the previous version's 86.

But it's not just about quantity. DeepSeek Coder V2 has also made leaps in terms of context processing. It can now handle contexts of up to 128,000 tokens, a substantial increase from the previous 16,000 token limit. This expanded context window allows the model to understand and generate code for much larger and more complex projects.

The Secret Sauce of DeepSeek Coder V2: Training Data and Architecture

What sets DeepSeek Coder V2 apart is its carefully curated training dataset. The model's knowledge base consists of:

60% source code
10% mathematical data
30% natural language

This diverse mix ensures that the model isn't just good at coding, but also excels in understanding mathematical concepts and natural language instructions - crucial skills for a well-rounded coding assistant.

The code portion of the training data is particularly impressive, containing 1.17 trillion tokens sourced from GitHub and CommonCrawl. This vast repository of real-world code examples gives DeepSeek Coder V2 a robust understanding of various programming paradigms, best practices, and common coding patterns.

Architecturally, DeepSeek Coder V2 employs a Mixture-of-Experts (MoE) approach. This sophisticated design allows for two variants:

A 16-billion-parameter model with only 2.4 billion active parameters
A 236-billion-parameter model with 21 billion active parameters

Both versions have been trained on a staggering 10.2 trillion tokens, giving them a wealth of knowledge to draw from.

DeepSeek Coder Benchmarks vs GPT-4 and Claude Opus

Now, let's talk about the elephant in the room - how does DeepSeek Coder V2 stack up against industry giants like GPT-4 and Claude? The results are, quite frankly, impressive.

DeepSeek Coder vs. GPT-4 and Claude

In standard coding benchmarks like HumanEval and MBPP, DeepSeek Coder V2 doesn't just keep up with commercial models - it gives them a run for their money. The 236-billion parameter version of DeepSeek Coder V2 achieved an average score of 75.3% across these benchmarks. To put this in perspective:

It slightly trails GPT-4o's 76.4%
But it outperforms both GPT-4 and Claude 3 Opus

These results are nothing short of remarkable for an open-source model, especially considering the resources and backing behind models like GPT-4 and Claude.

Mathematical Prowess of DeepSeek Coder

It's not just in pure coding tasks where DeepSeek Coder V2 shines. In mathematical benchmarks such as GSM8K, MATH, and AIME, the model holds its own against leading commercial models. This mathematical aptitude is crucial for tasks involving algorithmic problem-solving and computational thinking.

Language Tasks: Maintaining the Edge

While primarily focused on coding, DeepSeek Coder V2 hasn't neglected natural language processing. In language tasks, it performs similarly to its predecessor, DeepSeek-V2, ensuring that it can understand and respond to complex prompts and instructions effectively.

You can read more about Deepseek Coder V2 on their Official Github page:

How to Set Up DeepSeek Coder V2 on Mac

Now that we've explored what makes DeepSeek Coder V2 special, let's dive into the practical side of things. How can you, as a Mac user, harness the power of this impressive model? Let's walk through the process step by step.

Prerequisites for DeepSeek Coder

Before we begin, make sure you have the following installed on your Mac:

Python 3.8 or higher
pip (Python package installer)
Git
Homebrew (optional, but recommended)

Step 1: Setting Up the Environment for DeepSeek Coder

Open Terminal on your Mac.
Create a new directory for your DeepSeek Coder V2 project:

mkdir deepseek_coder_v2
cd deepseek_coder_v2

Create a virtual environment to isolate the project dependencies:

python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate

Step 2: Installing Required Libraries for DeepSeek Coder

Upgrade pip to ensure you have the latest version:

pip install --upgrade pip

Install the required libraries:

pip install torch transformers accelerate

Step 3: Downloading the DeepSeek Coder V2 Model

For this guide, we'll use the DeepSeek-Coder-V2-Lite-Instruct model, which is more suitable for running on a Mac.

Create a new Python script file named download_model.py:

touch download_model.py

Open the file in your preferred text editor and add the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"

print("Downloading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.save_pretrained("./deepseek_coder_v2_tokenizer")

print("Downloading model...")
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
model.save_pretrained("./deepseek_coder_v2_model")

print("Model and tokenizer downloaded successfully!")

Run the script to download the model and tokenizer:

python download_model.py

This process may take some time depending on your internet connection.

Step 4: Creating a Script to Run DeepSeek Coder V2

Create a new Python script file named run_deepseek_coder.py:

touch run_deepseek_coder.py

Open the file in your text editor and add the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def generate_code(prompt, max_length=512):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Load the model and tokenizer
print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("./deepseek_coder_v2_tokenizer", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("./deepseek_coder_v2_model", trust_remote_code=True, torch_dtype=torch.float32)

# Move model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"Model loaded and running on {device}")

# Main interaction loop
while True:
    user_input = input("Enter your coding prompt (or 'quit' to exit): ")
    if user_input.lower() == 'quit':
        break
    
    generated_code = generate_code(user_input)
    print("\nGenerated Code:")
    print(generated_code)
    print("\n" + "="*50 + "\n")

print("Thank you for using DeepSeek Coder V2!")

Step 5: Running DeepSeek Coder V2

Execute the script:

python run_deepseek_coder.py

Wait for the model to load. You'll see a message indicating that the model is loaded and running on either CPU or GPU (if available).

Enter your coding prompts when prompted. The model will generate code based on your input.

To exit the program, type 'quit' when prompted for input.

Optimizing DeepSeek Coder Performance on Mac

Running large language models like DeepSeek Coder V2 on a Mac can be resource-intensive. Here are some tips to optimize performance:

Use GPU acceleration: If your Mac has a compatible GPU, ensure you have the correct CUDA drivers installed to enable GPU acceleration.

Adjust model size: If you're experiencing slow performance, consider using an even smaller model like "deepseek-ai/DeepSeek-Coder-V2-1.3B".

Limit generation length: Modify the max_length parameter in the generate_code function to reduce the maximum output length.

Use quantization: Implement quantization techniques to reduce model size and improve inference speed.

Advanced Usage: Code Completion with DeepSeek Coder

Let's enhance our script to support code completion using the Fill-in-the-Middle (FIM) technique. Update the run_deepseek_coder.py script with the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def generate_code(prompt, max_length=512):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def complete_code(code, max_length=512):
    fim_prompt = f"<｜fim▁begin｜>{code}<｜fim▁hole｜><｜fim▁end｜>"
    inputs = tokenizer(fim_prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=max_length)
    completed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return completed_code[len(fim_prompt):]

# Load the model and tokenizer
print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("./deepseek_coder_v2_tokenizer", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("./deepseek_coder_v2_model", trust_remote_code=True, torch_dtype=torch.float32)

# Move model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"Model loaded and running on {device}")

# Main interaction loop
while True:
    print("\nChoose an option:")
    print("1. Generate code from prompt")
    print("2. Complete code")
    print("3. Quit")
    
    choice = input("Enter your choice (1/2/3): ")
    
    if choice == '1':
        user_input = input("Enter your coding prompt: ")
        generated_code = generate_code(user_input)
        print("\nGenerated Code:")
        print(generated_code)
    elif choice == '2':
        user_code = input("Enter the code to complete (use <FILL> where you want completion): ")
        completed_code = complete_code(user_code)
        print("\nCompleted Code:")
        print(completed_code)
    elif choice == '3':
        break
    else:
        print("Invalid choice. Please try again.")
    
    print("\n" + "="*50 + "\n")

print("Thank you for using DeepSeek Coder V2!")

This updated script adds a code completion feature using the Fill-in-the-Middle technique. Users can now choose between generating code from a prompt or completing existing code.

Handling Long Context with DeepSeek Coder

DeepSeek Coder V2 supports a context length of up to 128K tokens. To take advantage of this for handling larger codebases or more complex prompts, we can modify our script to process longer inputs.

Update the run_deepseek_coder.py script with the following modifications:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def generate_code(prompt, max_length=1024, max_new_tokens=512):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=False, padding=False).to(model.device)
    outputs = model.generate(**inputs, max_length=max_length, max_new_tokens=max_new_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def complete_code(code, max_length=1024, max_new_tokens=512):
    fim_prompt = f"<｜fim▁begin｜>{code}<｜fim▁hole｜><｜fim▁end｜>"
    inputs = tokenizer(fim_prompt, return_tensors="pt", truncation=False, padding=False).to(model.device)
    outputs = model.generate(**inputs, max_length=max_length, max_new_tokens=max_new_tokens)
    completed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return completed_code[len(fim_prompt):]

# Load the model and tokenizer
print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("./deepseek_coder_v2_tokenizer", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "./deepseek_coder_v2_model",
    trust_remote_code=True,
    torch_dtype=torch.float32,
    max_position_embeddings=128000  # Set maximum context length
)

# Move model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"Model loaded and running on {device}")

# Main interaction loop
while True:
    print("\nChoose an option:")
    print("1. Generate code from prompt")
    print("2. Complete code")
    print("3. Process large codebase")
    print("4. Quit")
    
    choice = input("Enter your choice (1/2/3/4): ")
    
    if choice == '1':
        user_input = input("Enter your coding prompt: ")
        generated_code = generate_code(user_input)
        print("\nGenerated Code:")
        print(generated_code)
    elif choice == '2':
        user_code = input("Enter the code to complete (use <FILL> where you want completion): ")
        completed_code = complete_code(user_code)
        print("\nCompleted Code:")
        print(completed_code)
    elif choice == '3':
        file_path = input("Enter the path to your large codebase file: ")
        try:
            with open(file_path, 'r') as file:
                large_codebase = file.read()
            prompt = f"Analyze and improve the following large codebase:\n\n{large_codebase}\n\nProvide suggestions for improvements, optimizations, and best practices:"
            analysis = generate_code(prompt, max_length=128000, max_new_tokens=1024)
            print("\nCodebase Analysis and Suggestions:")
            print(analysis)
        except FileNotFoundError:
            print(f"Error: File not found at {file_path}")
    elif choice == '4':
        break
    else:
        print("Invalid choice. Please try again.")
    
    print("\n" + "="*50 + "\n")

print("Thank you for using DeepSeek Coder V2!")

This updated script includes the following enhancements:

Increased max_length and added max_new_tokens parameters to handle longer inputs and outputs.
Set max_position_embeddings to 128000 when loading the model to utilize the full context length.
Added a new option to process large codebases by reading from a file.

Step 9: Implementing Streaming Generation

To improve the user experience, especially for longer generations, we can implement streaming generation. This allows the model to output tokens as they are generated, providing a more interactive experience.

Update the run_deepseek_coder.py script with the following modifications:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import os

def generate_code_stream(prompt, max_length=1024, max_new_tokens=512):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=False, padding=False).to(model.device)
    streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
    
    generation_kwargs = dict(
        **inputs,
        max_length=max_length,
        max_new_tokens=max_new_tokens,
        streamer=streamer
    )
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    generated_text = ""
    for new_text in streamer:
        generated_text += new_text
        print(new_text, end='', flush=True)
    
    print("\n")
    return generated_text

def complete_code_stream(code, max_length=1024, max_new_tokens=512):
    fim_prompt = f"<｜fim▁begin｜>{code}<｜fim▁hole｜><｜fim▁end｜>"
    return generate_code_stream(fim_prompt, max_length, max_new_tokens)

def process_large_codebase(file_path, max_length=128000, max_new_tokens=1024):
    with open(file_path, 'r') as file:
        large_codebase = file.read()
    prompt = f"Analyze and improve the following large codebase:\n\n{large_codebase}\n\nProvide suggestions for improvements:"
    return generate_code_stream(prompt, max_length, max_new_tokens)

# Load the model and tokenizer
print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("./deepseek_coder_v2_tokenizer", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "./deepseek_coder_v2_model",
    trust_remote_code=True,
    torch_dtype=torch.float32,
    max_position_embeddings=128000
)

# Move model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"Model loaded and running on {device}")

# Main interaction loop
while True:
    print("\nChoose an option:")
    print("1. Generate code from prompt (streaming)")
    print("2. Complete code (streaming)")
    print("3. Process large codebase (streaming)")
    print("4. Quit")
    
    choice = input("Enter your choice (1/2/3/4): ")
    
    if choice == '1':
        user_input = input("Enter your coding prompt: ")
        print("\nGenerated Code:")
        generate_code_stream(user_input)
    elif choice == '2':
        user_code = input("Enter the code to complete (use <FILL> where you want completion): ")
        print("\nCompleted Code:")
        complete_code_stream(user_code)
    elif choice == '3':
        file_path = input("Enter the path to your large codebase file: ")
        if os.path.exists(file_path):
            print("\nAnalyzing and improving the codebase:")
            process_large_codebase(file_path)
        else:
            print(f"Error: File not found at {file_path}")
    elif choice == '4':
        print("Thank you for using DeepSeek Coder V2!")
        break
    else:
        print("Invalid choice. Please try again.")

This complete script includes the following components:

Necessary imports, including os for file path handling.
The generate_code_stream function for streaming code generation.
The complete_code_stream function for code completion using the Fill-in-the-Middle technique.
A new process_large_codebase function to handle large codebases.
Model and tokenizer loading with the specified parameters.
The main interaction loop with four options:

Generate code from a prompt
Complete code
Process a large codebase
Quit the program

The script now properly handles file path checking for the large codebase option and includes error handling for invalid file paths. It also provides a clean exit option when the user chooses to quit.

To use this script, make sure you have the DeepSeek Coder V2 model and tokenizer files in the same directory as this script, or update the paths in the from_pretrained calls accordingly.

Step 10: Advanced Features and Use Cases

Now that we have implemented streaming generation, let's explore some advanced features and use cases of DeepSeek Coder V2 that showcase its capabilities.

Multi-Language Support

DeepSeek Coder V2 supports an impressive 338 programming languages, a significant increase from its predecessor. To leverage this feature, we can modify our script to allow users to specify the target programming language:

def generate_code_stream(prompt, language, max_length=1024, max_new_tokens=512):
    full_prompt = f"Generate code in {language}:\n\n{prompt}"
    # ... (rest of the function remains the same)

# In the main loop:
if choice == '1':
    user_input = input("Enter your coding prompt: ")
    language = input("Enter the target programming language: ")
    print(f"\nGenerating {language} code:")
    generate_code_stream(user_input, language)

Mathematical Reasoning

DeepSeek Coder V2 excels in mathematical reasoning tasks. We can add a new option to our script for solving mathematical problems:

def solve_math_problem(problem, max_length=1024, max_new_tokens=512):
    prompt = f"Solve the following mathematical problem step by step:\n\n{problem}"
    return generate_code_stream(prompt, "LaTeX", max_length, max_new_tokens)

# Add this option to the main loop:
elif choice == '5':
    math_problem = input("Enter a mathematical problem: ")
    print("\nSolution:")
    solve_math_problem(math_problem)

Code Refactoring

DeepSeek Coder V2 can assist in code refactoring tasks. Let's add a feature to suggest improvements for existing code:

def refactor_code(code, max_length=1024, max_new_tokens=512):
    prompt = f"Refactor and improve the following code:\n\n{code}\n\nProvide explanations for the changes:"
    return generate_code_stream(prompt, "markdown", max_length, max_new_tokens)

# Add this option to the main loop:
elif choice == '6':
    code_to_refactor = input("Enter the code to refactor: ")
    print("\nRefactored code and explanations:")
    refactor_code(code_to_refactor)

Project-Level Code Completion

DeepSeek Coder V2's 128K context window allows for project-level code completion. We can enhance our script to handle multiple files:

import os

def complete_project(project_dir, max_length=128000, max_new_tokens=1024):
    project_files = []
    for root, _, files in os.walk(project_dir):
        for file in files:
            if file.endswith(('.py', '.js', '.java', '.cpp')):  # Add more extensions as needed
                with open(os.path.join(root, file), 'r') as f:
                    project_files.append(f"File: {file}\n\n{f.read()}\n\n")
    
    project_content = "".join(project_files)
    prompt = f"Analyze the following project and suggest completions or improvements:\n\n{project_content}"
    return generate_code_stream(prompt, "markdown", max_length, max_new_tokens)

# Add this option to the main loop:
elif choice == '7':
    project_path = input("Enter the path to your project directory: ")
    print("\nAnalyzing project and generating suggestions:")
    complete_project(project_path)

Performance Optimization

To further optimize the performance of DeepSeek Coder V2 on Mac, consider the following techniques:

Quantization: Implement 8-bit quantization to reduce memory usage and increase inference speed:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "./deepseek_coder_v2_model",
    trust_remote_code=True,
    device_map="auto",
    quantization_config=quantization_config
)

Caching: Implement a caching mechanism for frequently used prompts or code snippets to reduce redundant computations.

Batch Processing: For tasks involving multiple inputs, implement batch processing to leverage parallel computation:

def batch_generate_code(prompts, max_length=1024, max_new_tokens=512):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(model.device)
    outputs = model.generate(**inputs, max_length=max_length, max_new_tokens=max_new_tokens)
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

Conclusion

DeepSeek Coder V2 represents a significant advancement in open-source code intelligence models. Its impressive performance across various coding tasks, mathematical reasoning, and general language understanding makes it a versatile tool for developers and researchers alike.

By following this guide, you've learned how to set up and run DeepSeek Coder V2 on your Mac, implement advanced features like streaming generation and project-level code completion, and optimize its performance. The model's ability to handle a wide range of programming languages, coupled with its large context window, positions it as a powerful assistant for complex coding tasks and project-wide analysis.

As you continue to explore DeepSeek Coder V2, remember to stay updated with the latest releases and community contributions. The open-source nature of this model encourages collaboration and continuous improvement, potentially leading to even more impressive capabilities in the future.

💡

Want to create your own Agentic AI Workflow with No Code?

You can easily create AI workflows with Anakin AI without any coding knowledge. Connect to LLM APIs such as: GPT-4, Claude 3.5 Sonnet, Uncensored Dolphin-Mixtral, Stable Diffusion, DALLE, Web Scraping.... into One Workflow!

Forget about complicated coding, automate your madane work with Anakin AI!

For a limited time, you can also use Google Gemini 1.5 and Stable Diffusion for Free!