Implementing DeepSeek-R1 Locally through Llama.cpp

DeepSeek’s R1 model revolutionizes AI reasoning, balancing reinforcement learning with structured training techniques.

DeepSeek-R1 released on Jan 20th, 2025 sent shockwaves through the stock market, particularly impacting technology companies that are heavily invested in AI. Major US tech companies like NVIDIA, Broadcom, and AMD saw significant drops in their stock prices, with NVIDIA experiencing a record-one-day market capitalization loss of over $500 billion. This disruption was caused due to DeepSeek-R1’s competitive performance and significantly lower cost which challenged the dominance of existing players like OpenAI and NVIDIA. While the initial market reaction was dramatic, the long-term implications of Deepseek-R1 are still unfolding which include accelerated AI adoption and shift in market dynamics. This article explores Deepseek-R1 LLM and its local implementation using Llama.cpp. 

Table of Contents

  1. Understanding DeepSeek-R1
  2. Core Features of DeepSeek-R1
  3. DeepSeek-R1 Evaluation & Benchmarks
  4. Hands-on Implementation of Deepseek-R1 through Llama.cpp

Understanding DeepSeek R1

DeepSeek-R1-zero and DeepSeek-R1 are the two first-generation reasoning models introduced by DeepSeek on January 20, 2025. DeepSeek-R1-Zero is a model trained using large-scale reinforcement learning without supervised fine-tuning as a prerequisite, that exemplifies incredible reasoning capabilities. Using reinforcement learning, DeepSeek-R1-Zero naturally transpires with numerous powerful and fascinating reasoning behaviors. However, it faces challenges such as poor comprehensibility and language mixing.

To address the limitations of DeepSeek-R1-Zero and further enhancement of reasoning performance, DeepSeek-R1 was introduced. It incorporates multi-stage training and cold-start data before reinforcement learning. Cold-start data refers to a small set of high-quality curated data that initially train the model before fully transitioning to reinforcement learning. This helps stabilize the early training process and provides a foundation for strong reasoning abilities by providing clear examples of how to generate detailed step-by-step explanations when tackling complex problems (Chain of Thought).

DeepSeek-R1’s performance is comparable to OpenAI-o1-1217 on reasoning tasks. Six dense models distilled from DeepSeek-R1 (1.5B, 7B, 8B, 14B, 32B, 70B) based on Qwen and Llama were also released along with DeepSeek-R1-Zero and DeepSeek-R1. 

DeepSeek-R1 Benchmarks

DeepSeek-R1 employs the use of thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. It then operates on the same concept of reasoning-oriented reinforcement learning as DeepSeek-R1-Zero. Upon nearing convergence in the reinforcement learning process, new supervised fine-tuning data is created using rejection sampling on the reinforcement learning checkpoint combined with supervised data from DeepSeek-V3 in domains related to factual QA, self-cognition, writing, and then retrain the DeepSeek-V3-Base model. After the fine-tuning process with the new data, the checkpoint undergoes additional reinforcement learning, taking into account prompts from all scenarios. After these steps, a checkpoint referred to as DeepSeek-R1 is obtained, which achieves performance on par with the OpenAI-o1-1217 model. 

Using Qwen2.5-32B as the base model, direct distillation from DeepSeek-R1 outperforms applying reinforcement learning on it. This exhibits that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase from the reinforcement learning training from the base model, DeepSeek-R1 utilizes a small amount of long Chain-of-Thought data to fine-tune the model as the initial reinforcement learning actor. To collect this type of data different techniques were explored such as few-shot prompting with long Chain-of-Thought as an example, direct model prompting to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators. 

DeepSeek-R1 model reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, logic reasoning and science, which involve well-defined problems with clear solutions are enhanced via the same large-scale reinforcement learning training process applied in DeepSeek-R1-Zero after fine-tuning DeepSeek-V3-Base on the cold start data. During this training process, in order to mitigate the issue of language mixing, a large consistency reward is introduced during the reinforcement learning training, which is calculated as the proportion of the target language words in the Chain-of-Thought. Lastly, the accuracy of reasoning tasks, and the reward for language consistency is combined by summing them to form the final reward. Then reinforcement learning is applied on the fine-tuned model until it achieves convergence on reasoning problems. 

When the reasoning-oriented reinforcement learning converges, the resulting checkpoint is used for collecting supervised fine-tuning data for the subsequent rount. This stage focuses on data from other domains to enhance the model capabilities in writing and other general-purpose tasks. 

For further alignment of model with human preferences, a secondary reinforcement learning stage focused on improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities is implemented. The model is trained using a combination of reward signals and diverse prompt disseminations. Rule-based rewards are used for guiding the process in math, code and logical reasoning domains specifically for reasoning data. For general data, reward models are used to capture human preferences in complex scenarios. 

For helpfulness, the prime focus is on the final summary ensuring that the evaluation highlights the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlesness, the evaluation of entire response of the model, including both the reasoning process and the summary, is done for identifying and mitigating any potential risks, biases, or harmful content which may arise during the generation process.  

DeepSeek-R1 Training Process

Core Features of DeepSeek-R1

DeepSeek-R1 Evaluation & Benchmarks

DeepSeek-R1 is evaluated on the following benchmarks – 

  1. Massive Multitask Language Understanding (MMLU) which covers 57 diverse subject areas ranging from elementary math and history to more advanced topics such as law, philosophy, and medicine having about 16,000 MCQs. 
  2. MMLU-Redux which is an annotated version of MMLU it provides a more accurate and reliable benchmark for evaluating LLM performance.
  3. C-Eval which is a comprehensive Chinese evaluation suite for foundation models. It comprises 13948 MCQs covering 52 diverse disciplines and four difficulty levels. 
  4. CMMLU is another comprehensive benchmark, Chinese Massive Multitask Language Understanding, designed to assess the knowledge and reasoning abilities of LLMs specifically within the context of the Chinese language and culture. 
  5. IFEval stands for Instruction Following Evaluation. Its a benchmark specifically designed to assess how well LLMs can follow explicit instructions. 
  6. FRAMES is an evaluation dataset designed to test the capabilities of RAG systems across factuality, retrieval accuracy, and reasoning. It stands for Factuality, Retrieval, and Reasoning Measurement Set. 
  7. GPQA stands for Graduate-Level Google-Proof Q&A Benchmark and is used to test LLM’s ability to answer difficult questions related to science discipline. 
  8. SimpleQA is used for testing LLM’s ability to provide factual answers to simple questions. 
  9. C-SimpleQA is the Chinese counterpart of SimpleQA
  10. SWE-Bench Verified is a human-validated subset of SWE-Bench that more reliably evaluates the LLM model’s ability to solve real-world software engineering tasks and issues. 
  11. Aider is an AI-powered coding assistant and benchmark designed to evaluate how well LLMs can help with real-world coding tasks.  
  12. LiveCodeBench is a benchmark designed to evaluate the coding capabilities of LLMs in a more dynamic and realistic setting compared to traditional static code generation tests. 
  13. Codeforces is a popular platform for competitive programming, using it as a benchmark means evaluating how well LLMs can perform in coding competitions. 
  14. CNMO 2024 is the Chinese National Mathematical Olympiad 2024 which evaluates an LLM’s ability to solve complex mathematical problems, similar to other math-focused benchmarks like MATH-500 or AIME. 
  15. AIME 2024 stands for American Mathematics Examination 2024 which tests how well LLMs can solve problems related to mathematics and logical reasoning. 

The evaluation setup involves setting the maximum generation length to 32,768 tokens for the models. It was found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints. Therefore, pass@k evaluation was used as default and pass@1 was reported using a non-zero temperature. The sampling temperature of 0.6 and a top-p value of 0.95 were used to generate k responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is calculated using the following equation – 

Where pi  denotes the correctness of the i-th response. 

DeepSeek-R1 and Other Model Comparison 

DeepSeek-R1 Distilled Models and Other Models Comparison

Hands-on Implementation of DeepSeek-R1 through Llama.cpp

In this hands-on tutorial, we will implement DeepSeek-R1 locally using Llama.cpp

Step 1: Setup a virtual environment and install the required libraries – 

python3 -m venv deepseek_llamacpp

source deepseek_llamacpp/bin/activate

pip install langchain langchain-core langchain-community
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

Step 2: Downloading the DeepSeek-R1-Distill-Qwen-1.5B model using git lfs and converting it into gguf format – 

Step 3: Initializing llama.cpp with LangChain and building a Q&A AI Assistant with memory locally – 

from langchain_core.prompts import PromptTemplate

from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

from langchain_community.llms import LlamaCpp

from langchain.memory import ConversationBufferMemory

# defining the prompt template and callback for streaming

template = """Question: {question}

Answer: Let's work this out in a step-by-step way to ensure we get the right answer."""

prompt = PromptTemplate.from_template(template)

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# setting parameters for llamacpp (gguf) model 

memory = ConversationBufferMemory()

n_gpu_layers = -1

n_batch = 512 

llm = LlamaCpp(

   model_path="/Users/sachintripathi/Documents/Py_files/Deepseek /DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q6_K.gguf",

   n_gpu_layers=n_gpu_layers,

   n_batch=n_batch,

   f16_kv=True,

   callback_manager=callback_manager,

   verbose=True

)

# chaining prompt and llm

llm_chain = prompt | llm

def main_with_memory():

   memory.clear() 

   while True:

       user_question = input("Enter your question (or type 'quit' to exit): ")

       if user_question.lower() == 'quit':

           break

      # past message retrieval from memory

       context = " ".join([f"Q: {msg['input']}\nA: {msg['output']}" for msg in memory.chat_memory.messages])

       combined_prompt = f"{context}\nQuestion: {user_question}"

       generated_text = llm_chain.invoke({"question": combined_prompt})

       print("Answer:", generated_text)

       memory.save_context({"input": user_question}, {"output": generated_text})

if __name__ == "__main__":

   main_with_memory()

Final Words

DeepSeek-R1 is a powerful LLM that leverages cold-start data along with iterative reinforcement learning fine-tuning and achieves formidable performance in the LLM landscape. Its open-source reasoning rivals OpenAI and other closed-source models in terms of math, code, reasoning, and cost efficiency. By offering competitive performance at a fraction of the cost, DeepSeek-R1 has made advanced AI capabilities more accessible to a wider range of users and its open-source nature has challenged the traditional dominance of closed-source models, demonstrating that high-performing LLMs can be developed and shared openly, putting pressure on established players in the AI industry. 

References

  1. Link to Code
  2. DeepSeek-R1 Technical Report
  3. DeepSeek-R1 Documentation
  4. Unsloth DeepSeek-R1 Model HuggingFace
  5. Llama.cpp GitHub Repository

Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.