DeepSeek R1 | Notion

Cold Start (Fine-Tune)

What: Fine-tuning of DeepSeek-V3-Base
Why: Give the model basic reasoning capabilities / patterns, stable RL foundation, output fmt
Data: Thousands CoT examples

RL Stage 1

What: GRPO reinforcement learning
- https://arxiv.org/abs/2402.03300
- 64 samples per training example
- Score each one w/ a rule-based reward (e.g., correct/incorrect for math, coding)
- Compare each sample to the mean of all samples
- For samples with high (low) normalized reward above (below) the group mean:
  - Increase (drop) probability of the model generating all the tokens in that sequence.
  - Each token in the sequence get a positive (negative) gradient update.
  - Let’s make all the choices that led to it more (or less) likely in the future.
Why: "Discover" good reasoning patterns, makes the model very strong at reasoning
- Lost some general capabilities
- Had potential language mixing issues.
Data: ~144K CoT format GSM8K and MATH questions on reasoning-intensive tasks

Rejection Sampling

What: Generate new training data by filtering results from RL stage 1
Why: Results in 600k reasoning traces
Data: 600k + 200k non-reasoning (writing, factual QA, etc) DeepSeek-V3's SFT dataset

Fine-Tune