Paper + OSS Models
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
Training
Cold Start (Fine-Tune)
- What: Fine-tuning of DeepSeek-V3-Base
- Why: Give the model basic reasoning capabilities / patterns, stable RL foundation, output fmt
- Data: Thousands CoT examples
RL Stage 1
- What: GRPO reinforcement learning
- https://arxiv.org/abs/2402.03300
- 64 samples per training example
- Score each one w/ a rule-based reward (e.g., correct/incorrect for math, coding)
- Compare each sample to the mean of all samples
- For samples with high (low) normalized reward above (below) the group mean:
- Increase (drop) probability of the model generating all the tokens in that sequence.
- Each token in the sequence get a positive (negative) gradient update.
- Let’s make all the choices that led to it more (or less) likely in the future.
- Why: "Discover" good reasoning patterns, makes the model very strong at reasoning
- Lost some general capabilities
- Had potential language mixing issues.
- Data: ~144K CoT format GSM8K and MATH questions on reasoning-intensive tasks
Rejection Sampling
- What: Generate new training data by filtering results from RL stage 1
- Why: Results in 600k reasoning traces
- Data: 600k + 200k non-reasoning (writing, factual QA, etc) DeepSeek-V3's SFT dataset
Fine-Tune
- What: Fine-tuning of RL stage 1 model