Scaling LLMs through next-token prediction is limited by high-quality data availability. Kimi K1.5 overcomes this with an optimized reinforcement learning framework, enhancing long-context reasoning, policy optimization, and multimodal integration. It outperforms GPT-4o, Claude Sonnet 3.5, and OpenAI’s o1 across reasoning benchmarks. This article explores its architecture, training methods, and RL advancements.
Table of Content
- Understanding Kimi K1.5
- The Kimi K1.5’s Architecture
- RL Framework & Policy Optimization
- Overview of Long2Short Compression for Short-CoT Models
Lets start with understanding what Kimi K1.5 is.
Understanding Kimi K1.5
Traditional LLM scaling, based on Kaplan’s laws, is limited by finite curated data. Kimi K1.5 bypasses this by scaling reinforcement learning, enabling self-improvement beyond static datasets. Through RL training on long-context Chain-of-Thought reasoning, it achieves emergent problem-solving in mathematics, code generation, and multimodal understanding, showcasing superior performance across challenging benchmarks.
The Kimi K1.5’s Architecture
Kimi K1.5 adopts a multi-modal transformer architecture, optimized for long-context reasoning and vision-text integration. Unlike previous reinforcement learning-based LLMs that rely on Monte Carlo tree search (MCTS) or value functions, Kimi K1.5 simplifies the RL training pipeline while achieving superior results.
Overview of Kimi K1.5’s Architecture
Long-Context Scaling
Kimi K1.5 scales context length to 128K tokens, enabling it to maintain long-term dependencies and enhance reasoning over extended sequences. This is achieved through partial rollouts, which store and reuse prior reasoning steps to cut computational costs, and memory-efficient self-attention to optimize transformer operations for lengthy inputs.
Policy Optimization
Kimi K1.5 uses Online Mirror Descent RL instead of traditional advantage-based methods to refine reasoning strategies. This approach ensures stable training by avoiding variance explosion, boosts convergence efficiency, and employs adaptive sampling to focus on weaker reasoning paths, enhancing generalization across diverse tasks and scenarios.
Unified Multimodal Integration
Kimi K1.5 is trained on a mix of text and vision data, allowing cross-modal reasoning. Its training corpus includes STEM problem sets for math skills, coding datasets for programming expertise, and vision-language datasets for visual comprehension, ensuring robust performance across mathematical, logical, and visual understanding tasks.
These architectural improvements collectively enable Kimi K1.5 to achieve state-of-the-art reasoning capabilities across diverse benchmarks.
RL Framework & Policy Optimization
Kimi K1.5’s RL pipeline consists of four stages:
RL Prompt Set Curation
A high-quality prompt dataset is essential for stable RL training. It employs:
- Diverse coverage across STEM, coding, and general reasoning tasks.
- Difficulty balancing, ensuring a mix of easy and hard problems.
- Automated filtering, removing prompts prone to reward hacking.
Long-CoT Supervised Fine-Tuning
Before RL training, it undergoes supervised fine-tuning (SFT) on long-CoT datasets. This phase:
- Teaches the model structured reasoning through multi-step problem-solving.
- Encourages planning and self-reflection, mimicking human cognitive processes.
Reinforcement Learning with Online Mirror Descent
During RL training, It optimizes its reasoning policies using Online Mirror Descent instead of traditional policy gradient methods. Key optimizations include:
- Relative entropy regularization, preventing mode collapse during training.
- Reward models for math and code, ensuring verifiable correctness.
- Token length penalties, reducing overthinking and unnecessary verbosity.
Partial Rollouts for Long-Context RL
Partial rollouts improve efficiency by reusing past reasoning steps, reducing the need for redundant rollouts. This significantly lowers training compute costs while maintaining performance.
Large Scale RL Training Architecture for LLM
Overview of Long2Short Compression for Short-CoT Models
While long-CoT reasoning achieves superior accuracy, it is computationally expensive. Kimi K1.5 introduces Long2Short RL techniques to compress long-CoT strategies into shorter, more efficient representations.
This is achieved through:
- Model merging, combining long-CoT and short-CoT models via weight averaging.
- Shortest rejection sampling, selecting the shortest correct response for fine-tuning.
- Preference-based RL, training models to optimize for brevity without sacrificing correctness.
Experiments show that Long2Short models retain long-CoT accuracy while reducing inference cost, making them ideal for real-world deployment.
Final Words
Kimi K1.5 establishes reinforcement learning as a viable strategy for LLM scaling, demonstrating state-of-the-art performance across math, code, and vision-language tasks. Its key innovations long-context scaling, mirror descent RL, and Long2Short compression enable efficient reasoning without requiring excessive computation.