Kimi K1.5 for Advancing LLMs with Scaling RL

Kimi K1.5 revolutionizes LLM scaling by leveraging RL for long-context reasoning, policy optimization, and multimodal integration.
Kimi K1.5

Scaling LLMs through next-token prediction is limited by high-quality data availability. Kimi K1.5 overcomes this with an optimized reinforcement learning framework, enhancing long-context reasoning, policy optimization, and multimodal integration. It outperforms GPT-4o, Claude Sonnet 3.5, and OpenAI’s o1 across reasoning benchmarks. This article explores its architecture, training methods, and RL advancements.

Table of Content

  1. Understanding Kimi K1.5
  2. The Kimi K1.5’s Architecture
  3. RL Framework & Policy Optimization
  4. Overview of Long2Short Compression for Short-CoT Models

Lets start with understanding what Kimi K1.5 is.

Understanding Kimi K1.5

Traditional LLM scaling, based on Kaplan’s laws, is limited by finite curated data. Kimi K1.5 bypasses this by scaling reinforcement learning, enabling self-improvement beyond static datasets. Through RL training on long-context Chain-of-Thought reasoning, it achieves emergent problem-solving in mathematics, code generation, and multimodal understanding, showcasing superior performance across challenging benchmarks.

The Kimi K1.5’s Architecture

Kimi K1.5 adopts a multi-modal transformer architecture, optimized for long-context reasoning and vision-text integration. Unlike previous reinforcement learning-based LLMs that rely on Monte Carlo tree search (MCTS) or value functions, Kimi K1.5 simplifies the RL training pipeline while achieving superior results.

Overview of Kimi K1.5’s Architecture

Overview of Kimi K1.5’s Architecture

Long-Context Scaling

Kimi K1.5 scales context length to 128K tokens, enabling it to maintain long-term dependencies and enhance reasoning over extended sequences. This is achieved through partial rollouts, which store and reuse prior reasoning steps to cut computational costs, and memory-efficient self-attention to optimize transformer operations for lengthy inputs.

Policy Optimization

Kimi K1.5 uses Online Mirror Descent RL instead of traditional advantage-based methods to refine reasoning strategies. This approach ensures stable training by avoiding variance explosion, boosts convergence efficiency, and employs adaptive sampling to focus on weaker reasoning paths, enhancing generalization across diverse tasks and scenarios.

Unified Multimodal Integration

Kimi K1.5 is trained on a mix of text and vision data, allowing cross-modal reasoning. Its training corpus includes STEM problem sets for math skills, coding datasets for programming expertise, and vision-language datasets for visual comprehension, ensuring robust performance across mathematical, logical, and visual understanding tasks.

These architectural improvements collectively enable Kimi K1.5 to achieve state-of-the-art reasoning capabilities across diverse benchmarks.

RL Framework & Policy Optimization

Kimi K1.5’s RL pipeline consists of four stages:

RL Prompt Set Curation

A high-quality prompt dataset is essential for stable RL training. It employs:

  • Diverse coverage across STEM, coding, and general reasoning tasks.
  • Difficulty balancing, ensuring a mix of easy and hard problems.
  • Automated filtering, removing prompts prone to reward hacking.

Long-CoT Supervised Fine-Tuning

Before RL training, it undergoes supervised fine-tuning (SFT) on long-CoT datasets. This phase:

  • Teaches the model structured reasoning through multi-step problem-solving.
  • Encourages planning and self-reflection, mimicking human cognitive processes.

Kimi k1.5 long-CoT results

Kimi k1.5 long-CoT results

Reinforcement Learning with Online Mirror Descent

During RL training, It optimizes its reasoning policies using Online Mirror Descent instead of traditional policy gradient methods. Key optimizations include:

  • Relative entropy regularization, preventing mode collapse during training.
  • Reward models for math and code, ensuring verifiable correctness.
  • Token length penalties, reducing overthinking and unnecessary verbosity.

Partial Rollouts for Long-Context RL

Partial rollouts improve efficiency by reusing past reasoning steps, reducing the need for redundant rollouts. This significantly lowers training compute costs while maintaining performance.

Large Scale RL Training Architecture for LLM

Large Scale RL Training Architecture for LLM

Overview of Long2Short Compression for Short-CoT Models

While long-CoT reasoning achieves superior accuracy, it is computationally expensive. Kimi K1.5 introduces Long2Short RL techniques to compress long-CoT strategies into shorter, more efficient representations.

This is achieved through:

  • Model merging, combining long-CoT and short-CoT models via weight averaging.
  • Shortest rejection sampling, selecting the shortest correct response for fine-tuning.
  • Preference-based RL, training models to optimize for brevity without sacrificing correctness.

Kimi k1.5 short-CoT results.

Kimi k1.5 short-CoT results.

Experiments show that Long2Short models retain long-CoT accuracy while reducing inference cost, making them ideal for real-world deployment.

Final Words

Kimi K1.5 establishes reinforcement learning as a viable strategy for LLM scaling, demonstrating state-of-the-art performance across math, code, and vision-language tasks. Its key innovations long-context scaling, mirror descent RL, and Long2Short compression enable efficient reasoning without requiring excessive computation.

References

KIMI K1.5 Research Paper

Picture of Aniruddha Shrikhande

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.