As AI capabilities grow, so does the need for reliable evaluation. Traditional reward models often fail to account for the nuances of reasoning, especially in subjective or multi-step tasks. Enter J1, Meta AI’s novel framework that trains large language models (LLMs) to judge responses with better reasoning and less bias. Leveraging Reinforcement Learning (RL) with verifiable rewards, J1 sets a new standard in LLM-as-a-Judge systems, outperforming even larger baselines ones like DeepSeek-R1 on complex evaluation benchmarks.
Table of Content
- What is J1?
- How J1 Works
- Key Features
- J1 Architecture Explained
- Use Cases for J1
Let’s start by understanding what J1 is.
What is J1?
J1 is a generalist LLM-as-a-Judge, trained to assess the quality of model generated responses. Unlike scalar reward models that produce a single score, It uses a chain of thought reasoning and makes judgments through verdicts or scores. It is trained on both verifiable (math) and non-verifiable (chat) tasks using purely synthetic preference data which is a crucial innovation that bypasses the need for costly human annotations.
It promotes a more thoughtful evaluation process by incorporating chain-of-thought supervision, encouraging thinking before judging. This approach offers a significant advantage by mitigating position bias, a well-known weakness inherent in pairwise scoring methodologies.
How J1 Works
J1 evaluates model responses through structured reasoning. Instead of scoring directly, it uses synthetic preference pairs one good and one poor response to learn judgment. By generating detailed thought processes before deciding, It mimics human-like evaluation, making it suitable for both objective tasks like math and subjective ones like conversation or writing.
Trained with verifiable rewards which are correctness and consistency It optimizes both its reasoning and final verdicts. Its design includes pairwise judges (comparing two responses) and pointwise judges (evaluating one response at a time). It mitigates position bias, scales effectively at test time via self-consistency, and achieves state-of-the-art accuracy across diverse benchmarks, outperforming even larger models on non-verifiable tasks.
Key Features
The paper explores two main variants of J1: Pairwise and Pointwise. Pairwise-J1 compares two responses and determines which is better, while Pointwise-J1 evaluates a single response and assigns it a score. Pointwise-J1 is shown to be effective in mitigating positional bias, a common problem where the order of responses influences the judgment.
Thinking patterns of Pairwise-J1 and Pointwise-J1
Chain-of-Thought Reasoning
It generates intermediate reasoning before judging, improving transparency and decision quality. Reasoning steps include defining criteria, reference answer generation, and response comparison.
Synthetic Verifiable Training
Trained on synthetic preference pairs across both verifiable (e.g., math) and non-verifiable (e.g., chat) tasks—eliminating the need for costly human annotations.
Reinforcement Learning with Verifiable Rewards
Uses online RL (GRPO) with rewards based on verdict correctness and positional consistency—training models to think and judge accurately.

Bias Mitigation
Mitigates position bias by training on both response orders and applying verdict consistency rewards. Also includes a pointwise judge that’s consistent by design.
Multiple Judgment Modes
Supports multiple evaluation strategies:
Pairwise Verdict (better of two), Pairwise Scores (quality scores for both),
Scores + Verdict (hybrid), Pointwise Score (single-response evaluation).
Scalable Inference via Self-Consistency
Boosts evaluation accuracy by sampling multiple reasoning traces at test time and aggregating judgments (e.g., SC@32).
J1 Architecture Explained
J1 is implemented using the verl framework. The models are trained on a dataset of 22K synthetic preference pairs, including both WildChat and MATH prompts. The training process involves generating multiple “rollouts” per prompt and optimizing the model using the GRPO algorithm. Here hyperparameters such as learning rate and KL coefficient which are required to achieve optimal performance are carefully tuned. The models are then evaluated on a range of benchmarks, demonstrating strong performance compared to state-of-the-art methods.
Architecture of Pairwise-J1 and Pointwise-J1
The J1 framework explores several LLM-as-a-Judge architectures.
Pairwise LLM-as-a-Judge with Verdict (PaV):
This primary architecture takes a user question and a response pair as input and generates thought tokens and a final verdict indicating the preferred response.
Pairwise LLM-as-a-Judge with Scores (PaS):
This variant generates real-valued scores for each response, and the response with the higher score is selected as the verdict.
Pairwise LLM-as-a-Judge with Scores & Verdict (PaVS):
This architecture combines both scores and a final verdict generation.
Pointwise LLM-as-a-Judge (PoS):
This architecture takes a user question and a single response as input and outputs a score reflecting the response quality. Pointwise judges are inherently consistent and are trained via distant supervision from pairwise data.
Experimental Evaluation
J1 models are evaluated on five pairwise judgment benchmarks: PPE, RewardBench, JudgeBench, RM-Bench, and FollowBenchEval. These benchmarks cover both verifiable and non-verifiable tasks and include multilingual instructions and responses from various LLMs. Baselines for comparison include zero-shot LLMs, scalar reward models, generative reward models, and general Reasoning/Thinking-LLMs.
Results on five reward modeling benchmark
Use Cases for J1
J1 has many uses in the creation and assessment of LLMs. By offering more complex rewards, it can enhance training and improve alignment with desired behaviours. It can also be a very effective evaluation tool, providing a more thorough review of LLM performance than conventional measures. It is especially useful in real-world applications where subjective attributes like helpfulness and safety are crucial because of its capacity to precisely assess non-verifiable jobs.
Final Thoughts
J1 marks a major breakthrough in the creation of LLM-as-a-Judge models. In assessing both verifiable and non-verifiable LLM outputs, It achieves state-of-the-art performance by utilising reinforcement learning and concentrating on verifiable incentives. The study also offers insightful information about how to minimise judgement bias and improve training methods. It paves the way for more effective LLM evaluation and improved alignment, ultimately contributing to the progress of AI.