Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

29 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How AI Learns to Critique Its Own Answers

Ever wondered how a chatbot could become its own toughest teacher? Critique‑RL is a new trick that lets an AI not only answer questions but also give itself feedback, just like a student checking their own homework. Imagine a writer (the “actor”) drafting a story, then a sharp editor (the “critic”) reads it, points out the weak spots, and the writer rewrites a better version. This two‑step dance happens inside the computer, guided by a clever game‑like learning method called reinforcement learning. First, the system learns to spot good versus bad answers, then it fine‑tunes how helpful its feedback is, without needing a human expert to grade everything. The result? The AI improves its reasoning by up to 9 % on familiar tasks and still gains 5 % on new challenges. Scientists found that this self‑critique loop makes language models smarter and more reliable for everyday use, from answering your queries to drafting emails. It shows that teaching machines to critique themselves could be the next big step toward truly helpful AI. 🌟

Short Review

Overview: Advancing LLM Critiquing with Unsupervised Reinforcement Learning

This insightful article introduces Critique-RL, a novel two-stage Reinforcement Learning (RL) approach designed to train critiquing language models (LLMs) without the need for strong external supervision or an oracle verifier. The core problem addressed is the limitation of existing methods that often rely on powerful supervisors for annotating critique data. Critique-RL operates on a sophisticated two-player paradigm where an actor generates a response, a critic provides feedback, and the actor subsequently refines its output. The research highlights that relying solely on indirect reward signals for RL optimization often leads to critics with poor discriminability, despite improved helpfulness. To overcome this, Critique-RL employs a unique two-stage strategy: Stage I focuses on reinforcing the critic's discriminability using direct rule-based reward signals, while Stage II enhances its helpfulness through indirect rewards based on actor refinement, all while maintaining discriminability via regularization. Extensive experiments across various tasks and models, particularly with Qwen2.5, demonstrate substantial performance improvements, including a notable 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, underscoring its significant potential.

Critical Evaluation: Assessing Critique-RL's Impact on LLM Performance

Strengths: Robustness and Novelty in LLM Training

The primary strength of this work lies in its innovative two-stage optimization strategy, which effectively addresses the critical challenge of training effective critiquing LLMs without relying on expensive or unavailable strong supervision. By decoupling and sequentially optimizing discriminability and helpfulness, Critique-RL offers a robust solution to a long-standing problem in LLM development. The empirical evidence, particularly the consistent outperformance of Critique-RL against various baselines like Supervised Fine-tuning (SFT) and other RL methods on mathematical reasoning tasks, strongly validates its efficacy. Furthermore, the ablation studies conclusively confirm the crucial role of both optimization stages, reinforcing the methodological soundness. The demonstrated improvements in compute efficiency and impressive generalization capabilities to out-of-domain tasks further highlight the practical utility and broad applicability of this novel approach.

Weaknesses: Potential Limitations and Future Directions

While Critique-RL presents a significant advancement, certain aspects warrant further consideration. The reliance on direct rule-based reward signals in Stage I, though effective, might introduce a dependency on the quality and comprehensiveness of these rules, potentially limiting its adaptability to highly nuanced or subjective domains where explicit rules are difficult to define. The experiments primarily focus on mathematical reasoning tasks (e.g., GSM8K); therefore, the generalizability of these performance gains to more open-ended, creative, or complex reasoning tasks beyond structured problem-solving remains an area for future exploration. Investigating how the balance between discriminability and helpfulness might shift in such diverse contexts could provide valuable insights into the method's broader applicability and potential limitations.

Conclusion: The Future of Self-Improving Language Models

Critique-RL represents a significant stride towards developing more autonomous and self-improving Large Language Models. By enabling the training of effective critiquing models without the prohibitive cost of strong human supervision, this research offers a transformative pathway for enhancing LLM capabilities in complex reasoning tasks. The article's meticulous methodology and compelling experimental results underscore its substantial contribution to the field of Reinforcement Learning for LLMs. This work not only provides a practical solution for improving model outputs but also opens exciting avenues for future research into more sophisticated, unsupervised self-correction mechanisms, ultimately pushing the boundaries of artificial intelligence.