BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra

13 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

BigCodeArena Shows How AI Learns to Write Smarter Code

Ever wondered if a computer can really understand the code it writes? BigCodeArena lets us watch AI‑generated programs run in real time, so people can see which snippets actually work. Imagine a cooking show where the chef not only writes the recipe but also bakes the cake in front of the judges – that’s what this platform does for code. Over 14,000 coding conversations across ten languages were collected, and more than 4,700 moments let humans pick a winner between two AI answers. The result? When the code is executed, even ordinary users can spot the better solution, and the AI models learn from those choices. This new insight helped create two handy benchmarks, BigCodeReward and AutoCodeArena, that measure how well AIs judge their own code without needing a human every time. The takeaway is clear: watching code run turns vague “good‑or‑bad” guesses into solid, trustworthy tools – a step closer to AI that writes reliable software for everyone.

Short Review

Overview

The article presents BigCodeArena, an innovative platform designed for the real-time evaluation of code generation by large language models (LLMs). It addresses the challenges of assessing LLM-generated code quality through manual examination, which is often complex and time-consuming. By leveraging execution feedback, the platform facilitates user interaction with the code execution process, enhancing the evaluation of model responses. The study collected over 14,000 conversation sessions across ten programming languages and various execution environments, revealing significant insights into user preferences and model performance. The findings led to the development of two benchmarks, BigCodeReward and AutoCodeArena, aimed at systematically evaluating the coding capabilities of LLMs.

Critical Evaluation

Strengths

One of the primary strengths of the article is its comprehensive approach to evaluating LLM-generated code through interactive assessments. By incorporating real-time execution feedback, BigCodeArena addresses the limitations of traditional static evaluation methods. The extensive dataset of over 14,000 sessions provides a robust foundation for analyzing user preferences and model performance across diverse programming languages and environments. Furthermore, the introduction of benchmarks like BigCodeReward and AutoCodeArena enhances the reliability and consistency of evaluations, offering valuable tools for future research.

Weaknesses

Despite its strengths, the article does have some limitations. The reliance on user-generated data may introduce biases in preference judgments, potentially skewing the evaluation outcomes. Additionally, while the findings indicate that execution results improve model accuracy, some models exhibited instability, raising questions about the generalizability of the results. The article could benefit from a more detailed discussion on the implications of these biases and the potential need for further refinement of the evaluation methodologies.

Implications

The implications of this research are significant for the field of code generation and evaluation. By establishing a platform that integrates user interaction and execution feedback, BigCodeArena sets a new standard for assessing LLM performance. The findings suggest that proprietary models, such as GPT-5, outperform their open-source counterparts, highlighting the ongoing competition in the development of advanced coding capabilities. This research paves the way for future studies to explore the nuances of LLM performance in various coding tasks and environments.

Conclusion

In summary, the article presents a valuable contribution to the evaluation of LLM-generated code through the introduction of BigCodeArena. Its innovative approach, supported by extensive data collection and the development of new benchmarks, enhances our understanding of model performance and user preferences. As the field continues to evolve, the insights gained from this research will be instrumental in shaping future methodologies for assessing coding quality in LLMs.

Readability

The article is well-structured and accessible, making it suitable for a professional audience. The clear presentation of findings and methodologies enhances user engagement, while the emphasis on key terms ensures that important concepts are easily identifiable. Overall, the narrative flows smoothly, encouraging readers to delve deeper into the implications of the research.