Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, Letian Ni, Yuyang Cheng, Qiguang Chen, Jingzhe Ding, Shengda Long, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Ge Zhang, Wenhao Huang, Wanxiang Che, Chenghua Lin

18 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

AI Struggles to Pick the Best Story: What a New Study Uncovered

Ever wondered if a computer can tell which story feels more exciting? Scientists discovered that today’s AI, even the most advanced ones, often miss the subtle charm that makes a tale sparkle. They built a special test called WritingPreferenceBench, gathering 1,800 paired stories in English and Chinese, all matched for facts and length. When the AI was asked to choose the more engaging piece, it guessed correctly only about half the time—no better than a coin flip. Imagine asking a friend to pick the tastier slice of cake without looking at the frosting; most would guess, but many would be wrong. Surprisingly, a new kind of AI that explains its reasoning—like “this line feels more vivid because…”—got the right answer more than 80% of the time. This shows that reasoning matters more than raw speed, and that machines still have a long way to go before they truly understand human taste. The takeaway? Even as AI gets smarter, the magic of creativity and emotion remains a uniquely human treasure, waiting for the day machines can truly feel it. Stay curious and keep sharing stories!

Short Review

Advancing Subjective Preference Learning in Creative Writing

This article addresses a critical gap in preference learning: accurately assessing subjective writing quality when objective signals are absent. It introduces WritingPreferenceBench, a novel cross-lingual dataset designed to neutralize factors like factual accuracy and length. The research reveals that while standard sequence-based models and zero-shot LLM judges perform poorly, generative reward models incorporating explicit reasoning chains achieve substantially higher accuracy.

Critical Evaluation

Strengths: Pioneering Subjective Quality Assessment

A major strength is the study's innovative approach to isolating subjective writing quality through the meticulously constructed WritingPreferenceBench dataset. This dataset, featuring 1,800 human-annotated preference pairs across diverse genres and languages, effectively controls for objective factors, providing a robust foundation for evaluating nuanced aspects like creativity. The paper compellingly demonstrates the critical role of intermediate reasoning representations, showing how generative reward models with explicit reasoning chains significantly outperform traditional sequence-based models, offering a clear direction for future Reinforcement Learning from Human Feedback (RLHF) research.

Weaknesses: Unmasking Model Limitations

The research effectively uncovers significant weaknesses in current preference learning methods, particularly their inability to capture subjective quality without relying on objective error detection. Standard sequence-based reward models achieve only 52.7% accuracy, with zero-shot language model judges performing similarly at 53.9%. A notable limitation is the severe genre instability across models, where performance varies widely and persists even with increased model scale. These findings challenge the prevalent "LLM-as-judge" paradigm, suggesting inherent limitations in reliably assessing subjective creative quality.

Implications: Reshaping AI for Creative Domains

The implications of this study are profound for developing AI in creative domains. It strongly suggests that successful preference modeling for subjective tasks requires a fundamental shift from direct classification to methods incorporating explicit reasoning. This necessitates exploring hybrid architectures and novel training objectives beyond current direct preference optimization (DPO) and LLM scaling approaches. The research provides a compelling argument for integrating more sophisticated cognitive processes into AI systems designed to understand and generate creative content, paving the way for more nuanced and human-aligned AI.

Conclusion: A New Path for AI in Creative Expression

This article makes a significant contribution to AI and creative writing by rigorously exposing the limitations of current preference learning methods in capturing subjective quality. The introduction of the WritingPreferenceBench dataset and the compelling evidence for the necessity of reasoning chains in generative reward models mark a crucial step forward. By challenging existing paradigms and proposing new directions, the study offers invaluable insights for developing more sophisticated and human-centric AI systems that truly understand and evaluate creative expression.