Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Perapard Ngokpol, Kun Kerdthaisong, Pasin Buakhaw, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot

17 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

Superhero AI: Testing Robots on Marvel & DC Multiverses

Ever wondered if a chatbot could *truly* become your favorite hero? Scientists have built a new test called “Beyond One World” that puts AI agents in the shoes of 30 iconic superheroes—from the classic caped crusader to the latest cinematic savior. The challenge isn’t just to recite famous catch‑phrases; the AI must remember each hero’s unique backstory and make choices that match their moral compass. Think of it like a trivia night where the questions change depending on which version of the character you’re playing. Researchers found that while some models can spin a convincing story, they often stumble on the exact details that fans cherish. The study also introduced a “Think‑Act Matching” score, measuring how well an AI’s reasoning lines up with its final actions—an important step toward trustworthy digital storytellers. This breakthrough could make virtual assistants, games, and educational tools feel more authentic, letting us interact with our beloved heroes in ways that feel genuinely personal. The future of role‑playing AI just got a heroic upgrade.

Short Review

Evaluating Large Language Models in Multiversal Character Role-Playing

The article introduces "Beyond One World," a novel benchmark designed to rigorously evaluate Large Language Models' (LLMs) capacity for version-specific character role-playing. Focusing on superhero canons, this research explores how LLMs portray distinct character incarnations across different universes, assessing both factual recall and moral decision-making. The methodology innovatively separates internal "thinking" from outward "acting" to gauge response fidelity. Key findings reveal significant challenges in cross-version generalization and a notable "thinking-acting" gap, where models struggle to align their reasoning with their actions. This work highlights critical limitations in current LLM capabilities for achieving consistent multiversal consistency.

Critical Evaluation of LLM Role-Playing Fidelity

Strengths of Multiversal LLM Evaluation

This study makes a substantial contribution by introducing a novel benchmark that addresses an underexplored yet crucial aspect of LLM performance: faithful, version-specific character portrayal. Utilizing the rich, complex narratives of superhero canons provides an ideal and comprehensive evaluation environment. The innovative framework, which distinguishes between a model's internal "thinking" and its external "acting," alongside the "Think-Act Matching" metric, offers a sophisticated approach to assessing reasoning alignment and trustworthiness in role-playing scenarios.

Challenges in LLM Persona Fidelity

Despite its strengths, the research uncovers significant limitations in current LLMs. A primary concern is the persistent difficulty with cross-version generalization, indicating that models struggle to adapt character traits across different canonical iterations. Furthermore, the study reveals a "thinking-acting" gap, where models often excel at one but not both, suggesting a fundamental disconnect in their ability to consistently integrate reasoning with persona-faithful responses. The mixed impact of Chain-of-Thought prompting also highlights that current reasoning strategies are not universally effective, sometimes even reducing canonical accuracy in stronger models.

Future Directions for Role-Playing AI

The findings from "Beyond One World" provide a critical diagnostic for advancing role-playing AI. By exposing gaps in multiversal consistency and reasoning alignment, this benchmark guides future research towards developing more robust and trustworthy AI agents. It underscores the necessity for new architectural designs and training methodologies that can bridge the identified "thinking-acting" gap and enhance persona fidelity across complex, evolving character narratives, ultimately contributing to more sophisticated and reliable LLM applications.

Conclusion

The "Beyond One World" benchmark represents a significant and foundational evaluation for understanding the nuanced capabilities of LLMs in complex role-playing. It not only establishes a new standard for assessing character-grounded performance but also clearly delineates critical areas for improvement. This research is invaluable for guiding future research in developing LLMs that can achieve true multiversal consistency and reasoning alignment, paving the way for more coherent and trustworthy AI interactions.