RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu

27 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How AI Turns a Simple Phrase into a Stunning Video

Ever typed a short phrase and imagined a movie playing in your mind? RAPO++ is the new AI wizard that makes that happen, turning brief, everyday prompts into vivid, lifelike videos. Think of it as a personal chef who takes a plain ingredient—your words—and adds the perfect spices, herbs, and sauces to create a gourmet dish. First, it fetches clever modifiers from a massive knowledge graph, reshaping your prompt so it matches what the video‑generator was trained on. Then, it watches the video it creates, learns what looks off, and fine‑tunes the prompt again and again, like a director adjusting each scene for better flow. Finally, the system teaches a language model these tricks, so future prompts start out already polished. The result? Videos that stay on topic, show multiple objects correctly, and move smoothly through time—all without changing the underlying video engine. Scientists found this approach works across many popular models, making high‑quality AI videos more accessible than ever. Imagine the stories you could tell with just a few words—your imagination is the only limit. It’s a breakthrough that brings cinematic creation to anyone’s fingertips.

Short Review

Overview

This paper introduces RAPO++, a novel three-stage framework designed to significantly enhance Text-to-Video (T2V) generation quality by optimizing user-provided prompts. Recognizing that initial prompts are often short, unstructured, and misaligned with training data, RAPO++ offers a comprehensive solution without altering the underlying generative backbone of T2V models. The framework integrates Retrieval-Augmented Prompt Optimization (RAPO), Sample-Specific Prompt Optimization (SSPO), and Large Language Model (LLM) fine-tuning to refine prompts iteratively. This approach aims to improve semantic alignment, compositional reasoning, temporal stability, and physical plausibility in generated videos. Extensive experiments across five state-of-the-art T2V models and benchmarks demonstrate RAPO++'s superior performance, establishing it as a model-agnostic, cost-efficient, and scalable solution for prompt optimization.

Critical Evaluation

Strengths

RAPO++ presents a robust and innovative solution to a critical challenge in T2V generation. Its multi-stage architecture, combining retrieval augmentation, iterative refinement, and LLM fine-tuning, offers a holistic approach to prompt optimization. The framework's ability to operate without modifying the generative backbone makes it highly model-agnostic and practical for integration with diverse T2V models. The demonstrated significant gains in metrics such as semantic alignment, compositional reasoning, and temporal stability across multiple benchmarks underscore its effectiveness. Furthermore, the closed-loop feedback mechanism of SSPO, utilizing Vision-Language Models (VLMs) and verifiers, ensures high-quality, context-aware prompt refinement, leading to progressively improved video generation. The fine-tuning of the rewriter LLM internalizes optimization patterns, enabling efficient, high-quality prompt generation even before inference, which is a substantial advantage for scalability and computational efficiency.

Weaknesses

While RAPO++ marks a significant advancement, certain aspects warrant further consideration. The paper acknowledges current limitations in numeracy tasks, suggesting that the framework may struggle with precise counting or quantity-related instructions, an area ripe for future development through count-aware feedback mechanisms. The inherent complexity of a three-stage framework involving multiple LLMs, VLMs, and iterative feedback loops, while powerful, could pose challenges for implementation and debugging, particularly for researchers new to this domain. Additionally, the framework's reliance on external LLMs and VLMs means its performance is inherently tied to the capabilities and potential biases of these underlying models. Although the paper emphasizes cost-efficiency, the iterative nature of SSPO, especially before the LLM fine-tuning fully generalizes, might still incur notable computational overhead during the optimization process for individual prompts.

Conclusion

RAPO++ represents a pivotal contribution to the field of generative AI, particularly for Text-to-Video synthesis. By addressing the fundamental issue of suboptimal user prompts, it significantly elevates the quality and fidelity of generated videos. Its model-agnostic and cost-efficient design positions it as a highly valuable and scalable tool for researchers and practitioners. The framework's comprehensive approach to prompt optimization, leading to substantial improvements in compositional understanding and physical plausibility, sets a new benchmark for the industry. Future work addressing current limitations, such as numeracy, will further solidify RAPO++'s impact and broaden its applicability across diverse T2V generation scenarios, ultimately pushing the boundaries of what is achievable in AI-driven video creation.