Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen

20 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How AI Learned to Edit Videos Like a Pro—Without Real Footage

Ever wondered how a computer could follow your exact video‑editing instructions? Researchers built a clever system called Ditto that teaches AI to cut, add, and transform clips just by reading a simple command. The trick? Instead of hunting for endless real videos, they let a powerful image editor imagine scenes and then stitched them together with a fast video generator, creating a massive library of one‑million synthetic examples. Think of it like a chef who practices recipes in a virtual kitchen before cooking for real guests. This synthetic “cookbook” lets the AI learn the art of editing without the huge cost of filming everything. The result is Editto, a model that follows instructions with surprising accuracy, setting a new benchmark for AI‑driven video creation. This breakthrough means anyone could soon turn a text prompt into a polished clip, opening doors for creators, teachers, and marketers alike. Imagine the possibilities when video editing becomes as easy as sending a message—your story, your way, in seconds.

Short Review

Advancing Instruction-Based Video Editing with the Ditto Framework

The field of instruction-based video editing has long faced a significant hurdle: the scarcity of large-scale, high-quality training data. This challenge limits the development of robust models capable of democratizing content creation. A recent article introduces Ditto, a comprehensive framework designed to overcome this fundamental data limitation. At its core, Ditto features an innovative data generation pipeline that synergistically combines a leading image editor with an in-context video generator, significantly expanding the scope beyond existing models. This framework also addresses the prohibitive cost-quality trade-off through an efficient, distilled model architecture, enhanced by a temporal enhancer to reduce computational overhead and improve temporal coherence. The entire process is driven by an intelligent agent that meticulously crafts diverse instructions and rigorously filters outputs, ensuring quality control at scale. Utilizing this sophisticated framework, the researchers invested over 12,000 GPU-days to construct Ditto-1M, a groundbreaking dataset comprising one million high-fidelity video editing examples. The subsequent training of their model, Editto, on Ditto-1M, employing a curriculum learning strategy, has yielded superior instruction-following capabilities and established a new state-of-the-art in this rapidly evolving domain.

Critical Evaluation of the Ditto Framework

Strengths of the Ditto Framework

The Ditto framework presents several compelling strengths that significantly advance the landscape of AI-driven video editing. Foremost is its innovative solution to the pervasive problem of data scarcity, delivering the massive Ditto-1M dataset. This synthetic data generation pipeline, which fuses an image editor, a video generator, and a Vision-Language Model (VLM) agent, is a robust approach to creating diverse and high-quality training examples. The methodology prioritizes both aesthetic and motion quality, employing sophisticated techniques like source video filtering and a two-step VLM prompting strategy for contextually grounded edits. Furthermore, the framework's efficiency is notable, utilizing a distilled model architecture and a temporal enhancer to manage computational costs while boosting performance. The intelligent agent's role in automating instruction generation and rigorous output filtering ensures scalability and maintains high data quality, which is crucial for training advanced models like Editto. The quantitative and qualitative results, including superior CLIP-T, CLIP-F, VLM scores, and positive user study feedback, alongside crucial ablation studies, firmly establish Editto's state-of-the-art performance and validate the importance of data scale and the Modality Curriculum Learning (MCL) strategy.

Potential Weaknesses and Future Directions

While the Ditto framework offers substantial advancements, certain aspects warrant consideration. The significant investment of 12,000 GPU-days to build Ditto-1M, while yielding an impressive dataset, highlights the substantial computational resources required for such large-scale data generation. This could present a barrier for research groups with more limited computational infrastructure. Additionally, the reliance on a Vision-Language Model (VLM) for instruction generation and output curation, while effective, means the quality and diversity of the generated data are inherently tied to the VLM's capabilities and potential biases. Future research could explore methods to further diversify instruction generation or incorporate human-in-the-loop validation for critical scenarios to mitigate potential VLM-induced limitations. Exploring the generalization capabilities of Editto on even more diverse, real-world, uncurated video content could also provide valuable insights into its robustness beyond the synthetic Ditto-1M dataset.

Conclusion

The Ditto framework represents a pivotal contribution to the field of instruction-based video editing, effectively addressing the long-standing challenge of data scarcity. By introducing a novel, scalable data generation pipeline and the extensive Ditto-1M dataset, the research provides an invaluable resource for the community. The resulting Editto model, trained with a sophisticated Modality Curriculum Learning strategy, demonstrates exceptional instruction-following ability and sets a new benchmark for performance. This work not only pushes the boundaries of AI-driven video content creation but also lays a strong foundation for future research into more efficient, diverse, and accessible video editing technologies, ultimately moving closer to the vision of democratized content creation.