LongCat-Video Technical Report

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang

29 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

AI Breakthrough: Creating Minutes‑Long Videos in Minutes

Ever imagined a computer that could spin a whole movie‑length clip in the time it takes to brew coffee? Scientists have unveiled a new AI model called LongCat‑Video that does exactly that. Built on a clever “diffusion transformer” engine, this system can turn a short text prompt, a single picture, or even a brief clip into a smooth, high‑definition video that lasts for minutes—all without the usual lag. Think of it like a master chef who can prepare a multi‑course feast by adding ingredients step‑by‑step, first sketching the outline and then filling in the details, making the process fast and efficient. LongCat‑Video keeps every frame in perfect sync, so the motion feels natural, and it does so at 720p and 30 frames per second in just a few minutes. This breakthrough opens doors for creators, educators, and anyone who dreams of bringing stories to life without waiting hours for rendering. The future of video may soon be as instant as a click. 🌟

Short Review

Advancing Long Video Generation with LongCat-Video: A Scientific Review

This scientific analysis delves into LongCat-Video, a substantial 13.6 billion parameter foundational model designed for efficient, high-quality long video generation, marking a significant stride towards developing comprehensive world models. Utilizing a unified Diffusion Transformer (DiT) architecture, LongCat-Video adeptly handles Text-to-Video, Image-to-Video, and Video-Continuation tasks within a single framework. Its innovative approach integrates a coarse-to-fine generation strategy across temporal and spatial axes, alongside Block Sparse Attention, to achieve remarkable inference efficiency for minutes-long, 720p, 30fps videos. The model's robust performance is further enhanced by a multi-reward Reinforcement Learning from Human Feedback (RLHF) training paradigm, incorporating Gradient Reweighting Policy Optimization (GRPO) to ensure stability and superior output quality.

Critical Evaluation

Strengths

LongCat-Video presents several compelling strengths that position it as a leading model in generative AI. Its unified architecture for diverse video generation tasks streamlines development and application, offering versatility. The model's exceptional efficiency in generating long, temporally coherent videos is a major breakthrough, achieved through sophisticated techniques like coarse-to-fine generation and Block Sparse Attention, which significantly accelerate high-resolution inference. The comprehensive data curation pipeline, coupled with a multi-stage training process involving Supervised Fine-Tuning (SFT) and multi-reward RLHF with GRPO, ensures both high quality and training stability. Furthermore, the extensive evaluation protocol, encompassing both human and automatic benchmarks, rigorously validates its competitive performance against state-of-the-art models. The public availability of its code and model weights is a crucial contribution, fostering transparency and accelerating research in the field.

Weaknesses

While highly innovative, LongCat-Video does present areas for potential refinement. The complexity of its multi-stage training pipeline, involving advanced concepts like Flow Matching, LoRA, and fixed stochastic timestep SDE sampling, could pose a considerable barrier for researchers with limited computational resources or specialized expertise, despite the open-source release. Additionally, the evaluation identified specific areas for improvement, particularly concerning image and motion alignment, suggesting that while overall quality is high, subtle inconsistencies might still occur. Although efficient, generating "minutes-long videos within minutes" still implies substantial computational demands for large-scale or continuous video production, highlighting the ongoing challenge of scaling generative models.

Implications

The introduction of LongCat-Video carries significant implications for the future of generative AI and its applications. Its ability to produce high-quality, temporally coherent long videos efficiently opens new avenues for content creation, virtual reality, and advanced simulation environments. By laying a strong foundation for world models, LongCat-Video pushes the boundaries of AI's capacity to understand and generate dynamic, complex sequences. The sophisticated training methodologies, particularly the multi-reward RLHF and efficiency techniques, offer valuable insights that could inspire advancements across various generative model architectures. Its open-source nature is poised to democratize access to cutting-edge video generation technology, fostering collaborative innovation and accelerating the pace of research globally.

Conclusion

LongCat-Video represents a pivotal advancement in the domain of video generation, effectively addressing the critical challenge of producing efficient, high-quality long videos. Its innovative architectural design, robust training methodologies, and demonstrated performance solidify its position as a foundational model. Despite minor areas for refinement, its contributions to efficiency, temporal coherence, and the pursuit of world models are substantial, making it an invaluable resource for the scientific community and a significant step forward in generative AI.