InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, Yangkun Zhu

16 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How Robots Learn to “See” and “Act” Like Humans

Ever wondered how a robot could pick up a cup just by hearing “grab the blue mug on the left”? InternVLA‑M1 makes that possible by teaching robots to understand *where* to act before deciding *how* to move. Think of it like a child first pointing to a toy before reaching for it – the robot first matches words to spots in its camera view, then figures out the right arm motion. The system was trained on millions of simple “point‑and‑pick” examples, learning to link instructions with visual positions without caring which robot body it uses. In real tests, this spatial “thinking” gave robots a boost of up to 20 % in handling new objects and complex tasks, from kitchen chores to warehouse sorting. The result? Machines that can adapt to fresh situations with far less hand‑holding. As we keep adding more everyday scenarios, the line between human intuition and robot precision keeps blurring. The future may soon bring assistants that understand our words and act in the world as naturally as we do. Imagine the possibilities when every home has a truly helpful robot companion. Stay tuned for the next step in smart robotics.

Short Review

Advancing Robot Autonomy with Spatially Guided Vision-Language-Action Frameworks

This analysis focuses on InternVLA-M1, a novel vision-language-action (VLA) framework designed to propel instruction-following robots towards scalable, general-purpose intelligence. The core innovation lies in its spatially guided training approach, which establishes a critical link between human instructions and robot actions through precise spatial grounding. The methodology employs a sophisticated two-stage pipeline: initial spatial grounding pre-training determines "where to act" by aligning instructions with visual positions, followed by spatially guided action post-training to decide "how to act" through embodiment-aware action generation. This comprehensive strategy yields substantial performance improvements across diverse robotic tasks and environments, demonstrating enhanced spatial reasoning and robust generalization capabilities.

Critical Evaluation

Strengths

The framework's primary strength is its innovative spatially guided training, which effectively bridges high-level instructions with low-level robot actions. The two-stage pipeline, encompassing extensive spatial grounding pre-training on over 2.3 million data points and subsequent action post-training, is a robust design choice. Furthermore, the development of a scalable synthetic data engine, generating 244K generalizable pick-and-place episodes, significantly enhances the model's ability to generalize and perform in varied scenarios. InternVLA-M1 consistently outperforms existing baselines, showing impressive gains (e.g., +14.6% to +20.6%) across benchmarks like SimplerEnv, WidowX, and LIBERO, alongside superior real-world performance in complex, long-horizon tasks.

Weaknesses

While the paper presents compelling results, a potential area for further exploration could be the inherent challenges of the sim-to-real gap, despite the effective synthetic co-training. The extensive training data and dual-system architecture might also imply significant computational demands, which could be a consideration for broader accessibility and deployment in resource-constrained environments. Additionally, while robust, the framework's adaptability to entirely novel, unstructured environments beyond the evaluated benchmarks could warrant further investigation.

Implications

InternVLA-M1 represents a significant step forward in developing scalable generalist robots capable of understanding and executing complex instructions. Its spatially guided approach offers a unifying principle for creating more resilient and adaptable robotic systems, pushing the boundaries of instruction following and autonomous manipulation. This work has profound implications for industrial automation, service robotics, and human-robot collaboration, paving the way for more intelligent and versatile robotic agents in real-world settings.

Conclusion

This article introduces a highly impactful framework that significantly advances the field of robot autonomy through its innovative spatially guided training. InternVLA-M1's demonstrated superior performance, robust generalization, and enhanced spatial reasoning capabilities position it as a crucial development for the future of robotics. The methodology provides a strong foundation for building more intelligent, adaptable, and scalable robotic systems, making a substantial contribution to the ongoing quest for truly general-purpose robots.