UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi

27 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How Smart Instructions Teach Computers to Click the Right Buttons

Ever wonder how a voice assistant could tap the exact button you mean on a phone screen? Scientists discovered that the secret lies in giving the AI many different ways to think about a single command, just like offering several clues to a friend searching for a hidden object. By training the system with a rich mix of instructions, the model learns to pick the most helpful “clue” at the moment it needs to act. This simple shift turned a modest helper into a super‑sharp navigator, boosting its success rate dramatically on real‑world apps. Imagine a digital assistant that not only hears “open messages” but also reasons, “find the envelope icon that looks like a paper plane,” and then chooses the best path to get there. That breakthrough means fewer mistakes, smoother interactions, and a future where our devices understand us as naturally as a human friend. Next time you speak to your phone, remember: it’s not just listening—it’s reasoning, one smart instruction at a time.

Short Review

Advancing GUI Grounding with Instruction-as-Reasoning

This paper introduces the Instruction-as-Reasoning (IAR) paradigm to enhance Graphical User Interface (GUI) grounding, a core capability for intelligent agents. It addresses limitations of static instructions and poor data quality, revealing a substantial 23.3% flaw rate in existing datasets. The proposed two-stage framework combines Supervised Fine-Tuning (SFT) with Reinforcement Learning (RL) using Group Relative Policy Optimization (GRPO). This enables models to dynamically select optimal analytical pathways from diverse instructions, yielding up to a 76% relative performance improvement. The resulting UI-Ins models achieve state-of-the-art results on five benchmarks, demonstrating emergent reasoning and strong agentic potential.

Critical Evaluation

Strengths of the Instruction-as-Reasoning Paradigm

A significant strength is the novel Instruction-as-Reasoning (IAR) paradigm, fundamentally shifting how natural language instructions are utilized in GUI grounding. The authors meticulously identify a substantial 23.3% flaw rate in existing datasets and propose a robust two-stage Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) framework, enhanced by Group Relative Policy Optimization (GRPO) and a sophisticated data pipeline. This method achieves state-of-the-art performance across multiple challenging benchmarks, demonstrating the UI-Ins models' superior accuracy and emergent reasoning capabilities. Furthermore, the framework successfully mitigates policy collapse and shows strong agentic potential, making it a valuable contribution to the field.

Weaknesses and Future Directions

While innovative, the methodology's reliance on GPT-4.1 for generating instructions introduces potential dependency on LLM capabilities and biases. The two-stage SFT+RL framework, particularly with GRPO, could present significant computational demands and implementation challenges. Although UI-Ins models achieve impressive results on GUI grounding, the generalizability of the "Instruction-as-Reasoning" paradigm to broader multimodal tasks warrants further investigation. A deeper qualitative analysis into the root causes of identified MLLM errors could also provide richer insights for future model development.

Conclusion

This paper presents an impactful advancement in GUI grounding, fundamentally rethinking how natural language instructions are leveraged. By introducing the Instruction-as-Reasoning paradigm and a robust two-stage training framework, the authors achieve state-of-the-art performance and provide critical insights into instruction quality and diversity. The UI-Ins models demonstrate impressive emergent reasoning and strong agentic potential, setting a new benchmark for intelligent GUI agents. This work offers a valuable blueprint for developing more capable multimodal models, addressing challenges and opening new research avenues.

Keywords

GUI grounding
instruction-as-reasoning paradigm
multi-perspective instruction diversity
supervised fine-tuning with synthesized instructions
reinforcement learning for pathway selection
UI-Ins-32B model
UI-Ins-7B model
UI-I2E-Bench grounding accuracy
ScreenSpot-Pro benchmark performance
MMBench-GUI L2 evaluation
AndroidWorld agentic success rate
policy collapse mitigation in SFT+RL
dynamic analytical instruction pathways
emergent reasoning in GUI agents
instruction quality flaw rate analysis

Artificial Intelligence

Yatai Ji

27 Oct 2025

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Read Article

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

paper-plane Quick Insight