Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei

29 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How AI Starts Sketching Its Own Thoughts

Ever wondered if a computer can “draw” its ideas before answering? Scientists have created a new tool called Latent Sketchpad that lets advanced AI models doodle inside their “mind” to solve tricky visual puzzles. Imagine trying to navigate a maze – before you walk through it, you might sketch a quick map on paper. This AI does the same, but the sketch lives as hidden data that guides its reasoning.

The magic is simple: the AI alternates between talking and drawing, using a tiny “vision head” to produce quick sketches that it can later turn into pictures we can see. These sketches act like a mental scratchpad, helping the model plan, imagine, and explain its steps. The result? The AI solves visual challenges as well as, or even better than, its previous versions, and it can do it across different AI families like Gemma3 and Qwen2.5‑VL.

This breakthrough shows that giving machines a way to “think on paper” could make future human‑computer chats more natural, creative, and useful. The next time you sketch an idea, remember: AI is learning to do the same. 🌟

Short Review

Advancing Multimodal LLMs with Latent Sketchpad for Visual Thinking

This insightful research introduces Latent Sketchpad, a novel framework designed to overcome a significant limitation in Multimodal Large Language Models (MLLMs): their struggle with complex visual planning and imagination. Inspired by human cognitive processes, the framework equips MLLMs with an internal visual scratchpad, enabling them to engage in generative visual thought without compromising their core reasoning abilities. The methodology integrates visual generation directly into the MLLM's native autoregressive reasoning, allowing for a seamless interleaving of textual and visual processing. Key components include a Context-Aware Vision Head for autoregressive visual latent generation and a pretrained Sketch Decoder that translates these internal latents into interpretable sketch images. Evaluated on a new dataset, MazePlanning, Latent Sketchpad demonstrates comparable or superior reasoning performance across various frontier MLLMs, significantly enhancing their interpretability and robustness.

Critical Evaluation of Latent Sketchpad

Strengths

The Latent Sketchpad framework presents several compelling strengths. Its primary contribution lies in addressing a critical gap in MLLM capabilities by enabling generative visual thinking, moving beyond mere perceptual understanding. The integration of visual generation directly into the autoregressive reasoning process is a sophisticated design choice, fostering a more human-like cognitive flow. Empirical evaluations reveal strong performance, with the framework delivering comparable or even superior reasoning capabilities to existing MLLM backbones, including advanced models like GPT-4o, Gemma3, and Qwen2.5-VL. This demonstrates remarkable generalization and plug-and-play capability across diverse architectures. Furthermore, the ability to translate internal latents into interpretable sketch images significantly enhances model transparency, offering valuable insights into the MLLM's internal thought process. The framework also improves Out-of-Distribution (OOD) robustness and enhances visualization quality, structural stability, and reasoning performance, as evidenced by high Layout Consistency Rate (LCR) and Visual Success Rate (VSR).

Weaknesses

While highly innovative, the Latent Sketchpad framework might present a few areas for further consideration. The reliance on a newly introduced dataset, MazePlanning, while valuable for specific evaluation, could limit immediate generalizability assessments across a broader spectrum of visual planning tasks. The computational overhead associated with integrating additional generative components, such as the Context-Aware Vision Head and Sketch Decoder, might also be a factor, potentially impacting inference speed or resource requirements for deployment in highly constrained environments. Although the sketches offer interpretability, the direct interpretability of the raw visual latents themselves, prior to decoding, remains an area for deeper exploration. Additionally, the complexity of training and fine-tuning these integrated components, including the specific loss functions and connector adaptation, could pose a barrier for researchers without specialized expertise.

Implications

The implications of Latent Sketchpad are profound, extending the frontiers of MLLM capabilities. By equipping models with an internal visual scratchpad, the framework paves the way for MLLMs to tackle more complex, multi-step visual reasoning tasks that demand planning and imagination, mirroring human cognitive processes. This advancement promises to unlock new opportunities for richer human-computer interaction, enabling MLLMs to assist users in creative design, complex problem-solving, and interactive planning scenarios. The enhanced interpretability through sketch generation fosters greater trust and understanding of AI decision-making. Ultimately, Latent Sketchpad represents a significant step towards developing more versatile and intelligent AI systems capable of truly multimodal thinking, broadening their applicability across diverse scientific and industrial domains.

Conclusion

Latent Sketchpad stands as a pivotal advancement in the field of Multimodal Large Language Models, effectively bridging the gap between textual reasoning and generative visual thought. Its innovative approach to integrating an internal visual scratchpad significantly enhances MLLMs' capacity for visual planning and imagination, a critical step towards more sophisticated AI. The framework's strong empirical performance, generalization across diverse models, and improved interpretability underscore its substantial value. This research not only pushes the boundaries of AI capabilities but also opens exciting avenues for future development in human-computer interaction and the application of truly multimodal intelligence.