Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Yanhong Li, Zixuan Lan, Jiawei Zhou

24 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

Half the Tokens: Turning Text into Pictures to Supercharge AI

Ever wondered if a picture could carry the same story as a long paragraph? Scientists discovered that feeding AI a snapshot of text can cut the amount of “reading bits” it needs by almost half—without losing meaning. Imagine writing a whole essay, then snapping a photo of the page and showing it to a friend; they still get every idea, but you’ve saved the effort of typing each word. By turning lengthy documents into a single image, modern AI models understand the content just as well while using far fewer internal tokens. Tests on tasks like summarizing news articles and searching long documents showed the same quality results, but with a dramatic reduction in processing load. This clever shortcut means faster responses and lower costs for the services we use every day. It’s a simple trick that could make AI assistants more efficient for everyone, and the future might just look a little more visual.

Short Review

Overview

This article explores an innovative approach to input compression for multimodal large language models (MLLMs) by converting long text inputs into images. The primary goal is to determine whether this method can significantly reduce the number of decoder tokens required while maintaining performance levels. Through rigorous experimentation on two benchmarks, RULER and CNN/DailyMail, the authors demonstrate that this text-as-image technique can achieve nearly 50% token savings without compromising task accuracy, thus enhancing the efficiency of LLMs.

Critical Evaluation

Strengths

The article presents a compelling case for the effectiveness of the text-as-image compression method, showcasing substantial token savings and improved inference speed. The use of two distinct benchmarks, RULER and CNN/DailyMail, adds robustness to the findings, allowing for a comprehensive evaluation of the method's applicability across different tasks. Additionally, the results indicate that this approach not only reduces the computational load but also enhances the quality of generated outputs, particularly in document summarization, where it outperforms traditional token-pruning methods.

Weaknesses

Despite its strengths, the study has some limitations. The reliance on specific models, such as GPT-4.1-mini and Qwen2.5-VL-72B-Instruct, may restrict the generalizability of the findings to other architectures or configurations. Furthermore, while the results are promising, the long-term implications of using images as inputs in various contexts remain unexplored. The potential for loss of nuanced information in the image conversion process could also pose challenges in certain applications, warranting further investigation.

Implications

The implications of this research are significant for the field of natural language processing and machine learning. By demonstrating that visual representations can serve as an effective means of input compression, the study opens new avenues for optimizing LLMs, particularly in scenarios where computational resources are limited. This could lead to faster processing times and broader accessibility of advanced language models in real-world applications.

Conclusion

In summary, this article makes a valuable contribution to the ongoing discourse on input efficiency in large language models. The innovative text-as-image method not only reduces token usage but also preserves performance, suggesting a promising direction for future research. As the demand for more efficient and capable LLMs continues to grow, this approach could play a crucial role in shaping the next generation of multimodal AI systems.

Keywords

large language models
multimodal input processing
visual text representation
input compression techniques
token usage reduction
text-as-image method
decoder token efficiency
long-context retrieval benchmarks
document summarization strategies
performance preservation in LLMs
image-based text encoding
experimental results in NLP
RULER benchmark analysis
CNN/DailyMail summarization
effective input methods for LLMs

Artificial Intelligence

Zhaoyang Yu

29 Oct 2025

ReCode: Unify Plan and Action for Universal Granularity Control

Read Article

Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

paper-plane Quick Insight