UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Jinchao Li, Qi Wang, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Min Zhang

16 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

AI That Can Sing and Compose Music in One Go

Ever imagined a single computer program that can both talk like a friend and compose a catchy tune? Scientists have built such a system, called UniMoE‑Audio, that blends speech and music generation into one smart AI. Instead of training separate programs, this model learns to switch between “talking” and “playing” modes, much like a talented musician who can pick up a microphone or a guitar at the drop of a hat. The secret sauce is a flexible “expert team” inside the AI that decides on the fly how many specialists to use, so it never gets overwhelmed by the huge amount of music data or the smaller speech data. The result? Clearer, more natural‑sounding speech and richer, more creative music—both beating previous benchmarks. This breakthrough means future apps could let you chat with a virtual assistant that also writes background scores for your videos, all without juggling multiple programs. Imagine the possibilities for creators, educators, and anyone who loves sound. The future of audio is finally speaking in harmony.

Short Review

Advancing Universal Audio Generation with UniMoE-Audio

This insightful article introduces UniMoE-Audio, a pioneering unified model designed for both speech and music generation. It directly addresses the long-standing challenges of task conflicts and severe data imbalances that have historically hindered progress in universal audio synthesis. The authors propose a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework, coupled with an innovative three-stage training curriculum, to overcome these obstacles. Through extensive experimentation, UniMoE-Audio demonstrates state-of-the-art performance across major speech and music generation benchmarks, showcasing superior synergistic learning and effectively mitigating performance degradation often seen in naive joint training approaches.

Critical Evaluation of UniMoE-Audio

Strengths

The core strength of UniMoE-Audio lies in its sophisticated architectural design. The Dynamic-Capacity MoE framework, featuring a Top-P routing strategy and hybrid experts (routed, shared, and null), intelligently allocates computational resources and captures both domain-specific and domain-agnostic features. This adaptive approach effectively resolves the inherent task conflicts between speech and music generation. Furthermore, the meticulously crafted three-stage training curriculum—comprising Independent Specialist Training, MoE Integration and Warmup, and Synergistic Joint Training—is highly effective in leveraging imbalanced datasets, preventing catastrophic forgetting, and fostering enhanced cross-domain synergy.

Experimental results consistently highlight UniMoE-Audio's superior performance, achieving state-of-the-art results in both speech synthesis (measured by metrics like UTMOS) and aesthetic music generation (evaluated via CLAP and Fréchet Audio Distance). The model's ability to maintain high quality across diverse tasks, including Text-to-Music and Video-to-Music, underscores its robustness and the efficacy of its specialized MoE architecture in mitigating task conflict compared to dense baselines.

Considerations and Future Scope

While UniMoE-Audio presents a robust and highly effective solution, the inherent complexity of its MoE architecture and multi-stage training curriculum might pose challenges for broader implementation or require significant computational resources. Future research could explore optimizing these aspects for greater efficiency or extending its capabilities to an even wider array of audio tasks beyond speech and music, such as environmental sounds or sound effects, further pushing the boundaries of universal audio generation.

Conclusion

UniMoE-Audio represents a significant leap forward in the pursuit of universal audio generation. By innovatively tackling task conflicts and data imbalance through its specialized MoE architecture and a carefully designed training strategy, the model not only achieves state-of-the-art performance but also demonstrates profound synergistic learning. This work provides a compelling blueprint for future research in unified multimodal models, paving the way for more comprehensive and high-fidelity AI-driven audio content creation.