Publications

Journal Articles

VIBEVOICE-ASR Technical Report

Published in arXiv preprint arXiv:2601.18184, 2026

VibeVoice-ASR is a general-purpose speech understanding framework that supports single-pass processing for up to 60 minutes of audio, unifying ASR, Speaker Diarization, and Timestamping into a single end-to-end generation task. It supports over 50 languages and natively handles code-switching.

Recommended citation: Z. Peng, J. Yu, Y. Chang, Z. Wang, L. Dong, Y. Hao, Y. Tu, C. Yang, W. Wang, et al. (2026). "VIBEVOICE-ASR Technical Report." arXiv preprint arXiv:2601.18184.
Download Paper

Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

Published in arXiv preprint arXiv:2601.13802, 2026

Habibi is a unified-dialectal Arabic TTS framework covering 12+ regional dialects. Our unified model matches or surpasses per-dialect specialized models and is highly competitive with ElevenLabs Eleven v3 (alpha).

Recommended citation: Y. Chen, J. Liu, Y. Tu, Z. Niu, Y. Liang, C. Qiang, C. Zhang, K. Yu, X. Chen. (2026). "Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis." arXiv preprint arXiv:2601.13802.
Download Paper

Yujie Tu

Publications

Journal Articles

VIBEVOICE-ASR Technical Report

Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis