Realtime Voice - Doubao Team

January 20, 2025

Doubao Realtime Voice Model

We innovate through the adoption of the Speech2Speech end-to-end framework, a native approach that deeply integrates speech and text modes to truly implement an end-to-end model of understanding and generation. Compared with the traditional cascading method, it not only has a strong understanding ability, but also has unprecedented high speech expression and high control, as well as excellent emotional acceptance ability, and has ultra-low delay (about 700ms in the naked model) and smooth interruption ability in real-time interaction.