January 20, 2025
Doubao Realtime Voice Model
We innovate through the adoption of the Speech2Speech end-to-end framework, a native approach that deeply integrates speech and text modes to truly implement an end-to-end model of understanding and generation. Compared with the traditional cascading method, it not only has a strong understanding ability, but also has unprecedented high speech expression and high control, as well as excellent emotional acceptance ability, and has ultra-low delay (about 700ms in the naked model) and smooth interruption ability in real-time interaction.