A Peek into the Technologies behind Seed-TTS - the Generated Sound That is Too "Real" to Be True

A Peek into the Technologies behind Seed-TTS - the Generated Sound That is Too "Real" to Be True

Date

2024-06-25

Category

Team Story

Seed-TTS is a speech-generation LLM model recently released by the ByteDance Doubao Team.


The generated speech is remarkably similar to that of a human, with the ability to replicate even minor pronunciation errors. Its greatest strength lies in its ability to imitate human speech with exceptional accuracy and naturalness.


For instance, Seed-TTS has the ability to generate fresh speech from text while retaining the voice attributes of the original speech, as demonstrated by providing a speech audio.


Original material (Prompt):



The Chinese speech generated by Seed-TTS:



Out of nowhere, a wave of laughter erupted next to me. I turned to face the source, stood up tall, flexed my slightly plump arms, and quipped, "I only gained a few pounds to conceal my irresistible charm. Otherwise, wouldn't it be too much for you all to handle?"


English speech can be generated while still retaining the distinct characteristics of the Chinese speaker.


The English speech generated by Seed-TTS:



Suddenly, there was a burst of laughter beside me. I looked at them, stood up straight with high spirit, shook the slightly fleshy arms, and smiled lightly, saying, "The flesh on my body is to hide my bursting charm. Otherwise, wouldn't it scare you?"


Seed-TTS also allows you to customize tone, enabling you to express a character's "flirtatious nature" through the voice.



Are you also seeking a delightful romance? Look no further than A Smile Is Beautiful. The protagonists are both stunning and popular in their school, having first connected through a game before meeting in real life. Their relationship is characterized by a complete absence of misunderstandings, resulting in an overwhelmingly sweet experience. The mere thought of it is enough to bring a smile to my face!



Little fool, um... It's a very cute and amiable name, a bit "unconventional". I can't help but wonder, what inspired you to choose this nickname for me?


Seed-TTS is capable of producing not just an individual's voice, but also the narrative and emotional nuances of various characters in a novel, tailored to their unique traits and the plot of the story.



"This pill... Couldn't it be something like a sedative drug or an aphrodisiac? Why am I smelling a scent that's so similar to what the two elder sisters described? Um, you wouldn't... be plotting something bad against me, would you?" Upon hearing this, Han Li was left stunned for a considerable amount of time. He was now experiencing an overwhelming sensation of shock as if he could vomit blood three times over. This girl's thoughts are just too hard to understand. She could actually associate the incense pill with an aphrodisiac. He doesn't know whether he should admire her caution or protest for his being wrongly accused. "It seems that what you said is true. However, I still have to take it to my second elder sister for inspection before using it. After all, we girls have to be careful!" Han Li was at a loss for words and could only cough awkwardly to conceal his embarrassment. "Uh, do as you like," he finally managed to say. Now, he felt that he'd better stay away from this mischievous, bewitching girl. Otherwise, he feared he might succumb to depression in her hands. "Humph, but if this medicine is really as useful as you said, then you pass my test! In the future, if you need any help in the Mo Mansion, you can come to Caihuan for help. With a small reward, I can guarantee complete resolution of any issue." "OK. If I need something, I will definitely come to you for help." Han Li regained his composure at this time and responded to this with a forced smile while thinking viciously in his heart: "Yeah, like I ever will".


For more demonstrations and principles, please refer to the original paper and the effect demo:



Before the technical report was released, some Seed-TTS technologies had been integrated into customer-facing products for some time, receiving a lot of genuine praise from users. These technologies were also being utilized externally to provide tech monetization services, including the Doubao Speech Synthesis Model and the Doubao Voice Replication Model.


Find more about the technical breakthroughs, research significance, and challenges along the way in the interview below.

image


Q: Seed-TTS has garnered attention from industry insiders. Which piece of positive feedback left a lasting impression on you?

A: There is a professor who specializes in speech recognition and later joined a company. He is an industry expert that I admire very much. Not long ago at an academic conference, we gave a demo of Seed-TTS. After watching it, he said that recently he wanted to see what could be done in speech generation. However, after watching it, he felt that there seemed to be nothing left to do in this field. Although I think there is still room for improvement, I was very happy after hearing that.


Q: What made you feel happy?

A: When people compliment you, most often they're just being polite. However, this professor was actively seeking out research topics at that time. When he saw our demo, he gave us very positive feedback and felt that what we had done couldn't be surpassed, and he had to look for other research topics. This is really a high recognition for us.


Q: What sets Seed-TTS apart from previous accomplishments?

A: It is a foundation model for speech generation, which is slightly different from most other speech generation models. While traditional TTS models are single-task oriented, Seed-TTS is designed to handle any task and produce a wide range of sounds. Additionally, it offers greater control over various dimensions, including dialects, speech habits of real individuals, and even speech imperfections like word-swallowing.

Seed-TTS aims to cater to all speech methods worldwide, be it English, Japanese, Chinese, or even dialects like Shaanxi and Henan dialects in Chinese. It can also generate human voices expressing different emotions such as happiness, sadness, crying, shouting, and anger. The possibilities are endless with Seed-TTS.


Q: Have all the above assumptions been achieved?

A: A large part has been achieved. Of course, some are still unattainable for now. But the technology has been moving forward. Now the language model has an in-depth text understanding, but we hope we can turn it into a real "base" model.


Q: What are the challenges in creating a "foundation model"?

A: First, detailed modeling needs to be excellent. In the past, TTS systems were mainly used for announcements and were relatively easy to implement, but they often sounded "robotic." To achieve a foundation model with human-like sound, a lot of detail is required. Humans are very sensitive to their own voices; even if a dog's or cat's sound is slightly unnatural, it might not be noticeable, but any slight issue with human speech can make it sound "mechanical."

Second, high naturalness and stability are required. A few years ago, mainstream TTS systems were mostly based on prior knowledge and duration models, where each phoneme was predefined, limiting expressiveness from the ground up. Removing these restrictions can lead to issues with stability and naturalness, which is another challenge.

Third, data coverage needs to be extensive. We aim to replicate any person's voice and various language dialects, including imperfect human pronunciations like swallowing words or non-standard articulation. To reconstruct these features and replicate imperfections, data coverage must be wide. Previously, industry data usage was in the hundreds or thousands of hours, with some models using tens of thousands of hours. The data scale used by Seed-TTS is much larger than before. Such a large volume of data also brings challenges in balancing quality and quantity.

Fourth, model design. With this enormous data volume, designing a model that performs well in all aspects is a significant challenge.

Finally, there are engineering challenges. As mentioned, our data scale is large, and model complexity is high, naturally leading to engineering issues that have rarely been addressed before.


Q: From a technical perspective, what value is there in solving these challenges?

A: The main value lies in the research process, where we attempted to answer many previously unresolved questions:

(1) There are two sets of generative models, language models and diffusion models, which are biased towards text and images, respectively. Since speech possesses attributes of both text and images, determining which model is more suitable for speech modeling is a question we aim to answer.

(2) Speech and text share many similarities. How to design speech representations that are better suited for language modeling is also a problem that needs to be addressed.

(3) How to use reinforcement learning to integrate various subjective and objective preference information into the generative system is also one of the issues. There are many other highlights in other aspects, including the stability issues of autoregressive speech generation models. Moreover, through this research, we are also trying to look at TTS problems from a perspective outside the TTS field.


Q: You mentioned research on language models and diffusion models; what conclusions have we drawn from this?

A: Seed-TTS not only provides a technology solution based on language models but also offers another solution completely independent of duration models using Diffusion techniques, which is a first in the industry. Moreover, after extensive comparisons between the two systems, we found that language models are relatively friendly for stream processing, while diffusion models are more suitable for edit processing. I believe that in the future, these two will continue to merge.


Q: What technical challenges did Seed-TTS specifically address for these two systems?

A: For the language modeling system, the main issues addressed were the speech tokenizer and stability.

For language modeling, tokenizing speech is a core component. Currently, there are both continuous and discrete tokenizers on the market, and our team has done a lot of exploration. We found that the design of the tokens containing information has a very critical impact on the overall performance and stability of the model. This includes the information in the tokens, frame rate, as well as how to tokenize and then convert it back into sound. At present, there is not much exploration of these aspects in the industry.

In terms of the stability of the language model, we have explored various aspects including tokens, model design, decoding strategies, and data preparation to truly meet industrial and application requirements.

For the pure Diffusion system, since the additional duration model has been removed, its challenges are also focused on stability. After many attempts, we have also achieved very good performance in this process.


Q: Regarding the statement "there are many similarities between speech and text models," what insights does this provide us?

A: From the perspective of text large language models, speech generation models can be categorized into three stages: pre-training, instruction fine-tuning, and Post-Training.

Pre-training can enhance the model's foundational abilities, specifically reflected in in-context learning capabilities, such as voice continuation and voice cloning.

Instruction fine-tuning mainly involves making the speech generation process more controllable through instructions, much like a director giving guidance to an actor on speaking faster, slower, or more persuasively. These aspects have been integrated into our models.

Finally, we have found that reinforcement learning can enhance models across many dimensions by incorporating various subjective and objective preferences into the generation system, including stability, control, expressiveness, and naturalness. There has been relatively little exploration in this area within the industry.

Building on these foundational capabilities, we have also explored using synthetic data for self-distillation, which has yielded very positive results. This method is relatively more common in text LLMs but has been less explored in the speech industry.


Q: You've noted three times that "some issues receive less attention in the industry." What leads to this oversight?

A: On one hand, previous research in the field of voice generation was relatively independent, with many traditional industry experiences that may no longer apply in the current wave of AIGC. From a broader perspective, voice generation shares many commonalities with text and image generation. The rapid development of text models and image generation has also given us new insights. Since the promotion of new ideas takes time, industry exploration is still relatively limited.

On the other hand, many researchers work in academia and lack the necessary resources. There is a lot of systematic engineering involved here. We have not only managed to do it but have also explored it in detail, discovering models that balance stability, expressiveness, and computational load. But does that mean we have achieved the best results? Not necessarily, continuous exploration may still be needed.


Q: Are there any milestone moments throughout the entire research process?

A: The basic results were achieved last year, and since then, we have iterated extensively using real cases. This included finding real cases, various post-training processes, and solving implementation issues such as stability in different scenarios, initial packet delay, concurrency, and computational load. Compared to that time, the performance has improved significantly now.


image


Q: Looking back now, what's the value of the entire research?

A: Considering Seed-TTS's inherent value, voice is not merely a tool; it's the most direct way of interaction for human beings.

For example, the transition from silent films to sound films was a small change, but it represented a huge leap for the industry. Emotional connections between people often rely more on voice; for instance, when a child calls out "Dad", the emotional connection it creates is vastly different from reading the word in text. If we are to move toward true AI, the naturalness of speech is a key factor. In the past, we imagined machines with robotic voices, like Moss in The Wandering Earth. But if AI is truly going to act like your assistant or partner, the emotional connection that voice brings is essential. One of the reasons J.A.R.V.I.S. from Iron Man is remembered by so many is because it was voiced by a real person.

Furthermore, in terms of application, there are many scenarios where voice technology can be implemented, such as audiobooks, character design, video translation, virtual characters, broadcasting, and acting. It can even help people with speech impediments or those who are unable to speak to express themselves. As long as the scenario involves more than just conveying pure information through voice, there is space for application. This is also the motivation behind our efforts to refine the foundational model.


Q: Scaling laws have been regarded as a "belief" by some practitioners. When it comes to voice generation models, what happens when we scale up the data and models?

A: Even on a very large scale, when we continue to increase the scale, we consistently see benefits. Overall, by increasing the scale, we've been pleasantly surprised to see the model continually acquire new capabilities.


Q: Based on your observations, where is the limit?

A: For now, we still see benefits with each increase, so further exploration is definitely needed. However, we have already proven that with the right model design, we can break away from traditional TTS thinking. In the past, we relied on small amounts of high-quality data, but now, as we continue to scale up, we are able to achieve even greater benefits.


Q: What can we learn from GPT4-o?

A: Based on unified generation and understanding, GPT4-o integrates the abilities to listen, speak, and think simultaneously, setting higher standards for audio technology. This advancement opens up numerous new challenges for us to address in our projects.


Q: What are your main focuses regarding audio in large language models?

A: Firstly, we aim for models to replicate the performance quality of professional actors and actresses. Most of the time, the audio generated by models closely resembles that of actual humans. However, in scenarios involving intense emotions and complex information, like movies and TV, the audio quality produced by models currently lags. Our goal is to bridge this gap, even in these corner cases.

Secondly, we're focusing on refining the processing of details, specifically improving how we handle and optimize bad cases and addressing uncommon, long-tail issues.


image


Q: Why did launching Seed-TTS require collaboration from numerous colleagues worldwide?

A: The evolution of the industry necessitates widespread collaboration. Achieving breakthroughs with LLM and scaling it industrially requires the collective effort of many, rather than just a handful of ideas. Moreover, expertise in each area is crucial. For instance, we've enlisted professional data analysts for data processing and specialized QA and engineering teams for implementation. Their contributions have been invaluable. Leading AI entities demonstrate that success involves extensive collaboration, with each project phase managed by dedicated experts. This level of intricate, high-quality teamwork also demands strong organizational skills.


Q: What keywords would you use to describe our team's atmosphere?

A: I'd say "highly motivated" and "meticulous". The term "highly motivated" mirrors our self-driven approach to work, like when we are spurred by curiosity and a desire to innovate within the industry. Such a proactive mindset is typical in startups but less so in larger companies.


Q: You mentioned the team pays close attention to details. Can you elaborate?

A: This means we delve into the specifics of real-life scenarios. While creating demos is straightforward and they can be made to look impressive, real-world applications may present a myriad of challenges for product details. To ensure our model consistently delivers high-quality outputs that satisfy user demands, we impose rigorous standards on the system's stability and robustness. This involves meticulous refinement to guarantee that every detail meets high standards. Meanwhile, we have not placed much emphasis on enhancing the demos.


Q: Are there any different opinions on our decision to not focus on enhancing demos?

A: Indeed, especially among our younger team members, as it's natural for everyone to showcase their best. Nonetheless, we strive to bridge any significant gap between the product and its demo by ensuring practical results in real-world usage. Our ultimate goal is to truly revolutionize the industry.


Q: Are any of the technologies applied to the Doubao App?

A: Several technologies we've developed are already utilized by users in real-world settings, and they were introduced to the public only after we could validate their value. Meanwhile, we are finalizing some technologies to prepare for their launch.


Q: What key phrases define our team?

A: The first is professionalism. This manifests across various domains such as data analysis, infrastructure, and model design. We delve into the intricacies of each detail with a high level of expertise, aiming for peak performance from an industrial application standpoint. The second is focus and momentum. To achieve our objectives, we need to demonstrate strong commitment and energy. Realizing our goals brings a deep sense of accomplishment and boosts confidence for everyone involved. The third is cohesion. The team collaboration is very smooth and free of turf wars, which fosters a comfortable working environment that's uncommon in large corporations.


Q: What qualities do we consistently expect from new team members?

A: Firstly, it's essential that new members appreciate our values. Though skills matter, it's more important to find partners who can collaborate effectively as a team so that everyone can realize their potential. When individuals align on core values, collaboration becomes inherently more seamless. Secondly, we aim to attract individuals from diverse backgrounds. With methods in different AI fields growing increasingly similar and converging, expertise in reinforcement learning, visual recognition, and audio recognition becomes crucial for generation tasks. We value diversity in professional experience; for example, I shifted my focus from speech understanding to TTS. Lastly, a proactive attitude, a keenness to learn, and a strong commitment to excellence are essential. As different generation tasks present distinct characteristics, we hope to onboard candidates who can creatively deliver these tasks by drawing on their personal experiences. To achieve that, it's essential for the team members to proactively learn and absorb new knowledge. Moreover, as we aim to create the best technology and products in the industry, we expect our team members to carry this vision and continuously strive ahead.


image


The content above comes from the Seed-TTS team members. The team is actively seeking exceptional talent.


If you share a passion for AI foundational models and appreciate the culture of the Doubao (Seed) team, we invite you to explore more about our technological innovations, team stories, and job opportunities. Visit our official website at team.doubao.com or follow our official WeChat account.


ByteDance's Top Seed Talent Program is hiring. We aim to continuously attract and recruit top talents with grand ambitions and a desire to change the world with technology. Join us, and you will work with the finest scientists and engineers to tackle the industry's top technical challenges and breakthroughs. Feel free to long-press the QR code below or click to access the original article and submit your resume.


Click here to apply