Volcano Engine held its 2024 AI Innovation Tour in Chengdu. At the event, Doubao introduced its Image-to-Image model, as well as the upgraded Text-to-Image model, speech synthesis model, and voice cloning model.

This article introduces the upgraded features of the Text-to-Image, speech synthesis, and voice cloning models, including a deeper understanding of the subject-object relationship and spatial construction in image generation and the ability to accurately express emotions, retain voice attributes like word-swallowing and accents in speech synthesis. The Doubao (Seed) team working in vision and speech also discussed future trends in Text-to-Image and speech synthesis.

The daily token usage of the Doubao Large Model has exceeded 500 billion — Volcano Engine unveiled the latest capabilities of the Doubao large model recently at the 2024 AI Innovation Tour in Chengdu. At the event, Doubao introduced its Image-to-Image model, as well as the upgraded Text-to-Image model, speech synthesis model, and voice cloning model.

In May this year, ByteDance launched its self-developed "Doubao Large Model" series. According to the ranking released by FlagEval, a third-party large model evaluation platform, Doubao large model (Doubao-Pro-4k) ranked second in the "objective evaluation" of closed-source large models with a comprehensive score of 75.96, second only to GPT-4, making it the highest-scoring large language model from China. In "subjective evaluation", Doubao also ranked second.

Two months have passed, and the average daily usage of tokens by enterprise customers increased 22 times since the model launch date. The explosive growth shows the market recognition of Doubao's large model in terms of capabilities and performance.

The Doubao (Seed) team provided technical support for the major capabilities released this time; This article introduces the details of these capabilities and explains the technology involved behind them.

1. A Text-to-Image Model that better understands "Chinese Style"

The Text-to-Image model upgrades are reflected in three aspects:

Firstly, the new generation of models can deeply understand complex prompts, including multi-subject, counterfactual, and subject-object relationships, achieving more accurate image-text matching.

Prompt: A hyperrealistic photograph with a cinematic feel, featuring a gigantic, super cute cat in Lujiazui. The cat, as tall as the skyscrapers and as wide as the road, lies sprawled across the street, blocking traffic. Cars, the size of the cat's paws, navigate the busy road.

Secondly, the model is also better at enhancing image quality from three aspects: light and shadow, atmosphere color, and human beauty.

prompt: Statue of David, standing on grass, in a shot put pose, plaster material, inside a modern Olympic venue, epic composition, super detailed, perfect lighting

Thirdly, the team has strengthened Chinese content, accurately understanding Chinese cultural elements, including Chinese figures, objects, dynasties, geography, cuisine, and festivals.

The team believes that the biggest highlight of this model release is its ability to generate content with a "Chinese style". We use native bilingual LLM + data to achieve precise generation of Chinese elements.

Prompt: A girl in traditional Chinese style, wearing Qing dynasty costume, with lively eyes and a naturally beautiful nose. She wears a gold headdress with intricate textures. She is an empress. Her red robe is embroidered with dragons and phoenixes in complex patterns. She's also wearing a pearl necklace. It's snowing. She has on golden finger cuffs. There's a red gate and wall.

Prompt: A female martial arts hero from ancient China points forward, her body and face in profile. With a serious expression, she stands in the middle ground amidst a windswept, sandy landscape. (Many swords fly behind her toward where she is pointing: 1.4 aspect ratio). The composition is epic, with elements of Chinese fantasy. Featuring realistic skin, depth of field, and meticulous detail, the image is rendered in a photorealistic style with film grain and low saturation.

Prompt: Cinematic, photography, Hasselblad, minimalism, artistic composition, leave blank, rime, a Suzhou garden with rime-covered branches, ultra high quality, super fine, best quality, Zen, oriental mood

Prompt: Classic red and white, thin lines, ink wash painting style. A woman in a Qing Dynasty cloak leans against a giant plum tree with snow falling on the budding flowers. The weather is extremely cold. She plays a flute with a melancholic air, worried her music might disturb the delicate blossoms.

To enhance the model's capabilities, the team has made comprehensive preparations.

The team continues to strengthen its data re-captioning capabilities and implements precise data labeling to enhance data quality control. The team has also optimized the reliability of the training cluster for managing and processing large datasets.

For the text understanding module, the team uses a native bilingual large language model as the text encoder, which significantly improves the model's ability to understand Chinese. It possesses a broader knowledge of the world and has established a foundational understanding of different languages. In other words, whether facing unique Chinese phrases or English slang, the language model provides more accurate text embedding, enabling it to learn the original cultural elements precisely.

In deployment and inference, the team employs a distillation approach to address time-consuming issues, aiming to achieve high-quality image generation in lower deployment environments. From the data, they simplified the step count of the original model for image generation and reduced the time to 40% of the original.

Finally, the team also planned more comprehensive and accurate dimensions to evaluate the quality of image generation, which include: structural accuracy, picture quality, image aesthetics, image-text consistency, content creation, and complexity adaptability. Even within the same dimension, the team evaluates the generation results through dimensions such as subject accuracy, multi-subject accuracy, and the number of actions.

In addition to the Text-to-Image model, this release also includes an Image-to-Image model that preserves the multidimensional features of the original image, such as character contours, expressions, and spatial structure. The model also supports more than 50 different styles, and enables image extension, partial redesign, and erase play, allowing for creative expansion of images. It is now used in apps such as Douyin, Jianying, Doubao, and Xinghui, and has served companies such as Samsung and Nubia, covering phone photos, assistant tools, e-commerce marketing, and advertising.

2. A speech foundation model that lets data speak for itself

Speech has also been a focus of this release, including the upgraded Doubao Speech Synthesis Model and the Doubao Voice Cloning Model.

Among them, the Speech Synthesis Model can deeply understand the storyline and characters, correctly express emotions, and retain pronunciation habits such as word-swallowing and accents, indistinguishable from human voice, making it more natural. The team has meticulously refined 26 premium voice tones to support the needs of professional LIVE hosts across various specific scenarios, including live hosting, broadcasting, and LIVE streaming.

In contrast, the Doubao Voice Cloning model can replicate voices with high-fidelity in 5 seconds, highly restoring the speaker's voice features and accents. It supports transfer across 6 major languages, making the pronunciation closer to that of a native speaker. This model is designed for "learning the voice of any character,” with better replication capability and the potential even to learn a speaker’s habitual phrases.

Note: Demonstration of the "Taibai Jinxing" voice replication effect

The underlying technology of the two models mentioned above is associated with Seed-TTS.

This is a foundation model for speech generation. Unlike traditional TTS, which focuses on a single task, Seed-TTS can model a variety of voices and allows simultaneous manipulation across multiple dimensions, such as dialects, real-person speaking habits, and even speech imperfections like word omission.

As for how the large language model learns the principles of "word-swallowing," "accent," and "speaking habits," the team believes that traditional TTS uses specific modeling designed for the model framework, model duration, energy distribution, and pitch distribution, injecting human priors that do not well reflect the data features. However, large language models can “let the data speak for itself.”

The large language model itself has the ability to build models and extract big data features, which allows speech features to be preserved. In addition, the use of RL, data augmentation, improved text labeling, and text representation enhance performance at certain levels.

For example, the phrase "哈哈" can have drastically different meanings and expressions depending on the context. Seed-TTS can understand the meanings in different scenarios and learn the corresponding expressions through context. Similarly, the TTS model can also deeply understand the storyline and characters, and accurately express emotions.

Note: More emotional expressions shown in speech synthesis

In terms of implementation, Seed-TTS mainly focuses on addressing speech tokenization and stability issues regarding large language model systems.

Currently, both continuous and discrete tokenizers are available on the market. Our team's research has found that the design of information contained within tokens has a crucial impact on the overall performance and stability of the model. This includes factors such as the information carried by the tokens, frame rate, how tokenization is conducted, and how speech is reconstructed from the tokens.

The team has explored multiple aspects of the large language model's stability, including tokens, model design, decoding strategies, and data preparation, to truly meet industrial and application-related requirements.

For pure diffusion systems, since the additional duration model is removed, the main challenge also remains on stability. After many attempts, the team has also obtained great metrics in this direction.

In addition to research, the Doubao Speech team also made algorithm iterations to support this upgraded release, including better controllability, performance, and stability. In terms of engineering, the team has jointly reduced the computational workload, and alongside the engineering team, they conducted debugging to ensure the actual effects aligned with the demo.

3. The team continuously focuses on and strives to solve fundamental problems of large language models.

Reflecting on the development of large language models for speech, the team believes that traditional research on tasks such as TTS and ASR has been conducted in isolation. As a result, adaptations and adjustments have been necessary when applied to different scenarios and fields. However, with the advent of large language models, the convergence of various tasks at the foundational level has become inevitable.

Past research has shown that the human brain learns language and pronunciation through experience and constant imitation, with both "listening" and "speaking" equally important. The same applies to machines.

If the TTS model is the "mouth" of a machine, then the ASR model is the "ears," one controlling voice and the other responsible for hearing and understanding. However, the core of both relies on the extraction of sound and text features.

Correspondingly, the Doubao (Seed) team has successively released two models for speech: Seed-TTS and Seed-ASR. Notably, the Seed-ASR technical report, which was recently released to the public, demonstrates its ability to leverage the extensive knowledge of LLMs to enhance the accuracy of ASR recognition results. Seed-ASR significantly outperforms other end-to-end models across multiple domains, various languages, dialects, and accents in comprehensive evaluation sets. Currently, the relevant technology has also been integrated into the Doubao Automatic Speech Recognition model.

Regarding the integration work of TTS model and ASR model, the team is working on it.

Regarding the prospects of text-to-image technology, the Doubao Vision team believes that it has been 2 years since the release of Stable Diffusion, and many new technologies and plugins have emerged in the industry, such as LoRA, ControlNet, and Adapter, as well as DiT architecture and more powerful large language models. The team has revealed that text-to-image 2.0, based on DiT architecture, is about to launch, boasting a 40% improvement in generation quality over the current model. Image-text consistency and aesthetics have also significantly improved.

Meanwhile, some underlying problems in the text-to-image field are yet to be fully solved, which will be the focus of the team's future efforts.

On the one hand, the model's ability to understand events needs to be further improved. Specifically, image-text matching ability is the key to text-to-image technology development.

On the other hand, text-to-image requires better controllable editing and generation capabilities. Even with tools like ControlNet and Adapter, there are still flaws. Solving this issue could broaden the possibilities for application implementation.

Finally, regarding social responsibility, text-to-image models need further improvement in the areas of fairness, safety, and bias elimination to be more accountable to the public.

From the architecture upgrading of text-to-image DiT to the "All-in-One" speech model, we aim to continually attract outstanding talents with grand ambitions and a desire to "change the world with technology" to join our team, contribute innovative ideas, and collaborate in solving foundational issues and making breakthroughs.

The Doubao (Seed) team is hiring! Click "Read more" to learn about job openings.

Doubao LLM's Visual Model and Voice Model Upgraded! Text-to-Image has a better understanding of "Chinese style", and TTS can "capture" emotions

1. A Text-to-Image Model that better understands "Chinese Style"

2. A speech foundation model that lets data speak for itself

3. The team continuously focuses on and strives to solve fundamental problems of large language models.