2024 is coming to an end. No matter how turbulent the AI wave surges,

true believers still adhere to accelerating progress toward their goal of AGI.

Since its debut on May 15, we have seen

230 days birth and rapid growth of Doubao large models.

From learning to speak like a child to understanding the world,

then to drawing creators' imagined fantasy, it is still in its infancy.

But，every step counts.

We want to share with you about 8 key moments of Doubao large models in 2024.

1. Learning to Listen and Speak Like a Human

In July, Doubao large model can

understand conversations in more than 20 languages mixed with dialects , and think while listening.

Doubao large models have also learned how to express emotion in speech.

It can be interrupted at any time and can "cut in" during an interaction.

It can also retain human pronunciation habits such as omitted sounds, accents, etc.

*Available within Doubao, Jianying, Ola Friend, etc.

This is made possible by the new Doubao Automatic Speech Recognition Model "Seed-ASR"

and foundation model for speech generation "Seed-TTS".

Unlike traditional small speech models,

the Doubao speech large model draws on more diverse and extensive data,

incorporates chain of-thought reasoning, and exhibits exceptional generalization capabilities.

*Overviews of Seed-ASR framework and Seed-TTS inference pipeline

2. Mastering Music and Singing

In September, Doubao large models brought to life the idea that

"An AI can be a band."

From songwriting and performance generalization to vocal singing，

Doubao large models have learned more than 10 musical skills.

It can provide unexpected inspiration for music creation.

Underpinning this achievement is the music generative model "Seed-Music" .

By combining the advantages of language models and diffusion models,

Seed-Music offers a universal framework for music generation,

with highly controllable editing.

*Overview of Seed-Music framework

3. Doubao Large Models Can Make Videos!

In September, Doubao large models also learned

to follow complex prompts

to accurately generate HD video with multiple interactive subjects

and flexibly control camera angles,

offering creators a visual experience that fuses reality with fantasy.

*Dreamina and Doubao can be used to assist in creating short fantasy films.

Behind the scenes are two Doubao video generation models launched at the same time:

PixelDance and Seaweed

A new diffusion model training method has built consistent lens performance.

Optimized Transformer structure maximizes the generalization of video generation.

The technology for simultaneous video and sound effect generation sparks even more creative inspiration.

* Doubao team's research towards long narrative video generation with synchronized foley

4. Advanced Painting and Image Editing Master

Chinese, cinematic, or surreal aesthetics,

Doubao large models master them all.

In November, it also learned"Image Editing with One Phrase" and "Poster Generation with One Click"

allowing for image editing and accurate text generation based on any prompts.

‍*Available within Dreamina and Doubao.

Behind it is the constantly iterating Doubao text-to-image model.

It can achieve precise image-text alignment in complex scenes

and build high-quality text rendering capabilities.

Universal image-editing model SeedEdit

allows users to edit any image through natural language.

*Overview of SeedEdit framework and its optimization pipeline

5. Professional-Level Programming

In early December,

the coding abilities of Doubao large models increased significantly.

It can serve as both AI programmer and data analyst.

It supports free on-canvas code preview, human-machine collaborative programming，

and one-click data processing and visual analysis.

*Available on DoubaoMarsCode and will soon be available on Doubao.

These features are powered by the Doubao codeLLM "Doubao-coder",

which was produced by massive real-world programming data and intensive training carried out by experts in related fields.

The model provides in-depth support for over 16 programming languages and 11 real-world application scenarios,

addressing full-stack programming needs for front-end and back-end development, machine learning, etc.

6. Pushing the Limits of Language Understanding

At the same time, the context window of Doubao large models

has increased to 3 million words , an industry-leading milestone,

which allows it to read hundreds of academic reports easily at once.

Only a 15-second delay is shown per million tokens.

*Use Doubao to test its ultra-long text processing abilities

This achievement is built on a variety of breakthroughs in data algorithms and model-acceleration optimizations,

including contextual correlation data algorithms such as STRING.

The outcome is a significant improvement in LLM's ability to leverage massive external knowledge,

reducing latency to ten-second level with sparse and distributed schemes.

*Detailed pseudocode of STRING incorporating FlashAttention

7. Learning to "Open its Eyes" to See the World

In mid-December,

Doubao large models learned to perceive the world visually,

incorporating multiple senses for deep thought and creation.

Take a picture of a calculus math problem,

and Doubao can not only understand accurately but solve it in no time.

*These model features can be experienced through the Volcano Ark.

It is powered by the new Doubao Visual Language Model,

which fuses visual language understanding with text generation in a single model architecture,

exhibiting exceptional content recognition,

strong reasoning, and fine expressing capability.

*Doubao-vision's performance on different benchmarks

8. General Model Capabilities Comparable to GPT-4o

Also in mid-December,

the Doubao General Model "Doubao-pro" underwent a full upgrade.

aligning its capabilities with GPT-4o.

With reasoning ability enhanced,

it also learns to "self-reflect" when providing responses.

*Doubao General Model Pro has extensively upgraded all of its capabilities.

This is enabled by a huge amount of data optimization and model architecture innovation,

including increased model sparsity and introducing reinforcement learning, etc.

The accuracy of understanding and output quality of Doubao-pro took a giant leap,

making it a true all-rounder that balances performance and efficiency.

*Doubao-pro's performance on different benchmarks

In this year,

the Doubao (Seed) team deeply delved into fundamental AI research.

57 of its papers were selected for ICLR, CVPR, NeurIPS, and other top conferences.

There were also open source projects with over 1 million downloads and repositories receiving 10,000 Stars on GitHub.

The Doubao (Seed) team also had in-depth cooperation with nearly 20 universities

and established joint laboratory with Tsinghua and Peking University.

The Doubao Large Model Fund has supported more than 40 top scholars,

helping them in researching groundbreaking AI technologies.

In 2024, Doubao large models support more than 50 application scenarios.

Doubao has become the most popular AI product in China.

Through the Volcano Engine, Doubao large models have served more than 30 industries.

Its average daily token usage has exceeded 4 trillion,

which is 33-fold from the release in May.

After 230 days, the adventure of the Doubao large models has just begun.

The distant shores of universal intelligence belong to those who keep pressing forward.

To find the most promising researchers,

the team launched the Top Seed Talent Program this year,

recruiting top PhD graduates globally

to collaborate on researching world-class AI topics.

In 2025, the Doubao (Seed) team

will continue to explore fundamental model topics,

committing to change the world through technology.

We invite top talents with the same vision to join us!

If you are interested in agent collaboration, data science, or applying large models to solve complex problems, and are eager to explore frontier research topics, please visit our careers page for more information about open positions.

8 Key Moments of Doubao Large Models in 2024