Speech
We develop AI training and inference systems utilizing GPU technology. We are dedicated to advancing the forefront of AI system technology to accelerate the large language models for audio/music.
Main areas of focus
AI system technology
We develop AI training and inference systems utilizing GPU technology. We are dedicated to advancing the forefront of AI system technology to accelerate the large language models for audio/music.
AI foundation models for speech/multimodal audio/music
The team is also responsible for developing AI foundation models for speech, multimodal audio, and music throughout the engineering cycle. This includes supporting tasks such as data preparation and processing, model training, evaluation and deployment, among others.
Research topics
Work alongside well-known experts in the voice industry to explore the most challenging research topics. Strive for innovation and set your sights on the pinnacle of excellence in your work, while embarking on a journey of extraordinary personal and professional growth.
Research topics
AI foundation models for audio understanding and generation
AI foundation models for audio understanding and generation, exploring standardized modeling approaches for speech recognition, synthesis, transformation, music creation, and sound effect production.
AI foundation
Audio
Multimodal model design and optimization
Multimodal model network architecture design and optimization, the design and refinement of diffusion models.
Multimodal
Optimization
Application of reinforcement learning within environments for audio
Application of reinforcement learning within multimodal large model environments for speech/audio, and the design and enhancement of RL system solutions.
Reinforcement learning
Application
Large-scale distributed training and inference systems
Explore the development of efficient, large-scale distributed training and inference systems.
Large-scale
System
Development of machine learning platforms within environments for speech
Development of robust, scalable, and distributed machine learning platforms, aimed at facilitating the production and rapid iteration of speech/audio-related algorithms.
Machine learning
Audio
Technical capability demonstration
Seed-TTS
Doubao · Speech Generation Model shows superior context comprehension and naturalness, capable of deeply understanding narratives and character roles, expressing emotions accurately, and maintaining specific pronunciation habits such as swallowing sounds and accents, closely mimicking human vocal quality.
Speech
Generation
Seed-ASR
Doubao · Speech Recognition Model leverages more advanced context-aware capabilities to produce more precise recognition outcomes and support the recognition of multiple Chinese dialects, including Mandarin, Cantonese, Shanghainese, Sichuanese, Xi'an dialect, and Minnan, within a single model.
Speech
Recognition
Seed-Music
Seed-Music is a collection of music generation models with flexible control capabilities, offering four core functions: controllable music generation, score-to-music conversion, lyric and music editing, and zero-sample vocal cloning. By smartly merging the benefits of language and diffusion models into the composition process, Music makes crafting songs accessible to everyone.
Music
Generation