Speech
豆包大模型语音团队的使命是利用多模态语音技术丰富交互和创作方式。团队专注于语音和音频、音乐、自然语言理解和多模态深度学习等领域的前沿研究和产品创新

研究领域
AI 系统技术
我们基于 GPU 构建 AI 训练和推理系统,并推进 AI 系统技术的最先进水平,以加速大型音频/音乐语言模型
语音/音频多模态/音乐大模型
团队还负责语音/音频多模态/音乐大模型的完整工程周期的开发,包括数据准备/处理、模型训练/评估/部署等工作
探索课题
与语音界知名技术大牛共事,探索最具挑战性的课题,工作中践行高标准和创新性,收获高质量的成长
课题方向

音频及音乐理解生成基座大模型
音频理解和生成基座大模型,探索语音识别、合成、转换、音乐生成、音效生成的统一建模方式
AI foundation
Audio

多模态模型设计和优化
多模态模型网络结构设计和优化、扩散模型的设计和优化
Multimodal
Optimization

强化学习在音频场景下的应用
强化学习在语音/音频多模态大模型场景下的应用,以及 RL 系统方案设计和优化
Reinforcement learning
Application

大规模分布式训练推理系统
探索高效的大规模分布式训练和推理系统
Large-scale
System

语音场景下的机器学习平台建设
高可用、可扩展、分布式机器学习平台的建设,支撑语音/音频相关算法生产与高效迭代
Machine learning
Audio
精选论文

2024.09.13
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and postproduction editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For postproduction editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio.
We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music.
Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou
Speech&Audio
2024.09.13
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou
Speech
Speech&Audio

2024.07.10
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.
Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou
Speech&Audio
2024.07.10
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou
Speech
Speech&Audio

2024.06.04
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech
(TTS) models capable of generating speech that is virtually indistinguishable from
human speech. Seed-TTS serves as a foundation model for speech generation and
excels in speech in-context learning, achieving performance in speaker similarity
and naturalness that matches ground truth human speech in both objective and
subjective evaluations. With fine-tuning, we achieve even higher subjective scores
across these metrics. Seed-TTS offers superior controllability over various speech
attributes such as emotion and is capable of generating highly expressive and diverse
speech for speakers in the wild. Furthermore, we propose a self-distillation method
for speech factorization, as well as a reinforcement learning approach to enhance
model robustness, speaker similarity, and controllability. We additionally present a
non-autoregressive (NAR) variant of the Seed-TTS model, named Seed-TTSDiT,
which utilizes a fully diffusion-based architecture. Unlike previous NAR-based
TTS systems, Seed-TTSDiT does not depend on pre-estimated phoneme durations
and performs speech generation through end-to-end processing. We demonstrate
that this variant achieves comparable performance to the language model-based
variant and showcase its effectiveness in speech editing. We encourage readers
to listen to demos at https://bytedancespeech.github.io/seedtts_tech_report/
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, Xiaobin Zhuang
Speech&Audio
2024.06.04
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, Xiaobin Zhuang
Speech
Speech&Audio
查看更多论文
技术能力展示

Seed-TTS
豆包·语音生成模型具备出色的上下文学习能力和自然度,能深度理解故事情节和人物角色,正确表达情绪,还能保留吞音、口音等发音习惯,媲美真人音色。
Speech
Generation

Seed-ASR
豆包·语音识别模型可基于更强的上下文感知能力,推理得出更准确的识别结果,并支持一个模型识别普通话和粤语、上海话、四川话、西安话、闽南语等多种中国方言。
Speech
Recognition

Seed-Music
Seed-Music 是一个具有灵活控制能力的音乐生成模型家族,提供了可控音乐生成、谱转曲、词曲编辑、零样本人声克隆四大核心功能,融合了语言模型和扩散模型优势,融入作曲工作流。
Music
Generation