Multimodal Interaction & World Model
豆包大模型多模态交互与世界模型团队致力于研发具备人类水平的多模态理解与交互能力的模型,并推动多模态助手类产品的探索和研发
课题方向
多模态理解基础模型与应用
构建融合视听、语言的理解模型,提升图像与视频中文字、layout、定位、空间关系等基础理解能力,并强化多模态推理能力。提升模型训练与推理效率;实现用户长期记忆,优化模型在各类终端设备上的使用体验
Multimodal
Foundation
多模态 Agent 与推理
突破包括多模态 RAG,视觉 CoT 与 Agent 等在内的多模态模型进阶能力,构建GUI/游戏等虚拟世界的通用多模态Agent
Multimodal
Foundation
Agent
生成与理解统一模型
探索连续与离散信号统一的表示与训练方法,建设交织生成与理解的模型
Multimodal
World Model
世界模型
利用预训练、仿真等技术对虚拟/现实世界的各类环境进行建模,提供多模态交互探索的基本能力
Multimodal
World Model

精选论文

2025.01.21
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
Computer Vision
2024.11.05
Classification Done Right for Vision-Language Pre-Training
We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. https://github.com/x-cls/superclass
Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, Haoqi Fan
Computer Vision
2024.10.03
Video Instruction Tuning With Synthetic Data
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li
Computer Vision
查看更多论文
技术应用
Seed-VLM
Seed-VLM 围绕豆包相关场景打造体验前沿的视觉助手。通过用户偏好对齐的后链路训练,确保了高可用性的响应,并结合视觉链条思维(Visual CoT)提供了更丰富的功能体验。
Visual-Language Model

热招岗位

多模态世界模型算法研究员/专家-豆包大模型
北京/上海/杭州/深圳
社招
立即投递
多模态世界模型算法工程师/专家-豆包大模型
北京/上海/杭州/深圳
社招
立即投递
视觉大模型算法专家-豆包大模型(Top Seed)
北京/上海/杭州/深圳
校招
立即投递
多模态世界模型算法实习生-豆包大模型
北京/上海/杭州/深圳
实习
立即投递