Infrastructures
The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
Research topics
Ultra-large-scale training clusters
Study methods to improve the stability and model flops utilization (MFU) of large scale training clusters, including cross-cluster, low precision, fault tolerant and elastic training techniques.
Large-scale
Stability
Reinforcement learning systems
Research on end-to-end large model reinforcement learning systems, designing the next-generation RL systems under dynamic loads, complex agent/environment interactions, heterogeneous resources, and multimodal scenarios.
Reinforcement learning
Agent
Optimization
Inference parallelization solutions
Research on overcoming compute and memory access bottlenecks during inference, including multi-node inference and parallel inference strategies on heterogeneous hardware.
Inference
Parallel
Next-Generation Model and Hardware Co-Optimizatio
Research on advanced model architectures, training and inference paradigms by co-designing next-generation hardware systems with next-generation generative and understanding model architectures.
Systems-algorithm co-design
Model architecture
Compiler Optimization for Heterogeneous Hardware
Research on high-performance operator compilation and joint optimization of computation and communication for emerging hardware architectures.
Heterogeneous systems
Compiler

Selected Papers

Mar 20, 2025
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the optimal approximation rate.
Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma
LLM
Mar 18, 2025
Hyper-Connections
We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou
LLM
Mar 01, 2025
TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice
The Mixture of Experts (MoE) architecture has emerged as a promising solution to reduce computational overhead by selectively activating subsets of model parameters. The effectiveness of MoE models depends primarily on their routing mechanisms, with the widely adopted Top-K routing scheme used for activating experts. However, the Top-K scheme has notable limitations, including unnecessary activations and underutilization of experts. In this work, rather than modifying the routing mechanism as done in previous studies, we propose the Ternary Choice MoE (TC-MoE), a novel approach that expands the expert space by applying the ternary set {-1, 0, 1} to each expert. This expansion allows more efficient and effective expert activations without incurring significant computational costs. Additionally, given the unique characteristics of the expanded expert space, we introduce a new load balance loss and reward loss to ensure workload balance and achieve a flexible trade-off between effectiveness and efficiency. Extensive experiments demonstrate that TC-MoE achieves an average improvement of over 1.1% compared with traditional approaches, while reducing the average number of activated experts by up to 9%. These results confirm that TC-MoE effectively addresses the inefficiencies of conventional routing schemes, offering a more efficient and scalable solution for MoE-based large language models. Code and models are available at https://github.com/stiger1000/TC-MoE.
Shen Yan, Xingyan Bin, Sijun Zhang, Yisen Wang, Zhouchen Lin
LLM
Learn More

Featured Jobs

Research Scientist in ML Systems
Seattle / San Jose
Experienced Hiring
Apply Now
Software Engineer, ML System Architecture
Seattle / San Jose
Experienced Hiring
Apply Now
Research Scientist, Applied Machine Learning
Seattle / San Jose
Campus Recruitment
Apply Now
Software Engineer in Machine Learning Systems
Seattle / San Jose
Campus Recruitment
Apply Now
Software Engineer Intern (Doubao (Seed) - Machine Learning System)
Seattle / San Jose
Internship
Apply Now
Research Scientist Intern (Doubao (Seed) - Machine Learning System)
Seattle / San Jose
Internship
Apply Now