Home Top Seed Seed Edge Research News Join Us

EN

中文

Home Top Seed Seed Edge Research News Join Us

Infrastructures

The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Research topics

Ultra-large-scale training clusters

Study methods to improve the stability and model flops utilization (MFU) of large scale training clusters, including cross-cluster, low precision, fault tolerant and elastic training techniques.

Large-scale

Stability

Reinforcement learning systems

Research on end-to-end large model reinforcement learning systems, designing the next-generation RL systems under dynamic loads, complex agent/environment interactions, heterogeneous resources, and multimodal scenarios.

Reinforcement learning

Agent

Optimization

Inference parallelization solutions

Research on overcoming compute and memory access bottlenecks during inference, including multi-node inference and parallel inference strategies on heterogeneous hardware.

Inference

Parallel

Next-Generation Model and Hardware Co-Optimizatio

Research on advanced model architectures, training and inference paradigms by co-designing next-generation hardware systems with next-generation generative and understanding model architectures.

Systems-algorithm co-design

Model architecture

Compiler Optimization for Heterogeneous Hardware

Research on high-performance operator compilation and joint optimization of computation and communication for emerging hardware architectures.

Heterogeneous systems

Compiler

Selected Papers

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the optimal approximation rate.

Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma

Hyper-Connections

We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou

Hyper-Connections

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou

TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice

The Mixture of Experts (MoE) architecture has emerged as a promising solution to reduce computational overhead by selectively activating subsets of model parameters. The effectiveness of MoE models depends primarily on their routing mechanisms, with the widely adopted Top-K routing scheme used for activating experts. However, the Top-K scheme has notable limitations, including unnecessary activations and underutilization of experts. In this work, rather than modifying the routing mechanism as done in previous studies, we propose the Ternary Choice MoE (TC-MoE), a novel approach that expands the expert space by applying the ternary set {-1, 0, 1} to each expert. This expansion allows more efficient and effective expert activations without incurring significant computational costs. Additionally, given the unique characteristics of the expanded expert space, we introduce a new load balance loss and reward loss to ensure workload balance and achieve a flexible trade-off between effectiveness and efficiency. Extensive experiments demonstrate that TC-MoE achieves an average improvement of over 1.1% compared with traditional approaches, while reducing the average number of activated experts by up to 9%. These results confirm that TC-MoE effectively addresses the inefficiencies of conventional routing schemes, offering a more efficient and scalable solution for MoE-based large language models. Code and models are available at https://github.com/stiger1000/TC-MoE.

Shen Yan, Xingyan Bin, Sijun Zhang, Yisen Wang, Zhouchen Lin

TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice

Shen Yan, Xingyan Bin, Sijun Zhang, Yisen Wang, Zhouchen Lin

Learn More

Featured Jobs

Research Scientist in ML Systems

Seattle / San Jose

Experienced Hiring

Software Engineer, ML System Architecture

Seattle / San Jose

Experienced Hiring

Research Scientist, Applied Machine Learning

Seattle / San Jose

Campus Recruitment

Software Engineer in Machine Learning Systems

Seattle / San Jose

Campus Recruitment

Software Engineer Intern (Doubao (Seed) - Machine Learning System)

Seattle / San Jose

Internship

Research Scientist Intern (Doubao (Seed) - Machine Learning System)

Seattle / San Jose

Internship

Do great things with great people

Join ByteDance Seed

Privacy Terms

User Agreement Privacy Policy

Copyright © 2024 - 2025 Seed

Privacy Terms

User Agreement Privacy Policy

Copyright © 2024 - 2025