Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
AuthorsChong Wang**, Nan Du**, Tom Gunter**, Tao Lei, Kulin Seth, Senyu Tong**, Jianyu Wang, Guoli Yin, Xiyou Zhou, Kelvin Zou**, Ruoming Pang**
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
AuthorsChong Wang**, Nan Du**, Tom Gunter**, Tao Lei, Kulin Seth, Senyu Tong**, Jianyu Wang, Guoli Yin, Xiyou Zhou, Kelvin Zou**, Ruoming Pang**
Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor parallelism decomposes matrix operations across devices but introduces substantial inter-GPU synchronization, leading to communication bottlenecks and degraded scalability. We propose the Parallel Track (PT) Transformer, a novel architectural paradigm that restructures computation to minimize cross-device dependencies. PT achieves up to a 16x reduction in synchronization operations relative to standard tensor parallelism, while maintaining competitive model quality in our experiments. We integrate PT into two widely adopted LLM serving stacks-Tensor-RT-LLM and vLLM-and report consistent improvements in serving efficiency, including up to 15-30% reduced time to first token, 2-12% reduced time per output token, and up to 31.90% increased throughput in both settings.
SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models
May 22, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICML
With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce…
Towards Low-Bit Communication for Tensor Parallel LLM Inference
November 19, 2024research area Methods and Algorithms, research area Speech and Natural Language ProcessingWorkshop at NeurIPS
This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) Workshop at NeurIPS 2024.
Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem…