paperFebruary 2026

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

AuthorsChong Wang**, Nan Du**, Tom Gunter**, Tao Lei, Kulin Seth, Senyu Tong**, Jianyu Wang, Guoli Yin, Xiyou Zhou, Kelvin Zou**, Ruoming Pang**

View publication

Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor parallelism decomposes matrix operations across devices but introduces substantial inter-GPU synchronization, leading to communication bottlenecks and degraded scalability. We propose the Parallel Track (PT) Transformer, a novel architectural paradigm that restructures computation to minimize cross-device dependencies. PT achieves up to a 16x reduction in synchronization operations relative to standard tensor parallelism, while maintaining competitive model quality in our experiments. We integrate PT into two widely adopted LLM serving stacks-Tensor-RT-LLM and vLLM-and report consistent improvements in serving efficiency, including up to 15-30% reduced time to first token, 2-12% reduced time per output token, and up to 31.90% increased throughput in both settings.

** Work done while at Apple

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Related readings and updates.

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Towards Low-Bit Communication for Tensor Parallel LLM Inference

Discover opportunities in Machine Learning.