Accelerating LLM Inference on NVIDIA GPUs with ReDrafter
Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry.
Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art performance. ReDrafter uses an RNN draft model, and combines beam search with dynamic tree attention to speed up LLM token generation by up to 3.5 tokens per generation step for open source models, surpassing the performance of prior speculative decoding techniques.
Tokens per second speed up
Productionizing ReDrafter to Speed up NVIDIA TensorRT-LLM
This research work demonstrated strong results, but its greater impact comes from being applied in production to accelerate LLM inference. To make this advancement production-ready for NVIDIA GPUs, we collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.
Although TensorRT-LLM supports numerous open source LLMs and the Medusa speculative decoding method, ReDrafter’s beam search and tree attention algorithms rely on operators that had never been used in previous applications. To enable the integration of ReDrafter, NVIDIA added new operators or exposed existing ones, which considerably improved TensorRT-LLM’s capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.
In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen 2.7x speed-up in generated tokens per second for greedy decoding (see Figure 1). These benchmark results indicate this tech could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.
For additional detail, see this post on the NVIDIA developer blog.
Conclusion
LLMs are increasingly being used to power production applications, and improving inference efficiency can both impact computational costs and reduce latency for users. With ReDrafter’s novel approach to speculative decoding integrated into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.
Acknowledgements
Many people contributed to this project including: Aonan Zhang, Xuanyu Zhang, Yunfei Cheng, Chong Wang, Yi Wang, Abhishek Udupa, Dhaval Doshi, and our collaborators at NVIDIA.
Related readings and updates.
Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference
December 11, 2025research area Methods and Algorithms, research area Speech and Natural Language Processing
Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror…
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
November 18, 2024research area Methods and Algorithms, research area Speech and Natural Language Processing
We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for large language models (LLMs) inference. The performance gains are driven by three key aspects: (1) leveraging a recurrent neural network (RNN) as the draft model conditioning on LLM’s hidden states, (2) applying a dynamic tree attention algorithm over beam search results to eliminate duplicated prefixes in candidate…
