highlightDecember 18, 2024

Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry.

Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art performance. ReDrafter uses an RNN draft model, and combines beam search with dynamic tree attention to speed up LLM token generation by up to 3.5 tokens per generation step for open source models, surpassing the performance of prior speculative decoding techniques.

Tokens per second speed up

Figure 1: Tokens per second speed up using NVIDIA TensorRT-LLM with ReDrafter vs Auto-regression.

Productionizing ReDrafter to Speed up NVIDIA TensorRT-LLM

This research work demonstrated strong results, but its greater impact comes from being applied in production to accelerate LLM inference. To make this advancement production-ready for NVIDIA GPUs, we collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.

Although TensorRT-LLM supports numerous open source LLMs and the Medusa speculative decoding method, ReDrafter’s beam search and tree attention algorithms rely on operators that had never been used in previous applications. To enable the integration of ReDrafter, NVIDIA added new operators or exposed existing ones, which considerably improved TensorRT-LLM's capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.

In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen 2.7x speed-up in generated tokens per second for greedy decoding (see Figure 1). These benchmark results indicate this tech could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.

For additional detail, see this post on the NVIDIA developer blog.

Conclusion

LLMs are increasingly being used to power production applications, and improving inference efficiency can both impact computational costs and reduce latency for users. With ReDrafter’s novel approach to speculative decoding integrated into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.

Acknowledgements

Many people contributed to this project including: Aonan Zhang, Xuanyu Zhang, Yunfei Cheng, Chong Wang, Yi Wang, Abhishek Udupa, Dhaval Doshi, and our collaborators at NVIDIA.

Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Tokens per second speed up

Productionizing ReDrafter to Speed up NVIDIA TensorRT-LLM

Conclusion

Acknowledgements

Related readings and updates.

Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

Discover opportunities in Machine Learning.