CommVQ: Commutative Vector Quantization for KV Cache Compression
AuthorsJunyan Li†, Tianle Cai‡, Yang Zhang§, Muhammad Yusuf Hassan†, Talha Chafekar†, Colorado Reed, Zhile Ren, Pengsheng Guo, Binazir Karimzadeh, Chong Wang, Chuang Gan†
CommVQ: Commutative Vector Quantization for KV Cache Compression
AuthorsJunyan Li†, Tianle Cai‡, Yang Zhang§, Muhammad Yusuf Hassan†, Talha Chafekar†, Colorado Reed, Zhile Ren, Pengsheng Guo, Binazir Karimzadeh, Chong Wang, Chuang Gan†
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as con- text lengths grow. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long context LLM inference. First, we leverage additive quantization by introducing a lightweight encoder and codebook to compress the KV cache, which can then be decoded with a simple matrix multiplication. Second, to tackle the high computational costs during decoding, we design the codebook to be commutative with Ro- tary Position Embedding (RoPE), and utilize an Expectation-Maximization (EM) algorithm to learn the codebook. This enables efficient integration of decoding into the self-attention mechanism, significantly reducing computational overhead. Our approach achieves superior accu- racy through additive quantization while lowering computational costs with our RoPE-commutative codebook. Experiments on long-context bench marks and GSM8K demonstrate that our method reduces FP16 KV cache size by 87.5% for 2-bit quantization, while maintaining higher accu- racy than state-of-the-art KV cache quantization methods. Remarkably, it enables 1-bit quanti- zation of the KV cache with minimal accuracy degradation, making it possible to run a LLaMA- 3.1 8B model with a maximum 128K context length on a single RTX 4090 GPU.
EpiCache: Episodic KV Cache Management for Long Conversational Question Answering
September 23, 2025research area Methods and Algorithms, research area Speech and Natural Language Processing
Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while…
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
July 11, 2025research area Speech and Natural Language Processingconference ICML
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding,…