Researchers propose Kara, a sliding-window KV cache compression method to improve reasoning LLM serving efficiency
Kara introduces a decoding-time compression technique that reduces KV cache memory usage and improves output throughput for long chain-of-thought models by leveraging bidirectional attention and a Token2Chunk module.
1 source · cross-referenced
- Reasoning LLMs generate long chain-of-thought sequences that accumulate large KV caches, increasing decoding latency and limiting throughput.
- Existing KV cache compression methods have limitations, including rigid boundaries and potential information loss from threshold-triggered policies.
- Kara addresses these by performing decoding-time compression on recent context using bidirectional attention and a Token2Chunk module.
- The authors adapt Kara to PagedAttention and introduce KvLLM, an inference framework built on vLLM, to reduce memory usage and improve throughput.
- Extensive experiments show consistent performance improvements for Kara and KvLLM.
Long chain-of-thought (CoT) generation in reasoning language models leads to the accumulation of large KV caches during decoding, which increases memory overhead, decoding latency, and limits throughput. Existing KV cache compression techniques attempt to mitigate this by selectively removing unimportant KV pairs, but often introduce new challenges.
The authors identify two key limitations in prior methods: threshold-triggered compression policies may yield limited throughput gains or even reduce throughput, and they frequently eliminate entire blocks of KV pairs, exacerbating information loss. Additionally, these methods typically preserve either isolated KV pairs or fixed-size chunks with rigid boundaries, failing to retain important semantic information at arbitrary token positions.
To overcome these limitations, the researchers propose Kara, a sliding-window KV cache compression method that operates on recently generated context during decoding. Kara uses bidirectional attention to score and select informative KV pairs within the window, enabling more precise retention of relevant information.
The authors further design a Token2Chunk module to expand selected KV pairs into flexible-sized chunks, preserving important semantic content more effectively than rigid chunking strategies. They adapt Kara to PagedAttention and introduce KvLLM, an inference framework built on vLLM, which reduces KV cache memory usage and improves output throughput.
Extensive experiments demonstrate consistent performance improvements for both Kara and KvLLM, indicating the method’s effectiveness in addressing the trade-offs between memory efficiency and model performance in reasoning LLM serving.
- Jul 3, 2026 · arXiv cs.CL
Researchers propose TokenScope for token-level interpretability of code-generating LLMs
Trust79 - Jul 3, 2026 · Google DeepMind — Blog
Google DeepMind and A24 form multi-project research partnership to shape future entertainment tools
Trust82 - Jul 3, 2026 · arXiv cs.AI
Neuro-symbolic framework PACE generates feasibility-aware counterfactual explanations for ML models
Trust79