ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction
Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang
In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 2025
Large Language Models (LLMs) are widely used in todayās tasks of natural language processing. To support applications like multi-turn chats, document understanding, and content generation, models with long context lengths are growing in importance. However, managing long contexts brings substantial challenges due to the expansion of key-value cache (KV cache). Longer KV cache requires larger memory, limiting the batch-size and thus decreasing throughput. Also, computing attention over long KV cache incurs more memory access, hurting the end-to-end latency. Prior works find that it is sufficient to use only the recent and high-impact tokens for attention computation, allowing the eviction of less vital tokens to reduce memory footprint. Nonetheless, we observe a dynamic shift in token importance across different decoding steps. Tokens initially evicted might regain importance after certain decoding steps. To address this, we propose ARK VALE, a page-based KV cache manager that can recognize and recall important tokens evicted before. We asynchronously copy the filled page into external memory (e.g., CPU memory) as backup and summarize/compress it into a much smaller digest by constructing the bounding-volume of the keys in the KV-page. Before attention computation, we measure all pagesā importance based on their digests, recall the important ones, evict the unimportant ones, and select the top-ranked pages for attention computation. Experiment results show that ARKVALE performs well on various long context tasks with negligible accuracy loss under 2kĀ 4k cache budget and can improve decoding latency up to 2.2\texttimes (1.7\texttimes in average) and batching throughput up to 4.6\texttimes (3.5\texttimes in average). Our code is now available at https://github.com/pku-liang/ArkVale.