Tong WU

Hi!👋

I’m Tong WU (吴童), currently a senior undergraduate in EECS, Peking University, supervised by Prof. Zhi Yang. I’m also a research intern in AIGCIC now.

My research interests lie in:

Co-design of algorithms and systems for efficient LLM training and inference🚀
AI hardwares, compilers and DSLs (still learning…)⚙️

Currently as a member of Tile-AI, I’m actively contributing to TileLang, a popular DSL for streamline the development of efficient kernels , as well as relevant open-source projects. They include TileScale, a distributed programming language in progress, and the high-performance operator library TileOps.

If you’re interested in my work, please feel free to contact me via my email: wutong1109 [AT] stu [DOT] pku [DOT] edu [DOT] cn

selected publications

NeurIPS

ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction

Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang

In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 2025

Abs

Large Language Models (LLMs) are widely used in today’s tasks of natural language processing. To support applications like multi-turn chats, document understanding, and content generation, models with long context lengths are growing in importance. However, managing long contexts brings substantial challenges due to the expansion of key-value cache (KV cache). Longer KV cache requires larger memory, limiting the batch-size and thus decreasing throughput. Also, computing attention over long KV cache incurs more memory access, hurting the end-to-end latency. Prior works find that it is sufficient to use only the recent and high-impact tokens for attention computation, allowing the eviction of less vital tokens to reduce memory footprint. Nonetheless, we observe a dynamic shift in token importance across different decoding steps. Tokens initially evicted might regain importance after certain decoding steps. To address this, we propose ARK VALE, a page-based KV cache manager that can recognize and recall important tokens evicted before. We asynchronously copy the filled page into external memory (e.g., CPU memory) as backup and summarize/compress it into a much smaller digest by constructing the bounding-volume of the keys in the KV-page. Before attention computation, we measure all pages’ importance based on their digests, recall the important ones, evict the unimportant ones, and select the top-ranked pages for attention computation. Experiment results show that ARKVALE performs well on various long context tasks with negligible accuracy loss under 2k 4k cache budget and can improve decoding latency up to 2.2\texttimes (1.7\texttimes in average) and batching throughput up to 4.6\texttimes (3.5\texttimes in average). Our code is now available at https://github.com/pku-liang/ArkVale.