[Paper Note] IMPRESS an Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference
Motivation
Existing works show that not all KVs are created equal: few important KVs that have higher attention scores contribute most to the Transformer inference results. H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models demonstrates that only feeding vital tokens to LLM can achieve almost the same model accuracy.
Now that the disk access is expensive and can not be hidden easily, we can only load important tokens to reduce IO overhead by decreasing the amount of data to be transferred This paper leverage this insight and proposes a importance-informed multi-tier KV cache storage system. There are three major issues need to be solved :
![Featured image for [Paper Note] IMPRESS An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference](/posts/impress-an-importance-informed-multi-tier-prefix-kv-storage-system-for-large-language-model-inference/images/pasted-image-20250907153049.png)
![Featured image for [Paper Note] Attentionstore Cost-effective attention reuse across multi-turn conversations in large language model serving](/posts/attentionstore-cost-effective-attention-reuse-across-multi-turn-conversations-in-large-language-model-serving/images/pasted-image-20250703034132.png)
![Featured image for [Paper Note] Orca A Distributed Serving System for Transformer-Based Generative Models](/posts/orca-a-distributed-serving-system-for-transformer-based-generative-models/images/pasted-image-20250702050548.png)

![Featured image for [Paper Note] CacheGen KV Cache Compression and Streaming for Fast Large Language Model Serving](/posts/cachegen-kv-cache-compression-and-streaming-for-fast-large-language-model-serving/images/cachegen-kv-cache-architecture.png)
![Featured image for [Paper Note] SGLang Efficient Execution of Structured Language Model Programs](/posts/sglang-efficient-execution-of-structured-language-model-programs/images/pasted-image-20250607093516.png)

