[Paper Note] H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Motivation
Due to limited GPU memory capacity, it is impossible to hold all KV cache in GPU HBM, necessitating a cache eviction algorithm which can maximize the cache hit rate. However, existing works either need large cache size, or entail heavy eviction cost. This paper aims to tackle this problem.
This paper finds that not all tokens are created equal: a small portion of tokens contribute most of the value of attention calculation. More specifically, most attention store matrices are sparse for most model even thought the model is trained densely. Further analysis shows 95% attention score matrices are sparse, where a token is deemed as sparse if it’s attention score is smaller than 1% of the most highest attention score in the token sequence.