Home avatar

主题的晦涩 人生的短暂

[Paper Note] H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Motivation

Due to limited GPU memory capacity, it is impossible to hold all KV cache in GPU HBM, necessitating a cache eviction algorithm which can maximize the cache hit rate. However, existing works either need large cache size, or entail heavy eviction cost. This paper aims to tackle this problem.

This paper finds that not all tokens are created equal: a small portion of tokens contribute most of the value of attention calculation. More specifically, most attention store matrices are sparse for most model even thought the model is trained densely. Further analysis shows 95% attention score matrices are sparse, where a token is deemed as sparse if it’s attention score is smaller than 1% of the most highest attention score in the token sequence.

[Paper Note] Sparse Indexing Large Scale, Inline Deduplication Using Sampling and Locality

Motivation

If the data deduplication system stores each deduplicated chunk in it’s index, the index size will soon exceeds the memory capacity due to the super large dataset size. Consider, for example, a store that contains 10 TB of unique data and uses 4 KB chunks. Then there are 2.7 × 109 unique chunks. Assuming that every hash entry in the index consumes 40 bytes, we need 100 GB of storage for the full index.

To overcome memory capacity limitation, the index has to be offloaded into disk. However, offloading the index into disk, we need one disk IO per chunk lookup. If an IO takes 4ms, the offloaded index can only achieve 250 lookup per second.

[Paper Note] IMPRESS an Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference

Motivation

Existing works show that not all KVs are created equal: few important KVs that have higher attention scores contribute most to the Transformer inference results. H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models demonstrates that only feeding vital tokens to LLM can achieve almost the same model accuracy.

Now that the disk access is expensive and can not be hidden easily, we can only load important tokens to reduce IO overhead by decreasing the amount of data to be transferred This paper leverage this insight and proposes a importance-informed multi-tier KV cache storage system. There are three major issues need to be solved :

[Paper Note] Attentionstore Cost-Effective Attention Reuse Across Multi-Turn Conversations in Large Language Model Serving

Core Ideas

Offloading KV caches to larger-but-slower medium can significantly improves cache hit rates, thereby improving TTFT. latency issues caused by slower medium can be reduced by the following techniques:

  1. Asynchronous IO can be deployed to hide IO delay, such as layer-wise pre-loading, asynchronous saving.
  2. Scheduler-aware caching is feasible in LLM serving because the future jobs are knowable.

Motivation

Existing systems like SGLang leverage prefix sharing to improve prefilling cache reuse in multi-turn conservation and long context scenarios. However, existing prefix sharing methods only stores KV caches in VRAM and DRAM, which is highly restricted in capacity and constrains cache reuse. Moreover, existing system can only retain user sessions (prefix cache) for a short time due to the capacity limitation, which is not suitable for multi-turn conservation scenarios.

[Paper Note] Orca a Distributed Serving System for Transformer-Based Generative Models

Core Ideas

  1. More fine-grained scheduling: shifting from coarse-grained request-level scheduling to fine-grained iteration-level scheduling.
  2. Selective batching: maximize batching of “batchable” matrix operations, leaving “nonbatchable” attention operations alone.
  3. Control message transfer optimizations
    1. Separated control message channel and data message channel.
    2. Control message pre-sending which is an equivalent of preloading.

Intra-Layer and Inter-Layer Parallelism

Orca leverages both intra-layer parallelism, as known as tensor parallelism and inter-layer parallelism, commonly known as pipeline parallelism. In this architecture, layers are partitioned into different pipeline stage, and each layer in a specific stage is further split into multiple GPUs. This parallelism scheme is depicted as follows.

[Paper Note] CacheGen KV Cache Compression and Streaming for Fast Large Language Model Serving

Core Ideas

  1. Optimizing KV cache network transfer size and cost is valuable.
  2. Traditional video compressing techniques can be applied.

Abstract

In the long context scenario, long context is too long to fit into GPU memory, necessitating the need of KV cache storage system which stores and transmits KV cache via network. However, the transmission cost of fetching of KV cache can be non-trivial. The size of a KV cache grows with both the model size and context length and can easily reach 10s GB. Specially, although most GPUs attached with high-speed RDMA network, with the prevalence of long context, we may construct KV cache storage system with high-speed GPUs and low-speed network.

0%