Home avatar

主题的晦涩 人生的短暂

[Paper Note] IMPRESS an Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference

Motivation

Existing works show that not all KVs are created equal: few important KVs that have higher attention scores contribute most to the Transformer inference results. H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models demonstrates that only feeding vital tokens to LLM can achieve almost the same model accuracy.

Now that the disk access is expensive and can not be hidden easily, we can only load important tokens to reduce IO overhead by decreasing the amount of data to be transferred This paper leverage this insight and proposes a importance-informed multi-tier KV cache storage system. There are three major issues need to be solved :

[Paper Note] Attentionstore Cost-Effective Attention Reuse Across Multi-Turn Conversations in Large Language Model Serving

Core Ideas

Offloading KV caches to larger-but-slower medium can significantly improves cache hit rates, thereby improving TTFT. latency issues caused by slower medium can be reduced by the following techniques:

  1. Asynchronous IO can be deployed to hide IO delay, such as layer-wise pre-loading, asynchronous saving.
  2. Scheduler-aware caching is feasible in LLM serving because the future jobs are knowable.

Motivation

Existing systems like SGLang leverage prefix sharing to improve prefilling cache reuse in multi-turn conservation and long context scenarios. However, existing prefix sharing methods only stores KV caches in VRAM and DRAM, which is highly restricted in capacity and constrains cache reuse. Moreover, existing system can only retain user sessions (prefix cache) for a short time due to the capacity limitation, which is not suitable for multi-turn conservation scenarios.

[Paper Note] Orca a Distributed Serving System for Transformer-Based Generative Models

Core Ideas

  1. More fine-grained scheduling: shifting from coarse-grained request-level scheduling to fine-grained iteration-level scheduling.
  2. Selective batching: maximize batching of “batchable” matrix operations, leaving “nonbatchable” attention operations alone.
  3. Control message transfer optimizations
    1. Separated control message channel and data message channel.
    2. Control message pre-sending which is an equivalent of preloading.

Intra-Layer and Inter-Layer Parallelism

Orca leverages both intra-layer parallelism, as known as tensor parallelism and inter-layer parallelism, commonly known as pipeline parallelism. In this architecture, layers are partitioned into different pipeline stage, and each layer in a specific stage is further split into multiple GPUs. This parallelism scheme is depicted as follows.

[Paper Note] CacheGen KV Cache Compression and Streaming for Fast Large Language Model Serving

Core Ideas

  1. Optimizing KV cache network transfer size and cost is valuable.
  2. Traditional video compressing techniques can be applied.

Abstract

In the long context scenario, long context is too long to fit into GPU memory, necessitating the need of KV cache storage system which stores and transmits KV cache via network. However, the transmission cost of fetching of KV cache can be non-trivial. The size of a KV cache grows with both the model size and context length and can easily reach 10s GB. Specially, although most GPUs attached with high-speed RDMA network, with the prevalence of long context, we may construct KV cache storage system with high-speed GPUs and low-speed network.

[Paper Note] SGLang Efficient Execution of Structured Language Model Programs

SGLang 的目标是提供一套完整、高效的 Language Model(LM) program 框架,包括前端的编程语言和后端的运行时,而非单独聚焦于 LLM 推理,前后端协同设计给 SGLang 更多的优化空间。

Fronted

LM program 即通过编程和语言模型交互的程序,由于语言模型 non-deterministic 的本质,LM program 需要做大量的工作,例如复杂的字符串处理,才能和 LM 交互。

KV Cache on SSD: Taking Twitter's Fatcache as an Example.

High performance in-memory key-value caches are indispensable components in large-scale web architecture. However, the limited memory capacity and high power consumption of memory motives researchers and developers to develop key-value cache on SSD, where SSD is considered as an extension of limited memory.

In this post, I will talk about the general ideas about KV cache on SSD based on Twitter’s fatcache and further discuss the issues with this traditional approach.

Background

Twitter’s fatcache and many other modern memory allocators, such as Google tcmalloc and Linux slab allocator are based on the idea of slab allocator. You can get the comprehensive detail about slab allocator in paper titled The slab allocator: An object-caching kernel memory allocator. I am not willing to delve into too many trivial details here, but overall, slab allocator is a kind of segregated list list allocator. The term slab is actually a continuous memory area, which is the basic management unit of slab allocator. A slab is further divided into slots of the same size which are used to store objects and other metadata. Besides, a slab uses a freelist to keep track of the allocation status of slots, which is the key of allocation and deallocation. All slabs with the same slot(object) size are categorized together and further organized into a sorted array based on the slot size. By doing so, the allocator is able to use binary search to allocate objects from the best-fitted slab.

0%