[Paper Note] Strata Hierarchical Context Caching for Long Context Language Model Serving

2025-09-13 About 500 words 3 minutes

Contents

Existing works assume the GPU-CPU bandwidth is not a bottleneck. However, this paper finds the GPU-CPU bandwidth is extremely under-utilized due to the data fragmentation introduced by PagedAttention and the existing scheduler doesn’t treat GPU-CPU bandwidth as the first class resource, leading to severe data loading latency and thereby stalling the CPU.

Experiments show 70% time spends in KV cache loading from the lower storage hierarchy, demonstrating GPU-CPU bandwidth is under-utilized. Even with an I/O mechanism achieving 75% of theoretical PCIe bandwidth, stalls still account for up to 24% of prefill execution time, calling for more sophisticated scheduling strategies.

The solution is as follows:

Improve IO throughput
- GPU-assisted IO increases concurrency: Thousands of GPU threads load data concurrently. See GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture.
- Page-first Layout increases unit data size: Manage the token’s pages of layers into a contiguous layout in the low-hierarchical storage, such as GPU memory and SSD.
IO-aware scheduling that treats GPU-CPU bandwidth as the first class resources.
- Identify and solve the delayed hit problem: Avoid scheduling multiple requests which encounter the same cache miss into a batch.
- Balanced batch: Mix compute-bound requests with IO-bound requests (e.g., requests require loading data from the CPU memory) to maximize IO overlapping.
- Hide loading stall with bubble filling: If an IO-bound request’s latency cannot be hidden, schedule a compute-intensive requests to fill the pipeline bubble.

Delayed Hit

Suppose A0 and A1 sharing the prefix are scheduled together in a batch and suffer from cache misses, the serving system may recompute A0 and A1 redundantly. The solution is to detect delayed hits and re-arrange requests sharing the same context to different batches.

More specifically, built atop SGLang, Strata proposes a new radix tree structure named HiRadixtree, where each node represents a prefix. Even if the cache required by the request is missed, HiRadixtree constructs a transit node denoting the cache that is being computed, such that Strata can rearrange requests.

Evaluation

The existing designs cannot fully utilize the theoretical GPU-CPU bandwidth.
With the aforementioned techniques, H200 machine can achieve high GPU-CPU bandwidth than GH200 Grace Hopper with higher theoretical bandwidth.
Most performance improvement is from GPU-assisted IO and solved delayed hit problem.

Insights

Long context is the future direction and recompute is not feasible in long context situations due to limited GPU HBM capacity, necessitating multi-tier KV cache offloading.
KV cache loading from SSD should be interrupted if the loading doesn’t finish before the request is scheduled.
IO overlaying makes sense because computing and loading presents different IO patterns, one is HBM bandwidth bound and the other one is GPU-CPU bandwidth bound.

Thoughts

This paper only solves the GPU-CPU bottleneck and doesn’t solve the SSD bottleneck.
To some senses, this work centers the multi-tier KV cache storage around the CPU memory.