Jun's website

主题的晦涩人生的短暂

[Paper Note] Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm With SSD

Jun published on 2025-11-06 included in Paper

Background

Vector searching aims to search for the closest neighbors given a target vector. When the vector is high dimensional, accurate vector search is prohibitively expensive, necessitating approximation nearest neighbor search (ANNS) which returns top-K nearest similar vectors instead of accurate results.

To support large-scale datasets, ANNS indexes has to be offloaded to SSD due to the limited CPU memory capacity. However, existing on-SSD graph-based vector search suffers from high latency, due to the misalignment between search algorithms and SSD I/O characteristics. Vector searching acts like BFS, which computes and loads the nearest neighbor in each steps until the target is reached. This I/O compute dependency leads to synchronous I/O pattern, contradicting modern SSD’s parallel I/O characteristics.

[Paper Note] Supporting Our AI Overlords Redesigning Data Systems to Be Agent-First

Jun published on 2025-10-30 included in Paper

Motivation

As modern LLMs become both cheap and efficient, in the future, AI agents will act on users’ behalf, querying, analysizing, processing data. In other words, the digital world will be AI-native, and present-day database systems should also follow the trend.

This paper analyses query patterns of AI agents, and propose an architecture to shed a light to AI-native database architecture.

Observations.

This paper finds that the pattern of AI agents generally consists of two sequential stages: matadata exploration and solution formulation. In the first expiration state, AI agents construct coarse-grained queries called probes to find out patterns of data speculatively; and then AI agents formulate a solution. Analysis shows that exploration is required and improve the success rate of tasks tremendously.

[Paper Note] Hyperledger Fabric a Distributed Operating System for Permissioned Blockchains

Jun published on 2025-10-25 included in Paper

Motivation

HyperLedger Fabric aims to implement a permissioned blockchain featuring:

General Programming Language
High Throughput
No Cryptocurrency
Confidentiality

General programming language and high throughput is not supported due to existing public block chains’ order-execute model.Essentially, order-state model is a form of finite state machine replication (e.g. Raft). In this model, all nodes start from the same initial state and apply mutation logs in a global total order, ultimately achieve consensus (the same state). This model requires deterministic execution, or else different nodes executes the s,ame transaction get different results and thus lead to inconsistent states. To avoid no-deterministic behaviors, existing blockchains require domain-specific languages.

[Paper Note] Fast State Restoration in LLM Serving With HCache

Jun published on 2025-10-20 included in Paper

Motivation

In stateful LLM serving workloads (i.e. RAG and multi-turn conversation), existing works use either token recomputation or KV cache offloading to restore LLM contexts, introducing substantial compute or I/O overheads. In the transformer architecture, the KV cache is derived from the activation (aka. hidden states), and thus KV cache can be restored from the activation.

The activation input of a layer is the output of it’s prior layer. A transformer layer receives the activation input, and then projects it into Q, K and V vectors for attention calculation. That is to say, KV cache can be re-computed from the activation. Mathematically, restoring from activation has lower CPU compute and I/O overheads, as activation recomputation cost is 6x lesss than token recomputation and it’s I/O overhead is 0.5 of offloaded KV cache transmission.

[Paper Note] Strata Hierarchical Context Caching for Long Context Language Model Serving

Jun published on 2025-09-13 included in Paper

Existing works assume the GPU-CPU bandwidth is not a bottleneck. However, this paper finds the GPU-CPU bandwidth is extremely under-utilized due to the data fragmentation introduced by PagedAttention and the existing scheduler doesn’t treat GPU-CPU bandwidth as the first class resource, leading to severe data loading latency and thereby stalling the CPU.

Experiments show 70% time spends in KV cache loading from the lower storage hierarchy, demonstrating GPU-CPU bandwidth is under-utilized. Even with an I/O mechanism achieving 75% of theoretical PCIe bandwidth, stalls still account for up to 24% of prefill execution time, calling for more sophisticated scheduling strategies.

The solution is as follows:

Improve IO throughput
- GPU-assisted IO increases concurrency: Thousands of GPU threads load data concurrently. See GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture.
- Page-first Layout increases unit data size: Manage the token’s pages of layers into a contiguous layout in the low-hierarchical storage, such as GPU memory and SSD.
IO-aware scheduling that treats GPU-CPU bandwidth as the first class resources.
- Identify and solve the delayed hit problem: Avoid scheduling multiple requests which encounter the same cache miss into a batch.
- Balanced batch: Mix compute-bound requests with IO-bound requests (e.g., requests require loading data from the CPU memory) to maximize IO overlapping.
- Hide loading stall with bubble filling: If an IO-bound request’s latency cannot be hidden, schedule a compute-intensive requests to fill the pipeline bubble.

Delayed Hit

Suppose A0 and A1 sharing the prefix are scheduled together in a batch and suffer from cache misses, the serving system may recompute A0 and A1 redundantly. The solution is to detect delayed hits and re-arrange requests sharing the same context to different batches.

[Paper Note] H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Jun published on 2025-09-11 included in Paper

Motivation

Due to limited GPU memory capacity, it is impossible to hold all KV cache in GPU HBM, necessitating a cache eviction algorithm which can maximize the cache hit rate. However, existing works either need large cache size, or entail heavy eviction cost. This paper aims to tackle this problem.

This paper finds that not all tokens are created equal: a small portion of tokens contribute most of the value of attention calculation. More specifically, most attention store matrices are sparse for most model even thought the model is trained densely. Further analysis shows 95% attention score matrices are sparse, where a token is deemed as sparse if it’s attention score is smaller than 1% of the most highest attention score in the token sequence.