Home avatar

主题的晦涩 人生的短暂

[Paper Note] Supporting Our AI Overlords Redesigning Data Systems to Be Agent-First

Motivation

As modern LLMs become both cheap and efficient, in the future, AI agents will act on users’ behalf, querying, analysizing, processing data. In other words, the digital world will be AI-native, and present-day database systems should also follow the trend.

This paper analyses query patterns of AI agents, and propose an architecture to shed a light to AI-native database architecture.

Observations.

This paper finds that the pattern of AI agents generally consists of two sequential stages: matadata exploration and solution formulation. In the first expiration state, AI agents construct coarse-grained queries called probes to find out patterns of data speculatively; and then AI agents formulate a solution. Analysis shows that exploration is required and improve the success rate of tasks tremendously.

[Paper Note] Hyperledger Fabric a Distributed Operating System for Permissioned Blockchains

Motivation

HyperLedger Fabric aims to implement a permissioned blockchain featuring:

  • General Programming Language
  • High Throughput
  • No Cryptocurrency
  • Confidentiality

General programming language and high throughput is not supported due to existing public block chains’ order-execute model.Essentially, order-state model is a form of finite state machine replication (e.g. Raft). In this model, all nodes start from the same initial state and apply mutation logs in a global total order, ultimately achieve consensus (the same state). This model requires deterministic execution, or else different nodes executes the s,ame transaction get different results and thus lead to inconsistent states. To avoid no-deterministic behaviors, existing blockchains require domain-specific languages.

[Paper Note] Fast State Restoration in LLM Serving With HCache

Motivation

In stateful LLM serving workloads (i.e. RAG and multi-turn conversation), existing works use either token recomputation or KV cache offloading to restore LLM contexts, introducing substantial compute or I/O overheads. In the transformer architecture, the KV cache is derived from the activation (aka. hidden states), and thus KV cache can be restored from the activation.

The activation input of a layer is the output of it’s prior layer. A transformer layer receives the activation input, and then projects it into Q, K and V vectors for attention calculation. That is to say, KV cache can be re-computed from the activation. Mathematically, restoring from activation has lower CPU compute and I/O overheads, as activation recomputation cost is 6x lesss than token recomputation and it’s I/O overhead is 0.5 of offloaded KV cache transmission.

[Paper Note] Strata Hierarchical Context Caching for Long Context Language Model Serving

Existing works assume the GPU-CPU bandwidth is not a bottleneck. However, this paper finds the GPU-CPU bandwidth is extremely under-utilized due to the data fragmentation introduced by PagedAttention and the existing scheduler doesn’t treat GPU-CPU bandwidth as the first class resource, leading to severe data loading latency and thereby stalling the CPU.

Experiments show 70% time spends in KV cache loading from the lower storage hierarchy, demonstrating GPU-CPU bandwidth is under-utilized. Even with an I/O mechanism achieving 75% of theoretical PCIe bandwidth, stalls still account for up to 24% of prefill execution time, calling for more sophisticated scheduling strategies.

The solution is as follows:

  • Improve IO throughput
    • GPU-assisted IO increases concurrency: Thousands of GPU threads load data concurrently. See GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture.
    • Page-first Layout increases unit data size: Manage the token’s pages of layers into a contiguous layout in the low-hierarchical storage, such as GPU memory and SSD.
  • IO-aware scheduling that treats GPU-CPU bandwidth as the first class resources.
    • Identify and solve the delayed hit problem: Avoid scheduling multiple requests which encounter the same cache miss into a batch.
    • Balanced batch: Mix compute-bound requests with IO-bound requests (e.g., requests require loading data from the CPU memory) to maximize IO overlapping.
    • Hide loading stall with bubble filling: If an IO-bound request’s latency cannot be hidden, schedule a compute-intensive requests to fill the pipeline bubble.

Delayed Hit

Suppose A0 and A1 sharing the prefix are scheduled together in a batch and suffer from cache misses, the serving system may recompute A0 and A1 redundantly. The solution is to detect delayed hits and re-arrange requests sharing the same context to different batches.

[Paper Note] H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Motivation

Due to limited GPU memory capacity, it is impossible to hold all KV cache in GPU HBM, necessitating a cache eviction algorithm which can maximize the cache hit rate. However, existing works either need large cache size, or entail heavy eviction cost. This paper aims to tackle this problem.

This paper finds that not all tokens are created equal: a small portion of tokens contribute most of the value of attention calculation. More specifically, most attention store matrices are sparse for most model even thought the model is trained densely. Further analysis shows 95% attention score matrices are sparse, where a token is deemed as sparse if it’s attention score is smaller than 1% of the most highest attention score in the token sequence.

[Paper Note] Sparse Indexing Large Scale, Inline Deduplication Using Sampling and Locality

Motivation

If the data deduplication system stores each deduplicated chunk in it’s index, the index size will soon exceeds the memory capacity due to the super large dataset size. Consider, for example, a store that contains 10 TB of unique data and uses 4 KB chunks. Then there are 2.7 × 109 unique chunks. Assuming that every hash entry in the index consumes 40 bytes, we need 100 GB of storage for the full index.

To overcome memory capacity limitation, the index has to be offloaded into disk. However, offloading the index into disk, we need one disk IO per chunk lookup. If an IO takes 4ms, the offloaded index can only achieve 250 lookup per second.

0%