Jun's website

主题的晦涩人生的短暂

KV Cache on SSD: Taking Twitter's Fatcache as an Example.

Jun published on 2025-03-10 included in Storage Cache

High performance in-memory key-value caches are indispensable components in large-scale web architecture. However, the limited memory capacity and high power consumption of memory motives researchers and developers to develop key-value cache on SSD, where SSD is considered as an extension of limited memory.

In this post, I will talk about the general ideas about KV cache on SSD based on Twitter’s fatcache and further discuss the issues with this traditional approach.

Background

Twitter’s fatcache and many other modern memory allocators, such as Google tcmalloc and Linux slab allocator are based on the idea of slab allocator. You can get the comprehensive detail about slab allocator in paper titled The slab allocator: An object-caching kernel memory allocator. I am not willing to delve into too many trivial details here, but overall, slab allocator is a kind of segregated list list allocator. The term slab is actually a continuous memory area, which is the basic management unit of slab allocator. A slab is further divided into slots of the same size which are used to store objects and other metadata. Besides, a slab uses a freelist to keep track of the allocation status of slots, which is the key of allocation and deallocation. All slabs with the same slot(object) size are categorized together and further organized into a sorted array based on the slot size. By doing so, the allocator is able to use binary search to allocate objects from the best-fitted slab.

[Paper Note] ALPS an Adaptive Learning, Priority OS Scheduler for Serverless Functions

Jun published on 2024-09-03 included in Paper

Motivation

FaaS 环境下存在大量短生命周期的函数，这些函数作为进程调度到 OS 上。同时创建数千个函数都是家常便饭。由于 Faas Function 生命周期通常很短，研究表明 99% 的 Azure Function 都在 224s 以内。因此，OS 调度策略会对 FaaS function 的周转时间（turnaround time）产生重大影响。然而，Linux CFS 在大量短生命周期任务的 FaaS 下表现并不好。

[Paper Note] Demystifying and Checking Silent Semantic Violations in Large Distributed Systems

Jun published on 2024-08-19 included in Paper

这个工作太神奇了，阅读 Understanding, detecting and localizing partial failures in large system software 的时候，在思考怎样检测 silent semantic violation，论文里说一个难点就是不知道正确的语义是什么，我想到也许可以用 LLM 推测。完全没想到可以用论文如此简洁的方式推测。

论文的思路很简单，从系统的 regression test 入手。尽管这些 test 通常是真的特定的 bug 的，但这些 test 仍然蕴含了系统的语义。论文要做的就是从 regression test 中推导出这些语义，并在运行时检测系统是否违背了语义。

[Paper Note] Understanding the Performance Implications of the Design Principles in Storage-Disaggregated Databases

Jun published on 2024-08-15 included in Paper

本论文从最基本的单体数据库出发，一步步推导出目前主流的架构设计，并详细对这些设计进行性能分析。对于我这种新人而言，跟这作者的思路走，像是一场思想旅行，打开了一扇大门。

论文针对哪种类型的数据库？
storage-disaggregated OLTP database
为什么 storage-disaggregated OLAP database 不使用 log-as-the-database 和 shared-storage 设计？
OLAP 通常服务读密集型负载，这两个设计解决的是写密集型负载的痛点。
log-as-the-database 的原理和效果？
计算节点只发送 xlog，存储节点通过重放 xlog 得到数据。降低了网络负担，并且利用了存储节点的 CPU。

[Paper Note] Understanding, Detecting and Localizing Partial Failures in Large System Software

Jun published on 2024-08-05 included in Paper

背景

partial failure 是区别于 fail-stop 模型 full failure 的另一种故障模式，简而言之，partial failure 指系统部分地故障，但不是完全故障而无法服务。论文给 partial failure 下了以下定义：

对于一个服务器进程 P，其中包含许多组件，提供一系列服务 R。如果进程 P 中发生了故障（fault），但这个故障没有让 P crash，但却破坏了 $R_f \notin R$ 安全性（safety）保证、活性（liveness）或性能问题，这样的故障就是 partial failure。

[Paper Note] Efficient Exposure of Partial Failure Bugs in Distributed Systems With Inferred Abstract States

Jun published on 2024-07-31 included in Paper

Motivation

目前分布式系统中的故障注入不够高效，主要是由于以下原因：

许多故障是 partial failure。发生故障时，系统仍然在运行，只是系统的某个组件或服务受影响。
故障的触发条件非常罕见。例如，某些故障只在特定时刻的特定网络故障下发生，并且只影响某个组件。
成熟的分布式系统有比较好的容错性。许多混沌工程的实践随机注入故障，成熟的分布式系统可以容忍大部分故障，大量的注入是无效的，无法检测出新 bug。

故障注入的一大难点是“如何在无数中可能的故障注入点中找到最可能暴露 bug 的故障注入点”。论文认为，大多数的 bug 都发生在特殊情况下，因此故障注入框架也应该找到这些“特殊情况”，并在这些容易出错的地方注入故障。论文提出了一个新的故障注入框架 Legolas，Legolas 通过静态分析，找到程序的状态机，并在这些故障迁移时根据 bsrr（budgeted-state-round-robin）策略通过插入 hook 注入故障。