Skip to main content

2 posts tagged with "Storage"

Storage related content

View All Tags

Serving Hybrid Models at Scale in llm-d

· 13 min read
Kfir Toledo
Kfir Toledo
Research Staff Member, IBM
Or Ozeri
Or Ozeri
Research Staff Member, IBM
Danny Harnik
Danny Harnik
Senior Technical Staff Member, IBM
Itay Etelis
Itay Etelis
Research Staff Member, IBM
Rachel Brill
Rachel Brill
Senior Technical Staff Member, IBM
Maroon Ayoub
Maroon Ayoub
Research Scientist & Architect, IBM

For most of the transformer era, the KV cache rested on a quiet assumption: one model, one uniform cache. Every layer attended the same way, every block was the same size, and everything built on top of the cache (allocators, offload connectors, schedulers) could treat it as a single pool.

Hybrid models broke this assumption. Most of today's new cutting-edge models opt for mixing attention types within a single model (full attention next to sliding-window, linear, or Mamba layers), making the cache heterogeneous: different layers now hold different amounts of state, in different shapes, with different reuse rules. A cache block that used to be allocated as one uniform unit is now constituted of several distinct parts.

To serve a hybrid model efficiently, an AI inference platform has to handle that heterogeneity in at least three aspects of the stack:

  • GPU Memory Allocation: How the cache is laid out and allocated on the GPU. vLLM solved this with its Hybrid Memory Allocator (HMA), rebuilt around a unified allocator (see Hybrid Models as First-Class Citizens in vLLM).
  • KV Offloading: Extending the KV cache to CPU and storage. Without HMA awareness, an offloading connector turns the HMA off and therefore discards the GPU memory improvements or potential data movement savings.
  • KV-Aware Routing: Sending each request to the right model-server replica. Ignoring hybrid memory structure may erroneously list nodes as having or not having the required KV data based on information stemming from just part of the layers.

This post describes how we extended the vLLM solution for GPU memory handling also to KV offloading and routing. By doing so, llm-d's tiered KV cache management significantly improves throughput and latency at high rates.

Native KV Cache Offloading to Any Filesystem with llm-d

· 11 min read
Kfir Toledo
Kfir Toledo
Research Staff Member, IBM
Danny Harnik
Danny Harnik
Senior Technical Staff Member, IBM
Effi Ofer
Effi Ofer
Research Staff Member, IBM
Or Ozeri
Or Ozeri
Research Staff Member, IBM
Guy Margalit
Guy Margalit
Senior Technical Staff Member, IBM Storage CTO Office

llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.

While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.