Agentic Serving
The agentic-serving guide is a horizontal, workload-centric umbrella that serves agentic programs on llm-d. It provides the recommended, cohesive deployment for the workload — composing llm-d's well-lit paths into one stack rather than enabling a single feature. For the workload model, canonical shapes, and the direction this guide is driving toward, see the Agentic Serving workload page; this guide is the operational counterpart, and the canonical guide of the llm-d Agentic Inference SIG.
The reference workload this guide optimizes for is long-horizon loops (agentic code generation): deep multi-turn sessions over large, repository-scale contexts with tool-call pauses between turns. Three behaviors drive every choice below — prefill-heavy/decode-light (a 160K-token context dominates TTFT), high reusable locality (cache hit rate, not FLOPs, sets throughput), and bursty/stateful arrivals (tool pauses leave sessions idle, then resume in bursts).
The Optimization Stack
This guide's deployment composes llm-d's capability paths into one stack, each layer relieving a specific pressure of the agentic workload:
| Layer | What it does for the workload |
|---|---|
| Optimized baseline — routing foundation | Prefix-cache scorer routes a turn to the replica already holding its prefix; load-aware scorers keep bursts off hot replicas. The foundation every deployment builds on. |
| Tiered KV offloading | Offload KV cache beyond accelerator memory across tiers, so idle sessions restore on resume instead of recomputing prefill. |
| Precise prefix-cache routing — advanced | An exact, global view of cache state, enabling session-centric orchestration and non-naive (beyond-LRU) KV-cache offloading & retention. |
| P/D disaggregation — large models / interactivity | Separate prefill and decode pools so heavy prefill never stalls token generation, stabilizing ITL. |
The Agentic Inference SIG northstar sets the broader direction: session-graph orchestration, program-aware scheduling, zero-recompute state reuse with typed retention, and proactive state placement ahead of fan-out. See the workload page for the full direction and further reading.
Deployments
The layers above compose into deployments spanning a range of capabilities and operational costs - from a routing-and-offloading baseline up to disaggregated serving, added incrementally as a workload's scale and latency targets grow. Concrete workloads with benchmarked options are provided as sub-guides below.
Benchmarking
Deployments are compared against the same realistic agentic workload — large reused
contexts, deep multi-turn sessions, and tool-call stalls — replayed with
inference-perf via the
llm-d-benchmark harness rather than a single-turn
shared-prefix stream, so cross-turn reuse, session persistence, and bursty resumption are
actually exercised. They are compared on program-level metrics — whole-session completion
time and task throughput alongside TTFT and ITL. Replaying real agentic traces (program
structure and tool-call timing from OpenTelemetry) is the direction for program-level
evaluation.