NVIDIA unveils major Dynamo updates targeting AI coding agents, achieving up to 97% KV cache hit rates and 4x latency improvements for enterprise deployments. (NVIDIA unveils major Dynamo updates targeting AI coding agents, achieving up to 97% KV cache hit rates and 4x latency improvements for enterprise deployments. (

NVIDIA Dynamo Gets Agentic AI Overhaul With 97% Cache Hit Rates

2026/04/18 07:22
4 min read
For feedback or concerns regarding this content, please contact us at [email protected]

NVIDIA Dynamo Gets Agentic AI Overhaul With 97% Cache Hit Rates

Lawrence Jengar Apr 17, 2026 23:22

NVIDIA unveils major Dynamo updates targeting AI coding agents, achieving up to 97% KV cache hit rates and 4x latency improvements for enterprise deployments.

NVIDIA Dynamo Gets Agentic AI Overhaul With 97% Cache Hit Rates

NVIDIA has released a comprehensive update to its Dynamo inference framework specifically optimized for AI coding agents, addressing a critical bottleneck as enterprise adoption of automated code generation accelerates. The company reports achieving up to 97.2% cache hit rates for multi-agent workflows—a metric that directly translates to reduced compute costs and faster response times.

The timing isn't accidental. Stripe's internal agents now generate over 1,300 pull requests weekly. Ramp attributes 30% of its merged PRs to AI agents. Spotify reports 650+ agent-generated PRs monthly. Behind each of these workflows sits an inference stack under intense pressure from repeated context processing.

The Cache Problem Nobody Talks About

Here's what makes agentic AI different from chatbots: a coding agent like Claude Code or Codex makes hundreds of API calls per session, each carrying the full conversation history. After the first call writes the conversation prefix to KV cache, every subsequent call hits 85-97% cache on the same worker. NVIDIA measured an 11.7x read/write ratio—the system reads from cache nearly 12 times for every token written.

Without cache-aware routing, turn 2 of a conversation has roughly a 1/N chance of landing on the same worker as turn 1. Every miss forces complete prefix recomputation. For a 200K context window, that's expensive.

Three-Layer Architecture

Dynamo's update attacks the problem at three levels. The frontend now supports multiple API protocols—v1/responses, v1/messages, and v1/chat/completions—through a common internal representation. This matters because newer APIs use typed content blocks, letting the orchestrator see boundaries between thinking, tool calls, and text to apply different cache policies per block type.

The new "agent hints" extension allows harnesses to attach structured metadata to requests: priority levels, estimated output length, and speculative prefill flags. A harness can signal "warm this cache ahead of time" when it knows a tool call is about to return.

At the routing layer, NVIDIA's Flash Indexer now handles 170 million operations per second for KV-aware placement decisions. The NeMo Agent Toolkit team built a custom router using these APIs and measured 4x reduction in p50 time-to-first-token and up to 63% latency improvement for priority-tagged requests under memory pressure.

Rethinking Cache Eviction

Standard LRU eviction treats all cached data identically—a fundamental mismatch with how agents actually work. System prompts get reused every turn. Reasoning tokens inside <think> blocks? Typically zero reuse after the loop closes, yet they account for roughly 40% of generated tokens.

The update introduces selective retention with per-region control. Teams can specify that system prompt blocks evict last, conversation context survives 30-second tool call gaps, and decode tokens go first. TensorRT-LLM's new TokenRangeRetentionConfig enables this granularity within single requests.

NVIDIA is also building toward a four-tier memory hierarchy—GPU, CPU, local NVMe, and remote storage—where blocks flow automatically via write-through. When one worker computes KV for a prefix, any other worker can load those blocks via RDMA instead of recomputing. Four redundant prefill computations become one compute and three loads.

What This Means for Deployment

The company has been running internal Dynamo deployments of GLM-5 and MiniMax2.5 to power Codex and Claude Code harnesses, benchmarking against closed-source inference. They're targeting parity on cache reuse performance with optimized recipes coming in the next few weeks.

For teams already running open-source models on their own GPUs, the gap with managed API providers just got smaller. The cache_control API mirrors Anthropic's prompt caching semantics, so migration paths exist for teams familiar with that interface.

The agent hints specification remains v1, and NVIDIA is actively soliciting feedback from teams building agent harnesses on which signals prove most useful. Given that Dynamo 1.0 launched just last month with major cloud provider adoption, expect rapid iteration as enterprise agentic workloads scale.

Image source: Shutterstock
  • nvidia
  • ai infrastructure
  • dynamo
  • coding agents
  • enterprise ai
Market Opportunity
Major Logo
Major Price(MAJOR)
$0.06495
$0.06495$0.06495
-0.36%
USD
Major (MAJOR) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

USD1 Genesis: 0 Fees + 12% APR

USD1 Genesis: 0 Fees + 12% APRUSD1 Genesis: 0 Fees + 12% APR

New users: stake for up to 600% APR. Limited time!