Prompt Caching

Prompt Caching is an advanced optimization technique used to significantly reduce API latency and computational costs by storing and reusing frequently accessed prompt segments.

When you send a large amount of text to a Large Language Model (LLM)—such as an extensive system prompt, a massive document repository, or a long conversation history—the model must process and compute mathematical representations (Key-Value states) for every single token. Prompt caching allows the LLM provider to save this pre-computed state for the static parts of your prompt. When subsequent requests share the exact same starting text, the model reuses the cached state instead of calculating it from scratch.

How It Works: The Prefix Rule

Prompt caching works primarily on exact prefix matching. For a cache to trigger, the identical sequence of tokens must appear at the absolute beginning of the prompt.

If even a single character or whitespace changes at the start of your prompt, the cache breaks, and the entire text must be reprocessed.

Initial Request (Cache Creation)

The application sends a long document along with a question. The system processes the whole text, answers the user, and caches the document's computed tokens.

[ START OF PROMPT ]
---------------------------------------------------------
| STATIC CONTEXT: (10,000 tokens of company policy)     | <-- Cached for future use
---------------------------------------------------------
| DYNAMIC INPUT: "What is the holiday policy?"          |
---------------------------------------------------------

Subsequent Request (Cache Hit)

Another user asks a different question using the exact same company policy. The system matches the prefix, skips processing the first 10,000 tokens entirely, and only computes the new question.

Plaintext

[ START OF PROMPT ]
---------------------------------------------------------
| STATIC CONTEXT: (10,000 tokens of company policy)     | <-- CACHE HIT (Fast & Cheap)
---------------------------------------------------------
| DYNAMIC INPUT: "How do I claim travel expenses?"     |
---------------------------------------------------------

Implementation Approaches: Automatic vs. Explicit

Different LLM providers implement prompt caching in one of two ways:

1. Automatic (Implicit) Caching

Providers like OpenAI and DeepSeek handle prompt caching automatically under the hood. If a prompt exceeds a specific token threshold (e.g., 1,024 tokens) and matches a recently processed prefix, the cache is applied with no code changes required from the developer.

2. Explicit Caching

Providers like Anthropic (Claude) and Google (Gemini) give developers explicit control. You flag specific breakpoints in your prompt using attributes like cache_control to tell the model exactly where the static text ends and where the cache should be held.

Here is an example structure of an explicit cache control request via an API call:

{
  "model": "claude-3-5-sonnet",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "This is a massive 20,000-token legal document that I will query multiple times...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Analyze the liability clause in this document."
        }
      ]
    }
  ]
}

Architectural Best Practices

To maximize cache hits and benefit from prompt caching, you must intentionally restructure your prompts.

  • Put Static Context First: Always order your prompt elements from most static to most dynamic. System prompts, reference documentation, and core examples should sit at the very top. User questions, timestamps, and randomized variables must go at the absolute bottom.
  • Consolidate Multi-Turn Conversations: In a chat interface, appending new messages to the end of the history preserves the prefix rule. The old conversation remains a cached prefix, while only the latest turn is billed at the full rate.
  • Avoid Unnecessary Variables at the Top: Never place dynamic variables—such as Current Time: {{time}} or User ID: {{id}}—at the beginning of your system prompt. Doing so will completely invalidate the cache for every single request.

Core Benefits

Cost Savings: Cached tokens are billed at a fraction of the cost of standard input tokens. Depending on the provider, cache hits can reduce your input token expenses by 50% to 90%.

Speed Optimization: Because the model skips the heavy computational step of parsing the initial text, prompt processing time drops dramatically. For large contexts, latency can decrease by up to 80%, resulting in near-instantaneous responses.

Ideal Use Cases

  • Retrieval-Augmented Generation (RAG): When a chat application queries the same knowledge base or product documentation across millions of user sessions.
  • Coding Assistants: When an AI tool needs to maintain awareness of a large, unchanging code repository while a developer asks rapid, successive debugging questions.
  • Complex Agent Workflows: Multi-step AI agents that pass the same extensive tool definitions, system constraints, and loop memories back and forth during a long execution run.