Anthropic Context Engineering: Official Guidance vs Industry Practice¶
Last Updated: 2026-03-23
Sources: - Effective context engineering for AI agents (Anthropic engineering blog) - Effective harnesses for long-running agents (Anthropic engineering blog) - How Claude Code works (Claude Code official docs) - Managing context on the Claude Developer Platform (Anthropic news) - Context Engineering from Claude - Bojie Li analysis (community deep dive) - Compaction API docs - Context windows docs
Compression + Retrieval: How Official Docs Map to Two Fundamental Operations¶
All of Anthropic's context management guidance maps cleanly onto two operations: compression (reduce information) and retrieval (select relevant information).
Compression Side¶
From Claude Code official docs:
"Claude Code manages context automatically as you approach the limit. It clears older tool outputs first, then summarizes the conversation if needed."
Two layers of compression, escalating in aggressiveness:
| Layer | Mechanism | Cost | Information loss |
|---|---|---|---|
| 1st | Tool result clearing | Minimal — removes raw output, keeps the fact that tool was called | Low — results can be re-fetched |
| 2nd | Conversation summarization | LLM call — generates structured summary | Higher — nuance and detail may be lost |
Users can direct compression focus:
"Add a 'Compact Instructions' section to CLAUDE.md or run
/compactwith a focus (like/compact focus on the API changes)"
Retrieval Side¶
Three retrieval modes, from static to dynamic:
| Mode | What | When loaded | Token cost |
|---|---|---|---|
| Pre-load | CLAUDE.md, auto memory, system instructions | Session start | Fixed per session |
| On-demand | Skills (descriptions pre-loaded, full content loaded on use) | When skill is triggered | Variable |
| JIT exploration | Files via glob/grep/Read, web via WebSearch/WebFetch | During agentic loop | Variable |
Sub-agents as a retrieval strategy:
"Subagents get their own fresh context, completely separate from your main conversation. Their work doesn't bloat your context. When done, they return a summary."
This is retrieval + compression in one operation: the sub-agent retrieves information (explores codebase), then compresses it (returns summary).
The Agentic Loop as Retrieval → Action → Retrieval¶
Claude Code's three-phase loop maps directly:
The first phase is retrieval (glob, grep, Read to understand the codebase). The third phase is also retrieval (run tests, check output). Compression happens implicitly throughout (tool results accumulate until cleared or summarized).
What the Official Docs Don't Say¶
The official docs describe what happens but not why these specific tradeoffs: - Why clear tool results before summarizing? (Because re-fetching a file is cheap; re-deriving a decision is expensive) - Why pre-load CLAUDE.md but not skills? (Because CLAUDE.md is always relevant; skills are conditionally relevant) - Why sub-agents return summaries, not full transcripts? (Because the parent needs the conclusion, not the journey)
These are compression/retrieval tradeoff decisions that become clear through the two-operation lens.
Core Thesis¶
Anthropic frames context engineering as the evolution beyond prompt engineering:
"What configuration of context is most likely to generate our model's desired behavior?"
The guiding principle: find the smallest set of high-signal tokens that maximize the likelihood of the desired outcome. Context is a finite resource with diminishing returns as token count increases.
The Four Types of Context Rot¶
Anthropic identifies four mechanisms by which context quality degrades:
| Type | Description | Example |
|---|---|---|
| Context Poisoning | Incorrect/outdated information corrupts reasoning | Stale tool result from a file that's since been modified |
| Context Distraction | Irrelevant information reduces focus | 50 unrelated tool outputs from earlier exploration |
| Context Confusion | Similar but distinct information causes misassociation | Two files with similar names but different purposes |
| Context Clash | Contradictory information creates uncertainty | Old and new versions of the same config in context |
How studied agents handle these:
| Rot type | Pi | OpenClaw | Gemini CLI | Claude Code | Codex | OpenCode |
|---|---|---|---|---|---|---|
| Poisoning | No protection | No protection | No protection | system-reminder-file-modified-by-user-or-linter detects external changes |
No protection | No protection |
| Distraction | No mitigation (full context) | limitHistoryTurns removes old turns |
Tool output pre-summarization | Sub-agents isolate exploration context | Per-item truncation removes large outputs | prune() erases old tool outputs |
| Confusion | No mitigation | Provider-specific turn validation | No mitigation | 20+ system reminders provide clarifying context | No mitigation | No mitigation |
| Clash | No mitigation | No mitigation | No mitigation | No mitigation | No mitigation | Fork/revert lets user branch away from conflicting state |
Finding: Only Claude Code actively addresses context poisoning (via file modification detection). Most agents have no defense against confusion or clash. Distraction is the most commonly addressed type, through various truncation and pruning strategies.
Three Strategies for Long-Horizon Tasks¶
Anthropic recommends three approaches, in order of lightweight → heavyweight:
Strategy 1: Tool Result Clearing¶
"Once a tool has been called deep in the message history, why would the agent need to see the raw result again?"
This is described as "the safest, lightest touch form of compaction." Available as a Claude Developer Platform feature (context_editing).
Who does this:
| Agent | Approach |
|---|---|
| Codex | Per-item truncation at record time (10KB default) — most aggressive |
| Gemini CLI | Pre-summarization of large tool outputs + reverse token budget (protect recent, truncate old) |
| OpenCode | prune() erases tool outputs older than 40K tokens of recent outputs |
| Claude Code | Server-side context_editing API |
| Pi | Nothing — full tool results stay in context forever |
| OpenClaw | Nothing at agent level (inherited Pi behavior) |
Strategy 2: Compaction (Summarization)¶
"Distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation."
Anthropic emphasizes: start by maximizing recall (preserve everything), then iterate to eliminate superfluous content.
How each agent implements this:
| Agent | Compaction quality approach | Sections in summary |
|---|---|---|
| Pi | Single LLM call, no verification | 6 (Goal, Constraints, Progress, Decisions, Next Steps, Critical Context) |
| Codex (local) | Single LLM call | 4 (Progress, Constraints, Remaining, Data) |
| Codex (OpenAI) | Server-side encrypted opaque state | N/A (model-internal) |
| Gemini CLI | Two-pass: generate + verification probe | <state_snapshot> (free-form) |
| Claude Code | Server-side API, 3 analysis variants | 9 sections (most detailed — includes "All user messages" and verbatim quotes) |
| OpenCode | Single LLM call, plugin-extensible | 5 (Goal, Instructions, Discoveries, Accomplished, Files) |
Gap identified: Anthropic says "start by maximizing recall" but only Gemini CLI actually verifies recall with a second LLM call. All others generate once and hope for the best.
Strategy 3: Sub-Agent Architectures¶
"Specialized agents handle focused tasks with clean context windows, returning condensed summaries (1,000-2,000 tokens)."
Anthropic explicitly recommends 1,000-2,000 token summaries from sub-agents. Let's check what actually happens:
| Agent | Sub-agent return size | Matches recommendation? |
|---|---|---|
| Pi | getFinalOutput() — last assistant text, unbounded |
No limit |
| Gemini CLI | ToolResult.llmContent — unbounded |
No limit |
| Claude Code | Agent tool result — unbounded text | No limit |
| OpenCode | Last text part wrapped in <task_result> — unbounded |
No limit |
| OpenClaw | Completion event text — unbounded | No limit |
Finding: No agent enforces the 1,000-2,000 token recommendation. Sub-agent returns are all unbounded text. This is a gap between Anthropic's advice and industry practice.
Just-In-Time Context Retrieval¶
Anthropic recommends against pre-loading all data:
"Maintain lightweight identifiers (file paths, URLs, queries) and dynamically retrieve information during execution."
Claude Code demonstrates this: CLAUDE.md is pre-loaded, but files are retrieved via glob, grep, Read on demand.
How studied agents compare:
| Pattern | Agents | Anthropic alignment |
|---|---|---|
| Pre-load everything | Pi (full context), ChatGPT Memory (all 33 facts) | Against recommendation |
| Hybrid (pre-load + JIT) | Claude Code (CLAUDE.md + tools), Gemini CLI (GEMINI.md + tools) | Matches recommendation |
| JIT only | None of the studied agents | Most aggressive form |
| Per-node filtering | Self-developed agent (context_filter per capability) | Beyond recommendation (more granular) |
System Prompt Design¶
Anthropic's guidance:
"Find the right altitude — avoid both brittle hardcoded logic and vague guidance. Specific enough to guide behavior effectively, yet flexible enough to provide strong heuristics."
Reality check across agents:
| Agent | System prompt approach | Anthropic alignment |
|---|---|---|
| Pi | ~300 words, dynamic tool guidelines | Good altitude — minimal, adaptive |
| Codex | Single comprehensive file, clear structure | Good — organized sections |
| Gemini CLI | Toggleable sections, model-aware variants | Good — adapts to model capabilities |
| Claude Code | 65+ files, 20+ dynamic reminders | Most prescriptive — risks being "too low altitude" (too specific) |
| OpenClaw | 15+ sections, 3 modes (full/minimal/none) | Good — adapts to agent role |
| OpenCode | Provider-specific prompts | Good — adapts to model family |
Tool Design¶
Anthropic's guidance:
"Tools should be self-contained, unambiguous, and extremely clear with respect to their intended use. If humans cannot definitively determine which tool to use, agents will not perform better."
| Agent | Tool count | Clarity approach |
|---|---|---|
| Pi | 4 core (read/write/edit/bash) + optional (grep/find/ls) | Most focused |
| Codex | ~5 (shell, apply_patch, file ops) | Very focused |
| Claude Code | 18+ built-in tools | Largest set, but each has detailed description |
| OpenClaw | 20+ tools (including messaging, sessions, cron) | Most tools — risk of "decision paralysis" per Anthropic |
| OpenCode | Standard set + custom via plugins/MCP | Extensible |
Evaluation¶
Anthropic emphasizes: - Baseline establishment before changes - Negative examples defining boundaries - LLM-as-judge with rubrics - "Nothing perfectly replaces human evaluation"
Finding: None of the studied agents have built-in context management evaluation. No agent measures whether its compaction lost critical information, whether its context filtering improved task completion, or whether its sub-agent summaries were sufficient. This is a universal gap.
Multi-Session / Long-Running Agent Pattern¶
From the "effective harnesses" article, Anthropic recommends:
1. Progress file (claude-progress.txt) documenting completed work
2. Init script for reproducible environment setup
3. Git commits after each feature for rollback capability
4. Two-agent pattern: initializer + coding agent
Who does this: - OpenCode's fork/revert system is the closest to this pattern (filesystem snapshots + rollback) - Claude Code's memory system (CLAUDE.md + auto memory) partially addresses cross-session continuity - No other agent has structured multi-session state management
Summary: Recommendation Compliance¶
| Anthropic Recommendation | Fully compliant | Partially | Not at all |
|---|---|---|---|
| Tool result clearing | Codex, OpenCode | Gemini CLI | Pi, OpenClaw |
| High-fidelity compaction | Claude Code (9 sections) | Gemini CLI (with verification) | Pi, Codex local (minimal) |
| Sub-agent 1-2K token returns | None | All (unbounded returns) | |
| JIT context retrieval | Claude Code, Gemini CLI | OpenCode | Pi |
| Context rot awareness | Claude Code (poisoning only) | All others | |
| System prompt "right altitude" | Pi, Codex, OpenCode | Gemini CLI, OpenClaw | Claude Code (possibly too prescriptive) |
| Evaluation of context quality | None | All |