Skip to content

Anthropic Context Engineering: Official Guidance vs Industry Practice

Last Updated: 2026-03-23

Sources: - Effective context engineering for AI agents (Anthropic engineering blog) - Effective harnesses for long-running agents (Anthropic engineering blog) - How Claude Code works (Claude Code official docs) - Managing context on the Claude Developer Platform (Anthropic news) - Context Engineering from Claude - Bojie Li analysis (community deep dive) - Compaction API docs - Context windows docs


Compression + Retrieval: How Official Docs Map to Two Fundamental Operations

All of Anthropic's context management guidance maps cleanly onto two operations: compression (reduce information) and retrieval (select relevant information).

Compression Side

From Claude Code official docs:

"Claude Code manages context automatically as you approach the limit. It clears older tool outputs first, then summarizes the conversation if needed."

Two layers of compression, escalating in aggressiveness:

Layer Mechanism Cost Information loss
1st Tool result clearing Minimal — removes raw output, keeps the fact that tool was called Low — results can be re-fetched
2nd Conversation summarization LLM call — generates structured summary Higher — nuance and detail may be lost

Users can direct compression focus:

"Add a 'Compact Instructions' section to CLAUDE.md or run /compact with a focus (like /compact focus on the API changes)"

Retrieval Side

Three retrieval modes, from static to dynamic:

Mode What When loaded Token cost
Pre-load CLAUDE.md, auto memory, system instructions Session start Fixed per session
On-demand Skills (descriptions pre-loaded, full content loaded on use) When skill is triggered Variable
JIT exploration Files via glob/grep/Read, web via WebSearch/WebFetch During agentic loop Variable

Sub-agents as a retrieval strategy:

"Subagents get their own fresh context, completely separate from your main conversation. Their work doesn't bloat your context. When done, they return a summary."

This is retrieval + compression in one operation: the sub-agent retrieves information (explores codebase), then compresses it (returns summary).

The Agentic Loop as Retrieval → Action → Retrieval

Claude Code's three-phase loop maps directly:

gather context  →  take action  →  verify results
(retrieval)        (execution)     (retrieval again)

The first phase is retrieval (glob, grep, Read to understand the codebase). The third phase is also retrieval (run tests, check output). Compression happens implicitly throughout (tool results accumulate until cleared or summarized).

What the Official Docs Don't Say

The official docs describe what happens but not why these specific tradeoffs: - Why clear tool results before summarizing? (Because re-fetching a file is cheap; re-deriving a decision is expensive) - Why pre-load CLAUDE.md but not skills? (Because CLAUDE.md is always relevant; skills are conditionally relevant) - Why sub-agents return summaries, not full transcripts? (Because the parent needs the conclusion, not the journey)

These are compression/retrieval tradeoff decisions that become clear through the two-operation lens.


Core Thesis

Anthropic frames context engineering as the evolution beyond prompt engineering:

"What configuration of context is most likely to generate our model's desired behavior?"

The guiding principle: find the smallest set of high-signal tokens that maximize the likelihood of the desired outcome. Context is a finite resource with diminishing returns as token count increases.

The Four Types of Context Rot

Anthropic identifies four mechanisms by which context quality degrades:

Type Description Example
Context Poisoning Incorrect/outdated information corrupts reasoning Stale tool result from a file that's since been modified
Context Distraction Irrelevant information reduces focus 50 unrelated tool outputs from earlier exploration
Context Confusion Similar but distinct information causes misassociation Two files with similar names but different purposes
Context Clash Contradictory information creates uncertainty Old and new versions of the same config in context

How studied agents handle these:

Rot type Pi OpenClaw Gemini CLI Claude Code Codex OpenCode
Poisoning No protection No protection No protection system-reminder-file-modified-by-user-or-linter detects external changes No protection No protection
Distraction No mitigation (full context) limitHistoryTurns removes old turns Tool output pre-summarization Sub-agents isolate exploration context Per-item truncation removes large outputs prune() erases old tool outputs
Confusion No mitigation Provider-specific turn validation No mitigation 20+ system reminders provide clarifying context No mitigation No mitigation
Clash No mitigation No mitigation No mitigation No mitigation No mitigation Fork/revert lets user branch away from conflicting state

Finding: Only Claude Code actively addresses context poisoning (via file modification detection). Most agents have no defense against confusion or clash. Distraction is the most commonly addressed type, through various truncation and pruning strategies.

Three Strategies for Long-Horizon Tasks

Anthropic recommends three approaches, in order of lightweight → heavyweight:

Strategy 1: Tool Result Clearing

"Once a tool has been called deep in the message history, why would the agent need to see the raw result again?"

This is described as "the safest, lightest touch form of compaction." Available as a Claude Developer Platform feature (context_editing).

Who does this:

Agent Approach
Codex Per-item truncation at record time (10KB default) — most aggressive
Gemini CLI Pre-summarization of large tool outputs + reverse token budget (protect recent, truncate old)
OpenCode prune() erases tool outputs older than 40K tokens of recent outputs
Claude Code Server-side context_editing API
Pi Nothing — full tool results stay in context forever
OpenClaw Nothing at agent level (inherited Pi behavior)

Strategy 2: Compaction (Summarization)

"Distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation."

Anthropic emphasizes: start by maximizing recall (preserve everything), then iterate to eliminate superfluous content.

How each agent implements this:

Agent Compaction quality approach Sections in summary
Pi Single LLM call, no verification 6 (Goal, Constraints, Progress, Decisions, Next Steps, Critical Context)
Codex (local) Single LLM call 4 (Progress, Constraints, Remaining, Data)
Codex (OpenAI) Server-side encrypted opaque state N/A (model-internal)
Gemini CLI Two-pass: generate + verification probe <state_snapshot> (free-form)
Claude Code Server-side API, 3 analysis variants 9 sections (most detailed — includes "All user messages" and verbatim quotes)
OpenCode Single LLM call, plugin-extensible 5 (Goal, Instructions, Discoveries, Accomplished, Files)

Gap identified: Anthropic says "start by maximizing recall" but only Gemini CLI actually verifies recall with a second LLM call. All others generate once and hope for the best.

Strategy 3: Sub-Agent Architectures

"Specialized agents handle focused tasks with clean context windows, returning condensed summaries (1,000-2,000 tokens)."

Anthropic explicitly recommends 1,000-2,000 token summaries from sub-agents. Let's check what actually happens:

Agent Sub-agent return size Matches recommendation?
Pi getFinalOutput() — last assistant text, unbounded No limit
Gemini CLI ToolResult.llmContent — unbounded No limit
Claude Code Agent tool result — unbounded text No limit
OpenCode Last text part wrapped in <task_result> — unbounded No limit
OpenClaw Completion event text — unbounded No limit

Finding: No agent enforces the 1,000-2,000 token recommendation. Sub-agent returns are all unbounded text. This is a gap between Anthropic's advice and industry practice.

Just-In-Time Context Retrieval

Anthropic recommends against pre-loading all data:

"Maintain lightweight identifiers (file paths, URLs, queries) and dynamically retrieve information during execution."

Claude Code demonstrates this: CLAUDE.md is pre-loaded, but files are retrieved via glob, grep, Read on demand.

How studied agents compare:

Pattern Agents Anthropic alignment
Pre-load everything Pi (full context), ChatGPT Memory (all 33 facts) Against recommendation
Hybrid (pre-load + JIT) Claude Code (CLAUDE.md + tools), Gemini CLI (GEMINI.md + tools) Matches recommendation
JIT only None of the studied agents Most aggressive form
Per-node filtering Self-developed agent (context_filter per capability) Beyond recommendation (more granular)

System Prompt Design

Anthropic's guidance:

"Find the right altitude — avoid both brittle hardcoded logic and vague guidance. Specific enough to guide behavior effectively, yet flexible enough to provide strong heuristics."

Reality check across agents:

Agent System prompt approach Anthropic alignment
Pi ~300 words, dynamic tool guidelines Good altitude — minimal, adaptive
Codex Single comprehensive file, clear structure Good — organized sections
Gemini CLI Toggleable sections, model-aware variants Good — adapts to model capabilities
Claude Code 65+ files, 20+ dynamic reminders Most prescriptive — risks being "too low altitude" (too specific)
OpenClaw 15+ sections, 3 modes (full/minimal/none) Good — adapts to agent role
OpenCode Provider-specific prompts Good — adapts to model family

Tool Design

Anthropic's guidance:

"Tools should be self-contained, unambiguous, and extremely clear with respect to their intended use. If humans cannot definitively determine which tool to use, agents will not perform better."

Agent Tool count Clarity approach
Pi 4 core (read/write/edit/bash) + optional (grep/find/ls) Most focused
Codex ~5 (shell, apply_patch, file ops) Very focused
Claude Code 18+ built-in tools Largest set, but each has detailed description
OpenClaw 20+ tools (including messaging, sessions, cron) Most tools — risk of "decision paralysis" per Anthropic
OpenCode Standard set + custom via plugins/MCP Extensible

Evaluation

Anthropic emphasizes: - Baseline establishment before changes - Negative examples defining boundaries - LLM-as-judge with rubrics - "Nothing perfectly replaces human evaluation"

Finding: None of the studied agents have built-in context management evaluation. No agent measures whether its compaction lost critical information, whether its context filtering improved task completion, or whether its sub-agent summaries were sufficient. This is a universal gap.

Multi-Session / Long-Running Agent Pattern

From the "effective harnesses" article, Anthropic recommends: 1. Progress file (claude-progress.txt) documenting completed work 2. Init script for reproducible environment setup 3. Git commits after each feature for rollback capability 4. Two-agent pattern: initializer + coding agent

Who does this: - OpenCode's fork/revert system is the closest to this pattern (filesystem snapshots + rollback) - Claude Code's memory system (CLAUDE.md + auto memory) partially addresses cross-session continuity - No other agent has structured multi-session state management

Summary: Recommendation Compliance

Anthropic Recommendation Fully compliant Partially Not at all
Tool result clearing Codex, OpenCode Gemini CLI Pi, OpenClaw
High-fidelity compaction Claude Code (9 sections) Gemini CLI (with verification) Pi, Codex local (minimal)
Sub-agent 1-2K token returns None All (unbounded returns)
JIT context retrieval Claude Code, Gemini CLI OpenCode Pi
Context rot awareness Claude Code (poisoning only) All others
System prompt "right altitude" Pi, Codex, OpenCode Gemini CLI, OpenClaw Claude Code (possibly too prescriptive)
Evaluation of context quality None All