Context Management in LLM Agents: Research Summary¶
Last Updated: 2026-03-23
Studied Agents¶
| Agent | Type | Language | Source |
|---|---|---|---|
| Pi | Open source | TypeScript | pi.research.md |
| OpenClaw | Open source | TypeScript | openclaw.research.md |
| Gemini CLI | Open source | TypeScript | gemini-cli.research.md |
| Claude Code | Closed source (prompts extracted) | TypeScript (Bun binary) | claude-code-context.research.md |
| Codex | Open source | Rust | codex-context.research.md |
| OpenCode | Open source | TypeScript/Bun | opencode.research.md |
Additional References¶
| Source | Type | File |
|---|---|---|
| Anthropic official guidance | Best practices + compliance analysis | anthropic-context-engineering.research.md |
Universal Pattern¶
All agents share the same base model:
Specifically: - Single array (or equivalent) stores conversation history - Every LLM call sends the full accumulated history - LLM-generated summary replaces older content when approaching limits - Summary injected as user-role message to continue the conversation
Architecture Spectrum¶
Single-loop agents Multi-node workflow
(one context, one LLM) (multiple contexts, multiple LLMs)
Pi ── Codex ── Gemini CLI ── Claude Code ── OpenClaw ── Self-developed agent
│ │ │ │ │ │
│ per-item 2-pass server-side multi-stage dual-channel
│ truncation verify compaction pipeline (Ports + Context)
│ + tool + context + pluggable + per-node filter
│ pre-summary awareness engine + proactive summary
│
simple ──────────────────────────────────────────── complex
Key Dimensions Comparison¶
Context Accumulation¶
| Agent | What enters context | Pre-processing |
|---|---|---|
| Pi | Full tool results, all messages | None |
| Codex | Truncated tool results (per-item, 10KB default) | Per-item truncation at record time |
| Gemini CLI | Pre-summarized large tool outputs | LLM summarization before entry + reverse token budget |
| Claude Code | Full tool results, all messages | None (API handles compaction) |
| OpenClaw | Full tool results | Multi-stage: sanitize → validate → truncate → assemble |
| OpenCode | Full tool results, pruned after 40K token budget | Two-phase: prune old tool outputs + LLM summarization |
| Self-developed agent | Summary exchange only (full exchange disabled) | Per-node context_filter |
Compaction Strategy¶
| Agent | Location | Trigger | Method | Verification |
|---|---|---|---|---|
| Pi | Client | contextWindow - 16K reserve | Single LLM call, 6-section summary | None |
| Codex (OpenAI) | Server | Configurable threshold | Encrypted opaque compaction block | N/A (server-side) |
| Codex (other) | Client | Same | Single LLM call, 4-section summary | None |
| Gemini CLI | Client | 50% of token limit | LLM summary + probe verification | 2nd LLM call verifies completeness |
| Claude Code | Server (API) | ~80% of context window | 9-section structured summary | None (but 3 analysis variants) |
| OpenClaw | Client (Pi inherited) | Same as Pi | Same as Pi, or custom ContextEngine | Depends on engine |
| OpenCode | Client | context >= usable input limit | Two-phase: prune tool outputs + LLM 5-section summary | None; plugin hook for custom compaction |
| Self-developed agent | N/A | Per-node (proactive) | summary_exchange templates | None; no reactive compaction fallback |
Sub-Agent Context Model¶
| Agent | Sub-agent type | Context isolation | Return to parent |
|---|---|---|---|
| Pi | Extension (OS process spawn) | Full isolation | Final text only |
| Codex | None | N/A | N/A |
| Gemini CLI | In-process (new GeminiChat) | Fresh chat instance | Final text only |
| Claude Code | 6+ types (Explore, Plan, Fork...) | Fresh context (except Fork: inherits parent) | Final text only |
| OpenClaw | Gateway RPC (sessions_spawn) | Session-level isolation | Text + bidirectional steering |
| OpenCode | Session-based (Task tool) | Separate SQLite session, resumable | Final text in <task_result> tags |
| Self-developed agent | Capability nodes | Per-node context_filter (3 tiers) | summary_exchange + port_values |
System Prompt¶
| Agent | Size | Dynamic injection |
|---|---|---|
| Pi | ~300 words, single template | None |
| Codex | Single comprehensive file (prompt.md) | None |
| Gemini CLI | Section-based, toggleable, model-aware | GEMINI.md loading |
| Claude Code | 65+ modular files, ~8K tokens | 20+ system-reminder templates, per-event |
| OpenClaw | 15+ sections, 3 modes (full/minimal/none) | Minimal |
| OpenCode | Provider-specific prompts (Anthropic/GPT/Gemini/default) | AGENTS.md + CLAUDE.md + CONTEXT.md hierarchy |
| Self-developed agent | YAML profile templates per capability | Per-node prompt rendering with variables |
Design Patterns Identified¶
Pattern 1: Reactive vs Proactive Compression¶
Most agents compress reactively — wait until context is nearly full, then compact.
Exceptions: - Codex: Per-item truncation at entry time (proactive for tool outputs) - Gemini CLI: Tool output pre-summarization (proactive for large results) - Self-developed agent: summary_exchange at node completion (proactive for all node outputs)
Pattern 2: Client-Side → Server-Side Migration¶
Context compaction is moving server-side:
- 2025: Pi, Gemini CLI, OpenClaw — all client-side
- 2026: Claude Code (compact-2026-01-12), Codex (/responses/compact) — server-side API
- Server-side enables: encrypted state preservation (Codex), mid-stream compaction (Codex), simpler clients
Pattern 3: Single Channel vs Dual Channel¶
All mainstream agents use a single channel — everything (user messages, tool results, system reminders, summaries) goes into one conversation array.
Self-developed agent's dual-channel design (Ports for structured data, ContextMessages for semantic memory) is the only exception studied. This prevents structured data from inflating the conversation context.
Pattern 4: Context Awareness as a Model Feature¶
Claude Code's <budget:token_budget> and <system_warning> tags make the model itself aware of remaining context capacity. No other agent has this. Combined with server-side compaction, the model can self-manage without client-side heuristics.
Pattern 5: Sub-Agents as Context Management¶
Using sub-agents is fundamentally a context management strategy: give a focused task its own clean context window, get back a compressed summary. This pattern appears in Claude Code (Explore/Plan agents), OpenClaw (sessions_spawn), Gemini CLI (LocalAgentExecutor), and Self-developed agent (capability nodes with context_filter).
Pattern 6: Context Rot Awareness (from Anthropic)¶
Anthropic identifies four types of context degradation (anthropic-context-engineering.research.md):
| Type | Description | Agents that address it |
|---|---|---|
| Poisoning (incorrect info) | Stale tool results from modified files | Only Claude Code (file modification detection) |
| Distraction (irrelevant info) | Old tool outputs consuming attention | Codex, Gemini CLI, OpenCode (truncation/pruning) |
| Confusion (similar info) | Two similar files causing misassociation | No agent addresses this systematically |
| Clash (contradictory info) | Old and new versions of same data | OpenCode fork/revert (lets user branch away) |
Most agents only address distraction. Poisoning, confusion, and clash are largely unmitigated.
Pattern 7: Anthropic Recommendations vs Practice¶
Key gaps between what Anthropic recommends and what agents actually do (full analysis in anthropic-context-engineering.research.md):
- Sub-agent returns should be 1-2K tokens → No agent enforces this (all unbounded)
- Compaction should maximize recall → Only Gemini CLI verifies with a second LLM call
- Context quality should be evaluated → No agent measures compression information loss
- Tool result clearing is the safest first step → Only Codex, OpenCode, and Claude Code do this; Pi and OpenClaw skip it entirely
Open Questions¶
-
Graph-based context: Memory research found knowledge graphs (Graphiti) to be a breakthrough. No agent uses graph structures for context management. Could tracking causal relationships between tool calls improve compression quality?
-
Optimal compression threshold: Pi compresses near the limit, Gemini CLI at 50%. What's the optimal point? Earlier compression loses less information per compression event but compresses more often.
-
Verification cost: Gemini CLI's two-pass verification catches lost information but doubles the compression cost. Is it worth it? No one else does it.
-
Encrypted vs readable compaction: Codex's server returns opaque encrypted state. This preserves model-internal representation but is unauditable. Claude Code's 9-section text summary is readable but may lose latent semantics. Which is better?
-
When to filter vs when to send all: Pi's "send everything" works with 1M context windows. But context rot (accuracy degradation with length) suggests filtering may be better even when context fits. Where's the crossover point?