Skip to content

Context Management in LLM Agents: Research Summary

Last Updated: 2026-03-23


Studied Agents

Agent Type Language Source
Pi Open source TypeScript pi.research.md
OpenClaw Open source TypeScript openclaw.research.md
Gemini CLI Open source TypeScript gemini-cli.research.md
Claude Code Closed source (prompts extracted) TypeScript (Bun binary) claude-code-context.research.md
Codex Open source Rust codex-context.research.md
OpenCode Open source TypeScript/Bun opencode.research.md

Additional References

Source Type File
Anthropic official guidance Best practices + compliance analysis anthropic-context-engineering.research.md

Universal Pattern

All agents share the same base model:

Messages accumulate → Threshold reached → Compress/summarize → Continue with summary

Specifically: - Single array (or equivalent) stores conversation history - Every LLM call sends the full accumulated history - LLM-generated summary replaces older content when approaching limits - Summary injected as user-role message to continue the conversation


Architecture Spectrum

Single-loop agents                          Multi-node workflow
(one context, one LLM)                      (multiple contexts, multiple LLMs)

Pi ── Codex ── Gemini CLI ── Claude Code ── OpenClaw ── Self-developed agent
│      │          │              │              │           │
│  per-item    2-pass         server-side    multi-stage   dual-channel
│  truncation  verify         compaction     pipeline      (Ports + Context)
│              + tool         + context      + pluggable   + per-node filter
│              pre-summary    awareness      engine        + proactive summary
simple ──────────────────────────────────────────── complex

Key Dimensions Comparison

Context Accumulation

Agent What enters context Pre-processing
Pi Full tool results, all messages None
Codex Truncated tool results (per-item, 10KB default) Per-item truncation at record time
Gemini CLI Pre-summarized large tool outputs LLM summarization before entry + reverse token budget
Claude Code Full tool results, all messages None (API handles compaction)
OpenClaw Full tool results Multi-stage: sanitize → validate → truncate → assemble
OpenCode Full tool results, pruned after 40K token budget Two-phase: prune old tool outputs + LLM summarization
Self-developed agent Summary exchange only (full exchange disabled) Per-node context_filter

Compaction Strategy

Agent Location Trigger Method Verification
Pi Client contextWindow - 16K reserve Single LLM call, 6-section summary None
Codex (OpenAI) Server Configurable threshold Encrypted opaque compaction block N/A (server-side)
Codex (other) Client Same Single LLM call, 4-section summary None
Gemini CLI Client 50% of token limit LLM summary + probe verification 2nd LLM call verifies completeness
Claude Code Server (API) ~80% of context window 9-section structured summary None (but 3 analysis variants)
OpenClaw Client (Pi inherited) Same as Pi Same as Pi, or custom ContextEngine Depends on engine
OpenCode Client context >= usable input limit Two-phase: prune tool outputs + LLM 5-section summary None; plugin hook for custom compaction
Self-developed agent N/A Per-node (proactive) summary_exchange templates None; no reactive compaction fallback

Sub-Agent Context Model

Agent Sub-agent type Context isolation Return to parent
Pi Extension (OS process spawn) Full isolation Final text only
Codex None N/A N/A
Gemini CLI In-process (new GeminiChat) Fresh chat instance Final text only
Claude Code 6+ types (Explore, Plan, Fork...) Fresh context (except Fork: inherits parent) Final text only
OpenClaw Gateway RPC (sessions_spawn) Session-level isolation Text + bidirectional steering
OpenCode Session-based (Task tool) Separate SQLite session, resumable Final text in <task_result> tags
Self-developed agent Capability nodes Per-node context_filter (3 tiers) summary_exchange + port_values

System Prompt

Agent Size Dynamic injection
Pi ~300 words, single template None
Codex Single comprehensive file (prompt.md) None
Gemini CLI Section-based, toggleable, model-aware GEMINI.md loading
Claude Code 65+ modular files, ~8K tokens 20+ system-reminder templates, per-event
OpenClaw 15+ sections, 3 modes (full/minimal/none) Minimal
OpenCode Provider-specific prompts (Anthropic/GPT/Gemini/default) AGENTS.md + CLAUDE.md + CONTEXT.md hierarchy
Self-developed agent YAML profile templates per capability Per-node prompt rendering with variables

Design Patterns Identified

Pattern 1: Reactive vs Proactive Compression

Most agents compress reactively — wait until context is nearly full, then compact.

Exceptions: - Codex: Per-item truncation at entry time (proactive for tool outputs) - Gemini CLI: Tool output pre-summarization (proactive for large results) - Self-developed agent: summary_exchange at node completion (proactive for all node outputs)

Pattern 2: Client-Side → Server-Side Migration

Context compaction is moving server-side: - 2025: Pi, Gemini CLI, OpenClaw — all client-side - 2026: Claude Code (compact-2026-01-12), Codex (/responses/compact) — server-side API - Server-side enables: encrypted state preservation (Codex), mid-stream compaction (Codex), simpler clients

Pattern 3: Single Channel vs Dual Channel

All mainstream agents use a single channel — everything (user messages, tool results, system reminders, summaries) goes into one conversation array.

Self-developed agent's dual-channel design (Ports for structured data, ContextMessages for semantic memory) is the only exception studied. This prevents structured data from inflating the conversation context.

Pattern 4: Context Awareness as a Model Feature

Claude Code's <budget:token_budget> and <system_warning> tags make the model itself aware of remaining context capacity. No other agent has this. Combined with server-side compaction, the model can self-manage without client-side heuristics.

Pattern 5: Sub-Agents as Context Management

Using sub-agents is fundamentally a context management strategy: give a focused task its own clean context window, get back a compressed summary. This pattern appears in Claude Code (Explore/Plan agents), OpenClaw (sessions_spawn), Gemini CLI (LocalAgentExecutor), and Self-developed agent (capability nodes with context_filter).

Pattern 6: Context Rot Awareness (from Anthropic)

Anthropic identifies four types of context degradation (anthropic-context-engineering.research.md):

Type Description Agents that address it
Poisoning (incorrect info) Stale tool results from modified files Only Claude Code (file modification detection)
Distraction (irrelevant info) Old tool outputs consuming attention Codex, Gemini CLI, OpenCode (truncation/pruning)
Confusion (similar info) Two similar files causing misassociation No agent addresses this systematically
Clash (contradictory info) Old and new versions of same data OpenCode fork/revert (lets user branch away)

Most agents only address distraction. Poisoning, confusion, and clash are largely unmitigated.

Pattern 7: Anthropic Recommendations vs Practice

Key gaps between what Anthropic recommends and what agents actually do (full analysis in anthropic-context-engineering.research.md):

  • Sub-agent returns should be 1-2K tokens → No agent enforces this (all unbounded)
  • Compaction should maximize recall → Only Gemini CLI verifies with a second LLM call
  • Context quality should be evaluated → No agent measures compression information loss
  • Tool result clearing is the safest first step → Only Codex, OpenCode, and Claude Code do this; Pi and OpenClaw skip it entirely

Open Questions

  1. Graph-based context: Memory research found knowledge graphs (Graphiti) to be a breakthrough. No agent uses graph structures for context management. Could tracking causal relationships between tool calls improve compression quality?

  2. Optimal compression threshold: Pi compresses near the limit, Gemini CLI at 50%. What's the optimal point? Earlier compression loses less information per compression event but compresses more often.

  3. Verification cost: Gemini CLI's two-pass verification catches lost information but doubles the compression cost. Is it worth it? No one else does it.

  4. Encrypted vs readable compaction: Codex's server returns opaque encrypted state. This preserves model-internal representation but is unauditable. Claude Code's 9-section text summary is readable but may lose latent semantics. Which is better?

  5. When to filter vs when to send all: Pi's "send everything" works with 1M context windows. But context rot (accuracy degradation with length) suggests filtering may be better even when context fits. Where's the crossover point?