Skip to content

Context Research Plan

Last Updated: 2026-03-23

Goal

Research how mainstream agents assemble and manage context within a conversation: 1. How tokens are generated and streamed 2. How context (system prompt, messages, tool results) is stitched together before each LLM call 3. How different agents handle token budget, compaction, and truncation

Completed

  • [x] Clone open source repos as submodules (Pi, OpenClaw, Gemini CLI, Codex, OpenCode, claude-code-system-prompts)
  • [x] Study Pi: core agent loop and context compaction → pi.research.md
  • [x] Study OpenClaw: ContextEngine plugin architecture and assemble() flow → openclaw.research.md
  • [x] Study Gemini CLI: context management implementation → gemini-cli.research.md
  • [x] Study Claude Code: system prompts, compaction, sub-agents (via community extraction) → claude-code-context.research.md
  • [x] Study Codex: dual compaction, context manager (open source, Rust) → codex-context.research.md
  • [x] Study OpenCode: two-phase compaction, fork/revert, resumable sub-agents → opencode.research.md
  • [x] Write per-project research documents (7 total)
  • [x] Cross-project comparison → context.summary.md
  • [x] Cross-domain findings (Memory × Context) → findings.md

Open Areas

Discussed but not deeply researched

Topic Status Notes
Token streaming mechanics Mentioned in goal #1, not studied How each agent handles SSE/WebSocket/stdio streaming. Lower priority — this is transport, not context management
Anthropic "Effective Context Engineering" article Read and referenced Could do detailed breakdown: official recommendations vs what agents actually implement
Prompt placement empirics Identified as gap in findings No agent does A/B testing. AI Muse 18-model benchmark is the closest. Would need own experiments to go further
Knowledge graphs for context Identified as gap in findings Graphiti is Memory-side only. No one applies graph structures to context management
Compaction quality measurement Identified as unsolved problem No standard metric exists. Could survey academic work on summarization evaluation

Not yet started

Topic Priority Notes
Learning (continual learning) TODO New research direction. Focus on academic survey + any production systems doing user-level adaptation. See CLAUDE.md
Cross-domain article Future Synthesis article covering Memory × Context × Learning findings. Not urgent — deepen understanding first