Context Research Plan¶
Last Updated: 2026-03-23
Goal¶
Research how mainstream agents assemble and manage context within a conversation: 1. How tokens are generated and streamed 2. How context (system prompt, messages, tool results) is stitched together before each LLM call 3. How different agents handle token budget, compaction, and truncation
Completed¶
- [x] Clone open source repos as submodules (Pi, OpenClaw, Gemini CLI, Codex, OpenCode, claude-code-system-prompts)
- [x] Study Pi: core agent loop and context compaction →
pi.research.md - [x] Study OpenClaw: ContextEngine plugin architecture and assemble() flow →
openclaw.research.md - [x] Study Gemini CLI: context management implementation →
gemini-cli.research.md - [x] Study Claude Code: system prompts, compaction, sub-agents (via community extraction) →
claude-code-context.research.md - [x] Study Codex: dual compaction, context manager (open source, Rust) →
codex-context.research.md - [x] Study OpenCode: two-phase compaction, fork/revert, resumable sub-agents →
opencode.research.md - [x] Write per-project research documents (7 total)
- [x] Cross-project comparison →
context.summary.md - [x] Cross-domain findings (Memory × Context) →
findings.md
Open Areas¶
Discussed but not deeply researched¶
| Topic | Status | Notes |
|---|---|---|
| Token streaming mechanics | Mentioned in goal #1, not studied | How each agent handles SSE/WebSocket/stdio streaming. Lower priority — this is transport, not context management |
| Anthropic "Effective Context Engineering" article | Read and referenced | Could do detailed breakdown: official recommendations vs what agents actually implement |
| Prompt placement empirics | Identified as gap in findings | No agent does A/B testing. AI Muse 18-model benchmark is the closest. Would need own experiments to go further |
| Knowledge graphs for context | Identified as gap in findings | Graphiti is Memory-side only. No one applies graph structures to context management |
| Compaction quality measurement | Identified as unsolved problem | No standard metric exists. Could survey academic work on summarization evaluation |
Not yet started¶
| Topic | Priority | Notes |
|---|---|---|
| Learning (continual learning) | TODO | New research direction. Focus on academic survey + any production systems doing user-level adaptation. See CLAUDE.md |
| Cross-domain article | Future | Synthesis article covering Memory × Context × Learning findings. Not urgent — deepen understanding first |