LLM Agent Information Management: Research Summary¶
Last Updated: 2026-03-24
A synthesis of research across 20+ projects studying how LLM agents manage information — from memory frameworks and context management to the frontier of continuous learning.
0. Two Analytical Angles¶
This research examines LLM agent information management from two complementary angles.
Angle 1: Three Methods — How Agents Handle Information¶
| Method | What it does | Maturity |
|---|---|---|
| Compression | Reduce information to fit constraints — summarization, fact extraction, truncation | Production-ready, universally deployed |
| Retrieval | Select relevant information from a larger pool — vector search, text search, graph traversal | Production-ready, approaches vary widely |
| Learning | Write knowledge into model weights so it persists without external storage | Research frontier, no production deployment |
Angle 2: Three Directions — Where Agents Apply These Methods¶
| Direction | Time scale | Research status |
|---|---|---|
| Memory | Cross-session — persist knowledge between conversations | 15 projects studied |
| Context | Within-session — manage the context window during a conversation | 7 agents + Anthropic guidance studied |
| Learning | Post-deployment — adapt model weights after training | Plan written, not started |
Part 1-3 follows the three directions. Part 4 discusses cross-domain findings visible only when studying memory and context together.
1. Memory: Persisting Knowledge Across Conversations¶
1.1 Consumer Product Memory: Two Philosophies¶
Reverse engineering of ChatGPT and Claude reveals two opposing approaches. (ChatGPT details | Claude details)
| ChatGPT | Claude | |
|---|---|---|
| Strategy | Pre-inject everything | Retrieve on demand |
| Mechanism | 33 pre-computed facts + recent chat summaries, always in context | conversation_search and recent_chats tools, invoked when needed |
| Trade-off | Fixed token cost, automatic continuity | Variable cost, risk of missing relevant info |
Neither uses vector databases or RAG at runtime — both rely on simpler approaches than expected.
1.2 Memory Frameworks: Two Generations¶
Generation 1 (2024-2025): Foundational Approaches¶
| Framework | Core idea | What makes it different | Production adoption |
|---|---|---|---|
| Mem0 | LLM-driven fact CRUD | LLM decides ADD/UPDATE/DELETE on atomic facts; dual storage (vector + graph) | AWS Agent SDK official provider |
| Letta | Three-tier self-editing memory | Agent writes to its own prompt (Core/Recall/Archival); active, not passive | 11x sales AI, Kognitos |
| Graphiti | Bi-temporal knowledge graph | Temporal validity on entities and relationships; graph traversal retrieval | Zep AI (YC), 21.2k stars |
Key differentiator from traditional RAG: all three use active memory management — the LLM decides what to store and delete, rather than passively chunking documents. (Production details)
Generation 2 (2026 Q1): New Wave¶
| System | Core innovation | LongMemEval |
|---|---|---|
| Hindsight | Four memory networks (World/Experience/Observation) + reflect operation; multi-strategy retrieval (graph + BM25 + vector) | 91.4% |
| Mastra OM | Pure compression, no retrieval — Observer + Reflector agents compress everything into context | 94.87% |
| MemOS | Memory as OS resource; MemCube unifies three types (plaintext, KV-cache, weights) | N/A |
| Supermemory | ASMR: LLM-as-retriever replaces vector DB; 3+3 agent pipeline with ensemble voting | 98.6% (oracle) |
The generation shift reveals two key trends: LLM-as-retriever replacing vector search (Supermemory), and pure compression as a viable alternative to retrieval entirely (Mastra OM — 94.87% with zero retrieval challenges the assumption that retrieval is necessary).
1.3 Coding Assistant Memory¶
A different domain where memory means "understanding the codebase":
| Assistant | Key innovation |
|---|---|
| Cursor | Custom embedding model trained from agent session traces |
| Augment | Real-time personal index per developer; edit event streaming (+2.6% improvement) |
| Continue | BYOM architecture with content-addressed caching |
See memory.ecosystem.md for full market overview including vector databases (Qdrant, Chroma), graph databases, and embedding models.
2. Context: Managing the Conversation Window¶
2.1 Universal Pattern¶
All 7 agents studied share the same base model: (Full comparison)
The differences are in when, how, and where compression happens.
2.2 Architecture Spectrum¶
| Agent | Key characteristic | Detail |
|---|---|---|
| Pi | Minimal | Single LLM summary at near-limit; ~300 word system prompt; no pre-processing |
| Codex | Per-item truncation | 10KB limit per tool output at record time; dual compaction (server encrypted + client LLM) |
| Gemini CLI | Verified compression | LLM summary + second LLM probe to catch omissions; 50% threshold |
| OpenCode | Two-phase | Prune old tool outputs first, then LLM summary; resumable sub-agents; fork/revert |
| Claude Code | Server-side + model-aware | API compaction with 9-section summary; model knows its remaining budget (<budget:token_budget>) |
| OpenClaw | Multi-stage pipeline | sanitize → validate → truncate → assemble; pluggable ContextEngine with 7 lifecycle hooks |
2.3 Key Design Patterns¶
These patterns reveal a common trajectory: compression responsibility is shifting from client heuristics toward server-side APIs and model self-management.
Reactive → Proactive Compression — Most agents wait until context is nearly full. Exceptions: Codex truncates per-item at entry, Gemini CLI pre-summarizes tool outputs. Earlier, lighter compression reduces the risk of a single catastrophic information loss event.
Client-Side → Server-Side Compaction — Claude Code (compact-2026-01-12) and Codex (/responses/compact) offload compaction to server APIs. This enables encrypted state preservation, mid-stream compaction, and simpler client implementations.
Model Self-Management — Claude Code's <budget:token_budget> makes the model aware of remaining capacity. Combined with server-side compaction, the model can self-manage without client heuristics. No other agent has this — it may represent the direction all agents converge toward.
Context Rot — Anthropic identifies four types of context degradation: poisoning (stale info), distraction (irrelevant info), confusion (similar info), clash (contradictory info). Most agents only address distraction. The other three are largely unmitigated. (Anthropic guidance details)
3. Learning: Writing Knowledge Into Weights (TODO)¶
The third method and third research direction. Currently no production agent uses weight updates for personalization — all rely on external memory (compression + retrieval).
Research plan covers four directions: (Full plan)
| Direction | What | Status |
|---|---|---|
| Academic survey | Continual learning, catastrophic forgetting, self-evolving LLMs | TODO |
| Per-user personalization | Multi-LoRA serving, Sakana Doc-to-LoRA, DoorDash production case | TODO |
| VTuber / Character AI | Neuro-sama (weight-embedded personality) vs prompt-crafted personality | TODO |
| Hybrid pipeline | Accumulate external memories → batch LoRA fine-tune → deploy | TODO (speculative) |
Key question: Is per-user LoRA a viable middle ground between prompt injection and full fine-tuning?
4. Cross-Domain Findings¶
Findings visible only when studying memory and context together. (Full findings with 10 detailed analyses)
Memory and Context Are the Same Problem at Different Time Scales¶
Compaction IS memory creation. When Claude Code generates a 9-section summary during compaction, it creates a "memory" of the conversation. When Mem0 extracts facts, it "compacts" the conversation into durable storage. The methods (compression, retrieval) are shared; only the time scale differs. Techniques from one domain should transfer to the other.
Two Philosophies Appear in Both Domains¶
"Give everything, trust the model" (ChatGPT pre-injects all facts; Pi sends full history) vs "Curate aggressively, minimize noise" (Claude retrieves on-demand; OpenClaw's multi-stage pipeline). Neither is strictly better. As context windows grow (1M+), the balance shifts toward "trust the model" — but context rot (§2.3) pushes back. This tension is unresolved.
Compression Quality Is the Shared Unsolved Problem¶
Every system that compresses — whether memory extraction (Mem0 losing nuance) or context compaction (Pi's unverified summary) — risks silent information loss. No system reliably knows what was lost. Gemini CLI's two-pass probe (§2.2) is the only verification attempt, and it doubles the cost.
Graph Structures Are Unexplored in Context¶
Graphiti's bi-temporal knowledge graph is a memory breakthrough (§1.2). In context management, no agent uses graph structures — all use linear message arrays. No one tracks causal relationships between tool calls or how user intent evolves during a conversation.
Text Search Dominates Over RAG in Agentic Loops¶
Every coding agent studied uses grep/glob (text search) at runtime, not vector search. Reasons: zero index cost, exact matching, always current, explainable. RAG appears in the memory layer (cross-session retrieval) but not in real-time agent operation. This suggests RAG's value is in knowledge-base retrieval, not agentic execution. (Full analysis)
Sub-Agents Are a Compression Strategy¶
Sub-agents appear as a context management technique: give a focused task its own clean context window, return a compressed summary. This is the same pattern as memory extraction — take raw experience, distill it, store only the distillation. Designing sub-agent boundaries is fundamentally a compression design problem.
5. Open Questions¶
Compression¶
- Verification: Can we build a cheaper alternative to Gemini CLI's two-pass probe?
- Optimal threshold: Pi compresses near the limit, Gemini CLI at 50%. Where's the optimal point?
- Encrypted vs readable: Codex preserves opaque model state; Claude Code produces readable 9-section text. Which loses less?
Retrieval¶
- Graph in context: Could graph-based context representation (§4) produce better compression than linear summaries?
- LLM-as-retriever: Supermemory's ASMR replaces vector DB with LLM reasoning. Viable direction, or cost prohibitive?
- Prompt placement: No agent validates whether rules in system prompt vs user message affect quality. No A/B testing exists.
Learning¶
- Weight-based memory: Is per-user LoRA viable? What's the update cycle and storage cost?
- Hybrid pipeline: Can agents accumulate external memories and periodically fine-tune them into weights?
- Personality boundary: At what point does fine-tuning produce something prompt engineering can't replicate?
Research Materials¶
Individual Project Research¶
| Category | Projects |
|---|---|
| Memory frameworks (Gen 1) | Mem0, Letta, Graphiti |
| Memory frameworks (Gen 2) | Hindsight, Mastra OM, MemOS, Supermemory |
| Vector databases | Qdrant, Chroma |
| Coding assistants | Cursor, Augment, Continue |
| Context management | Pi, OpenClaw, Gemini CLI, Claude Code, Codex, OpenCode |
| Reverse engineering | ChatGPT Memory, Claude Memory |
| Official guidance | Anthropic Context Engineering |
Summary Documents¶
| Document | Scope |
|---|---|
| findings.md | Cross-domain findings (10 findings, detailed analysis) |
| context.summary.md | Context management comparison (6 agents, patterns, open questions) |
| memory.summary.md | Memory research summary (Phase 1: 2025-12) |
| memory.ecosystem.md | Market overview (GitHub stars, funding, selection guides) |
Plans¶
| Document | Status |
|---|---|
| plan/1-context-research.md | Completed |
| plan/2-learning-research.md | TODO (4 research directions planned) |
| plan/3-memory-update.md | Research done, summary pending |