Skip to content

LLM Agent Information Management: Research Summary

Last Updated: 2026-03-24

A synthesis of research across 20+ projects studying how LLM agents manage information — from memory frameworks and context management to the frontier of continuous learning.


0. Two Analytical Angles

This research examines LLM agent information management from two complementary angles.

Angle 1: Three Methods — How Agents Handle Information

Method What it does Maturity
Compression Reduce information to fit constraints — summarization, fact extraction, truncation Production-ready, universally deployed
Retrieval Select relevant information from a larger pool — vector search, text search, graph traversal Production-ready, approaches vary widely
Learning Write knowledge into model weights so it persists without external storage Research frontier, no production deployment

Angle 2: Three Directions — Where Agents Apply These Methods

Direction Time scale Research status
Memory Cross-session — persist knowledge between conversations 15 projects studied
Context Within-session — manage the context window during a conversation 7 agents + Anthropic guidance studied
Learning Post-deployment — adapt model weights after training Plan written, not started

Part 1-3 follows the three directions. Part 4 discusses cross-domain findings visible only when studying memory and context together.


1. Memory: Persisting Knowledge Across Conversations

1.1 Consumer Product Memory: Two Philosophies

Reverse engineering of ChatGPT and Claude reveals two opposing approaches. (ChatGPT details | Claude details)

ChatGPT Claude
Strategy Pre-inject everything Retrieve on demand
Mechanism 33 pre-computed facts + recent chat summaries, always in context conversation_search and recent_chats tools, invoked when needed
Trade-off Fixed token cost, automatic continuity Variable cost, risk of missing relevant info

Neither uses vector databases or RAG at runtime — both rely on simpler approaches than expected.

1.2 Memory Frameworks: Two Generations

Generation 1 (2024-2025): Foundational Approaches

Framework Core idea What makes it different Production adoption
Mem0 LLM-driven fact CRUD LLM decides ADD/UPDATE/DELETE on atomic facts; dual storage (vector + graph) AWS Agent SDK official provider
Letta Three-tier self-editing memory Agent writes to its own prompt (Core/Recall/Archival); active, not passive 11x sales AI, Kognitos
Graphiti Bi-temporal knowledge graph Temporal validity on entities and relationships; graph traversal retrieval Zep AI (YC), 21.2k stars

Key differentiator from traditional RAG: all three use active memory management — the LLM decides what to store and delete, rather than passively chunking documents. (Production details)

Generation 2 (2026 Q1): New Wave

System Core innovation LongMemEval
Hindsight Four memory networks (World/Experience/Observation) + reflect operation; multi-strategy retrieval (graph + BM25 + vector) 91.4%
Mastra OM Pure compression, no retrieval — Observer + Reflector agents compress everything into context 94.87%
MemOS Memory as OS resource; MemCube unifies three types (plaintext, KV-cache, weights) N/A
Supermemory ASMR: LLM-as-retriever replaces vector DB; 3+3 agent pipeline with ensemble voting 98.6% (oracle)

The generation shift reveals two key trends: LLM-as-retriever replacing vector search (Supermemory), and pure compression as a viable alternative to retrieval entirely (Mastra OM — 94.87% with zero retrieval challenges the assumption that retrieval is necessary).

1.3 Coding Assistant Memory

A different domain where memory means "understanding the codebase":

Assistant Key innovation
Cursor Custom embedding model trained from agent session traces
Augment Real-time personal index per developer; edit event streaming (+2.6% improvement)
Continue BYOM architecture with content-addressed caching

See memory.ecosystem.md for full market overview including vector databases (Qdrant, Chroma), graph databases, and embedding models.


2. Context: Managing the Conversation Window

2.1 Universal Pattern

All 7 agents studied share the same base model: (Full comparison)

Messages accumulate → Threshold reached → Compress/summarize → Continue with summary

The differences are in when, how, and where compression happens.

2.2 Architecture Spectrum

Agent Key characteristic Detail
Pi Minimal Single LLM summary at near-limit; ~300 word system prompt; no pre-processing
Codex Per-item truncation 10KB limit per tool output at record time; dual compaction (server encrypted + client LLM)
Gemini CLI Verified compression LLM summary + second LLM probe to catch omissions; 50% threshold
OpenCode Two-phase Prune old tool outputs first, then LLM summary; resumable sub-agents; fork/revert
Claude Code Server-side + model-aware API compaction with 9-section summary; model knows its remaining budget (<budget:token_budget>)
OpenClaw Multi-stage pipeline sanitize → validate → truncate → assemble; pluggable ContextEngine with 7 lifecycle hooks

2.3 Key Design Patterns

These patterns reveal a common trajectory: compression responsibility is shifting from client heuristics toward server-side APIs and model self-management.

Reactive → Proactive Compression — Most agents wait until context is nearly full. Exceptions: Codex truncates per-item at entry, Gemini CLI pre-summarizes tool outputs. Earlier, lighter compression reduces the risk of a single catastrophic information loss event.

Client-Side → Server-Side Compaction — Claude Code (compact-2026-01-12) and Codex (/responses/compact) offload compaction to server APIs. This enables encrypted state preservation, mid-stream compaction, and simpler client implementations.

Model Self-Management — Claude Code's <budget:token_budget> makes the model aware of remaining capacity. Combined with server-side compaction, the model can self-manage without client heuristics. No other agent has this — it may represent the direction all agents converge toward.

Context Rot — Anthropic identifies four types of context degradation: poisoning (stale info), distraction (irrelevant info), confusion (similar info), clash (contradictory info). Most agents only address distraction. The other three are largely unmitigated. (Anthropic guidance details)


3. Learning: Writing Knowledge Into Weights (TODO)

The third method and third research direction. Currently no production agent uses weight updates for personalization — all rely on external memory (compression + retrieval).

Research plan covers four directions: (Full plan)

Direction What Status
Academic survey Continual learning, catastrophic forgetting, self-evolving LLMs TODO
Per-user personalization Multi-LoRA serving, Sakana Doc-to-LoRA, DoorDash production case TODO
VTuber / Character AI Neuro-sama (weight-embedded personality) vs prompt-crafted personality TODO
Hybrid pipeline Accumulate external memories → batch LoRA fine-tune → deploy TODO (speculative)

Key question: Is per-user LoRA a viable middle ground between prompt injection and full fine-tuning?


4. Cross-Domain Findings

Findings visible only when studying memory and context together. (Full findings with 10 detailed analyses)

Memory and Context Are the Same Problem at Different Time Scales

Compaction IS memory creation. When Claude Code generates a 9-section summary during compaction, it creates a "memory" of the conversation. When Mem0 extracts facts, it "compacts" the conversation into durable storage. The methods (compression, retrieval) are shared; only the time scale differs. Techniques from one domain should transfer to the other.

Two Philosophies Appear in Both Domains

"Give everything, trust the model" (ChatGPT pre-injects all facts; Pi sends full history) vs "Curate aggressively, minimize noise" (Claude retrieves on-demand; OpenClaw's multi-stage pipeline). Neither is strictly better. As context windows grow (1M+), the balance shifts toward "trust the model" — but context rot (§2.3) pushes back. This tension is unresolved.

Compression Quality Is the Shared Unsolved Problem

Every system that compresses — whether memory extraction (Mem0 losing nuance) or context compaction (Pi's unverified summary) — risks silent information loss. No system reliably knows what was lost. Gemini CLI's two-pass probe (§2.2) is the only verification attempt, and it doubles the cost.

Graph Structures Are Unexplored in Context

Graphiti's bi-temporal knowledge graph is a memory breakthrough (§1.2). In context management, no agent uses graph structures — all use linear message arrays. No one tracks causal relationships between tool calls or how user intent evolves during a conversation.

Text Search Dominates Over RAG in Agentic Loops

Every coding agent studied uses grep/glob (text search) at runtime, not vector search. Reasons: zero index cost, exact matching, always current, explainable. RAG appears in the memory layer (cross-session retrieval) but not in real-time agent operation. This suggests RAG's value is in knowledge-base retrieval, not agentic execution. (Full analysis)

Sub-Agents Are a Compression Strategy

Sub-agents appear as a context management technique: give a focused task its own clean context window, return a compressed summary. This is the same pattern as memory extraction — take raw experience, distill it, store only the distillation. Designing sub-agent boundaries is fundamentally a compression design problem.


5. Open Questions

Compression

  • Verification: Can we build a cheaper alternative to Gemini CLI's two-pass probe?
  • Optimal threshold: Pi compresses near the limit, Gemini CLI at 50%. Where's the optimal point?
  • Encrypted vs readable: Codex preserves opaque model state; Claude Code produces readable 9-section text. Which loses less?

Retrieval

  • Graph in context: Could graph-based context representation (§4) produce better compression than linear summaries?
  • LLM-as-retriever: Supermemory's ASMR replaces vector DB with LLM reasoning. Viable direction, or cost prohibitive?
  • Prompt placement: No agent validates whether rules in system prompt vs user message affect quality. No A/B testing exists.

Learning

  • Weight-based memory: Is per-user LoRA viable? What's the update cycle and storage cost?
  • Hybrid pipeline: Can agents accumulate external memories and periodically fine-tune them into weights?
  • Personality boundary: At what point does fine-tuning produce something prompt engineering can't replicate?

Research Materials

Individual Project Research

Category Projects
Memory frameworks (Gen 1) Mem0, Letta, Graphiti
Memory frameworks (Gen 2) Hindsight, Mastra OM, MemOS, Supermemory
Vector databases Qdrant, Chroma
Coding assistants Cursor, Augment, Continue
Context management Pi, OpenClaw, Gemini CLI, Claude Code, Codex, OpenCode
Reverse engineering ChatGPT Memory, Claude Memory
Official guidance Anthropic Context Engineering

Summary Documents

Document Scope
findings.md Cross-domain findings (10 findings, detailed analysis)
context.summary.md Context management comparison (6 agents, patterns, open questions)
memory.summary.md Memory research summary (Phase 1: 2025-12)
memory.ecosystem.md Market overview (GitHub stars, funding, selection guides)

Plans

Document Status
plan/1-context-research.md Completed
plan/2-learning-research.md TODO (4 research directions planned)
plan/3-memory-update.md Research done, summary pending