Cross-Domain Findings: Memory × Context¶

Last Updated: 2026-04-15

Findings from studying memory implementations (8 projects) and context management (7 agents) in LLM systems.

Framework: Three Pillars of LLM Information Management¶

Memory and context are not separate problems — they are the same problem at different time scales. Context is "memory within a conversation"; memory is "context across conversations." Compaction generates summaries that are functionally identical to memory extraction.

All mechanisms studied across both domains reduce to three fundamental pillars:

Pillar 1: Compression¶

Reduce information to fit constraints. Necessary because context windows are finite and attention degrades with length (context rot).

Approach	Mechanism	Examples
Programmatic removal	Rule-based truncation, no LLM involved	Codex per-item truncation (10KB limit), OpenCode `prune()` (erase old tool outputs)
Tool result clearing	Remove raw results of past tool calls	Claude Code `context_editing` API, Gemini CLI reverse token budget
LLM summarization	Generate compressed summary when approaching threshold	All agents' compaction (Pi 6-section, Claude Code 9-section, Gemini CLI with verification probe)
Hierarchical compression	Multi-level: recent → detailed, old → summary, very old → facts	Memory systems (Mem0 extracts facts, Letta's three-tier Core/Recall/Archival)

Pillar 2: Retrieval¶

Select relevant information from a larger pool and inject into context. Necessary because not all stored information is relevant to the current task.

Approach	When loaded	Examples
Preload everything	Session start, fixed cost	ChatGPT Memory (33 facts always injected), Pi (full history every call)
Preload important + JIT rest	Hybrid: important at start, details on demand	Claude Code (CLAUDE.md preloaded + glob/grep on demand), Gemini CLI (GEMINI.md + tools)
On-demand only	When model decides to search	Claude Memory (`conversation_search` tool), vector search (Qdrant/Chroma)
Per-node filtering	Per LLM call, tag-based	Self-developed workflow agent (context_filter: Full/Cap-restricted/None per capability)
Hierarchical retrieval	Top-level first, drill down	Claude Code memory (MEMORY.md first 200 lines preloaded, deeper content via memory_search/memory_get)

Current trend: Hybrid (preload + JIT) is the dominant approach. Preload everything doesn't scale; pure JIT is too aggressive and risks missing information.

Pillar 3: Continuous Learning (Unexplored)¶

Write knowledge into model weights so it persists without external storage. This would eliminate the need for external memory systems entirely.

Status	Details
Academic research	Continual learning, catastrophic forgetting mitigation (LoRA, self-distillation, rehearsal)
Production reality	No coding agent does this. All use external memory (Pillar 1 + 2) instead of weight updates
Blockers	Catastrophic forgetting, compute cost, no standard methodology for per-user adaptation
Future direction	May converge with memory — e.g., accumulate external memories, then periodically batch-write into weights via fine-tuning

The most effective systems combine Pillar 1 and 2: compress what's old, retrieve what's relevant. Pillar 3 remains a research frontier.

Detailed Findings¶

Finding 1: Memory and Context Are the Same Problem at Different Time Scales¶

Memory (cross-session) and context (within-session) face identical challenges:

Challenge	Memory	Context
What to keep	Fact extraction (Mem0), entity tracking (Zep)	Compaction summary (all agents)
What to discard	Conflict resolution, outdated facts	Old tool outputs, resolved errors
How to compress	LLM summarization, knowledge graph	LLM summarization, structured templates
How to retrieve	Vector search, graph traversal, pre-injection	Full context, filtering, token budgeting

Compaction IS memory creation. When Claude Code generates a 9-section summary during compaction, it's creating a "memory" of the conversation. When Mem0 extracts facts from a conversation, it's "compacting" the conversation into durable storage.

Implication: Techniques from one domain likely transfer to the other. Graph-based memory (Graphiti) has no equivalent in context management yet. Proactive compression from context (self-developed workflow agent's per-node summary) has no equivalent in memory yet.

Finding 2: Two Philosophies Appear in Both Domains¶

"Give everything, trust the model" - Memory: ChatGPT pre-injects all 33 facts every conversation - Context: Pi sends full history every LLM call

"Curate aggressively, minimize noise" - Memory: Claude retrieves on-demand via conversation_search - Context: OpenClaw's multi-stage pipeline, self-developed workflow agent's per-node context_filter

Neither philosophy is strictly better. The "trust the model" approach is simpler to implement and works well with large context windows. The "curate" approach scales better but adds engineering complexity and risks filtering out relevant information.

As context windows grow (1M+ tokens), the balance shifts toward "trust the model" for context management. But context rot (accuracy degradation with length) pushes back toward curation. This tension is unresolved.

Finding 3: Compression Quality Is the Shared Unsolved Problem¶

Every system that compresses information risks losing something critical.

Memory side: - Mem0's fact extraction can lose nuance ("user prefers Python" loses the context of why) - Letta's self-editing memory can drift from the original facts over many updates - Graphiti's knowledge graph preserves relationships but may miss implicit context

Context side: - Pi's single-pass summary has no verification — information loss is silent - Gemini CLI adds a second LLM "probe" call to catch omissions (only agent to do this) - Claude Code uses 9 structured sections to ensure coverage, but doesn't verify - Codex's encrypted server-side compaction preserves latent model state, but is opaque

No system has a reliable way to know what was lost during compression. Gemini CLI's probe is the closest attempt, but it doubles the cost.

Finding 4: Structured vs Narrative Is a Fundamental Split¶

Memory research found three forms: structured facts (Mem0), narrative text (Letta), relationship graphs (Graphiti).

Context research found the same split: all mainstream agents use a single narrative channel (conversation messages), while self-developed workflow agent separates structured data (Ports) from narrative context (ContextMessages).

In mainstream agents, structured data (JSON tool results, code snippets, file contents) is forced into the narrative conversation format. This wastes tokens and makes extraction harder for the model. The dual-channel approach addresses this but adds architectural complexity.

Implication: The industry-standard "everything is a message" approach may be fundamentally wasteful for structured workflows. As agent tasks become more complex (multi-step, multi-tool), pressure to separate channels will increase.

Finding 5: Server-Side Processing Is the Trend¶

Both domains are moving computation server-side:

Era	Memory	Context
Early	Client-side vector DB (Chroma, Qdrant)	Client-side compaction (Pi, Gemini CLI)
Current	Cloud memory services, API-integrated	Server-side compaction API (Claude Code `compact-2026-01-12`, Codex `/responses/compact`)
Emerging	Model-native memory (ChatGPT built-in)	Model-native context awareness (Claude `<budget:token_budget>`)

The endpoint is likely model-native: the model itself manages both memory and context, with the harness providing only raw inputs. Claude's context awareness (model knows its remaining token budget) is a step in this direction.

Finding 6: Sub-Agents Are a Context Strategy, Not Just a Feature¶

Sub-agents appear in context research as a practical solution to context overflow: give a focused task its own clean context window, return a compressed summary.

This is the same pattern as memory extraction: take raw experience, distill it into a compact representation, store only the distillation.

Pattern	Memory equivalent	Context equivalent
Extract and store	Mem0 fact extraction	Sub-agent returns summary
Full vs compressed	Raw conversation vs facts	Full tool output vs summary_exchange
Selective retrieval	Vector search top-k	context_filter per node

Implication: Designing sub-agent boundaries is fundamentally a compression design problem — what information should cross the boundary, and in what form.

Finding 7: Knowledge Graphs Are Unexplored in Context¶

Memory research identified Graphiti's bi-temporal knowledge graph as the 2025 breakthrough (21.2k GitHub stars). It tracks entities, relationships, and temporal validity.

In context management, no agent uses graph structures. All use linear message arrays + text summaries. No one tracks: - Causal relationships between tool calls - How user intent evolves during a conversation - Dependencies between code modifications

This is a potential research direction: could graph-based context representation produce better compression than linear summaries? The information exists (tool A's output fed into tool B's input), but it's flattened into text when it enters the context.

Finding 8: No One Validates Prompt Placement Empirically¶

All studied agents use simple prompt placement strategies (system prompt at start, everything else in messages). None run A/B tests on: - Whether rules in system prompt vs user message affect output quality - Whether message ordering within context affects task completion - Whether filtering context (OpenClaw/self-developed workflow agent) improves or degrades performance vs sending everything (Pi)

The AI Muse 18-model benchmark is the closest empirical work, and it only tested constraint compliance, not agent task performance. This is a gap in the field.

Finding 9: Both Domains Reduce to Two Fundamental Operations — Compression and Retrieval¶

Every mechanism studied in both memory and context can be classified as either compression (reduce information to fit constraints) or retrieval (select relevant information from a larger pool).

	Compression	Retrieval
Memory	Mem0 fact extraction, Letta self-editing, ChatGPT pre-computed summaries	Claude `conversation_search`, vector search (Qdrant/Chroma), Graphiti graph traversal
Context	Compaction (all agents), tool output truncation/summarization, summary_exchange	OpenClaw `assemble(tokenBudget)`, self-developed workflow agent context_filter, sub-agent targeted exploration

Each system is a different mix of the two:

Compression-heavy: Pi (send everything, compact when full), ChatGPT Memory (pre-compress 33 facts, always inject)
Retrieval-heavy: Claude Memory (on-demand search, no pre-injection), Graphiti (graph traversal for relevant entities)
Balanced: OpenClaw (filter + assemble + compact), Claude Code (server-side compact + sub-agent exploration), self-developed workflow agent (per-node filter + proactive summary)

Core tradeoff: - Compression: Information is irreversibly lost, but context stays small and cost is low - Retrieval: Information is preserved, but requires indexing/query infrastructure and may miss relevant items

This framing suggests that advancing either domain means improving one of two things: better compression (lose less during summarization) or better retrieval (find more relevant information with less noise). The most effective systems will combine both — compress what's old, retrieve what's relevant.

Finding 10: Text Search Dominates Over RAG in Practice¶

Despite the hype around RAG (Retrieval-Augmented Generation) and vector databases, no coding agent uses RAG in its core agentic loop. All 7 studied agents rely on text search:

Agent	Search tool	Method
Claude Code	`glob`, `grep`, `Read`	File pattern matching + regex
Codex	`rg` (ripgrep)	Regex text search
OpenCode	ripgrep	Regex text search
Gemini CLI	built-in grep/glob	File pattern + regex
Pi	`grep`, `find`	Standard unix tools
OpenClaw	inherited from Pi	Same
Self-developed agent	N/A (data via ports)	Structured port bindings

Anthropic calls this "Agentic Search" — but the underlying mechanism is glob + grep, not embedding similarity.

Why text search wins in coding agents¶

Factor	Text search (grep/glob)	RAG (vector search)
Index cost	Zero — scan on demand	High — must compute embeddings upfront
Latency	ripgrep: milliseconds on large codebases	Vector query: faster, but index must exist
Precision	Exact — find exactly `handleAuth`	Fuzzy — may return semantically similar but wrong results
Explainability	You know why it matched	Opaque similarity score
Staleness	Always current (reads live files)	Index can be stale after edits

Where RAG does appear¶

RAG is used in the memory layer (cross-session retrieval), not the context layer (within-session search):

Scenario	Best method	Why
Find a function in codebase	Text search	Exact identifier match, zero index cost
Find relevant past conversation	Vector search (RAG)	Natural language, semantic matching needed
Find user preferences in memory	Text search may suffice	Short structured facts, keyword matching works
Find related documentation	Vector search (RAG)	Long natural language, semantic relevance

The production reality¶

Memory research found the same pattern: ChatGPT Memory uses pre-computed facts (no RAG at runtime), Claude Memory uses tool-based search. The vector databases studied (Qdrant, Chroma) are powerful infrastructure, but the agents that ship to millions of users chose simpler approaches.

This suggests RAG's value is in knowledge-base retrieval (documentation, past conversations) rather than real-time agent operation (finding code, executing tasks).

Summary Table¶

Finding	Status	Action
Memory ≈ Context at different time scales	Established	Transfer techniques across domains
Two philosophies (trust vs curate)	Observed, no winner	Choice depends on context window size and task type
Compression quality unsolved	Universal problem	Gemini CLI's probe approach worth investigating
Structured vs narrative split	Emerging recognition	Dual-channel architectures may become standard
Server-side trend	In progress	Expect more API-native memory and context features
Sub-agents = compression strategy	Underrecognized	Design sub-agent boundaries as compression boundaries
Knowledge graphs unexplored in context	Gap	Research opportunity
Prompt placement unvalidated	Gap	Empirical testing needed
Compression + Retrieval as two fundamental ops	Framework	Use to classify and evaluate any memory/context mechanism
Text search dominates over RAG in practice	Observed	RAG for knowledge bases; text search for real-time agent operation
Benchmark scores are backbone-dependent, cross-paper tables invalid	Established (Anatomy 2602.19320)	Require Δ vs Full-Context, matched backbone, LLM-judge, latency
Most engineering systems occupy one corner of the memory cube	Established (Survey 2512.13564)	Broaden coverage on Forms and Functions axes (MemOS, Hindsight are rare outliers)
Memory evolution (retroactive note updates) is an uncovered mechanism	Gap (A-MEM 2502.12110)	Consider retroactive refinement vs full regeneration
Learned memory policies exist (RL-trained GRPO)	New (AgeMem 2601.01885)	First tracked system beyond "prompted LLM" control

Finding 11: Cross-Paper Benchmark Comparisons Are Mostly Invalid¶

Added 2026-04-15 after "Anatomy of Agentic Memory" (arxiv 2602.19320).

Every cross-system accuracy claim in our research.md files (and in most 2026 memory papers) fails at least one of four validity conditions:

Matched backbone — same system swings 40+ points on LoCoMo just from gpt-4o-mini → Qwen-2.5-3B. Most tables mix backbones.
Delta vs Full-Context baseline — raw scores on a saturated benchmark (LongMemEval-S, LoCoMo both in "Moderate saturation" band) tell you little. Δ = Score_MAG − Score_FullContext is what measures contribution.
Semantic-aware judging — F1 on golden spans diverges from LLM-judge semantic utility by up to ~15 points.
Latency / cost reporting — hidden cost often dominates. Most papers skip this.

Claims in our existing research that need methodology asterisks: - Hindsight "91.4% LongMemEval" (backbone-dependent) - Supermemory "98.6% oracle" (self-report, no Δ) - Mastra "94.87% LongMemEval" (tied to gpt-5-mini) - Any MAGMA / LiCoMemory / SimpleMem vs baseline table that mixes backbones

Action: add "⚠️ Backbone-dependent, cross-system comparison invalid" note wherever single-number benchmark claims appear without Δ. Apply when building comparison tables in future research.

Finding 12: The Memory Design Cube Has Underexplored Corners¶

Added 2026-04-15 after 2026 Memory Survey (arxiv 2512.13564).

Organizing our engineering-side research against the survey's three-axis taxonomy (Forms × Functions × Dynamics):

Nearly all engineering systems (Mem0, Supermemory, ChatGPT, Claude, OpenClaw) occupy ONE corner: token-level × factual × retrieval-heavy
Hindsight is rare in Function breadth (factual + experiential + observational)
MemOS is rare in Form breadth (token + parametric + latent)
Letta is a taxonomy blind spot — its locus of control (LLM self-edits memory) has no axis in the framework

Under-explored directions: - Parametric memory (adapters / LoRA) — only MemOS - Latent memory (hidden state / KV cache as persistent store) — only experimental work - Experiential memory (case / strategy / skill) — only Hindsight - Learned memory policy — only AgeMem (2026)

These gaps match the "where to look next" list in plan/4-academic-and-retrieval-research.md.