LLM Memory Systems: 2025 Technology Landscape and Production Reality¶

Last Updated: 2025-12-28

While researching LLM memory systems, I found some interesting things: ChatGPT and Claude take completely opposite approaches to memory — one always injects, the other retrieves on demand. Even more surprising, Agent CLI tools like Claude Code, Codex, and Gemini have memory implementations far simpler than expected — no RAG, no knowledge graphs, just plain sliding windows. This article covers open-source frameworks, vector databases, coding assistants, and the real state of production deployment.

Covers: open-source memory frameworks (Mem0/Letta/Graphiti), vector databases (Qdrant/Chroma), coding assistants (Cursor/Augment/Continue), ChatGPT and Claude memory reverse engineering, Agent CLI analysis, and production deployment reality.

1. Memory Frameworks: Three Technical Approaches¶

Current mainstream memory frameworks represent three distinct technical philosophies:

Mem0: LLM-Driven CRUD¶

Mem0's core idea is using large models to manage memory CRUD operations. It extracts facts from conversations and lets the model judge which version is more trustworthy when conflicts arise.

Production validation: Mem0 has become the official memory provider for AWS Agent SDK, with real commercial cases (Sunflower healthcare, RevisionDojo education). This proves the value of "simple but effective" in production environments.

Letta: Three-Tier Memory + Self-Editing¶

Letta (formerly MemGPT) designed an OS-like three-tier architecture: - Core Memory: Core facts placed in the system prompt - Recall Memory: Vector retrieval of conversation history - Archival Memory: Long-term knowledge storage

What makes it unique is letting the agent decide when to update its own memory. Already adopted by 11x (sales AI) and Kognitos (enterprise automation).

Graphiti: Bi-Temporal Knowledge Graph¶

Graphiti, from the Zep team, introduces a temporal dimension — tracking not just "when we learned it" (transaction_time) but also "when the fact itself occurred" (valid_time).

This solves the state change problem: "user lived in Beijing last year" and "user lives in Shanghai now" are both true, but traditional systems struggle to preserve both simultaneously.

2. Vector Databases: Two Positions¶

Qdrant: Balancing Performance and Features¶

Qdrant pursues "supporting complex filtering while maintaining recall quality": filtrable HNSW indexing, sparse vector support, RRF/DBSF hybrid ranking.

In production, a common Redis + Qdrant dual-layer architecture emerges: Redis for hot data and fast access, Qdrant for cold data and semantic retrieval.

Chroma: Developer Experience First¶

Chroma chose extreme developer experience — its pre-filtering mechanism and clean API make prototyping very smooth. Currently undergoing a Rust v1.0 rewrite, catching up on the "performance" front.

3. Coding Assistants: Memory in Practice¶

Cursor: Learning from User Behavior¶

Cursor uses agent session trace data to train its own embedding model. Its vector representations are specialized for "code understanding" scenarios, not general text.

Augment: Real-Time Incremental Indexing¶

Augment focuses on "real-time" — monitoring edit events and dynamically updating a personal code index. According to public data, this delivers a 2.6% quality improvement.

Continue: Open Architecture¶

Continue chose the BYOM (Bring Your Own Model) approach, paired with content-addressed caching. More framework than product, suited for customization needs.

4. Consumer Product Reverse Engineering: ChatGPT vs Claude Memory¶

Through reverse engineering analysis of request patterns and system behavior, it was found that ChatGPT and Claude adopt fundamentally different memory architectures — representing two opposing design philosophies.

ChatGPT: Pre-Computed Injection (Passive Memory)¶

ChatGPT's memory is an "always inject" model:

Storage: ~33 fact summaries + recent conversation summaries
Injection timing: Automatically included at the start of every conversation, invisible to users
Update mechanism: Background async extraction, no impact on conversation latency

Design philosophy: Sacrifice context space for simplicity and reliability. Users don't need to wait for retrieval — memory "naturally" exists in the conversation.

Trade-offs: - Pros: Low latency, smooth experience, simple implementation - Costs: Fixed context window consumption, limited memory capacity

Claude: On-Demand Retrieval (Active Memory)¶

Claude implements memory as explicit tool calls:

Tool interfaces: conversation_search (semantic search of history), recent_chats (recent conversation list)
Trigger timing: Invoked only when the model determines it's needed; user sees "searching memory"
Retrieval scope: Can span longer time ranges of conversation history

Design philosophy: On-demand retrieval, precise matching. Only consume resources when truly needed.

Trade-offs: - Pros: Saves tokens, theoretically supports larger-scale memory - Costs: Increased latency, depends on model correctly judging when memory is needed

The Essential Difference¶

Dimension	ChatGPT (Passive)	Claude (Active)
Memory trigger	Auto-injection	Tool call
User perception	Invisible	Visible "searching"
Context usage	Fixed cost	On-demand cost
Latency	Low	Increases during retrieval
Capacity limit	Limited by injection volume	Theoretically larger
Implementation complexity	Low	High

This is not just a technical choice — it reflects product philosophy: ChatGPT pursues "seamless experience," Claude pursues "transparent control."

5. Agent CLI Tools: Surprisingly Simple¶

After studying the implementations of Claude Code, Codex, and Gemini CLI, I found a striking phenomenon: these tools' "memory" approaches are far simpler than expected.

Tool	Storage Format	Compression Method
Claude Code	JSONL	Plaintext summary
Codex	JSONL	Encrypted JWT compression
Gemini	Server-side	New session file per compression

Key finding: No complex RAG pipelines, no knowledge graphs — just plain sliding windows + summary compression. This stands in stark contrast to the various advanced approaches discussed in academic papers.

6. Production Deployment Status¶

Consumer Products: Memory Features Live¶

ChatGPT and Claude have both officially launched memory features that users can experience in daily conversations:

Product	Memory Mode	Core Feature
ChatGPT	Passive injection	33 fact summaries, seamless experience
Claude	Active retrieval	Tool calls, transparent control

This marks: memory features are transitioning from experimental to standard.

B2B Frameworks: Vertical Deployment¶

Framework	Deployment Cases	Characteristics
Mem0	AWS Agent SDK, Sunflower, RevisionDojo	Simple approaches deploy first
Letta	11x, Kognitos	Complex stateful agents
Graphiti	Zep AI platform core	Temporal knowledge graph

Enterprise Common Architecture¶

Most enterprises (Walmart, JP Morgan, etc.) don't use a single framework but build dual-layer memory architecture:

Hot layer (Redis): Recent 10-20 conversation turns, fast access
Cold layer (Vector DB): Semantic retrieval of historical conversations

Dual Purpose of Vector Databases¶

An easily confused point: vector databases are used not just for RAG (searching documents to answer questions) but also for conversation memory (searching past conversations to remember who the user is). Twilio, Aquant, and OpenAI all use this approach.

7. Summary¶

Technical Level¶

Memory systems are evolving from "vector retrieval" to "structured + lifecycle management." The core problems are clear: what to extract, how to store, when to retrieve, how to update. But the optimal solution is far from settled — fact extraction, graph structures, and temporal modeling each have their advocates.

Business Level¶

Memory capability has become a core competitive differentiator:

Consumer products: ChatGPT and Claude have made memory a standard feature
B2B frameworks: Mem0's AWS partnership proves the commercial value of memory
Developer tools: Cursor/Augment use memory to improve code understanding and developer retention

The reality: Consumer side is live (ChatGPT/Claude), B2B is still in early exploration. True long-term memory and cross-session learning still have a long way to go.

Observations¶

The most surprising finding: production-grade CLI tools universally adopt simple approaches. This might indicate:

Simple approaches are sufficient for current scenarios
The marginal benefit of complex approaches doesn't cover their cost
Or, better memory systems are the next competitive frontier

Memory is the key capability that transforms LLMs from "tools" into "assistants." The technology stack is still evolving rapidly and is worth continued attention.

Resources¶

Based on open-source code analysis and product reverse engineering. Full research materials:

Research period: December 2025