LLM Memory Systems: 2025 Technology Landscape and Production Reality¶
Last Updated: 2025-12-28
While researching LLM memory systems, I found some interesting things: ChatGPT and Claude take completely opposite approaches to memory — one always injects, the other retrieves on demand. Even more surprising, Agent CLI tools like Claude Code, Codex, and Gemini have memory implementations far simpler than expected — no RAG, no knowledge graphs, just plain sliding windows. This article covers open-source frameworks, vector databases, coding assistants, and the real state of production deployment.
Covers: open-source memory frameworks (Mem0/Letta/Graphiti), vector databases (Qdrant/Chroma), coding assistants (Cursor/Augment/Continue), ChatGPT and Claude memory reverse engineering, Agent CLI analysis, and production deployment reality.
1. Memory Frameworks: Three Technical Approaches¶
Current mainstream memory frameworks represent three distinct technical philosophies:
Mem0: LLM-Driven CRUD¶
Mem0's core idea is using large models to manage memory CRUD operations. It extracts facts from conversations and lets the model judge which version is more trustworthy when conflicts arise.
Production validation: Mem0 has become the official memory provider for AWS Agent SDK, with real commercial cases (Sunflower healthcare, RevisionDojo education). This proves the value of "simple but effective" in production environments.
Letta: Three-Tier Memory + Self-Editing¶
Letta (formerly MemGPT) designed an OS-like three-tier architecture: - Core Memory: Core facts placed in the system prompt - Recall Memory: Vector retrieval of conversation history - Archival Memory: Long-term knowledge storage
What makes it unique is letting the agent decide when to update its own memory. Already adopted by 11x (sales AI) and Kognitos (enterprise automation).
Graphiti: Bi-Temporal Knowledge Graph¶
Graphiti, from the Zep team, introduces a temporal dimension — tracking not just "when we learned it" (transaction_time) but also "when the fact itself occurred" (valid_time).
This solves the state change problem: "user lived in Beijing last year" and "user lives in Shanghai now" are both true, but traditional systems struggle to preserve both simultaneously.
2. Vector Databases: Two Positions¶
Qdrant: Balancing Performance and Features¶
Qdrant pursues "supporting complex filtering while maintaining recall quality": filtrable HNSW indexing, sparse vector support, RRF/DBSF hybrid ranking.
In production, a common Redis + Qdrant dual-layer architecture emerges: Redis for hot data and fast access, Qdrant for cold data and semantic retrieval.
Chroma: Developer Experience First¶
Chroma chose extreme developer experience — its pre-filtering mechanism and clean API make prototyping very smooth. Currently undergoing a Rust v1.0 rewrite, catching up on the "performance" front.
3. Coding Assistants: Memory in Practice¶
Cursor: Learning from User Behavior¶
Cursor uses agent session trace data to train its own embedding model. Its vector representations are specialized for "code understanding" scenarios, not general text.
Augment: Real-Time Incremental Indexing¶
Augment focuses on "real-time" — monitoring edit events and dynamically updating a personal code index. According to public data, this delivers a 2.6% quality improvement.
Continue: Open Architecture¶
Continue chose the BYOM (Bring Your Own Model) approach, paired with content-addressed caching. More framework than product, suited for customization needs.
4. Consumer Product Reverse Engineering: ChatGPT vs Claude Memory¶
Through reverse engineering analysis of request patterns and system behavior, it was found that ChatGPT and Claude adopt fundamentally different memory architectures — representing two opposing design philosophies.
ChatGPT: Pre-Computed Injection (Passive Memory)¶
ChatGPT's memory is an "always inject" model:
- Storage: ~33 fact summaries + recent conversation summaries
- Injection timing: Automatically included at the start of every conversation, invisible to users
- Update mechanism: Background async extraction, no impact on conversation latency
Design philosophy: Sacrifice context space for simplicity and reliability. Users don't need to wait for retrieval — memory "naturally" exists in the conversation.
Trade-offs: - Pros: Low latency, smooth experience, simple implementation - Costs: Fixed context window consumption, limited memory capacity
Claude: On-Demand Retrieval (Active Memory)¶
Claude implements memory as explicit tool calls:
- Tool interfaces:
conversation_search(semantic search of history),recent_chats(recent conversation list) - Trigger timing: Invoked only when the model determines it's needed; user sees "searching memory"
- Retrieval scope: Can span longer time ranges of conversation history
Design philosophy: On-demand retrieval, precise matching. Only consume resources when truly needed.
Trade-offs: - Pros: Saves tokens, theoretically supports larger-scale memory - Costs: Increased latency, depends on model correctly judging when memory is needed
The Essential Difference¶
| Dimension | ChatGPT (Passive) | Claude (Active) |
|---|---|---|
| Memory trigger | Auto-injection | Tool call |
| User perception | Invisible | Visible "searching" |
| Context usage | Fixed cost | On-demand cost |
| Latency | Low | Increases during retrieval |
| Capacity limit | Limited by injection volume | Theoretically larger |
| Implementation complexity | Low | High |
This is not just a technical choice — it reflects product philosophy: ChatGPT pursues "seamless experience," Claude pursues "transparent control."
5. Agent CLI Tools: Surprisingly Simple¶
After studying the implementations of Claude Code, Codex, and Gemini CLI, I found a striking phenomenon: these tools' "memory" approaches are far simpler than expected.
| Tool | Storage Format | Compression Method |
|---|---|---|
| Claude Code | JSONL | Plaintext summary |
| Codex | JSONL | Encrypted JWT compression |
| Gemini | Server-side | New session file per compression |
Key finding: No complex RAG pipelines, no knowledge graphs — just plain sliding windows + summary compression. This stands in stark contrast to the various advanced approaches discussed in academic papers.
6. Production Deployment Status¶
Consumer Products: Memory Features Live¶
ChatGPT and Claude have both officially launched memory features that users can experience in daily conversations:
| Product | Memory Mode | Core Feature |
|---|---|---|
| ChatGPT | Passive injection | 33 fact summaries, seamless experience |
| Claude | Active retrieval | Tool calls, transparent control |
This marks: memory features are transitioning from experimental to standard.
B2B Frameworks: Vertical Deployment¶
| Framework | Deployment Cases | Characteristics |
|---|---|---|
| Mem0 | AWS Agent SDK, Sunflower, RevisionDojo | Simple approaches deploy first |
| Letta | 11x, Kognitos | Complex stateful agents |
| Graphiti | Zep AI platform core | Temporal knowledge graph |
Enterprise Common Architecture¶
Most enterprises (Walmart, JP Morgan, etc.) don't use a single framework but build dual-layer memory architecture:
- Hot layer (Redis): Recent 10-20 conversation turns, fast access
- Cold layer (Vector DB): Semantic retrieval of historical conversations
Dual Purpose of Vector Databases¶
An easily confused point: vector databases are used not just for RAG (searching documents to answer questions) but also for conversation memory (searching past conversations to remember who the user is). Twilio, Aquant, and OpenAI all use this approach.
7. Summary¶
Technical Level¶
Memory systems are evolving from "vector retrieval" to "structured + lifecycle management." The core problems are clear: what to extract, how to store, when to retrieve, how to update. But the optimal solution is far from settled — fact extraction, graph structures, and temporal modeling each have their advocates.
Business Level¶
Memory capability has become a core competitive differentiator:
- Consumer products: ChatGPT and Claude have made memory a standard feature
- B2B frameworks: Mem0's AWS partnership proves the commercial value of memory
- Developer tools: Cursor/Augment use memory to improve code understanding and developer retention
The reality: Consumer side is live (ChatGPT/Claude), B2B is still in early exploration. True long-term memory and cross-session learning still have a long way to go.
Observations¶
The most surprising finding: production-grade CLI tools universally adopt simple approaches. This might indicate:
- Simple approaches are sufficient for current scenarios
- The marginal benefit of complex approaches doesn't cover their cost
- Or, better memory systems are the next competitive frontier
Memory is the key capability that transforms LLMs from "tools" into "assistants." The technology stack is still evolving rapidly and is worth continued attention.
Resources¶
Based on open-source code analysis and product reverse engineering. Full research materials:
- Mem0 Research
- Letta Research
- Graphiti Research
- ChatGPT Memory Reverse Engineering
- Claude Memory Reverse Engineering
- Agent CLI Session File Analysis
- Production Adoption Research
Research period: December 2025