Letta (MemGPT) Technical Research Report¶
Last Updated: 2025-12-17
Research Methodology: This document was generated through collaborative code repository analysis by Claude Code, Gemini CLI, and Codex CLI, with final summarization by Claude Code.
Overview¶
Letta (formerly MemGPT) is an open-source framework that transforms stateless LLMs into stateful agents with persistent memory. Based on the MemGPT paper's "LLM Operating System" principles, it implements a hierarchical memory system with virtual context management, enabling unbounded context through intelligent paging.
Source: Letta GitHub Repository
1. Core Concept: The "LLM OS" Analogy¶
Letta treats the LLM agent as an operating system:
| OS Component | LLM Equivalent | Letta Implementation |
|---|---|---|
| CPU | LLM (GPT-4, Claude, etc.) | Processing and reasoning |
| RAM | Context Window | Limited "hot" memory for immediate processing |
| Hard Drive | External Storage | Unlimited "cold" storage (vector DBs, SQL) |
| Kernel | Agent Orchestrator | Manages data paging between RAM and Disk |
Key Insight: Unlike standard RAG which passively retrieves before generation, Letta is proactive - the LLM decides if, when, and what to retrieve using tools.
2. Three-Tier Memory Architecture¶
Memory Tiers¶
| Tier | Analogy | Description | Always In-Context? |
|---|---|---|---|
| Core Memory | RAM/BIOS | Agent identity + working memory | Yes |
| Recall Memory | Chat Logs | Conversation history | No (searchable) |
| Archival Memory | Hard Drive | Long-term knowledge storage | No (searchable) |
A. Core Memory (In-Context)¶
Location: letta/schemas/memory.py (line 56), letta/schemas/block.py
Core memory consists of editable Blocks always present in the system prompt:
class Memory(BaseModel):
blocks: List[Block] # Editable memory blocks
file_blocks: List[FileBlock] # File content blocks
agent_type: AgentType # Determines rendering style
Default Blocks:
- persona - Who the agent is (personality, capabilities)
- human - Facts about the user
Block Schema:
class Block(BaseModel):
label: str # Identifier (e.g., "persona", "human")
value: str # Actual text content
limit: int = 2000 # Character limit (CORE_MEMORY_BLOCK_CHAR_LIMIT)
read_only: bool # Prevent agent edits
description: str # Block purpose documentation
Rendering: Two modes in Memory.compile():
- Standard XML rendering for most providers
- Line-numbered format for Anthropic models (_render_memory_blocks_line_numbered())
B. Archival Memory (Long-Term Storage)¶
Location: letta/services/passage_manager.py
Vector-embedded passages stored in database:
- Storage: Vector DB (Chroma, Qdrant, pgvector) or native PostgreSQL
- Access: Via archival_memory_search() and archival_memory_insert() tools
- Search: Hybrid search using RRF (Reciprocal Rank Fusion) combining vector similarity + full-text search
Schema: ArchivalPassage ORM model with embeddings, timestamps, and tags.
C. Recall Memory (Conversation History)¶
Location: letta/services/message_manager.py
Database-stored message history:
- Access: Via conversation_search() tool
- Features:
- Date range filtering
- Role-based filtering (user/assistant/tool)
- Hybrid search (text + semantic)
- Integration: Supports Turbopuffer for fast vector search (line 1051)
3. Virtual Context Management¶
Context Window Structure¶
The actual prompt sent to the LLM is dynamically assembled:
[System Instructions]
[Core Memory Block: Persona]
[Core Memory Block: Human]
[Tool Definitions (Memory Functions)]
[Notification: "70 previous messages hidden..."]
[Recent Message History]
Context Window Tracking¶
Location: letta/services/context_window_calculator/context_window_calculator.py
class ContextWindowOverview:
num_system_message: int # Tokens in system prompt
num_core_memory: int # Tokens in core memory blocks
num_external_memory_summary: int # Archival/recall summary tokens
num_messages: int # Conversation message tokens
num_tools: int # Tool definition tokens
context_window_limit: int # Max tokens
context_window_size_current: int # Current usage
Paging/Swapping Mechanism¶
Location: letta/services/summarizer/summarizer.py
When context exceeds threshold:
1. Detection: context_window_size_current > memory_warning_threshold
2. Eviction: Older messages removed from active window → remain in Recall Memory
3. Summarization: partial_evict_summarization() compresses old messages
4. Insertion: Summary message inserted as index 1 (after system message)
def simple_summary(messages) -> str:
# Uses separate "ephemeral summary agent" to generate summary
# Falls back to transcript truncation if context too large
4. Self-Editing Memory (Key Differentiator)¶
The agent has write-access to its own prompts via tools.
Core Memory Edit Tools¶
Location: letta/services/tool_executor/core_tool_executor.py
| Tool | Line | Description |
|---|---|---|
core_memory_append() |
320 | Add content to block |
core_memory_replace() |
329 | Replace old_content with new_content |
memory_replace() |
347 | Modern version with old_str/new_str |
memory_apply_patch() |
416 | Apply unified diff-style patches |
memory_insert() |
526 | Insert text at specific line |
memory_rethink() |
599 | Complete block rewrite |
memory_finish_edits() |
643 | Commit pending edits |
Self-Edit Execution Flow¶
1. User: "My favorite color is actually blue, not red."
2. LLM reasons: "I need to update the 'Human' block"
3. LLM emits: core_memory_replace(label="human", old="fav_color: red", new="fav_color: blue")
4. Letta executes function, updates DB
5. System prompt rebuilt with new memory state for next turn
Safeguards¶
- Read-only protection: Checks
block.read_onlybefore edits - Line number validation: Regex prevents agents from including line numbers in edits
- Uniqueness checks: Ensures
old_strappears exactly once - Persistence:
update_memory_if_changed_async()compares and updates only changed blocks
5. Agent Architecture¶
Agent Class Hierarchy¶
Location: letta/agent.py (line 96)
class Agent:
agent_state: AgentState # Configuration and state
agent_manager: AgentManager # Persistence operations
block_manager: BlockManager # Block CRUD
message_manager: MessageManager # Message history
llm_client: LLMClient # LLM API abstraction
Execution Loop¶
Main Loop: step() (line 753)
step() with chaining support
├── Loop iteration:
│ ├── inner_step() - Single LLM interaction
│ │ ├── Load persisted memory from blocks
│ │ ├── Get in-context messages
│ │ ├── _get_ai_reply() - Call LLM with tools
│ │ ├── _handle_ai_response() - Parse & execute tools
│ │ └── Persist new messages to DB
│ ├── Check heartbeat_request flag
│ └── Handle chaining conditions
└── Return aggregated usage stats
Agent Types¶
| Type | Description |
|---|---|
memgpt_agent |
Original MemGPT implementation |
memgpt_v2_agent |
Refreshed toolset |
letta_v1_agent |
Simplified, no forced tool calls |
react_agent |
ReAct-style reasoning |
sleeptime_agent |
With background memory agent |
workflow_agent |
Auto-clearing message buffer |
6. Tool System¶
Tool Types¶
Location: letta/schemas/tool.py
| Type | Description |
|---|---|
LETTA_CORE |
Core memory and search tools (send_message, conversation_search, archival_memory_*) |
CUSTOM |
User-defined Python functions |
EXTERNAL_MCP |
Model Context Protocol integrations |
COMPOSIO |
Third-party action integrations |
Tool Execution Pipeline¶
Location: letta/services/tool_executor/
ToolExecutionManager (orchestrator)
├── ToolExecutorFactory → routes to appropriate executor
├── LettaCoreToolExecutor → handles core Letta tools
├── SandboxToolExecutor → isolated execution for custom tools
└── ToolExecutionSandbox → E2B sandbox support
Tool Rules¶
Location: letta/schemas/tool_rule.py
| Rule | Behavior |
|---|---|
TerminalToolRule |
Tool that ends execution chain |
RequiresApprovalToolRule |
Requires human approval |
ContinueToolRule |
Forces heartbeat after execution |
InitToolRule |
Force first message tool |
7. Persistence Layer¶
ORM Architecture¶
Location: letta/orm/
SQLAlchemy ORM with PostgreSQL (recommended) or SQLite:
| Model | Purpose |
|---|---|
AgentModel |
Agent configuration |
BlockModel |
Memory blocks |
MessageModel |
Conversation history |
ArchivalPassageModel |
Archival memory passages |
Persistence Flow¶
# Memory update flow (agent_manager.py:1555)
for block in new_memory.blocks:
old_block = get_block_by_id(block.id)
if block.value != old_block.value:
block_manager.update_block(block)
# Refresh memory from DB
agent_state.memory = Memory(
blocks=[get_block_by_id(b.id) for b in memory.blocks]
)
8. LLM Provider Integration¶
Supported Providers¶
Location: letta/llm_api/
| Category | Providers |
|---|---|
| Cloud | OpenAI, Anthropic, Google (Gemini, Vertex), Azure OpenAI |
| Enterprise | AWS Bedrock, Together AI, Groq |
| Local | Ollama, LM Studio |
| Specialized | XAI, DeepSeek |
Client Architecture¶
# Factory pattern
LLMClient.create(provider_type, put_inner_thoughts_first, actor)
# Each client inherits from LLMClientBase:
- build_request_data() # Convert to provider format
- request_async() # Send API request
- convert_response_to_chat_completion() # Normalize response
Provider-Specific Handling¶
- Anthropic: Line-numbered memory rendering (memory.py lines 279-293)
- OpenAI: Standard function calling format
- Reasoning models: Special handling for o1/o3 models
9. Letta vs Standard RAG¶
| Feature | Standard RAG | Letta |
|---|---|---|
| Retrieval | Passive: retrieve before LLM sees query | Active: LLM decides if/when/what to retrieve |
| Memory | Read-only: LLM cannot update vector DB | Read/Write: LLM actively saves knowledge |
| State | Stateless: reset after every request | Stateful: identity persists across sessions |
| Context | Polluted: often stuffed with irrelevant chunks | Curated: agent maintains concise Core Memory |
| Update Mechanism | Append-only | Self-editing with CRUD operations |
10. Recent Developments (2024-2025)¶
-
Letta v1 Agent: New architecture optimized for reasoning models (GPT-4o, Claude 3.5), more flexible than strict MemGPT structure
-
Split-Thread Architecture: Separation of "thinking" (internal monologue/tool execution) from "speaking" (user-facing response)
-
Server/Client Split: FastAPI backend managing agent states, allowing multiple frontend clients (web, CLI, Discord)
-
Sleeptime Agents: Background memory management agent for organizing and compacting memory
11. Key Takeaways¶
-
LLM as Operating System: Memory management inspired by OS virtual memory principles
-
Three-Tier Memory: Core (always in-context) → Recall (searchable history) → Archival (vector storage)
-
Self-Editing is Key: Agent writes to its own prompt, unlike passive RAG systems
-
Virtual Context: Summarization + paging creates illusion of infinite context
-
Tool-Based Memory Access: Memory operations exposed as LLM function calls
-
Stateful by Design: Agent identity and knowledge persist across sessions
-
Provider Agnostic: Clean abstraction layer supports 15+ LLM providers
References¶
- Letta GitHub Repository
- Letta Documentation
- MemGPT Paper - "MemGPT: Towards LLMs as Operating Systems"
- Core implementation:
letta/agent.py(2000+ lines) - Memory schemas:
letta/schemas/memory.py,letta/schemas/block.py - Tool executor:
letta/services/tool_executor/