Hindsight Technical Research Report¶
Last Updated: 2026-03-24
Research Methodology: This document was generated through source code analysis of the vectorize-io/hindsight repository and the associated arXiv paper (2512.12818), supplemented by web research of benchmarks, blog posts, and community discussions.
Overview¶
Hindsight is an open-source (MIT) agent memory system by Vectorize.io that organizes long-term memory into epistemically distinct networks and provides three core operations: retain, recall, and reflect. It achieves state-of-the-art performance on LongMemEval (91.4%) and LoCoMo (89.61%) benchmarks by combining biomimetic memory organization, multi-strategy retrieval, and disposition-aware reasoning.
The architecture unifies two subsystems described in the paper: - Tempr (Temporal Entity Memory Priming Retrieval) — implements retain and recall - Cara (Coherent Adaptive Reasoning Agents) — implements reflect with configurable disposition traits
1. Core Architecture¶
Memory Networks¶
Hindsight organizes memory into four epistemically distinct fact types, stored as rows in a single memory_units PostgreSQL table differentiated by a fact_type column:
| Network | fact_type |
Description | Created By |
|---|---|---|---|
| World | world |
Objective facts about the external environment ("Alice works at Google") | Retain (LLM extraction) |
| Experience | experience |
Agent's own interactions, written in first person ("I helped user debug their API") | Retain (LLM extraction, fact_type='assistant' in extraction schema) |
| Observation | observation |
Preference-neutral entity summaries synthesized from underlying facts | Consolidation engine (automatic, post-retain) |
| Opinion | opinion |
Subjective judgments with confidence scores (deprecated in current code) | Originally Cara; now removed via migration |
The paper describes four networks. In the current codebase, the valid recall fact types are world, experience, and observation (VALID_RECALL_FACT_TYPES in response_models.py). The opinion network has been deprecated and its entries deleted via an Alembic migration.
Monorepo Structure¶
hindsight/
├── hindsight-api-slim/ # Core FastAPI server + memory engine (Python, uv)
│ └── hindsight_api/
│ ├── engine/ # Core memory engine
│ │ ├── memory_engine.py # Main orchestrator
│ │ ├── retain/ # Retain pipeline modules
│ │ ├── search/ # Multi-strategy retrieval
│ │ ├── reflect/ # Cara reflect agent
│ │ ├── consolidation/ # Observation synthesis
│ │ └── directives/ # Hard behavioral rules
│ ├── api/http.py # FastAPI HTTP routers
│ └── api/mcp.py # MCP server
├── hindsight-control-plane/ # Admin UI (Next.js)
├── hindsight-cli/ # CLI tool (Rust)
├── hindsight-clients/ # Generated SDKs (Python, TypeScript, Rust)
├── hindsight-integrations/ # Framework integrations (LiteLLM, OpenAI, LangGraph, CrewAI, Claude Code, etc.)
└── hindsight-docs/ # Docusaurus documentation site
Database Schema¶
PostgreSQL with pgvector. Key tables:
| Table | Purpose |
|---|---|
banks |
Memory banks (isolated per-user/agent "brains") with name, mission, disposition traits |
memory_units |
All facts (world, experience, observation) with embeddings, BM25 search vectors, temporal fields |
entities |
Canonical entity records (resolved from mentions) |
entity_links |
Links between memory units and entities |
memory_links |
Graph edges: semantic, temporal, causal, entity links between memory units |
documents |
Document tracking for multi-part ingestion |
chunks |
Raw text chunks for expand/retrieval |
mental_models |
User-defined stored reflect responses (pinned reflections) |
Each memory unit carries: text, context, embedding (vector), search_vector (BM25), event_date, occurred_start, occurred_end, mentioned_at, fact_type, confidence_score, tags, metadata, document_id, chunk_id.
2. Tempr: Retain and Recall¶
Retain Pipeline¶
The retain operation (retain/orchestrator.py) processes content through a multi-stage pipeline:
Input Content
│
▼
[1] Fact Extraction (LLM)
│ - Extracts structured facts with: what, when, where, who, why
│ - Classifies each as world or experience (assistant)
│ - Extracts entities, causal relations, temporal ranges
│ - Three extraction modes: standard, verbose, verbatim
│
▼
[2] Embedding Generation
│ - Augments fact text with date context
│ - Local sentence-transformers or TEI (Text Embeddings Inference)
│
▼
[3] Entity Resolution
│ - LLM-extracted entities + user-provided entities
│ - Resolved to canonical entity IDs via fuzzy matching
│ - Two strategies: "full" (load all bank entities) or "trigram" (pg_trgm GIN index)
│ - Co-occurrence tracking between entities
│
▼
[4] Database Transaction (single atomic write)
│ - Store memory units with embeddings + BM25 vectors
│ - Create entity links
│ - Create temporal links (time-proximity weighted, 24h window)
│ - Create semantic links (top-5 nearest neighbors, similarity >= 0.7)
│ - Create causal links (extracted by LLM during fact extraction)
│ - Document and chunk tracking
│
▼
[5] Post-Transaction
│ - Flush entity stats (counts, co-occurrences)
│ - Trigger consolidation job (background)
Fact Extraction Schema (from fact_extraction.py): Each fact is a Pydantic model with structured fields (what, when, where, who, why), combined into a single text string: "what | Involving: who | why". The LLM also extracts:
- occurred_start / occurred_end — ISO timestamps for datable events
- entities — named entities (people, places, concepts)
- causal_relations — links to previous facts in the batch (index-based, forward-only)
- fact_type — world or assistant
Recall Pipeline¶
The recall operation runs four retrieval strategies in parallel, then fuses and reranks results:
Query
│
├──► Query Analysis (dateparser: extract temporal constraints)
├──► Query Embedding
│
▼
┌─────────────────────────────────────────────────┐
│ 4-Way Parallel Retrieval │
│ │
│ [Semantic] [BM25] [Graph] [Temporal] │
│ Vector sim Full-text Entity/ Time-range │
│ HNSW index tsvector causal + semantic │
│ or vchord traversal spreading │
└──────────────────────┬──────────────────────────┘
│
▼
Reciprocal Rank Fusion (k=60)
│
▼
Cross-Encoder Reranking
+ Recency boost (±10%)
+ Temporal proximity boost (±10%)
│
▼
Token-budgeted output
Retrieval details:
- Semantic — pgvector HNSW cosine similarity, per-bank per-fact-type partial indexes, over-fetches 5x then trims. Minimum similarity threshold of 0.3.
- BM25 — Three backends: native PostgreSQL
tsvector/vchord_bm25/pg_textsearch. Keyword matching via tokenized query. - Graph — Three pluggable strategies:
- MPFP (Meta-Path Forward Push) — default. Sublinear graph traversal combining meta-path patterns from HIN literature with Forward Push local propagation from Approximate PPR. Lazy edge loading, hop-synchronized across all patterns to minimize DB queries to O(hops). Predefined patterns like
[semantic, semantic](topic expansion),[entity, temporal](entity timeline),[semantic, causes](reasoning chains). - BFS — Spreading activation with decay (original algorithm)
- Link Expansion — Direct single-hop expansion through entity, semantic, and causal links
- Temporal — Two-phase: date-ranked filtering within time window, then embedding similarity on the top-50 candidates per fact type. Temporal spreading to adjacent events.
Fusion: Reciprocal Rank Fusion merges the four result lists. Then a cross-encoder reranker scores each candidate. Combined scoring applies recency and temporal proximity as multiplicative boosts (±10% each) on top of the cross-encoder score.
Link Types in the Memory Graph¶
| Link Type | Created At | Weight Computation |
|---|---|---|
semantic |
Retain time | Cosine similarity between embeddings (top-5 neighbors, >= 0.7) |
temporal |
Retain time | max(0.3, 1.0 - time_diff_hours / 24) — proximity within 24h window |
causal |
Retain time | LLM-extracted caused_by relations with strength 0.0-1.0 |
entity |
Retain time | Co-occurrence through shared resolved entities |
3. Cara: Reflect¶
The reflect operation (reflect/agent.py) is an agentic loop that reasons over retrieved memories using LLM tool calling. It implements hierarchical retrieval:
Hierarchical Retrieval Strategy¶
- Mental Models (
search_mental_models) — User-curated stored reflect responses (highest quality, manually maintained) - Observations (
search_observations) — Auto-consolidated knowledge from memories, with freshness tracking (is_stale) - Raw Facts (
recall) — World facts and experiences as ground truth
The agent iterates up to DEFAULT_MAX_ITERATIONS = 10, calling tools to gather evidence, then produces a final answer grounded in retrieved memories.
Available Tools¶
| Tool | Purpose |
|---|---|
search_mental_models |
Search user-curated mental models (pinned reflections) |
search_observations |
Search auto-consolidated observations with freshness info |
recall |
Search raw facts (world + experience) |
expand |
Retrieve full chunk/document context for a memory |
done |
Produce final answer with supporting memory IDs |
Disposition Traits (Cara)¶
Memory banks can have configurable disposition traits that affect reflect behavior (not recall):
| Trait | Range | Low (1) | High (5) |
|---|---|---|---|
| Skepticism | 1-5 | Trusting — accepts information at face value | Skeptical — questions and doubts information |
| Literalism | 1-5 | Flexible — reads between the lines | Literal — interprets information strictly as stated |
| Empathy | 1-5 | Detached — ignores emotional context | Empathetic — considers feelings and relationships |
These traits are injected into the reflect system prompt as Disposition: skepticism=3, literalism=2, empathy=4. Given the same facts, agents with different dispositions form different conclusions.
Directives¶
Separate from dispositions, directives are hard rules injected into prompts (e.g., "Always respond in formal English", "Never share personal data"). They are user-defined, prioritized, and enforced with stronger language in the prompt than disposition traits.
4. Consolidation (Observation Synthesis)¶
The consolidation engine (consolidation/consolidator.py) runs as a background job after retain operations. It processes new, unconsolidated memories and produces observations:
Pipeline:
1. Fetch unconsolidated memories from the bank
2. Retrieve existing observations for context
3. LLM decides for each batch: CREATE new observation, UPDATE existing one, or DELETE obsolete one
4. Store observations as memory_units with fact_type='observation', tracking proof_count, source_memory_ids, and history
Consolidation prompt rules: - Redundant info (same info worded differently) → UPDATE existing observation - Contradictions/updates → capture both states with temporal markers ("used to X, now Y") - Resolve vague references when new facts provide concrete values - Never merge observations about different people or unrelated topics
Observations vs. Mental Models:
- Observations — auto-generated bottom-up by the consolidation engine from raw facts. Stored in memory_units table with fact_type='observation'.
- Mental Models — user-defined queries stored in the mental_models table. Refreshed on demand via reflect. Can serve as directives.
Observation Trends¶
Each observation has a computed trend based on evidence timestamps:
| Trend | Meaning |
|---|---|
STABLE |
Evidence spread across time, continues to present |
STRENGTHENING |
More/denser evidence recently |
WEAKENING |
Evidence mostly old, sparse recently |
NEW |
All evidence within recent window |
STALE |
No evidence in recent window |
5. Benchmark Performance¶
LongMemEval Results (as of January 2026)¶
Hindsight achieved state-of-the-art performance, the first memory system to cross 90%:
| System | Overall | Info Extract | Multi-Session | Temporal | Knowledge Update | Abstention |
|---|---|---|---|---|---|---|
| Hindsight (OSS-120B) | 91.4% | — | — | — | — | — |
| Hindsight (OSS-20B) | 83.6% | — | — | — | — | — |
| Full-context GPT-4o | 49.0% | — | 21.1% | 31.6% | 60.3% | — |
| Full-context baseline (20B) | 39.0% | — | — | — | — | — |
Key improvements with Hindsight over full-context baseline: - Multi-session: 21.1% → 79.7% - Temporal reasoning: 31.6% → 79.7% - Knowledge updates: 60.3% → 84.6% - Overall: +44.6 points over full-context baseline
Results independently reproduced by Virginia Tech Sanghani Center and The Washington Post.
LoCoMo Results¶
| System | Overall |
|---|---|
| Hindsight (Gemini-3) | 89.61% |
| Hindsight (OSS-120B) | 85.67% |
| Hindsight (OSS-20B) | 83.18% |
| Memobase | 75.78% |
6. Comparison with Other Memory Systems¶
| Feature | Hindsight | Mem0 | Letta (MemGPT) | Graphiti (Zep) | Supermemory |
|---|---|---|---|---|---|
| Memory Model | 4 epistemically distinct networks (world, experience, observation + deprecated opinion) | Dual store: vector + optional graph | In-context memory management via MemGPT architecture | Temporal knowledge graph with episodic/semantic edges | Vector store with auto-chunking |
| Storage | PostgreSQL + pgvector (single DB) | 24+ vector stores + Neo4j/Memgraph | PostgreSQL + pgvector | Neo4j graph DB | Multiple vector backends |
| Retrieval | 4-way parallel (semantic + BM25 + graph + temporal) + RRF + cross-encoder reranking | Vector similarity + graph traversal + optional reranking | LLM-managed retrieval within conversation context | Graph traversal with temporal edges | Vector similarity |
| Graph Traversal | MPFP (sublinear, meta-path patterns, lazy loading) or BFS spreading activation | Optional Neo4j entity-relation graph | N/A (LLM decides what to retrieve) | Temporal knowledge graph with entity resolution | N/A |
| Temporal Reasoning | First-class: temporal links, temporal retrieval arm, date-range spreading, occurred_start/end per fact | No native temporal support | No native temporal support | Temporal edges in knowledge graph | No native temporal support |
| Memory Updates | Consolidation engine: auto-synthesizes observations, handles contradictions with temporal markers | LLM-driven CRUD (ADD/UPDATE/DELETE) per fact | LLM edits memory blocks in-context | Graph edge invalidation with temporal validity | Append-only |
| Reflect/Reasoning | Agentic loop with hierarchical retrieval (mental models → observations → raw facts) | Not built-in | LLM reasons over in-context memory | Not built-in (graph query) | Not built-in |
| Disposition/Personality | Configurable traits (skepticism, literalism, empathy) per bank | Not supported | Not supported | Not supported | Not supported |
| Fact Classification | LLM classifies: world vs. experience + causal relations + entities | Single fact type | Core memory vs. archival memory | Episodic vs. semantic edges | Single type |
| LongMemEval | 91.4% | 49.0% (self-reported) | Not published | 71.2% (self-reported) | Not published |
| License | MIT | Apache 2.0 | Apache 2.0 | Apache 2.0 | MIT |
| Deployment | Single Docker container (embedded PostgreSQL) or external DB | Requires separate vector store + optional graph DB | Server with PostgreSQL | Requires Neo4j + separate services | Self-hosted or cloud |
7. Key Differentiators¶
vs. Mem0¶
- Hindsight uses four parallel retrieval strategies vs. Mem0's vector similarity + optional graph
- Hindsight has native temporal reasoning (first-class temporal links and retrieval)
- Hindsight auto-consolidates observations; Mem0 uses LLM-driven CRUD on individual facts
- Hindsight's reflect provides agentic reasoning; Mem0 has no built-in reasoning layer
- Single PostgreSQL deployment vs. Mem0's multi-service setup
vs. Letta (MemGPT)¶
- Fundamentally different paradigm: Hindsight is an external memory service; Letta manages memory in-context via the LLM itself
- Hindsight's structured graph enables sublinear retrieval; Letta pays full-context LLM cost
- Hindsight provides benchmark-validated accuracy; Letta's approach is more autonomous but harder to evaluate
vs. Graphiti (Zep)¶
- Both use graph-based memory, but different graph structures: Hindsight uses a heterogeneous memory graph (semantic/temporal/causal/entity edges); Graphiti uses a temporal knowledge graph
- Hindsight combines graph traversal with three other retrieval strategies (semantic, BM25, temporal) via RRF; Graphiti primarily uses graph traversal
- Hindsight's MPFP algorithm is sublinear in graph size; Graphiti uses full graph queries
Unique Capabilities¶
- MPFP Algorithm — Novel sublinear graph traversal combining meta-path patterns with Forward Push propagation. Hop-synchronized execution reduces DB queries to O(hops) regardless of pattern count.
- Epistemic Separation — Structurally distinguishes evidence (world/experience facts) from inference (observations) from instructions (directives).
- Disposition-Aware Reasoning — Same facts, different conclusions based on configurable agent personality traits.
- Consolidation with Temporal Markers — Handles contradictions gracefully ("used to X, now Y") rather than overwriting.
- Observation Trends — Algorithmically computed freshness signals (stable, strengthening, weakening, new, stale) on synthesized knowledge.
8. Integration Ecosystem¶
Hindsight provides integrations for: - LLM Wrappers: LiteLLM, OpenAI-compatible (drop-in replacement for API calls) - Agent Frameworks: LangGraph, CrewAI, PydanticAI, Agno, AI SDK (Vercel), Hermes - Coding Agents: Claude Code, OpenClaw, NemoClaw - Protocol: MCP (Model Context Protocol) server built-in
The LLM wrapper approach enables adding memory to existing agents with minimal code changes — swap the LLM client for the Hindsight wrapper, and memories are stored/retrieved automatically on each LLM call.
Sources¶
- arXiv Paper: Hindsight is 20/20 (2512.12818)
- GitHub Repository
- Hindsight Documentation
- Vectorize Blog: Introducing Hindsight
- VentureBeat: 91% Accuracy on LongMemEval
- PR Newswire: Vectorize Breaks 90%
- Vectorize: Hindsight vs Mem0
- Vectorize: Best AI Agent Memory Systems 2026
- Hindsight Benchmarks Repository