Gemini CLI Context Management Research¶
Last Updated: 2026-03-19
Source: google-gemini/gemini-cli (Apache 2.0)
Research focus: How Gemini CLI assembles and manages context within a conversation.
Architecture Overview¶
Gemini CLI is a monorepo with the core logic in packages/core/src/. Key locations for context management:
| Component | Path | Role |
|---|---|---|
| Chat state | core/geminiChat.ts |
Manages conversation history array, sends messages to API |
| Agent loop | agents/local-executor.ts |
Orchestrates turns, triggers compression, manages subagents |
| Turn | core/turn.ts |
Single turn execution: stream response, handle tool calls |
| Compression | services/chatCompressionService.ts |
Chat history compression (summarization + truncation) |
| Tool output summarizer | utils/summarizer.ts |
LLM-based tool output summarization |
| System prompt | prompts/promptProvider.ts, prompts/snippets.ts |
Section-based system prompt construction |
| Token limits | core/tokenLimits.ts |
Per-model token limit definitions |
Context Accumulation Model¶
Like Pi, Gemini CLI uses a single history array that accumulates indefinitely:
// geminiChat.ts
this.history.push(userContent); // User message added
const requestContents = this.getHistory(true); // Full history sent to API
// ... after response ...
this.history.push({ role: 'model', parts: consolidatedParts }); // Model response added
Every sendMessageStream() call sends the full curated history to the Gemini API. There is no pre-send truncation or token budgeting pipeline like OpenClaw.
Compression (Chat Compression Service)¶
Trigger¶
Compression triggers when token count exceeds a threshold (chatCompressionService.ts):
Called from local-executor.ts after each turn via tryCompressChat().
Algorithm: Three-Phase Compression¶
Phase 1: Tool Output Truncation (Budget-Based)
truncateHistoryToBudget() implements a "Reverse Token Budget" strategy:
- Iterates from newest to oldest messages
- Keeps a running tally of function response tokens
- Recent tool outputs preserved in full (high-fidelity for current context)
- Once budget exceeded (COMPRESSION_FUNCTION_RESPONSE_TOKEN_BUDGET = 50,000 tokens), older large tool outputs truncated to last 30 lines, full output saved to temp file
- This runs BEFORE summarization, ensuring the summarizer doesn't get overwhelmed
Phase 2: LLM Summarization with Verification
- Find split point: keep last 30% of history (
COMPRESSION_PRESERVE_THRESHOLD = 0.3), compress the older 70% - Split point must land on a user message (not on a function response)
- High-fidelity decision: if original (pre-truncation) history fits in model's token limit, send original to summarizer for better quality; otherwise send truncated version
- Generate
<state_snapshot>using a dedicated compression prompt - Verification probe: A second LLM call critically evaluates the snapshot and improves it if information was lost
- If previous snapshot exists in history, instruction to integrate it (incremental update)
Phase 3: Fallback (Truncation Only)
If summarization previously failed (hasFailedCompressionAttempt):
- Skip LLM summarization entirely
- Only apply Phase 1 truncation
- Status: CONTENT_TRUNCATED (vs COMPRESSED for full summarization)
Post-Compression History Structure¶
[user: <state_snapshot>...] ← Summary injected as user message
[model: "Got it. Thanks..."] ← Synthetic acknowledgment
[...recent 30% of history...] ← Preserved recent messages
Tool Output Summarization¶
Separate from chat compression, Gemini CLI has a dedicated tool output summarizer (utils/summarizer.ts):
- Triggered when individual tool results exceed a size threshold
- Uses a dedicated LLM call with
summarizer-defaultmodel config - Context-aware summarization: uses conversation history to understand what information matters
- Different strategies for: directory listings (structural), text content, shell command output (preserves error stack traces)
This means large tool outputs can be summarized before they even enter the main context, unlike Pi where full tool results always go into context.
System Prompt Construction¶
promptProvider.ts builds the system prompt from composable sections:
| Section | Content |
|---|---|
preamble |
Identity, interactive vs non-interactive mode |
coreMandates |
Core behavioral rules, memory instructions |
subAgents |
Available sub-agent definitions |
agentSkills |
Available skills with descriptions and locations |
taskTracker |
Task tracking instructions (if enabled) |
primaryWorkflows |
Coding workflow guidance, tool usage patterns |
planningWorkflow |
Plan mode specific instructions (if in plan mode) |
operationalGuidelines |
Shell efficiency, interactive shell guidance |
sandbox |
Sandbox mode guidance (macos-seatbelt, generic, outside) |
interactiveYoloMode |
Auto-approve mode instructions |
gitRepo |
Git-specific instructions (if in a git repo) |
finalReminder |
Legacy model compatibility reminder |
Notable features:
- Section toggleable: Each section can be enabled/disabled via isSectionEnabled() with env var overrides
- Model-aware: Modern vs legacy model snippets (snippets.ts vs snippets.legacy.ts)
- Template override: GEMINI_SYSTEM_MD env var can point to a custom system.md file
- Hierarchical memory: Supports global + extension + project level memory injection
- GEMINI.md context files: Project-specific instructions loaded from filesystem
Subagent Model¶
Gemini CLI has in-process subagents via LocalAgentExecutor:
Main agent (LocalAgentExecutor)
│
├─ tool call: subagent_tool({ task: "..." })
│ │
│ └─ LocalSubagentInvocation
│ └─ Creates new LocalAgentExecutor (in same process)
│ └─ New GeminiChat instance (independent history)
│ └─ Own system prompt (from agent definition)
│ └─ Streams activity back to parent as tool live output
│
├─ tool result: subagent's final output text
│
└─ Continues in main context
Key characteristics:
- In-process, new chat instance: Subagent runs in the same Node.js process but with a fresh GeminiChat (independent history)
- Agent definitions: Loaded from filesystem (agentLoader.ts), specify name, description, model config, tools, system prompt
- Activity streaming: Subagent streams progress (thoughts, tool calls) back to parent as ToolLiveOutput
- Built-in agents: codebase-investigator (fast recon), browser agent, cli-help-agent, generalist-agent
- A2A support: Also supports Agent-to-Agent protocol for remote agents
- One-way context: Parent's context doesn't flow to subagent; only final result returns
Comparison: Pi vs OpenClaw vs Gemini CLI¶
| Aspect | Pi | OpenClaw | Gemini CLI |
|---|---|---|---|
| Context model | Infinite accumulate | Infinite + multi-stage pipeline | Infinite accumulate |
| Pre-send processing | None | sanitize → validate → truncate → assemble | None |
| Compression trigger | contextWindow - reserve |
Same (inherited from Pi) | 50% of token limit |
| Compression approach | LLM summary (1 call) | Same (inherited) + engine can override | LLM summary (2 calls: generate + verify) |
| Tool output handling | Full results in context | Full results in context | Pre-summarization of large outputs + budget-based truncation |
| Verification | None | None | Probe step verifies snapshot quality |
| Fallback on failure | Error | Error | Truncation-only mode (no LLM re-attempt) |
| Subagent | Extension (process isolation) | Built-in (gateway RPC) | Built-in (in-process, new chat) |
| Subagent context | Fully isolated | Isolated + bidirectional communication | Isolated (new GeminiChat instance) |
| System prompt | ~300 words | 15+ sections, 3 modes | Section-based, toggleable, model-aware |
Session Storage vs Runtime Context¶
Previous research (agent-cli/gemini-session-files.md) analyzed Gemini CLI's session file structure:
- Sessions stored as single JSON files at
~/.gemini/tmp/{project-hash}/chats/session-{date}-{hash}.json - Simple flat structure:
{ sessionId, projectHash, messages: [...] } - Three message types:
user,gemini,info
The earlier finding noted "context managed server-side, not exposed to client". Source code analysis reveals this is not accurate for the current version — compression is fully client-side:
ChatCompressionServiceruns locally, calling the Gemini API for summarization- The
<state_snapshot>is injected as a user message into the local chat history chat.setHistory(newHistory)replaces the in-memory history after compression- Session files record the full conversation including compressed state
Key insight: the session file shows the final state (post-compression history), but the compression logic itself (split point calculation, truncation, summarization, verification) is invisible in the session file — only the result is persisted.
Unique Design Choices¶
- Two-pass compression: Generate summary → verify → fix. Extra LLM call but higher quality snapshots
- Tool output pre-summarization: Large tool results summarized BEFORE entering context, not just at compaction time. This is unique — Pi and OpenClaw put full tool results into context
- Reverse Token Budget for tool outputs: Newest tool results get full fidelity, oldest get truncated first. Smart prioritization
- Graceful degradation: If summarization fails once, switches to truncation-only mode for all subsequent compressions (avoids repeated expensive failures)
- 50% threshold: More aggressive compression trigger than Pi's "near the limit" approach. Compresses when half the context window is used
- State snapshot format: Uses
<state_snapshot>XML tags in the summary, enabling detection of previous snapshots for incremental updates - In-process subagents: Unlike Pi (OS process) or OpenClaw (gateway RPC), subagents run in the same Node.js process with a fresh chat instance. Simplest isolation model