Continuous Learning & Personality Training: Research Summary¶

Last Updated: 2026-03-24

Studied Systems¶

System	Type	Approach	Source
Neuro-sama	AI VTuber (1 character)	Iterative batch SFT on curated stream data	`neuro-sama.research.md`
Character.AI	Character platform (millions of characters)	DPO + personality constitutions (meta-character training)	`character-ai.research.md`
5 open-source VTuber recreations	Community projects	Prompt engineering only, no weight modification	`neuro-sama.research.md`
SillyTavern community	Roleplay platform	Character cards (PLists, Ali:Chat)	`personality-engineering.research.md`
BIG5-CHAT (ACL 2025)	Academic	SFT/DPO on 100K Big Five dialogues	`personality-engineering.research.md`
FinePE	Academic	MoE-LoRA, per-subtrait modules	`personality-engineering.research.md`
PERSONA	Academic	Contrastive activation analysis + vector algebra	`personality-engineering.research.md`
SAS Personality Sliders	Academic	Sequential Adaptive Steering	`personality-engineering.research.md`
Anthropic Persona Vectors	Industry research	Monitoring + steering + vaccination	`personality-engineering.research.md`

Core Finding: Three Paradigms for Personality¶

Personality engineering has three distinct paradigms, each operating at a different depth:

           ┌─────────────────────────────────────────────────┐
           │            Depth of Intervention                │
           │                                                 │
 Prompt    │  "Act as a pirate"                              │
 ──────────│  ✓ Zero cost, instant switch                    │
           │  ✗ Can't override alignment                     │
           │  ✗ Drifts over long context                     │
           │  ✗ 162-persona study: effect "largely random"   │
           │                                                 │
 Activation│  Inject α·v into residual stream                │
 ──────────│  ✓ Matches SFT performance (9.60 vs 9.61)      │
           │  ✓ Zero-parameter, instant composition          │
           │  ✗ Requires white-box model access              │
           │  ✗ Safety-aligned traits resist activation      │
           │  ? Long-context durability untested             │
           │                                                 │
 Weight    │  SFT / DPO / LoRA fine-tuning                   │
 ──────────│  ✓ Overrides alignment layer                    │
           │  ✓ Stable over long context                     │
           │  ✓ Human-like trait distributions               │
           │  ✗ Requires training data + GPU                 │
           │  ✗ Catastrophic forgetting on retraining        │
           │  ✗ Inflexible (need new adapter per persona)    │
           └─────────────────────────────────────────────────┘

Quantitative Comparison¶

Method	PersonalityBench	Model Access	Cost	Multi-trait
Prompting	8.39	API OK	Zero	Fragile
Activation (PERSONA)	9.60	White-box	Low	Algebraic
SFT (fine-tuning)	9.61	Weights	High	2^N models
MoE-LoRA (FinePE)	Outperforms SFT by 29% on BFI	Weights	Medium	Per-trait modules

Headline result: Activation engineering matches fine-tuning on personality benchmarks, without any gradient updates.

Two Production Architectures¶

Only two systems have confirmed weight-level personality training in production:

Architecture A: Per-Character Training (Neuro-sama)¶

Stream interactions → Vedal curates data → SFT on 2B model → Deploy
                              ↑                                  │
                              └──────────────────────────────────┘

Model: 2B params, q2_k quantization (latency-driven)
Personality: Core traits in weights, situational behavior in prompts
Learning: Iterative batch SFT, human-in-the-loop, irregular cadence
Scale: 1 character (2 with Evil Neuro)

Architecture B: Meta-Character Training (Character.AI)¶

Layer 4: User feedback (star ratings → response selection, no weight change)
Layer 3: User prompt (character definition via Prompt Poets)
Layer 2: Character Training (DPO + "I am..." constitutions → weights)  ★ moat
Layer 1: Foundation model (custom → now hybrid with third-party)

Model: Proprietary, param count unknown, MQA, native int8
Personality: Trained to generalize from ANY character description
Learning: Character Training teaches the mapping「description → behavior」
Scale: Millions of characters, 30K msg/s, 95% KV cache hit

The Fundamental Difference¶

Dimension	Per-Character (Neuro-sama)	Meta-Character (Character.AI)
Training goal	BE a specific character	ADOPT any character from description
New character cost	Full retraining	Zero (just write a description)
Personality depth	Deepest (in weights)	Deep (in weights, but generalized)
Scalability	O(N) models for N characters	O(1) model for N characters
Data source	Curated real interactions	Synthetic constitutional data

Both approaches are confirmed in production. Anthropic uses a similar method to Character.AI for Claude (Amanda Askell: "It's like constitutional AI, but without any human data").

What the Open-Source Community Tells Us¶

5 community projects attempt to recreate Neuro-sama. Key findings:

Pipeline is the easy part. All projects successfully wire LLM + TTS + STT + avatar.
Prompt engineering hits a ceiling. Aligned base models resist behaviors (swearing, aggression) even when explicitly prompted. This is the clearest evidence for why fine-tuning adds value.
Nobody attempts iterative learning. Zero projects implement deploy → collect → retrain.
Memory is the missing piece. Only moeru-ai/airi attempts cross-session memory (DuckDB + PGVector, WIP). All others have none.

Connection to Pillar 3¶

In findings.md, Pillar 3 (Continuous Learning) was described as "unexplored." This research partially fills that gap:

What We Now Know¶

Pillar 3 is not a single problem. It decomposes into at least three sub-problems:

Sub-problem	What changes	Production example
Personality training	How the model responds (style, tone, traits)	Character.AI, Neuro-sama
Knowledge updating	What the model knows (facts, skills)	None confirmed in production
Adaptation	Per-user behavioral preferences	Character.AI feedback (indirect)

Current production systems only address personality. Knowledge updating and per-user adaptation remain in the domain of external memory (Pillar 1+2).

The Research-to-Production Gap¶

Academic methods are mature: - BIG5-CHAT proves SFT/DPO work for personality (ACL 2025) - PERSONA proves activation engineering works without training - FinePE proves per-trait LoRA composition works

But production deployment is rare because: 1. Prompting is "good enough" for most commercial use cases 2. ROI is hard to quantify — personality quality lacks standard metrics 3. Evaluation is unsolved — "is the personality right?" is subjective 4. Those who do it don't publish — competitive moat

Activation Engineering as the Missing Link¶

Between external memory (Pillar 1) and weight updates (Pillar 3), activation engineering offers a middle ground:

Pillar 1: External Memory    ←→  Facts, knowledge (Mem0, Letta)
                                  Cheap, auditable, instant update
                                  Shallow (reconstructed from prompts)

    NEW: Activation Layer    ←→  Personality, style, behavioral traits
                                  No training needed, instant composition
                                  Deep (manipulates internal representations)
                                  Requires white-box model access

Pillar 3: Weight Updates     ←→  Core personality, alignment override
                                  Most persistent, hardest to change
                                  Expensive, risk of catastrophic forgetting

Anthropic's Vaccination Concept¶

The most forward-looking finding: Anthropic's "preventative steering" — injecting persona vectors during training to make models resistant to acquiring specific traits. This has implications for: - Catastrophic forgetting: "vaccinate" personality traits before retraining on new data - Safety alignment: prevent personality drift during continual learning - Production deployment: monitor personality vectors for drift detection

Updated Pillar 3 Definition¶

Based on this research, the original Pillar 3 description in findings.md should be expanded:

Original: "Write knowledge into model weights so it persists without external storage."

Updated: Pillar 3 encompasses three intervention depths: 1. Activation-level — Modify personality/behavior at inference time without weight changes (PERSONA, SAS, Anthropic persona vectors). Training-free, composable, but requires white-box access. 2. Adapter-level — Per-trait or per-user LoRA modules, dynamically loaded (FinePE, Multi-LoRA serving). Moderate cost, good scalability. 3. Weight-level — Full SFT/DPO on personality-annotated or constitutional data (Character.AI, Neuro-sama, BIG5-CHAT). Deepest intervention, highest cost.

Production systems combine multiple layers: Character.AI uses weight-level (Layer 2) + prompt-level (Layer 3) + feedback (Layer 4).

Multi-LoRA: Per-User Personalization at Scale¶

Full details: multi-lora.research.md

Serving Infrastructure (Solved)¶

Multi-LoRA serving is production-ready. S-LoRA demonstrates 2,000 concurrent adapters on a single A100-80GB. vLLM, NVIDIA NIM, Together AI, Fireworks, and Groq all support dynamic per-request adapter switching with minimal overhead (2ms/token via Punica SGMV kernel).

Per-User Adapter Generation (Emerging)¶

The bottleneck has shifted from serving to adapter creation. Three approaches bypass traditional fine-tuning:

Method	Time per User	Storage	Key Innovation
Doc-to-LoRA (Sakana AI)	<1s	~50 MB	Perceiver hypernetwork, 83.5% of full-context quality
Profile-to-PEFT	0.57s	~50 MB	MLP hypernetwork from user embedding
Personalized Pieces	~50 steps	0.45 MB	Assemble from shared piece pool, no training

Adapter Composition¶

Multiple LoRA adapters can be combined: weight merging (CAT/TIES/DARE), runtime stacking, or learned routing (MoLoRA). MoLoRA enables Qwen3-1.7B + 4 adapters to exceed Qwen3-8B.

Production Cases¶

Convirza: 60+ adapters on Llama-3-8b, 10x cost reduction vs OpenAI
Phonely: LoRA hot-swapping on Groq, 99.2% accuracy (surpassing GPT-4o)
DoorDash: LoRA for domain models, but per-user personalization remains RAG-based

Hybrid Memory → Weight Pipeline¶

Full details: hybrid-memory-weight.research.md

Current State¶

No production system implements the full memory-to-weight pipeline. The closest analogs: - Letta: Explicit roadmap for token-to-weight distillation. Shipped learning-sdk and skill learning (36.8% improvement) - Cursor: Memory-to-weight for retrieval layer (12.5% QA accuracy improvement), not LLM - Google Gboard: Full federated loop, but small on-device models, not LLMs

Five Proposed Architectures¶

Architecture	Source	Trigger	Forgetting Mitigation
Token-first distillation	Letta	Accumulated volume	Memory survives model upgrades
Instant injection	Doc-to-LoRA	Per-document (<1s)	N/A (additive, not destructive)
Continuous self-instruct	AWS	Scheduled (quarterly)	Human-in-the-loop review
Self-evolving agent	EvolveR/MemRL	Continuous (online/offline)	RL-based policy updates
Sparse memory fine-tuning	Jessy Lin et al.	Per-item or batched	11% forgetting vs 89% full FT

The Catastrophic Forgetting Breakthrough¶

Sparse memory fine-tuning (arxiv 2510.15103) reduces forgetting from 89% to 11% by updating only memory slots highly activated by new knowledge. Combined with the ICLR 2025 finding that much "forgetting" is actually alignment degradation (reversible with 50-100 samples), the memory-to-weight pipeline may be more viable than previously thought.

Emerging Consensus¶

The field converges on token-first, weight-second: store learning in token/memory space (auditable, portable, model-agnostic), distill into weights only for efficiency. Letta's position: "The weights are temporary; the learned context is what persists."

Open Questions¶

Does activation engineering scale to larger models? All results are on 7B-8B. Do persona vectors remain orthogonal at 70B+?
Can the three layers be combined? Memory (facts) + activation steering (personality) + LoRA (domain knowledge) + prompts (situation). No one has built a four-layer personalization stack.
When does distillation become worth it? If retrieval costs X per query and distillation costs Y per training run, at what query volume does memory-to-weight pay off?
Can Doc-to-LoRA handle personality, not just facts? Current results are on factual QA. Can hypernetworks generate personality adapters as effectively as DPO-based character training?
Is Letta's learning-sdk the first step toward the full pipeline? If they ship weight distillation, it would be the first complete memory-to-weight implementation.
Can sparse memory fine-tuning work with standard transformers? The 11% vs 89% result requires non-standard architecture with memory layers.