Continuous Learning Research Plan¶
Last Updated: 2026-03-24
Goal¶
Research how LLMs can learn and adapt after deployment — the "Pillar 3" identified in findings.md. This is the missing piece: Memory (external storage) and Context (window management) are well-studied, but writing knowledge into model weights remains largely unexplored in production.
Research Approach¶
Focus on what's publicly available: production systems, open-source projects, published case studies, and academic surveys. Not doing original research or model training.
Research Directions¶
Direction 1: AI VTuber / Character AI (Custom-Trained Personality)¶
The clearest real-world example of "personality written into weights."
| Target | Details | Status |
|---|---|---|
| Neuro-sama | 2B parameter custom-trained LLM by Vedal. Training data from Twitch interactions. Personality from weights, not prompts. Technical details intentionally private | ✅ Done → neuro-sama.research.md |
| Open-LLM-VTuber | Open source project (GitHub). Uses prompt engineering for personality — contrast with Neuro-sama's weight-based approach | ✅ Covered in neuro-sama.research.md |
| Character.AI | DPO + personality constitutions for meta-character training. One model generalizes to ANY character given a description. 30K msg/s. Post-Google pivot to third-party pre-trained + proprietary post-training | ✅ Done → character-ai.research.md |
| Community attempts | 5 projects surveyed: kimjammer/Neuro, Open-LLM-VTuber, moeru-ai/airi, AIRIS-VtuberAI, VedalAI/neuro-sdk. All focus on pipeline engineering; none attempt weight-level personality or iterative learning. Memory implementations minimal. | ✅ Covered in neuro-sama.research.md |
Key question: What's the boundary between "prompt-crafted personality" and "weight-embedded personality"? At what point does fine-tuning produce something that prompt engineering can't replicate?
Cross-Cutting: Personality Engineering Methods¶
| Target | Details | Status |
|---|---|---|
| Prompt-based personality | SillyTavern character cards, Eliza character files, 162-persona study (null results). Boundary: can't override alignment, drifts over long context | ✅ Done → personality-engineering.research.md |
| Fine-tuning for personality | BIG5-CHAT (ACL 2025), OpenCharacter, FinePE (MoE-LoRA per Big Five subtrait) | ✅ Done → personality-engineering.research.md |
| Activation engineering | PERSONA (matches SFT training-free), SAS personality sliders, Anthropic persona vectors (monitoring + vaccination) | ✅ Done → personality-engineering.research.md |
Direction 2: Per-User Personalization (Multi-LoRA)¶
The most production-ready approach to continuous adaptation.
| Target | Details | Status |
|---|---|---|
| Multi-LoRA serving | vLLM, S-LoRA (2000 adapters/GPU), Punica (SGMV kernel), NVIDIA NIM, LoRAX, Together AI, Fireworks, Groq. Serving is solved | ✅ Done → multi-lora.research.md |
| Sakana AI Doc-to-LoRA | Hypernetwork generates LoRA from document in <1s. 83.5% of full-context quality. ~50MB per adapter | ✅ Done → multi-lora.research.md |
| Profile-to-PEFT | MLP hypernetwork generates per-user LoRA in 0.57s. Also: Personalized Pieces (0.45MB/user), Apple PLUM, MTA | ✅ Done → multi-lora.research.md |
| DoorDash personalization | Uses LoRA for domain-specific models, but per-user personalization is RAG-based. Also: Convirza (60+ adapters), Phonely (99.2% accuracy) | ✅ Done → multi-lora.research.md |
Key question: Is per-user LoRA a viable middle ground between prompt injection and full fine-tuning? What are the production tradeoffs (storage, latency, staleness)?
Direction 3: Academic Survey (Continual Learning for LLMs)¶
Understand the theoretical landscape and what's possible vs what's practical.
| Target | Details | Status |
|---|---|---|
| ACM CSUR 2025 survey | Continual Learning of LLMs — comprehensive survey covering continual pre-training, instruction tuning, and alignment | ✅ Referenced in hybrid-memory-weight.research.md |
| Catastrophic forgetting solutions | Sparse memory fine-tuning (11% vs 89% forgetting), self-synthesized rehearsal, LoRA "learns less forgets less" | ✅ Done → hybrid-memory-weight.research.md |
| Self-Evolving LLMs | MoE-CL (Tencent, production), EvolveR, MemRL, MemSkill | ✅ Done → hybrid-memory-weight.research.md |
| Spurious forgetting | ICLR 2025: much "forgetting" is alignment degradation, not knowledge loss. Reversible with 50-100 samples | ✅ Done → hybrid-memory-weight.research.md |
Key question: How close is continual learning to production-ready? What's the gap between academic SOTA and what's deployable?
Direction 4: Hybrid Memory → Weight Pipeline¶
The logical endpoint: external memories accumulated over time, periodically fine-tuned into weights.
| Target | Details | Status |
|---|---|---|
| Concept exploration | No production system does full pipeline. Letta has explicit roadmap (token-first + weight distillation). 5 architecture proposals surveyed | ✅ Done → hybrid-memory-weight.research.md |
| Federated learning | Google Gboard (30+ models, DP ε≤1), Apple (keyboard), FwdLLM (LLaMA-7B on mobile). All small models, not LLMs | ✅ Done → hybrid-memory-weight.research.md |
| Cursor's approach | Session traces → LLM ranking oracle → train custom embedding model. 12.5% QA accuracy improvement. Memory-to-weight for retrieval layer | ✅ Done → hybrid-memory-weight.research.md |
Key question: Is "accumulate memories then fine-tune" a viable architecture? What would the update cycle look like?
Prioritization¶
| Priority | Direction | Reason |
|---|---|---|
| 1 | Direction 3 (Academic survey) | Establishes theoretical foundation before looking at practice |
| 2 | Direction 2 (Multi-LoRA) | Most production-ready, actionable for real systems |
| 3 | Direction 1 (VTuber/Character) | Interesting case study but limited public technical detail |
| 4 | Direction 4 (Hybrid pipeline) | Mostly speculative, explore last |
Output¶
Per-direction research documents (*.research.md), cross-direction summary, and update to findings.md Pillar 3 section.