Skip to content

Continuous Learning Research Plan

Last Updated: 2026-03-24

Goal

Research how LLMs can learn and adapt after deployment — the "Pillar 3" identified in findings.md. This is the missing piece: Memory (external storage) and Context (window management) are well-studied, but writing knowledge into model weights remains largely unexplored in production.

Research Approach

Focus on what's publicly available: production systems, open-source projects, published case studies, and academic surveys. Not doing original research or model training.


Research Directions

Direction 1: AI VTuber / Character AI (Custom-Trained Personality)

The clearest real-world example of "personality written into weights."

Target Details Status
Neuro-sama 2B parameter custom-trained LLM by Vedal. Training data from Twitch interactions. Personality from weights, not prompts. Technical details intentionally private ✅ Done → neuro-sama.research.md
Open-LLM-VTuber Open source project (GitHub). Uses prompt engineering for personality — contrast with Neuro-sama's weight-based approach ✅ Covered in neuro-sama.research.md
Character.AI DPO + personality constitutions for meta-character training. One model generalizes to ANY character given a description. 30K msg/s. Post-Google pivot to third-party pre-trained + proprietary post-training ✅ Done → character-ai.research.md
Community attempts 5 projects surveyed: kimjammer/Neuro, Open-LLM-VTuber, moeru-ai/airi, AIRIS-VtuberAI, VedalAI/neuro-sdk. All focus on pipeline engineering; none attempt weight-level personality or iterative learning. Memory implementations minimal. ✅ Covered in neuro-sama.research.md

Key question: What's the boundary between "prompt-crafted personality" and "weight-embedded personality"? At what point does fine-tuning produce something that prompt engineering can't replicate?

Cross-Cutting: Personality Engineering Methods

Target Details Status
Prompt-based personality SillyTavern character cards, Eliza character files, 162-persona study (null results). Boundary: can't override alignment, drifts over long context ✅ Done → personality-engineering.research.md
Fine-tuning for personality BIG5-CHAT (ACL 2025), OpenCharacter, FinePE (MoE-LoRA per Big Five subtrait) ✅ Done → personality-engineering.research.md
Activation engineering PERSONA (matches SFT training-free), SAS personality sliders, Anthropic persona vectors (monitoring + vaccination) ✅ Done → personality-engineering.research.md

Direction 2: Per-User Personalization (Multi-LoRA)

The most production-ready approach to continuous adaptation.

Target Details Status
Multi-LoRA serving vLLM, S-LoRA (2000 adapters/GPU), Punica (SGMV kernel), NVIDIA NIM, LoRAX, Together AI, Fireworks, Groq. Serving is solved ✅ Done → multi-lora.research.md
Sakana AI Doc-to-LoRA Hypernetwork generates LoRA from document in <1s. 83.5% of full-context quality. ~50MB per adapter ✅ Done → multi-lora.research.md
Profile-to-PEFT MLP hypernetwork generates per-user LoRA in 0.57s. Also: Personalized Pieces (0.45MB/user), Apple PLUM, MTA ✅ Done → multi-lora.research.md
DoorDash personalization Uses LoRA for domain-specific models, but per-user personalization is RAG-based. Also: Convirza (60+ adapters), Phonely (99.2% accuracy) ✅ Done → multi-lora.research.md

Key question: Is per-user LoRA a viable middle ground between prompt injection and full fine-tuning? What are the production tradeoffs (storage, latency, staleness)?

Direction 3: Academic Survey (Continual Learning for LLMs)

Understand the theoretical landscape and what's possible vs what's practical.

Target Details Status
ACM CSUR 2025 survey Continual Learning of LLMs — comprehensive survey covering continual pre-training, instruction tuning, and alignment ✅ Referenced in hybrid-memory-weight.research.md
Catastrophic forgetting solutions Sparse memory fine-tuning (11% vs 89% forgetting), self-synthesized rehearsal, LoRA "learns less forgets less" ✅ Done → hybrid-memory-weight.research.md
Self-Evolving LLMs MoE-CL (Tencent, production), EvolveR, MemRL, MemSkill ✅ Done → hybrid-memory-weight.research.md
Spurious forgetting ICLR 2025: much "forgetting" is alignment degradation, not knowledge loss. Reversible with 50-100 samples ✅ Done → hybrid-memory-weight.research.md

Key question: How close is continual learning to production-ready? What's the gap between academic SOTA and what's deployable?

Direction 4: Hybrid Memory → Weight Pipeline

The logical endpoint: external memories accumulated over time, periodically fine-tuned into weights.

Target Details Status
Concept exploration No production system does full pipeline. Letta has explicit roadmap (token-first + weight distillation). 5 architecture proposals surveyed ✅ Done → hybrid-memory-weight.research.md
Federated learning Google Gboard (30+ models, DP ε≤1), Apple (keyboard), FwdLLM (LLaMA-7B on mobile). All small models, not LLMs ✅ Done → hybrid-memory-weight.research.md
Cursor's approach Session traces → LLM ranking oracle → train custom embedding model. 12.5% QA accuracy improvement. Memory-to-weight for retrieval layer ✅ Done → hybrid-memory-weight.research.md

Key question: Is "accumulate memories then fine-tune" a viable architecture? What would the update cycle look like?


Prioritization

Priority Direction Reason
1 Direction 3 (Academic survey) Establishes theoretical foundation before looking at practice
2 Direction 2 (Multi-LoRA) Most production-ready, actionable for real systems
3 Direction 1 (VTuber/Character) Interesting case study but limited public technical detail
4 Direction 4 (Hybrid pipeline) Mostly speculative, explore last

Output

Per-direction research documents (*.research.md), cross-direction summary, and update to findings.md Pillar 3 section.