LLM Personality Engineering: Methods, Boundaries, and Tradeoffs¶

Last Updated: 2026-03-24

Overview¶

How do you make an LLM embody a specific persona? This document surveys three paradigms — prompt engineering, fine-tuning, and activation engineering — and maps the boundary between what each can achieve.

The key finding: personality is not a single problem, but a spectrum of increasingly deep interventions, each with distinct tradeoffs in cost, controllability, and persistence.

Paradigm Map¶

                    Cost / Effort
                         ▲
                         │
        Fine-tuning      │  ● BIG5-CHAT (SFT/DPO)
        (weight change)  │  ● OpenCharacter
                         │  ● FinePE (MoE-LoRA)
                         │  ● Neuro-sama (iterative SFT)
                         │
                         │
  Activation Engineering │  ● PERSONA (vector algebra)
  (inference-time,       │  ● SAS Personality Sliders
   no weight change)     │  ● Anthropic Persona Vectors
                         │
                         │
        Prompting        │  ● SillyTavern character cards
        (no change)      │  ● System prompt + few-shot
                         │  ● Eliza character files
                         │
                         └──────────────────────────────► Personality Depth / Robustness

Paradigm 1: Prompt Engineering¶

The most accessible approach. Define personality through system prompts, character descriptions, and example dialogues.

Techniques (from SillyTavern Community)¶

The SillyTavern roleplay community has developed the most sophisticated prompt-based personality engineering. Key formats:

Format	Description
PLists	Comma-separated trait lists: `[brave, sarcastic, loyal]`
Ali:Chat	Example dialogues with `{{char}}:` and `{{user}}:` prefixes, separated by `<START>` tags
W++	Structured attribute blocks (deprecated, but historically influential)
Plain prose	Natural language character description + backstory

Best practices (empirical, from community testing): - The first message is the single most important element — "the model is more likely to pick up style and length constraints from the first message than anything else" - MBTI profile as add-on to character card "significantly improves personality consistency" - Backstory + example conversations more effective than trait lists alone - Token budget matters: "a 1000-token character definition cuts the AI's memory in half" on small context models

What Prompting Can Do¶

Set broad personality traits (friendly, sarcastic, formal)
Establish speech patterns and vocabulary
Define character knowledge and backstory
Control response length and format

What Prompting Cannot Do¶

Override alignment training. kimjammer's finding: aligned base models "stubbornly avoid swearing" even when the system prompt explicitly demands it. The safety layer built into weights overrides prompt-level instructions.
Maintain consistency over long contexts. The SAS personality slider paper notes: "prompt engineering remains fragile: models frequently exhibit contextual drift within long context windows." As conversation grows, early personality instructions get diluted.
Achieve statistical validity. The paper "When 'A Helpful Assistant' Is Not Really Helpful" (arxiv 2311.10054) tested 162 personas across 2,410 questions on 9 models and found: "most of the personas have no or negative impact on LLM's performance." Persona effects are "largely unpredictable" — even optimized selection barely outperforms random choice.
Produce human-like trait distributions. BIG5-CHAT showed that prompting produces personality scores that don't match how humans actually express traits. SFT/DPO-trained models produce "intra-trait correlations more closely matching human data."

Eliza Framework (a16z)¶

Worth noting as a production-grade prompt-based system. Eliza uses JSON character files that define personality, knowledge, and behavior. Architecture separates Runtime, Character, Client, Adapter, and Plugin layers. Character files are purely prompt-based — no fine-tuning, no activation engineering. The sophistication is in the agent framework, not the personality method.

Paradigm 2: Fine-Tuning¶

Modify model weights to embed personality. Three sub-approaches:

2a. Supervised Fine-Tuning (SFT)¶

Train on personality-annotated dialogue data.

BIG5-CHAT (ACL 2025): - 100K dialogues grounded in Big Five personality model - Dataset constructed from SODA (social scenarios) + PsychGenerator (846K Facebook posts with Big Five annotations) - SFT and DPO both outperform prompting on BFI and IPIP-NEO personality assessments - Surprising finding: personality affects reasoning. Models with higher conscientiousness + agreeableness and lower extraversion + neuroticism perform better on reasoning tasks — matching psychological research on humans - Uses LoRA for efficient training

OpenCharacter (arxiv 2501.15427): - 20K synthetic characters + 306K role-playing instruction-response pairs - Two strategies: response rewriting (rewrite existing response in character) and response generation (generate new response in character) - LLaMA-3 8B fine-tuned achieves GPT-4o-level role-playing - Key insight: character info goes into system prompt during fine-tuning, so the model learns to generalize from character descriptions

2b. Mixture of LoRA Experts (Per-Trait Modules)¶

FinePE (ScienceDirect 2026): - Assigns separate LoRA modules to each Big Five subtrait (60 subtraits total) - Gating mechanism learns optimal combination weights - 120K Q&A pairs across subtrait subsets - Achieves average induced scores of 4.91 (high) / 1.61 (low), outperforming second-best by 29% - Addresses the "granularity gap": standard SFT averages across traits, FinePE achieves fine-grained disentangled control - Key advantage over prompting: prompting suffers from "conflicting trait expression" when combining multiple traits

Fusian (arxiv 2603.15405): - Multi-LoRA fusion for fine-grained continuous MBTI personality control - Similar per-trait LoRA approach applied to MBTI framework instead of Big Five

2c. Iterative Batch SFT (Neuro-sama)¶

See neuro-sama.research.md for full analysis.

2B parameter model, q2_k quantization
Training data from curated stream transcripts
Human-in-the-loop curation (Vedal manually selects data)
Deploy → collect → curate → retrain → deploy cycle
The only known production system doing iterative personality fine-tuning

Fine-Tuning Tradeoffs¶

Advantage	Limitation
Overrides alignment layer (can produce behaviors prompts can't)	Requires training data and GPU compute
Statistically valid personality profiles	Catastrophic forgetting risk on retraining
Robust over long contexts	Per-personality model or adapter needed
Human-like trait distributions (BIG5-CHAT)	Less flexible than prompting (can't switch on the fly)

Paradigm 3: Activation Engineering¶

The newest approach. Manipulate internal activations at inference time — no prompt changes, no weight changes. This is the most surprising finding in this survey.

3a. PERSONA Framework (arxiv 2602.15669)¶

Method: Contrastive activation analysis. Generate responses under trait-expressing and trait-suppressing prompts. The persona vector = difference between mean activations.

Key discovery: Personality traits appear as extractable, approximately orthogonal directions in the model's representation space that support algebraic operations.

Operations: - Scalar multiplication: α × vector controls intensity. Pearson correlation > 0.9 between α and personality score - Vector addition: Combine vectors for multi-trait personas (e.g., outgoing + compassionate) - Vector subtraction: Suppress traits while amplifying others

Performance:

Method	PersonalityBench Score
PERSONA (training-free)	9.60
Supervised Fine-Tuning	9.61
Neuron-based (NPTI)	9.43
Simple Prompting	8.39

PERSONA matches fine-tuning performance without any gradient updates. This is the headline result.

Limitations: - Safety-aligned traits resist activation (e.g., "self-interested" shows strong resistance) - Lower improvements on factual accuracy maintenance (43.8-61.4% win rates) - Requires white-box model access (can't use on API models)

3b. SAS Personality Sliders (arxiv 2603.03326)¶

Core operation: h' = h + α·v (inject scaled steering vector into residual stream)

Key innovation: Naive multi-vector steering causes "representation collapse." Sequential Adaptive Steering (SAS) solves this by training subsequent probes on residual streams shifted by prior interventions, effectively orthogonalizing vectors.

Result: Users adjust five sliders (Big Five traits) to create complex personas instantly. Zero-parameter, negligible inference overhead.

Comparison:

Approach	Parameter Cost	Multi-trait Support
Fine-tuning/DPO	Full retraining	Manual composition
LoRA merging	Weight updates	Limited
Naive steering	None	Fails (interference)
SAS	None	Succeeds

Validated on Llama-3-8B, Mistral-7B, Qwen-7B.

3c. Anthropic Persona Vectors (Anthropic Research)¶

Scope: Evil, sycophancy, hallucination, politeness, humor, optimism

Applications: 1. Monitoring: Detect personality shifts during deployment or training before they manifest in outputs 2. Steering (inference-time): Subtract vectors to suppress traits, but degrades general capabilities 3. Vaccination (training-time): Add vectors during training to prevent trait acquisition — "little-to-no degradation" in capabilities 4. Data flagging: Identify training samples likely to induce unwanted traits

Tested on: Qwen 2.5-7B-Instruct, Llama-3.1-8B-Instruct

Framing: Anthropic positions this as a safety/alignment tool, not a personality customization tool. The emphasis is on monitoring and preventing unwanted traits, not on enabling arbitrary persona creation.

3d. PRISM: Routing Between Persona and Base (arxiv 2603.18507)¶

A hybrid approach. Core finding: expert personas improve alignment tasks but damage knowledge tasks (MMLU drops from 71.6% to 68.0%).

PRISM learns a binary gate that routes queries to a specialized LoRA adapter only when persona activation helps, routing others to the base model. Result: +1.7 points overall improvement with no MMLU degradation.

Cross-Paradigm Comparison¶

Dimension	Prompting	Fine-Tuning	Activation Engineering
Cost	Zero	GPU hours + data	Compute for vector extraction
Flexibility	Instant persona switch	Need new adapter/model	Instant (adjust α)
Depth	Surface-level	Deep (in weights)	Deep (in activations)
Robustness	Drifts over long context	Stable	Stable within operating range
Override alignment	No	Yes	Partially (safety-aligned traits resist)
Multi-trait composition	Fragile	Complex (2^N models)	Algebraic (SAS)
Model access needed	API OK	Weights needed	Weights + activations needed
Production readiness	Mature	Mature	Research stage
Human-like trait distribution	No	Yes (BIG5-CHAT)	Untested

The Boundary: When to Use What¶

Prompting is sufficient when:¶

Broad personality traits (helpful, formal, friendly) are enough
The target behavior doesn't conflict with alignment training
Context windows are short
You need instant switching between personas
You're using API-only models

Fine-tuning is needed when:¶

You need behaviors that conflict with alignment (swearing, aggression, non-standard safety)
Long-context consistency matters
Statistical validity of personality is important (psychometric tests)
You're building a production character system (Neuro-sama, Character.AI)

Activation engineering is the future for:¶

Real-time personality customization (user-facing sliders)
Multi-trait composition without exponential model count
Safety monitoring (detecting personality drift)
Research into personality mechanisms (how models represent traits)

Connection to Continuous Learning Research¶

This survey connects to the broader research project in several ways:

Pillar 3 decomposition: "Learning" isn't just about factual knowledge. Personality is a distinct type of learned behavior. The three paradigms map to three levels of learning depth.
Activation engineering as the missing link: Between external memory (Pillar 1) and weight updates (Pillar 3), activation engineering offers a middle ground — modify behavior without modifying weights. This could enable "personality memory" that's deeper than prompts but cheaper than fine-tuning.
Anthropic's vaccination concept: The idea that you can make a model resistant to acquiring certain personality traits during training has implications for continual learning — you could protect against catastrophic forgetting of personality by "vaccinating" before retraining on new data.
BIG5-CHAT's reasoning finding: Personality isn't just aesthetic — it affects model capability. Higher conscientiousness improves reasoning. This means personality engineering is also capability engineering.

Open Questions¶

Does activation engineering scale to larger models? Current results are on 7B-8B models. Do personality vectors remain orthogonal and manipulable at 70B+?
Can activation engineering be combined with memory systems? Imagine: Mem0-style facts + persona vectors for personality + prompt for situation. A three-layer personality stack.
What's the long-context durability of activation steering? Prompts drift. Do steered activations also drift over very long conversations?
Can persona vectors be learned from deployment data? Instead of Vedal manually curating → fine-tuning, could you extract persona vectors from interaction history automatically?
Is Anthropic using persona vectors in Claude? They frame it as research, but the monitoring and vaccination capabilities suggest production applicability.

References¶

Prompting¶

SillyTavern Character Design Guide
When "A Helpful Assistant" Is Not Really Helpful — arxiv 2311.10054 — 162 personas, negative/null results
ElizaOS Documentation — a16z agent framework with JSON character files

Fine-Tuning¶

BIG5-CHAT — ACL 2025 — 100K dialogues, SFT/DPO outperform prompting
OpenCharacter — arxiv 2501.15427 — 20K synthetic characters, LLaMA-3 8B matches GPT-4o
FinePE — ScienceDirect 2026 — MoE-LoRA for per-subtrait control
Fusian — arxiv 2603.15405 — Multi-LoRA MBTI control
Fine-Tuning LLMs for Personality Preservation

Activation Engineering¶

PERSONA — arxiv 2602.15669 — Vector algebra, matches SFT performance training-free
SAS Personality Sliders — arxiv 2603.03326 — Sequential Adaptive Steering for multi-trait
Anthropic Persona Vectors — Monitoring, steering, vaccination
Neuron-Based Personality Manipulation — arxiv 2412.10427 — Layer 18 intervention

Hybrid / Routing¶

PRISM — arxiv 2603.18507 — Binary gate persona routing, alignment vs accuracy tradeoff