Phase 1.3: Multi-turn / Longitudinal Dynamics Survey

Created: 2026-02-18 21:10 CST Phase: 1 - Breadth Survey
Focus: Behavioral consistency over time, adaptation under ambiguity, resource constraints

Executive Summary

Multi-turn interactions are where personality either crystallizes or collapses. Over extended dialogues, agents face three fundamental challenges: consistency (behaving the same way in similar contexts), adaptation (adjusting to new information without losing identity), and resource management (operating under token/latency/cost constraints).

Key insight for personality emergence: Longitudinal evaluation reveals whether behavioral patterns are stable traits or temporary noise. Time-series analysis can distinguish personality from mood, and context pressure creates “personality under stress” variations.

1. Multi-turn Interaction Challenges

1.1 Beyond Single-Turn: The Real Test

Source: Li et al., 2025 (arXiv:2504.04717) — “Beyond Single-Turn” survey

Core challenges in multi-turn:

Context maintenance: Remembering what was said across turns
Coherence: Consistent persona/role over extended dialogues
Responsiveness: Adapting to user feedback without breaking character
Goal persistence: Maintaining objectives across interruptions

Why single-turn evaluation misses personality:

Single-turn measures capability (can agent do X?)
Multi-turn measures consistency (does agent do X reliably?)
Personality is behavioral consistency over time, not one-shot performance

Relevance to emergence: Personality is fundamentally a longitudinal phenomenon. You can’t measure it in a single interaction—requires observing patterns across many turns.

1.2 NeurIPS 2025 Multi-Turn Workshop Findings

Source: NeurIPS 2025 Workshop on Multi-Turn Interactions

Workshop themes:

Long-horizon evaluation methods that assess consistency, stability, strategic ability
Performance degradation over extended interactions
Accumulating errors and unexpected behaviors
Measuring and predicting performance on complex multi-turn tasks

Key insight: Current benchmarks are insufficient for evaluating long-term agent behavior. Need new metrics that capture:

Consistency score: How similar are responses to similar queries across time?
Stability index: How much does behavior drift over extended runs?
Strategic coherence: Does agent maintain long-term goals?

Relevance to emergence: Workshop identifies measuring consistency and stability as open research problem—directly addresses our north-star question.

2. Agent Drift: The Core Problem

2.1 Three Types of Drift

Source: Rath, 2026 (arXiv:2601.04170) — “Agent Drift: Quantifying Behavioral Degradation”

Definition: Agent drift is the progressive degradation of agent behavior, decision quality, and inter-agent coherence over extended interaction sequences.

Three manifestations:

1. Semantic drift:

Progressive deviation from original intent
Agent’s understanding of task shifts over time
Example: “Summarize this” → “Analyze this” → “Criticize this”

2. Coordination drift:

Breakdown in multi-agent consensus mechanisms
Agents lose shared understanding of goals
Example: Agent A thinks task is X, Agent B thinks task is Y

3. Behavioral drift:

Emergence of unintended strategies
Agent develops habits that weren’t specified
Example: Agent becomes increasingly verbose over time

Key finding: Unchecked drift can lead to 42% reduction in task success rates and affect nearly half of long-running agents.

Relevance to emergence: Drift is the enemy of stable personality. Understanding drift mechanisms is essential for designing personality that persists.

2.2 Agent Stability Index (ASI)

Source: Rath, 2026

Novel metric framework for quantifying drift across 12 dimensions:

Response consistency: Similar inputs → similar outputs?
Tool usage patterns: Stable tool selection over time?
Reasoning pathway stability: Consistent reasoning approaches?
Inter-agent agreement rates: Coordination stability in multi-agent?

… (8 more dimensions)

Key insight: Need composite metrics that capture multiple aspects of behavioral stability.

Relevance to emergence: ASI provides a measurement framework for tracking personality stability—exactly what we need to distinguish traits from noise.

2.3 Goal Drift vs. Style Drift

Source: Arike, 2025 (arXiv:2505.02709) — “Evaluating Goal Drift”

Goal drift: Deviation from original objective

Agent forgets or distorts its assigned goal
Correlates with pattern-matching behavior as context grows
Best models (Claude 3.5 Sonnet) maintain adherence for 100K+ tokens
All models exhibit some drift

Style drift: Change in behavioral patterns while maintaining goal

Agent achieves same objective but with different approach
Example: Initially thorough → later concise
Not necessarily bad, but indicates personality change

Key finding: Goal drift increases with context length and pattern-matching pressure.

Relevance to emergence:

Goal drift is undesirable (agent breaks its contract)
Style drift may be desirable (personality evolution)
Need mechanisms to distinguish the two

2.4 Identity Drift in Conversations

Source: Kim, 2024 (arXiv:2412.00804) — “Examining Identity Drift”

Experiment: Multi-turn conversations on personal themes across 9 LLMs.

Key findings:

1. Larger models experience greater identity drift

Counterintuitive: More capability ≠ more stability
Reason: More parameters → more degrees of freedom for drift

2. Model family differences exist but < parameter size effects

Architecture matters, but size matters more

3. Assigning a persona may NOT help maintain identity

Persona prompts can backfire
Identity needs to be reinforced, not just assigned

Relevance to emergence: Personality assignment isn’t enough—needs active maintenance mechanisms.

3. Resource Constraints as “Physics”

3.1 Context Window Pressure

Source: Hossain, 2025 (arXiv:2601.11564) — “Context Discipline and Performance”

Core finding: Performance degrades non-linearly as context fills, tied to Key-Value (KV) cache growth.

Implications:

Agents under context pressure behave differently
“Personality under stress” may differ from baseline
Resource constraints force behavioral trade-offs

Relevance to emergence: Context pressure creates situational personality variation—agent may be verbose when relaxed, terse when pressured.

3.2 Context Length Alone Hurts Performance

Source: Du, 2025 (arXiv:2510.05381) — EMNLP 2025 Findings

Shocking result: Even with perfect retrieval, performance degrades 13.9%-85% as input length increases.

Why it matters:

Not just retrieval failure
Sheer length of input hurts performance
Occurs even when irrelevant tokens are whitespace
Occurs even when models forced to attend only to relevant tokens

Mitigation: Transform long-context task → short-context by prompting model to recite retrieved evidence first (4% improvement on GPT-4o).

Relevance to emergence: Context length creates cognitive load that changes behavior—agents simplify, abbreviate, or make errors under load.

3.3 Context Rot: Non-Uniform Degradation

Source: Chroma Research, “Context Rot”

Core finding: Performance degrades non-uniformly across tasks and models as context increases.

Observations:

Different models show different degradation patterns
Some tasks are more resilient to context rot
Distractor content matters (not just length)
Model-specific behavior patterns emerge

Relevance to emergence: Different agents (different base models or personalities) will show different degradation signatures—this is measurable personality difference.

3.4 Token Budget and Latency Constraints

Source: Stevens Online; industry practice

Constraints create behavioral signatures:

Token budget:

Thrifty agents: Minimize tokens, terse responses
Thorough agents: Spend freely, verbose responses
Budget forces prioritization (what to include/exclude)

Latency constraints:

Fast agents: Quick responses, less reasoning depth
Careful agents: Slower, more thorough reasoning
Latency forces speed-accuracy trade-offs

Relevance to emergence: Resource constraints are the “physics” of personality—they shape behavior in consistent, measurable ways.

4. Measuring Behavioral Consistency

4.1 Consistency Metrics

Source: Evaluation surveys; ReliabilityBench

Consistency score (k):

Measure how often agent produces similar outputs for similar inputs
Calculated across multiple trials on same task
τ-bench finding: 60% pass@1 → only 25% consistency

Cross-trial variance:

Run same task multiple times
Measure variance in outputs
High variance = low consistency (unstable personality)

Temporal stability:

Measure behavior at time T1, T2, T3…
Calculate correlation over time
Stable personality = high temporal correlation

Relevance to emergence: These metrics provide quantitative measurement of personality stability.

4.2 Reliability vs. Capability

Source: ReliabilityBench (arXiv:2601.06112)

Gap: Benchmark performance ≠ production reliability

Agent can pass tests but fail consistency
Reliability = capability × consistency

Evaluation dimensions:

Correctness: Does agent get right answer?
Consistency: Does agent get same answer repeatedly?
Stability: Does agent maintain behavior over time?
Security: Does agent resist manipulation?

Relevance to emergence: Reliability requires both capability and personality stability—can’t have reliable agent without consistent behavior.

4.3 Longitudinal Evaluation Methods

Source: Li et al., 2025; multi-turn workshop

Methods:

1. Multi-trial consistency testing

Run same task N times
Measure output variance
Track over time (days/weeks)

2. Conversation replay analysis

Replay conversation from T1 at T2
Measure behavioral drift
Compare agent’s current vs. past responses

3. Stress testing under resource constraints

Vary token budget
Vary latency constraints
Observe behavioral changes

4. Cross-session memory tests

Give agent information in session 1
Test recall/usage in session 2, 3, 4…
Measure memory decay and behavioral impact

Relevance to emergence: Longitudinal methods reveal personality persistence vs. temporary behavioral fluctuations.

5. Adaptation Under Ambiguity

5.1 Handling Uncertainty

Source: Multi-agent failure guides; practical deployment

Challenge: Agents must operate when goals, context, or constraints are unclear.

Behavioral patterns:

Conservative agents: Ask for clarification, avoid action
Aggressive agents: Make assumptions, act decisively
Exploratory agents: Test multiple approaches, iterate

Ambiguity tolerance is a measurable personality trait.

Relevance to emergence: How agents handle ambiguity reveals risk tolerance, decision style, and confidence—all personality dimensions.

5.2 Adapting to Feedback

Source: Multi-turn surveys; self-reflection research

Feedback loop dynamics:

Agent acts
User/environment provides feedback
Agent adjusts behavior
Repeat

Personality dimensions:

Responsiveness: How quickly does agent adapt?
Plasticity: How much does behavior change?
Resistance: When does agent refuse to change?

Key finding: Over-adaptive agents lose personality; under-adaptive agents fail to learn.

Relevance to emergence: Adaptation rate is a personality dial—can be tuned, measured, and compared across agents.

6. Implications for Personality Emergence

6.1 Mechanisms Revealed

From multi-turn research:

1. Consistency mechanisms:

Stable identity prompts (but not enough alone)
Memory reinforcement (episodic recall of past behaviors)
Drift detection (monitor behavioral metrics)
Anchoring (periodic re-alignment to base personality)

2. Adaptation mechanisms:

Feedback integration (controlled plasticity)
Context-aware behavior change (situational personality)
Resource-aware behavior (graceful degradation under pressure)

3. Measurement mechanisms:

Agent Stability Index (12-dimension composite)
Consistency scores (cross-trial variance)
Temporal correlation (behavior over time)
Drift detection (deviation from baseline)

6.2 What Can Be Measured

Quantifiable personality dimensions:

Stability index: How much does behavior drift? (0-1 scale)
Consistency score: How similar are responses to similar inputs? (0-1 scale)
Adaptation rate: How quickly does behavior change after feedback? (time metric)
Ambiguity tolerance: How much uncertainty before agent asks clarification? (% threshold)
Resource sensitivity: How much does behavior change under constraints? (delta metric)
Temporal correlation: How correlated is behavior across time? (Pearson r)

6.3 What Remains Unknown

Open questions:

Trait vs. state: How to distinguish stable personality from temporary mood?
Measurement frequency: How many observations needed to establish a trait?
Baseline drift: Is some drift healthy (learning) vs. unhealthy (corruption)?
Cross-domain consistency: Do agents maintain personality across task types?
Recovery mechanisms: Can agents “reset” personality after drift?

7. Implications for Fleet Architecture

7.1 For SOUL.md Design

Requirements:

Personality anchoring: Periodic reinforcement of identity
Drift monitoring: Track behavioral metrics over time
Adaptation bounds: Define acceptable plasticity range
Resource awareness: Personality should adapt gracefully to constraints

Recommendations:

Include stability constraints in SOUL.md
Define drift thresholds (when to alert/reset)
Specify adaptation policy (how much to change, how fast)
Document resource-aware behavior (thrifty vs. thorough mode)

7.2 For Measurement System

Requirements:

Longitudinal tracking: Store behavioral metrics over time
Drift detection: Alert when metrics deviate beyond threshold
Consistency scoring: Calculate per-task and cross-task consistency
Cross-agent comparison: Compare stability across fleet

Recommendations:

Implement Agent Stability Index for each agent
Track consistency scores per domain
Monitor drift rate (change per unit time)
Visualize personality trajectories over time

7.3 For Deployment

Requirements:

Context budget management: Don’t fill context windows carelessly
Latency-aware behavior: Adapt reasoning depth to time constraints
Graceful degradation: Maintain personality under pressure
Periodic re-alignment: Reset to baseline personality when drift detected

Recommendations:

Set context limits that preserve personality quality
Implement latency budgets with personality-aware fallbacks
Design stress modes (thrifty vs. thorough)
Schedule personality audits (weekly/monthly drift checks)

8. References

Core Papers

Multi-Turn Survey: Li et al., 2025. “Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models.” arXiv:2504.04717
Agent Drift: Rath, 2026. “Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions.” arXiv:2601.04170
Goal Drift: Arike, 2025. “Technical Report: Evaluating Goal Drift in Language Model Agents.” arXiv:2505.02709
Identity Drift: Kim, 2024. “Examining Identity Drift in Conversations of LLM Agents.” arXiv:2412.00804
Context Discipline: Hossain, 2025. “Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths.” arXiv:2601.11564
Context Length Hurts: Du, 2025. “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval.” arXiv:2510.05381 (EMNLP 2025 Findings)
Context Rot: Chroma Research. “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” research.trychroma.com
Agent Evaluation Survey: Mohammadi, 2025. “Evaluation and Benchmarking of LLM Agents: A Survey.” arXiv:2507.21504

Workshops

NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models
ReliabilityBench: arXiv:2601.06112

Practical Resources

Stevens Online: “Hidden Economics of AI Agents: Token Costs and Latency Trade-offs”
Confident AI: “LLM Evaluation Metrics Guide”
Augment Code: “Why Multi-Agent LLM Systems Fail”

Next Steps

Phase 1.4: Multi-agent Emergence

Specialization, coordination, norms
Peer influence on personality
Interaction topology effects

Phase 1.3 complete. Moving to Phase 1.4.