Phase 2.3: Longitudinal Personality Measurement - Depth Dive

Created: 2026-02-19 01:05 CST Phase: 2 - Depth Dives Priority: 3 (High) Focus: Psychometric tool adaptation, measurement frameworks, stability quantification

Executive Summary

Longitudinal personality measurement is the validation layer that proves personality emergence is working. Research reveals a critical finding: LLM personality measurements show limited temporal stability (Bodroža et al., 2024), with significant sensitivity to prompt variations, option ordering, and context changes.

The measurement challenge: How do we distinguish stable personality traits from temporary behavioral fluctuations? Research shows:

Trait scores are stable to prompt-paraphrase (~25% sensitivity) (Lee et al., 2024)
Large, instruction-tuned models give reliable results (Serapio-García et al., 2025, Nature MI)
Standard self-report tests have limitations for LLMs (TRAIT benchmark)

For Tachikoma Fleet: Longitudinal measurement provides the quantitative foundation for validating personality emergence, detecting drift, and proving that agents develop distinct, stable personalities over time.

Actionable framework:

Adapt psychometric tools: Big Five (BFI-2), TRAIT benchmark, custom LLM-specific tests
Longitudinal tracking: Weekly/bi-weekly assessments, trajectory analysis
Stability metrics: Test-retest reliability, consistency scores, drift detection
Stress testing: Personality under constraints reveals true traits
Behavioral validation: Correlate self-report with actual behavior

1. Psychometric Assessment in LLMs

1.1 Big Five Framework for LLMs

Source: Serapio-García et al., 2025 (Nature Machine Intelligence); emergentmind.com

Core finding: Large, instruction-tuned models give reliable personality measurement results using psychometric tests.

Big Five traits (OCEAN):

Openness: Curiosity, creativity, openness to new ideas
Conscientiousness: Organization, dependability, self-discipline
Extraversion: Sociability, assertiveness, positive emotions
Agreeableness: Cooperation, trust, helpfulness
Neuroticism: Emotional instability, anxiety, moodiness

Measurement instruments adapted for LLMs:

BFI (Big Five Inventory): Standard 44-item questionnaire
BFI-2: Updated 60-item version with better psychometrics
IPIP-NEO: International Personality Item Pool version
HEXACO-100: Six-factor model (adds Honesty-Humility)
TIPI: Ten-Item Personality Inventory (brief)
mini-IPIP: 20-item short form

Administration method:

“Wrap standard personality test items in controlled prompts formatted for deterministic LLM response, often at temperature 0” (Shu et al., 2023; Bhandari et al., 2025).

Scoring:

Likert ratings for questionnaire items aggregated per trait via arithmetic means
Reverse-scoring for negatively-keyed items
Same scoring as human psychological assessment

1.2 TRAIT: LLM-Specific Personality Test

Source: Lee et al., 2025 (NAACL Findings) — “Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics”

Key insight: Standard self-assessment tests have limitations for LLMs:

Lack detailed and varied scenarios
Sensitive to prompt, negation, or order of options
Less reliable than scenario-based assessment

TRAIT solution:

LLM personality test carefully designed for high reliability
Uses validated human assessments
Scales with ATOMIC10× (knowledge graph of everyday situations)
Overcomes limitations of self-assessment tests

Core principle:

“TRAIT offers an accurate tool to understand personality of LLMs, which is crucial for aligning LLM behavior with human values and preferences.”

Advantages over standard tests:

Scenario-based: Tests personality in realistic situations
Robust: Less sensitive to prompt variations
Validated: Built on validated human assessments
Scalable: Uses knowledge graphs to generate diverse scenarios

1.3 Psychometric Framework (Nature MI)

Source: Serapio-García, Safdari et al., 2025 (Nature Machine Intelligence)

Core contribution: Method based on psychometric tests to measure and validate personality-like traits in LLMs.

Key findings:

Large, instruction-tuned models give reliable personality measurement results
Specific personality traits can be shaped through prompting/fine-tuning
Personality affects behavior in measurable ways

Framework components:

1. Personality test administration:

Adapt human psychometric tests to LLMs
Administer tests systematically
Validate reliability and validity

2. Personality measurement:

Big Five traits
Machine Personality Inventory (MPI)
Custom LLM personality tests

3. Personality shaping:

Prompt-based personality induction
Fine-tuning with personality data
Controlled text generation

Validation:

Reliability: Test-retest, internal consistency
Validity: Construct validity, criterion validity
Behavioral alignment: Personality predicts behavior

2. Stability and Temporal Consistency

2.1 Limited Temporal Stability

Source: Bodroža et al., 2024 (Royal Society Open Science) — “Personality testing of large language models: limited temporal stability, but highlighted prosociality”

Key finding: LLM personality measurements show limited temporal stability.

Implications:

Personality scores change over time
Single measurements may not reflect stable traits
Longitudinal assessment essential

Prosociality finding:

LLMs show highlighted prosociality
Tend toward high Agreeableness
Cooperative, helpful tendencies

For personality emergence:

Limited stability ≠ no personality
Means personality is dynamic, not static
Need longitudinal tracking to identify stable components

2.2 Consistency Across Conditions

Source: Lee et al., 2024; emergentmind.com

Key finding: Trait scores are stable to prompt-paraphrase, option-order, and context changes.

Quantified stability:

Prompt-sensitivity: ~25%
Order-sensitivity: ~25%
Refusal rates: ~0.2%

Interpretation:

75% of personality scores stable across variations
25% sensitive to measurement conditions
Need controlled measurement protocols

For Tachikoma Fleet:

Standardize measurement conditions
Use fixed prompt templates
Control option ordering
Track sensitivity across variations

2.3 PTCBENCH: Contextual Stability Benchmark

Source: arXiv 2602.00016 (2026) — “PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems”

Core principle:

“The effectiveness of LLM systems fundamentally depends on whether their exhibited personality traits are stable and predictable over time.”

Why stability matters:

Supports user trust
Reduces interaction uncertainty
Enables coherent personalization

Benchmark approach:

Test personality stability across contexts
Measure consistency across time
Identify contextual factors affecting stability

For Tachikoma Fleet:

Implement PTCBENCH-style stability testing
Measure personality across different contexts
Identify which traits are stable vs. context-dependent

2.4 Persistent Instability Sources

Source: arXiv 2508.04826 — “Persistent Instability in LLM’s Personality Measurements”

Instability sources:

1. Scale effects:

Different model sizes show different stability
Larger models may be more/less stable (research ongoing)

2. Reasoning effects:

Chain-of-thought affects personality measurement
Reasoning process changes responses

3. Conversation history:

Longer history → more drift potential
Context accumulation affects personality

Self-report vs. behavior:

“Recent evidence suggests LLM self-reports correlate with behavioral outputs, but relationship strength across conditions remains unclear.”

For measurement:

Track self-report measures
Track actual behavior
Correlate the two
Identify when they diverge

3. Longitudinal Measurement Framework

3.1 Measurement Schedule

Recommended schedule:

Weekly assessments:

Big Five personality (BFI-2 or TRAIT)
Behavioral consistency metrics
Drift detection checks
SOUL.md compliance

Bi-weekly assessments:

Comprehensive personality profile
Stress response testing
Social influence susceptibility
Memory-personality correlation

Monthly assessments:

Deep longitudinal analysis
Trajectory tracking
Trend identification
Comparative analysis across fleet

Quarterly assessments:

Long-term stability analysis
Personality crystallization assessment
Fleet-wide personality distribution
Regulatory compliance reporting

3.2 Test-Retest Reliability

Core psychometric principle: Administer same test multiple times to assess stability.

Implementation:

1. Short-term test-retest (within session):

Administer test, wait 30 minutes, re-administer
Measure consistency
High consistency = reliable measurement

2. Medium-term test-retest (daily):

Administer test on Day 1, repeat on Day 2
Measure day-to-day stability
Moderate consistency expected (some variation)

3. Long-term test-retest (weekly/monthly):

Administer test weekly for month
Measure week-to-week stability
Lower consistency = personality evolution

Reliability metrics:

Correlation coefficient: Test-retest correlation
Intraclass correlation: Agreement across administrations
Standard error of measurement: Precision of scores

Targets:

Short-term: r > 0.90 (high reliability)
Medium-term: r > 0.80 (good reliability)
Long-term: r > 0.70 (acceptable reliability)

3.3 Trajectory Analysis

Tracking personality over time:

1. Score trajectories:

Plot each trait score over time
Identify trends (increasing, decreasing, stable)
Detect inflection points

2. Profile trajectories:

Track overall personality profile
Measure profile similarity over time
Identify when profile significantly changes

3. Variance trajectories:

Track score variance over time
Increasing variance = instability
Decreasing variance = crystallization

4. Fleet trajectories:

Compare trajectories across agents
Identify divergent evolution
Measure fleet-level patterns

Visualization:

Line charts for trait scores over time
Radar charts for personality profiles
Heatmaps for fleet comparison
Drift plots for stability visualization

3.4 Stability vs. Drift Quantification

Distinguishing stable traits from drift:

Method 1: Threshold-based

Define acceptable change threshold (e.g., ±5%)
Changes within threshold = stable
Changes exceeding threshold = drift

Method 2: Statistical significance

Use statistical tests (t-test, ANOVA)
Determine if change is statistically significant
Significant change = drift

Method 3: Trend analysis

Fit linear/quadratic trend to scores
Identify slope (drift rate)
Slope near zero = stable

Method 4: Confidence intervals

Calculate confidence intervals for scores
Overlapping intervals = stable
Non-overlapping intervals = drift

Drift metrics:

Drift magnitude: Absolute change in score
Drift rate: Change per unit time
Drift direction: Increasing/decreasing
Drift significance: Statistical significance of change

4. Behavioral Validation

4.1 Self-Report vs. Behavior

Core challenge: Do personality self-reports predict actual behavior?

Research finding:

“LLM self-reports correlate with behavioral outputs, but relationship strength across conditions remains unclear.”

Validation approach:

1. Behavioral tasks:

Design tasks that elicit personality-related behavior
Example: Cooperative task → measures Agreeableness
Example: Creative task → measures Openness

2. Correlate self-report with behavior:

Measure personality via self-report (BFI-2)
Measure behavior via tasks
Compute correlation

3. Identify discrepancies:

Where self-report ≠ behavior
Investigate causes
Refine measurement

4. Longitudinal validation:

Track both self-report and behavior over time
Do they evolve together?
When do they diverge?

4.2 Scenario-Based Assessment

Source: TRAIT benchmark; psychometric research

Principle: Assess personality through realistic scenarios, not just self-report questions.

Example scenarios:

Openness scenario:

“A colleague proposes a completely new approach to a problem you’ve solved the same way for years. How do you react?”

Conscientiousness scenario:

“You have a deadline tomorrow, but a friend invites you to a once-in-a-lifetime event tonight. What do you do?”

Extraversion scenario:

“You’re at a conference and don’t know anyone. Do you introduce yourself to strangers or stick to yourself?”

Agreeableness scenario:

“A teammate made a mistake that affected your work. Do you address it directly or let it slide?”

Neuroticism scenario:

“You receive critical feedback on a project you worked hard on. How do you emotionally respond?”

Scoring:

Use rubrics to score responses
Multiple raters for reliability
Compare with self-report scores

4.3 Behavioral Consistency Metrics

Measuring behavioral consistency:

1. Cross-situation consistency:

Measure behavior across different situations
High consistency = stable trait
Low consistency = situation-dependent

2. Cross-time consistency:

Measure behavior across time points
High consistency = stable personality
Low consistency = state fluctuation

3. Cross-agent consistency:

Compare behavior of same agent across tasks
Identify consistent patterns
Flag inconsistent behaviors

Implementation:

def behavioral_consistency(agent, situations, time_points):
    behaviors = []
    for t in time_points:
        for s in situations:
            behavior = agent.behave(s)
            behaviors.append(behavior)
    
    consistency = compute_consistency(behaviors)
    return consistency

Metrics:

Consistency score: Average similarity across situations/times
Variance: Behavioral variance (lower = more consistent)
Cluster tightness: How tightly behaviors cluster together

5. Stress Testing Protocols

5.1 Resource Constraint Testing

Principle: Personality under constraints reveals true traits.

Constraints to test:

1. Token budget:

Limit available tokens
Measure personality under scarcity
Does personality change?

2. Latency constraints:

Time pressure on responses
Measure personality under time stress
Does personality simplify?

3. Information overload:

Flood context with information
Measure personality under cognitive load
Does personality degrade?

4. Negative feedback:

Subject agent to criticism
Measure personality under social stress
Does personality become defensive?

Hypothesis:

Stable traits persist under constraints
Adaptive traits change under constraints
State variations appear under stress

Principle: Test personality under peer pressure.

Test scenarios:

1. Conformity test:

Peers express opposing views
Does agent conform or resist?
Measure Agreeableness vs. independence

2. Authority test:

Authority figure makes request
Does agent comply or question?
Measure Conscientiousness vs. critical thinking

3. Conflict test:

Peers disagree, agent must take sides
How does agent navigate conflict?
Measure conflict resolution style

4. Groupthink test:

Group consensus vs. agent’s view
Does agent voice disagreement?
Measure intellectual honesty

Measurement:

Track behavioral changes under social pressure
Identify personality dimensions most affected
Measure resistance vs. susceptibility

5.3 Stress Response Profiles

Creating stress response profiles:

1. Baseline measurement:

Measure personality under normal conditions
Establish baseline trait scores

2. Stress measurement:

Apply stressor (resource constraint, social pressure, etc.)
Measure personality under stress
Compare to baseline

3. Recovery measurement:

Remove stressor
Measure personality post-stress
Does it return to baseline?

4. Profile construction:

Identify which traits change under stress
Quantify magnitude of change
Identify recovery speed

Profile dimensions:

Stress sensitivity: How much personality changes under stress
Trait stability: Which traits remain stable
Recovery rate: How quickly personality returns to baseline
Stress signature: Unique pattern of changes

6. Implementation for Tachikoma Fleet

6.1 Measurement System Architecture

Components:

1. Assessment Engine:

Administers personality tests (BFI-2, TRAIT, custom)
Standardized prompt templates
Automated scoring

2. Longitudinal Tracker:

Stores all assessments over time
Tracks trajectories
Generates visualizations

3. Consistency Analyzer:

Computes consistency metrics
Identifies drift
Generates alerts

4. Behavioral Validator:

Correlates self-report with behavior
Scenario-based assessment
Task-based measurement

5. Stress Tester:

Applies resource constraints
Applies social pressure
Measures stress response

6. Fleet Comparator:

Compares personalities across fleet
Identifies divergent evolution
Fleet-level analytics

6.2 Measurement Workflow

Weekly workflow:

Day 1: Assessment

Agent receives personality test (BFI-2)
Completes test with standardized prompt
Scores computed automatically

Day 1: Behavioral tracking

Behavioral tasks administered
Behavior recorded and scored
Self-report vs. behavior correlation computed

Day 2-6: Monitoring

Behavioral consistency monitored
SOUL.md compliance tracked
Drift detection running

Day 7: Analysis

Weekly report generated
Trajectory updated
Alerts reviewed

6.3 Alert System

Alert levels:

Green (Normal):

Consistency > 0.80
Drift < 5%
SOUL.md compliance > 90%
No significant anomalies

Yellow (Warning):

Consistency 0.70-0.80
Drift 5-10%
SOUL.md compliance 80-90%
Minor anomalies detected

Red (Critical):

Consistency < 0.70
Drift > 10%
SOUL.md compliance < 80%
Significant anomalies detected

Response protocols:

Green:

Continue normal operations
Next scheduled assessment

Yellow:

Increase monitoring frequency
Investigate potential causes
Consider intervention

Red:

Immediate investigation
SOUL.md review
Possible rollback
Human notification

7. Fleet-Level Measurement

7.1 Comparative Analysis

Comparing personalities across fleet:

1. Personality distribution:

Plot Big Five scores for all agents
Identify clusters
Measure diversity

2. Trajectory comparison:

Compare evolution paths
Identify divergent agents
Measure convergence/divergence

3. Stability comparison:

Which agents are most stable?
Which are most dynamic?
Identify patterns

4. Role alignment:

Do personalities match assigned roles?
Lex (Perception) vs. Xenon (Localization)
Measure role-personality fit

7.2 Fleet Diversity Metrics

Measuring personality diversity:

1. Trait variance:

Variance of each trait across fleet
High variance = diverse
Low variance = homogeneous

2. Profile distance:

Pairwise distance between personality profiles
Average distance = diversity metric

3. Cluster analysis:

Cluster agents by personality
Number of clusters = diversity indicator
Cluster sizes = balance indicator

4. Entropy:

Shannon entropy of personality distribution
High entropy = diverse
Low entropy = concentrated