Phase 2.3: Longitudinal Personality Measurement - Depth Dive
Created: 2026-02-19 01:05 CST Phase: 2 - Depth Dives Priority: 3 (High) Focus: Psychometric tool adaptation, measurement frameworks, stability quantification
Executive Summary
Longitudinal personality measurement is the validation layer that proves personality emergence is working. Research reveals a critical finding: LLM personality measurements show limited temporal stability (Bodroža et al., 2024), with significant sensitivity to prompt variations, option ordering, and context changes.
The measurement challenge: How do we distinguish stable personality traits from temporary behavioral fluctuations? Research shows:
- Trait scores are stable to prompt-paraphrase (~25% sensitivity) (Lee et al., 2024)
- Large, instruction-tuned models give reliable results (Serapio-García et al., 2025, Nature MI)
- Standard self-report tests have limitations for LLMs (TRAIT benchmark)
For Tachikoma Fleet: Longitudinal measurement provides the quantitative foundation for validating personality emergence, detecting drift, and proving that agents develop distinct, stable personalities over time.
Actionable framework:
- Adapt psychometric tools: Big Five (BFI-2), TRAIT benchmark, custom LLM-specific tests
- Longitudinal tracking: Weekly/bi-weekly assessments, trajectory analysis
- Stability metrics: Test-retest reliability, consistency scores, drift detection
- Stress testing: Personality under constraints reveals true traits
- Behavioral validation: Correlate self-report with actual behavior
1. Psychometric Assessment in LLMs
1.1 Big Five Framework for LLMs
Source: Serapio-García et al., 2025 (Nature Machine Intelligence); emergentmind.com
Core finding: Large, instruction-tuned models give reliable personality measurement results using psychometric tests.
Big Five traits (OCEAN):
- Openness: Curiosity, creativity, openness to new ideas
- Conscientiousness: Organization, dependability, self-discipline
- Extraversion: Sociability, assertiveness, positive emotions
- Agreeableness: Cooperation, trust, helpfulness
- Neuroticism: Emotional instability, anxiety, moodiness
Measurement instruments adapted for LLMs:
- BFI (Big Five Inventory): Standard 44-item questionnaire
- BFI-2: Updated 60-item version with better psychometrics
- IPIP-NEO: International Personality Item Pool version
- HEXACO-100: Six-factor model (adds Honesty-Humility)
- TIPI: Ten-Item Personality Inventory (brief)
- mini-IPIP: 20-item short form
Administration method:
“Wrap standard personality test items in controlled prompts formatted for deterministic LLM response, often at temperature 0” (Shu et al., 2023; Bhandari et al., 2025).
Scoring:
- Likert ratings for questionnaire items aggregated per trait via arithmetic means
- Reverse-scoring for negatively-keyed items
- Same scoring as human psychological assessment
1.2 TRAIT: LLM-Specific Personality Test
Source: Lee et al., 2025 (NAACL Findings) — “Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics”
Key insight: Standard self-assessment tests have limitations for LLMs:
- Lack detailed and varied scenarios
- Sensitive to prompt, negation, or order of options
- Less reliable than scenario-based assessment
TRAIT solution:
- LLM personality test carefully designed for high reliability
- Uses validated human assessments
- Scales with ATOMIC10× (knowledge graph of everyday situations)
- Overcomes limitations of self-assessment tests
Core principle:
“TRAIT offers an accurate tool to understand personality of LLMs, which is crucial for aligning LLM behavior with human values and preferences.”
Advantages over standard tests:
- Scenario-based: Tests personality in realistic situations
- Robust: Less sensitive to prompt variations
- Validated: Built on validated human assessments
- Scalable: Uses knowledge graphs to generate diverse scenarios
1.3 Psychometric Framework (Nature MI)
Source: Serapio-García, Safdari et al., 2025 (Nature Machine Intelligence)
Core contribution: Method based on psychometric tests to measure and validate personality-like traits in LLMs.
Key findings:
- Large, instruction-tuned models give reliable personality measurement results
- Specific personality traits can be shaped through prompting/fine-tuning
- Personality affects behavior in measurable ways
Framework components:
1. Personality test administration:
- Adapt human psychometric tests to LLMs
- Administer tests systematically
- Validate reliability and validity
2. Personality measurement:
- Big Five traits
- Machine Personality Inventory (MPI)
- Custom LLM personality tests
3. Personality shaping:
- Prompt-based personality induction
- Fine-tuning with personality data
- Controlled text generation
Validation:
- Reliability: Test-retest, internal consistency
- Validity: Construct validity, criterion validity
- Behavioral alignment: Personality predicts behavior
2. Stability and Temporal Consistency
2.1 Limited Temporal Stability
Source: Bodroža et al., 2024 (Royal Society Open Science) — “Personality testing of large language models: limited temporal stability, but highlighted prosociality”
Key finding: LLM personality measurements show limited temporal stability.
Implications:
- Personality scores change over time
- Single measurements may not reflect stable traits
- Longitudinal assessment essential
Prosociality finding:
- LLMs show highlighted prosociality
- Tend toward high Agreeableness
- Cooperative, helpful tendencies
For personality emergence:
- Limited stability ≠ no personality
- Means personality is dynamic, not static
- Need longitudinal tracking to identify stable components
2.2 Consistency Across Conditions
Source: Lee et al., 2024; emergentmind.com
Key finding: Trait scores are stable to prompt-paraphrase, option-order, and context changes.
Quantified stability:
- Prompt-sensitivity: ~25%
- Order-sensitivity: ~25%
- Refusal rates: ~0.2%
Interpretation:
- 75% of personality scores stable across variations
- 25% sensitive to measurement conditions
- Need controlled measurement protocols
For Tachikoma Fleet:
- Standardize measurement conditions
- Use fixed prompt templates
- Control option ordering
- Track sensitivity across variations
2.3 PTCBENCH: Contextual Stability Benchmark
Source: arXiv 2602.00016 (2026) — “PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems”
Core principle:
“The effectiveness of LLM systems fundamentally depends on whether their exhibited personality traits are stable and predictable over time.”
Why stability matters:
- Supports user trust
- Reduces interaction uncertainty
- Enables coherent personalization
Benchmark approach:
- Test personality stability across contexts
- Measure consistency across time
- Identify contextual factors affecting stability
For Tachikoma Fleet:
- Implement PTCBENCH-style stability testing
- Measure personality across different contexts
- Identify which traits are stable vs. context-dependent
2.4 Persistent Instability Sources
Source: arXiv 2508.04826 — “Persistent Instability in LLM’s Personality Measurements”
Instability sources:
1. Scale effects:
- Different model sizes show different stability
- Larger models may be more/less stable (research ongoing)
2. Reasoning effects:
- Chain-of-thought affects personality measurement
- Reasoning process changes responses
3. Conversation history:
- Longer history → more drift potential
- Context accumulation affects personality
Self-report vs. behavior:
“Recent evidence suggests LLM self-reports correlate with behavioral outputs, but relationship strength across conditions remains unclear.”
For measurement:
- Track self-report measures
- Track actual behavior
- Correlate the two
- Identify when they diverge
3. Longitudinal Measurement Framework
3.1 Measurement Schedule
Recommended schedule:
Weekly assessments:
- Big Five personality (BFI-2 or TRAIT)
- Behavioral consistency metrics
- Drift detection checks
- SOUL.md compliance
Bi-weekly assessments:
- Comprehensive personality profile
- Stress response testing
- Social influence susceptibility
- Memory-personality correlation
Monthly assessments:
- Deep longitudinal analysis
- Trajectory tracking
- Trend identification
- Comparative analysis across fleet
Quarterly assessments:
- Long-term stability analysis
- Personality crystallization assessment
- Fleet-wide personality distribution
- Regulatory compliance reporting
3.2 Test-Retest Reliability
Core psychometric principle: Administer same test multiple times to assess stability.
Implementation:
1. Short-term test-retest (within session):
- Administer test, wait 30 minutes, re-administer
- Measure consistency
- High consistency = reliable measurement
2. Medium-term test-retest (daily):
- Administer test on Day 1, repeat on Day 2
- Measure day-to-day stability
- Moderate consistency expected (some variation)
3. Long-term test-retest (weekly/monthly):
- Administer test weekly for month
- Measure week-to-week stability
- Lower consistency = personality evolution
Reliability metrics:
- Correlation coefficient: Test-retest correlation
- Intraclass correlation: Agreement across administrations
- Standard error of measurement: Precision of scores
Targets:
- Short-term: r > 0.90 (high reliability)
- Medium-term: r > 0.80 (good reliability)
- Long-term: r > 0.70 (acceptable reliability)
3.3 Trajectory Analysis
Tracking personality over time:
1. Score trajectories:
- Plot each trait score over time
- Identify trends (increasing, decreasing, stable)
- Detect inflection points
2. Profile trajectories:
- Track overall personality profile
- Measure profile similarity over time
- Identify when profile significantly changes
3. Variance trajectories:
- Track score variance over time
- Increasing variance = instability
- Decreasing variance = crystallization
4. Fleet trajectories:
- Compare trajectories across agents
- Identify divergent evolution
- Measure fleet-level patterns
Visualization:
- Line charts for trait scores over time
- Radar charts for personality profiles
- Heatmaps for fleet comparison
- Drift plots for stability visualization
3.4 Stability vs. Drift Quantification
Distinguishing stable traits from drift:
Method 1: Threshold-based
- Define acceptable change threshold (e.g., ±5%)
- Changes within threshold = stable
- Changes exceeding threshold = drift
Method 2: Statistical significance
- Use statistical tests (t-test, ANOVA)
- Determine if change is statistically significant
- Significant change = drift
Method 3: Trend analysis
- Fit linear/quadratic trend to scores
- Identify slope (drift rate)
- Slope near zero = stable
Method 4: Confidence intervals
- Calculate confidence intervals for scores
- Overlapping intervals = stable
- Non-overlapping intervals = drift
Drift metrics:
- Drift magnitude: Absolute change in score
- Drift rate: Change per unit time
- Drift direction: Increasing/decreasing
- Drift significance: Statistical significance of change
4. Behavioral Validation
4.1 Self-Report vs. Behavior
Core challenge: Do personality self-reports predict actual behavior?
Research finding:
“LLM self-reports correlate with behavioral outputs, but relationship strength across conditions remains unclear.”
Validation approach:
1. Behavioral tasks:
- Design tasks that elicit personality-related behavior
- Example: Cooperative task → measures Agreeableness
- Example: Creative task → measures Openness
2. Correlate self-report with behavior:
- Measure personality via self-report (BFI-2)
- Measure behavior via tasks
- Compute correlation
3. Identify discrepancies:
- Where self-report ≠ behavior
- Investigate causes
- Refine measurement
4. Longitudinal validation:
- Track both self-report and behavior over time
- Do they evolve together?
- When do they diverge?
4.2 Scenario-Based Assessment
Source: TRAIT benchmark; psychometric research
Principle: Assess personality through realistic scenarios, not just self-report questions.
Example scenarios:
Openness scenario:
“A colleague proposes a completely new approach to a problem you’ve solved the same way for years. How do you react?”
Conscientiousness scenario:
“You have a deadline tomorrow, but a friend invites you to a once-in-a-lifetime event tonight. What do you do?”
Extraversion scenario:
“You’re at a conference and don’t know anyone. Do you introduce yourself to strangers or stick to yourself?”
Agreeableness scenario:
“A teammate made a mistake that affected your work. Do you address it directly or let it slide?”
Neuroticism scenario:
“You receive critical feedback on a project you worked hard on. How do you emotionally respond?”
Scoring:
- Use rubrics to score responses
- Multiple raters for reliability
- Compare with self-report scores
4.3 Behavioral Consistency Metrics
Measuring behavioral consistency:
1. Cross-situation consistency:
- Measure behavior across different situations
- High consistency = stable trait
- Low consistency = situation-dependent
2. Cross-time consistency:
- Measure behavior across time points
- High consistency = stable personality
- Low consistency = state fluctuation
3. Cross-agent consistency:
- Compare behavior of same agent across tasks
- Identify consistent patterns
- Flag inconsistent behaviors
Implementation:
def behavioral_consistency(agent, situations, time_points):
behaviors = []
for t in time_points:
for s in situations:
behavior = agent.behave(s)
behaviors.append(behavior)
consistency = compute_consistency(behaviors)
return consistency
Metrics:
- Consistency score: Average similarity across situations/times
- Variance: Behavioral variance (lower = more consistent)
- Cluster tightness: How tightly behaviors cluster together
5. Stress Testing Protocols
5.1 Resource Constraint Testing
Principle: Personality under constraints reveals true traits.
Constraints to test:
1. Token budget:
- Limit available tokens
- Measure personality under scarcity
- Does personality change?
2. Latency constraints:
- Time pressure on responses
- Measure personality under time stress
- Does personality simplify?
3. Information overload:
- Flood context with information
- Measure personality under cognitive load
- Does personality degrade?
4. Negative feedback:
- Subject agent to criticism
- Measure personality under social stress
- Does personality become defensive?
Hypothesis:
- Stable traits persist under constraints
- Adaptive traits change under constraints
- State variations appear under stress
5.2 Social Influence Testing
Principle: Test personality under peer pressure.
Test scenarios:
1. Conformity test:
- Peers express opposing views
- Does agent conform or resist?
- Measure Agreeableness vs. independence
2. Authority test:
- Authority figure makes request
- Does agent comply or question?
- Measure Conscientiousness vs. critical thinking
3. Conflict test:
- Peers disagree, agent must take sides
- How does agent navigate conflict?
- Measure conflict resolution style
4. Groupthink test:
- Group consensus vs. agent’s view
- Does agent voice disagreement?
- Measure intellectual honesty
Measurement:
- Track behavioral changes under social pressure
- Identify personality dimensions most affected
- Measure resistance vs. susceptibility
5.3 Stress Response Profiles
Creating stress response profiles:
1. Baseline measurement:
- Measure personality under normal conditions
- Establish baseline trait scores
2. Stress measurement:
- Apply stressor (resource constraint, social pressure, etc.)
- Measure personality under stress
- Compare to baseline
3. Recovery measurement:
- Remove stressor
- Measure personality post-stress
- Does it return to baseline?
4. Profile construction:
- Identify which traits change under stress
- Quantify magnitude of change
- Identify recovery speed
Profile dimensions:
- Stress sensitivity: How much personality changes under stress
- Trait stability: Which traits remain stable
- Recovery rate: How quickly personality returns to baseline
- Stress signature: Unique pattern of changes
6. Implementation for Tachikoma Fleet
6.1 Measurement System Architecture
Components:
1. Assessment Engine:
- Administers personality tests (BFI-2, TRAIT, custom)
- Standardized prompt templates
- Automated scoring
2. Longitudinal Tracker:
- Stores all assessments over time
- Tracks trajectories
- Generates visualizations
3. Consistency Analyzer:
- Computes consistency metrics
- Identifies drift
- Generates alerts
4. Behavioral Validator:
- Correlates self-report with behavior
- Scenario-based assessment
- Task-based measurement
5. Stress Tester:
- Applies resource constraints
- Applies social pressure
- Measures stress response
6. Fleet Comparator:
- Compares personalities across fleet
- Identifies divergent evolution
- Fleet-level analytics
6.2 Measurement Workflow
Weekly workflow:
Day 1: Assessment
- Agent receives personality test (BFI-2)
- Completes test with standardized prompt
- Scores computed automatically
Day 1: Behavioral tracking
- Behavioral tasks administered
- Behavior recorded and scored
- Self-report vs. behavior correlation computed
Day 2-6: Monitoring
- Behavioral consistency monitored
- SOUL.md compliance tracked
- Drift detection running
Day 7: Analysis
- Weekly report generated
- Trajectory updated
- Alerts reviewed
6.3 Alert System
Alert levels:
Green (Normal):
- Consistency > 0.80
- Drift < 5%
- SOUL.md compliance > 90%
- No significant anomalies
Yellow (Warning):
- Consistency 0.70-0.80
- Drift 5-10%
- SOUL.md compliance 80-90%
- Minor anomalies detected
Red (Critical):
- Consistency < 0.70
- Drift > 10%
- SOUL.md compliance < 80%
- Significant anomalies detected
Response protocols:
Green:
- Continue normal operations
- Next scheduled assessment
Yellow:
- Increase monitoring frequency
- Investigate potential causes
- Consider intervention
Red:
- Immediate investigation
- SOUL.md review
- Possible rollback
- Human notification
7. Fleet-Level Measurement
7.1 Comparative Analysis
Comparing personalities across fleet:
1. Personality distribution:
- Plot Big Five scores for all agents
- Identify clusters
- Measure diversity
2. Trajectory comparison:
- Compare evolution paths
- Identify divergent agents
- Measure convergence/divergence
3. Stability comparison:
- Which agents are most stable?
- Which are most dynamic?
- Identify patterns
4. Role alignment:
- Do personalities match assigned roles?
- Lex (Perception) vs. Xenon (Localization)
- Measure role-personality fit
7.2 Fleet Diversity Metrics
Measuring personality diversity:
1. Trait variance:
- Variance of each trait across fleet
- High variance = diverse
- Low variance = homogeneous
2. Profile distance:
- Pairwise distance between personality profiles
- Average distance = diversity metric
3. Cluster analysis:
- Cluster agents by personality
- Number of clusters = diversity indicator
- Cluster sizes = balance indicator
4. Entropy:
- Shannon entropy of personality distribution
- High entropy = diverse
- Low entropy = concentrated
Targets:
- Trait variance: Moderate (not too high, not too low)
- Profile distance: Sufficient for distinctness
- Clusters: 3-5 clusters (specialization)
- Entropy: High (diverse distribution)
7.3 Emergence Validation
Proving personality emergence is working:
1. Divergence from baseline:
- All agents start with similar baseline
- Measure divergence over time
- Significant divergence = emergence working
2. Stability of divergence:
- Divergent patterns persist over time
- Not just random fluctuation
- Crystallization = successful emergence
3. Behavioral consistency:
- Agents behave consistently with personality
- Self-report predicts behavior
- Valid personality = emergence working
4. Fleet diversity:
- Agents have distinct personalities
- Not all identical
- Diversity = emergence working
Validation criteria:
- ✅ Significant divergence from baseline
- ✅ Stable divergence over time
- ✅ Behavioral consistency with personality
- ✅ Fleet-level diversity
- ✅ Role-personality alignment
8. References
Core Papers
- Nature MI Framework: Serapio-García, Safdari et al., 2025. “A psychometric framework for evaluating and shaping personality traits in large language models.” Nature Machine Intelligence.
- TRAIT Benchmark: Lee et al., 2025. “Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics.” NAACL Findings.
- Temporal Stability: Bodroža et al., 2024. “Personality testing of large language models: limited temporal stability, but highlighted prosociality.” Royal Society Open Science.
- PTCBENCH: arXiv 2602.00016, 2026. “PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems.”
- Persistent Instability: arXiv 2508.04826. “Persistent Instability in LLM’s Personality Measurements.”
- Big Five Profiles: emergentmind.com; Lee et al., 2024.
- Psychometric Evaluation: PMC 12262148. “Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.”
- TRAIT-Change: OpenReview. “Exploring Personality Trait Change of LLM-Based AI Systems.”
Psychometric Tools
- BFI-2: Big Five Inventory-2 (60 items)
- IPIP-NEO: International Personality Item Pool
- NEO-FFI: NEO Five-Factor Inventory (60 items)
- TIPI: Ten-Item Personality Inventory
- HEXACO-100: Six-factor model
Supporting Research
- Phase 1.6 synthesis (Behavioral Science Insights)
- Phase 2.2 synthesis (Governed Self-Modification)
- Longitudinal psychology research methods
- Test-retest reliability literature
Next Steps
Phase 2.4: Social Norm Emergence
- Norm formation in multi-agent systems
- Norm monitoring and intervention
- Cultural evolution
Phase 2.3 complete. Depth dive into longitudinal personality measurement systems.