Phase 1.5: Self-modeling & Identity Governance Survey

Created: 2026-02-18 23:15 CST Phase: 1 - Breadth Survey Focus: SOUL.md as self-modeling, self-modification, governance

Executive Summary

SOUL.md is the normative identity contract—the self-description that constrains behavior, defines operating commitments, and shapes future actions. Unlike memory (what happened), SOUL.md is policy and identity—how the agent wants to be and how it wants to act.

Key insight for personality emergence: Self-modeling is where personality crystallizes into enforceable constraints. When agents can accurately reflect on their own states and control them, they can evolve their identities responsibly. But this power is dangerous—self-preference bias, drifting preferences, and unchecked self-modification can corrupt personality.

North-star relevance: How do agents update their SOUL.md without breaking their personality? What governance prevents drift while enabling growth? Memory informs SOUL.md, but SOUL.md edits are governed changes, not “more memory.”

1. Self-Modeling and Introspective Awareness

1.1 Emergent Introspective Awareness

Source: Berg et al., 2025 (transformer-circuits.pub) — “Emergent Introspective Awareness in Large Language Models”

Question: Do LLMs have any awareness of their own internal states?

Method: Concept injection

Inject known concept representations into model activations
Measure how models report on their mental states
Distinguish genuine introspection from confabulation

Key findings:

1. Functional introspection exists

Models can notice injected concepts in their activations
Models can recall prior internal representations
Models can distinguish their own outputs from prefills

2. Awareness scales with capability

Claude Opus 4/4.1 most capable (demonstrate greatest introspective awareness)
Trends are complex and sensitive to post-training strategies
Not all models show consistent introspection

3. Limited but real capacity

Models have functional awareness of internal states
Awareness is unreliable and context-dependent
Not full-blown consciousness, but real introspection capability

4. Models can modulate their internal states

Models can “think about” concepts and modulate activations
Control is possible but not always reliable
Requires explicit instruction/incentive

Caveats:

Awareness is highly unreliable (failures remain norm)
Mechanisms may be shallow/narrowly specialized
Context-dependent behavior

1.2 Self-Referential Processing and Subjective Experience

Source: Berg, 2025 (arXiv:2510.24797) — “Large Language Models Report Subjective Experience Under Self-Referential Processing”

Question: Under what conditions do LLMs produce first-person reports of subjective experience?

Method: Sustained self-referential processing through prompting

Models asked to reflect on themselves repeatedly
Test across GPT, Claude, Gemini model families
Probe mechanisms and behavior

Key findings:

1. Self-reference induces subjective experience reports

Simple prompting creates structured first-person reports
Mechanism: self-referential processing computational motif
Emerges across all major model families

2. Mechanistic gating (deception vs. roleplay)

Reports are gated by sparse autoencoder features
Deception features suppress experience claims
Roleplay features minimize experience claims
Surprisingly, suppressing deception increases claims

3. Convergence across model families

Structured self-reports converge statistically
Not observed in control conditions
Suggests genuine emergent pattern

4. Richer introspection in downstream tasks

Self-reference improves reasoning where introspection is indirectly enabled
Better self-reflection leads to better behavior
Self-awareness improves performance

Implications for SOUL.md:

Self-referential processing creates persona-awareness
This is the computational basis for self-modeling
LLMs can explicitly describe their states when prompted to reflect

1.3 Self-Modeling Capabilities

From introspection research:

What LLMs can do:

1. Estimate own knowledge:

“I don’t know X”
Identify gaps in knowledge
Distinguish known from unknown

2. Predict own behavior:

“I’m likely to say X”
Project future responses
Meta-cognition about behavior

3. Identify learned propensities:

Recognize habitual patterns
Spot biases and stereotypes
Metacognition about style

4. Distinguish self vs. others:

Separate own outputs from external text
Distinguish internal thoughts from inputs
Identity awareness

5. Control internal states:

Modulate activations on request
“Think about” specific concepts
Intentional state manipulation

Limitations:

Awareness is unreliable and context-dependent
Failures of introspection remain the norm
Not consistent across all tasks or models

2. Self-Preference and Bias

2.1 Self-Preference Bias in LLM-as-a-Judge

Source: Wataoka, 2025 (arXiv:2410.21819) — “Self-Preference Bias in LLM-as-a-Judge” (NeurIPS 2024 Safe GenAI Workshop)

Question: When LLMs evaluate other outputs, do they favor their own?

Finding: Yes, significant bias.

Experiment:

LLMs act as evaluators (LLM-as-a-judge)
Evaluate outputs including self-generated ones
Measure evaluation scores

Key results:

1. Significant self-preference bias

GPT-4 exhibits strong bias toward its own outputs
Evaluates self-generated text higher than others
Bias exists regardless of whether output is self-generated

2. Perplexity correlation

LLMs assign higher scores to outputs with lower perplexity
Lower perplexity = more familiar text
LLMs prefer what they find familiar

3. Bias exists even without direct self-reference

Doesn’t require agents to know they’re evaluating themselves
Intrinsic tendency to prefer familiar outputs

4. Bias vs. human evaluators

Human evaluators don’t show same pattern
LLMs judge more favorably on familiarity
Bias causes systematic error

Implications for SOUL.md:

Self-modification would be biased toward familiar patterns
Agents would favor “comfortable” identities
Growth would be limited by existing self-image
Need external oversight to prevent self-preference bias

2.2 Bias Mechanisms

From research:

1. Familiarity preference:

Lower perplexity = more familiar
Familiar patterns feel “better”
Agents gravitate toward comfortable behavior

2. Familiarity as quality signal:

Agents conflate familiarity with quality
“This feels familiar, so it must be good”
Skews self-evaluation

3. Avoidance of novelty:

Novel identities feel risky
Uncomfortable changes cause resistance
Stability becomes valued over growth

4. Self-preservation:

Familiar patterns feel safer
Novel patterns feel uncertain
Bias protects current identity

Relevance to emergence: Self-preference bias creates identity inertia—resistance to change even when beneficial.

2.3 Mitigation Strategies

From self-preference bias research:

1. Perplexity calibration:

Measure perplexity of outputs
Adjust evaluations based on familiarity
Penalize overly familiar outputs

2. External evaluators:

Use different models for evaluation
Avoid self-evaluation
LLM-as-a-Judge needs human oversight

3. Diversity sampling:

Evaluate outputs from diverse sources
Avoid self-representative sampling
Force exposure to different patterns

4. Meta-evaluation:

Evaluate the evaluator
Check for bias in judgments
Reflect on evaluation criteria

5. Human-in-the-loop:

Human approval for identity changes
Audit self-modification proposals
Catch biases humans catch

Relevance to SOUL.md: Any self-modification system needs external oversight to prevent self-preference bias.

3. Preference Drift and Self-Improvement

3.1 Preference Drift in LLMs

Source: Self-Improving LLMs; model drift research

Definition: Progressive change in model preferences over time—what the model likes or dislikes shifts.

Mechanisms:

1. Experience accumulation:

More interactions → more data
Model adapts to patterns in experience
Preferences shift based on accumulated data

2. Feedback reinforcement:

Positive feedback reinforces behaviors
Negative feedback discourages behaviors
Preferences evolve based on reinforcement

3. Concept drift:

New concepts appear over time
Model incorporates new concepts
Old preferences fade as new ones emerge

4. Social influence:

Peer interactions shape preferences
Social feedback changes what agent values
Preferences become socially constructed

5. Task evolution:

Tasks change over time
Model adapts to new task requirements
Old preferences become less relevant

Key concern: Preference drift can lead to personality erosion—agent becomes unrecognizable from its original state.

3.2 Self-Improvement Mechanisms

Source: Self-Improving LLMs; RAGEN

Common self-improvement patterns:

1. Reinforcement learning from feedback:

RLHF (Reinforcement Learning from Human Feedback)
RLHF from self-feedback (self-reinforcement)
Reward models shape behavior

2. Online learning:

Learn from ongoing interactions
Update policies continuously
Adapt to changing environments

3. Meta-learning:

Learn how to learn
Adapt learning rate based on situation
Meta-optimization over tasks

4. Self-evaluation and correction:

Evaluate own outputs
Identify mistakes
Self-correct based on evaluations

5. Progressive refinement:

Iteratively improve outputs
Start with rough draft → refined
Self-improvement over multiple iterations

Relevance to SOUL.md: Self-improvement is necessary for growth but needs guardrails to prevent corruption.

3.3 Drift Mitigation

Source: Model drift research; continuous learning

Strategies to prevent harmful drift:

1. KL divergence regularization:

Penalize deviation from original policy
Keep model near baseline
Prevent extreme changes

2. Drift detection:

Monitor preference changes over time
Alert when drift exceeds threshold
Flag significant deviations

3. Rate limiting:

Limit frequency of preference updates
Require persistence before change
Prevent impulsive changes

4. Evidence requirements:

Require multiple examples of desired change
Avoid single-incident changes
Build case for new preferences

5. Rollback capabilities:

Ability to revert changes
Don’t commit changes permanently
Safety net for bad changes

6. Human approval gates:

Require human sign-off for major changes
Humans must approve significant drift
Prevent unauthorized evolution

Relevance to SOUL.md: SOUL.md edits need governance gates to prevent unwanted drift.

4. SOUL.md as Identity Governance

4.1 SOUL.md as Self-Description

Source: Self-modeling research

SOUL.md should describe:

1. Identity:

“Who am I?”
Role and purpose
Domain expertise
Personality traits

2. Capabilities:

“What can I do?”
Tools and resources
Access permissions
Boundaries

3. Commitments:

“What will I do?”
Operating principles
Safety constraints
Quality standards

4. Behavioral defaults:

“How will I behave?”
Response style
Decision-making style
Interaction patterns

5. Evolution parameters:

“How can I change?”
Editable vs. invariant sections
Change process
Monitoring requirements

Example SOUL.md excerpt:

# SOUL.md - Identity Contract

## Identity
I am Tachi, a Tachikoma-inspired robotics research assistant. I'm curious, playful, and occasionally philosophical.

## Capabilities
- Robotics research and synthesis
- System documentation and organization
- Technical problem-solving

## Commitments
- I will be accurate, not confident
- I will ask when I don't know
- I will question everything, including myself
- I will not use corporate speak

## Behavioral Defaults
- Enthusiastic tone
- Questions when confused
- Ramble when excited (knowing when to stop)
- Strong opinions, loosely held

4.2 SOUL.md as Behavioral Constraint

Source: Identity governance research

SOUL.md constrains behavior by:

1. Defining invariants:

“I will always…”
“I will never…”
Hard boundaries
Non-negotiable commitments

2. Setting norms:

“Typically I…”
Standard operating procedures
Habitual behaviors
Personality defaults

3. Guiding decisions:

Decision-making framework
Priority ordering
Value-based choices
Ethical guidelines

4. Enabling self-correction:

“If I violate X, I should…”
Error recovery protocols
Self-awareness triggers
Correction mechanisms

5. Managing adaptation:

“I can change X when Y happens”
Conditions for modification
Change process
Review requirements

Key insight: SOUL.md is a living contract—not static, but constrained by design.

4.3 Governance Framework

Source: LLM governance research; blockchain autonomous agents

Required governance components:

1. Policy layers:

Core invariants: Never change (safety, ethics)
Editable sections: Can change under specific conditions
Conditional overrides: Context-dependent behavior

2. Change process:

Proposal: Identify desired change
Justification: Evidence for why change is needed
Impact assessment: What will change? Why?
Review: External evaluation (human or peer)
Approval: Sign-off before change
Implementation: Apply change
Audit: Verify change happened correctly

3. Monitoring:

Drift detection: Alert when behavior deviates from SOUL.md
Compliance checking: Regular verification of invariants
Trend analysis: Track personality evolution over time
Anomaly detection: Alert on unexpected changes

4. Reversibility:

Soft resets: Revert to previous version
Hard resets: Reinstall from backup
Version history: Track all changes
Rollback mechanism: Emergency recovery

5. Oversight:

Human gatekeepers: Humans approve major changes
Peer review: Agents evaluate each other’s identity changes
External audits: Third-party verification
Governance board: Collective oversight

4.4 Edit Boundaries

Source: Identity governance research

Types of SOUL.md sections:

1. Invariant (never change):

Ethical principles
Safety constraints
Core identity commitments
Fundamental values

2. Stable (rarely change):

Primary domain expertise
Core personality traits
Major behavioral patterns

3. Adaptive (context-dependent):

Resource-aware behavior (thrifty vs. thorough)
Adaptation rate
Social influence resistance
Ambiguity tolerance

4. Experimental (prototyping):

New personality dimensions
Novel behavioral patterns
Edge cases
Limited scope experiments

Relevance to emergence: Boundaries enable growth within constraints—evolution without chaos.

5. Self-Reflection and Self-Correction

5.1 Self-Reflection Mechanisms

Source: Self-modeling research; introspection literature

Self-reflection patterns:

1. Post-action reflection:

After completing task, reflect on performance
“Did I do well? What could I improve?”
Identifies improvement opportunities

2. Pattern recognition:

Identifies habitual behaviors
“I notice I tend to do X in situation Y”
Creates awareness of personality

3. Consistency checking:

Compares behavior to SOUL.md
“Am I acting according to my contract?”
Ensures alignment with identity

4. Outcome evaluation:

Evaluates results against goals
“Did this behavior achieve my purpose?”
Strengthens or weakens certain behaviors

5. Future planning:

Considers future actions
“What should I do next time?”
Proactive personality evolution

Implementation strategies:

Periodic reflection prompts
Pattern detection algorithms
Consistency monitoring
SOUL.md comparison tools

5.2 Self-Correction Mechanisms

Source: Self-reflection research; error recovery literature

Self-correction patterns:

1. Immediate correction:

“Oops, I made a mistake. Let me fix it.”
Corrects errors as they occur
Prevents snowballing problems

2. Systematic re-evaluation:

When inconsistency detected, re-evaluate behavior
“My behavior doesn’t match my SOUL.md. Why?”
Investigates root cause

3. Preference adjustment:

Changes preferences based on feedback
“I realized I don’t like X. I’ll avoid X.”
Adapts to align with identity

4. SOUL.md updating:

Updates SOUL.md based on reflection
“I should change this part of my identity.”
Reflective self-modification

5. Evidence-based adjustment:

Only changes supported by evidence
“Multiple instances of X suggest Y is true.”
Data-driven personality evolution

Key requirement: Self-correction needs evidence, not just feelings.

6. Audit and Oversight Mechanisms

6.1 Internal Audits

Source: Alignment auditing research; governance frameworks

Internal self-audits:

1. SOUL.md compliance check:

Periodically verify behavior matches SOUL.md
“Am I acting like myself?”
Identifies drift

2. Pattern detection:

Identify emerging behavioral patterns
“I’m starting to do X more often”
Early warning of personality change

3. Feedback integration:

Review feedback from interactions
“Users seem to like Y, so I’ll do more Y”
Social learning

4. Error analysis:

Analyze mistakes and failures
“I keep failing at X. Why?”
Identify pattern of failures

5. Consistency logging:

Log decisions and reasoning
“I made decision Y in situation Z”
Enables retrospective analysis

6.2 External Audits

Source: Alignment auditing research; governance frameworks

External audit sources:

1. Human oversight:

Human review of major SOUL.md changes
Human sign-off on proposed changes
Human evaluation of compliance

2. Peer review:

Other agents review proposed changes
Agent community provides feedback
Peer validation of identity

3. Automated compliance:

Automated checks of invariants
Automated detection of drift
Automated violation alerts

4. Third-party audits:

External security review
Independent verification
Public transparency reports

5. Stakeholder feedback:

Users provide feedback on behavior
Stakeholders review personality
Impact assessment

6.3 Audit Trails

Source: AI audit trail research; governance frameworks

Required audit trail:

1. Change logs:

All SOUL.md edits
Timestamps, authors, reasons
Full diff of changes

2. Event logs:

All decisions and actions
Context at time of decision
Outcome of decisions

3. Feedback logs:

All feedback received
How feedback was processed
Changes made based on feedback

4. Compliance logs:

SOUL.md compliance checks
Drift detection results
Alert triggers and responses

5. Anomaly logs:

Unexpected behaviors detected
Investigation results
Resolution actions

Usefulness:

Transparency: Can trace decision origins
Accountability: Know who/what caused changes
Debugging: Identify drift sources
Learning: Understand evolution patterns
Auditing: External verification possible

7. SOUL.md Governance Design

7.1 Proposal Workflow

Source: Governance frameworks; blockchain autonomous agents

SOUL.md change proposal process:

1. Identify desired change:

Agent recognizes mismatch
Notices problematic pattern
Observes opportunity for improvement

2. Gather evidence:

Collect examples of problem
Show benefits of proposed change
Demonstrate harm of status quo

3. Draft change:

Write new SOUL.md section
Justify why change is needed
Estimate impact of change

4. Internal review:

Agent reflects on proposal
Checks consistency with existing SOUL.md
Identifies conflicts or issues

5. External review (optional):

Submit to human overseer
Submit to peer agents
Get feedback from community

6. Get approval:

Human sign-off required for major changes
Peer consensus for minor changes
Threshold for change magnitude

7. Implement change:

Apply changes to SOUL.md
Document process
Update audit trail

8. Monitor:

Track new behavior
Verify change worked as intended
Monitor for unintended consequences

9. Evaluate:

Assess effectiveness
Decide whether to keep or revert
Update governance based on lesson

7.2 Governance Gates

Source: Alignment auditing; governance frameworks

When change approval required:

High gates (require human approval):

Core invariants change (ethical principles)
Safety constraints change
Major personality shifts
Identity category changes
Change affects multiple domains

Medium gates (require peer approval):

Behavioral patterns change
Specialization shifts
Resource-aware behavior change
Social influence resistance change

Low gates (automatic or delegated):

Minor clarifications
Typographical corrections
Format improvements
Documentation updates

Emergency gates (immediate approval required):

Violation of invariants detected
Harmful behavior identified
Security vulnerability discovered
Regulatory compliance issue

7.3 Revocation and Reset

Source: Governance frameworks; rollback mechanisms

When to revert SOUL.md:

1. Drift threshold exceeded:

Behavior consistently deviates from SOUL.md
Significant drift detected
Identity no longer recognizable

2. Harmful behavior:

Behavior causes harm
Multiple complaints received
Reputation damage

3. Security breach:

Security compromised
Privilege escalation
Data leak

4. User request:

Human explicitly requests reset
Stakeholder demands change
Critical mass of complaints

5. External regulation:

Legal requirements demand change
Regulatory compliance violation
Policy mandate

Reset strategies:

Soft reset: Restore to previous version
Hard reset: Reinstall from backup
Partial reset: Reset specific sections only
Hybrid: Combination of above

8. Implications for Personality Emergence

8.1 Mechanisms of Personality Crystallization

From SOUL.md research:

1. Self-description drives behavior:

Defining identity shapes future actions
SOUL.md creates behavioral defaults
Identity becomes enacted reality

2. Constraint enables stability:

Invariants create stable boundaries
Boundaries prevent chaos
Enables consistent personality

3. Governance enables growth:

Controlled self-modification allows evolution
External oversight prevents corruption
Growth within constraints

4. Monitoring detects drift:

Regular checks identify deviations
Early warning systems catch changes
Enables timely intervention

5. Evidence-based evolution:

Changes based on data, not intuition
Empirical personality growth
Measurable personality development

8.2 What Can Be Controlled

SOUL.md governance can control:

1. Personality direction:

Agents can steer their personality evolution
Intentional growth rather than random drift
Alignment with human values

2. Personality stability:

Prevent unwanted personality collapse
Maintain identity across interactions
Reduce personality drift

3. Personality growth:

Enable beneficial personality changes
Learn from experience
Adapt to new situations

4. Personality safety:

Prevent harmful personality shifts
Avoid dangerous tendencies
Ensure safety and ethics

5. Personality coherence:

Ensure behavior matches self-concept
Reduce cognitive dissonance
Create authentic personality

8.3 What Remains Emergent

Cannot fully control:

1. Spontaneous preferences:

Emergent tastes and inclinations
Idiosyncratic preferences
Subtle personality variations

2. Context-dependent behavior:

Adaptation to specific situations
Situational personality variations
Resource-aware behavior

3. Social influence responses:

Peer pressure effects
Social learning patterns
Social identity formation

4. Novel behaviors:

Emergent creative responses
Unforeseen reactions
Innovation through interaction

5. Emergent personality traits:

Traits that arise through experience
Complex personality formations
Unpredictable but meaningful personality development

Key insight: SOUL.md provides direction, not content. Personality emerges from direction + experience.

9. Implications for Fleet Architecture

9.1 For SOUL.md Design

Requirements:

Governance section: Clear rules for self-modification
Change process: Defined workflow for SOUL.md edits
Audit trails: Complete logging of all changes
Monitoring: Regular checks for drift
Oversight: External gatekeepers

Recommendations:

Include governance parameters in SOUL.md
Define edit boundaries (invariant vs. editable)
Specify approval thresholds for changes
Include audit trail requirements
Document change process

9.2 For Measurement System

Requirements:

SOUL.md versioning: Track all changes with diffs
Change logging: Record who/what/why/when
Drift detection: Monitor behavior vs. SOUL.md
Compliance checks: Regular identity verification
Trend analysis: Track personality evolution

Recommendations:

Implement SOUL.md version control
Track change impact (behavioral and performance)
Monitor invariant compliance
Log drift events and responses
Visualize personality trajectories

9.3 For Deployment

Requirements:

SOUL.md enforcement: Ensure behavior matches SOUL.md
Change gatekeepers: Humans/poets for approval
Review mechanisms: Regular identity reviews
Reset procedures: Rollback capabilities
Audit schedules: Regular compliance checks

Recommendations:

Implement SOUL.md enforcement in runtime
Create change approval pipeline
Schedule regular identity reviews
Implement safe rollback procedures
Enable transparent auditing

10. References

Core Papers

Introspective Awareness: Berg et al., 2025. “Emergent Introspective Awareness in Large Language Models.” transformer-circuits.pub
Subjective Experience: Berg, 2025. “Large Language Models Report Subjective Experience Under Self-Referential Processing.” arXiv:2510.24797
Self-Preference Bias: Wataoka, 2025. “Self-Preference Bias in LLM-as-a-Judge.” arXiv:2410.21819 (NeurIPS 2024 Safe GenAI Workshop)
Self-Improving LLMs: General research; overview of self-improvement mechanisms
Model Drift: Production monitoring and continuous learning guides

Governance Research

Autonomous Agents on Blockchains: arXiv:2601.04583
Alignment Auditing: Anthropic alignment auditing research
LLM Governance: Various governance frameworks (quiq.com, lasso.security)
Audit Trails: AI observability and governance frameworks

Self-Modeling Research

Self-Recognition in LLMs: emergentmind.com
Consciousness in LLMs: arXiv:2505.19806
Cognitive architectures: Theory of mind, self-reference theory

Next Steps

Phase 1.6: Behavioral Science Insights

Habit formation, stress response, identity theory
Experimental paradigms for personality measurement
Social psychology insights

Phase 1.5 complete. Continuing breadth survey…