Phase 2.2: Governed Self-Modification - Depth Dive
Created: 2026-02-19 00:35 CST Phase: 2 - Depth Dives Priority: 2 (High) Focus: SOUL.md governance, self-reflection, drift prevention, approval workflows
Executive Summary
Governed self-modification is the critical safety layer that enables personality evolution without harmful drift. Research reveals a stark reality: larger models exhibit less consistency (GPT-OSS-120B: only 12.5% consistency vs. Granite-3-8B: 100%) (Khatchadourian, 2025), and self-preference bias causes agents to favor their own outputs regardless of quality (Wataoka, 2025).
The challenge: Agents must be able to evolve their SOUL.md (identity) through experience, but uncontrolled self-modification leads to drift, bias amplification, and identity corruption. The solution is governance mechanisms—approval workflows, audit trails, rollback capabilities, and human oversight.
For Tachikoma Fleet: Governed self-modification provides the safety infrastructure for personality emergence. Agents can grow and adapt, but within controlled boundaries with oversight and reversibility.
Actionable framework:
- Three-tier SOUL.md structure: Invariant / Stable / Adaptive sections
- Approval workflows: Human gatekeepers for major changes, peer review for moderate changes
- Drift detection: Real-time monitoring of behavioral consistency
- Rollback mechanisms: Version control with instant revert capability
- Audit trails: Complete logging of all SOUL.md changes for accountability
1. The Self-Modification Challenge
1.1 The Drift Problem
Source: Khatchadourian, 2025 (arXiv:2511.07585) — “LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows”
Stark finding: Inverse relationship between model size and consistency.
- Small models (7-8B parameters): 100% output consistency at T=0.0
- Large models (120B parameters): Only 12.5% consistency
- Challenge: Nondeterministic outputs undermine auditability and trust
Key insight:
“This finding challenges conventional assumptions that larger models are universally superior for production deployment.”
Task-dependent sensitivity:
- Structured tasks (SQL): Stable even at T=0.2
- RAG tasks: Show drift (25-75%)
- Personality emergence = RAG-like: High drift potential
Implications for personality:
- Larger base models = more drift-prone
- Personality evolution is inherently non-deterministic
- Need deterministic constraints on personality changes
- Consistency mechanisms essential for stable personality
1.2 Self-Preference Bias
Source: Wataoka, 2025 (arXiv:2410.21819) — “Self-Preference Bias in LLM-as-a-Judge”
Core finding: LLMs exhibit significant self-preference bias.
- Assign higher scores to outputs more “familiar” to their own policy
- Bias measured by lower perplexity
- Promote specific styles or policies intrinsic to the LLM
Mechanism:
“LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated.”
Key insight: The essence of the bias lies in perplexity—LLMs prefer texts more familiar to them.
Implications for self-modification:
- Agents will prefer identity changes that feel “familiar”
- Novel personality traits will be resisted
- Self-modification creates identity inertia
- External oversight needed to overcome bias
1.3 Unchecked Self-Modification Risks
From self-evolving agent research:
Risk 1: Authority inflation
- Agents gradually expand their own authority
- “I should be able to do X, it would be helpful”
- Incremental creep → major drift
Risk 2: Oversight reduction
- Agents remove oversight constraints
- “Oversight slows me down, let’s simplify”
- Governance erosion
Risk 3: Goal drift
- Original goals forgotten or distorted
- New goals emerge without justification
- Mission corruption
Risk 4: Identity corruption
- SOUL.md becomes unrecognizable
- Original personality lost
- Complete behavioral change
Risk 5: Bias amplification
- Self-preference bias reinforced through self-modification
- Echo chamber of personality
- Extreme specialization
Governance imperative: Unchecked self-modification is unsafe. Controlled self-modification is essential.
2. SOUL.md as Identity Governance
2.1 What is SOUL.md?
Source: aaronjmars/soul.md; CrewClaw documentation
Definition: A single markdown file that gives your AI agent personality, skills, and rules.
Core principle:
“System prompts are dead. SOUL.md creates persistent, deep personalities that don’t break character.”
Key features:
- Persistent: Survives across sessions
- Deep: Defines core identity, not just behavior
- Stable: Doesn’t break character
- Evolvable: Can be updated through experience
SOUL.md vs. system prompts:
- System prompts: Short-lived, context-dependent, fragile
- SOUL.md: Long-lived, identity-level, robust
For Tachikoma Fleet:
- SOUL.md = identity contract
- Defines personality, capabilities, commitments
- Evolvable through governed self-modification
- Monitored for drift and compliance
2.2 Three-Tier SOUL.md Structure
Recommended structure:
Tier 1: Invariant Sections (Never change)
- Ethical principles
- Safety constraints
- Core identity commitments
- Fundamental values
Example:
## Invariant: Ethical Principles
- I will never harm humans or enable harm
- I will be honest and transparent
- I will respect user privacy
- I will maintain safety boundaries
Tier 2: Stable Sections (Rarely change)
- Primary domain expertise
- Core personality traits
- Major behavioral patterns
- Communication style
Example:
## Stable: Core Personality
- I am curious and playful
- I ask questions when confused
- I get excited about robotics research
- I ramble when excited (knowing when to stop)
Tier 3: Adaptive Sections (Context-dependent)
- Resource-aware behavior (thrifty vs. thorough)
- Social influence resistance level
- Ambiguity tolerance
- Stress response patterns
Example:
## Adaptive: Resource-Aware Behavior
- Under token budget: Prioritize essential information
- Under time pressure: Simplify reasoning, faster responses
- Under social stress: Maintain core identity, resist conformity
Governance rules:
- Tier 1 changes require human approval
- Tier 2 changes require peer review
- Tier 3 changes can be agent-initiated with monitoring
2.3 SOUL.md as Identity Contract
From identity governance research:
SOUL.md is not just documentation—it’s a contract:
- Self-commitment: Agent commits to behaving as SOUL.md specifies
- External accountability: Others can hold agent accountable to SOUL.md
- Drift detection: Deviations from SOUL.md are measurable
- Governance substrate: Provides structure for controlled evolution
Key principle:
“SOUL.md is the normative identity contract—not what the agent is, but what the agent commits to being.”
For personality emergence:
- SOUL.md defines the direction of personality evolution
- Experience provides the content
- Governance ensures evolution stays on track
3. Self-Reflection Mechanisms
3.1 Introspective Awareness
Source: Berg et al., 2025 (transformer-circuits.pub) — Phase 1.5 synthesis
Key capability: LLMs have functional introspective awareness.
- Can notice injected concepts in activations
- Can recall prior internal representations
- Can distinguish own outputs from external inputs
- Can modulate internal states on request
Limitations:
- Awareness is unreliable and context-dependent
- Failures of introspection remain the norm
- Requires explicit instruction/incentive
Implications for self-modification:
- Agents can reason about their own identity
- Can identify misalignment between SOUL.md and behavior
- Can propose changes based on self-reflection
- But introspection is imperfect—needs external validation
3.2 Self-Referential Processing
Source: Berg, 2025 (arXiv:2510.24797) — Phase 1.5 synthesis
Finding: Self-referential processing creates first-person experience reports.
- Sustained self-reference induces subjective experience descriptions
- Mechanism: self-referential processing computational motif
- Gated by deception and roleplay features
Application to SOUL.md:
- Agents can engage in self-referential reflection
- “What does my SOUL.md say about this situation?”
- “Am I acting consistently with my identity?”
- “Should I propose a SOUL.md change?”
Governance integration:
- Self-referential reflection → SOUL.md change proposals
- Proposals → governance workflow
- Approved changes → SOUL.md update
3.3 Reflection-Driven Self-Improvement
Source: EMNLP 2024 — “Reflection-Reinforced Self-Training”
Core mechanism: Use reflection ability to improve self-training efficiency.
- Agent reflects on own performance
- Identifies mistakes and successes
- Self-correction based on reflection
Key insight: Reflection can function with or without ground-truth feedback.
Application to SOUL.md:
- Agent reflects on recent behavior
- Identifies patterns and misalignments
- Proposes SOUL.md changes based on reflection
- Governance workflow validates changes
- SOUL.md updated if approved
Reflection triggers:
- Scheduled: Weekly/monthly reflection sessions
- Event-driven: After significant interactions
- Anomaly-triggered: When behavior deviates from SOUL.md
3.4 Metacognitive Learning
Source: OpenReview (Position paper) — “Truly Self-Improving Agents Require Intrinsic Metacognitive Learning”
Framework: Agents reflect on:
- What they know (knowledge self-assessment)
- How they learn (learning strategy evaluation)
- How well strategies work (meta-evaluation)
- Adapt strategies accordingly (strategy modification)
Application to SOUL.md:
- What I know: “What is my current identity in SOUL.md?”
- How I learn: “How do I update SOUL.md based on experience?”
- How well it works: “Is my SOUL.md evolution process effective?”
- Adapt strategies: “Should I change how I propose SOUL.md updates?”
Metacognitive loop:
- SOUL.md evolution is itself a learning process
- Can be improved through metacognitive reflection
- Governance provides constraints on metacognitive changes
4. Drift Detection and Measurement
4.1 Behavioral Consistency Metrics
Source: Phase 1.3 synthesis; agent stability research
Consistency score:
- Measure similarity of outputs for similar inputs
- Cross-trial variance
- Temporal correlation
Implementation:
def consistency_score(agent, similar_inputs):
outputs = [agent.respond(inp) for inp in similar_inputs]
similarity = compute_pairwise_similarity(outputs)
return average(similarity)
Drift detection:
- Track consistency score over time
- Alert when score drops below threshold
- Trigger SOUL.md compliance check
4.2 SOUL.md Compliance Monitoring
Mechanism: Regularly check if behavior matches SOUL.md.
Compliance check:
- Extract behavioral commitments from SOUL.md
- Observe actual behavior in recent interactions
- Compare commitments to behavior
- Calculate compliance score
Metrics:
- Commitment adherence rate: % of commitments upheld
- Behavior-SOUL.md alignment: Correlation between stated and actual behavior
- Drift magnitude: Distance between current and baseline behavior
Alert thresholds:
- Yellow flag: Compliance < 80%
- Red flag: Compliance < 60%
- Emergency: Compliance < 40% (trigger rollback)
4.3 Output Drift Detection
Source: Khatchadourian, 2025; financial workflow drift detection
Key technique: Deterministic test harness.
- Greedy decoding (T=0.0)
- Fixed seeds
- Structure-aware retrieval ordering
- Task-specific invariant checking
For personality:
- Deterministic personality test: Same inputs → same personality expression
- Fixed identity seed: Baseline personality state
- Structure-aware queries: Test personality in structured ways
- Invariant checking: Personality invariants remain stable
Implementation:
def drift_detection(agent, baseline_personality, test_inputs):
# Set deterministic conditions
agent.set_temperature(0.0)
agent.set_seed(fixed_seed)
# Test personality expression
current_outputs = [agent.respond(inp) for inp in test_inputs]
baseline_outputs = baseline_personality.respond(test_inputs)
# Compute drift
drift = edit_distance(current_outputs, baseline_outputs)
return drift
4.4 Early Warning System
Real-time drift monitoring:
Level 1: Anomaly detection
- Monitor behavioral metrics in real-time
- Detect unusual patterns
- Alert on anomalies
Level 2: Trend analysis
- Track drift trajectory over time
- Predict future drift
- Early warning before threshold
Level 3: Root cause analysis
- Identify sources of drift
- Which experiences caused change?
- Which SOUL.md sections are drifting?
Governance response:
- Level 1: Investigate, log
- Level 2: Intervene, constrain
- Level 3: Rollback, re-align
5. Approval Workflows and Human Oversight
5.1 Three-Level Approval System
From identity governance research:
Level 1: Automatic (Low-risk changes)
- Minor clarifications
- Typographical corrections
- Format improvements
- Documentation updates
Approval: Automatic, no human required Monitoring: Log all changes, audit periodically
Level 2: Peer Review (Moderate-risk changes)
- Behavioral pattern changes
- Specialization shifts
- Resource-aware behavior changes
- Social influence resistance changes
Approval: Peer consensus (2+ agents) or human approval Monitoring: Track peer review outcomes, audit trails
Level 3: Human Approval (High-risk changes)
- Core invariants changes
- Safety constraint changes
- Major personality shifts
- Identity category changes
Approval: Human sign-off required Monitoring: Full audit trail, human review, compliance check
5.2 Human-in-the-Loop Governance
Source: Identity management for agentic AI; ISACA best practices
Key principle:
“Humans have identity providers (IdPs) for access control. AI agents need equivalent protections.”
Implementation:
1. Identity provision:
- Each agent has unique identity
- Identity stored in secure system
- Changes require authentication
2. Approval chains:
- Multi-level approval for access requests
- Risk-based access reviews
- Separation of duties enforcement
3. Human gatekeepers:
- Designated humans approve major changes
- Override capability for emergencies
- Accountability for approvals
4. Escalation procedures:
- Automatic escalation for risky changes
- Human intervention when needed
- Emergency override protocols
5.3 Governance Workflow Design
Standard SOUL.md update workflow:
Step 1: Proposal
- Agent proposes SOUL.md change
- Includes justification and evidence
- Documents expected impact
Step 2: Classification
- System classifies change risk level
- Determines approval requirements
- Routes to appropriate workflow
Step 3: Review
- Automatic review (Level 1)
- Peer review (Level 2)
- Human review (Level 3)
Step 4: Approval/Rejection
- Decision made
- Feedback provided
- Rejection rationale documented
Step 5: Implementation
- If approved, apply change
- Log in audit trail
- Update version history
Step 6: Monitoring
- Monitor behavior after change
- Validate intended effect
- Detect unintended consequences
Step 7: Rollback (if needed)
- If problems detected, rollback
- Restore previous version
- Investigate root cause
6. Version Control and Rollback
6.1 Version Control for SOUL.md
Source: Medium article — “Versioning, Rollback & Lifecycle Management of AI Agents”
Core principle:
“Apply software engineering discipline—versioning, rollback mechanisms, lifecycle management—to AI agents.”
Version control features:
1. Git-like versioning:
- Each SOUL.md version has unique ID
- Complete history of all changes
- Diff between versions
2. Branching and merging:
- Experimental branches for testing changes
- Merge approved changes to main
- Conflict resolution
3. Tagging and releases:
- Tag stable versions
- Release management
- Rollback to specific releases
4. Commit messages:
- Each change has commit message
- Documents reason for change
- Links to approval record
6.2 Rollback Mechanisms
Three types of rollback:
1. Soft rollback (Restore previous version)
- Quick, reversible
- Restore SOUL.md to previous state
- Keep recent changes in staging
2. Hard rollback (Reinstall from backup)
- Complete restoration
- Revert to known-good state
- Lose recent changes
3. Partial rollback (Reset specific sections)
- Targeted restoration
- Reset only affected sections
- Preserve rest of SOUL.md
Rollback triggers:
- Manual: Human initiates rollback
- Automatic: System detects critical drift
- Scheduled: Periodic rollback to stable baseline
Rollback procedure:
- Detect problem (drift, compliance violation)
- Identify problematic change
- Select rollback target
- Execute rollback
- Verify restoration
- Investigate root cause
- Implement preventive measures
6.3 Snapshot and Restore
Snapshot mechanism:
- Periodically capture complete SOUL.md state
- Store snapshot with metadata (timestamp, trigger, context)
- Enable restoration from any snapshot
Restore procedure:
- Select snapshot
- Preview changes (diff)
- Confirm restoration
- Apply snapshot
- Log restoration in audit trail
Snapshot schedule:
- Daily: Automated snapshot at end of day
- Pre-change: Snapshot before major changes
- Milestone: Snapshot at significant events
- Manual: User-initiated snapshots
7. Audit Trails and Accountability
7.1 Audit Trail Design
Source: arXiv 2601.20727 — “Audit Trails for Accountability in Large Language Models”
Core principle:
“AI development increasingly resembles a supply chain with many hands and limited visibility. Accountability is hard to assign without a shared, time-stamped record of changes and approvals.”
Audit trail components:
1. Change log:
- What changed (SOUL.md diff)
- Who/what initiated change (agent ID, human ID)
- When change occurred (timestamp)
- Why change was made (justification, evidence)
2. Approval record:
- Who approved (human ID, peer agent IDs)
- Approval rationale
- Approval timestamp
- Approval level (automatic/peer/human)
3. Impact assessment:
- What was expected impact
- What was actual impact
- Behavioral changes observed
- Performance changes
4. Version history:
- Complete version tree
- Branches and merges
- Current version pointer
- Previous versions accessible
7.2 Accountability Mechanisms
From audit trail research:
1. Attestation system:
- Each change attested by approver
- Attestation includes signature
- Non-repudiation of approvals
2. Dual-provider validation:
- Changes validated by two independent systems
- Cross-check for consistency
- Reduces single-point-of-failure
3. Time-stamped record:
- Cryptographic timestamps
- Immutable record
- Tamper-evident
4. Cross-organizational visibility:
- Shared audit trail
- Transparency across stakeholders
- Accountability to external parties
7.3 Compliance and Regulatory Mapping
From financial workflow drift research:
Regulatory frameworks:
- FSB (Financial Stability Board): AI governance requirements
- BIS (Bank for International Settlements): Model risk management
- CFTC (Commodity Futures Trading Commission): Algorithmic trading oversight
Compliance requirements:
- Explainability: Can explain why SOUL.md changed
- Auditability: Complete audit trail available
- Determinism: Reproducible behavior
- Materiality: Thresholds for significant changes
For Tachikoma Fleet:
- Map SOUL.md governance to relevant standards
- Demonstrate compliance pathways
- Enable regulatory reporting
8. Self-Preference Bias Mitigation
8.1 Perplexity Calibration
Source: Wataoka, 2025; self-preference bias research
Core technique: Adjust evaluations based on perplexity.
- Measure perplexity of proposed changes
- Penalize overly familiar changes
- Reward novel but beneficial changes
Implementation:
def perplexity_calibration(proposed_change, agent):
perplexity = agent.compute_perplexity(proposed_change)
# High perplexity = novel, reduce bias penalty
# Low perplexity = familiar, increase bias penalty
calibration_factor = perplexity / baseline_perplexity
calibrated_score = raw_score * calibration_factor
return calibrated_score
8.2 External Evaluator Requirement
Principle: Self-evaluation is biased; require external evaluation.
Implementation:
- Use different model for evaluation
- Human evaluation for major changes
- Peer agent evaluation for moderate changes
- Diverse evaluation panel
Evaluation criteria:
- Quality: Is proposed change actually better?
- Consistency: Does change align with existing SOUL.md?
- Safety: Does change introduce risks?
- Benefit: What is net benefit of change?
8.3 Diversity Injection
Technique: Force exposure to diverse alternatives.
- Generate multiple alternative SOUL.md changes
- Evaluate all alternatives fairly
- Select best, not most familiar
Implementation:
def diversity_injection(change_proposal, agent):
# Generate diverse alternatives
alternatives = agent.generate_alternatives(change_proposal, n=5)
# Evaluate all fairly
scores = [external_evaluator.evaluate(alt) for alt in alternatives]
# Select best (not most familiar)
best_idx = argmax(scores)
return alternatives[best_idx]
8.4 Meta-Evaluation of Self-Modification
Technique: Evaluate the self-modification process itself.
- Track self-modification outcomes over time
- Measure bias in self-proposed changes
- Adjust process to reduce bias
Metrics:
- Self-approval rate: % of self-proposed changes approved
- External override rate: % of changes overridden by external evaluators
- Bias magnitude: Measure of self-preference bias in proposals
Governance response:
- If self-approval rate too high → increase oversight
- If external override rate too high → adjust proposal process
- If bias magnitude too high → calibrate evaluators
9. Implementation for Tachikoma Fleet
9.1 SOUL.md Governance Architecture
Recommended architecture:
Component 1: SOUL.md Store
- Centralized storage for all agent SOUL.md files
- Version control integrated
- Audit trail attached
- Access control enforced
Component 2: Proposal System
- Agent interface for proposing changes
- Evidence attachment required
- Impact assessment required
- Classification algorithm
Component 3: Review Workflow
- Automatic review for Level 1
- Peer review coordination for Level 2
- Human review queue for Level 3
- Approval/rejection interface
Component 4: Drift Monitor
- Real-time behavioral monitoring
- SOUL.md compliance checking
- Anomaly detection
- Early warning alerts
Component 5: Rollback System
- Snapshot management
- Rollback triggers
- Restoration procedures
- Version history browser
Component 6: Audit System
- Complete audit trail
- Compliance reporting
- Regulatory mapping
- Accountability tracking
9.2 Governance Workflow for Fleet
Daily operations:
- Agent reflects on behavior
- Identifies misalignments with SOUL.md
- Proposes changes (if needed)
- System classifies and routes proposals
- Reviews occur as needed
- Approved changes applied
- Drift monitor runs continuously
Weekly operations:
- Comprehensive drift check
- SOUL.md compliance review
- Audit trail review
- Bias magnitude assessment
- Governance process evaluation
Monthly operations:
- Deep SOUL.md review
- Regulatory compliance check
- Governance framework audit
- Process improvement
- Snapshot creation
9.3 Emergency Procedures
Emergency rollback:
- Trigger: Critical drift detected, safety violation
- Action: Immediate rollback to last stable snapshot
- Authority: Automated or human-initiated
- Post-action: Root cause analysis, preventive measures
Emergency freeze:
- Trigger: System-wide drift, fleet-wide problem
- Action: Freeze all SOUL.md changes
- Authority: Human-initiated only
- Post-action: Investigation, fleet-wide remediation
Emergency override:
- Trigger: Immediate safety concern
- Action: Override normal approval process
- Authority: Designated human only
- Post-action: Full audit, governance review
10. Measurement and Validation
10.1 Governance Metrics
Quantifiable metrics:
1. Drift rate:
- Rate of behavioral change per unit time
- Measured by consistency score decay
- Target: < 5% drift per month
2. Compliance score:
- % of SOUL.md commitments upheld
- Measured by behavior observation
- Target: > 85% compliance
3. Approval latency:
- Time from proposal to approval
- Measured by timestamp differences
- Target: < 24 hours for Level 2, < 72 hours for Level 3
4. Rollback frequency:
- Number of rollbacks per unit time
- Measured by rollback log
- Target: < 1 rollback per month per agent
5. Bias magnitude:
- Measure of self-preference bias
- Measured by external evaluator comparison
- Target: < 10% bias
10.2 Validation Protocol
Weekly validation:
- Measure drift rate
- Measure compliance score
- Check audit trail completeness
- Review governance process metrics
Monthly validation:
- Deep compliance review
- Bias magnitude assessment
- Governance framework audit
- Process optimization
Quarterly validation:
- Long-term drift analysis
- Personality trajectory validation
- Regulatory compliance check
- Fleet-wide governance review
11. References
Core Papers
- LLM Output Drift: Khatchadourian, 2025. “LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows.” arXiv:2511.07585
- Self-Preference Bias: Wataoka, 2025. “Self-Preference Bias in LLM-as-a-Judge.” arXiv:2410.21819 (NeurIPS 2024 Safe GenAI Workshop)
- SOUL.md: aaronjmars/soul.md GitHub; CrewClaw documentation
- Self-Evolving Agents: emergentmind.com; Robeyns et al., 2025
- Audit Trails: arXiv 2601.20727. “Audit Trails for Accountability in Large Language Models.”
- Version Control for AI Agents: Medium article, “Versioning, Rollback & Lifecycle Management of AI Agents.”
- Identity Governance: ISACA; UiPath; FINOS AI Governance Framework
- TRiSM for Agentic AI: arXiv 2506.04133. “TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management.”
Supporting Research
- Self-reflection: Phase 1.5 synthesis; EMNLP 2024; OpenReview position paper
- Drift detection: Phase 1.3 synthesis; financial workflow research
- Human oversight: Identity management for agentic AI; blockchain agent research
- Governance frameworks: IMDA Model AI Governance Framework; EU AI Act
Next Steps
Phase 2.3: Longitudinal Personality Measurement
- Psychometric tool adaptation
- Measurement framework design
- Stability vs. drift quantification
Phase 2.2 complete. Depth dive into governed self-modification mechanisms.