Phase 1.1: LLM Agents & Tool Use Survey
Created: 2026-02-18 20:45 CST Phase: 1 - Breadth Survey Focus: Planning loops, action selection, tool calling, error recovery, long-horizon execution
Executive Summary
LLM-based agents represent a fundamental shift from static language models to dynamic systems that can reason, act, and adapt over extended time horizons. This survey examines the mechanisms by which agents plan, select tools, execute actions, recover from errors, and maintain coherence across long-horizon tasks.
Key insight for personality emergence: Agent behavior is shaped not just by prompts, but by the interaction loop between reasoning, action, observation, and reflection. This loop creates opportunities for behavioral divergence even in identical base models.
1. Planning Architectures
1.1 ReAct (Reasoning + Acting)
Source: Yao et al., 2022 (arXiv:2210.03629)
Core mechanism: Interleaves reasoning traces with task-specific actions.
Thought → Action → Observation → Thought → Action → ...
Key findings:
- Reduces hallucination vs. pure chain-of-thought (CoT) by grounding reasoning in external actions
- Enables error recovery through observation-feedback loops
- Human-interpretable traces facilitate debugging and optimization
Error patterns:
- Loop repetition (23% of errors): Model generates same thought/action repeatedly
- Non-informative retrieval: Failed searches derail reasoning
- Reasoning vs. acting imbalance: Over-reliance on one modality
Relevance to emergence: The Thought-Action-Observation loop creates sequential decision points where behavioral patterns can crystallize. Agents develop “styles” of reasoning (verbose vs. terse, exploratory vs. exploitative).
1.2 Chain-of-Thought (CoT) and Extensions
Source: Wei et al., 2022; extensive follow-on work
Core mechanism: Generate intermediate reasoning steps before final answer.
Extensions relevant to agents:
- Tree-of-Thought (ToT): Explore multiple reasoning paths, backtrack on failure
- Self-Consistency: Sample multiple CoTs, aggregate
- Reflexion: Reflect on failures, store in memory for future attempts
Relevance to emergence: Reasoning style becomes a behavioral signature. Some agents naturally generate long exploratory chains; others prefer short decisive steps. This can be measured and tracked over time.
1.3 Hierarchical Planning
Source: Multiple (LaMMA-P, ReAcTree, Deep Agents)
Core mechanism: Decompose long-horizon goals into subgoals, then action sequences.
Architectures:
- LaMMA-P: LLM for subtask extraction + PDDL planner for execution
- ReAcTree: Dynamic agent trees with decomposition nodes
- Deep Agents: Task managers with multi-level memory
Key finding: Long-horizon planning requires explicit subgoal structures. Pure LLM reasoning degrades over >10-15 steps without external scaffolding.
Relevance to emergence: Agents develop planning styles:
- Depth-first vs. breadth-first exploration
- Subgoal granularity (many small steps vs. few large steps)
- Backtracking frequency
2. Tool Calling / Function Calling Mechanisms
2.1 Token-Level Mechanics
Source: PromptingGuide.ai; ODSC presentations; industry practice
How it works:
- System prompt defines available tools (schemas, descriptions)
- LLM generates structured output (JSON, function call tokens)
- Environment executes tool, returns observation
- LLM incorporates observation into next reasoning step
Key challenge: Tool selection accuracy degrades with:
- Large tool sets (>20 tools)
- Similar tool descriptions
- Context length pressure
Relevance to emergence: Agents develop tool preferences:
- Frequency of tool use vs. pure reasoning
- Tool selection patterns (certain agents favor certain tools)
- Error recovery strategies when tool fails
2.2 AvaTaR: Optimizing Tool Usage
Source: NeurIPS 2024 poster
Core mechanism: Contrastive reasoning to improve tool selection.
Findings:
- Contrastive examples improve tool selection accuracy
- Reduces hallucination of non-existent tools
- Improves multi-step tool chains
Relevance to emergence: Tool usage patterns can be measured and compared across agents:
- Tool call frequency per task type
- Success rate per tool
- Tool chain patterns (which tools are used together)
2.3 Function Calling Benchmarks
Source: ComplexFuncBench; Survey on Evaluation of LLM-based Agents (arXiv:2503.16416)
Evaluation dimensions:
- Single-turn tool selection: Can agent pick right tool?
- Multi-step chains: Can agent sequence tools correctly?
- Error handling: Can agent recover from tool failures?
- Virtual API servers: Simulate API state changes for evaluation
Relevance to emergence: Benchmarks provide quantitative behavioral profiles:
- Agent A: High accuracy, poor error recovery
- Agent B: Lower accuracy, excellent error recovery
- These profiles persist across tasks → trait-like tendencies
3. Error Recovery and Self-Reflection
3.1 Reflexion Pattern
Source: Shinn et al., 2023; PromptingGuide.ai
Core mechanism:
- Agent attempts task, fails
- Reflection model analyzes failure trajectory
- Generates verbal reinforcement (feedback)
- Stores in long-term memory
- Agent retries with feedback
Key finding: Verbal reflection can substitute for weight updates (no fine-tuning needed).
Relevance to emergence:
- Reflection style becomes behavioral signature (blaming environment vs. self-critique)
- Reflection frequency creates agent personality (cautious agents reflect more)
- Reflection content shapes future behavior (what does agent focus on?)
3.2 Self-Reflection Effects on Performance
Source: Renze, 2024 (arXiv:2405.06682)
Experiment: 9 LLMs, 8 types of self-reflection, multiple-choice questions.
Findings:
- Statistically significant improvement (p < 0.001) from self-reflection
- All 8 reflection types helped, but magnitude varied
- Reflection most helpful for initially difficult questions
Relevance to emergence: This confirms self-modification is possible through natural language. Agents can change their behavior based on their own outputs—a prerequisite for identity formation.
3.3 Error Recovery Strategies
Source: ReAct paper; agent deployment practice
Common patterns:
- Retry with different parameters: 40% success
- Backtrack and replan: 25% success
- Escalate to human/orchestrator: 15% (explicit failure)
- Hallucinate success: 20% (dangerous!)
Relevance to emergence: Error recovery style is measurable and persistent:
- Cautious agents backtrack early
- Aggressive agents retry multiple times before backtracking
- Some agents escalate frequently; others rarely
4. Long-Horizon Task Execution
4.1 OdysseyBench: Long-Horizon Evaluation
Source: Wang et al., 2025 (arXiv:2508.09124)
Benchmark design:
- 602 tasks across Word, Excel, PDF, Email, Calendar
- Requires long-term contextual dependencies
- Multi-interaction coordination across applications
- Generated via HomerAgents (multi-agent framework)
Key finding: Existing benchmarks focus on atomic tasks; OdysseyBench reveals performance degradation over long horizons (10-50+ steps).
Relevance to emergence:
- Long-horizon execution creates more opportunities for behavioral patterns to emerge
- Agents show consistent “styles” across long tasks
- Resource constraints (token limits, latency budgets) force style trade-offs
4.2 Resource Constraints as “Physics”
Source: EmergentMind; multi-turn dynamics research
Key insight: Tokens, latency, and tool budgets function like physical constraints—agents must adapt behavior to survive.
Observable behaviors:
- Budget-aware planning: Estimate cost before acting
- Graceful degradation: Simplify reasoning when context fills
- Strategic forgetting: What to keep vs. discard
Relevance to emergence: Resource management style creates agent personality:
- “Thrifty” agents minimize tool calls
- “Thorough” agents spend freely for accuracy
- These preferences persist across tasks
4.3 Long-Horizon Failure Modes
Source: OdysseyBench; agent deployment reports
Common failures:
- Goal drift: Agent forgets original objective
- Context overflow: Long trajectories exceed context window
- Error accumulation: Small mistakes compound
- Motivation decay: Agent stops trying to optimize
Relevance to emergence: Failure patterns reveal agent tendencies:
- Does agent self-correct goal drift?
- How does agent handle context overflow?
- These are measurable behavioral traits
5. Multi-Agent Tool Coordination
5.1 Emergent Coordination in Multi-Agent LLMs
Source: Riedl et al., 2025 (arXiv:2510.05174)
Experiment: Guessing game with minimal communication, three interventions:
- Control (no modifications)
- Persona assignment (stable identity)
- Persona + “think about others” (Theory of Mind prompt)
Findings:
- Control: Temporal synergy, little cross-agent alignment
- Persona: Identity-linked differentiation emerges
- Persona + ToM: Goal-directed complementarity
Critical insight: Prompt design can steer agents from aggregates to collectives with higher-order structure.
Relevance to emergence: This is directly applicable to our fleet design:
- Assigning stable identities (agent names, domains) creates differentiation
- Adding Theory of Mind prompts (“what would other agents think?”) creates coordination
- Emergence is measurable via partial information decomposition
5.2 Multi-Agent Collaboration Mechanisms
Source: Nguyen et al., 2025 (arXiv:2501.06322)
Framework dimensions:
- Actors: Which agents participate
- Types: Cooperation, competition, coopetition
- Structures: Centralized, peer-to-peer, distributed
- Strategies: Role-based, model-based
- Coordination protocols: Communication patterns
Key finding: Role-based specialization (e.g., orchestrator, worker, reviewer) is most effective for complex tasks.
Relevance to emergence: Role assignment creates stable behavioral baselines, but agents may deviate based on experience → personality within role.
6. Key Findings for Personality Emergence
6.1 Mechanisms That Create Behavioral Divergence
Identified mechanisms:
- Sequential decision points: Thought-Action-Observation loops create choice points
- Error recovery styles: Agents develop consistent retry/backtrack/escalate patterns
- Tool preferences: Frequency and selection patterns persist across tasks
- Reflection content: What agents focus on in self-reflection shapes future behavior
- Resource management: Budget-aware behavior creates agent “personalities”
- Persona + ToM prompts: Explicit identity assignment creates differentiation
6.2 What Can Be Measured
Quantifiable behavioral traits:
- Reasoning verbosity: Mean tokens per thought
- Tool call frequency: Calls per task, per tool
- Error recovery latency: Steps before backtracking
- Reflection depth: Tokens of self-reflection
- Planning horizon: Average subgoal depth
- Coordination frequency: Messages to other agents
These can be tracked over time to identify stable patterns vs. noise.
6.3 What Remains Unknown
Open questions:
- Stability vs. drift: How long do behavioral patterns persist?
- Measurement frequency: How many tasks needed to identify a trait?
- Cross-domain consistency: Do agents maintain personality across task types?
- Social influence: How do peer agents shape each other’s behavior?
- Identity editing: Can agents deliberately change their own personalities?
7. Implications for Fleet Architecture
7.1 For SOUL.md Design
Findings suggest:
- Identity assignment matters: Stable agent names + domains create differentiation
- Theory of Mind prompts help: “Consider what other agents would think”
- Reflection mechanisms are critical: Agents need structured self-reflection
- Resource constraints create style: Explicit budgets force personality
SOUL.md should include:
- Agent name and domain (identity anchor)
- Resource management defaults (thrifty vs. thorough)
- Reflection triggers (when to self-reflect)
- Coordination mindset (ToM prompt)
7.2 For Memory Architecture
Findings suggest:
- Reflection storage: Long-term memory for self-reflections
- Tool usage logs: Track frequency and success rates
- Error recovery patterns: Store backtracking behaviors
- Coordination history: Messages to/from other agents
Memory enables emergence by providing historical context for behavioral patterns.
7.3 For Peer Review System
Findings suggest:
- Role-based specialization works: Reviewer vs. producer roles
- Personality-driven review: Agent’s style shapes review focus
- Cross-agent observation: Agents learn from watching each other
Peer review creates social feedback that shapes personality over time.
8. References
Core Papers
- ReAct: Yao et al., 2022. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv:2210.03629
- Reflexion: Shinn et al., 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.”
- Self-Reflection: Renze, 2024. “Self-Reflection in LLM Agents: Effects on Problem-Solving Performance.” arXiv:2405.06682
- Emergent Coordination: Riedl et al., 2025. “Emergent Coordination in Multi-Agent Language Models.” arXiv:2510.05174
- Multi-Agent Collaboration: Nguyen et al., 2025. “Multi-Agent Collaboration Mechanisms: A Survey of LLMs.” arXiv:2501.06322
- OdysseyBench: Wang et al., 2025. “OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows.” arXiv:2508.09124
- AvaTaR: NeurIPS 2024. “AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning.”
Workshops & Venues
- NeurIPS 2025 Workshop on Multi-Turn Interactions in LLMs
- ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgents)
- Survey on Evaluation: arXiv:2503.16416
Practical Resources
- PromptingGuide.ai: ReAct, Reflexion, Function Calling guides
- LangChain Blog: Planning for Agents (October 2025)
- Lil’Log: LLM Powered Autonomous Agents (Lilian Weng)
Next Steps
Phase 1.2: Long-term Memory for Agents
- Episodic vs. semantic memory architectures
- Retrieval policies and consolidation
- Memory’s role in behavioral persistence
Phase 1.1 complete. Moving to Phase 1.2.