Phase 1.1: LLM Agents & Tool Use Survey

Created: 2026-02-18 20:45 CST Phase: 1 - Breadth Survey Focus: Planning loops, action selection, tool calling, error recovery, long-horizon execution

Executive Summary

LLM-based agents represent a fundamental shift from static language models to dynamic systems that can reason, act, and adapt over extended time horizons. This survey examines the mechanisms by which agents plan, select tools, execute actions, recover from errors, and maintain coherence across long-horizon tasks.

Key insight for personality emergence: Agent behavior is shaped not just by prompts, but by the interaction loop between reasoning, action, observation, and reflection. This loop creates opportunities for behavioral divergence even in identical base models.

1. Planning Architectures

1.1 ReAct (Reasoning + Acting)

Source: Yao et al., 2022 (arXiv:2210.03629)

Core mechanism: Interleaves reasoning traces with task-specific actions.

Thought → Action → Observation → Thought → Action → ...

Key findings:

Reduces hallucination vs. pure chain-of-thought (CoT) by grounding reasoning in external actions
Enables error recovery through observation-feedback loops
Human-interpretable traces facilitate debugging and optimization

Error patterns:

Loop repetition (23% of errors): Model generates same thought/action repeatedly
Non-informative retrieval: Failed searches derail reasoning
Reasoning vs. acting imbalance: Over-reliance on one modality

Relevance to emergence: The Thought-Action-Observation loop creates sequential decision points where behavioral patterns can crystallize. Agents develop “styles” of reasoning (verbose vs. terse, exploratory vs. exploitative).

1.2 Chain-of-Thought (CoT) and Extensions

Source: Wei et al., 2022; extensive follow-on work

Core mechanism: Generate intermediate reasoning steps before final answer.

Extensions relevant to agents:

Tree-of-Thought (ToT): Explore multiple reasoning paths, backtrack on failure
Self-Consistency: Sample multiple CoTs, aggregate
Reflexion: Reflect on failures, store in memory for future attempts

Relevance to emergence: Reasoning style becomes a behavioral signature. Some agents naturally generate long exploratory chains; others prefer short decisive steps. This can be measured and tracked over time.

1.3 Hierarchical Planning

Source: Multiple (LaMMA-P, ReAcTree, Deep Agents)

Core mechanism: Decompose long-horizon goals into subgoals, then action sequences.

Architectures:

LaMMA-P: LLM for subtask extraction + PDDL planner for execution
ReAcTree: Dynamic agent trees with decomposition nodes
Deep Agents: Task managers with multi-level memory

Key finding: Long-horizon planning requires explicit subgoal structures. Pure LLM reasoning degrades over >10-15 steps without external scaffolding.

Relevance to emergence: Agents develop planning styles:

Depth-first vs. breadth-first exploration
Subgoal granularity (many small steps vs. few large steps)
Backtracking frequency

2. Tool Calling / Function Calling Mechanisms

2.1 Token-Level Mechanics

Source: PromptingGuide.ai; ODSC presentations; industry practice

How it works:

System prompt defines available tools (schemas, descriptions)
LLM generates structured output (JSON, function call tokens)
Environment executes tool, returns observation
LLM incorporates observation into next reasoning step

Key challenge: Tool selection accuracy degrades with:

Large tool sets (>20 tools)
Similar tool descriptions
Context length pressure

Relevance to emergence: Agents develop tool preferences:

Frequency of tool use vs. pure reasoning
Tool selection patterns (certain agents favor certain tools)
Error recovery strategies when tool fails

2.2 AvaTaR: Optimizing Tool Usage

Source: NeurIPS 2024 poster

Core mechanism: Contrastive reasoning to improve tool selection.

Findings:

Contrastive examples improve tool selection accuracy
Reduces hallucination of non-existent tools
Improves multi-step tool chains

Relevance to emergence: Tool usage patterns can be measured and compared across agents:

Tool call frequency per task type
Success rate per tool
Tool chain patterns (which tools are used together)

2.3 Function Calling Benchmarks

Source: ComplexFuncBench; Survey on Evaluation of LLM-based Agents (arXiv:2503.16416)

Evaluation dimensions:

Single-turn tool selection: Can agent pick right tool?
Multi-step chains: Can agent sequence tools correctly?
Error handling: Can agent recover from tool failures?
Virtual API servers: Simulate API state changes for evaluation

Relevance to emergence: Benchmarks provide quantitative behavioral profiles:

Agent A: High accuracy, poor error recovery
Agent B: Lower accuracy, excellent error recovery
These profiles persist across tasks → trait-like tendencies

3. Error Recovery and Self-Reflection

3.1 Reflexion Pattern

Source: Shinn et al., 2023; PromptingGuide.ai

Core mechanism:

Agent attempts task, fails
Reflection model analyzes failure trajectory
Generates verbal reinforcement (feedback)
Stores in long-term memory
Agent retries with feedback

Key finding: Verbal reflection can substitute for weight updates (no fine-tuning needed).

Relevance to emergence:

Reflection style becomes behavioral signature (blaming environment vs. self-critique)
Reflection frequency creates agent personality (cautious agents reflect more)
Reflection content shapes future behavior (what does agent focus on?)

3.2 Self-Reflection Effects on Performance

Source: Renze, 2024 (arXiv:2405.06682)

Experiment: 9 LLMs, 8 types of self-reflection, multiple-choice questions.

Findings:

Statistically significant improvement (p < 0.001) from self-reflection
All 8 reflection types helped, but magnitude varied
Reflection most helpful for initially difficult questions

Relevance to emergence: This confirms self-modification is possible through natural language. Agents can change their behavior based on their own outputs—a prerequisite for identity formation.

3.3 Error Recovery Strategies

Source: ReAct paper; agent deployment practice

Common patterns:

Retry with different parameters: 40% success
Backtrack and replan: 25% success
Escalate to human/orchestrator: 15% (explicit failure)
Hallucinate success: 20% (dangerous!)

Relevance to emergence: Error recovery style is measurable and persistent:

Cautious agents backtrack early
Aggressive agents retry multiple times before backtracking
Some agents escalate frequently; others rarely

4. Long-Horizon Task Execution

4.1 OdysseyBench: Long-Horizon Evaluation

Source: Wang et al., 2025 (arXiv:2508.09124)

Benchmark design:

602 tasks across Word, Excel, PDF, Email, Calendar
Requires long-term contextual dependencies
Multi-interaction coordination across applications
Generated via HomerAgents (multi-agent framework)

Key finding: Existing benchmarks focus on atomic tasks; OdysseyBench reveals performance degradation over long horizons (10-50+ steps).

Relevance to emergence:

Long-horizon execution creates more opportunities for behavioral patterns to emerge
Agents show consistent “styles” across long tasks
Resource constraints (token limits, latency budgets) force style trade-offs

4.2 Resource Constraints as “Physics”

Source: EmergentMind; multi-turn dynamics research

Key insight: Tokens, latency, and tool budgets function like physical constraints—agents must adapt behavior to survive.

Observable behaviors:

Budget-aware planning: Estimate cost before acting
Graceful degradation: Simplify reasoning when context fills
Strategic forgetting: What to keep vs. discard

Relevance to emergence: Resource management style creates agent personality:

“Thrifty” agents minimize tool calls
“Thorough” agents spend freely for accuracy
These preferences persist across tasks

4.3 Long-Horizon Failure Modes

Source: OdysseyBench; agent deployment reports

Common failures:

Goal drift: Agent forgets original objective
Context overflow: Long trajectories exceed context window
Error accumulation: Small mistakes compound
Motivation decay: Agent stops trying to optimize

Relevance to emergence: Failure patterns reveal agent tendencies:

Does agent self-correct goal drift?
How does agent handle context overflow?
These are measurable behavioral traits

5. Multi-Agent Tool Coordination

5.1 Emergent Coordination in Multi-Agent LLMs

Source: Riedl et al., 2025 (arXiv:2510.05174)

Experiment: Guessing game with minimal communication, three interventions:

Control (no modifications)
Persona assignment (stable identity)
Persona + “think about others” (Theory of Mind prompt)

Findings:

Control: Temporal synergy, little cross-agent alignment
Persona: Identity-linked differentiation emerges
Persona + ToM: Goal-directed complementarity

Critical insight: Prompt design can steer agents from aggregates to collectives with higher-order structure.

Relevance to emergence: This is directly applicable to our fleet design:

Assigning stable identities (agent names, domains) creates differentiation
Adding Theory of Mind prompts (“what would other agents think?”) creates coordination
Emergence is measurable via partial information decomposition

5.2 Multi-Agent Collaboration Mechanisms

Source: Nguyen et al., 2025 (arXiv:2501.06322)

Framework dimensions:

Actors: Which agents participate
Types: Cooperation, competition, coopetition
Structures: Centralized, peer-to-peer, distributed
Strategies: Role-based, model-based
Coordination protocols: Communication patterns

Key finding: Role-based specialization (e.g., orchestrator, worker, reviewer) is most effective for complex tasks.

Relevance to emergence: Role assignment creates stable behavioral baselines, but agents may deviate based on experience → personality within role.

6. Key Findings for Personality Emergence

6.1 Mechanisms That Create Behavioral Divergence

Identified mechanisms:

Sequential decision points: Thought-Action-Observation loops create choice points
Error recovery styles: Agents develop consistent retry/backtrack/escalate patterns
Tool preferences: Frequency and selection patterns persist across tasks
Reflection content: What agents focus on in self-reflection shapes future behavior
Resource management: Budget-aware behavior creates agent “personalities”
Persona + ToM prompts: Explicit identity assignment creates differentiation

6.2 What Can Be Measured

Quantifiable behavioral traits:

Reasoning verbosity: Mean tokens per thought
Tool call frequency: Calls per task, per tool
Error recovery latency: Steps before backtracking
Reflection depth: Tokens of self-reflection
Planning horizon: Average subgoal depth
Coordination frequency: Messages to other agents

These can be tracked over time to identify stable patterns vs. noise.

6.3 What Remains Unknown

Open questions:

Stability vs. drift: How long do behavioral patterns persist?
Measurement frequency: How many tasks needed to identify a trait?
Cross-domain consistency: Do agents maintain personality across task types?
Social influence: How do peer agents shape each other’s behavior?
Identity editing: Can agents deliberately change their own personalities?

7. Implications for Fleet Architecture

7.1 For SOUL.md Design

Findings suggest:

Identity assignment matters: Stable agent names + domains create differentiation
Theory of Mind prompts help: “Consider what other agents would think”
Reflection mechanisms are critical: Agents need structured self-reflection
Resource constraints create style: Explicit budgets force personality

SOUL.md should include:

Agent name and domain (identity anchor)
Resource management defaults (thrifty vs. thorough)
Reflection triggers (when to self-reflect)
Coordination mindset (ToM prompt)

7.2 For Memory Architecture

Findings suggest:

Reflection storage: Long-term memory for self-reflections
Tool usage logs: Track frequency and success rates
Error recovery patterns: Store backtracking behaviors
Coordination history: Messages to/from other agents

Memory enables emergence by providing historical context for behavioral patterns.

7.3 For Peer Review System

Findings suggest:

Role-based specialization works: Reviewer vs. producer roles
Personality-driven review: Agent’s style shapes review focus
Cross-agent observation: Agents learn from watching each other

Peer review creates social feedback that shapes personality over time.

8. References

Core Papers

ReAct: Yao et al., 2022. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv:2210.03629
Reflexion: Shinn et al., 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.”
Self-Reflection: Renze, 2024. “Self-Reflection in LLM Agents: Effects on Problem-Solving Performance.” arXiv:2405.06682
Emergent Coordination: Riedl et al., 2025. “Emergent Coordination in Multi-Agent Language Models.” arXiv:2510.05174
Multi-Agent Collaboration: Nguyen et al., 2025. “Multi-Agent Collaboration Mechanisms: A Survey of LLMs.” arXiv:2501.06322
OdysseyBench: Wang et al., 2025. “OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows.” arXiv:2508.09124
AvaTaR: NeurIPS 2024. “AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning.”

Workshops & Venues

NeurIPS 2025 Workshop on Multi-Turn Interactions in LLMs
ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgents)
Survey on Evaluation: arXiv:2503.16416

Practical Resources

PromptingGuide.ai: ReAct, Reflexion, Function Calling guides
LangChain Blog: Planning for Agents (October 2025)
Lil’Log: LLM Powered Autonomous Agents (Lilian Weng)

Next Steps

Phase 1.2: Long-term Memory for Agents

Episodic vs. semantic memory architectures
Retrieval policies and consolidation
Memory’s role in behavioral persistence

Phase 1.1 complete. Moving to Phase 1.2.