Local LLM Integration Summary¶

Status: ✅ Proof of Concept Complete
Date: October 12, 2025
Branch: feature/local-llm-support

Overview¶

Successfully integrated local LLM support (vLLM) with the ShadowHound mission agent, proving the concept works. Also improved cloud LLM (GPT-4o) performance through agent optimizations.

What We Built¶

1. vLLM Integration on Thor (Jetson AGX Orin)¶

Container: NVIDIA official vLLM image (3.5x faster than Ollama)
Endpoint: http://192.168.10.116:8000/v1 (OpenAI-compatible API)
Models Tested:
✅ Mistral-7B-Instruct-v0.3 (works, inconsistent tool calling)
✅ Qwen2.5-Coder-7B-Instruct (works, needs tuning)
Setup Script: scripts/setup_vllm_thor.sh
Custom Chat Template: Fixed Mistral tokenizer hang and tool_call_id format

2. Local Embeddings (Semantic Memory)¶

Model: sentence-transformers/all-MiniLM-L6-v2
Auto-Detection: Automatically uses local embeddings when OPENAI_BASE_URL is not api.openai.com
Performance: Fast, no API costs
Storage: ChromaDB local vector database

3. DIMOS Agent Improvements¶

Fixed: Added tool_choice='auto' to OpenAIAgent._send_query() (commit 545b343)
Fixed: Added temperature=0.0 for deterministic responses
Impact: Improved tool calling reliability for BOTH local and cloud LLMs
Location: src/dimos-unitree/dimos/agents/agent.py lines 895-910

4. Configuration System¶

File: .env
Easy Switch: Comment/uncomment blocks to switch between cloud and local
Validated: Both configurations tested and working

Performance Results¶

Cloud LLM (GPT-4o) - Production Ready ✅¶

Command: "take a step back"
Response Time: 2.89s
Tool Calling: 100% consistent
Result: ✅ Perfect execution

Command: "take a step back and rotate 45 degrees right"  
Response Time: 4.17s
Tool Calling: 100% consistent
Result: ✅ Perfect execution

Command: "take a step forward and rotate 180 degrees"
Response Time: 3.81s
Tool Calling: 100% consistent
Result: ✅ Perfect execution

Benefits of Agent Optimizations (tool_choice + temperature): - Cloud response times improved (~3-7s, was ~10-30s before) - Tool calling now 100% consistent (no more text explanations) - Multi-step commands execute flawlessly

Local LLM (vLLM + Mistral-7B) - Proof of Concept ⚠️¶

Status: Works but inconsistent
Success Rate: ~60% (sometimes returns text, sometimes tool_calls)
When it works: Robot moves correctly
When it fails: Returns text explanation instead of calling functions

Issues:
- tool_choice='auto' not fully respected by Mistral-7B
- Prompt engineering affects consistency unpredictably
- Need more model/parameter tuning

Achievements:
✅ vLLM inference working (9-11s response time on Jetson)
✅ Local embeddings working (no API costs)
✅ Robot executed commands from local LLM multiple times
✅ Custom chat template fixed tokenizer issues

Key Files Modified¶

1. DIMOS Submodule (Pushed to GitHub)¶

src/dimos-unitree/dimos/agents/agent.py
- Added tool_choice='auto' parameter (line 901)
- Added temperature=0.0 parameter (line 906)
- Added debug logging for tool_choice validation
- Commit: 545b343
- Branch: fix/webrtc-instant-commands-and-progress

2. Mission Executor¶

src/shadowhound_mission_agent/shadowhound_mission_agent/mission_executor.py
- Simplified system prompt (line 77-79)
- Reduced max_output_tokens to 150 (line 71)
- Passes system_prompt via system_query parameter (line 358)

3. Configuration Files¶

.env - Dual configuration (cloud active, local commented)
scripts/setup_vllm_thor.sh - Thor vLLM deployment script
test_tool_call_format.sh - vLLM validation test

Configuration Examples¶

Cloud (GPT-4o) - ACTIVE¶

AGENT_BACKEND=openai
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o
USE_LOCAL_EMBEDDINGS=false

Local (vLLM) - COMMENTED OUT¶

# AGENT_BACKEND=openai
# OPENAI_BASE_URL=http://192.168.10.116:8000/v1
# OPENAI_MODEL=mistralai/Mistral-7B-Instruct-v0.3
# USE_LOCAL_EMBEDDINGS=true

Lessons Learned¶

What Worked ✅¶

vLLM is much faster than Ollama (~3.5x on Jetson)
OpenAI-compatible API is the right abstraction - same code for cloud and local
tool_choice='auto' is critical - without it, LLMs inconsistently choose format
temperature=0.0 helps consistency - robotics needs determinism
Local embeddings work great - sentence-transformers is fast and free
Auto-detection of local embeddings - no manual configuration needed

What Didn't Work ⚠️¶

Aggressive system prompts confuse tool calling - "You MUST call functions" breaks format
Mistral-7B tool calling unreliable - even with tool_choice='auto'
Small models need more tuning - 7B parameters may be too small for consistent tool use
Python cache issues on host - needed aggressive cleanup after submodule updates

Open Questions ❓¶

Why does Mistral ignore tool_choice? - vLLM bug or model limitation?
Would larger models (70B) be more consistent? - needs Thor with more VRAM
Can we force tool calling with constrained decoding? - vLLM feature to explore
Is the custom chat template correct? - might need Mistral-specific tweaks

Recommendations¶

For Production (Now)¶

✅ Use Cloud (GPT-4o) - Consistent, fast, reliable - Agent optimizations improved performance significantly - Cost is acceptable for real missions (~$0.01-0.05 per command)

For Future Local LLM Work¶

⚠️ Needs More Investigation - Try larger models (70B) when hardware permits - Explore vLLM constrained decoding features - Test other tool-calling models (Llama 3.1, Qwen2.5, etc.) - Consider fine-tuning a model specifically for robot control

Next Steps¶

✅ Merge agent improvements to dev (benefits both cloud and local)
✅ Keep vLLM scripts in repo for future experiments
⏸️ Pause local LLM work until hardware/model upgrades available
📝 Document the setup for future reference

Technical Debt / TODOs¶

[ ] Remove debug logging from DIMOS agent.py (or make it conditional)
[ ] Test Qwen2.5-Coder-7B tool calling consistency
[ ] Document vLLM custom chat template requirements
[ ] Add vLLM health check to start.sh
[ ] Consider model registry system (swap models easily)
[ ] Profile memory usage on Thor (can we run 13B models?)

How to Use This Work¶

Switch to Local LLM (Experimental)¶

# 1. On Thor, start vLLM
ssh thor
cd /path/to/shadowhound
./scripts/setup_vllm_thor.sh

# 2. On laptop, edit .env
# Comment out cloud config, uncomment local config

# 3. Restart
./start.sh

# 4. Test with simple commands
# Note: May return text instead of executing ~40% of the time

Switch to Cloud (Production)¶

# 1. Edit .env
# Uncomment cloud config, comment out local config

# 2. Restart
./start.sh

# 3. Enjoy consistent tool calling!

Acknowledgments¶

vLLM Team: For the excellent inference server
DIMOS Framework: For clean agent abstractions
Mistral AI: For the Mistral-7B model
Sentence Transformers: For local embeddings

Conclusion¶

This work successfully proved that: 1. ✅ Local LLM inference works on Jetson AGX Orin 2. ✅ OpenAI-compatible API abstraction is the right pattern 3. ✅ Agent improvements benefit both cloud and local LLMs 4. ✅ Local embeddings are production-ready

However, for production robot missions: - Use cloud LLM (GPT-4o) for consistent, reliable tool calling - Keep local LLM infrastructure for future when models improve

The agent improvements (tool_choice='auto' + temperature=0.0) are valuable regardless of backend and should be merged to dev.

Status: Ready for merge to dev ✅