Local LLM Exploration¶
Context¶
ShadowHound initially used OpenAI cloud API for LLM inference (~12s response time). For an autonomous mobile robot, this latency is problematic: - Network dependency (robot may operate without internet) - Slow iteration cycles during development - Cost considerations for continuous operation - Privacy concerns for household deployment
Goal: Deploy LLM locally on Thor AGX Orin for faster, offline-capable inference.
Hypothesis¶
vLLM server running on Thor AGX Orin (64GB RAM) can host 7B-13B parameter models with: - Sub-second response times (vs. 12s cloud) - Tool calling support (critical for DIMOS agent) - Acceptable quality for household assistant tasks
Experiments¶
Experiment 1: Llama 3.1 8B Instruct (2025-10-10 Morning)¶
Commit: ca48cf3
What We Tried:
- Model: meta-llama/Meta-Llama-3.1-8B-Instruct
- vLLM server on Thor (port 8000)
- Tool calling with DIMOS agent
Results: - ❌ License issues: Llama models require gated access - ❌ Tokenizer hang: Loading got stuck on tokenizer initialization - ⚠️ Licensing concern: Meta's license may restrict commercial use
Decision: Abandon - Licensing complexity and technical issues make this unsuitable for open-source project
Experiment 2: Mistral 7B Instruct v0.3 (2025-10-10 Afternoon)¶
Commit: 3ac1e01
What We Tried:
- Model: mistralai/Mistral-7B-Instruct-v0.3
- vLLM server on Thor (port 8000)
- Tool calling validation script
- DIMOS agent integration test
Results: - ✅ Tool calling works! Properly formats and executes tool calls - ✅ Apache 2.0 license (fully open for any use) - ✅ Performance: ~37 tokens/sec baseline (24x faster than cloud) - ✅ Response time: ~0.5s vs 12s cloud (96% reduction) - ✅ Quality: Acceptable for household assistant tasks
Decision: Selected as primary local model - Best balance of performance, licensing, and functionality
Experiment 3: Qwen 2.5 7B Instruct (2025-10-10 Evening)¶
Commit: 16be165
What We Tried:
- Model: Qwen/Qwen2.5-7B-Instruct
- Same vLLM setup as Mistral
- Tool calling test
Results: - ❌ Tool calling broken: Returns tool calls as JSON text instead of executing - ⚠️ Format issue: Output contains the tool call structure but doesn't trigger execution - ✅ Performance: Similar speed to Mistral - ✅ License: Apache 2.0
Decision: Reject - Tool calling is critical for agent functionality, format incompatibility is a blocker
Experiment 4: Local Embeddings (SentenceTransformers) (2025-10-10 Evening)¶
Commit: b0c0e26
What We Tried:
- Replace OpenAI embeddings with sentence-transformers/all-MiniLM-L6-v2
- Test with DIMOS memory system
- Validate embedding quality for RAG
Results: - ✅ Works with non-OpenAI backends - ✅ Fast: Local embedding generation - ✅ Quality: Sufficient for memory retrieval - ✅ No API costs or rate limits
Decision: Adopt - Default for non-OpenAI backends
Experiment 5: DIMOS Memory Fix (2025-10-10 Late Evening)¶
Commit: d9b6340
What We Tried: - Fix DIMOS auto-creating OpenAI memory even with local LLM - Patch to respect backend configuration - Test with local LLM + local embeddings
Results: - ✅ Fixed: DIMOS now respects backend choice - ✅ Local LLM + local embeddings working together - ✅ No OpenAI dependency when using local backend
Decision: Merged - Critical fix for local deployment
Final Results¶
What Worked¶
Mistral 7B Instruct v0.3 selected as primary local model: - Performance: 37 tok/s baseline, ~0.5s response time - Tool calling: Fully functional with DIMOS agent - License: Apache 2.0 (open for any use) - Quality: Acceptable for household tasks - 24x speed improvement over OpenAI cloud (0.5s vs 12s)
Local embeddings (SentenceTransformers): - No external API dependency - Fast local generation - Works with DIMOS memory system
What Didn't Work¶
Llama 3.1 8B: - Gated model access (licensing friction) - Tokenizer hang issues - License restrictions unclear for commercial use
Qwen 2.5 7B: - Tool calling format incompatible with agent framework - Returns JSON text instead of executing tool calls
Hermes-2-Pro: - Briefly considered but not tested (switched to Mistral after good results)
Key Decisions¶
- Mistral 7B as default local model
- Rationale: Best balance of performance, licensing, tool calling support
-
Trade-off: May not match GPT-4 quality, but speed/cost/privacy wins
-
Local embeddings for non-OpenAI backends
- Rationale: Removes last OpenAI dependency
-
Trade-off: Slightly lower quality than OpenAI embeddings, but acceptable
-
Dual-backend support maintained
- Rationale: Cloud still useful for complex tasks, development
-
Configuration:
.envfile selects backend -
Tool calling is non-negotiable
- Rationale: Agent framework requires tool execution
- Impact: Eliminates many otherwise viable models
Performance Metrics¶
| Backend | Response Time | Tokens/Sec | Cost | Dependency |
|---|---|---|---|---|
| OpenAI Cloud | ~12s | ~8 tok/s | $$ | Internet |
| Mistral 7B Local | ~0.5s | ~37 tok/s | Free | Thor AGX |
Speed improvement: 24x faster (96% latency reduction)
Constraints Identified¶
- Thor compute budget unknown: Need to profile full system under load
- Model size limit: 7B-13B models fit in 64GB RAM with room for other processes
- Tool calling format: Not all models support OpenAI-compatible tool calling
- Licensing critical: Must verify Apache 2.0 or MIT for deployment
Implementation Status¶
- [x] Research complete (4 models tested)
- [x] Approach validated (Mistral 7B working)
- [x] Implementation merged (commit
3ac1e01) - [x] Documentation updated (performance notes, roadmap)
- [x] Local embeddings integrated
- [x] DIMOS memory fix applied
Follow-Up Work¶
Immediate¶
- Profile Thor under full system load (compute budget)
- Test Mistral 7B on real robot missions
Future¶
- Experiment with quantized models (4-bit for memory efficiency)
- Test Mistral 22B if compute budget allows
- Evaluate vision-language models for perception tasks
References¶
- Thor Performance Notes
- Local AI Roadmap
- vLLM Documentation: https://docs.vllm.ai/
- Mistral AI: https://mistral.ai/
- Tool calling test script:
test_backend_validation.py
Total Time: ~13 hours (full day marathon) Commits: 60+ commits (most intensive development day) Outcome: ✅ Local LLM deployment validated, 24x speed improvement achieved