Ollama Model Comparison for ShadowHound¶
Last Updated: 2025-01-10
Purpose: Document model selection rationale for robot control workloads
Hardware Target: Thor (128GB RAM, ~58GB available for models)
Executive Summary¶
After comprehensive research of the Ollama model ecosystem (100+ available models), we've identified 5 primary candidates and 2 optional models for ShadowHound testing. These models were selected based on:
- Memory constraints: Must fit in 58GB available RAM
- Use case alignment: Excel at structured output (JSON), planning, and reasoning
- Popularity/maturity: High pull counts indicate community validation
- Specialization: Task-specific models (coding, reasoning) vs general-purpose
- Performance diversity: Range from 7B (fast) to 70B (quality) parameters
Expected Winner: qwen2.5-coder:32b for production, phi4:14b for development
Model Selection Matrix¶
| Model | Size | RAM | Pulls | Specialization | Use Case Fit | Speed Estimate |
|---|---|---|---|---|---|---|
| phi4:14b | 9.1GB | ~10GB | 5.3M | SOTA efficiency | Fast dev/test | ~80 tok/s ⚡⚡⚡ |
| qwen2.5-coder:32b | 20GB | ~20GB | 7.5M | Code/JSON specialist | Navigation plans | ~35 tok/s ⚡⚡ |
| qwq:32b | 20GB | ~20GB | 1.7M | Reasoning specialist | Complex planning | ~35 tok/s ⚡⚡ |
| llama3.3:70b | 43GB | ~43GB | 2.6M | General purpose | Baseline upgrade | ~18 tok/s ⚡ |
| deepseek-r1:7b | 4.7GB | ~5GB | 65.2M* | Reasoning chains | Fast reasoning | ~120 tok/s ⚡⚡⚡⚡ |
| hermes3:70b | 50GB | ~50GB | 339K | Function calling | Future tools API | ~18 tok/s ⚡ |
| gemma2:27b | 17GB | ~17GB | 8.1M | Efficient | Middle ground | ~45 tok/s ⚡⚡ |
*Combined pulls across all deepseek-r1 sizes
Tier 1: Primary Testing Candidates¶
🥇 qwen2.5-coder:32b - JSON/Coding Specialist¶
Why This is Likely the Winner: - Built for structured output: Trained specifically for code generation - JSON expertise: Should achieve 95-100/100 on navigation prompt quality - Proven popularity: 7.5M pulls = battle-tested in production - Optimal size: 32B parameters = sweet spot for quality/speed - Robot control fit: Navigation plans are essentially code (structured instructions)
Expected Performance: - Simple prompts: 95-100/100 quality - Navigation (JSON): 98-100/100 ← Best in class - Reasoning: 85-90/100 (not specialized, but capable) - Speed: ~35 tok/s (good enough for missions)
Use Case: Primary production model for mission planning and navigation
🥈 qwq:32b - Reasoning Specialist¶
Why This Matters: - Purpose-built reasoning: From Qwen's reasoning-focused series - Complex planning: Excels at multi-step logical decisions - Emerging but proven: 1.7M pulls, newer but gaining traction - Same size as qwen-coder: Direct comparison at 32B parameter level
Expected Performance: - Simple prompts: 95-100/100 quality - Navigation (JSON): 85-95/100 (formatting secondary to logic) - Reasoning: 95-100/100 ← Best in class - Speed: ~35 tok/s (comparable to qwen-coder)
Use Case: Alternative primary if reasoning > JSON formatting, or for complex obstacle scenarios
🥉 phi4:14b - Speed Champion¶
Why This is Essential: - Microsoft SOTA: State-of-the-art small model from Microsoft Research - Efficiency leader: 5.3M pulls, highly regarded in community - Development speed: 2.5x faster than 32B models, near-instant responses - Surprising quality: Small models have dramatically improved in 2024-2025
Expected Performance: - Simple prompts: 95-100/100 quality - Navigation (JSON): 90-95/100 (strong, but not specialist level) - Reasoning: 85-95/100 (impressive for size) - Speed: ~80 tok/s ← 2.5x faster than 32B models
Use Case: Development/testing model, backup for production when speed critical
Tier 2: Validation & Comparison¶
llama3.3:70b - Latest Meta Release¶
Why Test This: - Direct upgrade: Meta claims "similar to llama3.1:405b performance" - Current baseline successor: Natural evolution from llama3.1:70b - Proven architecture: Meta's LLaMA series is industry standard - Largest in our range: 70B = maximum quality we can fit
Expected Performance: - All tasks: 90-100/100 (generalist strength) - Speed: ~18 tok/s (slowest, but acceptable)
Use Case: Validation that specialized models beat generalists
deepseek-r1:7b - Experimental Reasoning¶
Why This is Interesting: - "Reasoning approaching O3": Cutting-edge reasoning architecture - Massive popularity: 65.2M combined pulls (all sizes) = most popular - Ultra-fast: Smallest model = fastest responses - Curiosity test: Can 7B compete with 32B on reasoning?
Expected Performance: - Simple prompts: 90-95/100 - Navigation (JSON): 80-90/100 (size limitation) - Reasoning: 90-95/100 (specialized architecture compensates for size) - Speed: ~120 tok/s ← Fastest option
Use Case: CI/testing, curiosity about reasoning vs size tradeoff
Tier 3: Optional Models¶
hermes3:70b - Tool-Use Specialist (Optional)¶
Why Consider: - Function calling: Built for tool-based use cases - Skills API alignment: Future expansion of skills API with tool definitions - 70B quality: Large model = high capability
When to Test: - If planning to expand skills API with function calling - If need tool-use capabilities (structured API calls) - If have extra time for benchmarking
Downside: 50GB RAM = 86% memory utilization (risky)
gemma2:27b - Google's Efficient Model (Optional)¶
Why Consider: - Middle ground: Between 14B and 32B - Google quality: 8.1M pulls, proven - Efficiency focus: Optimized for resource usage
When to Test: - If 32B models are too slow - If 14B models have insufficient quality - If need compromise option
Downside: Likely beaten by phi4 (speed) and qwen-coder (quality)
Memory Usage Analysis¶
Thor has 128GB total RAM: - ROS2 + Nav2 + Perception: ~30GB - System overhead: ~20GB - OS/buffers: ~20GB - Available for models: ~58GB
Models by Memory Footprint¶
Small (can run 5+ simultaneously): - deepseek-r1:7b: ~5GB ✅ - phi4:14b: ~9GB ✅
Medium (can run 2 simultaneously): - gemma2:27b: ~17GB ✅ - qwen2.5-coder:32b: ~20GB ✅ - qwq:32b: ~20GB ✅
Large (run one at a time): - llama3.3:70b: ~43GB ✅ (74% utilization) - hermes3:70b: ~50GB ⚠️ (86% utilization - risky)
Cannot Fit: - llama4:maverick: 245GB ❌ - qwen3:235b: 147GB ❌ - deepseek-r1:671b: 404GB ❌
Expected Benchmark Results¶
Prediction Matrix¶
Based on model architectures and specializations:
| Model | Simple (3 words) | Navigation (JSON) | Reasoning (obstacle) | Avg Speed |
|---|---|---|---|---|
| phi4:14b | 100 | 92 | 88 | ~80 tok/s |
| qwen2.5-coder:32b | 100 | 98 ⭐ | 87 | ~35 tok/s |
| qwq:32b | 100 | 90 | 96 ⭐ | ~35 tok/s |
| llama3.3:70b | 100 | 95 | 92 | ~18 tok/s |
| deepseek-r1:7b | 95 | 85 | 92 | ~120 tok/s |
Key Insights: - Simple prompts: All models should score 95-100 (trivial task) - Navigation: qwen2.5-coder should dominate (JSON specialist) - Reasoning: qwq should lead (reasoning specialist) - Speed: Size is king (7B > 14B > 32B > 70B)
Decision Framework¶
Use Case: Navigation-Heavy Missions (Most Likely)¶
Recommendation: qwen2.5-coder:32b primary, phi4:14b backup
Rationale: - Navigation plans are the primary workload - JSON generation quality is critical (invalid JSON = failed mission) - qwen2.5-coder is purpose-built for structured output - phi4 provides fast fallback (2.5x speed) with acceptable quality
Configuration:
PRIMARY_MODEL="qwen2.5-coder:32b"
BACKUP_MODEL="phi4:14b"
Use Case: Complex Reasoning Missions¶
Recommendation: qwq:32b primary, deepseek-r1:7b backup
Rationale: - Multi-step logical planning is primary challenge - Obstacle avoidance requires spatial reasoning - qwq is reasoning-focused, should excel at "which side of doorway" scenarios - deepseek-r1 offers ultra-fast reasoning for iterative planning
Configuration:
PRIMARY_MODEL="qwq:32b"
BACKUP_MODEL="deepseek-r1:7b"
Use Case: Development/Testing¶
Recommendation: phi4:14b primary, deepseek-r1:7b optional
Rationale: - Dev iterations need fast responses - Quality is "good enough" for testing - Can iterate 2-3x faster than production models - deepseek-r1 for ultra-fast reasoning tests
Configuration:
PRIMARY_MODEL="phi4:14b"
BACKUP_MODEL="deepseek-r1:7b"
Use Case: Production Quality (Regardless of Speed)¶
Recommendation: llama3.3:70b primary, qwen2.5-coder:32b backup
Rationale: - Largest model = highest quality (generalist) - Speed is acceptable for mission planning (not real-time) - qwen-coder as backup reduces RAM to 20GB if needed
Configuration:
PRIMARY_MODEL="llama3.3:70b"
BACKUP_MODEL="qwen2.5-coder:32b"
What Makes These Different from llama3.1?¶
Current Baseline (llama3.1:8b / llama3.1:70b)¶
- Type: General-purpose instruction-following models
- Strengths: Broad capabilities, proven in production
- Weaknesses: Not specialized, older architecture (released 2024)
Why Alternatives May Be Better¶
Specialization¶
- qwen2.5-coder: Trained on code/structured data → better JSON
- qwq: Trained with reasoning chains → better planning
- hermes3: Trained for function calling → better tool use
Efficiency¶
- phi4: 1/5 the size of 70B, similar quality (new architecture)
- deepseek-r1: 1/10 the size, specialized reasoning
Modern Architecture¶
- llama3.3: Improved over 3.1 (released 2025)
- qwen2.5/qwq: 2025 releases with 128K context, 18T tokens training
- deepseek-r1: Chain-of-thought prompting, O3-level reasoning
Training Data¶
- qwen2.5-coder: Massive code corpus (better at JSON/structure)
- phi4: Textbook-quality data (better at reasoning despite size)
Next Steps: Running the Benchmark¶
1. Pull Models¶
On Thor, pull the Tier 1 + Tier 2 models (Tier 3 is optional):
# SSH to Thor
ssh thor
# Pull models (will take 15-30 minutes total)
docker exec ollama ollama pull phi4:14b # ~9GB download
docker exec ollama ollama pull qwen2.5-coder:32b # ~20GB download
docker exec ollama ollama pull qwq:32b # ~20GB download
docker exec ollama ollama pull llama3.3:70b # ~43GB download
docker exec ollama ollama pull deepseek-r1:7b # ~5GB download
# Optional Tier 3
# docker exec ollama ollama pull hermes3:70b # ~50GB download
# docker exec ollama ollama pull gemma2:27b # ~17GB download
2. Run Benchmark¶
cd ~/shadowhound
./scripts/benchmark_ollama_models.sh
Expected Duration: 15-25 minutes (5 models × 3 prompts × ~60s each)
3. Analyze Results¶
The script will output: - Performance summary (speed, tokens/sec, TTFT) - Quality scores (0-100 per task) - Recommendations based on speed vs quality tradeoffs
Look for: - ✅ Quality scores >90 on navigation prompts (critical) - ✅ Quality scores >85 on reasoning prompts (important) - ⚖️ Speed tradeoffs (2x slower for 10% better quality = good deal)
4. Make Data-Driven Decision¶
Based on actual results:
# Pseudo-logic for decision
if qwen_coder_nav_quality > 95 and qwen_coder_speed > 30:
PRIMARY = "qwen2.5-coder:32b" # JSON specialist wins
elif qwq_reasoning > 95 and missions_are_reasoning_heavy:
PRIMARY = "qwq:32b" # Reasoning specialist wins
elif phi4_quality > 90 and speed_is_critical:
PRIMARY = "phi4:14b" # Speed champion wins
else:
PRIMARY = "llama3.3:70b" # Safe generalist choice
if PRIMARY == "qwen2.5-coder:32b" or PRIMARY == "qwq:32b":
BACKUP = "phi4:14b" # Fast backup for 32B primary
else:
BACKUP = "qwen2.5-coder:32b" # Quality backup for speed primary
5. Update Configuration¶
Edit scripts/setup_ollama_thor.sh:
# Before (baseline)
PRIMARY_MODEL="llama3.1:70b"
BACKUP_MODEL="llama3.1:8b"
# After (data-driven choice)
PRIMARY_MODEL="qwen2.5-coder:32b" # Or winner from benchmark
BACKUP_MODEL="phi4:14b"
Research Sources¶
Model Information¶
- Ollama Library: https://ollama.com/library (100+ models with specs)
- Ollama GitHub: https://github.com/ollama/ollama (154k stars, active)
- Model Cards: Individual model pages on Ollama library
Key Models Investigated¶
Reasoning Specialists: - deepseek-r1 (1.5b-671b): "Reasoning approaching O3", 65.2M pulls - qwq (32b): "Reasoning model of Qwen series", 1.7M pulls - openthinker (7b-32b): "Distilled from DeepSeek-R1", 601K pulls
Coding Specialists: - qwen2.5-coder (0.5b-32b): "Code generation, reasoning, fixing", 7.5M pulls - qwen3-coder (30b-480b): "Agentic and coding tasks", 471K pulls - deepseek-coder-v2 (16b-236b): "GPT4-Turbo comparable", 1.1M pulls
General Purpose: - llama3.3 (70b): "Similar to llama3.1:405b", 2.6M pulls - qwen2.5 (0.5b-72b): "18T tokens, 128K context", 14.8M pulls - phi4 (14b): "Microsoft state-of-the-art", 5.3M pulls - gemma2 (2b-27b): "High-performing, efficient", 8.1M pulls
Function Calling: - hermes3 (3b-405b): "Tool-based use cases", 339K pulls - granite3.1-dense (2b-8b): "RAG and tool support", 121K pulls
Selection Methodology¶
- Memory filtering: Eliminated models >60GB (Thor constraint)
- Popularity filtering: Prioritized models with >1M pulls (proven)
- Specialization matching: Selected coding/reasoning specialists for robot control
- Size diversity: Covered 7B-70B range for speed vs quality comparison
- Community validation: Checked GitHub stars, pull counts, recent activity
Appendix: Full Model Landscape¶
For reference, here are other notable models that didn't make the cut and why:
Too Large for Thor (>60GB)¶
- llama4:maverick (400B): 245GB RAM required ❌
- deepseek-r1:671b: 404GB RAM required ❌
- qwen3:235b: 147GB RAM required ❌
- deepseek-v3:671b: 404GB RAM required ❌
Too Small (Insufficient Quality)¶
- smollm2 (135m-1.7b): Compact but limited capability ⚠️
- gemma3:1b: Tiny, good for edge but not robot control ⚠️
Not Specialized for Use Case¶
- mistral (7b): Good general model, but beaten by phi4 ⚠️
- llava (7b-34b): Vision-focused, but we don't need VLM yet ⚠️
- codellama (7b): Older coding model, beaten by qwen2.5-coder ⚠️
Redundant with Better Options¶
- qwen3 (various): Newer, but qwen2.5-coder more specialized ⚠️
- llama3.1:8b: Original baseline, but phi4 likely better ⚠️
- gemma3 (various): Good, but covered by gemma2:27b ⚠️
Experimental/Unproven¶
- cogito (3b-70b): Interesting hybrid, but only 548K pulls ⚠️
- openthinker (7b-32b): Promising, but beaten by qwq ⚠️
Changelog¶
2025-01-10 - Initial Selection¶
- Researched 100+ Ollama models
- Selected 5 primary + 2 optional candidates
- Documented selection rationale
- Created testing matrix and decision framework
Next Update: After benchmark results are available, update with actual performance data and final recommendation.