Ollama Model Comparison for ShadowHound¶

Last Updated: 2025-01-10
Purpose: Document model selection rationale for robot control workloads
Hardware Target: Thor (128GB RAM, ~58GB available for models)

Executive Summary¶

After comprehensive research of the Ollama model ecosystem (100+ available models), we've identified 5 primary candidates and 2 optional models for ShadowHound testing. These models were selected based on:

Memory constraints: Must fit in 58GB available RAM
Use case alignment: Excel at structured output (JSON), planning, and reasoning
Popularity/maturity: High pull counts indicate community validation
Specialization: Task-specific models (coding, reasoning) vs general-purpose
Performance diversity: Range from 7B (fast) to 70B (quality) parameters

Expected Winner: qwen2.5-coder:32b for production, phi4:14b for development

Model Selection Matrix¶

Model	Size	RAM	Pulls	Specialization	Use Case Fit	Speed Estimate
phi4:14b	9.1GB	~10GB	5.3M	SOTA efficiency	Fast dev/test	~80 tok/s ⚡⚡⚡
qwen2.5-coder:32b	20GB	~20GB	7.5M	Code/JSON specialist	Navigation plans	~35 tok/s ⚡⚡
qwq:32b	20GB	~20GB	1.7M	Reasoning specialist	Complex planning	~35 tok/s ⚡⚡
llama3.3:70b	43GB	~43GB	2.6M	General purpose	Baseline upgrade	~18 tok/s ⚡
deepseek-r1:7b	4.7GB	~5GB	65.2M*	Reasoning chains	Fast reasoning	~120 tok/s ⚡⚡⚡⚡
hermes3:70b	50GB	~50GB	339K	Function calling	Future tools API	~18 tok/s ⚡
gemma2:27b	17GB	~17GB	8.1M	Efficient	Middle ground	~45 tok/s ⚡⚡

*Combined pulls across all deepseek-r1 sizes

Tier 1: Primary Testing Candidates¶

🥇 qwen2.5-coder:32b - JSON/Coding Specialist¶

Why This is Likely the Winner: - Built for structured output: Trained specifically for code generation - JSON expertise: Should achieve 95-100/100 on navigation prompt quality - Proven popularity: 7.5M pulls = battle-tested in production - Optimal size: 32B parameters = sweet spot for quality/speed - Robot control fit: Navigation plans are essentially code (structured instructions)

Expected Performance: - Simple prompts: 95-100/100 quality - Navigation (JSON): 98-100/100 ← Best in class - Reasoning: 85-90/100 (not specialized, but capable) - Speed: ~35 tok/s (good enough for missions)

Use Case: Primary production model for mission planning and navigation

🥈 qwq:32b - Reasoning Specialist¶

Why This Matters: - Purpose-built reasoning: From Qwen's reasoning-focused series - Complex planning: Excels at multi-step logical decisions - Emerging but proven: 1.7M pulls, newer but gaining traction - Same size as qwen-coder: Direct comparison at 32B parameter level

Expected Performance: - Simple prompts: 95-100/100 quality - Navigation (JSON): 85-95/100 (formatting secondary to logic) - Reasoning: 95-100/100 ← Best in class - Speed: ~35 tok/s (comparable to qwen-coder)

Use Case: Alternative primary if reasoning > JSON formatting, or for complex obstacle scenarios

🥉 phi4:14b - Speed Champion¶

Why This is Essential: - Microsoft SOTA: State-of-the-art small model from Microsoft Research - Efficiency leader: 5.3M pulls, highly regarded in community - Development speed: 2.5x faster than 32B models, near-instant responses - Surprising quality: Small models have dramatically improved in 2024-2025

Expected Performance: - Simple prompts: 95-100/100 quality - Navigation (JSON): 90-95/100 (strong, but not specialist level) - Reasoning: 85-95/100 (impressive for size) - Speed: ~80 tok/s ← 2.5x faster than 32B models

Use Case: Development/testing model, backup for production when speed critical

Tier 2: Validation & Comparison¶

llama3.3:70b - Latest Meta Release¶

Why Test This: - Direct upgrade: Meta claims "similar to llama3.1:405b performance" - Current baseline successor: Natural evolution from llama3.1:70b - Proven architecture: Meta's LLaMA series is industry standard - Largest in our range: 70B = maximum quality we can fit

Expected Performance: - All tasks: 90-100/100 (generalist strength) - Speed: ~18 tok/s (slowest, but acceptable)

Use Case: Validation that specialized models beat generalists

deepseek-r1:7b - Experimental Reasoning¶

Why This is Interesting: - "Reasoning approaching O3": Cutting-edge reasoning architecture - Massive popularity: 65.2M combined pulls (all sizes) = most popular - Ultra-fast: Smallest model = fastest responses - Curiosity test: Can 7B compete with 32B on reasoning?

Expected Performance: - Simple prompts: 90-95/100 - Navigation (JSON): 80-90/100 (size limitation) - Reasoning: 90-95/100 (specialized architecture compensates for size) - Speed: ~120 tok/s ← Fastest option

Use Case: CI/testing, curiosity about reasoning vs size tradeoff

Tier 3: Optional Models¶

hermes3:70b - Tool-Use Specialist (Optional)¶

Why Consider: - Function calling: Built for tool-based use cases - Skills API alignment: Future expansion of skills API with tool definitions - 70B quality: Large model = high capability

When to Test: - If planning to expand skills API with function calling - If need tool-use capabilities (structured API calls) - If have extra time for benchmarking

Downside: 50GB RAM = 86% memory utilization (risky)

gemma2:27b - Google's Efficient Model (Optional)¶

Why Consider: - Middle ground: Between 14B and 32B - Google quality: 8.1M pulls, proven - Efficiency focus: Optimized for resource usage

When to Test: - If 32B models are too slow - If 14B models have insufficient quality - If need compromise option

Downside: Likely beaten by phi4 (speed) and qwen-coder (quality)

Memory Usage Analysis¶

Thor has 128GB total RAM: - ROS2 + Nav2 + Perception: ~30GB - System overhead: ~20GB - OS/buffers: ~20GB - Available for models: ~58GB

Models by Memory Footprint¶

Small (can run 5+ simultaneously): - deepseek-r1:7b: ~5GB ✅ - phi4:14b: ~9GB ✅

Medium (can run 2 simultaneously): - gemma2:27b: ~17GB ✅ - qwen2.5-coder:32b: ~20GB ✅ - qwq:32b: ~20GB ✅

Large (run one at a time): - llama3.3:70b: ~43GB ✅ (74% utilization) - hermes3:70b: ~50GB ⚠️ (86% utilization - risky)

Cannot Fit: - llama4:maverick: 245GB ❌ - qwen3:235b: 147GB ❌ - deepseek-r1:671b: 404GB ❌

Expected Benchmark Results¶

Prediction Matrix¶

Based on model architectures and specializations:

Model	Simple (3 words)	Navigation (JSON)	Reasoning (obstacle)	Avg Speed
phi4:14b	100	92	88	~80 tok/s
qwen2.5-coder:32b	100	98 ⭐	87	~35 tok/s
qwq:32b	100	90	96 ⭐	~35 tok/s
llama3.3:70b	100	95	92	~18 tok/s
deepseek-r1:7b	95	85	92	~120 tok/s

Key Insights: - Simple prompts: All models should score 95-100 (trivial task) - Navigation: qwen2.5-coder should dominate (JSON specialist) - Reasoning: qwq should lead (reasoning specialist) - Speed: Size is king (7B > 14B > 32B > 70B)

Decision Framework¶

Recommendation: qwen2.5-coder:32b primary, phi4:14b backup

Rationale: - Navigation plans are the primary workload - JSON generation quality is critical (invalid JSON = failed mission) - qwen2.5-coder is purpose-built for structured output - phi4 provides fast fallback (2.5x speed) with acceptable quality

Configuration:

PRIMARY_MODEL="qwen2.5-coder:32b"
BACKUP_MODEL="phi4:14b"

Use Case: Complex Reasoning Missions¶

Recommendation: qwq:32b primary, deepseek-r1:7b backup

Rationale: - Multi-step logical planning is primary challenge - Obstacle avoidance requires spatial reasoning - qwq is reasoning-focused, should excel at "which side of doorway" scenarios - deepseek-r1 offers ultra-fast reasoning for iterative planning

Configuration:

PRIMARY_MODEL="qwq:32b"
BACKUP_MODEL="deepseek-r1:7b"

Use Case: Development/Testing¶

Recommendation: phi4:14b primary, deepseek-r1:7b optional

Rationale: - Dev iterations need fast responses - Quality is "good enough" for testing - Can iterate 2-3x faster than production models - deepseek-r1 for ultra-fast reasoning tests

Configuration:

PRIMARY_MODEL="phi4:14b"
BACKUP_MODEL="deepseek-r1:7b"

Use Case: Production Quality (Regardless of Speed)¶

Recommendation: llama3.3:70b primary, qwen2.5-coder:32b backup

Rationale: - Largest model = highest quality (generalist) - Speed is acceptable for mission planning (not real-time) - qwen-coder as backup reduces RAM to 20GB if needed

Configuration:

PRIMARY_MODEL="llama3.3:70b"
BACKUP_MODEL="qwen2.5-coder:32b"

What Makes These Different from llama3.1?¶

Current Baseline (llama3.1:8b / llama3.1:70b)¶

Type: General-purpose instruction-following models
Strengths: Broad capabilities, proven in production
Weaknesses: Not specialized, older architecture (released 2024)

Why Alternatives May Be Better¶

Specialization¶

qwen2.5-coder: Trained on code/structured data → better JSON
qwq: Trained with reasoning chains → better planning
hermes3: Trained for function calling → better tool use

Efficiency¶

phi4: 1/5 the size of 70B, similar quality (new architecture)
deepseek-r1: 1/10 the size, specialized reasoning

Modern Architecture¶

llama3.3: Improved over 3.1 (released 2025)
qwen2.5/qwq: 2025 releases with 128K context, 18T tokens training
deepseek-r1: Chain-of-thought prompting, O3-level reasoning

Training Data¶

qwen2.5-coder: Massive code corpus (better at JSON/structure)
phi4: Textbook-quality data (better at reasoning despite size)

Next Steps: Running the Benchmark¶

1. Pull Models¶

On Thor, pull the Tier 1 + Tier 2 models (Tier 3 is optional):

# SSH to Thor
ssh thor

# Pull models (will take 15-30 minutes total)
docker exec ollama ollama pull phi4:14b          # ~9GB download
docker exec ollama ollama pull qwen2.5-coder:32b # ~20GB download
docker exec ollama ollama pull qwq:32b           # ~20GB download
docker exec ollama ollama pull llama3.3:70b      # ~43GB download
docker exec ollama ollama pull deepseek-r1:7b    # ~5GB download

# Optional Tier 3
# docker exec ollama ollama pull hermes3:70b     # ~50GB download
# docker exec ollama ollama pull gemma2:27b      # ~17GB download

2. Run Benchmark¶

cd ~/shadowhound
./scripts/benchmark_ollama_models.sh

Expected Duration: 15-25 minutes (5 models × 3 prompts × ~60s each)

3. Analyze Results¶

The script will output: - Performance summary (speed, tokens/sec, TTFT) - Quality scores (0-100 per task) - Recommendations based on speed vs quality tradeoffs

Look for: - ✅ Quality scores >90 on navigation prompts (critical) - ✅ Quality scores >85 on reasoning prompts (important) - ⚖️ Speed tradeoffs (2x slower for 10% better quality = good deal)

4. Make Data-Driven Decision¶

Based on actual results:

# Pseudo-logic for decision
if qwen_coder_nav_quality > 95 and qwen_coder_speed > 30:
    PRIMARY = "qwen2.5-coder:32b"  # JSON specialist wins
elif qwq_reasoning > 95 and missions_are_reasoning_heavy:
    PRIMARY = "qwq:32b"  # Reasoning specialist wins
elif phi4_quality > 90 and speed_is_critical:
    PRIMARY = "phi4:14b"  # Speed champion wins
else:
    PRIMARY = "llama3.3:70b"  # Safe generalist choice

if PRIMARY == "qwen2.5-coder:32b" or PRIMARY == "qwq:32b":
    BACKUP = "phi4:14b"  # Fast backup for 32B primary
else:
    BACKUP = "qwen2.5-coder:32b"  # Quality backup for speed primary

5. Update Configuration¶

Edit scripts/setup_ollama_thor.sh:

# Before (baseline)
PRIMARY_MODEL="llama3.1:70b"
BACKUP_MODEL="llama3.1:8b"

# After (data-driven choice)
PRIMARY_MODEL="qwen2.5-coder:32b"  # Or winner from benchmark
BACKUP_MODEL="phi4:14b"

Research Sources¶

Model Information¶

Ollama Library: https://ollama.com/library (100+ models with specs)
Ollama GitHub: https://github.com/ollama/ollama (154k stars, active)
Model Cards: Individual model pages on Ollama library

Key Models Investigated¶

Reasoning Specialists: - deepseek-r1 (1.5b-671b): "Reasoning approaching O3", 65.2M pulls - qwq (32b): "Reasoning model of Qwen series", 1.7M pulls - openthinker (7b-32b): "Distilled from DeepSeek-R1", 601K pulls

Coding Specialists: - qwen2.5-coder (0.5b-32b): "Code generation, reasoning, fixing", 7.5M pulls - qwen3-coder (30b-480b): "Agentic and coding tasks", 471K pulls - deepseek-coder-v2 (16b-236b): "GPT4-Turbo comparable", 1.1M pulls

General Purpose: - llama3.3 (70b): "Similar to llama3.1:405b", 2.6M pulls - qwen2.5 (0.5b-72b): "18T tokens, 128K context", 14.8M pulls - phi4 (14b): "Microsoft state-of-the-art", 5.3M pulls - gemma2 (2b-27b): "High-performing, efficient", 8.1M pulls

Function Calling: - hermes3 (3b-405b): "Tool-based use cases", 339K pulls - granite3.1-dense (2b-8b): "RAG and tool support", 121K pulls

Selection Methodology¶

Memory filtering: Eliminated models >60GB (Thor constraint)
Popularity filtering: Prioritized models with >1M pulls (proven)
Specialization matching: Selected coding/reasoning specialists for robot control
Size diversity: Covered 7B-70B range for speed vs quality comparison
Community validation: Checked GitHub stars, pull counts, recent activity

Appendix: Full Model Landscape¶

For reference, here are other notable models that didn't make the cut and why:

Too Large for Thor (>60GB)¶

llama4:maverick (400B): 245GB RAM required ❌
deepseek-r1:671b: 404GB RAM required ❌
qwen3:235b: 147GB RAM required ❌
deepseek-v3:671b: 404GB RAM required ❌

Too Small (Insufficient Quality)¶

smollm2 (135m-1.7b): Compact but limited capability ⚠️
gemma3:1b: Tiny, good for edge but not robot control ⚠️

Not Specialized for Use Case¶

mistral (7b): Good general model, but beaten by phi4 ⚠️
llava (7b-34b): Vision-focused, but we don't need VLM yet ⚠️
codellama (7b): Older coding model, beaten by qwen2.5-coder ⚠️

Redundant with Better Options¶

qwen3 (various): Newer, but qwen2.5-coder more specialized ⚠️
llama3.1:8b: Original baseline, but phi4 likely better ⚠️
gemma3 (various): Good, but covered by gemma2:27b ⚠️

Experimental/Unproven¶

cogito (3b-70b): Interesting hybrid, but only 548K pulls ⚠️
openthinker (7b-32b): Promising, but beaten by qwq ⚠️

Changelog¶

2025-01-10 - Initial Selection¶

Researched 100+ Ollama models
Selected 5 primary + 2 optional candidates
Documented selection rationale
Created testing matrix and decision framework

Next Update: After benchmark results are available, update with actual performance data and final recommendation.

Ollama Model Comparison for ShadowHound¶

Executive Summary¶

Model Selection Matrix¶

Tier 1: Primary Testing Candidates¶

🥇 qwen2.5-coder:32b - JSON/Coding Specialist¶

🥈 qwq:32b - Reasoning Specialist¶

🥉 phi4:14b - Speed Champion¶

Tier 2: Validation & Comparison¶

llama3.3:70b - Latest Meta Release¶

deepseek-r1:7b - Experimental Reasoning¶

Tier 3: Optional Models¶

hermes3:70b - Tool-Use Specialist (Optional)¶

gemma2:27b - Google's Efficient Model (Optional)¶

Memory Usage Analysis¶

Models by Memory Footprint¶

Expected Benchmark Results¶

Prediction Matrix¶

Decision Framework¶

Use Case: Navigation-Heavy Missions (Most Likely)¶

Use Case: Complex Reasoning Missions¶

Use Case: Development/Testing¶

Use Case: Production Quality (Regardless of Speed)¶

What Makes These Different from llama3.1?¶

Current Baseline (llama3.1:8b / llama3.1:70b)¶

Why Alternatives May Be Better¶

Specialization¶

Efficiency¶

Modern Architecture¶

Training Data¶

Next Steps: Running the Benchmark¶

1. Pull Models¶

2. Run Benchmark¶

3. Analyze Results¶

4. Make Data-Driven Decision¶

5. Update Configuration¶

Research Sources¶

Model Information¶

Key Models Investigated¶

Selection Methodology¶

Appendix: Full Model Landscape¶

Too Large for Thor (>60GB)¶

Too Small (Insufficient Quality)¶

Not Specialized for Use Case¶

Redundant with Better Options¶

Experimental/Unproven¶

Changelog¶

2025-01-10 - Initial Selection¶