Ollama Model Benchmarking Guide¶
Last Updated: 2025-10-09
Purpose: Test and compare Ollama models before deployment
Overview¶
The benchmark_ollama_models.sh script helps you objectively compare different Ollama models on your actual hardware (Thor) before committing to one for production use.
What It Tests¶
- Speed: Tokens per second generation rate
- Latency: Time to first token (responsiveness)
- Quality: Response accuracy, completeness, instruction-following (see Quality Scoring Guide)
- Resource Usage: Model size and memory footprint
Test Scenarios¶
The benchmark runs three types of prompts to simulate real mission agent tasks:
- Simple: Basic acknowledgment (tests baseline speed + instruction following)
- Navigation: JSON plan generation (tests structured output + validity)
- Reasoning: Problem-solving task (tests logic + explanation quality)
New in v2.0: Automated quality scoring inspired by OpenAI Evals and IFEval. Each response gets a 0-100 quality score based on accuracy, completeness, and task-specific criteria.
Quick Start¶
1. Make Sure Ollama is Running¶
# On Thor - check container status
docker ps | grep ollama
# Should show ollama container running on port 11434
If not running, start it:
./scripts/setup_ollama_thor.sh
2. Run Benchmark¶
# On Thor
cd ~/shadowhound
./scripts/benchmark_ollama_models.sh
3. Review Results¶
The script will: 1. Pull models if not already downloaded 2. Warm up each model (first inference is always slower) 3. Run 3 test prompts per model 4. Generate JSON results + summary report 5. Provide recommendation
Runtime: ~10-20 minutes depending on models tested
Understanding the Output¶
During Execution¶
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: llama3.1:8b
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Model Info:
Size: 4.9GB | Parameters: 8B | Family: llama
Warming up model...
✓ Model loaded
Testing: simple
Duration: 1.23s | Tokens: 15 | Speed: 12.2 tok/s | TTFT: 0.15s
Testing: navigation
Duration: 3.45s | Tokens: 85 | Speed: 24.6 tok/s | TTFT: 0.18s
Testing: reasoning
Duration: 4.12s | Tokens: 120 | Speed: 29.1 tok/s | TTFT: 0.21s
Summary Report¶
============================================================
BENCHMARK SUMMARY
============================================================
llama3.1:8b
------------------------------------------------------------
Total Duration: 8.80s
Total Tokens: 220
Avg Speed: 21.9 tokens/sec
Avg Time to First: 0.180s
Performance by Task:
simple 1.23s | 12.2 tok/s | Q: 100/100
navigation 3.45s | 24.6 tok/s | Q: 85/100
reasoning 4.12s | 29.1 tok/s | Q: 72/100
llama3.1:70b
------------------------------------------------------------
Total Duration: 24.50s
Total Tokens: 265
Avg Speed: 10.8 tokens/sec
Avg Time to First: 0.850s
Performance by Task:
simple 3.20s | 4.7 tok/s | Q: 100/100
navigation 8.80s | 10.5 tok/s | Q: 100/100
reasoning 12.50s | 16.2 tok/s | Q: 96/100
============================================================
RECOMMENDATIONS
============================================================
🚀 Fastest Model: llama3.1:8b (21.9 tok/s)
🎯 Best Quality: llama3.1:70b (98.7/100)
📊 Speed vs Quality Tradeoff:
llama3.1:8b Speed: 21.9 tok/s | Quality: 85.7/100
llama3.1:70b Speed: 10.8 tok/s | Quality: 98.7/100
💡 Recommendation:
🌟 Use llama3.1:70b - 13pts better quality, only 2.0x slower!
Note: Quality scores (Q: X/100) measure response accuracy and instruction-following. See Quality Scoring Guide for details.
Metrics Explained¶
Speed Metrics¶
Tokens Per Second (tok/s)¶
- Higher is better
- How fast the model generates text
- 8B: Typically 15-30 tok/s
- 70B: Typically 5-15 tok/s
Time to First Token (TTFT)¶
- Lower is better
- How quickly the model starts responding
- Important for perceived responsiveness
- 8B: ~0.1-0.3s
- 70B: ~0.5-1.5s
Duration¶
- Total time to complete the response
Quality Metrics¶
Quality Score (0-100)¶
- Higher is better
- Automated evaluation of response quality
- Measures:
- Simple: Instruction following (word count, format)
- Navigation: JSON validity, structure, completeness
- Reasoning: Answer presence, logic, explanation quality
- 90-100: Excellent (production-ready)
- 75-89: Good (usable)
- 60-74: Fair (consider for non-critical)
- <60: Poor (significant issues)
See ollama_quality_scoring.md for complete scoring methodology. - Depends on both speed and response length
Customizing Tests¶
Add More Models¶
Edit scripts/benchmark_ollama_models.sh:
# Models to test (in order of size)
declare -a MODELS=(
"llama3.1:8b"
"llama3.1:70b"
"mistral:7b" # Add alternative models
"phi-2:latest" # Add smaller models
)
Add Custom Prompts¶
declare -A TEST_PROMPTS=(
["simple"]="Say hello in exactly 3 words"
["navigation"]="Generate a JSON plan..."
["reasoning"]="A robot needs to..."
["custom"]="Your custom test prompt here" # Add your own
)
Change Test Parameters¶
# In the benchmark_prompt function, adjust:
"options": {
"temperature": 0.7, # Randomness (0-1)
"num_predict": 200 # Max tokens to generate
}
Interpreting Results for ShadowHound¶
Speed vs Quality Tradeoff¶
| Model | Speed | Quality | Use Case |
|---|---|---|---|
| llama3.1:8b | ⚡⚡⚡⚡⚡ | ⭐⭐⭐⭐ | Development, testing, simple tasks |
| llama3.1:70b | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | Production missions, complex planning |
Mission Agent Requirements¶
Ideal characteristics for robot missions: - ✅ Latency: <2s total response time for simple commands - ✅ Throughput: >10 tok/s for plan generation - ✅ Quality: Reliable JSON output, good reasoning - ✅ Consistency: Repeatable results with temp=0.7
Decision Matrix¶
Choose 8B if: - You need fast iteration during development - Simple navigation tasks are primary use case - You want snappy responses - Running many inference requests in parallel
Choose 70B if: - Complex mission planning is needed - Quality/reliability is critical - You can accept 2-4s response times - You need better reasoning and understanding
Recommendation: Run both! Use 8B for development, switch to 70B for actual missions.
Advanced: Automated Testing¶
Run Benchmark on Schedule¶
Add to Thor's crontab:
# Benchmark weekly to track performance
0 2 * * 0 /home/user/shadowhound/scripts/benchmark_ollama_models.sh > /tmp/ollama_benchmark.log 2>&1
Compare Models Over Time¶
# View historical results
ls -lh ~/ollama_benchmarks/
# Compare two benchmark runs
diff <(jq '.[0].tests' benchmark1.json) <(jq '.[0].tests' benchmark2.json)
Automated Model Selection¶
Use benchmark results in launch files:
# In launch file
import json
from pathlib import Path
# Load latest benchmark
benchmark_file = Path.home() / "ollama_benchmarks" / "latest.json"
with open(benchmark_file) as f:
results = json.load(f)
# Pick fastest model that meets threshold
for model in results:
avg_speed = sum(t['tokens_per_second'] for t in model['tests']) / len(model['tests'])
if avg_speed > 15.0: # Minimum acceptable speed
selected_model = model['model']
break
Troubleshooting¶
Benchmark Fails to Connect¶
# Check Ollama is running
curl http://localhost:11434/api/tags
# Check container
docker ps | grep ollama
# View logs
docker logs ollama
Model Download Fails¶
# Check disk space
df -h ~/ollama-data
# Check internet
ping ollama.com
# Manually pull model
docker exec ollama ollama pull llama3.1:8b
Slow Performance¶
First run is always slower - models need to be loaded into memory.
# Warm up model manually
docker exec ollama ollama run llama3.1:8b "hi"
# Then run benchmark
Out of Memory¶
Thor has 128GB, should handle 70B model. If OOM occurs:
# Check memory usage
free -h
# Check what's using memory
docker stats
# Stop other containers
docker stop <other-containers>
Example Results (Reference)¶
NVIDIA Jetson AGX Thor (128GB RAM)¶
llama3.1:8b - Avg Speed: 22 tok/s - Avg TTFT: 0.18s - Total for 3 tests: ~8s - Quality: Good for most tasks
llama3.1:70b - Avg Speed: 11 tok/s - Avg TTFT: 0.85s - Total for 3 tests: ~24s - Quality: Excellent, near GPT-4
Conclusion: 70B is ~2x slower but significantly better quality. Worth it for production missions!
Integration with Setup Script¶
The benchmark results inform your choice in setup_ollama_thor.sh:
# Based on benchmark results:
PRIMARY_MODEL="llama3.1:70b" # If quality is priority
# OR
PRIMARY_MODEL="llama3.1:8b" # If speed is priority
You can also use env var to switch at runtime:
export OLLAMA_MODEL="llama3.1:8b" # Fast for development
# OR
export OLLAMA_MODEL="llama3.1:70b" # Quality for missions
Best Practices¶
- Benchmark on Thor - Don't trust specs, test on your actual hardware
- Test with real prompts - Add your actual mission commands to TEST_PROMPTS
- Consider task variety - Balance of simple/complex tasks
- Warm up matters - First inference is slower, benchmark accounts for this
- Track over time - Re-benchmark after Ollama updates
Related Documentation¶
- ollama_models.md - Model recommendations
- ollama_setup.md - Installation guide
- Setup script:
scripts/setup_ollama_thor.sh - Test script:
scripts/test_ollama_laptop.sh
Benchmark before you deploy! Objective data beats guessing. 🎯