Ollama Model Benchmarking Guide¶

Last Updated: 2025-10-09
Purpose: Test and compare Ollama models before deployment

Overview¶

The benchmark_ollama_models.sh script helps you objectively compare different Ollama models on your actual hardware (Thor) before committing to one for production use.

What It Tests¶

Speed: Tokens per second generation rate
Latency: Time to first token (responsiveness)
Quality: Response accuracy, completeness, instruction-following (see Quality Scoring Guide)
Resource Usage: Model size and memory footprint

Test Scenarios¶

The benchmark runs three types of prompts to simulate real mission agent tasks:

Simple: Basic acknowledgment (tests baseline speed + instruction following)
Navigation: JSON plan generation (tests structured output + validity)
Reasoning: Problem-solving task (tests logic + explanation quality)

New in v2.0: Automated quality scoring inspired by OpenAI Evals and IFEval. Each response gets a 0-100 quality score based on accuracy, completeness, and task-specific criteria.

Quick Start¶

1. Make Sure Ollama is Running¶

# On Thor - check container status
docker ps | grep ollama

# Should show ollama container running on port 11434

If not running, start it:

./scripts/setup_ollama_thor.sh

2. Run Benchmark¶

# On Thor
cd ~/shadowhound
./scripts/benchmark_ollama_models.sh

3. Review Results¶

The script will: 1. Pull models if not already downloaded 2. Warm up each model (first inference is always slower) 3. Run 3 test prompts per model 4. Generate JSON results + summary report 5. Provide recommendation

Runtime: ~10-20 minutes depending on models tested

Understanding the Output¶

During Execution¶

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: llama3.1:8b
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Model Info:
Size: 4.9GB | Parameters: 8B | Family: llama

Warming up model...
✓ Model loaded

  Testing: simple
    Duration: 1.23s | Tokens: 15 | Speed: 12.2 tok/s | TTFT: 0.15s
  Testing: navigation
    Duration: 3.45s | Tokens: 85 | Speed: 24.6 tok/s | TTFT: 0.18s
  Testing: reasoning
    Duration: 4.12s | Tokens: 120 | Speed: 29.1 tok/s | TTFT: 0.21s

Summary Report¶

============================================================
BENCHMARK SUMMARY
============================================================

llama3.1:8b
------------------------------------------------------------
  Total Duration:      8.80s
  Total Tokens:        220
  Avg Speed:           21.9 tokens/sec
  Avg Time to First:   0.180s

  Performance by Task:
    simple           1.23s  |  12.2 tok/s  |  Q: 100/100
    navigation       3.45s  |  24.6 tok/s  |  Q: 85/100
    reasoning        4.12s  |  29.1 tok/s  |  Q: 72/100

llama3.1:70b
------------------------------------------------------------
  Total Duration:      24.50s
  Total Tokens:        265
  Avg Speed:           10.8 tokens/sec
  Avg Time to First:   0.850s

  Performance by Task:
    simple           3.20s  |   4.7 tok/s  |  Q: 100/100
    navigation       8.80s  |  10.5 tok/s  |  Q: 100/100
    reasoning       12.50s  |  16.2 tok/s  |  Q: 96/100

============================================================
RECOMMENDATIONS
============================================================

🚀 Fastest Model:       llama3.1:8b (21.9 tok/s)
🎯 Best Quality:        llama3.1:70b (98.7/100)

📊 Speed vs Quality Tradeoff:
   llama3.1:8b          Speed: 21.9 tok/s  |  Quality: 85.7/100
   llama3.1:70b         Speed: 10.8 tok/s  |  Quality: 98.7/100

💡 Recommendation:
   🌟 Use llama3.1:70b - 13pts better quality, only 2.0x slower!

Note: Quality scores (Q: X/100) measure response accuracy and instruction-following. See Quality Scoring Guide for details.

Metrics Explained¶

Speed Metrics¶

Tokens Per Second (tok/s)¶

Higher is better
How fast the model generates text
8B: Typically 15-30 tok/s
70B: Typically 5-15 tok/s

Time to First Token (TTFT)¶

Lower is better
How quickly the model starts responding
Important for perceived responsiveness
8B: ~0.1-0.3s
70B: ~0.5-1.5s

Duration¶

Total time to complete the response

Quality Metrics¶

Quality Score (0-100)¶

Higher is better
Automated evaluation of response quality
Measures:
Simple: Instruction following (word count, format)
Navigation: JSON validity, structure, completeness
Reasoning: Answer presence, logic, explanation quality
90-100: Excellent (production-ready)
75-89: Good (usable)
60-74: Fair (consider for non-critical)
<60: Poor (significant issues)

See ollama_quality_scoring.md for complete scoring methodology. - Depends on both speed and response length

Customizing Tests¶

Add More Models¶

Edit scripts/benchmark_ollama_models.sh:

# Models to test (in order of size)
declare -a MODELS=(
    "llama3.1:8b"
    "llama3.1:70b"
    "mistral:7b"           # Add alternative models
    "phi-2:latest"         # Add smaller models
)

Add Custom Prompts¶

declare -A TEST_PROMPTS=(
    ["simple"]="Say hello in exactly 3 words"
    ["navigation"]="Generate a JSON plan..."
    ["reasoning"]="A robot needs to..."
    ["custom"]="Your custom test prompt here"  # Add your own
)

Change Test Parameters¶

# In the benchmark_prompt function, adjust:
"options": {
    "temperature": 0.7,      # Randomness (0-1)
    "num_predict": 200       # Max tokens to generate
}

Interpreting Results for ShadowHound¶

Speed vs Quality Tradeoff¶

Model	Speed	Quality	Use Case
llama3.1:8b	⚡⚡⚡⚡⚡	⭐⭐⭐⭐	Development, testing, simple tasks
llama3.1:70b	⚡⚡⚡	⭐⭐⭐⭐⭐	Production missions, complex planning

Mission Agent Requirements¶

Ideal characteristics for robot missions: - ✅ Latency: <2s total response time for simple commands - ✅ Throughput: >10 tok/s for plan generation - ✅ Quality: Reliable JSON output, good reasoning - ✅ Consistency: Repeatable results with temp=0.7

Decision Matrix¶

Choose 8B if: - You need fast iteration during development - Simple navigation tasks are primary use case - You want snappy responses - Running many inference requests in parallel

Choose 70B if: - Complex mission planning is needed - Quality/reliability is critical - You can accept 2-4s response times - You need better reasoning and understanding

Recommendation: Run both! Use 8B for development, switch to 70B for actual missions.

Advanced: Automated Testing¶

Run Benchmark on Schedule¶

Add to Thor's crontab:

# Benchmark weekly to track performance
0 2 * * 0 /home/user/shadowhound/scripts/benchmark_ollama_models.sh > /tmp/ollama_benchmark.log 2>&1

Compare Models Over Time¶

# View historical results
ls -lh ~/ollama_benchmarks/

# Compare two benchmark runs
diff <(jq '.[0].tests' benchmark1.json) <(jq '.[0].tests' benchmark2.json)

Automated Model Selection¶

Use benchmark results in launch files:

# In launch file
import json
from pathlib import Path

# Load latest benchmark
benchmark_file = Path.home() / "ollama_benchmarks" / "latest.json"
with open(benchmark_file) as f:
    results = json.load(f)

# Pick fastest model that meets threshold
for model in results:
    avg_speed = sum(t['tokens_per_second'] for t in model['tests']) / len(model['tests'])
    if avg_speed > 15.0:  # Minimum acceptable speed
        selected_model = model['model']
        break

Troubleshooting¶

Benchmark Fails to Connect¶

# Check Ollama is running
curl http://localhost:11434/api/tags

# Check container
docker ps | grep ollama

# View logs
docker logs ollama

Model Download Fails¶

# Check disk space
df -h ~/ollama-data

# Check internet
ping ollama.com

# Manually pull model
docker exec ollama ollama pull llama3.1:8b

Slow Performance¶

First run is always slower - models need to be loaded into memory.

# Warm up model manually
docker exec ollama ollama run llama3.1:8b "hi"

# Then run benchmark

Out of Memory¶

Thor has 128GB, should handle 70B model. If OOM occurs:

# Check memory usage
free -h

# Check what's using memory
docker stats

# Stop other containers
docker stop <other-containers>

Example Results (Reference)¶

NVIDIA Jetson AGX Thor (128GB RAM)¶

llama3.1:8b - Avg Speed: 22 tok/s - Avg TTFT: 0.18s - Total for 3 tests: ~8s - Quality: Good for most tasks

llama3.1:70b - Avg Speed: 11 tok/s - Avg TTFT: 0.85s - Total for 3 tests: ~24s - Quality: Excellent, near GPT-4

Conclusion: 70B is ~2x slower but significantly better quality. Worth it for production missions!

Integration with Setup Script¶

The benchmark results inform your choice in setup_ollama_thor.sh:

# Based on benchmark results:
PRIMARY_MODEL="llama3.1:70b"    # If quality is priority
# OR
PRIMARY_MODEL="llama3.1:8b"     # If speed is priority

You can also use env var to switch at runtime:

export OLLAMA_MODEL="llama3.1:8b"  # Fast for development
# OR
export OLLAMA_MODEL="llama3.1:70b" # Quality for missions

Best Practices¶

Benchmark on Thor - Don't trust specs, test on your actual hardware
Test with real prompts - Add your actual mission commands to TEST_PROMPTS
Consider task variety - Balance of simple/complex tasks
Warm up matters - First inference is slower, benchmark accounts for this
Track over time - Re-benchmark after Ollama updates

ollama_models.md - Model recommendations
ollama_setup.md - Installation guide
Setup script: scripts/setup_ollama_thor.sh
Test script: scripts/test_ollama_laptop.sh

Benchmark before you deploy! Objective data beats guessing. 🎯