Skip to content

Quality Scoring - Quick Reference

TL;DR: Benchmark now measures both speed AND quality to help you choose between fast (8B) vs accurate (70B) models.


What's New

Before (Performance Only)

llama3.1:8b:  21.9 tok/s  โ† Fast!
llama3.1:70b: 10.8 tok/s  โ† Slow...

Decision: Use 8B? ๐Ÿคท

After (Performance + Quality)

llama3.1:8b:  21.9 tok/s  |  Quality: 85.7/100  โ† Fast but mistakes
llama3.1:70b: 10.8 tok/s  |  Quality: 98.7/100  โ† Slower but reliable

Decision: Use 70B! 13pts quality gain worth 2x slowdown โœ…

How It Works

Automated Checks (No Human Required)

Prompt Type Checks
Simple โœ… Word count
โœ… Format compliance
โœ… No errors
Navigation โœ… Valid JSON
โœ… Required fields
โœ… Correct structure
Reasoning โœ… Answer present
โœ… Explanation quality
โœ… Logic markers

Score Interpretation

Score Meaning
90-100 โœ… Production ready
75-89 โš ๏ธ Usable with caution
60-74 โš ๏ธ Non-critical only
<60 โŒ Not recommended

Example Output

llama3.1:8b
  Performance by Task:
    simple      1.23s  |  12.2 tok/s  |  Q: 100/100  โ† Perfect
    navigation  3.45s  |  24.6 tok/s  |  Q: 85/100   โ† Some JSON issues
    reasoning   4.12s  |  29.1 tok/s  |  Q: 72/100   โ† Weak logic

llama3.1:70b
  Performance by Task:
    simple      3.20s  |   4.7 tok/s  |  Q: 100/100  โ† Perfect
    navigation  8.80s  |  10.5 tok/s  |  Q: 100/100  โ† Perfect JSON
    reasoning  12.50s  |  16.2 tok/s  |  Q: 96/100   โ† Strong reasoning

When to Use Which Model

Use 8B When:

  • ๐Ÿš€ Speed critical (real-time responses)
  • ๐Ÿงช Development/testing (fast iteration)
  • ๐Ÿ“ Simple tasks (word count = 100 for both models)

Use 70B When:

  • ๐ŸŽฏ Accuracy critical (mission planning)
  • ๐Ÿง  Complex reasoning needed
  • ๐Ÿญ Production deployment
  • ๐Ÿ”’ Safety-critical tasks

Quick Start

# On Thor - run benchmark (auto-scores quality)
./scripts/benchmark_ollama_models.sh

# Takes 10-20 minutes, outputs:
# - Performance metrics (speed, latency)
# - Quality scores (0-100 per task)
# - Intelligent recommendation

Learn More


Added: 2025-10-09
Status: Production Ready
Dependencies: None (Python 3 stdlib only)