Quality Scoring - Quick Reference¶
TL;DR: Benchmark now measures both speed AND quality to help you choose between fast (8B) vs accurate (70B) models.
What's New¶
Before (Performance Only)¶
llama3.1:8b: 21.9 tok/s โ Fast!
llama3.1:70b: 10.8 tok/s โ Slow...
Decision: Use 8B? ๐คท
After (Performance + Quality)¶
llama3.1:8b: 21.9 tok/s | Quality: 85.7/100 โ Fast but mistakes
llama3.1:70b: 10.8 tok/s | Quality: 98.7/100 โ Slower but reliable
Decision: Use 70B! 13pts quality gain worth 2x slowdown โ
How It Works¶
Automated Checks (No Human Required)¶
| Prompt Type | Checks |
|---|---|
| Simple | โ
Word count โ Format compliance โ No errors |
| Navigation | โ
Valid JSON โ Required fields โ Correct structure |
| Reasoning | โ
Answer present โ Explanation quality โ Logic markers |
Score Interpretation¶
| Score | Meaning |
|---|---|
| 90-100 | โ Production ready |
| 75-89 | โ ๏ธ Usable with caution |
| 60-74 | โ ๏ธ Non-critical only |
| <60 | โ Not recommended |
Example Output¶
llama3.1:8b
Performance by Task:
simple 1.23s | 12.2 tok/s | Q: 100/100 โ Perfect
navigation 3.45s | 24.6 tok/s | Q: 85/100 โ Some JSON issues
reasoning 4.12s | 29.1 tok/s | Q: 72/100 โ Weak logic
llama3.1:70b
Performance by Task:
simple 3.20s | 4.7 tok/s | Q: 100/100 โ Perfect
navigation 8.80s | 10.5 tok/s | Q: 100/100 โ Perfect JSON
reasoning 12.50s | 16.2 tok/s | Q: 96/100 โ Strong reasoning
When to Use Which Model¶
Use 8B When:¶
- ๐ Speed critical (real-time responses)
- ๐งช Development/testing (fast iteration)
- ๐ Simple tasks (word count = 100 for both models)
Use 70B When:¶
- ๐ฏ Accuracy critical (mission planning)
- ๐ง Complex reasoning needed
- ๐ญ Production deployment
- ๐ Safety-critical tasks
Quick Start¶
# On Thor - run benchmark (auto-scores quality)
./scripts/benchmark_ollama_models.sh
# Takes 10-20 minutes, outputs:
# - Performance metrics (speed, latency)
# - Quality scores (0-100 per task)
# - Intelligent recommendation
Learn More¶
- Full Guide: ollama_quality_scoring.md - Complete methodology
- Benchmarking: ollama_benchmarking.md - How to run tests
- Academic Background:
- IFEval Paper - Instruction-following evaluation
- OpenAI Evals - Evaluation framework
Added: 2025-10-09
Status: Production Ready
Dependencies: None (Python 3 stdlib only)