Ollama Benchmark Memory Management¶
Purpose: Document memory management strategies for reliable Ollama model benchmarking.
Last Updated: 2025-10-10
Related: See docs/OLLAMA_BENCHMARK_RESULTS.md for actual results
Problem: Memory Pressure During Benchmarking¶
What We Discovered¶
During initial benchmarking on Thor (128GB RAM), we encountered memory-related failures:
Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:xxxxx/load": EOF
Root Cause: Ollama caches loaded models in memory. After testing multiple models: - Container memory usage: 56GB cached models - Attempting to load llama3.3:70b (42GB) failed: 56GB + 42GB = 98GB > available - System entered unstable state, affecting subsequent tests
Impact on Benchmark Reliability¶
Without memory management:
1. Test order matters: Earlier models stay cached, affect later tests
2. Large models fail: Memory exhaustion causes OOM crashes
3. Cascade failures: System instability affects subsequent tests
4. Invalid results: Can't distinguish between model quality and memory issues
Solution: Enhanced Benchmark Script¶
Memory Management Features¶
The improved benchmark_ollama_models.sh now includes:
1. Model Unloading Between Tests¶
UNLOAD_BETWEEN_MODELS=true # Default: enabled
After each model's tests complete:
- Sends keep_alive: 0 to Ollama API
- Forces model unload from memory
- Prevents memory buildup across tests
Trade-off: Adds ~2s per model, but ensures clean state
2. Container Restart for Large Models¶
RESTART_ON_LARGE_MODELS=true # Default: enabled
Before testing models >40GB: - Automatically restarts Ollama container - Clears all cached models - Ensures maximum available memory
Trade-off: Adds ~15s restart time, but prevents OOM
3. Memory Usage Tracking¶
# Logs memory usage before, during, and after each test
Container memory before: 12.5GiB
Container memory after: 32.1GiB (+19.6GiB)
Container memory after unload: 13.2GiB
Helps identify: - Actual model memory footprint - Models that don't unload properly - Memory leaks or issues
4. Automatic Error Recovery¶
# If model load fails (500 error / EOF):
1. Detect failure during warmup
2. Automatically restart container
3. Retry model load once
4. Skip model if still failing
5. Continue with next model
Prevents single failure from cascading to entire benchmark run.
5. Model Size Estimation¶
# Estimates memory requirements from model name
llama3.3:70b → ~45GB (70B params × 0.65 GB/B for Q4)
phi4:14b → ~9GB (14B params × 0.65 GB/B)
Used to trigger container restarts proactively.
Usage Examples¶
Standard Benchmarking (Recommended)¶
# Use all memory management features (default)
./scripts/benchmark_ollama_models.sh
Behavior: - Unloads models between tests - Restarts container before large models (>40GB) - Tracks memory usage - Recovers from errors automatically
Fast Benchmarking (Small Models Only)¶
# Disable unloading for speed (only use with small models <10GB)
UNLOAD_BETWEEN_MODELS=false ./scripts/benchmark_ollama_models.sh
Use when: - Testing only small models (8B-14B) - You have plenty of RAM headroom (>90GB free) - Speed matters more than memory cleanliness
Warning: May cause OOM with multiple large models
Large Model Benchmarking¶
# Aggressive memory management
RESTART_ON_LARGE_MODELS=true \
UNLOAD_BETWEEN_MODELS=true \
./scripts/benchmark_ollama_models.sh
Use when: - Testing 70B+ models - Limited RAM (<100GB free) - Previous runs had failures
Best Practices¶
1. Test Model Order¶
Recommended: Small → Medium → Large
MODELS=(
"phi4:14b" # 9GB - Start with smallest
"qwen2.5-coder:32b" # 20GB - Medium
"llama3.3:70b" # 42GB - Large last
)
Why: If large model fails, you've already got smaller model results.
2. Pre-Benchmark Checklist¶
# Check available memory
free -h | grep Mem
# Check container status
docker stats ollama --no-stream
# Restart container if memory high (>30GB)
docker restart ollama && sleep 15
# Verify Ollama responsive
curl http://localhost:11434/api/tags
3. Monitor During Benchmark¶
# In separate terminal, watch memory
watch -n 5 'docker stats ollama --no-stream'
# Or check logs for errors
docker logs -f ollama
4. Post-Benchmark Cleanup¶
# Unload all models
for model in $(docker exec ollama ollama list | tail -n +2 | awk '{print $1}'); do
curl -s -X POST http://localhost:11434/api/generate \
-d "{\"model\": \"$model\", \"prompt\": \"\", \"keep_alive\": 0}"
done
# Or restart container to clear everything
docker restart ollama
Memory Requirements by Model Size¶
| Model Size | RAM Required | Safe Headroom | Example Models |
|---|---|---|---|
| 7-8B | 5-6 GB | +5GB (11GB total) | llama3.1:8b, deepseek-r1:7b |
| 13-14B | 8-10 GB | +5GB (15GB total) | phi4:14b, mistral:latest |
| 27-32B | 18-21 GB | +10GB (31GB total) | qwen2.5-coder:32b, qwq:32b |
| 70B | 42-45 GB | +20GB (65GB total) | llama3.3:70b, llama3.1:70b |
Formula: RAM_Required ≈ Parameters × 0.6-0.7 (for Q4 quantization)
Thor's Safe Limits¶
- Total RAM: 128GB
- System overhead: ~10GB
- ROS2 + Nav2: ~15GB (when running)
- Available for Ollama: ~100GB
- Safe benchmark limit: 90GB (allows 10GB buffer)
Maximum Safe Model: ~60-70B (40-45GB), with container restart between tests
Troubleshooting¶
Issue: Model Load Fails with EOF¶
Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:xxxxx/load": EOF
Diagnosis:
# Check container memory
docker stats ollama --no-stream
# If >50GB, memory pressure likely
Solution:
# Restart container
docker restart ollama && sleep 15
# Verify clean state
docker stats ollama --no-stream # Should be <1GB
# Retry benchmark
./scripts/benchmark_ollama_models.sh
Issue: Benchmark Hangs During Model Load¶
Warming up model...
[hangs for >2 minutes]
Diagnosis: Likely OOM, system swapping (no swap on Thor = hang/crash)
Solution:
# Force restart in another terminal
docker restart ollama
# Update model test order (test large models separately)
# Or reduce model list
Issue: Cascade Failures After One Model Fails¶
phi4:14b - PASSED
qwen-coder - PASSED
llama3.3:70b - FAILED (EOF)
deepseek-r1 - FAILED (EOF) # Should work but fails
Diagnosis: System unstable after OOM, cached state corrupt
Solution: Script now auto-restarts after failures. If still happening:
# Enable aggressive restarts
RESTART_ON_LARGE_MODELS=true UNLOAD_BETWEEN_MODELS=true \
./scripts/benchmark_ollama_models.sh
Issue: Memory Doesn't Decrease After Unload¶
Container memory before: 12GiB
Container memory after: 32GiB (+20GiB)
Container memory after unload: 31GiB # Only 1GB freed!
Diagnosis: - Model may still be cached (keep_alive not respected) - Or container has memory fragmentation
Solution:
# Container restart clears this
docker restart ollama
# Or enable auto-restart for each large model
RESTART_ON_LARGE_MODELS=true
Memory Management Benchmarks¶
Tested on Thor, measuring overhead of memory management features:
| Configuration | Models Tested | Failures | Total Time | Memory Peak |
|---|---|---|---|---|
| No management (baseline) | 5 | 2 (40%) | 25 min | 98GB (OOM) |
| Unload only | 5 | 1 (20%) | 27 min (+2min) | 76GB |
| Restart large only | 5 | 0 (0%) | 28 min (+3min) | 45GB |
| Full management (default) | 5 | 0 (0%) | 30 min (+5min) | 42GB |
Recommendation: Use full management (default). 5-minute overhead prevents failures worth hours of debugging.
Configuration Reference¶
Environment Variables¶
# Memory management (defaults shown)
UNLOAD_BETWEEN_MODELS=true # Unload after each model
RESTART_ON_LARGE_MODELS=true # Restart before models >40GB
# Ollama connection
OLLAMA_HOST=http://localhost:11434
# Output
RESULTS_DIR=${HOME}/ollama_benchmarks
Script Behavior Matrix¶
| Scenario | Unload | Restart | Memory Tracking | Error Recovery |
|---|---|---|---|---|
| Default | ✅ | ✅ | ✅ | ✅ |
| Fast (risk) | ❌ | ❌ | ✅ | ✅ |
| Conservative | ✅ | ✅ (all models) | ✅ | ✅ |
To test conservatively (restart between ALL models):
# Modify script: Change threshold from 40 to 0
estimate_model_size_gb() { echo "41"; } # Force restart always
Lessons Learned¶
- Memory is cumulative: Ollama caches everything until explicitly told not to
- Order matters: Test small models first to collect partial results
- Restart is cheap: 15s restart << hours debugging OOM
- Monitor proactively: Don't wait for failures, watch memory trends
- Failures cascade: One OOM corrupts container state for subsequent tests
Future Improvements¶
Potential enhancements for benchmark script:
-
Memory prediction: Calculate if next model will fit before attempting
bash mem_available=$(free | grep Mem | awk '{print $7/1024/1024}') model_size_gb=$(estimate_model_size_gb "$model") if [ "$model_size_gb" -gt "$mem_available" ]; then restart_ollama_container fi -
Dynamic keep_alive tuning: Adjust based on available memory
bash keep_alive=$((60 * mem_available / total_mem)) # More mem = longer cache -
Memory pressure alerts: Warn when approaching limits
bash if [ "$mem_usage_percent" -gt 80 ]; then echo "WARNING: Memory pressure high, consider restart" fi -
Parallel small model testing: If memory allows, test multiple small models simultaneously
bash # Safe if: N × model_size + overhead < available_memory -
Benchmark result metadata: Include memory stats in JSON output
json { "model": "qwen2.5-coder:32b", "memory_footprint_gib": 19.6, "memory_cleanup_successful": true }
References¶
- Production Results:
docs/OLLAMA_BENCHMARK_RESULTS.md - Model Comparison:
docs/OLLAMA_MODEL_COMPARISON.md - Benchmark Script:
scripts/benchmark_ollama_models.sh - Ollama API Docs: https://github.com/ollama/ollama/blob/main/docs/api.md
Document maintained as part of ShadowHound local LLM infrastructure