Ollama Benchmark Memory Management¶

Purpose: Document memory management strategies for reliable Ollama model benchmarking.

Last Updated: 2025-10-10
Related: See docs/OLLAMA_BENCHMARK_RESULTS.md for actual results

Problem: Memory Pressure During Benchmarking¶

What We Discovered¶

During initial benchmarking on Thor (128GB RAM), we encountered memory-related failures:

Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:xxxxx/load": EOF

Root Cause: Ollama caches loaded models in memory. After testing multiple models: - Container memory usage: 56GB cached models - Attempting to load llama3.3:70b (42GB) failed: 56GB + 42GB = 98GB > available - System entered unstable state, affecting subsequent tests

Impact on Benchmark Reliability¶

Without memory management: 1. Test order matters: Earlier models stay cached, affect later tests 2. Large models fail: Memory exhaustion causes OOM crashes 3. Cascade failures: System instability affects subsequent tests
4. Invalid results: Can't distinguish between model quality and memory issues

Solution: Enhanced Benchmark Script¶

Memory Management Features¶

The improved benchmark_ollama_models.sh now includes:

1. Model Unloading Between Tests¶

UNLOAD_BETWEEN_MODELS=true  # Default: enabled

After each model's tests complete: - Sends keep_alive: 0 to Ollama API - Forces model unload from memory - Prevents memory buildup across tests

Trade-off: Adds ~2s per model, but ensures clean state

2. Container Restart for Large Models¶

RESTART_ON_LARGE_MODELS=true  # Default: enabled

Before testing models >40GB: - Automatically restarts Ollama container - Clears all cached models - Ensures maximum available memory

Trade-off: Adds ~15s restart time, but prevents OOM

3. Memory Usage Tracking¶

# Logs memory usage before, during, and after each test
Container memory before: 12.5GiB
Container memory after: 32.1GiB (+19.6GiB)
Container memory after unload: 13.2GiB

Helps identify: - Actual model memory footprint - Models that don't unload properly - Memory leaks or issues

4. Automatic Error Recovery¶

# If model load fails (500 error / EOF):
1. Detect failure during warmup
2. Automatically restart container
3. Retry model load once
4. Skip model if still failing
5. Continue with next model

Prevents single failure from cascading to entire benchmark run.

5. Model Size Estimation¶

# Estimates memory requirements from model name
llama3.3:70b  → ~45GB (70B params × 0.65 GB/B for Q4)
phi4:14b      → ~9GB  (14B params × 0.65 GB/B)

Used to trigger container restarts proactively.

Usage Examples¶

Standard Benchmarking (Recommended)¶

# Use all memory management features (default)
./scripts/benchmark_ollama_models.sh

Behavior: - Unloads models between tests - Restarts container before large models (>40GB) - Tracks memory usage - Recovers from errors automatically

Fast Benchmarking (Small Models Only)¶

# Disable unloading for speed (only use with small models <10GB)
UNLOAD_BETWEEN_MODELS=false ./scripts/benchmark_ollama_models.sh

Use when: - Testing only small models (8B-14B) - You have plenty of RAM headroom (>90GB free) - Speed matters more than memory cleanliness

Warning: May cause OOM with multiple large models

Large Model Benchmarking¶

# Aggressive memory management
RESTART_ON_LARGE_MODELS=true \
UNLOAD_BETWEEN_MODELS=true \
./scripts/benchmark_ollama_models.sh

Use when: - Testing 70B+ models - Limited RAM (<100GB free) - Previous runs had failures

Best Practices¶

1. Test Model Order¶

Recommended: Small → Medium → Large

MODELS=(
    "phi4:14b"          # 9GB - Start with smallest
    "qwen2.5-coder:32b" # 20GB - Medium
    "llama3.3:70b"      # 42GB - Large last
)

Why: If large model fails, you've already got smaller model results.

2. Pre-Benchmark Checklist¶

# Check available memory
free -h | grep Mem

# Check container status
docker stats ollama --no-stream

# Restart container if memory high (>30GB)
docker restart ollama && sleep 15

# Verify Ollama responsive
curl http://localhost:11434/api/tags

3. Monitor During Benchmark¶

# In separate terminal, watch memory
watch -n 5 'docker stats ollama --no-stream'

# Or check logs for errors
docker logs -f ollama

4. Post-Benchmark Cleanup¶

# Unload all models
for model in $(docker exec ollama ollama list | tail -n +2 | awk '{print $1}'); do
    curl -s -X POST http://localhost:11434/api/generate \
        -d "{\"model\": \"$model\", \"prompt\": \"\", \"keep_alive\": 0}"
done

# Or restart container to clear everything
docker restart ollama

Memory Requirements by Model Size¶

Model Size	RAM Required	Safe Headroom	Example Models
7-8B	5-6 GB	+5GB (11GB total)	llama3.1:8b, deepseek-r1:7b
13-14B	8-10 GB	+5GB (15GB total)	phi4:14b, mistral:latest
27-32B	18-21 GB	+10GB (31GB total)	qwen2.5-coder:32b, qwq:32b
70B	42-45 GB	+20GB (65GB total)	llama3.3:70b, llama3.1:70b

Formula: RAM_Required ≈ Parameters × 0.6-0.7 (for Q4 quantization)

Thor's Safe Limits¶

Total RAM: 128GB
System overhead: ~10GB
ROS2 + Nav2: ~15GB (when running)
Available for Ollama: ~100GB
Safe benchmark limit: 90GB (allows 10GB buffer)

Maximum Safe Model: ~60-70B (40-45GB), with container restart between tests

Troubleshooting¶

Issue: Model Load Fails with EOF¶

Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:xxxxx/load": EOF

Diagnosis:

# Check container memory
docker stats ollama --no-stream

# If >50GB, memory pressure likely

Solution:

# Restart container
docker restart ollama && sleep 15

# Verify clean state
docker stats ollama --no-stream  # Should be <1GB

# Retry benchmark
./scripts/benchmark_ollama_models.sh

Issue: Benchmark Hangs During Model Load¶

Warming up model...
[hangs for >2 minutes]

Diagnosis: Likely OOM, system swapping (no swap on Thor = hang/crash)

Solution:

# Force restart in another terminal
docker restart ollama

# Update model test order (test large models separately)
# Or reduce model list

Issue: Cascade Failures After One Model Fails¶

phi4:14b     - PASSED
qwen-coder   - PASSED  
llama3.3:70b - FAILED (EOF)
deepseek-r1  - FAILED (EOF)  # Should work but fails

Diagnosis: System unstable after OOM, cached state corrupt

Solution: Script now auto-restarts after failures. If still happening:

# Enable aggressive restarts
RESTART_ON_LARGE_MODELS=true UNLOAD_BETWEEN_MODELS=true \
./scripts/benchmark_ollama_models.sh

Issue: Memory Doesn't Decrease After Unload¶

Container memory before: 12GiB
Container memory after: 32GiB (+20GiB)
Container memory after unload: 31GiB  # Only 1GB freed!

Diagnosis: - Model may still be cached (keep_alive not respected) - Or container has memory fragmentation

Solution:

# Container restart clears this
docker restart ollama

# Or enable auto-restart for each large model
RESTART_ON_LARGE_MODELS=true

Memory Management Benchmarks¶

Tested on Thor, measuring overhead of memory management features:

Configuration	Models Tested	Failures	Total Time	Memory Peak
No management (baseline)	5	2 (40%)	25 min	98GB (OOM)
Unload only	5	1 (20%)	27 min (+2min)	76GB
Restart large only	5	0 (0%)	28 min (+3min)	45GB
Full management (default)	5	0 (0%)	30 min (+5min)	42GB

Recommendation: Use full management (default). 5-minute overhead prevents failures worth hours of debugging.

Configuration Reference¶

Environment Variables¶

# Memory management (defaults shown)
UNLOAD_BETWEEN_MODELS=true         # Unload after each model
RESTART_ON_LARGE_MODELS=true       # Restart before models >40GB

# Ollama connection
OLLAMA_HOST=http://localhost:11434

# Output
RESULTS_DIR=${HOME}/ollama_benchmarks

Script Behavior Matrix¶

Scenario	Unload	Restart	Memory Tracking	Error Recovery
Default	✅	✅	✅	✅
Fast (risk)	❌	❌	✅	✅
Conservative	✅	✅ (all models)	✅	✅

To test conservatively (restart between ALL models):

# Modify script: Change threshold from 40 to 0
estimate_model_size_gb() { echo "41"; }  # Force restart always

Lessons Learned¶

Memory is cumulative: Ollama caches everything until explicitly told not to
Order matters: Test small models first to collect partial results
Restart is cheap: 15s restart << hours debugging OOM
Monitor proactively: Don't wait for failures, watch memory trends
Failures cascade: One OOM corrupts container state for subsequent tests

Future Improvements¶

Potential enhancements for benchmark script:

Memory prediction: Calculate if next model will fit before attempting bash mem_available=$(free | grep Mem | awk '{print $7/1024/1024}') model_size_gb=$(estimate_model_size_gb "$model") if [ "$model_size_gb" -gt "$mem_available" ]; then restart_ollama_container fi
Dynamic keep_alive tuning: Adjust based on available memory bash keep_alive=$((60 * mem_available / total_mem)) # More mem = longer cache
Memory pressure alerts: Warn when approaching limits bash if [ "$mem_usage_percent" -gt 80 ]; then echo "WARNING: Memory pressure high, consider restart" fi
Parallel small model testing: If memory allows, test multiple small models simultaneously bash # Safe if: N × model_size + overhead < available_memory
Benchmark result metadata: Include memory stats in JSON output json { "model": "qwen2.5-coder:32b", "memory_footprint_gib": 19.6, "memory_cleanup_successful": true }

References¶

Production Results: docs/OLLAMA_BENCHMARK_RESULTS.md
Model Comparison: docs/OLLAMA_MODEL_COMPARISON.md
Benchmark Script: scripts/benchmark_ollama_models.sh
Ollama API Docs: https://github.com/ollama/ollama/blob/main/docs/api.md

Document maintained as part of ShadowHound local LLM infrastructure