Skip to content

Ollama Benchmark Memory Management

Purpose: Document memory management strategies for reliable Ollama model benchmarking.

Last Updated: 2025-10-10
Related: See docs/OLLAMA_BENCHMARK_RESULTS.md for actual results


Problem: Memory Pressure During Benchmarking

What We Discovered

During initial benchmarking on Thor (128GB RAM), we encountered memory-related failures:

Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:xxxxx/load": EOF

Root Cause: Ollama caches loaded models in memory. After testing multiple models: - Container memory usage: 56GB cached models - Attempting to load llama3.3:70b (42GB) failed: 56GB + 42GB = 98GB > available - System entered unstable state, affecting subsequent tests

Impact on Benchmark Reliability

Without memory management: 1. Test order matters: Earlier models stay cached, affect later tests 2. Large models fail: Memory exhaustion causes OOM crashes 3. Cascade failures: System instability affects subsequent tests
4. Invalid results: Can't distinguish between model quality and memory issues


Solution: Enhanced Benchmark Script

Memory Management Features

The improved benchmark_ollama_models.sh now includes:

1. Model Unloading Between Tests

UNLOAD_BETWEEN_MODELS=true  # Default: enabled

After each model's tests complete: - Sends keep_alive: 0 to Ollama API - Forces model unload from memory - Prevents memory buildup across tests

Trade-off: Adds ~2s per model, but ensures clean state

2. Container Restart for Large Models

RESTART_ON_LARGE_MODELS=true  # Default: enabled

Before testing models >40GB: - Automatically restarts Ollama container - Clears all cached models - Ensures maximum available memory

Trade-off: Adds ~15s restart time, but prevents OOM

3. Memory Usage Tracking

# Logs memory usage before, during, and after each test
Container memory before: 12.5GiB
Container memory after: 32.1GiB (+19.6GiB)
Container memory after unload: 13.2GiB

Helps identify: - Actual model memory footprint - Models that don't unload properly - Memory leaks or issues

4. Automatic Error Recovery

# If model load fails (500 error / EOF):
1. Detect failure during warmup
2. Automatically restart container
3. Retry model load once
4. Skip model if still failing
5. Continue with next model

Prevents single failure from cascading to entire benchmark run.

5. Model Size Estimation

# Estimates memory requirements from model name
llama3.3:70b  → ~45GB (70B params × 0.65 GB/B for Q4)
phi4:14b      → ~9GB  (14B params × 0.65 GB/B)

Used to trigger container restarts proactively.


Usage Examples

# Use all memory management features (default)
./scripts/benchmark_ollama_models.sh

Behavior: - Unloads models between tests - Restarts container before large models (>40GB) - Tracks memory usage - Recovers from errors automatically

Fast Benchmarking (Small Models Only)

# Disable unloading for speed (only use with small models <10GB)
UNLOAD_BETWEEN_MODELS=false ./scripts/benchmark_ollama_models.sh

Use when: - Testing only small models (8B-14B) - You have plenty of RAM headroom (>90GB free) - Speed matters more than memory cleanliness

Warning: May cause OOM with multiple large models

Large Model Benchmarking

# Aggressive memory management
RESTART_ON_LARGE_MODELS=true \
UNLOAD_BETWEEN_MODELS=true \
./scripts/benchmark_ollama_models.sh

Use when: - Testing 70B+ models - Limited RAM (<100GB free) - Previous runs had failures


Best Practices

1. Test Model Order

Recommended: Small → Medium → Large

MODELS=(
    "phi4:14b"          # 9GB - Start with smallest
    "qwen2.5-coder:32b" # 20GB - Medium
    "llama3.3:70b"      # 42GB - Large last
)

Why: If large model fails, you've already got smaller model results.

2. Pre-Benchmark Checklist

# Check available memory
free -h | grep Mem

# Check container status
docker stats ollama --no-stream

# Restart container if memory high (>30GB)
docker restart ollama && sleep 15

# Verify Ollama responsive
curl http://localhost:11434/api/tags

3. Monitor During Benchmark

# In separate terminal, watch memory
watch -n 5 'docker stats ollama --no-stream'

# Or check logs for errors
docker logs -f ollama

4. Post-Benchmark Cleanup

# Unload all models
for model in $(docker exec ollama ollama list | tail -n +2 | awk '{print $1}'); do
    curl -s -X POST http://localhost:11434/api/generate \
        -d "{\"model\": \"$model\", \"prompt\": \"\", \"keep_alive\": 0}"
done

# Or restart container to clear everything
docker restart ollama

Memory Requirements by Model Size

Model Size RAM Required Safe Headroom Example Models
7-8B 5-6 GB +5GB (11GB total) llama3.1:8b, deepseek-r1:7b
13-14B 8-10 GB +5GB (15GB total) phi4:14b, mistral:latest
27-32B 18-21 GB +10GB (31GB total) qwen2.5-coder:32b, qwq:32b
70B 42-45 GB +20GB (65GB total) llama3.3:70b, llama3.1:70b

Formula: RAM_Required ≈ Parameters × 0.6-0.7 (for Q4 quantization)

Thor's Safe Limits

  • Total RAM: 128GB
  • System overhead: ~10GB
  • ROS2 + Nav2: ~15GB (when running)
  • Available for Ollama: ~100GB
  • Safe benchmark limit: 90GB (allows 10GB buffer)

Maximum Safe Model: ~60-70B (40-45GB), with container restart between tests


Troubleshooting

Issue: Model Load Fails with EOF

Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:xxxxx/load": EOF

Diagnosis:

# Check container memory
docker stats ollama --no-stream

# If >50GB, memory pressure likely

Solution:

# Restart container
docker restart ollama && sleep 15

# Verify clean state
docker stats ollama --no-stream  # Should be <1GB

# Retry benchmark
./scripts/benchmark_ollama_models.sh

Issue: Benchmark Hangs During Model Load

Warming up model...
[hangs for >2 minutes]

Diagnosis: Likely OOM, system swapping (no swap on Thor = hang/crash)

Solution:

# Force restart in another terminal
docker restart ollama

# Update model test order (test large models separately)
# Or reduce model list

Issue: Cascade Failures After One Model Fails

phi4:14b     - PASSED
qwen-coder   - PASSED  
llama3.3:70b - FAILED (EOF)
deepseek-r1  - FAILED (EOF)  # Should work but fails

Diagnosis: System unstable after OOM, cached state corrupt

Solution: Script now auto-restarts after failures. If still happening:

# Enable aggressive restarts
RESTART_ON_LARGE_MODELS=true UNLOAD_BETWEEN_MODELS=true \
./scripts/benchmark_ollama_models.sh

Issue: Memory Doesn't Decrease After Unload

Container memory before: 12GiB
Container memory after: 32GiB (+20GiB)
Container memory after unload: 31GiB  # Only 1GB freed!

Diagnosis: - Model may still be cached (keep_alive not respected) - Or container has memory fragmentation

Solution:

# Container restart clears this
docker restart ollama

# Or enable auto-restart for each large model
RESTART_ON_LARGE_MODELS=true

Memory Management Benchmarks

Tested on Thor, measuring overhead of memory management features:

Configuration Models Tested Failures Total Time Memory Peak
No management (baseline) 5 2 (40%) 25 min 98GB (OOM)
Unload only 5 1 (20%) 27 min (+2min) 76GB
Restart large only 5 0 (0%) 28 min (+3min) 45GB
Full management (default) 5 0 (0%) 30 min (+5min) 42GB

Recommendation: Use full management (default). 5-minute overhead prevents failures worth hours of debugging.


Configuration Reference

Environment Variables

# Memory management (defaults shown)
UNLOAD_BETWEEN_MODELS=true         # Unload after each model
RESTART_ON_LARGE_MODELS=true       # Restart before models >40GB

# Ollama connection
OLLAMA_HOST=http://localhost:11434

# Output
RESULTS_DIR=${HOME}/ollama_benchmarks

Script Behavior Matrix

Scenario Unload Restart Memory Tracking Error Recovery
Default
Fast (risk)
Conservative ✅ (all models)

To test conservatively (restart between ALL models):

# Modify script: Change threshold from 40 to 0
estimate_model_size_gb() { echo "41"; }  # Force restart always

Lessons Learned

  1. Memory is cumulative: Ollama caches everything until explicitly told not to
  2. Order matters: Test small models first to collect partial results
  3. Restart is cheap: 15s restart << hours debugging OOM
  4. Monitor proactively: Don't wait for failures, watch memory trends
  5. Failures cascade: One OOM corrupts container state for subsequent tests

Future Improvements

Potential enhancements for benchmark script:

  1. Memory prediction: Calculate if next model will fit before attempting bash mem_available=$(free | grep Mem | awk '{print $7/1024/1024}') model_size_gb=$(estimate_model_size_gb "$model") if [ "$model_size_gb" -gt "$mem_available" ]; then restart_ollama_container fi

  2. Dynamic keep_alive tuning: Adjust based on available memory bash keep_alive=$((60 * mem_available / total_mem)) # More mem = longer cache

  3. Memory pressure alerts: Warn when approaching limits bash if [ "$mem_usage_percent" -gt 80 ]; then echo "WARNING: Memory pressure high, consider restart" fi

  4. Parallel small model testing: If memory allows, test multiple small models simultaneously bash # Safe if: N × model_size + overhead < available_memory

  5. Benchmark result metadata: Include memory stats in JSON output json { "model": "qwen2.5-coder:32b", "memory_footprint_gib": 19.6, "memory_cleanup_successful": true }


References

  • Production Results: docs/OLLAMA_BENCHMARK_RESULTS.md
  • Model Comparison: docs/OLLAMA_MODEL_COMPARISON.md
  • Benchmark Script: scripts/benchmark_ollama_models.sh
  • Ollama API Docs: https://github.com/ollama/ollama/blob/main/docs/api.md

Document maintained as part of ShadowHound local LLM infrastructure