Ollama Local LLM Support - Current Status and TODOs¶
Date: 2025-10-10 (End of Day)
Branch: feature/local-llm-support
Status: Ready for Robot Testing
Next Test Date: 2025-10-11
๐ฏ Current State Summary¶
โ What's Working¶
- Ollama Container on Thor
- Jetson-optimized container running:
ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04 - HTTP API accessible at
http://192.168.50.10:11434 - Container properly configured with GPU access
-
Service stable and responding to requests
-
Models Pulled and Available
- โ phi4:14b - Primary candidate (~7.7GB model size)
- โ qwen2.5-coder:32b (~20GB)
- โ qwq:32b
- โ llama3.3:70b
- โ deepseek-r1:7b
-
โ gpt-oss:20b (testing only)
-
Backend Validation System
- Two-layer validation implemented:
- Early check in
start.sh(warns if backend unreachable) - Authoritative check in
mission_agent.py(exits if validation fails)
- Early check in
- Automatic validation on mission agent startup
- Clear error messages for troubleshooting
-
See
docs/LLM_BACKEND_VALIDATION.mdfor details -
Benchmark Infrastructure
- Quality scoring system implemented (
scripts/quality_scorer.py) - Automated benchmark script (
scripts/benchmark_ollama_models.sh) - GPU validation tests (baseline + final verification)
- Auto-start container if not running
-
API-only operations (no docker restart issues)
-
Documentation Complete
- Setup guides for Thor (
scripts/setup_ollama_thor.sh) - Performance notes (
docs/THOR_PERFORMANCE_NOTES.md) - Deployment checklist (
docs/OLLAMA_DEPLOYMENT_CHECKLIST.md) - Backend validation guide (
docs/LLM_BACKEND_VALIDATION.md) -
Benchmark results template (
docs/OLLAMA_BENCHMARK_RESULTS.md) -
GPU Monitoring Tools Ready
- jtop installation scripts created and committed:
scripts/install_jtop_thor.sh(main installer)scripts/patch_thor_jp7_in_repo.sh(Thor GPU patches)
- Security analysis completed (
docs/SECURITY_ANALYSIS_JTOP.md) - NOT YET INSTALLED on Thor
โ ๏ธ Known Issues¶
1. GPU Performance Degradation (CRITICAL)¶
Symptom: After running benchmark or unloading/reloading models, GPU performance drops 50-86%
Evidence from Tonight's Testing:
BASELINE (Fresh): gpt-oss:20b at 37.00 tok/s โ
POST-BENCHMARK: ~5 tok/s โ (86% slower!)
FINAL VERIFICATION: ~5 tok/s โ (FAILED)
Impact: - Benchmark results unreliable after first model test - Production use may suffer degradation over time - Affects ALL models, not model-specific
Current Workaround:
- Reboot Thor before production use
- Avoid unnecessary model unload/reload cycles
- Keep models loaded with keep_alive parameter
Root Cause: UNKNOWN - Not fixed by API-only operations - Not fixed by avoiding container restart - Possibly Jetson container issue - Possibly CUDA context persistence issue - Possibly GPU memory fragmentation
Priority: HIGH - Must investigate before production deployment
2. Memory Monitoring Disabled¶
Issue: nvidia-smi on Thor returns N/A for GPU memory metrics
Current State: - Memory monitoring in benchmark script commented out (lines 165-195) - Cannot track per-model VRAM usage accurately - Cannot verify models fit in available memory
Solution Available: Install jtop with Thor patches
- Scripts ready: install_jtop_thor.sh + patch_thor_jp7_in_repo.sh
- Provides accurate GPU memory, utilization, power, temperature
- NOT YET INSTALLED
Priority: MEDIUM - Should install before robot testing
๐ Benchmark Results (Partial)¶
Test Conditions¶
- System: Thor (Jetson AGX Orin 64GB)
- Container: Jetson-optimized Ollama
- Date: 2025-10-10 evening
- Status: INCOMPLETE - GPU degradation invalidated results
Observed Performance (Before Degradation)¶
phi4:14b - Most promising candidate - Speed: ~15-20 tok/s estimated (not formally measured) - Quality: Strong for reasoning tasks - Size: Reasonable memory footprint (~7.7GB) - JSON: Good structured output capability - Recommendation: Primary model for robot testing
qwen2.5-coder:32b - Previous benchmark winner - Speed: ~4-5 tok/s (previous benchmark) - Quality: 98/100 (JSON specialist) - Size: Large memory footprint (~20GB) - JSON: Excellent structured output - Status: Backup option, may be too slow for real-time robot
GPU Validation Results¶
โ
BASELINE TEST: 36.67 tok/s (gpt-oss:20b)
โ FINAL TEST: ~5 tok/s (FAILED - degraded)
Conclusion: Benchmark results unreliable, must retest after fixing GPU issue
๐ TODOs Before Robot Testing¶
Priority 1: CRITICAL (Must Fix)¶
[ ] 1.1: Install jtop on Thor for GPU Monitoring¶
# On Thor
cd ~/shadowhound
git pull origin feature/local-llm-support
sudo ./scripts/install_jtop_thor.sh
# Verify installation
systemctl status jtop.service
sudo jtop # Should show GPU memory, not N/A
Why Critical: Need accurate GPU monitoring during robot testing to: - Verify models fit in VRAM - Monitor for memory pressure - Track GPU utilization during missions - Debug performance issues
Estimated Time: 15 minutes
Blockers: None, scripts ready
[ ] 1.2: Investigate GPU Performance Degradation¶
Hypothesis Testing Plan:
- Test CUDA Persistence Mode ```bash # On Thor, before benchmark sudo nvidia-smi -pm 1 # Enable persistence mode
# Run single model test twice # Compare speeds - should be consistent ```
- Test Standard Container vs Jetson Container ```bash # Try standard ollama/ollama container docker run -d --gpus=all -p 11435:11434 ollama/ollama
# Test same model multiple times # Check if degradation still occurs ```
- Test Model Keep-Alive Behavior ```bash # Load model and keep in memory curl http://localhost:11434/api/generate -d '{ "model": "phi4:14b", "prompt": "test", "keep_alive": -1 # Keep forever }'
# Run multiple queries # Check if keeping model loaded prevents degradation ```
- Monitor with jtop During Benchmark ```bash # Terminal 1: Run jtop sudo jtop
# Terminal 2: Run benchmark ./scripts/benchmark_ollama_models.sh
# Watch for: # - GPU memory fragmentation # - Clock speed changes # - Thermal throttling # - Power state changes ```
- Check NVIDIA/Ollama Known Issues
- Search Jetson forums for similar reports
- Check Ollama GitHub issues for Jetson-specific bugs
- Test with different Ollama versions
Success Criteria: Identify root cause and implement fix OR document reliable workaround
Estimated Time: 2-4 hours
Priority: CRITICAL - Blocks production deployment
Priority 2: HIGH (Should Do Before Robot Test)¶
[ ] 2.1: Reboot Thor and Verify Clean State¶
# On Thor
sudo reboot
# After reboot, verify GPU baseline
docker exec ollama ollama run --verbose gpt-oss:20b "Count to 10"
# Should see: ~37 tok/s (eval rate)
Why: Ensure clean starting state for robot testing
Estimated Time: 5 minutes (+ reboot time)
[ ] 2.2: Verify phi4:14b Performance in Clean State¶
# On Thor, after reboot
docker exec ollama ollama run --verbose phi4:14b "Explain what a quadruped robot is in one sentence."
# Expected:
# - Speed: 15-20 tok/s
# - Quality: Coherent, concise response
# - JSON capable: Test with structured output prompt
Test JSON Output:
docker exec ollama ollama run phi4:14b "Return JSON with keys: name (value: GO2), type (value: quadruped), speed (value: fast)"
# Expected output:
# {"name": "GO2", "type": "quadruped", "speed": "fast"}
Success Criteria: - Speed >15 tok/s in clean state - Valid JSON generation - Coherent reasoning
Estimated Time: 10 minutes
[ ] 2.3: Document phi4:14b as Primary Test Model¶
Update docs/OLLAMA_BENCHMARK_RESULTS.md with decision rationale:
- Speed: Good balance (15-20 tok/s)
- Size: Reasonable VRAM (~7.7GB)
- Quality: Strong reasoning capability
- JSON: Capable of structured output
- Availability: Already pulled on Thor
Estimated Time: 10 minutes
Priority 3: MEDIUM (Good to Have)¶
[ ] 3.1: Update Deployment Checklist with Current Status¶
Mark Phase 1 and Phase 2 tests as ready: - Mission agent backend validation โ (implemented) - Ollama connection working โ - Models available โ - GPU monitoring tools ready โณ (need to install)
Estimated Time: 5 minutes
[ ] 3.2: Create Quick Start Guide for Tomorrow¶
Write docs/ROBOT_TEST_QUICKSTART.md with:
1. Pre-flight checks (reboot Thor, verify GPU)
2. Launch commands (mission agent with phi4:14b)
3. Test missions (3-5 simple commands)
4. What to look for (speed, quality, errors)
5. Troubleshooting steps
Estimated Time: 15 minutes
[ ] 3.3: Prepare Test Mission Scripts¶
Create scripts/test_missions.txt with pre-written missions:
# Simple missions for robot testing
1. "Move forward 1 meter"
2. "Rotate 90 degrees to the left"
3. "Move to coordinates x=2.0, y=1.0"
4. "Take a photo"
5. "Return to starting position"
Estimated Time: 10 minutes
๐งช Robot Testing Plan (2025-10-11)¶
Test Setup¶
- Hardware Prerequisites
- [ ] GO2 robot powered on and responsive
- [ ] Thor rebooted (clean GPU state)
- [ ] ROS2 bridge running (go2_ros2_sdk)
-
[ ] Network connectivity verified
-
Software Prerequisites
- [ ] jtop installed and working
- [ ] phi4:14b verified in clean state
- [ ] Mission agent tested with backend validation
-
[ ] Dashboard accessible
-
Monitoring Setup
- [ ] Terminal 1:
sudo jtop(GPU monitoring) - [ ] Terminal 2: Mission agent logs
- [ ] Terminal 3: ROS2 topic monitoring (
/cmd_vel,/odom) - [ ] Browser: Web dashboard (http://localhost:8080)
Test Sequence¶
Phase 1: Startup Validation (10 min)¶
- Launch mission agent with phi4:14b
- Verify backend validation passes
- Verify first response time (<5 seconds)
- Check jtop shows model loaded (~7-8GB GPU memory)
Phase 2: Text-Only Tests (15 min)¶
- "Describe what you are"
- "Create a navigation plan to move forward 2 meters"
- "Generate JSON: {action: nav.goto, x: 1.0, y: 0.0}"
Success Criteria: - All responses coherent and fast (<5s) - JSON output valid - No errors in logs
Phase 3: Robot Motion Tests (30 min)¶
- "Move forward 0.5 meters" (conservative first test)
- "Rotate 45 degrees to the left"
- "Move forward 1 meter then stop"
- "Navigate to x=2.0, y=1.0"
Success Criteria: - Robot executes commands - Movements approximately correct (ยฑ20% tolerance) - No crashes or safety issues - Mission agent reports success
Phase 4: Performance Validation (15 min)¶
- Monitor GPU memory during 5 consecutive missions
- Check for memory leaks or degradation
- Time 5 missions, average response time
- Verify speeds stay consistent (>15 tok/s)
Success Criteria: - GPU memory stable (<10GB) - No performance degradation - Response times consistent (<5s variance)
Abort Conditions¶
- Robot behaves unsafely (stop immediately)
- GPU performance degrades (note pattern, may need to investigate)
- Mission agent crashes repeatedly
- Response times >30 seconds
๐ Decision Log¶
Model Selection: phi4:14b (Primary)¶
Date: 2025-10-10
Rationale:
- Best speed/quality tradeoff observed
- Reasonable memory footprint
- Good reasoning capability
- JSON generation capable
- Already available on Thor
Alternatives: - qwen2.5-coder:32b (backup if more accuracy needed, slower) - OpenAI fallback (always available)
Benchmark Status: INCOMPLETE¶
Date: 2025-10-10
Reason: GPU performance degradation invalidated results
Action: Must retest after fixing degradation issue
Impact: Using phi4:14b based on preliminary observations, not formal benchmark
jtop Installation: DEFERRED¶
Date: 2025-10-10
Reason: End of day, scripts complete but not yet run
Action: Install before robot testing tomorrow
Impact: None (scripts ready, just need to execute)
๐ Tomorrow's Workflow (2025-10-11)¶
Morning Setup (30 min)¶
- โ Coffee first
- Reboot Thor โ wait for clean state
- Install jtop โ verify GPU monitoring works
- Verify phi4:14b performance in clean state
- Review test plan
Robot Testing (2 hours)¶
- Launch mission agent with phi4:14b
- Run Phase 1: Startup validation
- Run Phase 2: Text-only tests
- Run Phase 3: Robot motion tests
- Run Phase 4: Performance validation
- Document results in deployment checklist
Afternoon Analysis (1 hour)¶
- Review test results
- Investigate any issues found
- Document decision: PASS โ merge OR FAIL โ fix
- Update documentation with findings
If PASS: Merge to dev (30 min)¶
- Final documentation updates
- Merge feature/local-llm-support โ dev
- Tag release
- Celebrate! ๐
If FAIL: Debug (variable)¶
- Investigate GPU degradation issue
- Test alternative models/configurations
- Document blockers
- Plan next steps
๐ Need Help? Reference These Docs¶
- Backend validation issues:
docs/LLM_BACKEND_VALIDATION.md - GPU performance problems:
docs/THOR_PERFORMANCE_NOTES.md - Benchmark procedures:
docs/OLLAMA_BENCHMARK_RESULTS.md - Deployment steps:
docs/OLLAMA_DEPLOYMENT_CHECKLIST.md - jtop installation:
scripts/install_jtop_thor.sh(run withsudo) - Security concerns:
docs/SECURITY_ANALYSIS_JTOP.md
๐ฏ Success Definition¶
Minimum criteria for merge: - โ Mission agent starts with Ollama backend - โ phi4:14b generates valid responses - โ Robot executes 3+ navigation commands successfully - โ Performance stable over 10+ missions - โ GPU monitoring working (jtop installed) - โ No critical bugs or safety issues - โ ๏ธ GPU degradation issue documented (workaround: reboot)
Nice to have: - Root cause of GPU degradation identified and fixed - Benchmark completed with reliable results - Multiple models tested and compared
๐ญ Final Notes from 2025-10-10 Testing¶
What went well: - Ollama infrastructure solid and stable - Backend validation system working perfectly - phi4:14b shows promise (fast, capable) - Documentation comprehensive and helpful - jtop scripts ready for deployment
What needs work: - GPU degradation mystery needs solving - Need accurate memory monitoring (install jtop) - Benchmark incomplete due to degradation - Production deployment blocked until degradation understood
Confidence level: - ๐ข HIGH for phi4:14b basic functionality - ๐ก MEDIUM for production deployment (need to solve degradation) - ๐ข HIGH for tools and infrastructure
Overall assessment: Ready for robot testing tomorrow with phi4:14b. If testing goes well, can merge with known issue documented (GPU degradation workaround = reboot). If degradation occurs during robot operation, must investigate further before production use.
Last Updated: 2025-10-10 23:00
Next Update: After robot testing 2025-10-11
Status: READY FOR TOMORROW'S TEST ๐