Ollama Deployment Test Checklist¶

Purpose: Validate Ollama local LLM backend works correctly with ShadowHound robot before merging to dev/main.

Date Created: 2025-10-10
Branch: feature/local-llm-support
Target System: Thor + GO2 Robot

Pre-Deployment Validation¶

✅ Prerequisites (Updated 2025-10-10 EOD)¶

[x] Ollama container running on Thor (Jetson-optimized)
[x] phi4:14b model pulled (PRIMARY for testing, ~7.7GB)
[x] qwen2.5-coder:32b model pulled (BACKUP, 98/100 quality, slower)
[x] Models available and working (basic validation done)
[x] Backend validation system implemented (two-layer)
[x] Container memory stable
[x] Benchmark infrastructure complete
[ ] ⚠️ jtop NOT YET INSTALLED - Need GPU monitoring before robot test
[ ] ⚠️ GPU degradation issue unresolved - Workaround: reboot Thor before testing
[ ] ⚠️ Benchmark incomplete - GPU degradation invalidated results (~5 tok/s post-test)

STATUS: Ready for robot testing with phi4:14b after Thor reboot and jtop installation

Test Plan: End-to-End Robot Integration¶

⚠️ IMPORTANT PRE-TEST STEPS (Added 2025-10-10):

Reboot Thor for clean GPU state bash # On Thor sudo reboot # Wait for system to come back up (~2 min)
Install jtop for GPU monitoring bash cd ~/shadowhound git pull origin feature/local-llm-support sudo ./scripts/install_jtop_thor.sh # Verify: systemctl status jtop.service # Verify: sudo jtop (should show GPU memory, not N/A)
Verify phi4:14b baseline performance bash docker exec ollama ollama run --verbose phi4:14b "Count to 10" # Expected: 15-20 tok/s (eval rate) # If much slower, investigate GPU degradation
Setup monitoring terminals
Terminal 1: sudo jtop (watch GPU memory/utilization)
Terminal 2: Mission agent logs
Terminal 3: ros2 topic echo /cmd_vel (robot commands)

Phase 1: Mission Agent Startup (15 min)¶

Test 1.1: Launch Mission Agent with Ollama Backend¶

On laptop (in devcontainer):

# Launch mission agent with phi4:14b (PRIMARY TEST MODEL)
ros2 launch shadowhound_mission_agent mission_agent.launch.py \
    agent_backend:=ollama \
    ollama_base_url:=http://192.168.50.10:11434 \
    ollama_model:=phi4:14b \
    web_host:=0.0.0.0 \
    web_port:=8080

# Alternative: qwen2.5-coder:32b (if phi4 has issues)
# ollama_model:=qwen2.5-coder:32b

Expected Output:

[INFO] [mission_agent]: Starting Mission Agent with Ollama backend
[INFO] [mission_agent]: Ollama URL: http://192.168.50.10:11434
[INFO] [mission_agent]: Model: phi4:14b
============================================================
🔍 VALIDATING LLM BACKEND CONNECTION
============================================================
Testing ollama backend...
  URL: http://192.168.50.10:11434
  Model: phi4:14b
  Checking Ollama service...
  ✅ Ollama service responding
  ✅ Model 'phi4:14b' available
  Sending test prompt...
  ✅ Test prompt succeeded (response: 'OK')
============================================================
✅ Ollama backend validation PASSED
============================================================
[INFO] [mission_agent]: MissionExecutor ready!
[INFO] [mission_agent]: Web dashboard: http://0.0.0.0:8080
[INFO] [mission_agent]: Ready to accept missions

New Feature: The mission agent now automatically validates the LLM backend connection on startup. If validation fails, the node exits immediately with a clear error message. See docs/LLM_BACKEND_VALIDATION.md for details.

⚠️ Watch jtop Terminal: Should see GPU memory jump to ~7-8GB when phi4:14b loads on first request

Verify: - [ ] Mission agent starts without errors - [ ] Backend validation PASSED (automatic on startup) - [ ] Ollama connection successful - [ ] Model loads correctly (watch jtop: ~7-8GB GPU memory for phi4:14b) - [ ] First response <5 seconds (phi4 is fast: 15-20 tok/s expected) - [ ] No timeout errors - [ ] jtop shows GPU memory stable (not increasing rapidly)

Troubleshooting: - If validation fails: See error message for specific issue (service unreachable, model not found, etc.) - If connection fails: Check OLLAMA_BASE_URL matches Thor IP - If model not found: Verify model pulled on Thor (docker exec ollama ollama list) - If timeout: Check Thor firewall allows port 11434 - See detailed troubleshooting: docs/LLM_BACKEND_VALIDATION.md

Test 1.2: Web Dashboard Access¶

On laptop browser: Navigate to http://localhost:8080

Verify: - [ ] Dashboard loads successfully - [ ] Backend indicator shows "OLLAMA" (not "OPENAI") - [ ] Model name displays: "phi4:14b" (or qwen2.5-coder:32b if using backup) - [ ] Connection status: GREEN - [ ] No JavaScript console errors

Screenshot: Take screenshot of dashboard showing Ollama backend active

Phase 2: Simple Mission Tests (30 min)¶

Test 2.1: Text Response (No Robot Motion)¶

Mission: "Describe what you are"

Expected LLM Response: - Should identify as robot assistant - Concise response (phi4 is direct, not overly verbose) - Fast response (<3 seconds for phi4:14b at 15-20 tok/s expected)

⚠️ Watch for GPU Degradation: If response is slow (>10s), check jtop for issues

Verify: - [ ] Response received within 10 seconds - [ ] Response is coherent and relevant - [ ] No errors in mission agent logs - [ ] Dashboard shows response correctly

Logs to Check:

# In separate terminal
ros2 topic echo /shadowhound/mission/status

Mission: "Create a navigation plan to explore 3 meters forward, then rotate right"

Expected LLM Response (JSON structure):

{
  "steps": [
    {"action": "nav.goto", "params": {"x": 3.0, "y": 0.0}},
    {"action": "nav.rotate", "params": {"yaw": -1.57}}
  ]
}

Verify: - [ ] JSON structure is valid - [ ] Skills are correctly identified (nav.goto, nav.rotate) - [ ] Parameters have reasonable values - [ ] phi4:14b produces valid JSON (good structured output capability)

Compare: Try same mission with OpenAI backend later to validate quality difference

Test 2.3: Complex Reasoning¶

Mission: "If the robot is 0.6m wide and needs to pass through a 0.8m doorway with an obstacle 0.3m to the left, should it go right or left?"

Expected Response: - Logical reasoning explaining choice - Correct answer: "Go RIGHT" (0.4m clearance vs 0.1m) - Clear explanation

Verify: - [ ] Correct reasoning - [ ] Not overly verbose (phi4 should be concise) - [ ] Fast response (<5 seconds)

Phase 3: Robot Hardware Integration (45 min)¶

Prerequisites: - GO2 robot powered on - ROS2 bridge active (go2_ros2_sdk running on Thor) - Nav2 stack running - Map loaded

Mission: "Move forward 1 meter"

Expected Behavior: 1. LLM generates navigation plan: {"steps": [{"action": "nav.goto", "params": {"x": 1.0, "y": 0.0}}]} 2. Mission agent executes skill: nav.goto 3. Robot moves forward ~1 meter 4. Success reported back to dashboard

Verify: - [ ] LLM generates correct JSON plan - [ ] Skill execution starts (check /cmd_vel topic) - [ ] Robot physically moves - [ ] Distance approximately correct (±0.2m tolerance) - [ ] Mission completes successfully - [ ] Dashboard shows "SUCCESS" status - [ ] jtop shows stable GPU memory (no sudden spikes or leaks)

Logs to Check:

# Monitor velocity commands
ros2 topic echo /cmd_vel

# Monitor skill execution
ros2 topic echo /shadowhound/skill/status

Test 3.2: Multi-Step Mission¶

Mission: "Rotate 90 degrees left, move forward 2 meters, then rotate back to original heading"

Expected Plan:

{
  "steps": [
    {"action": "nav.rotate", "params": {"yaw": 1.57}},
    {"action": "nav.goto", "params": {"x": 2.0, "y": 0.0}},
    {"action": "nav.rotate", "params": {"yaw": -1.57}}
  ]
}

Verify: - [ ] LLM generates multi-step plan - [ ] Each step executes in sequence - [ ] Robot completes full mission - [ ] Final heading approximately correct (±15° tolerance) - [ ] No skill execution failures

Test 3.3: Perception Integration¶

Mission: "Take a photo"

Expected Plan:

{
  "steps": [
    {"action": "perception.snapshot", "params": {}}
  ]
}

Verify: - [ ] LLM identifies perception skill - [ ] Camera image captured - [ ] Image displayed in dashboard (if implemented) - [ ] Mission reports success

Test 3.4: Error Handling¶

Mission: "Move to an unreachable location behind a wall"

Expected Behavior: - LLM generates plan (may not know about wall) - Navigation skill attempts execution - Nav2 reports failure (obstacle/timeout) - Mission agent reports error to LLM - LLM suggests alternative or acknowledges failure

Verify: - [ ] System doesn't crash on navigation failure - [ ] Error properly reported to user - [ ] LLM provides helpful error message - [ ] System ready for next mission

Phase 4: Performance Validation (30 min)¶

Test 4.1: Response Time Benchmarks¶

Execute 10 simple missions and measure:

Metrics: - Time to first token (TTFT) - Total completion time - End-to-end mission time (LLM + skill execution)

Target Performance (based on preliminary observations): - TTFT: <2 seconds (phi4 is fast) - Simple text response: <3 seconds - JSON navigation plan: <5 seconds - Quality: Good reasoning and JSON generation

⚠️ Known Issue: GPU degradation may occur after multiple model unload/reload cycles. If speeds drop significantly, note in results and plan investigation.

Record Results:

# Use benchmark script
cd scripts
./benchmark_ollama_models.sh  # Re-run with production config

# Or manual timing
time echo "Mission: move forward 1m" | ros2 topic pub --once /shadowhound/mission ...

Verify: - [ ] Performance meets benchmarked expectations - [ ] No degradation from Thor system load - [ ] Consistent response times (low variance)

Test 4.2: Memory Stability¶

Run 20+ missions consecutively and monitor:

On Thor:

# Monitor container memory (watch for GPU memory specifically with jtop)
watch -n 10 'docker stats ollama --no-stream'

# Also monitor GPU memory with jtop
sudo jtop  # Watch GPU section

Verify: - [ ] GPU memory usage stable (~7-8GB for phi4:14b) - [ ] No memory leaks (gradual increase in GPU VRAM) - [ ] No OOM errors - [ ] Model stays loaded between missions (jtop GPU memory stays elevated) - [ ] No performance degradation (if speeds drop >20%, investigate)

Expected Memory: - Initial: ~500MB (empty container) - After first mission: ~7-8GB GPU VRAM (model loaded) - After 20 missions: Still ~7-8GB GPU VRAM (stable) - Container RAM: ~2-3GB (stable)

Test 4.3: Concurrent Operation¶

With robot operating: - Launch Nav2 stack - Launch mission agent with Ollama - Execute navigation mission - Monitor system resources on Thor

On Thor:

# Check CPU, memory, GPU
htop
docker stats
nvidia-smi

Verify: - [ ] Thor CPU usage <80% - [ ] Memory usage comfortable (<100GB of 128GB) - [ ] No resource contention - [ ] All systems responsive

Phase 5: Backup Model Testing (15 min)¶

Test 5.1: Fallback to phi4:14b¶

Reconfigure mission agent to use backup model:

ros2 launch shadowhound_mission_agent mission_agent.launch.py \
    agent_backend:=ollama \
    ollama_model:=phi4:14b

Run same test missions: 1. Simple text response 2. Navigation plan generation 3. Multi-step mission

Verify: - [ ] phi4:14b works correctly - [ ] Much faster responses (20.2 tok/s vs 4.4) - [ ] Quality still acceptable (86.7/100 benchmark) - [ ] Valid backup option if qwen2.5-coder has issues

Phase 6: Comparison with OpenAI (15 min)¶

Optional but Recommended: Compare with cloud backend

Test 6.1: OpenAI Backend Baseline¶

Launch with OpenAI:

ros2 launch shadowhound_mission_agent mission_agent.launch.py \
    agent_backend:=openai \
    openai_model:=gpt-4-turbo

Run same test missions and compare:

Metric	qwen2.5-coder:32b	phi4:14b	gpt-4-turbo
Response Time	~5s	~2s	~3s
JSON Quality
Reasoning Quality
Cost per Mission	$0	$0	~$0.03
Privacy	Local	Local	Cloud

Verify: - [ ] Ollama quality competitive with OpenAI - [ ] Local response times acceptable - [ ] No degradation in mission success rate

Post-Testing: Documentation¶

Test Results Summary¶

Date Tested: __
Tested By: __
System: Thor + GO2 Robot
Branch: feature/local-llm-support

Overall Results¶

[ ] PASS: All critical tests passed
[ ] PASS WITH ISSUES: Some non-critical failures (document below)
[ ] FAIL: Critical issues blocking deployment

Performance Summary¶

Metric	Target	Actual
Mission Success Rate	>95%	___%
Avg Response Time	<5s	___s
Memory Stability	Stable
JSON Quality	>90/100

Issues Encountered¶

Issue: _____
Severity: Critical / High / Medium / Low
Workaround: _____
Resolution: _____
(Add more as needed)

Recommendations¶

[ ] APPROVED FOR MERGE: All tests passed, ready for dev branch
[ ] CONDITIONAL APPROVAL: Minor issues, document and merge
[ ] NOT READY: Critical issues must be fixed first

Deployment Checklist (Pre-Merge)¶

Code Quality¶

[ ] All commits have descriptive messages
[ ] No debug code or commented-out sections
[ ] Configuration files updated with production values
[ ] Documentation complete and accurate

Testing¶

[ ] All phases completed successfully
[ ] Test results documented above
[ ] Edge cases considered (failures, timeouts, etc.)
[ ] Performance meets expectations

Documentation¶

[ ] README.md updated with Ollama instructions
[ ] Configuration examples provided
[ ] Troubleshooting guide complete
[ ] Benchmark results documented

Integration¶

[ ] No breaking changes to existing code
[ ] Backwards compatible (OpenAI backend still works)
[ ] Launch files updated
[ ] Dependencies documented

Merge Process¶

1. Final Review¶

# On laptop, review all changes
cd /workspaces/shadowhound
git diff dev...feature/local-llm-support

# Check for unintended changes
git status

2. Update Documentation¶

[ ] Update main README.md with Ollama setup
[ ] Add deployment notes
[ ] Update CHANGELOG.md (if exists)

3. Merge to dev¶

# Ensure feature branch is up to date
git checkout feature/local-llm-support
git pull

# Switch to dev and merge
git checkout dev
git pull
git merge feature/local-llm-support

# Resolve any conflicts
# Test one more time on dev branch
# Push to remote
git push

4. Verify dev Branch¶

[ ] CI/CD passes (if configured)
[ ] Quick smoke test on dev branch
[ ] No unexpected issues

5. Merge to main (Production)¶

Only after dev branch validated:

git checkout main
git pull
git merge dev
git tag -a v1.1.0 -m "Add Ollama local LLM support with qwen2.5-coder:32b"
git push
git push --tags

Rollback Plan¶

If issues discovered after merge:

Quick Rollback¶

# Revert merge commit
git revert -m 1 <merge-commit-hash>
git push

# Or reset to before merge (if not pushed)
git reset --hard HEAD~1

Fallback Configuration¶

# Temporarily switch back to OpenAI
ros2 launch shadowhound_mission_agent mission_agent.launch.py \
    agent_backend:=openai

Success Criteria¶

✅ Minimum Requirements for Merge: 1. Mission agent starts successfully with Ollama backend 2. At least 3 simple missions execute correctly 3. At least 1 hardware navigation mission succeeds 4. No memory leaks or stability issues 5. Performance meets benchmarked expectations 6. Backup model (phi4:14b) works as fallback 7. All documentation complete

Notes & Observations¶

(Use this section during testing to record observations, unexpected behavior, or insights)

Last Updated: 2025-10-10
Next Review: After robot testing complete