Thor System Utilities Setup¶
Priority: Medium
Status: Backlog
Related: Issue #12 - Local LLM Support
Problem¶
Thor needs better system monitoring and memory management tools for working with LLMs:
- jtop not installed - Can't monitor GPU usage, memory, thermals in real-time
- Manual memory clearing - Users report needing to clear memory cache between model loads
- No automated cleanup - Switching models requires manual intervention
From NVIDIA forum users:
"I also note that every time I switch models I have to go into jtop and clear the memory cache manually. Am I missing something?"
Required Tools¶
1. jtop (jetson-stats)¶
What: Real-time monitoring for Jetson systems (GPU, CPU, RAM, thermal, power)
Install:
sudo apt update
sudo apt install python3-pip
sudo pip3 install -U jetson-stats
sudo systemctl restart jtop.service
# Reboot required
sudo reboot
# Then run: jtop
Usage: - Real-time GPU memory monitoring - Thermal throttling detection - Power consumption tracking - Memory cache clearing (via UI)
2. Automated Memory Management¶
What: Script to clear caches before model loads
Commands:
# Clear page cache, dentries, and inodes
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# Kill zombie processes
sudo pkill -9 python3
# Check available memory
free -h
3. Model Switching Helper¶
What: Script to safely switch between models with cleanup
Pseudocode:
switch_model.sh <model_name>
1. Stop current vLLM container
2. Clear GPU memory
3. Clear system caches
4. Wait 5 seconds
5. Start new model
6. Verify health
Acceptance Criteria¶
- [ ] jtop installed and working on Thor
- [ ] Document jtop usage in Thor setup guide
- [ ] Create memory cleanup script (scripts/thor_cleanup_memory.sh)
- [ ] Create model switching script (scripts/thor_switch_model.sh)
- [ ] Add cleanup to vLLM startup script (automatic)
- [ ] Test: Switch between 3 different models without manual intervention
- [ ] Document when/why manual cleanup might still be needed
Nice to Have¶
- Telegraf + Grafana dashboard for Thor metrics
- Automated alerts on OOM conditions
- Model memory requirements database
- Pre-flight checks before model load
Timeline¶
Phase 1 (Quick Fix): Add memory cleanup to existing vLLM script
Phase 2 (Full Solution): Install jtop, create helper scripts
Phase 3 (Advanced): Monitoring dashboard
Notes¶
Current vLLM script already does:
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches
But may need more aggressive cleanup between model switches.
Priority: Do this AFTER we validate vLLM works end-to-end. Don't want to debug jtop installation issues when testing LLM functionality.