vLLM Quick Start for Thor¶
Recommended: Use NVIDIA's official vLLM container - fastest and most stable option.
Prerequisites¶
None! Hermes-2-Pro is open and doesn't require authentication.
~~The Qwen model requires HuggingFace authentication~~ (No longer using Qwen)
# On Thor - run once
huggingface-cli login
When prompted: - Get token from: https://huggingface.co/settings/tokens (read access) - Say Yes to "Add token as git credential" (persists forever) - Accept license: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct
See docs/vllm_huggingface_auth.md for troubleshooting.
Quick Setup (5 minutes)¶
On Thor:
cd ~/shadowhound # or wherever you cloned it
git pull origin feature/local-llm-support
./scripts/setup_vllm_thor.sh
That's it! The script will: 1. Pull NVIDIA's vLLM container (~10GB) 2. Start server with Mistral-7B-Instruct-v0.3 (Apache 2.0, no license restrictions!) 3. Enable native tool calling support 4. Expose OpenAI-compatible API on port 8000
Note: Mistral is fully open source (Apache 2.0) - no gating or authentication!
First run takes longer while downloading the model (~5GB).
On Your Laptop¶
Update .env:
AGENT_BACKEND=openai
OPENAI_BASE_URL=http://192.168.10.116:8000/v1
OPENAI_MODEL=mistralai/Mistral-7B-Instruct-v0.3
USE_PLANNING_AGENT=false
# API key (required by DIMOS, use dummy for vLLM)
OPENAI_API_KEY=sk-dummy-key-for-vllm
Note on Embeddings: The agent will automatically detect that you're using a local LLM (non-OpenAI URL) and use local embeddings (sentence-transformers). No need to set USE_LOCAL_EMBEDDINGS=true unless you want to be explicit.
See docs/vllm_env_example.txt for complete configuration.
Rebuild and test:
cd ~/shadowhound
git pull origin feature/local-llm-support
colcon build --packages-select shadowhound_mission_agent
source install/setup.bash
./start.sh
Test It¶
From laptop:
# Simple test
curl -X POST http://192.168.10.116:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
Via web UI: - Go to http://localhost:8501 - Send: "hi" - Should get actual response (not 'GGGGG'!)
Why Mistral 7B?¶
✅ Apache 2.0 License - fully open, no restrictions or gating!
✅ Native tool calling - built into the model
✅ Officially validated by vLLM - tested and stable
✅ No authentication required - just download and run
✅ Smaller than Llama - faster inference, less memory
✅ Official NVIDIA support for Thor
✅ OpenAI-compatible API - seamless integration
From vLLM docs: Mistral is one of the primary tested models for tool calling and uses the mistral parser.
Stopping the Server¶
Press Ctrl+C in the terminal running the script.
Troubleshooting¶
"ValueError: No embedding data received"¶
Cause: This shouldn't happen anymore! The agent auto-detects local LLM backends.
If it does happen: Force local embeddings in .env:
USE_LOCAL_EMBEDDINGS=true
Then rebuild: colcon build --packages-select shadowhound_mission_agent
Out of memory¶
Edit script, reduce GPU_MEMORY=0.8 to 0.6 or 0.5
Model download fails¶
Check internet connection and HuggingFace access
Container won't start¶
# Check logs
docker logs vllm-server
# Verify GPU
nvidia-smi
# Free memory
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
Different Models¶
Other vLLM-validated tool calling models (if you want to try):
# In setup_vllm_thor.sh, change MODEL= and parser:
# Current default (Apache 2.0, no license!):
MODEL="mistralai/Mistral-7B-Instruct-v0.3"
--tool-call-parser mistral
# Llama 3.1 (requires Meta license acceptance):
MODEL="meta-llama/Llama-3.1-8B-Instruct"
--tool-call-parser hermes
# IBM Granite (Apache 2.0):
MODEL="ibm-granite/granite-3.0-8b-instruct"
--tool-call-parser hermes
⚠️ Models that DON'T work:
- Qwen/Qwen2.5-Coder-7B-Instruct - Returns JSON as text, not tool calls
- NousResearch/Hermes-2-Pro-Llama-3-8B - CUDA index errors
✅ Recommended: Stick with Mistral - it's open source and just works!
Performance¶
Expected latency: 5-10 seconds per command (much faster than Ollama's 12-15s)
Related¶
- Issue #12: LLM Alternatives
- NVIDIA vLLM Announcement
- scripts/README.md - Comparison of all setup options
- docs/vllm_huggingface_auth.md - Authentication troubleshooting