vLLM Mistral Tokenizer Hang¶

Issue¶

When running vLLM with Mistral-7B-Instruct-v0.3, API requests hang indefinitely with these symptoms:

# curl hangs at 0% after ~2 minutes
curl -X POST http://192.168.10.116:8000/v1/chat/completions ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   329    0     0  100   329      0      3  0:01:49  0:01:48  0:00:01     0

# vLLM logs show warnings:
(APIServer pid=1) INFO: Non-Mistral tokenizer detected when using a Mistral model...
(APIServer pid=1) INFO: Engine 000: Avg prompt throughput: 5.2 tokens/s, 
  Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, 
  GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Key Indicators: 1. Tokenizer warning: "Non-Mistral tokenizer detected" 2. Very low throughput: 5.2 tokens/s prompt, 0.1 tokens/s generation 3. Zero GPU cache usage: 0.0% (should be >0% when processing) 4. Request stuck: Running but not progressing

Root Causes¶

1. Tokenizer Mismatch¶

vLLM fails to automatically detect Mistral's correct tokenizer, falling back to a generic one that doesn't work properly with tool calling.

Solution: Explicitly specify tokenizer with --tokenizer flag:

vllm serve "mistralai/Mistral-7B-Instruct-v0.3" \
  --tokenizer "mistralai/Mistral-7B-Instruct-v0.3" \
  --tool-call-parser mistral

2. FLASHINFER Backend Incompatibility¶

The VLLM_ATTENTION_BACKEND=FLASHINFER environment variable causes issues on Jetson AGX Orin with certain models.

Solution: Remove the FLASHINFER backend specification, let vLLM auto-detect:

# Don't set this:
# -e VLLM_ATTENTION_BACKEND=FLASHINFER

# Let vLLM choose the best backend automatically

Fixes Applied¶

Updated `scripts/setup_vllm_thor.sh`¶

Change 1: Add explicit tokenizer

  vllm serve "${MODEL}" \
    --port 8000 \
    --host 0.0.0.0 \
    --trust-remote-code \
+   --tokenizer "${MODEL}" \
    --max-model-len ${MAX_MODEL_LEN} \

Change 2: Remove FLASHINFER backend

  docker run --rm -it --network host \
    --name "${CONTAINER_NAME}" \
    --shm-size=16g \
    --runtime=nvidia \
    --gpus all \
-   -e VLLM_ATTENTION_BACKEND=FLASHINFER \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \

Testing¶

1. Restart vLLM with fixes¶

# On Thor
cd ~/shadowhound
git pull origin feature/local-llm-support

# Stop old container
docker stop vllm-server 2>/dev/null || true

# Restart with fixes
./scripts/setup_vllm_thor.sh

2. Verify tokenizer detection¶

Look for these in startup logs:

# Should NOT see:
INFO: Non-Mistral tokenizer detected when using a Mistral model...

# Should see proper model loading:
INFO: Loading model weights...
INFO: Model loaded successfully

3. Test basic completion (no tool calling)¶

curl -X POST http://192.168.10.116:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 20
  }'

Expected: - Response within ~5 seconds - Actual text completion - No "Non-Mistral tokenizer" warning

4. Test tool calling¶

curl -X POST http://192.168.10.116:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Move forward 1 meter"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "Move",
        "description": "Move the robot",
        "parameters": {
          "type": "object",
          "properties": {
            "x": {"type": "number", "description": "Distance in meters"}
          },
          "required": ["x"]
        }
      }
    }],
    "tool_choice": "auto"
  }' | jq '.choices[0].message'

Expected:

{
  "role": "assistant",
  "content": null,
  "tool_calls": [
    {
      "id": "...",
      "type": "function",
      "function": {
        "name": "Move",
        "arguments": "{\"x\": 1.0}"
      }
    }
  ]
}

NOT:

{
  "content": "```json\n{\"name\": \"Move\", \"arguments\": {\"x\": 1.0}}\n```",
  "tool_calls": []  // ❌ Empty!
}

Performance Expectations¶

With fixes applied, you should see:

# Good performance indicators:
INFO: Engine 000: 
  Avg prompt throughput: 50-100 tokens/s  (was 5.2!)
  Avg generation throughput: 20-40 tokens/s  (was 0.1!)
  GPU KV cache usage: 5-15%  (was 0.0%!)

Typical response times: - Basic chat: 0.5-2 seconds - Tool calling: 1-3 seconds - Complex multi-tool: 3-5 seconds

Alternative Backends (If Still Issues)¶

If problems persist, try different attention backends explicitly:

Option 1: xFormers (Most Compatible)¶

vllm serve "${MODEL}" \
  --tokenizer "${MODEL}" \
  --attention-backend xformers \
  ...

Option 2: PyTorch Native¶

vllm serve "${MODEL}" \
  --tokenizer "${MODEL}" \
  --attention-backend torch_sdpa \
  ...

Other Potential Issues¶

Model Download Incomplete¶

If first run was interrupted, model might be corrupted:

# Clear cache and re-download
rm -rf ~/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3
./scripts/setup_vllm_thor.sh

Memory Fragmentation¶

If Thor has been running models for a while:

# Restart to clear GPU memory completely
sudo reboot

Docker Shared Memory Too Small¶

Increase --shm-size if you see out-of-memory errors:

# In setup_vllm_thor.sh, change:
--shm-size=16g  # Increase to 32g if needed

Monitoring¶

Watch vLLM logs for health indicators:

docker logs -f vllm-server 2>&1 | grep -E "(INFO|WARNING|ERROR)"

Healthy signs: - No tokenizer warnings - GPU cache usage >0% - Throughput >20 tokens/s - Requests complete <5 seconds

Unhealthy signs: - "Non-Mistral tokenizer detected" - GPU cache = 0.0% - Throughput <10 tokens/s - Requests timeout

References¶

vLLM Tool Calling Docs: https://docs.vllm.ai/en/stable/features/tool_calling.html
Mistral Model Card: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
vLLM Attention Backends: https://docs.vllm.ai/en/stable/design/attention.html
Issue: vllm_tool_calling_not_executing

Status¶

Current State: Fixes applied, awaiting testing

Next Steps: 1. ✅ Added --tokenizer flag to setup script 2. ✅ Removed FLASHINFER backend 3. ⏳ User needs to restart vLLM on Thor 4. ⏳ Test basic completion (should be fast now!) 5. ⏳ Test tool calling (should return tool_calls array!) 6. ⏳ Test robot control (should actually move!)

Expected Outcome: With explicit tokenizer and default attention backend, Mistral should work perfectly. Tool calling should return properly formatted function calls within 1-3 seconds.