Start Script Issues and Improvements¶

Date: October 12, 2025
File: start.sh
Status: Multiple issues identified
Priority: Medium to High

Issue 1: `--agent-only` Still Requires Topics (CRITICAL)¶

Severity: HIGH
Lines: 1248-1252

Problem¶

# Agent-only mode: skip driver and verification
if [ "$AGENT_ONLY" = true ]; then
    print_info "Agent-only mode: Skipping driver launch and verification"
    echo ""
    launch_mission_agent
    return $?
fi

What happens: - Skips driver launch ✅ - Skips verification ✅ - But agent STILL subscribes to topics and hangs! ❌

Root cause: The --agent-only flag only affects the start script, not the agent itself. The agent (via DIMOS UnitreeROSControl) unconditionally subscribes to: - camera/compressed (or camera/image_raw if use_raw=True) - go2_states

Why This Is Hard¶

Mock mode is not just about skipping topics - it requires proper interface mocking:

ROS Control Layer (UnitreeROSControl):
Subscribes to topics in __init__
Needs mock callbacks that simulate robot state
Needs mock action clients for navigation skills
Skills Layer (DIMOS):
Skills like Move, SpinLeft expect real robot responses
Need to simulate: pose updates, velocity feedback, completion signals
Need to track: virtual position, orientation, timing
Memory Layer (ChromaDB + embeddings):
This part already works! ✅ (we just fixed it)
Can test in isolation with test_embeddings_fix.py

Quick Workaround (Not Full Solution)¶

Add disable_video_stream=True flag to DIMOS initialization:

# In mission_agent.py
self.control = UnitreeROSControl(
    disable_video_stream=True,  # Skip camera subscription
    mock_connection=True,        # Skip action clients
)

But this only helps camera - go2_states still required!

Proper Solution (Separate Issue)¶

Mock mode needs its own design and implementation. Recommend: 1. Create: docs/issues/mock_mode_architecture.md - Design document 2. Implement: Mock providers for all robot interfaces 3. Test: Comprehensive mock mode test suite

For now, testing requires either: - Full hardware setup (./start.sh --prod) - External driver running (./start.sh --skip-driver) - Standalone scripts like test_embeddings_fix.py (for specific components)

Related: See docs/issues/mock_mode_ros_topic_dependency.md

Issue 2: Mock Mode Still Launches Driver¶

Severity: MEDIUM
Lines: 1258-1267

Problem¶

# Stage 1: Launch robot driver (unless skipped)
if [ "$SKIP_DRIVER" != true ]; then
    if ! launch_robot_driver; then
        # ...
    fi
fi

What happens when --mock is used: - Driver still launches unless explicitly skipped - Mock mode should imply no driver needed - Wastes time and confuses users

Expected Behavior¶

Mock mode should automatically skip driver:

# Stage 1: Launch robot driver (unless skipped or mock)
if [ "$SKIP_DRIVER" != true ] && [ "$MOCK_ROBOT" != "true" ]; then
    if ! launch_robot_driver; then
        # ...
    fi
else
    if [ "$MOCK_ROBOT" = "true" ]; then
        print_info "Mock mode: Skipping robot driver launch"
    else
        print_info "Skipping robot driver launch (--skip-driver flag)"
    fi
fi

Issue 3: Verification Runs Even When Driver Skipped¶

Severity: MEDIUM
Lines: 1270-1278

Problem¶

# Stage 2: Verify topics (unless skipped or mock mode)
if [ "$SKIP_DRIVER" != true ] && [ "$MOCK_ROBOT" != "true" ]; then
    if ! verify_robot_topics; then
        # ...
    fi
fi

Logic is correct BUT: - If user manually does --skip-driver (without --mock), verification is skipped - This means they're assuming driver is already running - But we don't verify it's actually there!

Expected Behavior¶

Should verify topics when driver is skipped (assuming external driver):

# Stage 2: Verify topics
if [ "$MOCK_ROBOT" != "true" ]; then
    if [ "$SKIP_DRIVER" = true ]; then
        print_info "Driver launch skipped - verifying external driver topics..."
    else
        print_info "Verifying driver launched successfully..."
    fi

    if ! verify_robot_topics; then
        print_error "Topic verification failed"
        if [ "$SKIP_DRIVER" = true ]; then
            print_info "Is the external driver running?"
        fi
        read -p "Launch mission agent anyway? [y/N]: " continue_choice
        # ...
    fi
else
    print_info "Mock mode: Skipping topic verification"
fi

Issue 4: Topic Verification Should Match Actual Configuration¶

Severity: LOW
Lines: 1120-1128

Problem¶

local critical_topics=(
    "/go2_states"
    "/camera/image_raw"
    "/imu"
    "/odom"
)

Issues: 1. Hard-codes /camera/image_raw but DIMOS uses configurable camera topics: - Default: camera/compressed (when use_raw=False) - Optional: camera/image_raw (when use_raw=True) 2. Mission agent also subscribes to /camera/image_raw for web UI 3. /imu and /odom are checked but agent doesn't strictly require them

Context¶

From dimos/robot/unitree/unitree_ros_control.py:

CAMERA_TOPICS = {
    "raw": {"topic": "camera/image_raw", "type": Image},
    "compressed": {"topic": "camera/compressed", "type": CompressedImage},
}

# Line 101: Default uses compressed
active_camera_topics = {
    "main": self.CAMERA_TOPICS["raw" if use_raw else "compressed"]
}

Expected Behavior¶

Option 1: Check both camera topics (one will exist)

local critical_topics=(
    "/go2_states"  # Robot state (always required)
)

local camera_topics=(
    "/camera/image_raw"    # Used by: mission agent web UI, DIMOS if use_raw=True
    "/camera/compressed"   # Used by: DIMOS default
)

# Check if at least one camera topic exists

Option 2: Make it configurable (read from launch params)

# Get camera topic from environment or launch args
local camera_topic=${CAMERA_TOPIC:-/camera/compressed}

Note¶

This is LOW priority because: - In production, driver publishes BOTH topics (webrtc bridge converts) - Verification will pass either way - Only matters for minimal test setups

Issue 5: Misleading "Topics Look Good" Message¶

Severity: LOW
Lines: 1149-1155

Problem¶

# If critical topics are missing, abort
if [ "$topics_ok" = false ]; then
    print_error "Critical topics are missing - cannot launch mission agent"
    # ...
    return 1
fi

# Topics look good, ask for final confirmation
read -p "Topics look good? Continue to launch mission agent? [Y/n]: " continue_choice

Issue: If topics are missing, we abort. But then we ask for confirmation even though we already checked. Redundant interaction.

Expected Behavior¶

Only ask if there are warnings (not errors):

if [ "$topics_ok" = false ]; then
    print_error "Critical topics are missing - cannot launch mission agent"
    read -p "Launch anyway (may fail)? [y/N]: " continue_choice
    if [ "$continue_choice" != "y" ] && [ "$continue_choice" != "Y" ]; then
        return 1
    fi
elif [ "$topics_warnings" = true ]; then
    print_warning "Some optional topics are missing"
    read -p "Continue to launch mission agent? [Y/n]: " continue_choice
    if [ "$continue_choice" = "n" ] || [ "$continue_choice" = "N" ]; then
        return 1
    fi
else
    print_success "All topics available - proceeding to launch"
fi

Issue 6: PYTHONPATH and DIMOS Import¶

Severity: LOW
Lines: 1238

Problem¶

# Set PYTHONPATH for DIMOS
export PYTHONPATH="${SCRIPT_DIR}/src/dimos-unitree:${PYTHONPATH}"

Issue: This assumes DIMOS is a submodule at that exact path. If structure changes or DIMOS is installed differently, this breaks.

Expected Behavior¶

Check if path exists before adding:

# Set PYTHONPATH for DIMOS
if [ -d "${SCRIPT_DIR}/src/dimos-unitree" ]; then
    export PYTHONPATH="${SCRIPT_DIR}/src/dimos-unitree:${PYTHONPATH}"
    print_info "Added DIMOS to PYTHONPATH"
else
    print_warning "DIMOS submodule not found at src/dimos-unitree"
    print_info "Agent may fail if DIMOS is not installed"
fi

Issue 7: CONN_TYPE Default May Be Wrong¶

Severity: MEDIUM
Lines: 1241

Problem¶

# Export CONN_TYPE for DIMOS (defaults to webrtc if not set)
export CONN_TYPE=${CONN_TYPE:-webrtc}

Issue: Hardcodes webrtc as default. But: - CycloneDDS mode exists and is valid - Should respect .env configuration - WebRTC requires WiFi, CycloneDDS requires Ethernet

Expected Behavior¶

Get default from .env or make it explicit:

# Export CONN_TYPE for DIMOS
if [ -z "$CONN_TYPE" ]; then
    print_warning "CONN_TYPE not set - defaulting to 'webrtc'"
    print_info "Set CONN_TYPE in .env for CycloneDDS mode"
    export CONN_TYPE="webrtc"
else
    print_info "Using CONN_TYPE=$CONN_TYPE"
fi

Issue 8: No Validation of Required Environment Variables¶

Severity: MEDIUM
Lines: N/A (missing feature)

Problem¶

Script doesn't validate that required environment variables are set before launching: - OPENAI_API_KEY (if using OpenAI backend) - OPENAI_BASE_URL (if using vLLM/Ollama) - OPENAI_MODEL or OLLAMA_MODEL

What happens: - Agent launches - Fails during initialization - Cryptic error messages

Expected Behavior¶

Add validation before launch:

validate_environment() {
    print_section "Environment Validation"

    local agent_backend=${AGENT_BACKEND:-openai}
    local all_ok=true

    if [ "$agent_backend" = "openai" ]; then
        if [ -z "$OPENAI_BASE_URL" ]; then
            # Using OpenAI cloud
            :
        else
            case "$OPENAI_BASE_URL" in
                *api.openai.com*)
                    # Using OpenAI cloud
                    :
                    ;;
                *)
                    # Using local LLM (vLLM, LocalAI, etc.)
                    :
                    ;;
            esac
        fi
            # Using OpenAI cloud
            if [ -z "$OPENAI_API_KEY" ]; then
                print_error "OPENAI_API_KEY not set (required for OpenAI cloud)"
                all_ok=false
            else
                print_success "OPENAI_API_KEY configured"
            fi
        else
            # Using local LLM (vLLM, LocalAI, etc.)
            print_success "Using local LLM: $OPENAI_BASE_URL"
            print_info "No API key required for local LLM"
        fi

        if [ -z "$OPENAI_MODEL" ]; then
            print_warning "OPENAI_MODEL not set - will use default"
        else
            print_success "Model: $OPENAI_MODEL"
        fi
    elif [ "$agent_backend" = "ollama" ]; then
        if [ -z "$OLLAMA_BASE_URL" ]; then
            print_warning "OLLAMA_BASE_URL not set - using default (http://localhost:11434)"
        else
            print_success "Ollama: $OLLAMA_BASE_URL"
        fi

        if [ -z "$OLLAMA_MODEL" ]; then
            print_error "OLLAMA_MODEL not set (required)"
            all_ok=false
        else
            print_success "Model: $OLLAMA_MODEL"
        fi
    fi

    echo ""

    if [ "$all_ok" = false ]; then
        print_error "Environment validation failed"
        print_info "Check your .env file and try again"
        return 1
    fi

    return 0
}

Issue 9: No Health Check for LLM Backend¶

Severity: LOW
Lines: N/A (missing feature)

Problem¶

Script doesn't verify that the LLM backend (vLLM, Ollama, OpenAI) is actually accessible before launching agent.

What happens: - Agent launches - Tries to connect to LLM - Fails during first query - Not user-friendly

Expected Behavior¶

Add optional health check:

check_llm_backend() {
    print_section "LLM Backend Health Check"

    local agent_backend=${AGENT_BACKEND:-openai}

    if [ "$agent_backend" = "openai" ]; then
        local base_url=${OPENAI_BASE_URL:-https://api.openai.com/v1}

        print_info "Checking LLM endpoint: $base_url"

                case "$base_url" in
                    *api.openai.com*)
                        print_info "OpenAI cloud endpoint - skipping health check"
                        ;;
                    *)
            # Local LLM - check if accessible
            local models_url="${base_url}/models"
            if curl -s -f -m 5 "$models_url" >/dev/null 2>&1; then
                print_success "LLM endpoint accessible"

                # Try to get model list
                local models=$(curl -s -m 5 "$models_url" 2>/dev/null)
                if [ -n "$models" ]; then
                    print_info "Available models:"
                    echo "$models" | jq -r '.data[].id' 2>/dev/null | head -5 | sed 's/^/  - /'
                fi
            else
                print_warning "LLM endpoint not accessible (may be down)"
                print_info "Check if vLLM/Ollama is running"
                read -p "Continue anyway? [y/N]: " continue_choice
                if [ "$continue_choice" != "y" ] && [ "$continue_choice" != "Y" ]; then
                    return 1
                fi
            fi
            ;;
        esac
    fi

    echo ""
    return 0
}

Issue 10: Cleanup Doesn't Kill Background Driver¶

Severity: LOW
Lines: 1340-1348

Problem¶

# Kill robot driver if we started it
if [ -f "/tmp/shadowhound_driver.pid" ]; then
    local driver_pid=$(cat /tmp/shadowhound_driver.pid 2>/dev/null)
    if [ -n "$driver_pid" ]; then
        print_info "Stopping robot driver (PID: $driver_pid)..."
        kill $driver_pid 2>/dev/null || true
        sleep 1
        kill -9 $driver_pid 2>/dev/null || true
    fi
    rm -f /tmp/shadowhound_driver.pid
fi

Issue: PID file approach is fragile: - What if driver crashes and PID is reused? - What if PID file is stale? - What if driver was started externally?

Expected Behavior¶

Use process name matching:

# Kill robot driver processes
print_info "Stopping robot driver..."
pkill -f "go2_driver_node" 2>/dev/null && print_success "Driver stopped" || print_info "No driver found"
pkill -f "robot.launch.py" 2>/dev/null || true

# Clean up PID file
rm -f /tmp/shadowhound_driver.pid 2>/dev/null || true

Summary of Critical Issues¶

Issue	Severity	Impact	Blocks
#1: --agent-only hangs	HIGH	Can't test agent in isolation	Testing
#2: Mock launches driver	MEDIUM	Wastes time, confusing	UX
#3: Verification skipped	MEDIUM	May miss problems	Reliability
#8: No env validation	MEDIUM	Cryptic errors	UX

Note: Issue #4 (camera topics) downgraded to LOW - both topics published in production, verification passes either way.

Recommended Fix Priority¶

DEFER (needs architecture work):
Issue #1: Mock mode requires proper interface mocking (not a quick fix)
Action: Create docs/issues/mock_mode_architecture.md design document
Workaround: Use full driver or component-specific test scripts
HIGH (usability - quick wins):
Issue #8: Add environment validation (prevents cryptic errors)
Issue #2: Mock mode skips driver automatically (1-line fix)
MEDIUM (correctness):
Issue #3: Verify topics when driver skipped
Issue #7: CONN_TYPE handling
LOW (nice to have):
Issue #4: Camera topic detection (works in production anyway)
Issue #5: Better user prompts
Issue #6: PYTHONPATH validation
Issue #9: LLM health check
Issue #10: Better cleanup

start.sh - Main launch script
docs/issues/mock_mode_ros_topic_dependency.md - Root cause of Issue #1
src/shadowhound_mission_agent/launch/mission_agent.launch.py - Agent launch file
.env - Environment configuration

Next Steps¶

Fix Issue #1 as quick workaround (1 line change)
Create comprehensive fix PR addressing all issues
Add integration tests for different launch modes
Update documentation with launch mode examples

Start Script Issues and Improvements¶

Issue 1: --agent-only Still Requires Topics (CRITICAL)¶

Problem¶

Why This Is Hard¶

Quick Workaround (Not Full Solution)¶

Proper Solution (Separate Issue)¶

Issue 2: Mock Mode Still Launches Driver¶

Problem¶

Expected Behavior¶

Issue 3: Verification Runs Even When Driver Skipped¶

Problem¶

Expected Behavior¶

Issue 4: Topic Verification Should Match Actual Configuration¶

Problem¶

Context¶

Expected Behavior¶

Note¶

Issue 5: Misleading "Topics Look Good" Message¶

Problem¶

Expected Behavior¶

Issue 6: PYTHONPATH and DIMOS Import¶

Problem¶

Expected Behavior¶

Issue 7: CONN_TYPE Default May Be Wrong¶

Problem¶

Expected Behavior¶

Issue 8: No Validation of Required Environment Variables¶

Problem¶

Expected Behavior¶

Issue 9: No Health Check for LLM Backend¶

Problem¶

Expected Behavior¶

Issue 10: Cleanup Doesn't Kill Background Driver¶

Problem¶

Expected Behavior¶

Summary of Critical Issues¶

Recommended Fix Priority¶

Related Files¶

Next Steps¶

Issue 1: `--agent-only` Still Requires Topics (CRITICAL)¶