Knowledge Base: Watchdog Timers

Table of Contents

Knowledge Base: Watchdog Timers

Definition

A watchdog timer is a hardware or software mechanism that monitors system activity and takes corrective action if the system fails to respond within a specified time period. In autonomous systems, watchdogs detect software hangs, infinite loops, or deadlocks and can trigger recovery actions such as restarting processes, entering safe mode, or performing emergency stops.

Also Known As: Deadman switch, heartbeat monitor, liveness checker

When to Use

Watchdog timers are essential safety mechanisms in autonomous systems where software failures could lead to unsafe behavior.

Use This When:

System must continue operating or fail safely despite software bugs
Long-running processes that might hang or deadlock
Real-time systems requiring guaranteed response times
Safety-critical applications (vehicles, medical devices, industrial automation)
Systems with potential for resource exhaustion (memory leaks, runaway computation)

Don’t Use This When:

Non-critical applications where restart overhead is unacceptable
Systems with naturally variable response times where timeout is ambiguous
As a substitute for fixing underlying bugs (watchdogs are last resort, not primary solution)
When recovery actions themselves might be unsafe

How It Works

A watchdog timer runs independently of the monitored system. The monitored system must periodically “pet” or “kick” the watchdog to indicate it’s still functioning. If the watchdog doesn’t receive updates within the timeout period, it assumes the system has failed and triggers recovery.

Key Components

Timer: Counts down from a specified timeout value
Reset Mechanism: Allows monitored system to reset the timer (“pet the dog”)
Timeout Handler: Executes recovery action when timer expires
Independence: Watchdog must be independent of monitored system (separate thread, process, or hardware)

Process or Flow

System initialization: Set watchdog timeout (e.g., 1 second)
Normal operation: Main loop pets watchdog periodically (e.g., every 500ms)
Failure scenario: Main loop hangs or crashes, stops petting watchdog
Timeout: Watchdog timer expires after 1 second without being pet
Recovery: Watchdog triggers recovery action (restart, safe mode, emergency stop)

Tradeoffs and Considerations

Advantages

Fault detection: Catches software failures that might otherwise go unnoticed
Automatic recovery: Can restore operation without human intervention
Safety backstop: Prevents indefinite unsafe behavior from software bugs
Simple concept: Easy to understand and explain to safety reviewers

Disadvantages

False positives: May trigger on legitimate slow operations (e.g., sensor initialization)
Recovery complexity: Determining safe recovery action can be difficult
Overhead: Adds code complexity and runtime overhead (usually minimal)
Tuning challenge: Timeout value is critical - too short causes false alarms, too long delays recovery

Design Considerations

Timeout value must be longer than worst-case legitimate execution time
Recovery action must be safer than allowing failure to continue
Watchdog itself must be reliable (prefer hardware watchdogs for critical systems)
Consider hierarchical watchdogs (process-level, system-level)
Log watchdog events for debugging and analysis

Common Pitfalls

Pitfall 1: Watchdog in Same Thread as Monitored Code

Problem: If the monitored code hangs, the watchdog petting code also hangs, so watchdog never fires
Why It Happens: Misunderstanding of independence requirement
Solution: Watchdog must run in separate thread, process, or hardware. The petting action should be the ONLY thing preventing timeout.

Pitfall 2: Timeout Too Short

Problem: False positives from legitimate slow operations (sensor init, file I/O, network requests)
Why It Happens: Underestimating worst-case execution time or not accounting for system load
Solution: Measure actual execution times under worst-case conditions. Add margin (2-3x typical time). Consider suspending watchdog during known slow operations.

Pitfall 3: Unsafe Recovery Action

Problem: Recovery action itself causes unsafe behavior (e.g., sudden stop while moving at high speed)
Why It Happens: Not thinking through consequences of recovery in all possible states
Solution: Design recovery to be safe from any state. Often this means gentle degradation (reduce speed, then stop) rather than immediate action. May need state-dependent recovery strategies.

Pitfall 4: Petting Watchdog in Exception Handler

Problem: Code has subtle bug, throws exception, exception handler pets watchdog before restart, watchdog never fires
Why It Happens: Trying to be “too smart” about keeping watchdog happy
Solution: Pet watchdog only in main loop, AFTER all critical operations succeed. Exception handlers should NOT pet watchdog.

Pitfall 5: No Logging or Diagnostics

Problem: Watchdog fires, system recovers, but root cause remains unknown
Why It Happens: Focus on recovery without thinking about debugging
Solution: Log watchdog events with context (what was executing, system state). Preserve crash dumps if possible. Track watchdog event frequency.

Examples

Example 1: ROS Node Health Monitor

Context: ROS-based autonomous robot with multiple nodes. Critical nodes must be monitored for health.

Implementation:

Each critical node publishes heartbeat message every 100ms
Separate watchdog node subscribes to heartbeats
If no heartbeat for 500ms, watchdog triggers recovery (restart node or enter safe mode)
Watchdog itself monitored by system-level watchdog timer

Outcome: Successfully caught several failure modes in testing: node deadlocks, network issues, resource exhaustion. False positive rate reduced to near-zero after tuning timeout values.

Example 2: Hardware Watchdog for Flight Controller

Context: Quadrotor flight controller running on embedded processor. Software bugs could cause crashes.

Implementation:

Hardware watchdog timer on microcontroller (independent of main CPU)
Main control loop pets watchdog every iteration (50Hz)
Watchdog timeout set to 100ms (2x loop period)
On timeout, hardware forces system reset and enters safe mode (land immediately)

Outcome: Caught several hard-to-reproduce bugs in development. In one case, prevented crash when main loop hung due to I2C peripheral failure. System landed safely and logged diagnostic information.

References

Kopetz & Veríssimo (1993) - “Real-time and dependability concepts” - Link
Dubrova (2013) - “Fault-Tolerant Design” - Chapter on watchdog timers

Graceful Degradation (to be created)
Fail-Safe Design Patterns (to be created)
System Health Monitoring (to be created)

Safety Mechanisms in Autonomous Systems (to be created)

External Resources

Wikipedia: Watchdog Timer
Embedded Systems Design: Watchdog Timers
ROS 2 Lifecycle Nodes - Related concept

Implementation Notes

Practical guidance for implementing watchdog timers.

Quick Checklist

Identify critical components to monitor
Determine appropriate timeout value (measure worst-case execution time)
Ensure watchdog runs independently (separate thread/process/hardware)
Design safe recovery action for all possible states
Implement logging and diagnostics
Test with injected failures (hang simulation)
Verify false positive rate is acceptable
Document timeout values and rationale

Code Snippet Example (Python)

import threading
import time

class Watchdog:
    def __init__(self, timeout_sec, callback):
        self.timeout = timeout_sec
        self.callback = callback
        self.last_pet = time.time()
        self.running = True
        self.thread = threading.Thread(target=self._monitor, daemon=True)
        self.thread.start()
    
    def pet(self):
        """Reset the watchdog timer (call regularly from monitored code)"""
        self.last_pet = time.time()
    
    def _monitor(self):
        """Internal monitoring loop (runs in separate thread)"""
        while self.running:
            time.sleep(0.1)  # Check every 100ms
            if time.time() - self.last_pet > self.timeout:
                print(f"WATCHDOG TIMEOUT after {self.timeout}s!")
                self.callback()  # Trigger recovery
                self.last_pet = time.time()  # Reset to avoid repeated triggers
    
    def stop(self):
        self.running = False
        self.thread.join()

# Usage example
def recovery_action():
    print("Recovery: Entering safe mode")
    # Implement safe recovery here

def perform_task():
    """Placeholder for your critical task"""
    # Replace this with your actual task logic
    pass

watchdog = Watchdog(timeout_sec=1.0, callback=recovery_action)

# Main loop
try:
    while True:
        # Do critical work
        perform_task()
        
        # Pet watchdog only after successful completion
        watchdog.pet()
        
        time.sleep(0.5)  # Loop rate: 2Hz
finally:
    watchdog.stop()

Tags: safety, reliability, system-architecture