ADR-007: Kill Switch Design

Status

Accepted

Context

Autonomous security response systems pose risks if they malfunction:

False positives could disable legitimate users/systems
Bugs could trigger cascading actions
Compromised AI could be weaponized
External events may require immediate halt

We needed an emergency stop mechanism that is:

Fast to activate (< 1 second)
Globally effective
Difficult to accidentally trigger
Easy to recover from

Decision

We implemented a global kill switch with the following design:

Architecture

                    ┌─────────────┐
                    │ Kill Switch │
                    │   State     │
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Orchestrator  │  │ Policy Engine │  │ Action Runner │
│               │  │               │  │               │
│ check()       │  │ check()       │  │ check()       │
│ before        │  │ before        │  │ before        │
│ processing    │  │ evaluation    │  │ execution     │
└───────────────┘  └───────────────┘  └───────────────┘

State

#![allow(unused)]
fn main() {
pub struct KillSwitchStatus {
    pub active: bool,
    pub reason: Option<String>,
    pub activated_by: Option<String>,
    pub activated_at: Option<DateTime<Utc>>,
}
}

Check Points

The kill switch is checked at multiple points:

Alert Processing: Before creating incidents from alerts
Policy Evaluation: Before evaluating approval policies
Action Execution: Before executing any response action
Playbook Execution: Before running playbook stages

Activation

#![allow(unused)]
fn main() {
// Via API
POST /api/kill-switch/activate
{
    "reason": "Investigating false positive surge",
    "activated_by": "[email protected]"
}

// Via CLI
tw-cli kill-switch activate --reason "Emergency maintenance"

// Programmatic
kill_switch.activate("Anomaly detected", "system").await;
}

Deactivation

#![allow(unused)]
fn main() {
// Via API
POST /api/kill-switch/deactivate
{
    "reason": "Issue resolved"
}

// Only admins can deactivate
}

Event Notification

Activation triggers:

KillSwitchActivated event to all subscribers
Dashboard alert banner
Notification to configured channels

Consequences

Positive

Immediate halt of all automation
Clear audit trail of activation/deactivation
Multiple activation methods (UI, API, CLI)
Visible status in all interfaces

Negative

In-memory state (lost on restart, resets to inactive)
No automatic activation triggers yet
Single global switch (no per-action granularity)
Requires admin access to deactivate

Future Enhancements

Persistent State: Store kill switch state in database
Auto-Activation: Trigger on anomaly detection
Scoped Switches: Per-action-type or per-connector switches
Scheduled Deactivation: Auto-deactivate after timeout
Two-Person Rule: Require multiple admins for deactivation

Operational Procedures

When kill switch is activated:

All pending actions remain pending
New alerts create incidents but stop at enrichment
Dashboard shows prominent warning banner
Existing approved actions are NOT rolled back

To recover:

Investigate root cause
Fix underlying issue
Deactivate kill switch
Manually review pending actions
Resume normal operations

Triage Warden