Triage Warden

AI-powered security incident triage and response platform

Triage Warden automates the analysis and response to security incidents using AI agents, configurable playbooks, and integrations with your existing security stack.

Features

  • AI-Powered Triage: Automated analysis of phishing emails, malware alerts, and suspicious login attempts
  • Configurable Playbooks: Define custom investigation and response workflows
  • Policy Engine: Role-based approval workflows for sensitive actions
  • Connector Framework: Integrate with VirusTotal, Splunk, CrowdStrike, Jira, Microsoft 365, and more
  • Web Dashboard: Real-time incident management with approval workflows
  • REST API: Programmatic access for automation and integration
  • Audit Trail: Complete logging of all actions and decisions

Quick Example

# Analyze a phishing email
tw-cli incident create --type phishing --source "email-gateway" --data '{"subject": "Urgent: Update Account"}'

# Run AI triage
tw-cli triage run --incident INC-2024-001

# View the verdict
tw-cli incident get INC-2024-001 --format json

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Web Dashboard                             │
│                    (HTMX + Askama Templates)                     │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                         REST API                                 │
│                     (Axum + Tower)                               │
└─────────────────────────────────────────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌───────────────┐    ┌───────────────────┐    ┌───────────────┐
│ Policy Engine │    │   AI Triage Agent │    │    Actions    │
│    (Rust)     │    │     (Python)      │    │    (Rust)     │
└───────────────┘    └───────────────────┘    └───────────────┘
        │                       │                       │
        └───────────────────────┼───────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                       Connector Layer                            │
│        (VirusTotal, Splunk, CrowdStrike, Jira, M365)            │
└─────────────────────────────────────────────────────────────────┘

Getting Started

  1. Installation - Install Triage Warden
  2. Quick Start - Create your first incident
  3. Configuration - Configure connectors and policies

License

Triage Warden is licensed under the MIT License.

Getting Started

Welcome to Triage Warden! This guide will help you get up and running quickly.

Prerequisites

  • Rust 1.75+ (for building from source)
  • Python 3.11+ (for AI triage agents)
  • SQLite or PostgreSQL (for data storage)
  • uv (recommended Python package manager)

Installation Options

  1. From Source - Build and run locally
  2. Docker - Run in containers
  3. Pre-built Binaries - Download releases

Next Steps

Installation

Building from Source

Prerequisites

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

Clone and Build

# Clone the repository
git clone https://github.com/zachyking/triage-warden.git
cd triage-warden

# Build Rust components
cargo build --release

# Install Python dependencies
cd python
uv sync

Verify Installation

# Check the CLI
./target/release/triage-warden --version

# Run tests
cargo test
cd python && uv run pytest

Docker

# Build the image
docker build -t triage-warden .

# Run with default settings
docker run -p 8080:8080 triage-warden

# Run with custom configuration
docker run -p 8080:8080 \
  -e TW_DATABASE_URL=postgres://user:pass@host/db \
  -e TW_VIRUSTOTAL_API_KEY=your-key \
  triage-warden

Pre-built Binaries

Download the latest release from the releases page.

Available platforms:

  • Linux x86_64 (glibc)
  • Linux x86_64 (musl)
  • macOS x86_64
  • macOS aarch64 (Apple Silicon)
# Example for macOS
curl -LO https://github.com/zachyking/triage-warden/releases/latest/download/triage-warden-macos-aarch64.tar.gz
tar xzf triage-warden-macos-aarch64.tar.gz
./triage-warden --version

Database Setup

SQLite (Default)

SQLite is used by default. The database file is created automatically:

# Default location
DATABASE_URL=sqlite://./triage_warden.db

# Custom location
DATABASE_URL=sqlite:///var/lib/triage-warden/data.db

PostgreSQL

For production deployments:

# Create database
createdb triage_warden

# Set connection string
export DATABASE_URL=postgres://user:password@localhost/triage_warden

# Run migrations
triage-warden db migrate

Next Steps

Quick Start

Get Triage Warden running and process your first incident in 5 minutes.

1. Start the Server

# Start with default settings (SQLite, mock connectors)
cargo run --bin tw-api

# Or use the release binary
./target/release/tw-api

The web dashboard is now available at http://localhost:8080.

2. Create an Incident

Via Web Dashboard

  1. Open http://localhost:8080 in your browser
  2. Click "New Incident"
  3. Fill in the incident details:
    • Type: Phishing
    • Source: Email Gateway
    • Severity: Medium
  4. Click Create

Via CLI

tw-cli incident create \
  --type phishing \
  --source "email-gateway" \
  --severity medium \
  --data '{
    "subject": "Urgent: Verify Your Account",
    "sender": "[email protected]",
    "recipient": "[email protected]"
  }'

Via API

curl -X POST http://localhost:8080/api/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "incident_type": "phishing",
    "source": "email-gateway",
    "severity": "medium",
    "raw_data": {
      "subject": "Urgent: Verify Your Account",
      "sender": "[email protected]"
    }
  }'

3. Run AI Triage

# Trigger triage for the incident
tw-cli triage run --incident INC-2024-0001

The AI agent will:

  1. Parse email headers and content
  2. Check sender reputation
  3. Analyze URLs and attachments
  4. Generate a verdict with confidence score

4. View the Verdict

# Get incident with triage results
tw-cli incident get INC-2024-0001

# Example output:
# Incident: INC-2024-0001
# Type: phishing
# Status: triaged
# Verdict: malicious
# Confidence: 0.92
# Recommended Actions:
#   - quarantine_email
#   - block_sender
#   - notify_user

5. Execute Actions

Actions may require approval based on your policy configuration:

# Request to quarantine the email
tw-cli action execute --incident INC-2024-0001 --action quarantine_email

# If auto-approved:
# Action executed: quarantine_email (status: completed)

# If requires approval:
# Action pending approval from: Senior Analyst

Approve pending actions via the dashboard at /approvals.

Next Steps

Configuration

Triage Warden is configured through environment variables and configuration files.

Environment Variables

Core Settings

VariableDescriptionDefault
TW_DATABASE_URLDatabase connection stringsqlite://./triage_warden.db
TW_HOSTAPI server host0.0.0.0
TW_PORTAPI server port8080
TW_LOG_LEVELLogging level (trace, debug, info, warn, error)info
TW_ADMIN_PASSWORDInitial admin password(generated)

Connector Selection

VariableDescriptionValues
TW_THREAT_INTEL_MODEThreat intelligence backendmock, virustotal
TW_SIEM_MODESIEM backendmock, splunk
TW_EDR_MODEEDR backendmock, crowdstrike
TW_EMAIL_GATEWAY_MODEEmail gateway backendmock, m365
TW_TICKETING_MODETicketing backendmock, jira

VirusTotal

TW_THREAT_INTEL_MODE=virustotal
TW_VIRUSTOTAL_API_KEY=your-api-key-here

Splunk

TW_SIEM_MODE=splunk
TW_SPLUNK_URL=https://splunk.company.com:8089
TW_SPLUNK_TOKEN=your-token-here

CrowdStrike

TW_EDR_MODE=crowdstrike
TW_CROWDSTRIKE_CLIENT_ID=your-client-id
TW_CROWDSTRIKE_CLIENT_SECRET=your-client-secret
TW_CROWDSTRIKE_REGION=us-1  # us-1, us-2, eu-1

Microsoft 365

TW_EMAIL_GATEWAY_MODE=m365
TW_M365_TENANT_ID=your-tenant-id
TW_M365_CLIENT_ID=your-client-id
TW_M365_CLIENT_SECRET=your-client-secret

Jira

TW_TICKETING_MODE=jira
TW_JIRA_URL=https://company.atlassian.net
[email protected]
TW_JIRA_API_TOKEN=your-api-token
TW_JIRA_PROJECT_KEY=SEC

AI Provider

TW_AI_PROVIDER=anthropic  # anthropic, openai
TW_ANTHROPIC_API_KEY=your-api-key
# or
TW_OPENAI_API_KEY=your-api-key

Configuration File

For complex configurations, use a TOML file:

# config.toml

[server]
host = "0.0.0.0"
port = 8080
log_level = "info"

[database]
url = "postgres://user:pass@localhost/triage_warden"
max_connections = 10

[connectors.threat_intel]
mode = "virustotal"
api_key = "${TW_VIRUSTOTAL_API_KEY}"
rate_limit = 4  # requests per minute

[connectors.siem]
mode = "splunk"
url = "https://splunk.company.com:8089"
token = "${TW_SPLUNK_TOKEN}"

[connectors.edr]
mode = "crowdstrike"
client_id = "${TW_CROWDSTRIKE_CLIENT_ID}"
client_secret = "${TW_CROWDSTRIKE_CLIENT_SECRET}"
region = "us-1"

[ai]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
max_tokens = 4096

[policy]
default_action_approval = "auto"  # auto, analyst, senior, manager
high_severity_approval = "senior"
critical_action_approval = "manager"

Load with:

tw-api --config config.toml

Policy Rules

Policy rules control action approval requirements. See Policy Engine for details.

# Example policy rule
[[policy.rules]]
name = "isolate_host_requires_manager"
action = "isolate_host"
severity = ["high", "critical"]
approval_level = "manager"

Logging

Configure structured logging:

# JSON output for production
TW_LOG_FORMAT=json

# Pretty output for development
TW_LOG_FORMAT=pretty

# Filter specific modules
RUST_LOG=tw_api=debug,tw_core=info

Next Steps

Web Dashboard

Browser-based interface for incident management.

Overview

The dashboard provides:

  • Real-time incident monitoring
  • Approval workflow management
  • Playbook configuration
  • System settings

Access at: http://localhost:8080

Features

Home Dashboard

The main dashboard displays:

  • KPIs: Open incidents, pending approvals, triage rate
  • Recent Incidents: Latest incidents with status
  • Trend Charts: Incident volume over time
  • Quick Actions: Create incident, run playbook

Incident Management

  • List view with filtering and sorting
  • Detail view with full incident context
  • Action execution interface
  • Triage results and reasoning

Approval Workflow

  • Queue of pending approvals
  • One-click approve/reject
  • Bulk approval for related actions
  • SLA countdown timers

Playbook Management

  • Create and edit playbooks
  • Visual step editor
  • Test with sample data
  • Execution history

Settings

  • Connector configuration
  • Policy rule management
  • User administration
  • System preferences
PathDescription
/Dashboard home
/incidentsIncident list
/incidents/:idIncident detail
/approvalsPending approvals
/playbooksPlaybook management
/settingsSystem settings
/loginLogin page

Next Steps

Incidents

Managing incidents in the web dashboard.

Incident List

Access at /incidents

Filtering

  • Status: Open, Triaged, Resolved
  • Severity: Low, Medium, High, Critical
  • Type: Phishing, Malware, Suspicious Login
  • Date Range: Custom time period

Sorting

Click column headers to sort:

  • Created (newest/oldest)
  • Severity (highest/lowest)
  • Status

Bulk Actions

Select multiple incidents for:

  • Bulk resolve
  • Bulk escalate
  • Export to CSV

Incident Detail

Click an incident to view details.

Overview Tab

  • Incident metadata
  • AI verdict and confidence
  • Recommended actions
  • Timeline of events

Raw Data Tab

  • Original incident data (JSON)
  • Parsed email content (for phishing)
  • Detection details (for malware)

Actions Tab

  • Available actions
  • Executed actions with results
  • Pending approvals

Enrichment Tab

  • Threat intelligence results
  • SIEM correlation data
  • Related incidents

Creating Incidents

Click "New Incident" button.

Required Fields

  • Type: Select incident type
  • Source: Origin of the incident
  • Severity: Initial severity assessment

Optional Fields

  • Description: Free-form description
  • Raw Data: JSON payload
  • Assignee: Initial assignment

Executing Actions

From the incident detail page:

  1. Click "Actions" tab
  2. Select action from dropdown
  3. Fill in parameters
  4. Click "Execute"

If approval is required:

  • Action appears in pending state
  • Notification sent to approvers
  • Status updates when approved/rejected

Keyboard Shortcuts

ShortcutAction
j / kNavigate list
EnterOpen incident
EscClose modal
aOpen actions menu
eEscalate
rResolve

Real-time Updates

The dashboard uses HTMX for live updates:

  • New incidents appear automatically
  • Status changes reflect immediately
  • Approval decisions update in real-time

Approvals

Managing action approvals in the web dashboard.

Approval Queue

Access at /approvals

The queue shows all actions pending your approval based on your role level.

Queue Columns

  • Action: Type of action requested
  • Incident: Related incident
  • Requested By: Who/what requested it
  • Requested At: When requested
  • SLA: Time remaining to respond

Filtering

  • Approval Level: Analyst, Senior, Manager
  • Action Type: Specific actions
  • Incident Type: Phishing, malware, etc.

Approval Detail

Click an approval to see full context.

Context Section

  • Full incident details
  • AI reasoning (if from triage)
  • Related actions already taken

Decision Section

  • Approve: Execute the action
  • Reject: Decline with reason
  • Delegate: Assign to another approver

Approving Actions

Single Approval

  1. Click on pending action
  2. Review incident context
  3. Click "Approve" or "Reject"
  4. Add optional comment
  5. Confirm decision

Bulk Approval

For related actions:

  1. Select multiple actions (checkbox)
  2. Click "Bulk Approve" or "Bulk Reject"
  3. Add comment applying to all
  4. Confirm

Rejection

When rejecting:

  1. Click "Reject"
  2. Required: Enter rejection reason
  3. Optionally suggest alternative
  4. Confirm

The requester is notified of rejection and reason.

SLA Indicators

ColorMeaning
GreenPlenty of time
Yellow< 50% time remaining
Orange< 25% time remaining
RedSLA exceeded

Notifications

You receive notifications for:

  • New actions requiring your approval
  • SLA warnings (50%, 75% elapsed)
  • Escalations to your level

Configure notification preferences in Settings.

Delegation

If unavailable:

  1. Go to Settings > Delegation
  2. Select delegate user
  3. Set date range
  4. Delegate receives your approvals

Audit Trail

All approvals are logged:

  • Who approved/rejected
  • When decision was made
  • Time to approve
  • Comments provided

View at Settings > Audit Logs.

Playbooks

Managing playbooks in the web dashboard.

Playbook List

Access at /playbooks

Views

  • Active: Currently enabled playbooks
  • Inactive: Disabled playbooks
  • All: Complete list

Information Displayed

  • Name and description
  • Trigger conditions
  • Last run time
  • Success rate

Creating Playbooks

Click "New Playbook" button.

Basic Information

  • Name: Unique identifier
  • Description: What this playbook does
  • Version: Semantic version

Triggers

Configure when playbook runs:

  • Incident Type: Phishing, malware, etc.
  • Auto Run: Run automatically on new incidents
  • Conditions: Additional criteria

Variables

Define playbook variables:

quarantine_threshold: 0.7
notification_channel: "#security"

Step Editor

Visual editor for playbook steps.

Adding Steps

  1. Click "Add Step"
  2. Select action type
  3. Configure parameters
  4. Set output variable name

Step Types

  • Action: Execute an action
  • Condition: Branch logic
  • AI Analysis: Get AI verdict
  • Parallel: Run steps concurrently

Connections

  • Drag to reorder steps
  • Connect condition branches
  • Set dependencies

Testing Playbooks

Dry Run

  1. Click "Test"
  2. Select or create test incident
  3. Toggle "Dry Run"
  4. View step-by-step execution

With Live Data

  1. Click "Test"
  2. Select real incident
  3. Leave "Dry Run" off
  4. Actions will execute (with approval)

Execution History

View past executions:

  • Execution timestamp
  • Incident processed
  • Steps completed
  • Final verdict
  • Duration

Click execution for detailed trace.

Import/Export

Export

  1. Select playbook
  2. Click "Export"
  3. Download YAML file

Import

  1. Click "Import"
  2. Upload YAML file
  3. Review parsed playbook
  4. Click "Create"

Playbook Versions

Playbooks are versioned:

  1. Edit playbook
  2. Bump version number
  3. Save as new version
  4. Old version kept for rollback

View version history and compare changes.

Settings

System configuration in the web dashboard.

Settings Tabs

Access at /settings

General

  • Instance Name: Display name for this installation
  • Time Zone: Default timezone for display
  • Date Format: Date/time display format
  • Theme: Light/dark mode preference

Connectors

Configure external integrations.

Threat Intelligence

  • Mode: Mock or VirusTotal
  • API Key (for VirusTotal)
  • Rate limit settings

SIEM

  • Mode: Mock or Splunk
  • URL and authentication
  • Default search index

EDR

  • Mode: Mock or CrowdStrike
  • OAuth credentials
  • Region selection

Email Gateway

  • Mode: Mock or Microsoft 365
  • Azure AD configuration
  • Tenant settings

Ticketing

  • Mode: Mock or Jira
  • Instance URL
  • Project configuration

Policies

Manage policy rules.

Creating Rules

  1. Click "Add Rule"
  2. Enter rule name
  3. Define matching criteria
  4. Set decision (allow/deny/approval)
  5. Save

Rule Priority

Drag rules to reorder. First matching rule wins.

Users

User management (admin only).

User List

  • Username and email
  • Role (viewer/analyst/senior/admin)
  • Last login
  • Status (active/disabled)

Creating Users

  1. Click "Add User"
  2. Enter email and username
  3. Set initial role
  4. Generate or set password
  5. Send invitation email

Role Management

Assign roles:

  • Viewer: Read-only access
  • Analyst: Execute actions, approve analyst-level
  • Senior: Approve senior-level
  • Admin: Full access

Notifications

Configure notification preferences.

Channels

  • Email: SMTP settings
  • Slack: Webhook URL
  • Teams: Connector URL
  • PagerDuty: Integration key

Preferences

For each notification type:

  • Enable/disable channel
  • Set priority threshold
  • Configure quiet hours

Audit Logs

View system audit trail.

Filtering

  • Date range
  • Event type
  • User
  • Resource

Export

Export logs to CSV for compliance.

API Keys

Manage API credentials.

Creating Keys

  1. Click "Create API Key"
  2. Enter name and description
  3. Select scopes
  4. Set expiration (optional)
  5. Copy generated key

Revoking Keys

Click "Revoke" on any key. Revocation is immediate.

Backup & Restore

Database management.

Backup

  1. Click "Create Backup"
  2. Wait for completion
  3. Download backup file

Restore

  1. Click "Restore"
  2. Upload backup file
  3. Confirm restore
  4. System restarts

About

System information:

  • Version number
  • Build information
  • License status
  • Support links

CLI Reference

Command-line interface for Triage Warden.

Installation

The CLI is built with the main project:

cargo build --release
./target/release/tw-cli --help

Global Options

tw-cli [OPTIONS] <COMMAND>

Options:
  -c, --config <FILE>     Configuration file path
  -v, --verbose           Enable verbose output
  -q, --quiet             Suppress non-error output
  --json                  Output as JSON
  -h, --help              Print help
  -V, --version           Print version

Environment Variables

VariableDescription
TW_API_URLAPI server URL (default: http://localhost:8080)
TW_API_KEYAPI key for authentication
TW_CONFIGPath to config file

Commands Overview

CommandDescription
incidentManage incidents
actionExecute and manage actions
triageRun AI triage
playbookManage playbooks
policyManage policy rules
connectorManage connectors
userUser management
api-keyAPI key management
webhookWebhook management
configConfiguration management
dbDatabase operations
serveStart API server

Quick Examples

# List open incidents
tw-cli incident list --status open

# Create incident
tw-cli incident create --type phishing --severity high

# Run triage
tw-cli triage run --incident INC-2024-001

# Execute action
tw-cli action execute --incident INC-2024-001 --action quarantine_email

# Approve pending action
tw-cli action approve act-abc123

# Start server
tw-cli serve --port 8080

Next Steps

CLI Commands

Detailed reference for all CLI commands.

incident

Manage security incidents.

list

tw-cli incident list [OPTIONS]

Options:
  --status <STATUS>      Filter by status (open, triaged, resolved)
  --severity <SEVERITY>  Filter by severity
  --type <TYPE>          Filter by incident type
  --limit <N>            Maximum results (default: 20)
  --offset <N>           Skip first N results
  --sort <FIELD>         Sort field (created_at, severity)
  --desc                 Sort descending

get

tw-cli incident get <ID> [OPTIONS]

Options:
  --format <FORMAT>      Output format (table, json, yaml)
  --include-actions      Include action history
  --include-enrichment   Include enrichment data

create

tw-cli incident create [OPTIONS]

Options:
  --type <TYPE>          Incident type (required)
  --source <SOURCE>      Incident source (required)
  --severity <SEVERITY>  Initial severity (default: medium)
  --data <JSON>          Raw incident data as JSON
  --file <FILE>          Read data from file
  --auto-triage          Run triage after creation

update

tw-cli incident update <ID> [OPTIONS]

Options:
  --severity <SEVERITY>  Update severity
  --status <STATUS>      Update status
  --assignee <USER>      Assign to user

resolve

tw-cli incident resolve <ID> [OPTIONS]

Options:
  --resolution <TEXT>    Resolution notes
  --false-positive       Mark as false positive

action

Execute and manage actions.

execute

tw-cli action execute [OPTIONS]

Options:
  --incident <ID>        Associated incident
  --action <NAME>        Action to execute (required)
  --param <KEY=VALUE>    Action parameter (repeatable)
  --emergency            Emergency override (manager only)

list

tw-cli action list [OPTIONS]

Options:
  --incident <ID>        Filter by incident
  --status <STATUS>      Filter by status
  --pending              Show only pending approval

get

tw-cli action get <ID>

approve

tw-cli action approve <ID> [OPTIONS]

Options:
  --comment <TEXT>       Approval comment

reject

tw-cli action reject <ID> [OPTIONS]

Options:
  --reason <TEXT>        Rejection reason (required)

rollback

tw-cli action rollback <ID> [OPTIONS]

Options:
  --reason <TEXT>        Rollback reason

triage

Run AI triage.

run

tw-cli triage run [OPTIONS]

Options:
  --incident <ID>        Incident to triage (required)
  --playbook <NAME>      Specific playbook
  --model <MODEL>        AI model override
  --wait                 Wait for completion

status

tw-cli triage status <TRIAGE_ID>

playbook

Manage playbooks.

list

tw-cli playbook list [OPTIONS]

Options:
  --enabled              Only enabled playbooks
  --trigger-type <TYPE>  Filter by trigger type

get

tw-cli playbook get <ID>

add

tw-cli playbook add <FILE>

update

tw-cli playbook update <ID> <FILE>

delete

tw-cli playbook delete <ID>

run

tw-cli playbook run <ID> [OPTIONS]

Options:
  --incident <ID>        Incident to process
  --var <KEY=VALUE>      Override variable (repeatable)
  --dry-run              Don't execute actions

test

tw-cli playbook test <NAME> [OPTIONS]

Options:
  --incident <ID>        Use existing incident
  --data <JSON>          Use mock data
  --dry-run              Don't execute actions

validate

tw-cli playbook validate <FILE>

export

tw-cli playbook export <ID> [OPTIONS]

Options:
  -o, --output <FILE>    Output file (default: stdout)

policy

Manage policy rules.

list

tw-cli policy list

add

tw-cli policy add [OPTIONS]

Options:
  --name <NAME>          Rule name (required)
  --action <ACTION>      Action to match
  --pattern <PATTERN>    Action pattern (glob)
  --severity <SEVERITY>  Severity condition
  --approval-level <L>   Required approval level
  --allow                Auto-allow
  --deny                 Deny with reason
  --reason <TEXT>        Denial reason

delete

tw-cli policy delete <NAME>

test

tw-cli policy test [OPTIONS]

Options:
  --action <ACTION>      Action to test
  --severity <SEVERITY>  Incident severity
  --proposer-type <T>    Proposer type
  --confidence <N>       AI confidence score

connector

Manage connectors.

status

tw-cli connector status

test

tw-cli connector test <NAME>

configure

tw-cli connector configure <NAME> [OPTIONS]

Options:
  --mode <MODE>          Connector mode
  --api-key <KEY>        API key
  --url <URL>            Service URL

user

User management.

list

tw-cli user list

create

tw-cli user create [OPTIONS]

Options:
  --username <NAME>      Username (required)
  --email <EMAIL>        Email address
  --role <ROLE>          User role
  --service-account      Create as service account

update

tw-cli user update <ID> [OPTIONS]

Options:
  --role <ROLE>          New role
  --enabled              Enable user
  --disabled             Disable user

delete

tw-cli user delete <ID>

api-key

API key management.

list

tw-cli api-key list

create

tw-cli api-key create [OPTIONS]

Options:
  --name <NAME>          Key name (required)
  --scopes <SCOPES>      Comma-separated scopes
  --user <USER>          Associated user
  --expires <DATE>       Expiration date

revoke

tw-cli api-key revoke <PREFIX>

rotate

tw-cli api-key rotate <PREFIX>

webhook

Webhook management.

list

tw-cli webhook list

add

tw-cli webhook add <SOURCE> [OPTIONS]

Options:
  --secret <SECRET>      Webhook secret
  --auto-triage          Enable auto-triage
  --playbook <NAME>      Playbook to run

test

tw-cli webhook test <SOURCE>

delete

tw-cli webhook delete <SOURCE>

db

Database operations.

migrate

tw-cli db migrate

backup

tw-cli db backup [OPTIONS]

Options:
  -o, --output <FILE>    Backup file path

restore

tw-cli db restore <FILE>

serve

Start the API server.

tw-cli serve [OPTIONS]

Options:
  --host <HOST>          Bind address (default: 0.0.0.0)
  --port <PORT>          Port number (default: 8080)
  --config <FILE>        Configuration file

Architecture Overview

Triage Warden is built as a modular, layered system combining Rust for performance-critical components and Python for AI capabilities.

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                           Clients                                    │
│              (Web Browser, CLI, API Consumers)                       │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         API Layer (tw-api)                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │  REST API   │  │ Web Handlers│  │  Webhooks   │  │   Metrics   │ │
│  │   (Axum)    │  │(HTMX+Askama)│  │             │  │ (Prometheus)│ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                   │
        ┌──────────────────────────┼──────────────────────────┐
        ▼                          ▼                          ▼
┌───────────────┐        ┌───────────────────┐       ┌───────────────┐
│ Policy Engine │        │   Action Registry │       │  Event Bus    │
│   (tw-policy) │        │    (tw-actions)   │       │  (tw-core)    │
└───────────────┘        └───────────────────┘       └───────────────┘
        │                          │                          │
        └──────────────────────────┼──────────────────────────┘
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Core Domain (tw-core)                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │  Incidents  │  │  Playbooks  │  │   Users     │  │   Audit     │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Database Layer (SQLx)                             │
│              (SQLite for dev, PostgreSQL for prod)                   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                      Python Bridge (tw-bridge)                       │
│                          (PyO3 Bindings)                             │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       AI Layer (tw_ai)                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │Triage Agent │  │    Tools    │  │  Playbook   │  │  Evaluation │ │
│  │  (Claude)   │  │             │  │   Engine    │  │  Framework  │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Connector Layer (tw-connectors)                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │ VirusTotal  │  │   Splunk    │  │ CrowdStrike │  │    Jira     │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Crate Structure

CratePurpose
tw-apiHTTP server, REST API, web handlers, webhooks
tw-coreDomain models, database repositories, event bus
tw-actionsAction handlers (quarantine, isolate, notify, etc.)
tw-policyPolicy engine, approval rules, decision evaluation
tw-connectorsExternal service integrations (VirusTotal, Splunk, etc.)
tw-bridgePyO3 bindings exposing Rust to Python
tw-cliCommand-line interface
tw-observabilityMetrics, tracing, logging infrastructure

Key Design Decisions

Rust + Python Hybrid

  • Rust: Core platform, API server, policy engine, actions
  • Python: AI agents, LLM integrations, playbook execution
  • Bridge: PyO3 enables Python to call Rust connectors and actions

Trait-Based Connectors

All connectors implement traits for testability:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait ThreatIntelConnector: Send + Sync {
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>;
    async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport>;
    async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport>;
}
}

Event-Driven Architecture

The event bus enables loose coupling:

#![allow(unused)]
fn main() {
event_bus.publish(Event::IncidentCreated { id, incident_type });
event_bus.publish(Event::ActionExecuted { action_id, result });
}

Policy-First Actions

All actions pass through the policy engine:

Request → Policy Evaluation → (Allowed | Denied | RequiresApproval) → Execute

Next Steps

Components

Detailed description of each major component in Triage Warden.

tw-api

The HTTP server and web interface.

REST API Routes

RouteDescription
GET /api/incidentsList incidents with filtering
POST /api/incidentsCreate new incident
GET /api/incidents/:idGet incident details
POST /api/incidents/:id/actionsExecute action on incident
GET /api/playbooksList playbooks
POST /api/webhooks/:sourceReceive webhook events

Web Handlers

Server-rendered pages using HTMX and Askama templates:

  • Dashboard with KPIs
  • Incident list and detail views
  • Approval workflow interface
  • Playbook management
  • Settings configuration

Authentication

  • Session-based auth for web dashboard
  • API key auth for programmatic access
  • Role-based access control (admin, analyst, viewer)

tw-core

Core domain logic and data access.

Domain Models

#![allow(unused)]
fn main() {
pub struct Incident {
    pub id: Uuid,
    pub incident_type: IncidentType,
    pub severity: Severity,
    pub status: IncidentStatus,
    pub source: String,
    pub raw_data: serde_json::Value,
    pub verdict: Option<Verdict>,
    pub confidence: Option<f64>,
    pub created_at: DateTime<Utc>,
}

pub struct Action {
    pub id: Uuid,
    pub incident_id: Uuid,
    pub action_type: ActionType,
    pub status: ActionStatus,
    pub approval_level: Option<ApprovalLevel>,
    pub executed_by: Option<String>,
}
}

Repositories

Database access layer with SQLite and PostgreSQL support:

  • IncidentRepository
  • ActionRepository
  • PlaybookRepository
  • UserRepository
  • AuditRepository

Event Bus

Async event distribution:

#![allow(unused)]
fn main() {
pub enum Event {
    IncidentCreated { id: Uuid },
    IncidentUpdated { id: Uuid },
    ActionRequested { id: Uuid },
    ActionApproved { id: Uuid, approver: String },
    ActionExecuted { id: Uuid, success: bool },
}
}

tw-actions

Action handlers for incident response.

Email Actions

ActionDescription
parse_emailExtract headers, body, attachments
check_email_authenticationValidate SPF/DKIM/DMARC
quarantine_emailMove to quarantine
block_senderAdd to blocklist

Lookup Actions

ActionDescription
lookup_sender_reputationCheck sender against threat intel
lookup_urlsAnalyze URLs in content
lookup_attachmentsHash and check attachments

Host Actions

ActionDescription
isolate_hostNetwork isolation via EDR
scan_hostTrigger endpoint scan

Notification Actions

ActionDescription
notify_userSend user notification
notify_reporterUpdate incident reporter
escalateRoute to approval level
create_ticketCreate Jira ticket

tw-policy

Policy engine for action approval.

Rule Evaluation

#![allow(unused)]
fn main() {
pub struct PolicyRule {
    pub name: String,
    pub action_type: ActionType,
    pub conditions: Vec<Condition>,
    pub approval_level: ApprovalLevel,
}

pub enum PolicyDecision {
    Allowed,
    Denied { reason: String },
    RequiresApproval { level: ApprovalLevel },
}
}

Approval Levels

  1. Auto - No approval required
  2. Analyst - Any analyst can approve
  3. Senior - Senior analyst required
  4. Manager - SOC manager required

tw-connectors

External service integrations.

Connector Trait

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Connector: Send + Sync {
    fn name(&self) -> &str;
    fn connector_type(&self) -> &str;
    async fn health_check(&self) -> ConnectorResult<ConnectorHealth>;
    async fn test_connection(&self) -> ConnectorResult<bool>;
}
}

Available Connectors

TypeImplementations
Threat IntelVirusTotal, Mock
SIEMSplunk, Mock
EDRCrowdStrike, Mock
Email GatewayMicrosoft 365, Mock
TicketingJira, Mock

tw-bridge

PyO3 bindings for Python integration.

Exposed Classes

from tw_bridge import ThreatIntelBridge, SIEMBridge, EDRBridge

# Use connectors from Python
threat_intel = ThreatIntelBridge("virustotal")
result = threat_intel.lookup_hash("abc123...")

tw_ai (Python)

AI triage and playbook execution.

Triage Agent

Claude-powered agent for incident analysis:

agent = TriageAgent(model="claude-sonnet-4-20250514")
verdict = await agent.analyze(incident)
# Returns: Verdict(classification="malicious", confidence=0.92, ...)

Playbook Engine

YAML-based playbook execution:

name: phishing_triage
steps:
  - action: parse_email
  - action: check_email_authentication
  - action: lookup_sender_reputation
  - condition: sender_reputation < 0.3
    action: quarantine_email

Data Flow

How data moves through Triage Warden from incident creation to resolution.

Incident Lifecycle

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Created   │────▶│   Triaging  │────▶│   Triaged   │────▶│  Resolved   │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │                   │
       ▼                   ▼                   ▼                   ▼
   Webhook/API        AI Agent          Actions Executed      Closed
   receives data      analyzes          (with approval)

Detailed Flow

1. Incident Creation

External Source (Email Gateway, SIEM, EDR)
                    │
                    ▼
            Webhook Endpoint
            /api/webhooks/:source
                    │
                    ▼
         ┌──────────────────┐
         │  Parse & Validate │
         │  Incoming Data    │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Create Incident   │
         │ Record in DB      │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Publish Event:    │
         │ IncidentCreated   │
         └──────────────────┘

2. AI Triage

         ┌──────────────────┐
         │ Event: Incident   │
         │ Created           │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Load Playbook     │
         │ (based on type)   │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Execute Playbook  │
         │ Steps             │
         └──────────────────┘
                    │
        ┌───────────┴───────────┐
        ▼                       ▼
┌───────────────┐       ┌───────────────┐
│ Enrichment    │       │ AI Analysis   │
│ Actions       │       │ (Claude)      │
│ - parse_email │       │               │
│ - lookup_*    │       │ Generates:    │
└───────────────┘       │ - Verdict     │
        │               │ - Confidence  │
        │               │ - Reasoning   │
        │               │ - Actions     │
        └───────┬───────└───────────────┘
                │
                ▼
         ┌──────────────────┐
         │ Update Incident   │
         │ with Verdict      │
         └──────────────────┘

3. Action Execution

         ┌──────────────────┐
         │ Action Request    │
         │ (from agent or    │
         │  human)           │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Build Action      │
         │ Context           │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Policy Engine     │
         │ Evaluation        │
         └──────────────────┘
                    │
        ┌───────────┼───────────┐
        ▼           ▼           ▼
   ┌────────┐  ┌────────┐  ┌────────┐
   │Allowed │  │Denied  │  │Requires│
   │        │  │        │  │Approval│
   └────────┘  └────────┘  └────────┘
        │           │           │
        ▼           ▼           ▼
   Execute      Return       Queue for
   Action       Error        Approval
        │                       │
        │                       ▼
        │              ┌──────────────┐
        │              │ Notify       │
        │              │ Approvers    │
        │              └──────────────┘
        │                       │
        │                       ▼
        │              ┌──────────────┐
        │              │ Wait for     │
        │              │ Approval     │
        │              └──────────────┘
        │                       │
        │        ┌──────────────┴──────────────┐
        │        ▼                             ▼
        │   ┌────────┐                    ┌────────┐
        │   │Approved│                    │Rejected│
        │   └────────┘                    └────────┘
        │        │                             │
        │        ▼                             ▼
        │   Execute Action               Update Status
        │        │
        └────────┴─────────┐
                           ▼
                  ┌──────────────┐
                  │ Connector    │
                  │ Execution    │
                  │ (External    │
                  │  Service)    │
                  └──────────────┘
                           │
                           ▼
                  ┌──────────────┐
                  │ Update       │
                  │ Action       │
                  │ Status       │
                  └──────────────┘
                           │
                           ▼
                  ┌──────────────┐
                  │ Audit Log    │
                  │ Entry        │
                  └──────────────┘

Data Stores

Primary Database

TablePurpose
incidentsIncident records
actionsAction requests and results
playbooksPlaybook definitions
usersUser accounts
sessionsActive sessions
api_keysAPI credentials
audit_logsAction audit trail
connectorsConnector configurations
policiesPolicy rules
notificationsNotification history
settingsSystem settings

Event Bus (In-Memory)

Transient event distribution for real-time updates:

  • Incident lifecycle events
  • Action status changes
  • Approval notifications
  • System health events

External Data Flow

Inbound (Webhooks)

Email Gateway ──────┐
SIEM Alerts ────────┼──▶ Webhook Handler ──▶ Incident Creation
EDR Events ─────────┘

Outbound (Connectors)

                           ┌──▶ VirusTotal (threat intel)
Action Execution ──────────┼──▶ Splunk (SIEM queries)
                           ├──▶ CrowdStrike (host actions)
                           ├──▶ M365 (email actions)
                           └──▶ Jira (ticketing)

Metrics Flow

Rust Components ──┬──▶ Prometheus Registry ──▶ /metrics endpoint
Python Components ─┘

Exposed metrics:

  • triage_warden_incidents_total{type, severity}
  • triage_warden_actions_total{action, status}
  • triage_warden_triage_duration_seconds{type}
  • triage_warden_connector_requests_total{connector, status}

Security Model

Triage Warden implements defense-in-depth with multiple security layers.

Authentication

Web Dashboard

Session-based authentication with secure cookies:

  • Session tokens: Random 256-bit tokens
  • Cookie settings: HttpOnly, Secure, SameSite=Lax
  • Session duration: 8 hours (configurable)
  • CSRF protection: Per-request tokens on all state-changing forms

API Access

API key authentication for programmatic access:

curl -H "Authorization: Bearer tw_abc123_secretkey" \
  https://api.example.com/api/incidents

API key features:

  • Prefix stored in plain text for lookup (tw_abc123)
  • Secret portion hashed with Argon2
  • Scopes limit allowed operations
  • Expiration dates supported

Authorization

Role-Based Access Control (RBAC)

RoleCapabilities
ViewerRead incidents, view dashboards
AnalystViewer + execute low-risk actions, approve analyst-level
Senior AnalystAnalyst + execute medium-risk actions, approve senior-level
AdminFull access, user management, system configuration

Policy-Based Action Control

The policy engine evaluates every action request:

#![allow(unused)]
fn main() {
// Policy evaluation flow
ActionRequest
    → Build ActionContext (action_type, target, severity, proposer)
    → Evaluate policy rules
    → Return PolicyDecision
        - Allowed: Execute immediately
        - Denied: Return error with reason
        - RequiresApproval: Queue for specified approval level
}

Example Policy Rules

# Low-risk actions auto-approve
[[policy.rules]]
name = "auto_approve_lookups"
action_patterns = ["lookup_*"]
decision = "allowed"

# High-severity host isolation requires manager
[[policy.rules]]
name = "isolate_requires_manager"
action = "isolate_host"
severity = ["high", "critical"]
approval_level = "manager"

# Block dangerous actions on production
[[policy.rules]]
name = "no_delete_in_prod"
action_patterns = ["delete_*"]
environment = "production"
decision = "denied"
reason = "Deletion not allowed in production"

Multi-Tenant Isolation

Triage Warden supports multi-tenancy with strong data isolation guarantees.

Row-Level Security (RLS)

PostgreSQL Row-Level Security provides database-level tenant isolation:

-- Each table has RLS policies that filter by tenant
-- Application sets tenant context at the start of each request
SELECT set_tenant_context('tenant-uuid-here');

-- All subsequent queries automatically filtered
SELECT * FROM incidents;  -- Only returns current tenant's data

Key Features:

FeatureDescription
Automatic filteringAll SELECT/UPDATE/DELETE queries filtered by tenant
Insert validationINSERT must match current tenant context
Fail-secureNo tenant context = no data access
Defense-in-depthDatabase enforces isolation even if app has bugs

Tenant Context Management

The application manages tenant context through several mechanisms:

  1. Request Middleware: Resolves tenant from subdomain, header, or JWT
  2. Session Variable: Sets app.current_tenant on each database connection
  3. Context Guard: RAII pattern ensures cleanup
#![allow(unused)]
fn main() {
// Using the tenant context guard
async fn handle_request(pool: &TenantAwarePool, tenant_id: Uuid) {
    let _guard = TenantContextGuard::new(pool, tenant_id).await?;

    // All queries here are automatically filtered by tenant
    let incidents = incident_repo.list_all().await?;

    // Context cleared when guard drops
}
}

Admin Operations

Admin operations that need to bypass RLS use a separate connection pool:

  • Admin pool: Superuser role that bypasses RLS policies
  • Use cases: Tenant management, cross-tenant reporting, maintenance
  • Access control: Restricted to Admin role users only

Tables Protected by RLS

All tenant-scoped data tables have RLS enabled:

  • incidents, actions, approvals, audit_logs
  • users, api_keys, sessions
  • playbooks, policies, connectors
  • notification_channels, settings

System tables (tenants, feature_flags) do NOT have RLS.

Debugging RLS Issues

-- Check current tenant context
SELECT get_current_tenant();

-- View RLS policies for a table
SELECT * FROM pg_policies WHERE tablename = 'incidents';

-- Check if RLS is enabled
SELECT relname, relrowsecurity
FROM pg_class
WHERE relname IN ('incidents', 'tenants');

Data Protection

At Rest

  • Database encryption: SQLite with SQLCipher (optional), PostgreSQL with TDE
  • Credential storage: All API keys/tokens hashed with Argon2id
  • Secrets management: Environment variables or external secret stores

In Transit

  • TLS 1.3: Required for all external connections
  • Certificate validation: Strict validation for connectors
  • Internal traffic: TLS optional for localhost development

Sensitive Data Handling

#![allow(unused)]
fn main() {
// Credentials redacted in logs
impl std::fmt::Debug for ApiKey {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "ApiKey {{ prefix: {}, secret: [REDACTED] }}", self.prefix)
    }
}
}

Audit Trail

All security-relevant actions logged:

EventData Captured
Loginuser_id, ip_address, success, timestamp
Logoutuser_id, session_duration
Action executedaction_id, user_id, incident_id, result
Action approvedaction_id, approver_id, decision
Policy changeuser_id, old_value, new_value
User managementadmin_id, target_user, operation

Audit log retention: 90 days (configurable)

Connector Security

Credential Management

Connector credentials stored encrypted:

# Environment variables (recommended)
TW_VIRUSTOTAL_API_KEY=your-key

# Or encrypted in database
tw-cli connector set virustotal --api-key "$(read -s)"

Rate Limiting

Built-in rate limiting prevents API abuse:

ConnectorDefault Limit
VirusTotal4 req/min (free tier)
Splunk100 req/min
CrowdStrike50 req/min

Circuit Breaker

Automatic failure handling:

#![allow(unused)]
fn main() {
// After 5 consecutive failures, circuit opens
// Requests fail fast for 30 seconds
// Then half-open state allows test requests
}

Input Validation

API Requests

  • JSON schema validation on all endpoints
  • Size limits on request bodies (1MB default)
  • Type coercion disabled (strict typing)

Webhook Payloads

  • HMAC signature verification
  • Replay attack prevention (timestamp validation)
  • Payload size limits
#![allow(unused)]
fn main() {
// Webhook signature verification
fn verify_webhook(payload: &[u8], signature: &str, secret: &str) -> bool {
    let expected = hmac_sha256(secret, payload);
    constant_time_compare(signature, &expected)
}
}

Secure Defaults

  • HTTPS enforced in production
  • Secure cookie flags enabled
  • CORS restricted to configured origins
  • Debug endpoints disabled in production
  • Verbose errors only in development

Security Headers

Default response headers:

Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
Content-Security-Policy: default-src 'self'

Vulnerability Disclosure

Report security vulnerabilities to: [email protected]

We follow responsible disclosure practices and aim to respond within 48 hours.

Database Schema

Triage Warden supports both SQLite (development/small deployments) and PostgreSQL (production). This document describes the database schema used by both backends.

Overview

The database consists of 13 tables organized into four logical groups:

  • Core Incident Management: incidents, audit_logs, actions, approvals
  • Configuration: playbooks, connectors, policies, notification_channels, settings
  • Authentication: users, sessions, api_keys
  • Multi-Tenancy: tenants, feature_flags

Multi-Tenancy

All tenant-scoped tables include a tenant_id foreign key that references the tenants table. In PostgreSQL, Row-Level Security (RLS) policies automatically filter all queries by the current tenant context.

tenants

Tenant organizations in a multi-tenant deployment.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
nameTEXTNOT NULLOrganization display name
slugTEXTUNIQUE, NOT NULLURL-safe identifier for routing
statusENUM/TEXTDEFAULT 'active'active, suspended, pending_deletion
settingsJSON/TEXTDEFAULT '{}'Tenant-specific settings
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Indexes: slug (unique), status

feature_flags

Feature flag configuration for gradual rollouts.

ColumnTypeConstraintsDescription
nameTEXTPRIMARY KEYFlag name
descriptionTEXTDEFAULT ''Flag description
default_enabledBOOLEANDEFAULT FALSEDefault state
tenant_overridesJSONDEFAULT '{}'Per-tenant overrides
percentage_rolloutINTEGERNULLABLE0-100 percentage rollout
created_atTIMESTAMPNOT NULLCreation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Note: The tenants and feature_flags tables are NOT protected by RLS.

Entity Relationship Diagram

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│    users     │       │   api_keys   │       │   sessions   │
├──────────────┤       ├──────────────┤       ├──────────────┤
│ id (PK)      │◄──────│ user_id (FK) │       │ id (PK)      │
│ email        │       │ id (PK)      │       │ data         │
│ username     │       │ key_hash     │       │ expiry_date  │
│ password_hash│       │ scopes       │       └──────────────┘
│ role         │       └──────────────┘
└──────────────┘

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│  incidents   │       │  audit_logs  │       │   actions    │
├──────────────┤       ├──────────────┤       ├──────────────┤
│ id (PK)      │◄──────│ incident_id  │       │ id (PK)      │
│ source       │       │ id (PK)      │       │ incident_id  │──┐
│ severity     │       │ action       │       │ action_type  │  │
│ status       │◄──────│ actor        │       │ target       │  │
│ alert_data   │       │ details      │       │ approval_status│ │
│ enrichments  │       │ created_at   │       └──────────────┘  │
│ analysis     │       └──────────────┘                         │
│ proposed_actions│                                             │
│ ticket_id    │       ┌──────────────┐                         │
│ tags         │       │  approvals   │◄────────────────────────┘
│ metadata     │       ├──────────────┤
└──────────────┘       │ id (PK)      │
                       │ action_id    │
                       │ incident_id  │
                       │ status       │
                       └──────────────┘

Core Tables

incidents

Stores security incidents created from incoming alerts.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
tenant_idUUID/TEXTFK → tenants, NOT NULLOwning tenant
sourceJSON/TEXTNOT NULLAlert source metadata
severityENUM/TEXTNOT NULLinfo, low, medium, high, critical
statusENUM/TEXTNOT NULLSee Status Values
alert_dataJSON/TEXTNOT NULLOriginal alert payload
enrichmentsJSON/TEXTDEFAULT '[]'Array of enrichment results
analysisJSON/TEXTNULLABLEAI triage analysis
proposed_actionsJSON/TEXTDEFAULT '[]'Array of proposed actions
ticket_idTEXTNULLABLEExternal ticket reference
tagsJSON/TEXTDEFAULT '[]'User-defined tags
metadataJSON/TEXTDEFAULT '{}'Additional metadata
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Indexes: (tenant_id, status), (tenant_id, severity), (tenant_id, created_at), status, severity, created_at, updated_at

RLS: Protected by Row-Level Security in PostgreSQL.

Incident Status Values

  • new - Newly created from alert
  • enriching - Gathering threat intelligence
  • analyzing - AI analysis in progress
  • pending_review - Awaiting analyst review
  • pending_approval - Actions awaiting approval
  • executing - Actions being executed
  • resolved - Incident resolved
  • false_positive - Marked as false positive
  • escalated - Escalated to higher tier
  • closed - Administratively closed

audit_logs

Immutable audit trail for all incident actions.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
incident_idUUID/TEXTFK → incidentsParent incident
actionTEXTNOT NULLAction type (status_changed, action_approved, etc.)
actorTEXTNOT NULLUsername or "system"
detailsJSON/TEXTNULLABLEAction-specific details
created_atTIMESTAMP/TEXTNOT NULLAction timestamp

Indexes: incident_id, created_at

actions

Stores proposed and executed response actions.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
incident_idUUID/TEXTFK → incidentsParent incident
action_typeTEXTNOT NULLisolate_host, disable_user, block_ip, etc.
targetJSON/TEXTNOT NULLAction target details
parametersJSON/TEXTDEFAULT '{}'Action parameters
reasonTEXTNOT NULLJustification for action
priorityINTEGERDEFAULT 50Execution priority (1-100)
approval_statusENUM/TEXTNOT NULLSee Approval Status Values
approved_byTEXTNULLABLEApproving user
approval_timestampTIMESTAMP/TEXTNULLABLEApproval time
resultJSON/TEXTNULLABLEExecution result
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
executed_atTIMESTAMP/TEXTNULLABLEExecution timestamp

Indexes: incident_id, approval_status, created_at

Approval Status Values

  • pending - Awaiting approval decision
  • auto_approved - Automatically approved by policy
  • approved - Manually approved
  • denied - Manually denied
  • executed - Successfully executed
  • failed - Execution failed

approvals

Tracks multi-level approval workflows.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
action_idUUID/TEXTFK → actionsRelated action
incident_idUUID/TEXTFK → incidentsParent incident
approval_levelTEXTNOT NULLanalyst, senior, manager, executive
statusENUM/TEXTNOT NULLpending, approved, denied, expired
requested_byTEXTNOT NULLRequesting user/system
requested_atTIMESTAMP/TEXTNOT NULLRequest timestamp
decided_byTEXTNULLABLEDeciding user
decided_atTIMESTAMP/TEXTNULLABLEDecision timestamp
decision_reasonTEXTNULLABLEOptional reason
expires_atTIMESTAMP/TEXTNULLABLEApproval expiration

Indexes: action_id, status, expires_at

Configuration Tables

playbooks

Automation workflow definitions.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
nameTEXTNOT NULLPlaybook name
descriptionTEXTNULLABLEDescription
trigger_typeTEXTNOT NULLalert_type, severity, source, manual
trigger_conditionTEXTNULLABLETrigger condition expression
stagesJSON/TEXTDEFAULT '[]'Array of workflow stages
enabledBOOLEAN/INTEGERDEFAULT TRUEActive status
execution_countINTEGERDEFAULT 0Times executed
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Indexes: name, trigger_type, enabled, created_at

connectors

External integration configurations.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
nameTEXTNOT NULLDisplay name
connector_typeTEXTNOT NULLvirus_total, jira, splunk, etc.
configJSON/TEXTDEFAULT '{}'Connection configuration (encrypted credentials)
statusTEXTDEFAULT 'unknown'connected, disconnected, error, unknown
enabledBOOLEAN/INTEGERDEFAULT TRUEActive status
last_health_checkTIMESTAMP/TEXTNULLABLELast health check time
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Indexes: name, connector_type, status, enabled

policies

Approval and automation policy rules.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
nameTEXTNOT NULLPolicy name
descriptionTEXTNULLABLEDescription
conditionTEXTNOT NULLCondition expression
actionTEXTNOT NULLauto_approve, require_approval, deny
approval_levelTEXTNULLABLERequired approval level
priorityINTEGERDEFAULT 0Evaluation priority
enabledBOOLEAN/INTEGERDEFAULT TRUEActive status
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Indexes: name, action, priority, enabled

notification_channels

Alert notification configurations.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
nameTEXTNOT NULLChannel name
channel_typeTEXTNOT NULLslack, teams, email, pagerduty, webhook
configJSON/TEXTDEFAULT '{}'Channel configuration
eventsJSON/TEXTDEFAULT '[]'Subscribed event types
enabledBOOLEAN/INTEGERDEFAULT TRUEActive status
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Indexes: name, channel_type, enabled

settings

Key-value configuration store.

ColumnTypeConstraintsDescription
keyTEXTPRIMARY KEYSetting key (general, rate_limits, llm)
valueJSON/TEXTNOT NULLSetting value as JSON
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Authentication Tables

users

User accounts for dashboard and API access.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
emailTEXTUNIQUE, NOT NULLEmail address
usernameTEXTUNIQUE, NOT NULLLogin username
password_hashTEXTNOT NULLArgon2 password hash
roleENUM/TEXTNOT NULLadmin, analyst, viewer
display_nameTEXTNULLABLEDisplay name
enabledBOOLEAN/INTEGERDEFAULT TRUEAccount active status
last_login_atTIMESTAMP/TEXTNULLABLELast login timestamp
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp
updated_atTIMESTAMP/TEXTNOT NULLLast update timestamp

Indexes: email, username, role, enabled

sessions

User session storage (tower-sessions compatible).

ColumnTypeConstraintsDescription
idTEXTPRIMARY KEYSession ID
dataBLOBNOT NULLEncrypted session data
expiry_dateINTEGERNOT NULLUnix timestamp expiration

Indexes: expiry_date

api_keys

API key authentication.

ColumnTypeConstraintsDescription
idUUID/TEXTPRIMARY KEYUnique identifier
user_idUUID/TEXTFK → usersOwner user
nameTEXTNOT NULLKey display name
key_hashTEXTNOT NULLSHA-256 hash of key
key_prefixTEXTNOT NULLFirst 8 chars for identification
scopesJSON/TEXTDEFAULT '[]'Allowed API scopes
expires_atTIMESTAMP/TEXTNULLABLEKey expiration
last_used_atTIMESTAMP/TEXTNULLABLELast usage timestamp
created_atTIMESTAMP/TEXTNOT NULLCreation timestamp

Indexes: user_id, key_prefix, expires_at

Database-Specific Notes

SQLite

  • UUIDs stored as TEXT
  • Timestamps stored as ISO 8601 TEXT
  • Boolean stored as INTEGER (0/1)
  • JSON stored as TEXT
  • Uses CHECK constraints for enums

PostgreSQL

  • Native UUID type
  • Native TIMESTAMPTZ type
  • Native BOOLEAN type
  • Native JSONB type with indexing
  • Uses custom ENUM types for status fields
  • Row-Level Security (RLS) enabled on all tenant-scoped tables

Row-Level Security

PostgreSQL deployments use RLS for defense-in-depth tenant isolation:

-- RLS policy example (automatically applied to all queries)
CREATE POLICY incidents_select_tenant_isolation ON incidents
    FOR SELECT
    USING (tenant_id = current_setting('app.current_tenant', true)::uuid);

To set the tenant context:

-- Set before executing tenant-scoped queries
SELECT set_tenant_context('00000000-0000-0000-0000-000000000001'::uuid);

-- Or use the session variable directly
SET app.current_tenant = '00000000-0000-0000-0000-000000000001';

Helper functions:

FunctionDescription
set_tenant_context(uuid)Sets tenant context, returns previous value
get_current_tenant()Returns current tenant UUID or NULL
clear_tenant_context()Clears tenant context

Migrations

Migrations are managed by SQLx and located in:

  • SQLite: crates/tw-core/src/db/migrations/sqlite/
  • PostgreSQL: crates/tw-core/src/db/migrations/postgres/

Run migrations automatically on startup or manually:

# SQLite
tw-cli db migrate --database-url "sqlite:data/triage.db"

# PostgreSQL
tw-cli db migrate --database-url "postgres://user:pass@host/db"

Connectors

Connectors integrate Triage Warden with external security tools and services.

Overview

Each connector type has a trait interface and multiple implementations:

TypePurposeImplementations
Threat IntelligenceHash/URL/domain reputationVirusTotal, Mock
SIEMLog queries and correlationSplunk, Mock
EDREndpoint detection and responseCrowdStrike, Mock
Email GatewayEmail security operationsMicrosoft 365, Mock
TicketingIncident ticket managementJira, Mock

Configuration

Select connector implementations via environment variables:

# Use real connectors
TW_THREAT_INTEL_MODE=virustotal
TW_SIEM_MODE=splunk
TW_EDR_MODE=crowdstrike
TW_EMAIL_GATEWAY_MODE=m365
TW_TICKETING_MODE=jira

# Or use mocks for testing
TW_THREAT_INTEL_MODE=mock
TW_SIEM_MODE=mock

Connector Trait

All connectors implement the base Connector trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Connector: Send + Sync {
    /// Unique identifier for this connector instance
    fn name(&self) -> &str;

    /// Type of connector (threat_intel, siem, edr, etc.)
    fn connector_type(&self) -> &str;

    /// Check connector health
    async fn health_check(&self) -> ConnectorResult<ConnectorHealth>;

    /// Test connection to the service
    async fn test_connection(&self) -> ConnectorResult<bool>;
}

pub enum ConnectorHealth {
    Healthy,
    Degraded { message: String },
    Unhealthy { message: String },
}
}

Error Handling

Connectors return ConnectorResult<T> with detailed error types:

#![allow(unused)]
fn main() {
pub enum ConnectorError {
    /// Service returned an error
    RequestFailed(String),

    /// Resource not found
    NotFound(String),

    /// Authentication failed
    AuthenticationFailed(String),

    /// Rate limit exceeded
    RateLimited { retry_after: Option<Duration> },

    /// Network or connection error
    NetworkError(String),

    /// Invalid response from service
    InvalidResponse(String),
}
}

Health Monitoring

Check connector health via the API:

curl http://localhost:8080/api/connectors/health

{
  "connectors": [
    { "name": "virustotal", "type": "threat_intel", "status": "healthy" },
    { "name": "splunk", "type": "siem", "status": "healthy" },
    { "name": "crowdstrike", "type": "edr", "status": "degraded", "message": "High latency" }
  ]
}

Next Steps

Threat Intelligence Connector

Query threat intelligence services for reputation data on hashes, URLs, domains, and IP addresses.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait ThreatIntelConnector: Connector {
    /// Look up file hash reputation
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>;

    /// Look up URL reputation
    async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport>;

    /// Look up domain reputation
    async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport>;

    /// Look up IP address reputation
    async fn lookup_ip(&self, ip: &str) -> ConnectorResult<ThreatReport>;
}

pub struct ThreatReport {
    pub indicator: String,
    pub indicator_type: IndicatorType,
    pub malicious: bool,
    pub confidence: f64,
    pub categories: Vec<String>,
    pub first_seen: Option<DateTime<Utc>>,
    pub last_seen: Option<DateTime<Utc>>,
    pub sources: Vec<ThreatSource>,
}
}

VirusTotal

Configuration

TW_THREAT_INTEL_MODE=virustotal
TW_VIRUSTOTAL_API_KEY=your-api-key-here

Rate Limits

TierRequests/Minute
Free4
Premium500+

The connector automatically handles rate limiting with exponential backoff.

Supported Lookups

MethodVT EndpointNotes
lookup_hash/files/{hash}MD5, SHA1, SHA256
lookup_url/urls/{url_id}Base64-encoded URL
lookup_domain/domains/{domain}Domain reputation
lookup_ip/ip_addresses/{ip}IP reputation

Example Usage

#![allow(unused)]
fn main() {
let connector = VirusTotalConnector::new(api_key)?;

let report = connector.lookup_hash("44d88612fea8a8f36de82e1278abb02f").await?;
println!("Malicious: {}", report.malicious);
println!("Confidence: {:.2}", report.confidence);
println!("Categories: {:?}", report.categories);
}

Response Mapping

VirusTotal detection ratios map to confidence scores:

Detection RatioConfidenceClassification
0%0.0Clean
1-10%0.3Suspicious
11-50%0.6Likely Malicious
51-100%0.9Malicious

Mock Connector

For testing without external API calls:

TW_THREAT_INTEL_MODE=mock

The mock connector returns predictable results based on indicator patterns:

PatternResult
Contains "malicious"Malicious, confidence 0.95
Contains "suspicious"Suspicious, confidence 0.5
Contains "clean"Clean, confidence 0.1
DefaultClean, confidence 0.2

Python Bridge

Access from Python via the bridge:

from tw_bridge import ThreatIntelBridge

# Create bridge (uses TW_THREAT_INTEL_MODE env var)
bridge = ThreatIntelBridge()

# Or specify mode explicitly
bridge = ThreatIntelBridge("virustotal")

# Lookup hash
result = bridge.lookup_hash("44d88612fea8a8f36de82e1278abb02f")
print(f"Malicious: {result['malicious']}")
print(f"Confidence: {result['confidence']}")

# Lookup URL
result = bridge.lookup_url("https://example.com/suspicious")

# Lookup domain
result = bridge.lookup_domain("malware-site.com")

Caching

Results are cached to reduce API calls:

Lookup TypeCache Duration
Hash24 hours
URL1 hour
Domain6 hours
IP6 hours

Cache is stored in the database and shared across instances.

Adding Custom Providers

Implement the ThreatIntelConnector trait:

#![allow(unused)]
fn main() {
pub struct CustomThreatIntelConnector {
    client: reqwest::Client,
    api_key: String,
}

#[async_trait]
impl Connector for CustomThreatIntelConnector {
    fn name(&self) -> &str { "custom" }
    fn connector_type(&self) -> &str { "threat_intel" }
    // ... implement health_check, test_connection
}

#[async_trait]
impl ThreatIntelConnector for CustomThreatIntelConnector {
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport> {
        // Custom implementation
    }
    // ... implement other methods
}
}

See Adding Connectors for full details.

SIEM Connector

Query SIEM platforms for log data, run searches, and correlate events.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait SIEMConnector: Connector {
    /// Run a search query
    async fn search(&self, query: &str, time_range: TimeRange) -> ConnectorResult<SearchResults>;

    /// Get events by ID
    async fn get_events(&self, event_ids: &[String]) -> ConnectorResult<Vec<SIEMEvent>>;

    /// Get related events (correlation)
    async fn get_related_events(
        &self,
        indicator: &str,
        indicator_type: IndicatorType,
        time_range: TimeRange,
    ) -> ConnectorResult<Vec<SIEMEvent>>;
}

pub struct SIEMEvent {
    pub id: String,
    pub timestamp: DateTime<Utc>,
    pub source: String,
    pub event_type: String,
    pub severity: String,
    pub raw_data: serde_json::Value,
}

pub struct SearchResults {
    pub events: Vec<SIEMEvent>,
    pub total_count: u64,
    pub search_id: String,
}
}

Splunk

Configuration

TW_SIEM_MODE=splunk
TW_SPLUNK_URL=https://splunk.company.com:8089
TW_SPLUNK_TOKEN=your-token-here

Token Permissions

The Splunk token requires these capabilities:

  • search - Run searches
  • list_inputs - Health check
  • rest_access - REST API access

Example Searches

#![allow(unused)]
fn main() {
let connector = SplunkConnector::new(url, token)?;

// Search for events
let results = connector.search(
    r#"index=security sourcetype=firewall action=blocked"#,
    TimeRange::last_hours(24),
).await?;

// Find related events by IP
let related = connector.get_related_events(
    "192.168.1.100",
    IndicatorType::IpAddress,
    TimeRange::last_hours(1),
).await?;
}

Search Query Translation

Common queries translated to SPL:

Triage Warden QuerySplunk SPL
IP correlationindex=* src_ip="{ip}" OR dest_ip="{ip}"
User activityindex=* user="{user}"
Hash lookupindex=* (file_hash="{hash}" OR sha256="{hash}")

Performance Tips

  • Use specific indexes in queries
  • Limit time ranges when possible
  • Use | head 1000 to limit results

Mock Connector

For testing:

TW_SIEM_MODE=mock

The mock returns sample security events matching the query pattern.

Python Bridge

from tw_bridge import SIEMBridge

bridge = SIEMBridge("splunk")

# Run a search
results = bridge.search(
    query='index=security action=blocked',
    hours=24
)

for event in results['events']:
    print(f"{event['timestamp']}: {event['source']}")

# Get related events
related = bridge.get_related_events(
    indicator="192.168.1.100",
    indicator_type="ip",
    hours=1
)

Adding Custom SIEM

Implement the SIEMConnector trait:

#![allow(unused)]
fn main() {
pub struct ElasticSIEMConnector {
    client: elasticsearch::Elasticsearch,
}

#[async_trait]
impl SIEMConnector for ElasticSIEMConnector {
    async fn search(&self, query: &str, time_range: TimeRange) -> ConnectorResult<SearchResults> {
        // Translate to Elasticsearch DSL and execute
    }
    // ... implement other methods
}
}

EDR Connector

Integrate with Endpoint Detection and Response platforms for host information and response actions.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EDRConnector: Connector {
    /// Get host information
    async fn get_host(&self, host_id: &str) -> ConnectorResult<HostInfo>;

    /// Search for hosts
    async fn search_hosts(&self, query: &str) -> ConnectorResult<Vec<HostInfo>>;

    /// Get recent detections for a host
    async fn get_detections(&self, host_id: &str) -> ConnectorResult<Vec<Detection>>;

    /// Isolate a host from the network
    async fn isolate_host(&self, host_id: &str) -> ConnectorResult<ActionResult>;

    /// Remove host isolation
    async fn unisolate_host(&self, host_id: &str) -> ConnectorResult<ActionResult>;

    /// Trigger a scan on the host
    async fn scan_host(&self, host_id: &str) -> ConnectorResult<ActionResult>;
}

pub struct HostInfo {
    pub id: String,
    pub hostname: String,
    pub platform: String,
    pub os_version: String,
    pub agent_version: String,
    pub last_seen: DateTime<Utc>,
    pub isolation_status: IsolationStatus,
    pub tags: Vec<String>,
}

pub struct Detection {
    pub id: String,
    pub timestamp: DateTime<Utc>,
    pub severity: String,
    pub tactic: String,
    pub technique: String,
    pub description: String,
    pub process_name: Option<String>,
    pub file_path: Option<String>,
}
}

CrowdStrike

Configuration

TW_EDR_MODE=crowdstrike
TW_CROWDSTRIKE_CLIENT_ID=your-client-id
TW_CROWDSTRIKE_CLIENT_SECRET=your-client-secret
TW_CROWDSTRIKE_REGION=us-1  # us-1, us-2, eu-1, usgov-1

API Scopes Required

The API client requires these scopes:

  • Hosts: Read - Get host information
  • Hosts: Write - Isolation actions
  • Detections: Read - Get detections
  • Real Time Response: Write - Scan actions

OAuth2 Token Management

The connector automatically handles token refresh:

#![allow(unused)]
fn main() {
// Token refreshed automatically when expired
let connector = CrowdStrikeConnector::new(client_id, client_secret, region)?;

// All subsequent calls use valid token
let host = connector.get_host("abc123").await?;
}

Example Usage

#![allow(unused)]
fn main() {
// Get host information
let host = connector.get_host("aid:abc123").await?;
println!("Hostname: {}", host.hostname);
println!("Last seen: {}", host.last_seen);

// Check for detections
let detections = connector.get_detections("aid:abc123").await?;
for d in detections {
    println!("{}: {} - {}", d.timestamp, d.severity, d.description);
}

// Isolate compromised host
let result = connector.isolate_host("aid:abc123").await?;
if result.success {
    println!("Host isolated successfully");
}
}

Action Confirmation

Isolation and scan actions require policy approval. See Policy Engine.

Mock Connector

TW_EDR_MODE=mock

The mock provides sample hosts and detections for testing.

Python Bridge

from tw_bridge import EDRBridge

bridge = EDRBridge("crowdstrike")

# Get host info
host = bridge.get_host("aid:abc123")
print(f"Hostname: {host['hostname']}")
print(f"Platform: {host['platform']}")

# Get detections
detections = bridge.get_detections("aid:abc123")
for d in detections:
    print(f"{d['severity']}: {d['description']}")

# Isolate host (requires policy approval)
result = bridge.isolate_host("aid:abc123")
if result['success']:
    print("Host isolated")

Response Actions

ActionDescriptionRollback
isolate_hostNetwork isolationunisolate_host
scan_hostOn-demand scanN/A

Isolation Behavior

When isolated:

  • Host cannot communicate on network
  • Falcon agent maintains connection to cloud
  • User may see isolation notification

Rate Limits

EndpointLimit
Host queries100/min
Detection queries50/min
Containment actions10/min

Email Gateway Connector

Manage email security operations including search, quarantine, and sender blocking.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EmailGatewayConnector: Connector {
    /// Search for emails
    async fn search_emails(&self, query: EmailSearchQuery) -> ConnectorResult<Vec<EmailMessage>>;

    /// Get specific email by ID
    async fn get_email(&self, message_id: &str) -> ConnectorResult<EmailMessage>;

    /// Move email to quarantine
    async fn quarantine_email(&self, message_id: &str) -> ConnectorResult<ActionResult>;

    /// Release email from quarantine
    async fn release_email(&self, message_id: &str) -> ConnectorResult<ActionResult>;

    /// Block sender
    async fn block_sender(&self, sender: &str) -> ConnectorResult<ActionResult>;

    /// Unblock sender
    async fn unblock_sender(&self, sender: &str) -> ConnectorResult<ActionResult>;

    /// Get threat data for email
    async fn get_threat_data(&self, message_id: &str) -> ConnectorResult<EmailThreatData>;
}

pub struct EmailMessage {
    pub id: String,
    pub internet_message_id: String,
    pub sender: String,
    pub recipients: Vec<String>,
    pub subject: String,
    pub received_at: DateTime<Utc>,
    pub has_attachments: bool,
    pub attachments: Vec<EmailAttachment>,
    pub urls: Vec<String>,
    pub headers: HashMap<String, String>,
    pub threat_assessment: Option<ThreatAssessment>,
}

pub struct EmailSearchQuery {
    pub sender: Option<String>,
    pub recipient: Option<String>,
    pub subject_contains: Option<String>,
    pub timerange: TimeRange,
    pub has_attachments: Option<bool>,
    pub threat_type: Option<String>,
    pub limit: usize,
}
}

Microsoft 365

Configuration

TW_EMAIL_GATEWAY_MODE=m365
TW_M365_TENANT_ID=your-tenant-id
TW_M365_CLIENT_ID=your-client-id
TW_M365_CLIENT_SECRET=your-client-secret

App Registration

Create an Azure AD app registration with these API permissions:

PermissionTypePurpose
Mail.ReadApplicationRead emails
Mail.ReadWriteApplicationQuarantine actions
ThreatAssessment.Read.AllApplicationThreat data
Policy.Read.AllApplicationBlock list management

Example Usage

#![allow(unused)]
fn main() {
let connector = M365Connector::new(tenant_id, client_id, client_secret)?;

// Search for suspicious emails
let query = EmailSearchQuery {
    sender: Some("[email protected]".to_string()),
    timerange: TimeRange::last_hours(24),
    ..Default::default()
};
let emails = connector.search_emails(query).await?;

// Quarantine malicious email
let result = connector.quarantine_email("AAMkAGI2...").await?;

// Block sender
let result = connector.block_sender("[email protected]").await?;
}

Quarantine Behavior

When quarantined:

  • Email moved to quarantine folder
  • User notified (configurable)
  • Admin can release if false positive

Mock Connector

TW_EMAIL_GATEWAY_MODE=mock

Provides sample emails with various threat characteristics:

  • Phishing with malicious URLs
  • Malware with executable attachments
  • BEC/impersonation attempts
  • Clean legitimate emails

Python Bridge

from tw_bridge import EmailGatewayBridge

bridge = EmailGatewayBridge("m365")

# Search emails
emails = bridge.search_emails(
    sender="[email protected]",
    hours=24
)

for email in emails:
    print(f"From: {email['sender']}")
    print(f"Subject: {email['subject']}")
    print(f"Attachments: {len(email['attachments'])}")

# Quarantine email
result = bridge.quarantine_email("AAMkAGI2...")
if result['success']:
    print("Email quarantined")

# Block sender
result = bridge.block_sender("[email protected]")

Response Actions

ActionDescriptionRollback
quarantine_emailMove to quarantinerelease_email
block_senderAdd to blocklistunblock_sender

Threat Data

Get detailed threat information:

#![allow(unused)]
fn main() {
let threat_data = connector.get_threat_data("AAMkAGI2...").await?;

println!("Delivery action: {}", threat_data.delivery_action);
println!("Threat types: {:?}", threat_data.threat_types);
println!("Detection methods: {:?}", threat_data.detection_methods);
}

Fields:

  • delivery_action: Delivered, Quarantined, Blocked
  • threat_types: Phishing, Malware, Spam, BEC
  • detection_methods: URLAnalysis, AttachmentScanning, ImpersonationDetection
  • urls_clicked: URLs clicked by recipient (if tracking enabled)

Ticketing Connector

Create and manage security incident tickets in external ticketing systems.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait TicketingConnector: Connector {
    /// Create a new ticket
    async fn create_ticket(&self, ticket: CreateTicketRequest) -> ConnectorResult<Ticket>;

    /// Get ticket by ID
    async fn get_ticket(&self, ticket_id: &str) -> ConnectorResult<Ticket>;

    /// Update ticket fields
    async fn update_ticket(&self, ticket_id: &str, update: UpdateTicketRequest) -> ConnectorResult<Ticket>;

    /// Add comment to ticket
    async fn add_comment(&self, ticket_id: &str, comment: &str) -> ConnectorResult<()>;

    /// Search tickets
    async fn search_tickets(&self, query: TicketSearchQuery) -> ConnectorResult<Vec<Ticket>>;
}

pub struct CreateTicketRequest {
    pub title: String,
    pub description: String,
    pub priority: TicketPriority,
    pub ticket_type: String,
    pub labels: Vec<String>,
    pub assignee: Option<String>,
    pub custom_fields: HashMap<String, String>,
}

pub struct Ticket {
    pub id: String,
    pub key: String,
    pub title: String,
    pub description: String,
    pub status: String,
    pub priority: TicketPriority,
    pub assignee: Option<String>,
    pub created_at: DateTime<Utc>,
    pub updated_at: DateTime<Utc>,
    pub url: String,
}
}

Jira

Configuration

TW_TICKETING_MODE=jira
TW_JIRA_URL=https://company.atlassian.net
[email protected]
TW_JIRA_API_TOKEN=your-api-token
TW_JIRA_PROJECT_KEY=SEC

API Token

Generate an API token at: https://id.atlassian.com/manage-profile/security/api-tokens

Required permissions:

  • Create issues
  • Edit issues
  • Add comments
  • Browse project

Example Usage

#![allow(unused)]
fn main() {
let connector = JiraConnector::new(url, email, token, project_key)?;

// Create security ticket
let request = CreateTicketRequest {
    title: "Phishing Incident - INC-2024-001".to_string(),
    description: "Phishing email detected and quarantined.\n\n## Details\n...".to_string(),
    priority: TicketPriority::High,
    ticket_type: "Security Incident".to_string(),
    labels: vec!["phishing".to_string(), "triage-warden".to_string()],
    assignee: Some("[email protected]".to_string()),
    custom_fields: HashMap::new(),
};

let ticket = connector.create_ticket(request).await?;
println!("Created: {} - {}", ticket.key, ticket.url);

// Add investigation notes
connector.add_comment(
    &ticket.id,
    "## Investigation Notes\n\n- Sender reputation: Malicious\n- URLs: 2 phishing links"
).await?;
}

Issue Types

Configure the Jira project with these issue types:

Issue TypeUsage
Security IncidentMain incident ticket
InvestigationSub-task for investigation steps
RemediationSub-task for response actions

Custom Fields

Map custom fields in configuration:

TW_JIRA_FIELD_SEVERITY=customfield_10001
TW_JIRA_FIELD_INCIDENT_ID=customfield_10002
TW_JIRA_FIELD_VERDICT=customfield_10003

Mock Connector

TW_TICKETING_MODE=mock

Simulates ticket operations with in-memory storage.

Python Bridge

from tw_bridge import TicketingBridge

bridge = TicketingBridge("jira")

# Create ticket
ticket = bridge.create_ticket(
    title="Phishing Incident - INC-2024-001",
    description="Phishing email detected...",
    priority="high",
    ticket_type="Security Incident",
    labels=["phishing", "triage-warden"]
)
print(f"Created: {ticket['key']}")
print(f"URL: {ticket['url']}")

# Add comment
bridge.add_comment(
    ticket_id=ticket['id'],
    comment="Investigation complete. Verdict: Malicious"
)

# Update status
bridge.update_ticket(
    ticket_id=ticket['id'],
    status="Done"
)

# Search tickets
tickets = bridge.search_tickets(
    query="project = SEC AND labels = phishing",
    limit=10
)

Ticket Templates

Define templates for consistent ticket creation:

# config/ticket_templates.toml

[templates.phishing]
title = "Phishing: {subject}"
description = """
## Incident Summary
- **Type**: Phishing
- **Severity**: {severity}
- **Incident ID**: {incident_id}

## Details
{details}

## Recommended Actions
{recommended_actions}
"""
labels = ["phishing", "triage-warden"]

[templates.malware]
title = "Malware Alert: {hostname}"
description = """
## Incident Summary
- **Type**: Malware
- **Host**: {hostname}
- **Detection**: {detection}

## IOCs
{iocs}
"""
labels = ["malware", "triage-warden"]

Integration with Incidents

Tickets are automatically linked to incidents:

#![allow(unused)]
fn main() {
// Create ticket action stores the ticket key
let action = execute_action("create_ticket", incident_id, params).await?;

// Incident updated with ticket reference
incident.metadata["ticket_key"] = "SEC-1234";
incident.metadata["ticket_url"] = "https://company.atlassian.net/browse/SEC-1234";
}

Actions

Actions are the executable operations that Triage Warden can perform in response to incidents.

Overview

Actions fall into several categories:

CategoryPurposeExamples
AnalysisExtract and parse dataparse_email, check_email_authentication
LookupEnrich with external datalookup_sender_reputation, lookup_urls
ResponseTake containment actionsquarantine_email, isolate_host
NotificationAlert stakeholdersnotify_user, escalate
TicketingCreate/update ticketscreate_ticket, add_ticket_comment

Action Trait

All actions implement the Action trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Action: Send + Sync {
    /// Action name (used in playbooks and API)
    fn name(&self) -> &str;

    /// Human-readable description
    fn description(&self) -> &str;

    /// Required and optional parameters
    fn required_parameters(&self) -> Vec<ParameterDef>;

    /// Whether this action supports rollback
    fn supports_rollback(&self) -> bool;

    /// Execute the action
    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError>;

    /// Rollback the action (if supported)
    async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        Err(ActionError::RollbackNotSupported)
    }
}
}

Action Context

Actions receive an ActionContext with:

#![allow(unused)]
fn main() {
pub struct ActionContext {
    /// Unique execution ID
    pub execution_id: Uuid,

    /// Parameters passed to the action
    pub parameters: HashMap<String, serde_json::Value>,

    /// Related incident (if any)
    pub incident_id: Option<Uuid>,

    /// User or agent requesting the action
    pub proposer: String,

    /// Connectors available for use
    pub connectors: ConnectorRegistry,
}
}

Action Result

Actions return an ActionResult:

#![allow(unused)]
fn main() {
pub struct ActionResult {
    /// Whether the action succeeded
    pub success: bool,

    /// Action name
    pub action_name: String,

    /// Human-readable summary
    pub message: String,

    /// Execution duration
    pub duration: Duration,

    /// Output data (action-specific)
    pub output: HashMap<String, serde_json::Value>,

    /// Whether rollback is available
    pub rollback_available: bool,
}
}

Policy Integration

All actions pass through the policy engine before execution:

Action Request → Policy Evaluation → Decision
                                       ├─ Allowed → Execute
                                       ├─ Denied → Return Error
                                       └─ RequiresApproval → Queue

See Policy Engine for approval configuration.

Executing Actions

Via API

curl -X POST http://localhost:8080/api/incidents/{id}/actions \
  -H "Content-Type: application/json" \
  -d '{
    "action": "quarantine_email",
    "parameters": {
      "message_id": "AAMkAGI2...",
      "reason": "Phishing detected"
    }
  }'

Via CLI

tw-cli action execute \
  --incident INC-2024-001 \
  --action quarantine_email \
  --param message_id=AAMkAGI2... \
  --param reason="Phishing detected"

Via Playbook

steps:
  - action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
      reason: "Automated response to phishing"

Available Actions

Email Actions

Actions for analyzing and responding to email-based threats.

Analysis Actions

parse_email

Extract headers, body, attachments, and URLs from raw email.

Parameters:

NameTypeRequiredDescription
raw_emailstringYesRaw email content (RFC 822)

Output:

{
  "headers": {
    "From": "[email protected]",
    "To": "[email protected]",
    "Subject": "Important Document",
    "Date": "2024-01-15T10:30:00Z",
    "Message-ID": "<[email protected]>",
    "X-Originating-IP": "[192.168.1.100]"
  },
  "sender": "[email protected]",
  "recipients": ["[email protected]"],
  "subject": "Important Document",
  "body_text": "Please review the attached document...",
  "body_html": "<html>...",
  "attachments": [
    {
      "filename": "document.pdf",
      "content_type": "application/pdf",
      "size": 102400,
      "sha256": "abc123..."
    }
  ],
  "urls": [
    "https://example.com/document",
    "https://suspicious-site.com/login"
  ]
}

check_email_authentication

Validate SPF, DKIM, and DMARC authentication results.

Parameters:

NameTypeRequiredDescription
headersobjectYesEmail headers (from parse_email)

Output:

{
  "spf": {
    "result": "pass",
    "domain": "example.com"
  },
  "dkim": {
    "result": "pass",
    "domain": "example.com",
    "selector": "default"
  },
  "dmarc": {
    "result": "pass",
    "policy": "reject"
  },
  "authentication_passed": true,
  "risk_indicators": []
}

Risk Indicators:

  • spf_fail - SPF validation failed
  • dkim_fail - DKIM signature invalid
  • dmarc_fail - DMARC policy violation
  • header_mismatch - From/Reply-To mismatch
  • suspicious_routing - Unusual mail routing

Response Actions

quarantine_email

Move email to quarantine via email gateway.

Parameters:

NameTypeRequiredDescription
message_idstringYesEmail message ID
reasonstringNoReason for quarantine

Output:

{
  "quarantine_id": "quar-abc123",
  "message_id": "AAMkAGI2...",
  "quarantined_at": "2024-01-15T10:35:00Z"
}

Rollback: release_email - Releases email from quarantine

block_sender

Add sender to organization blocklist.

Parameters:

NameTypeRequiredDescription
senderstringYesEmail address to block
scopestringNoBlock scope: organization or user

Output:

{
  "block_id": "block-abc123",
  "sender": "[email protected]",
  "scope": "organization",
  "blocked_at": "2024-01-15T10:35:00Z"
}

Rollback: unblock_sender - Removes sender from blocklist

Usage Examples

Phishing Response Playbook

name: phishing_response
steps:
  - action: parse_email
    output: parsed

  - action: check_email_authentication
    parameters:
      headers: "{{ parsed.headers }}"
    output: auth

  - action: lookup_sender_reputation
    parameters:
      sender: "{{ parsed.sender }}"
    output: reputation

  - condition: "reputation.score < 0.3 or not auth.authentication_passed"
    action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
      reason: "Failed authentication and low sender reputation"

  - condition: "reputation.score < 0.2"
    action: block_sender
    parameters:
      sender: "{{ parsed.sender }}"
      scope: organization

CLI Example

# Quarantine suspicious email
tw-cli action execute \
  --action quarantine_email \
  --param message_id="AAMkAGI2..." \
  --param reason="Phishing indicators detected"

# Block malicious sender
tw-cli action execute \
  --action block_sender \
  --param sender="[email protected]" \
  --param scope=organization

Host Actions

Actions for endpoint containment and investigation.

isolate_host

Network-isolate a compromised host via EDR.

Parameters:

NameTypeRequiredDescription
host_idstringYesEDR host/agent ID
reasonstringNoReason for isolation

Output:

{
  "isolation_id": "iso-abc123",
  "host_id": "aid:xyz789",
  "hostname": "WORKSTATION-01",
  "isolated_at": "2024-01-15T10:40:00Z",
  "status": "isolated"
}

Behavior:

  • Host network access blocked
  • EDR agent maintains cloud connectivity
  • User notified (configurable)

Rollback: unisolate_host

Policy: Typically requires senior analyst or manager approval.

unisolate_host

Remove network isolation from a host.

Parameters:

NameTypeRequiredDescription
host_idstringYesEDR host/agent ID
reasonstringNoReason for removing isolation

Output:

{
  "host_id": "aid:xyz789",
  "hostname": "WORKSTATION-01",
  "unisolated_at": "2024-01-15T14:00:00Z",
  "status": "active"
}

scan_host

Trigger on-demand malware scan on a host.

Parameters:

NameTypeRequiredDescription
host_idstringYesEDR host/agent ID
scan_typestringNoquick or full (default: quick)

Output:

{
  "scan_id": "scan-abc123",
  "host_id": "aid:xyz789",
  "scan_type": "quick",
  "started_at": "2024-01-15T10:45:00Z",
  "status": "running"
}

Note: Scan results are retrieved separately as they may take time.

Usage Examples

Malware Response Playbook

name: malware_response
steps:
  - action: isolate_host
    parameters:
      host_id: "{{ incident.raw_data.host_id }}"
      reason: "Malware detection - automated isolation"
    output: isolation

  - action: scan_host
    parameters:
      host_id: "{{ incident.raw_data.host_id }}"
      scan_type: full

  - action: create_ticket
    parameters:
      title: "Malware Incident - {{ incident.raw_data.hostname }}"
      priority: high

  - action: notify_user
    parameters:
      user: "{{ incident.raw_data.user }}"
      message: "Your workstation has been isolated due to a security incident"

CLI Example

# Isolate compromised host
tw-cli action execute \
  --action isolate_host \
  --param host_id="aid:xyz789" \
  --param reason="Active malware infection"

# This action typically requires approval
# Check approval status:
tw-cli action status act-123456

# After investigation, remove isolation:
tw-cli action execute \
  --action unisolate_host \
  --param host_id="aid:xyz789" \
  --param reason="Malware cleaned, host verified"

API Example

# Request host isolation
curl -X POST http://localhost:8080/api/incidents/INC-2024-001/actions \
  -H "Content-Type: application/json" \
  -d '{
    "action": "isolate_host",
    "parameters": {
      "host_id": "aid:xyz789",
      "reason": "Suspected compromise"
    }
  }'

# Response (if requires approval):
{
  "action_id": "act-abc123",
  "status": "pending_approval",
  "approval_level": "manager",
  "message": "Action requires SOC Manager approval"
}

Policy Configuration

Host actions are typically high-impact and require approval:

[[policy.rules]]
name = "isolate_requires_approval"
action = "isolate_host"
approval_level = "senior"

[[policy.rules]]
name = "critical_isolate_requires_manager"
action = "isolate_host"
severity = ["critical"]
approval_level = "manager"

Lookup Actions

Actions for enriching incidents with threat intelligence data.

lookup_sender_reputation

Query threat intelligence for sender domain and IP reputation.

Parameters:

NameTypeRequiredDescription
senderstringYesEmail address
originating_ipstringNoSending server IP

Output:

{
  "sender": "[email protected]",
  "domain": "domain.com",
  "domain_reputation": {
    "score": 0.25,
    "categories": ["phishing", "newly-registered"],
    "first_seen": "2024-01-10",
    "registrar": "NameCheap"
  },
  "ip_reputation": {
    "ip": "192.168.1.100",
    "score": 0.3,
    "categories": ["spam", "proxy"],
    "country": "RU",
    "asn": "AS12345"
  },
  "overall_score": 0.25,
  "risk_level": "high"
}

Score Interpretation:

ScoreRisk Level
0.0 - 0.3High risk
0.3 - 0.6Medium risk
0.6 - 0.8Low risk
0.8 - 1.0Clean

lookup_urls

Check URLs against threat intelligence.

Parameters:

NameTypeRequiredDescription
urlsarrayYesList of URLs to check

Output:

{
  "results": [
    {
      "url": "https://legitimate-site.com/page",
      "malicious": false,
      "categories": ["business"],
      "confidence": 0.95
    },
    {
      "url": "https://phishing-site.com/login",
      "malicious": true,
      "categories": ["phishing", "credential-theft"],
      "confidence": 0.92,
      "threat_details": {
        "targeted_brand": "Microsoft",
        "first_seen": "2024-01-14"
      }
    }
  ],
  "malicious_count": 1,
  "total_count": 2
}

lookup_attachments

Hash attachments and check against threat intelligence.

Parameters:

NameTypeRequiredDescription
attachmentsarrayYesList of attachment objects with sha256

Output:

{
  "results": [
    {
      "filename": "invoice.pdf",
      "sha256": "abc123...",
      "malicious": false,
      "file_type": "PDF document",
      "confidence": 0.9
    },
    {
      "filename": "update.exe",
      "sha256": "def456...",
      "malicious": true,
      "file_type": "Windows executable",
      "confidence": 0.98,
      "threat_details": {
        "malware_family": "Emotet",
        "first_seen": "2024-01-12",
        "detection_engines": 45
      }
    }
  ],
  "malicious_count": 1,
  "total_count": 2
}

lookup_hash

Look up a single file hash.

Parameters:

NameTypeRequiredDescription
hashstringYesMD5, SHA1, or SHA256 hash

Output:

{
  "hash": "abc123...",
  "hash_type": "sha256",
  "malicious": true,
  "confidence": 0.95,
  "malware_family": "Emotet",
  "categories": ["trojan", "banking"],
  "first_seen": "2024-01-12",
  "last_seen": "2024-01-15",
  "detection_ratio": "45/70"
}

lookup_ip

Query IP address reputation.

Parameters:

NameTypeRequiredDescription
ipstringYesIP address

Output:

{
  "ip": "192.168.1.100",
  "malicious": true,
  "confidence": 0.8,
  "categories": ["c2", "malware-distribution"],
  "country": "RU",
  "asn": "AS12345",
  "asn_org": "Example ISP",
  "last_seen": "2024-01-15",
  "associated_malware": ["Cobalt Strike"]
}

Usage in Playbooks

name: email_triage
steps:
  - action: parse_email
    output: parsed

  - action: lookup_sender_reputation
    parameters:
      sender: "{{ parsed.sender }}"
    output: sender_rep

  - action: lookup_urls
    parameters:
      urls: "{{ parsed.urls }}"
    output: url_results

  - action: lookup_attachments
    parameters:
      attachments: "{{ parsed.attachments }}"
    output: attachment_results

  # Make decision based on lookups
  - condition: >
      sender_rep.risk_level == 'high' or
      url_results.malicious_count > 0 or
      attachment_results.malicious_count > 0
    set_verdict:
      classification: malicious
      confidence: 0.9

Caching

Lookup results are cached to reduce API calls:

LookupCache Duration
Hash24 hours
URL1 hour
Domain6 hours
IP6 hours

Force fresh lookup with skip_cache: true parameter.

Notification Actions

Actions for alerting stakeholders and managing escalation.

notify_user

Send notification to an affected user.

Parameters:

NameTypeRequiredDescription
userstringYesUser email or ID
messagestringYesNotification message
channelstringNoemail, slack, teams (default: email)
templatestringNoNotification template name

Output:

{
  "notification_id": "notif-abc123",
  "recipient": "[email protected]",
  "channel": "email",
  "sent_at": "2024-01-15T10:50:00Z",
  "status": "delivered"
}

Templates:

# templates/notifications.yaml
security_alert:
  subject: "Security Alert: Action Required"
  body: |
    A security incident affecting your account has been detected.

    Incident ID: {{ incident_id }}
    Type: {{ incident_type }}

    {{ message }}

    If you did not initiate this activity, please contact IT Security.

notify_reporter

Send status update to the incident reporter.

Parameters:

NameTypeRequiredDescription
incident_idstringYesIncident ID
statusstringYesStatus update message
include_verdictboolNoInclude AI verdict (default: false)

Output:

{
  "notification_id": "notif-def456",
  "reporter": "[email protected]",
  "status": "delivered"
}

escalate

Route incident to appropriate approval level.

Parameters:

NameTypeRequiredDescription
incident_idstringYesIncident ID
escalation_levelstringYesanalyst, senior, manager
reasonstringYesReason for escalation
override_assigneestringNoSpecific person to assign
custom_sla_hoursintNoCustom SLA (overrides default)
notify_channelsarrayNoAdditional channels (slack, pagerduty)

Output:

{
  "escalation_id": "esc-abc123",
  "incident_id": "INC-2024-001",
  "escalation_level": "senior",
  "assigned_to": "[email protected]",
  "due_date": "2024-01-15T12:50:00Z",
  "priority": "high",
  "sla_hours": 2
}

Default SLAs:

LevelSLA
Analyst4 hours
Senior2 hours
Manager1 hour

create_ticket

Create ticket in external ticketing system.

Parameters:

NameTypeRequiredDescription
titlestringYesTicket title
descriptionstringYesTicket description
prioritystringNolow, medium, high, critical
assigneestringNoInitial assignee
labelsarrayNoTicket labels

Output:

{
  "ticket_id": "12345",
  "ticket_key": "SEC-1234",
  "url": "https://company.atlassian.net/browse/SEC-1234",
  "created_at": "2024-01-15T10:55:00Z"
}

log_false_positive

Record a false positive for tuning.

Parameters:

NameTypeRequiredDescription
incident_idstringYesIncident ID
reasonstringYesWhy this is a false positive
feedbackstringNoAdditional feedback for AI improvement

Output:

{
  "fp_id": "fp-abc123",
  "incident_id": "INC-2024-001",
  "recorded_at": "2024-01-15T11:00:00Z",
  "used_for_training": true
}

run_triage_agent

Trigger AI triage agent on an incident.

Parameters:

NameTypeRequiredDescription
incident_idstringYesIncident ID
playbookstringNoSpecific playbook to use
modelstringNoAI model override

Output:

{
  "triage_id": "triage-abc123",
  "incident_id": "INC-2024-001",
  "verdict": "malicious",
  "confidence": 0.92,
  "reasoning": "Multiple indicators of phishing...",
  "recommended_actions": [
    "quarantine_email",
    "block_sender",
    "notify_user"
  ],
  "completed_at": "2024-01-15T10:52:00Z"
}

Usage Examples

Escalation Playbook

name: auto_escalate
trigger:
  - verdict: malicious
  - confidence: ">= 0.9"
  - severity: critical

steps:
  - action: escalate
    parameters:
      incident_id: "{{ incident.id }}"
      escalation_level: manager
      reason: "High-confidence critical incident requiring immediate attention"
      notify_channels:
        - slack
        - pagerduty

  - action: create_ticket
    parameters:
      title: "CRITICAL: {{ incident.subject }}"
      priority: critical

CLI Examples

# Escalate to senior analyst
tw-cli action execute \
  --incident INC-2024-001 \
  --action escalate \
  --param escalation_level=senior \
  --param reason="Complex threat requiring expertise"

# Create ticket
tw-cli action execute \
  --incident INC-2024-001 \
  --action create_ticket \
  --param title="Phishing Investigation" \
  --param priority=high

# Record false positive
tw-cli action execute \
  --incident INC-2024-001 \
  --action log_false_positive \
  --param reason="Legitimate vendor communication"

Policy Engine

The policy engine controls action approval workflows and enforces security boundaries.

Overview

Every action request passes through the policy engine:

Action Request → Build Context → Evaluate Rules → Decision
                                                    ├─ Allowed → Execute
                                                    ├─ Denied → Reject
                                                    └─ RequiresApproval → Queue

Policy Decision Types

DecisionBehavior
AllowedAction executes immediately
DeniedAction rejected with reason
RequiresApprovalQueued for specified approval level

Action Context

The policy engine evaluates these attributes:

#![allow(unused)]
fn main() {
pub struct ActionContext {
    /// The action being requested
    pub action_type: String,

    /// Target of the action (host, email, user, etc.)
    pub target: String,

    /// Incident severity (if associated)
    pub severity: Option<Severity>,

    /// AI confidence score (if from triage)
    pub confidence: Option<f64>,

    /// Who/what is requesting the action
    pub proposer: Proposer,

    /// Additional context
    pub metadata: HashMap<String, Value>,
}

pub enum Proposer {
    User { id: String, role: Role },
    Agent { name: String },
    Playbook { name: String },
    System,
}
}

Default Policies

Without custom rules, these defaults apply:

Action CategoryDefault Decision
Lookup actionsAllowed
Analysis actionsAllowed
Notification actionsAllowed
Response actionsRequiresApproval (analyst)
Host containmentRequiresApproval (senior)

Next Steps

Policy Rules

Define rules to control when actions require approval.

Rule Structure

[[policy.rules]]
name = "rule_name"
description = "Human-readable description"

# Matching criteria
action = "action_name"           # Specific action
action_patterns = ["pattern_*"]  # Glob patterns

# Conditions (all must match)
severity = ["high", "critical"]  # Incident severity
confidence_min = 0.8             # Minimum AI confidence
proposer_type = "agent"          # Who's requesting
proposer_role = "analyst"        # Role (if user)

# Decision
decision = "allowed"             # or "denied" or "requires_approval"
approval_level = "senior"        # If requires_approval
reason = "Explanation"           # If denied

Rule Examples

Auto-Approve Lookups

[[policy.rules]]
name = "auto_approve_lookups"
description = "Lookup actions are always allowed"
action_patterns = ["lookup_*"]
decision = "allowed"

Require Approval for Response Actions

[[policy.rules]]
name = "response_needs_analyst"
description = "Response actions require analyst approval"
action_patterns = ["quarantine_*", "block_*"]
decision = "requires_approval"
approval_level = "analyst"

High-Severity Host Isolation

[[policy.rules]]
name = "critical_isolation_needs_manager"
description = "Critical severity host isolation requires manager"
action = "isolate_host"
severity = ["critical"]
decision = "requires_approval"
approval_level = "manager"

Block Dangerous Actions in Production

[[policy.rules]]
name = "no_delete_production"
description = "Deletion actions not allowed in production"
action_patterns = ["delete_*"]
environment = "production"
decision = "denied"
reason = "Deletion actions are not permitted in production"

Trust High-Confidence AI Decisions

[[policy.rules]]
name = "trust_high_confidence_ai"
description = "Auto-approve when AI is highly confident"
proposer_type = "agent"
confidence_min = 0.95
severity = ["low", "medium"]
action_patterns = ["quarantine_email", "block_sender"]
decision = "allowed"

Analyst Self-Service

[[policy.rules]]
name = "analyst_can_notify"
description = "Analysts can send notifications without approval"
action_patterns = ["notify_*"]
proposer_role = "analyst"
decision = "allowed"

Rule Evaluation Order

Rules are evaluated in order. First matching rule wins.

# More specific rules first
[[policy.rules]]
name = "critical_isolation"
action = "isolate_host"
severity = ["critical"]
approval_level = "manager"

# General fallback
[[policy.rules]]
name = "default_isolation"
action = "isolate_host"
approval_level = "senior"

Condition Operators

Severity Matching

severity = ["high", "critical"]  # Match any in list

Confidence Ranges

confidence_min = 0.8   # Minimum confidence
confidence_max = 0.95  # Maximum confidence

Pattern Matching

action_patterns = ["lookup_*"]        # Prefix match
action_patterns = ["*_email"]         # Suffix match
action_patterns = ["*block*"]         # Contains

Proposer Conditions

proposer_type = "user"      # user, agent, playbook, system
proposer_role = "analyst"   # Only for user proposers

Managing Rules

Via Configuration File

# config/policy.toml
tw-api --config config/policy.toml

Via API

# List rules
curl http://localhost:8080/api/policies

# Create rule
curl -X POST http://localhost:8080/api/policies \
  -H "Content-Type: application/json" \
  -d '{
    "name": "new_rule",
    "action": "isolate_host",
    "approval_level": "senior"
  }'

Via CLI

# List rules
tw-cli policy list

# Add rule
tw-cli policy add \
  --name "block_needs_approval" \
  --action "block_sender" \
  --approval-level analyst

Testing Rules

Simulate policy evaluation without executing:

tw-cli policy test \
  --action isolate_host \
  --severity critical \
  --proposer-type agent \
  --confidence 0.92

# Output:
# Decision: RequiresApproval
# Level: manager
# Matched Rule: critical_isolation_needs_manager

Approval Levels

Understanding the approval workflow in Triage Warden.

Approval Hierarchy

Manager (SOC Manager)
    │
    ▼
Senior (Senior Analyst)
    │
    ▼
Analyst (Security Analyst)
    │
    ▼
Auto (No approval needed)

Higher levels can approve actions at their level or below.

Level Definitions

LevelRoleTypical Actions
AutoSystemLookups, analysis, low-risk notifications
AnalystSecurity AnalystEmail quarantine, sender blocking
SeniorSenior AnalystHost isolation, broad blocks
ManagerSOC ManagerCritical containment, policy changes

Approval Workflow

1. Action Requested

tw-cli action execute --incident INC-001 --action isolate_host

2. Policy Evaluation

Policy engine evaluates and returns:

{
  "decision": "requires_approval",
  "approval_level": "senior",
  "reason": "Host isolation requires senior analyst approval"
}

3. Action Queued

Action stored with pending status:

{
  "action_id": "act-abc123",
  "incident_id": "INC-001",
  "action_type": "isolate_host",
  "status": "pending_approval",
  "approval_level": "senior",
  "requested_by": "[email protected]",
  "requested_at": "2024-01-15T10:30:00Z"
}

4. Approvers Notified

Notification sent to eligible approvers via configured channels.

5. Approval Decision

Approver reviews and decides:

Approve:

tw-cli action approve act-abc123 --comment "Verified threat"

Reject:

tw-cli action reject act-abc123 --reason "False positive, user traveling"

6. Execution or Rejection

  • Approved: Action executes automatically
  • Rejected: Action marked rejected, requester notified

Approval UI

Access pending approvals at /approvals in the web dashboard.

Features:

  • Filterable list of pending actions
  • Incident context display
  • One-click approve/reject
  • Bulk approval for related actions

SLA Tracking

Each approval level has a default SLA:

LevelDefault SLA
Analyst4 hours
Senior2 hours
Manager1 hour

Overdue approvals are:

  1. Highlighted in dashboard
  2. Re-notified to approvers
  3. Optionally escalated to next level

Delegation

Approvers can delegate when unavailable:

tw-cli approval delegate \
  --from [email protected] \
  --to [email protected] \
  --until 2024-01-20

Approval Groups

Configure approval groups for redundancy:

[approval_groups]
senior_analysts = [
  "[email protected]",
  "[email protected]",
  "[email protected]"
]

managers = [
  "[email protected]",
  "[email protected]"
]

Any member of the group can approve.

Audit Trail

All approval decisions are logged:

{
  "event": "action_approved",
  "action_id": "act-abc123",
  "approver": "[email protected]",
  "decision": "approved",
  "comment": "Verified threat indicators",
  "timestamp": "2024-01-15T10:45:00Z",
  "time_to_approve": "15m"
}

Emergency Override

In emergencies, managers can bypass approval:

tw-cli action execute \
  --incident INC-001 \
  --action isolate_host \
  --emergency \
  --reason "Active ransomware, immediate containment required"

Emergency overrides are:

  • Logged with high visibility
  • Require manager credentials
  • Trigger additional notifications

Natural Language Queries

Query your security data using plain English instead of writing Splunk SPL, Elasticsearch KQL, or SQL by hand.

Overview

The NL Query Interface (Stage 4.1) lets analysts type questions like "show me critical incidents from the last 24 hours" and have Triage Warden translate them into structured queries against your SIEM, log store, or incident database.

The pipeline has four stages:

  1. Intent classification -- determines what the analyst is trying to do
  2. Entity extraction -- pulls out IPs, domains, hashes, date ranges, etc.
  3. Query translation -- converts the parsed intent + entities into the target query language
  4. Backend execution -- runs the query against Splunk, Elasticsearch, or SQL

Supported Intents

IntentExample query
search_incidents"show me open critical incidents"
search_logs"find authentication failures in the last hour"
lookup_ioc"check reputation for 192.168.1.100"
explain_incident"what happened in INC-2024-0042?"
compare_incidents"compare INC-001 and INC-002"
timeline_query"show me events from last week"
asset_lookup"who owns server web-prod-01?"
statistics"how many phishing incidents this month?"

Intent classification uses keyword matching and regex patterns -- no LLM call is needed for routing.

Entity Extraction

The entity extractor recognizes security-specific tokens:

  • IP addresses -- IPv4 (192.168.1.100)
  • Domains -- evil-domain.com
  • Hashes -- MD5 (32 hex chars), SHA-1 (40), SHA-256 (64)
  • Incident IDs -- INC-2024-0042, #42
  • Date ranges -- "last 24 hours", "past 7 days", 2024-01-01 to 2024-01-31
  • Usernames, hostnames, CVE IDs

Query Translation

Once intent and entities are extracted, NLQueryTranslator builds a structured query object:

from tw_ai.nl_query import NLQueryTranslator

translator = NLQueryTranslator()
result = translator.translate(
    "show me failed logins from 10.0.0.50 in the last hour"
)
# result.intent.intent = QueryIntent.SEARCH_LOGS
# result.structured_query returns the backend-specific query

Backend Adapters

The translator outputs queries for three backends:

BackendOutput formatUse case
SplunkSPL queriesindex=auth action=failure src_ip=10.0.0.50 earliest=-1h
ElasticsearchKQL / Query DSLevent.action:failure AND source.ip:10.0.0.50
SQLSQL WHERE clausesIncident database queries

Conversation Context

Multi-turn conversations are supported via ConversationContext. When an analyst asks "now show me the same for last week", the system retains the entities from the previous turn.

from tw_ai.nl_query import ConversationContext

ctx = ConversationContext()
ctx.update("show me incidents from 10.0.0.50", entities=[...])
ctx.update("now filter to critical only", entities=[...])
# Second turn inherits the IP entity from the first

Security and Audit

All NL queries are sanitized before execution to prevent injection attacks. The QuerySanitizer strips dangerous characters and SQL keywords from user input.

Every query is logged to the QueryAuditLog with:

  • Original natural language query
  • Classified intent and confidence
  • Translated structured query
  • Execution timestamp and user ID

API Endpoint

When FastAPI is available, the NL query service exposes a REST endpoint:

curl -X POST http://localhost:8080/api/v1/nl/query \
  -H "Content-Type: application/json" \
  -d '{"query": "show me critical incidents from the last 24 hours"}'

Configuration

No special configuration is required. The NL query engine uses the same SIEM and database connections already configured in config/default.yaml.

To add custom keywords for intent classification:

from tw_ai.nl_query import IntentClassifier, QueryIntent

classifier = IntentClassifier(
    custom_keywords={
        QueryIntent.SEARCH_LOGS: ["splunk", "kibana"],
    }
)

Automated Threat Hunting

Proactively search for threats across your environment using hypothesis-driven hunts with built-in query templates mapped to MITRE ATT&CK.

Overview

The threat hunting module (Stage 5.1) provides:

  • Hunt management -- create, schedule, and track hunts with hypotheses
  • Built-in query library -- 20+ pre-built queries across 8 MITRE ATT&CK categories
  • Multi-platform queries -- Splunk SPL and Elasticsearch KQL templates
  • Finding promotion -- promote hunt findings directly to incidents

Hunt Lifecycle

A hunt progresses through these statuses:

StatusDescription
draftHunt is being designed, not yet executable
activeHunt is enabled and will run on schedule or trigger
pausedTemporarily suspended
completedFinished executing (one-time hunts)
failedExecution encountered errors
archivedNo longer active, kept for reference

Creating a Hunt

Via API

curl -X POST http://localhost:8080/api/v1/hunts \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Detect Kerberoasting",
    "hypothesis": "Attackers may request TGS tickets for service accounts to crack offline",
    "hunt_type": "scheduled",
    "queries": [
      {
        "query_type": "splunk",
        "query": "index=wineventlog EventCode=4769 TicketEncryptionType=0x17 | stats count by ServiceName",
        "description": "Detect RC4-encrypted TGS requests",
        "timeout_secs": 300,
        "expected_baseline": 5
      }
    ],
    "schedule": {
      "cron_expression": "0 */4 * * *",
      "timezone": "UTC",
      "max_runtime_secs": 600
    },
    "mitre_techniques": ["T1558.003"],
    "data_sources": ["windows_event_logs"],
    "tags": ["credential-access", "priority-high"],
    "enabled": true
  }'

Hunt Types

TypeDescription
scheduledRuns on a cron schedule
continuousRuns as a streaming query
on_demandRuns only when manually triggered
triggeredRuns when a condition is met (e.g., new threat intel)

Built-in Query Library

Access 20+ pre-built queries via the API:

curl http://localhost:8080/api/v1/hunts/queries/library

Queries span 8 MITRE ATT&CK categories:

  • Initial Access
  • Execution
  • Persistence
  • Credential Access
  • Lateral Movement
  • Collection
  • Command and Control
  • Exfiltration

Each built-in query includes Splunk SPL and Elasticsearch KQL templates, expected baselines for anomaly detection, and configurable parameters.

Executing a Hunt

Trigger a hunt manually:

curl -X POST http://localhost:8080/api/v1/hunts/{hunt_id}/execute

The response includes findings with severity levels, evidence data, and the query that produced each finding.

Viewing Results

# Get all results for a hunt
curl http://localhost:8080/api/v1/hunts/{hunt_id}/results

Each result includes:

  • Total and critical finding counts
  • Duration and execution status
  • Individual findings with severity, evidence, and matched query

Promoting Findings to Incidents

When a hunt finding warrants investigation, promote it to a full incident:

curl -X POST http://localhost:8080/api/v1/hunts/{hunt_id}/findings/{finding_id}/promote

This creates a new incident with the finding's evidence, severity, and hunt metadata attached.

Query Languages

LanguageIdentifierExample
Splunk SPLsplunkindex=wineventlog EventCode=4625
Elasticsearchelasticsearchevent.code: 4625
SQLsqlSELECT * FROM events WHERE event_code = 4625
Kusto (KQL)kustoSecurityEvent | where EventID == 4625
CustomcustomAny custom query syntax

Python Hypothesis Generator

The Python tw_ai package includes an AI-powered hypothesis generator that suggests new hunts based on current threat intelligence and recent incident patterns.

Collaboration

Coordinate incident response across your team with assignments, comments, real-time events, activity feeds, and shift handoffs.

Overview

The collaboration module (Stage 4.3) adds team workflow features to incident management:

  • Incident assignment -- manual and auto-assignment with rules
  • Comments -- threaded discussion on incidents with mentions
  • Real-time events -- live updates pushed to connected clients
  • Activity feed -- chronological audit trail of all actions
  • Shift handoff -- structured handoff reports between shifts

Incident Assignment

Manual Assignment

Assign an incident to an analyst through the web UI's assignment picker, or via the web endpoint:

curl -X POST http://localhost:8080/web/incidents/{id}/assign \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d 'assignee=analyst-uuid'

Auto-Assignment Rules

The system supports rule-based auto-assignment. Rules are defined in the application configuration and evaluated when new incidents arrive. Each rule specifies conditions and an assignee target:

FieldDescription
nameHuman-readable rule name
conditionsList of conditions to match (severity, incident type, source, tag)
assigneeWho to assign to (see Assignee Targets below)
priorityEvaluation order (lower number = higher priority)

Rules are evaluated in priority order. The first matching rule wins.

Note: Auto-assignment rule management via API is planned for a future release. Rules are currently configured at the application level.

Assignee Targets

TypeDescription
userAssign to a specific analyst by ID
teamRound-robin across team members
on_callAssign to whoever is on-call

Comments

Add discussion, analysis notes, and action records to incidents.

Creating a Comment

curl -X POST http://localhost:8080/api/v1/comments \
  -H "Content-Type: application/json" \
  -d '{
    "incident_id": "incident-uuid",
    "content": "Found lateral movement evidence via PsExec. @senior-analyst please review.",
    "comment_type": "analysis",
    "mentions": ["senior-analyst-uuid"]
  }'

Comment Types

TypeUse case
noteGeneral notes and observations
analysisTechnical findings and analysis
action_takenRecord of actions performed
questionQuestions for other team members
resolutionFinal resolution summary

Filtering Comments

# All comments for an incident
curl "http://localhost:8080/api/v1/comments?incident_id={id}"

# Only analysis comments
curl "http://localhost:8080/api/v1/comments?incident_id={id}&comment_type=analysis"

# Comments by a specific analyst
curl "http://localhost:8080/api/v1/comments?author_id={analyst_id}"

Comments support pagination with page and per_page query parameters.

Real-time Events

The real-time event system pushes updates to connected clients when incidents are modified, comments are added, or assignments change. Events include:

  • Incident status changes
  • New comments and mentions
  • Assignment updates
  • Action execution results
  • Field-level change tracking

Subscribers can filter events by incident ID, event type, or severity.

Activity Feed

Every action on an incident is recorded in the activity feed, providing a complete audit trail:

  • Who did what and when
  • What fields changed (with before/after values)
  • Comment and assignment history
  • Action execution records

Filter the activity feed by incident, user, or activity type.

Shift Handoff

Generate structured handoff reports at shift transitions:

curl -X POST http://localhost:8080/api/v1/handoffs \
  -H "Content-Type: application/json" \
  -d '{
    "shift_start": "2025-01-15T08:00:00Z",
    "shift_end": "2025-01-15T16:00:00Z",
    "notes": "Ongoing phishing campaign targeting finance department"
  }'

Handoff reports include:

  • Summary of open incidents per severity
  • Actions pending approval
  • Recent escalations
  • Custom notes from the outgoing team

Agentic AI Response

Control how much autonomy the AI has when responding to incidents, from fully manual to fully autonomous, with time-based rules and per-action overrides.

Overview

The Agentic AI Response system (Stage 5.4) provides configurable autonomy levels that determine which actions the AI can execute automatically and which require human approval. It includes:

  • Four autonomy levels with increasing automation
  • Per-action and per-severity overrides
  • Time-based rules for different autonomy during business hours vs. off-hours
  • Execution guardrails to prevent dangerous actions
  • Full audit trail of every autonomy decision

Autonomy Levels

LevelActions auto-executedHuman role
assistedNoneAI suggests, human executes everything
supervisedLow-risk onlyAI auto-executes safe actions, human approves the rest
autonomousAll except protectedAI handles most actions, human reviews protected targets
full_autonomousEverythingEmergency mode -- AI executes all actions (requires special auth)

Risk Level Mapping

Each action has an inherent risk level that determines whether it can be auto-executed:

Risk levelAuto-execute in Supervised?Auto-execute in Autonomous?
noneYesYes
lowYesYes
mediumNoYes
highNoYes
criticalNoNo (requires full_autonomous)

Configuration

Get Current Config

curl http://localhost:8080/api/v1/autonomy/config

Update Config

curl -X PUT http://localhost:8080/api/v1/autonomy/config \
  -H "Content-Type: application/json" \
  -d '{
    "default_level": "supervised",
    "per_action_overrides": {
      "isolate_host": "assisted",
      "create_ticket": "autonomous"
    },
    "per_severity_overrides": {
      "critical": "assisted",
      "low": "autonomous"
    },
    "time_based_rules": [
      {
        "name": "Business hours - supervised",
        "start_hour": 9,
        "end_hour": 17,
        "days_of_week": [1, 2, 3, 4, 5],
        "level": "supervised"
      },
      {
        "name": "Off-hours - autonomous",
        "start_hour": 17,
        "end_hour": 9,
        "days_of_week": [0, 1, 2, 3, 4, 5, 6],
        "level": "autonomous"
      }
    ],
    "emergency_contacts": ["[email protected]"]
  }'

Resolution Priority

When resolving the autonomy level for a given action, overrides are checked in this order:

  1. Per-action overrides (highest priority)
  2. Per-severity overrides
  3. Time-based rules
  4. Default level (fallback)

Resolve for a Specific Action

Check what the system would decide for a specific action + severity combination:

curl -X POST http://localhost:8080/api/v1/autonomy/resolve \
  -H "Content-Type: application/json" \
  -d '{"action": "isolate_host", "severity": "critical"}'

Response:

{
  "level": "assisted",
  "auto_execute": false,
  "reason": "Per-action override for 'isolate_host'"
}

Time-Based Rules

Time-based rules let you run with less autonomy during business hours (when analysts are available) and more autonomy during nights and weekends.

FieldDescription
nameHuman-readable rule name
start_hourStart hour, 0-23 inclusive
end_hourEnd hour, 0-24 exclusive
days_of_weekArray of days (0=Sunday through 6=Saturday)
levelAutonomy level when rule applies

Hours wrap around midnight: start_hour: 22, end_hour: 6 means 10 PM to 6 AM.

Execution Guardrails

The guardrails system (configured in config/guardrails.yaml) provides hard limits regardless of autonomy level:

  • Forbidden actions -- actions that can never be automated (e.g., delete_user, wipe_host)
  • Protected assets -- targets that always require human approval (production systems, domain controllers)
  • Rate limits -- maximum actions per hour/day to prevent runaway automation
  • Blast radius limits -- caps on how many targets a single action can affect

See Guardrails Reference for full configuration details.

Audit Log

Every autonomy decision is logged for compliance and debugging:

curl "http://localhost:8080/api/v1/autonomy/audit?limit=20"

# Filter by incident
curl "http://localhost:8080/api/v1/autonomy/audit?incident_id={id}"

Each audit entry records:

  • Action and severity evaluated
  • Resolved autonomy level
  • Whether auto-execution was allowed
  • Reason for the decision
  • Whether the action was actually executed
  • Execution outcome

Attack Surface Integration

Correlate incidents with known vulnerabilities and external exposures using integrations with vulnerability scanners and attack surface monitoring platforms.

Overview

The attack surface module (Stage 5.2) connects Triage Warden to:

  • Vulnerability scanners -- Qualys, Tenable, and Rapid7 for known vulnerability data
  • Attack surface monitors -- Censys and SecurityScorecard for external exposure discovery
  • Risk scoring -- combined risk assessment using vulnerability and exposure data

Vulnerability Scanners

Supported Platforms

PlatformConnectorCapabilities
QualysQualysConnectorAsset vulns, scan results, CVE lookup, recent findings
TenableTenableConnectorAsset vulns, scan results, CVE lookup, recent findings
Rapid7Rapid7ConnectorAsset vulns, scan results, CVE lookup, recent findings

VulnerabilityScanner Trait

All scanners implement the same trait, making them interchangeable:

#![allow(unused)]
fn main() {
pub trait VulnerabilityScanner: Connector {
    async fn get_vulnerabilities_for_asset(&self, asset_id: &str) -> ConnectorResult<Vec<Vulnerability>>;
    async fn get_scan_results(&self, scan_id: &str) -> ConnectorResult<ScanResult>;
    async fn get_recent_vulnerabilities(&self, since: DateTime<Utc>, limit: Option<usize>) -> ConnectorResult<Vec<Vulnerability>>;
    async fn get_vulnerability_by_cve(&self, cve_id: &str) -> ConnectorResult<Option<Vulnerability>>;
}
}

Vulnerability Data

Each vulnerability includes:

FieldDescription
cve_idCVE identifier (if assigned)
severityInformational, Low, Medium, High, Critical
cvss_scoreCVSS base score (0.0 - 10.0)
affected_asset_idsWhich assets are affected
exploit_availableWhether a public exploit exists
patch_availableWhether a vendor patch is available
statusOpen, Remediated, Accepted, FalsePositive

Scan Results

Query scan results for summary data:

FieldDescription
total_hostsNumber of hosts scanned
vulnerabilities_foundTotal vulnerabilities discovered
critical_countCritical severity findings
high_countHigh severity findings
statusPending, Running, Completed, Failed, Cancelled

Attack Surface Monitoring

Supported Platforms

PlatformConnectorCapabilities
CensysCensysConnectorDomain exposures, asset exposure, risk scoring
SecurityScorecardScorecardConnectorDomain exposures, asset exposure, risk scoring

AttackSurfaceMonitor Trait

#![allow(unused)]
fn main() {
pub trait AttackSurfaceMonitor: Connector {
    async fn get_exposures(&self, domain: &str) -> ConnectorResult<Vec<ExternalExposure>>;
    async fn get_asset_exposure(&self, asset_id: &str) -> ConnectorResult<Vec<ExternalExposure>>;
    async fn get_risk_score(&self, domain: &str) -> ConnectorResult<Option<f32>>;
}
}

Exposure Types

The system detects these categories of external exposure:

TypeDescriptionExample
open_portOpen network port with identified servicePort 22 running SSH
expired_certificateTLS certificate past its expiry dateexample.com cert expired
weak_cipherDeprecated or weak TLS cipher in useRC4 cipher detected
exposed_servicePublicly accessible service that may be unintendedElasticsearch on public IP
dns_issueDNS misconfigurationMissing SPF record
misconfigured_headerMissing or incorrect HTTP security headerNo X-Frame-Options

Each exposure includes a risk score (0.0 to 100.0) and structured details.

Risk Scoring

Risk scores from vulnerability scanners and ASM platforms are combined during incident triage to assess the exposure of affected assets. When the AI agent triages an incident involving a compromised host, it can check:

  1. What known vulnerabilities exist on the host
  2. Whether public exploits are available for those vulnerabilities
  3. What external exposures exist for the host or its domain
  4. The overall risk score for the affected domain

This context helps the agent make more accurate severity assessments and recommend appropriate response actions.

Configuration

Add vulnerability scanner and ASM connectors in config/default.yaml:

connectors:
  qualys:
    connector_type: qualys
    enabled: true
    base_url: https://qualysapi.qualys.com
    api_key: ${QUALYS_USERNAME}
    api_secret: ${QUALYS_PASSWORD}
    timeout_secs: 60

  censys:
    connector_type: censys
    enabled: true
    base_url: https://search.censys.io
    api_key: ${CENSYS_API_ID}
    api_secret: ${CENSYS_SECRET}
    timeout_secs: 30

Content Packages

Share playbooks, hunts, knowledge articles, and saved queries between Triage Warden instances using distributable content packages.

Overview

The content package system (Stage 5.5) provides:

  • Import/export of playbooks, hunts, knowledge, and queries
  • Package validation before import
  • Conflict resolution when imported content already exists
  • Semantic versioning and compatibility tracking

Package Format

A content package consists of a manifest and a list of content items:

{
  "manifest": {
    "name": "phishing-response-kit",
    "version": "1.2.0",
    "description": "Playbooks and hunts for phishing incident response",
    "author": "Security Team",
    "license": "MIT",
    "tags": ["phishing", "email", "social-engineering"],
    "compatibility": ">=2.0.0"
  },
  "contents": [
    {
      "type": "playbook",
      "name": "phishing-triage",
      "data": { "stages": [...] }
    },
    {
      "type": "hunt",
      "name": "credential-harvesting-detection",
      "data": { "hypothesis": "...", "queries": [...] }
    },
    {
      "type": "knowledge",
      "title": "Phishing Indicators Guide",
      "content": "Common phishing indicators include..."
    },
    {
      "type": "query",
      "name": "failed-logins-by-source",
      "query_type": "siem",
      "query": "event.type:authentication AND event.outcome:failure | stats count by source.ip"
    }
  ]
}

Content Types

TypeDescriptionStored in
playbookAutomated response workflowsPlaybook repository
huntThreat hunt definitions with queriesHunt store
knowledgeReference articles and guidesKnowledge base
querySaved search queriesQuery library

Manifest Fields

FieldRequiredDescription
nameYesUnique package name
versionYesSemantic version string
descriptionYesWhat the package contains
authorYesCreator name or organization
licenseNoLicense identifier (e.g., "MIT", "Apache-2.0")
tagsNoCategorization tags
compatibilityNoMinimum Triage Warden version required

Importing Packages

curl -X POST http://localhost:8080/api/v1/packages/import \
  -H "Content-Type: application/json" \
  -d '{
    "package": { ... },
    "conflict_resolution": "skip"
  }'

Response:

{
  "imported": 3,
  "skipped": 1,
  "errors": []
}

Conflict Resolution

When an imported item has the same name as an existing one:

ModeBehavior
skipKeep existing, ignore the imported item (default)
overwriteReplace existing with the imported version
renameImport with a modified name (e.g., phishing-triage-imported-1)

Validating Packages

Check a package for errors before importing:

curl -X POST http://localhost:8080/api/v1/packages/validate \
  -H "Content-Type: application/json" \
  -d '{ "manifest": { ... }, "contents": [ ... ] }'

Response:

{
  "valid": true,
  "warnings": ["Package author is not specified"],
  "errors": [],
  "content_count": 4
}

Validation checks:

  • Package name and version are present
  • All content items have non-empty names
  • Warns on missing author or empty content list

Exporting Content

Export a Playbook

curl -X POST http://localhost:8080/api/v1/packages/export/playbook/{playbook_id} \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-playbook-package",
    "version": "1.0.0",
    "description": "Exported playbook",
    "author": "Security Team",
    "license": "MIT",
    "tags": ["phishing"]
  }'

Export a Hunt

curl -X POST http://localhost:8080/api/v1/packages/export/hunt/{hunt_id} \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-hunt-package",
    "version": "1.0.0",
    "description": "Exported hunt",
    "author": "Threat Hunting Team"
  }'

Both return the full package JSON that can be shared or imported into another instance.

AI Triage

Automated incident analysis using Claude AI agents.

Overview

The triage agent analyzes security incidents to:

  1. Classify - Determine if the incident is malicious, suspicious, or benign
  2. Assess confidence - Quantify certainty in the classification
  3. Explain - Provide reasoning for the verdict
  4. Recommend - Suggest response actions

How It Works

Incident → Playbook Selection → Tool Execution → AI Analysis → Verdict
  1. Incident received - New incident created via webhook or API
  2. Playbook selected - Based on incident type (phishing, malware, etc.)
  3. Tools executed - Parse data, lookup reputation, check authentication
  4. AI analysis - Claude analyzes gathered data
  5. Verdict returned - Classification with confidence and recommendations

Example Verdict

{
  "incident_id": "INC-2024-001",
  "classification": "malicious",
  "confidence": 0.92,
  "category": "phishing",
  "reasoning": "Multiple indicators suggest this is a credential phishing attempt:\n1. Sender domain registered 2 days ago\n2. SPF and DKIM authentication failed\n3. URL leads to a fake Microsoft login page\n4. Subject uses urgency tactics",
  "recommended_actions": [
    {
      "action": "quarantine_email",
      "priority": 1,
      "reason": "Prevent user access to phishing content"
    },
    {
      "action": "block_sender",
      "priority": 2,
      "reason": "Sender has no legitimate history"
    },
    {
      "action": "notify_user",
      "priority": 3,
      "reason": "Educate user about phishing attempt"
    }
  ],
  "iocs": [
    {"type": "domain", "value": "phishing-site.com"},
    {"type": "ip", "value": "192.168.1.100"}
  ],
  "mitre_attack": ["T1566.001", "T1078"]
}

Triggering Triage

Automatic (Webhook)

Configure webhooks to auto-triage new incidents:

webhooks:
  email_gateway:
    auto_triage: true
    playbook: phishing_triage

Manual (CLI)

tw-cli triage run --incident INC-2024-001

Manual (API)

curl -X POST http://localhost:8080/api/incidents/INC-2024-001/triage

Next Steps

Triage Agent

The AI agent that analyzes security incidents.

Architecture

┌─────────────────────────────────────────────────────────┐
│                     Triage Agent                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │   Claude    │  │   Tools     │  │  Playbook   │     │
│  │   Model     │  │   (Bridge)  │  │   Engine    │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    Python Bridge                         │
│           (ThreatIntelBridge, SIEMBridge, etc.)         │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                   Rust Connectors                        │
│        (VirusTotal, Splunk, CrowdStrike, etc.)          │
└─────────────────────────────────────────────────────────┘

Agent Configuration

# python/tw_ai/agents/config.py
class AgentConfig:
    model: str = "claude-sonnet-4-20250514"
    max_tokens: int = 4096
    temperature: float = 0.1
    max_tool_calls: int = 10
    timeout_seconds: int = 120

Environment variables:

TW_AI_PROVIDER=anthropic
TW_ANTHROPIC_API_KEY=your-key
TW_AI_MODEL=claude-sonnet-4-20250514

Available Tools

The agent has access to these tools via the Python bridge:

ToolPurpose
parse_emailExtract email components
check_email_authenticationValidate SPF/DKIM/DMARC
lookup_sender_reputationQuery sender reputation
lookup_urlsCheck URL reputation
lookup_attachmentsCheck attachment hashes
search_siemQuery SIEM for related events
get_host_infoGet EDR host information

Agent Workflow

async def triage(self, incident: Incident) -> Verdict:
    # 1. Load appropriate playbook
    playbook = self.load_playbook(incident.incident_type)

    # 2. Execute playbook steps (tools)
    context = {}
    for step in playbook.steps:
        result = await self.execute_step(step, incident, context)
        context[step.output] = result

    # 3. Build analysis prompt
    prompt = self.build_analysis_prompt(incident, context)

    # 4. Get AI verdict
    response = await self.client.messages.create(
        model=self.config.model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=self.config.max_tokens
    )

    # 5. Parse and return verdict
    return self.parse_verdict(response)

System Prompt

The agent uses a specialized system prompt:

You are an expert security analyst assistant. Analyze the provided security
incident data and determine:

1. Classification: Is this malicious, suspicious, benign, or inconclusive?
2. Confidence: How certain are you (0.0 to 1.0)?
3. Category: What type of threat is this (phishing, malware, etc.)?
4. Reasoning: Explain your analysis step by step
5. Recommended Actions: What should be done to respond?

Use the tool results provided to inform your analysis. Be thorough but concise.
Cite specific evidence for your conclusions.

Tool Calling

The agent can call tools during analysis:

# Agent decides to check URL reputation
tool_result = await self.call_tool(
    name="lookup_urls",
    parameters={"urls": ["https://suspicious-site.com/login"]}
)

# Result used in analysis
# {
#   "results": [{
#     "url": "https://suspicious-site.com/login",
#     "malicious": true,
#     "categories": ["phishing"],
#     "confidence": 0.95
#   }]
# }

Customizing the Agent

Custom System Prompt

agent = TriageAgent(
    system_prompt="""
    You are a SOC analyst specializing in email security.
    Focus on phishing indicators and BEC patterns.
    Always check sender authentication carefully.
    """
)

Custom Tools

Register additional tools:

@agent.tool
async def custom_lookup(domain: str) -> dict:
    """Look up domain in internal threat database."""
    return await internal_db.query(domain)

Model Selection

# Use different models for different scenarios
if incident.severity == "critical":
    agent = TriageAgent(model="claude-opus-4-20250514")
else:
    agent = TriageAgent(model="claude-sonnet-4-20250514")

Error Handling

The agent handles failures gracefully:

try:
    verdict = await agent.triage(incident)
except ToolError as e:
    # Tool failed - continue with available data
    verdict = await agent.triage_partial(incident, failed_tools=[e.tool])
except AIError as e:
    # AI call failed - return inconclusive
    verdict = Verdict.inconclusive(reason=str(e))

Metrics

Agent metrics exported to Prometheus:

  • triage_duration_seconds - Time to complete triage
  • triage_tool_calls_total - Tool calls per triage
  • triage_verdict_total - Verdicts by classification
  • triage_confidence_histogram - Confidence score distribution

Verdict Types

Understanding the classification outcomes from AI triage.

Classifications

ClassificationDescriptionTypical Response
MaliciousConfirmed threatImmediate containment
SuspiciousLikely threat, needs investigationQueue for analyst review
BenignNot a threatClose or archive
InconclusiveInsufficient dataRequest more information

Malicious

The incident is a confirmed security threat.

Criteria:

  • Multiple strong threat indicators
  • High-confidence threat intelligence matches
  • Clear malicious intent (credential theft, malware, etc.)

Example:

{
  "classification": "malicious",
  "confidence": 0.95,
  "category": "phishing",
  "reasoning": "Email contains credential phishing page targeting Microsoft 365. Sender domain registered yesterday, fails all email authentication. URL redirects to fake login mimicking Microsoft branding."
}

Response:

  • Execute recommended containment actions
  • Create incident ticket
  • Notify affected users

Suspicious

The incident shows concerning indicators but lacks definitive proof.

Criteria:

  • Some threat indicators present
  • Mixed or conflicting signals
  • Unusual but not clearly malicious behavior

Example:

{
  "classification": "suspicious",
  "confidence": 0.65,
  "category": "potential_phishing",
  "reasoning": "Email sender is unknown but domain is 6 months old with valid authentication. URL leads to legitimate document sharing service but file name uses urgency tactics. Recipient has not received email from this sender before."
}

Response:

  • Queue for analyst review
  • Gather additional context
  • Consider temporary quarantine pending review

Benign

The incident is not a security threat.

Criteria:

  • No threat indicators found
  • Known good sender/source
  • Normal expected behavior

Example:

{
  "classification": "benign",
  "confidence": 0.92,
  "category": "legitimate_email",
  "reasoning": "Email from known vendor with established sending history. All authentication passes. Attachment is a standard invoice PDF matching expected format. No suspicious URLs or indicators."
}

Response:

  • Close incident
  • Release from quarantine if held
  • Update detection rules if false positive

Inconclusive

Insufficient data to make a determination.

Criteria:

  • Missing critical information
  • Tool failures preventing analysis
  • Conflicting strong indicators

Example:

{
  "classification": "inconclusive",
  "confidence": 0.3,
  "category": "unknown",
  "reasoning": "Unable to analyze attachment - file corrupted. Sender reputation service unavailable. Email authentication results are mixed (SPF pass, DKIM fail). Need manual review of attachment content.",
  "missing_data": [
    "attachment_analysis",
    "sender_reputation"
  ]
}

Response:

  • Escalate to analyst
  • Retry failed tool calls
  • Request additional information

Confidence Scores

Confidence ranges and their meaning:

RangeInterpretation
0.9 - 1.0Very high confidence, clear evidence
0.7 - 0.9High confidence, strong indicators
0.5 - 0.7Moderate confidence, mixed signals
0.3 - 0.5Low confidence, limited evidence
0.0 - 0.3Very low confidence, insufficient data

Category Types

Email Threats

CategoryDescription
phishingCredential theft attempt
spear_phishingTargeted phishing
becBusiness email compromise
malware_deliveryMalicious attachment/link
spamUnsolicited bulk email

Endpoint Threats

CategoryDescription
malwareMalicious software detected
ransomwareRansomware activity
cryptominerCryptocurrency mining
ratRemote access trojan
pupPotentially unwanted program

Access Threats

CategoryDescription
brute_forcePassword guessing attempt
credential_stuffingLeaked credential use
impossible_travelGeographically impossible login
account_takeoverCompromised account

Using Verdicts

Automation Rules

# Auto-respond to high-confidence malicious
- trigger:
    classification: malicious
    confidence: ">= 0.9"
  actions:
    - quarantine_email
    - block_sender
    - create_ticket

# Queue suspicious for review
- trigger:
    classification: suspicious
  actions:
    - escalate:
        level: analyst
        reason: "Suspicious activity requires review"

Metrics

Track verdict distribution:

# Verdict counts by classification
sum by (classification) (triage_verdict_total)

# Average confidence by category
avg by (category) (triage_confidence)

Confidence Scoring

How the AI agent determines confidence in its verdicts.

Confidence Factors

The agent considers multiple factors when calculating confidence:

Evidence Quality

FactorImpact
Threat intel match (high confidence)+0.3
Threat intel match (low confidence)+0.1
Authentication failure+0.2
Known malicious indicator+0.3
Suspicious pattern+0.1

Evidence Quantity

IndicatorsConfidence Boost
1 indicatorBase
2-3 indicators+0.1
4-5 indicators+0.2
6+ indicators+0.3

Data Completeness

Missing DataConfidence Penalty
None0
Minor (sender reputation)-0.1
Moderate (attachment analysis)-0.2
Major (multiple tools failed)-0.3

Calculation Example

Phishing Email Analysis:

Base confidence: 0.5

Evidence found:
+ SPF failed: +0.15
+ DKIM failed: +0.15
+ Sender domain < 7 days old: +0.2
+ URL matches phishing pattern: +0.25
+ VirusTotal flags URL as phishing: +0.2

Evidence count (5): +0.2

Data completeness: All tools succeeded: +0

Final confidence: 0.5 + 0.15 + 0.15 + 0.2 + 0.25 + 0.2 + 0.2 = 1.0 (capped at 0.99)

Verdict: malicious, confidence: 0.99

Confidence Thresholds

Policy decisions use confidence thresholds:

# Auto-quarantine high confidence malicious
[[policy.rules]]
name = "auto_quarantine_confident"
classification = "malicious"
confidence_min = 0.9
action = "quarantine_email"
decision = "allowed"

# Require review for lower confidence
[[policy.rules]]
name = "review_uncertain"
confidence_max = 0.7
decision = "requires_approval"
approval_level = "analyst"

Confidence Calibration

The agent is calibrated so confidence correlates with accuracy:

Stated ConfidenceExpected Accuracy
0.9~90% of verdicts correct
0.8~80% of verdicts correct
0.7~70% of verdicts correct

Monitoring Calibration

Track calibration with metrics:

# Accuracy at confidence level
triage_accuracy_by_confidence{confidence_bucket="0.9-1.0"}

Improving Calibration

  1. Feedback loop - Log false positives to improve
  2. Periodic review - Sample low-confidence verdicts
  3. Model updates - Retrain with corrected examples

Handling Low Confidence

When confidence is low:

Option 1: Escalate

- condition: confidence < 0.6
  action: escalate
  parameters:
    level: analyst
    reason: "Low confidence verdict requires human review"

Option 2: Gather More Data

- condition: confidence < 0.6
  action: request_additional_data
  parameters:
    - "sender_history"
    - "recipient_context"

Option 3: Conservative Default

- condition: confidence < 0.6
  action: quarantine_email
  parameters:
    reason: "Quarantined pending review due to uncertainty"

Confidence in UI

Dashboard displays confidence visually:

ConfidenceDisplay
0.9+Green badge, "High Confidence"
0.7-0.9Yellow badge, "Moderate Confidence"
0.5-0.7Orange badge, "Low Confidence"
<0.5Red badge, "Very Low Confidence"

Improving Confidence

Actions that help the agent be more confident:

  1. Complete data - Ensure all tools succeed
  2. Rich context - Provide incident metadata
  3. Historical data - Include past incidents with similar patterns
  4. Clear playbooks - Well-defined analysis steps

Playbooks

Playbooks define automated investigation and response workflows.

Overview

A playbook is a sequence of steps that:

  1. Gather and analyze incident data
  2. Enrich with threat intelligence
  3. Determine verdict and response
  4. Execute approved actions

Playbook Structure

name: phishing_triage
description: Automated phishing email analysis
version: "1.0"

# When this playbook applies
triggers:
  incident_type: phishing
  auto_run: true

# Variables available to steps
variables:
  quarantine_threshold: 0.7
  block_threshold: 0.3

# Execution steps
steps:
  - name: Parse Email
    action: parse_email
    parameters:
      raw_email: "{{ incident.raw_data.raw_email }}"
    output: parsed

  - name: Check Authentication
    action: check_email_authentication
    parameters:
      headers: "{{ parsed.headers }}"
    output: auth

  - name: Check Sender
    action: lookup_sender_reputation
    parameters:
      sender: "{{ parsed.sender }}"
    output: sender_rep

  - name: Check URLs
    action: lookup_urls
    parameters:
      urls: "{{ parsed.urls }}"
    output: url_results
    condition: "{{ parsed.urls | length > 0 }}"

  - name: Quarantine if Malicious
    action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
      reason: "Automated quarantine - phishing detected"
    condition: >
      sender_rep.score < variables.quarantine_threshold or
      url_results.malicious_count > 0 or
      not auth.authentication_passed

# Final verdict generation
verdict:
  use_ai: true
  model: claude-sonnet-4-20250514
  context:
    - parsed
    - auth
    - sender_rep
    - url_results

Triggers

Define when a playbook runs:

triggers:
  # Run for specific incident types
  incident_type: phishing

  # Auto-run on incident creation
  auto_run: true

  # Or require manual trigger
  auto_run: false

  # Conditions
  conditions:
    severity: ["medium", "high", "critical"]
    source: "email_gateway"

Steps

Basic Step

- name: Step Name
  action: action_name
  parameters:
    key: value
  output: variable_name

Conditional Step

- name: Block Known Bad
  action: block_sender
  parameters:
    sender: "{{ parsed.sender }}"
  condition: "{{ sender_rep.score < 0.2 }}"

Parallel Steps

- parallel:
    - action: lookup_urls
      parameters:
        urls: "{{ parsed.urls }}"
      output: url_results

    - action: lookup_attachments
      parameters:
        attachments: "{{ parsed.attachments }}"
      output: attachment_results

Loop Steps

- name: Check Each URL
  loop: "{{ parsed.urls }}"
  action: lookup_url
  parameters:
    url: "{{ item }}"
  output: url_results
  aggregate: list

Variables

Built-in Variables

VariableDescription
incidentThe incident being processed
incident.idIncident ID
incident.raw_dataOriginal incident data
incident.severityIncident severity
variablesPlaybook-defined variables

Step Outputs

Each step's output is available to subsequent steps:

- action: parse_email
  output: parsed

- action: lookup_urls
  parameters:
    urls: "{{ parsed.urls }}"  # Use previous output

Templates

Use Jinja2-style templates:

parameters:
  message: "Alert for {{ incident.id }}: {{ parsed.subject }}"
  priority: "{{ 'high' if incident.severity == 'critical' else 'medium' }}"

Next Steps

Creating Playbooks

Guide to writing custom playbooks for your security workflows.

Getting Started

1. Create Playbook File

mkdir -p playbooks
touch playbooks/my_playbook.yaml

2. Define Basic Structure

name: my_playbook
description: Description of what this playbook does
version: "1.0"

triggers:
  incident_type: phishing
  auto_run: true

steps:
  - name: First Step
    action: parse_email
    output: result

3. Register Playbook

tw-cli playbook add playbooks/my_playbook.yaml

Step Types

Action Step

Execute a registered action:

- name: Parse Email Content
  action: parse_email
  parameters:
    raw_email: "{{ incident.raw_data.raw_email }}"
  output: parsed
  on_error: continue  # or "fail" (default)

Condition Step

Branch based on conditions:

- name: Check if High Risk
  condition: "{{ sender_rep.score < 0.3 }}"
  then:
    - action: quarantine_email
      parameters:
        message_id: "{{ incident.raw_data.message_id }}"
  else:
    - action: log_event
      parameters:
        message: "Low risk, no action needed"

AI Analysis Step

Get AI verdict:

- name: AI Analysis
  type: ai_analysis
  model: claude-sonnet-4-20250514
  context:
    - parsed
    - auth_results
    - reputation
  prompt: |
    Analyze this email for phishing indicators.
    Consider the authentication results and sender reputation.
  output: ai_verdict

Notification Step

Send alerts:

- name: Alert Team
  action: notify_channel
  parameters:
    channel: slack
    message: |
      New {{ incident.severity }} incident detected
      ID: {{ incident.id }}
      Type: {{ incident.incident_type }}

Error Handling

Per-Step Error Handling

- name: Check Reputation
  action: lookup_sender_reputation
  parameters:
    sender: "{{ parsed.sender }}"
  output: reputation
  on_error: continue  # Don't fail playbook if this fails
  default_output:     # Use this if step fails
    score: 0.5
    risk_level: "unknown"

Global Error Handler

on_error:
  - action: notify_channel
    parameters:
      channel: slack
      message: "Playbook {{ playbook.name }} failed: {{ error.message }}"
  - action: escalate
    parameters:
      level: analyst
      reason: "Automated triage failed"

Variables and Templates

Define Variables

variables:
  high_risk_threshold: 0.3
  quarantine_enabled: true
  notification_channel: "#security-alerts"

Use Variables

- name: Check Risk
  condition: "{{ sender_rep.score < variables.high_risk_threshold }}"
  then:
    - action: quarantine_email
      condition: "{{ variables.quarantine_enabled }}"

Template Functions

parameters:
  # String manipulation
  domain: "{{ parsed.sender | split('@') | last }}"

  # Conditionals
  priority: "{{ 'critical' if incident.severity == 'critical' else 'high' }}"

  # Lists
  all_urls: "{{ parsed.urls | join(', ') }}"
  url_count: "{{ parsed.urls | length }}"

  # Defaults
  assignee: "{{ incident.assignee | default('unassigned') }}"

Testing Playbooks

Dry Run

tw-cli playbook test my_playbook \
  --incident INC-2024-001 \
  --dry-run

With Mock Data

tw-cli playbook test my_playbook \
  --data '{"raw_email": "From: [email protected]..."}'

Validate Syntax

tw-cli playbook validate playbooks/my_playbook.yaml

Best Practices

1. Use Descriptive Names

# Good
- name: Check sender domain reputation

# Bad
- name: step1

2. Handle Failures Gracefully

- name: External Lookup
  action: lookup_sender_reputation
  on_error: continue
  default_output:
    score: 0.5

3. Add Timeouts

- name: Slow External API
  action: custom_lookup
  timeout: 30s

4. Log Key Decisions

- name: Log Verdict
  action: log_event
  parameters:
    level: info
    message: "Verdict: {{ verdict.classification }} ({{ verdict.confidence }})"

5. Version Your Playbooks

name: phishing_triage
version: "2.1.0"
changelog:
  - "2.1.0: Added attachment analysis"
  - "2.0.0: Restructured for parallel lookups"

Example: Complete Playbook

name: comprehensive_phishing_triage
description: Full phishing email analysis with all checks
version: "2.0"

triggers:
  incident_type: phishing
  auto_run: true

variables:
  quarantine_threshold: 0.3
  block_threshold: 0.2

steps:
  # Parse email
  - name: Parse Email
    action: parse_email
    parameters:
      raw_email: "{{ incident.raw_data.raw_email }}"
    output: parsed

  # Parallel enrichment
  - name: Enrich Data
    parallel:
      - action: check_email_authentication
        parameters:
          headers: "{{ parsed.headers }}"
        output: auth

      - action: lookup_sender_reputation
        parameters:
          sender: "{{ parsed.sender }}"
        output: sender_rep

      - action: lookup_urls
        parameters:
          urls: "{{ parsed.urls }}"
        output: urls
        condition: "{{ parsed.urls | length > 0 }}"

      - action: lookup_attachments
        parameters:
          attachments: "{{ parsed.attachments }}"
        output: attachments
        condition: "{{ parsed.attachments | length > 0 }}"

  # AI Analysis
  - name: AI Verdict
    type: ai_analysis
    model: claude-sonnet-4-20250514
    context: [parsed, auth, sender_rep, urls, attachments]
    output: verdict

  # Response actions
  - name: Quarantine Malicious
    action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
    condition: >
      verdict.classification == 'malicious' and
      verdict.confidence >= variables.quarantine_threshold

  - name: Block Repeat Offender
    action: block_sender
    parameters:
      sender: "{{ parsed.sender }}"
    condition: >
      sender_rep.score < variables.block_threshold

  - name: Create Ticket
    action: create_ticket
    parameters:
      title: "{{ verdict.classification | title }}: {{ parsed.subject | truncate(50) }}"
      priority: "{{ incident.severity }}"
    condition: "{{ verdict.classification != 'benign' }}"

on_error:
  - action: escalate
    parameters:
      level: analyst
      reason: "Playbook execution failed"

Built-in Playbooks

Ready-to-use playbooks included with Triage Warden.

Email Security

phishing_triage

Comprehensive phishing email analysis.

Triggers: incident_type: phishing

Steps:

  1. Parse email headers and body
  2. Check SPF/DKIM/DMARC authentication
  3. Look up sender reputation
  4. Analyze URLs against threat intel
  5. Check attachment hashes
  6. AI analysis and verdict
  7. Auto-quarantine if malicious (confidence > 0.8)

Usage:

tw-cli playbook run phishing_triage --incident INC-2024-001

spam_triage

Quick spam classification.

Triggers: incident_type: spam

Steps:

  1. Parse email
  2. Check spam indicators (bulk headers, suspicious patterns)
  3. Classify as spam/not spam
  4. Auto-archive low-confidence spam

bec_detection

Business Email Compromise detection.

Triggers: incident_type: bec

Steps:

  1. Parse email
  2. Check for executive impersonation
  3. Analyze reply-to mismatch
  4. Check for urgency indicators
  5. Verify sender against directory
  6. AI analysis for social engineering patterns

Endpoint Security

malware_triage

Malware alert analysis.

Triggers: incident_type: malware

Steps:

  1. Get host information from EDR
  2. Look up file hash
  3. Check related processes
  4. Query SIEM for lateral movement
  5. AI verdict
  6. Auto-isolate if critical severity + high confidence

suspicious_login

Anomalous login investigation.

Triggers: incident_type: suspicious_login

Steps:

  1. Get login details
  2. Check for impossible travel
  3. Query user's recent activity
  4. Check IP reputation
  5. Verify device fingerprint
  6. AI analysis

Customizing Built-in Playbooks

Override Variables

tw-cli playbook run phishing_triage \
  --incident INC-2024-001 \
  --var quarantine_threshold=0.9 \
  --var auto_block=false

Fork and Modify

# Export built-in playbook
tw-cli playbook export phishing_triage > my_phishing.yaml

# Edit as needed
vim my_phishing.yaml

# Register custom version
tw-cli playbook add my_phishing.yaml

Extend with Hooks

# my_phishing.yaml
extends: phishing_triage

# Add steps after parent playbook
after_steps:
  - name: Custom Logging
    action: log_to_siem
    parameters:
      event: phishing_verdict
      data: "{{ verdict }}"

# Override variables
variables:
  quarantine_threshold: 0.85

Playbook Comparison

PlaybookAI UsedAuto-ResponseTypical Duration
phishing_triageYesQuarantine, Block30-60s
spam_triageNoArchive5-10s
bec_detectionYesEscalate45-90s
malware_triageYesIsolate60-120s
suspicious_loginYesLock account30-60s

Monitoring Playbooks

Execution Metrics

# Playbook execution count
sum by (playbook) (playbook_executions_total)

# Average duration
avg by (playbook) (playbook_duration_seconds)

# Success rate
sum(playbook_executions_total{status="success"}) /
sum(playbook_executions_total)

Alerts

# Alert on playbook failures
- alert: PlaybookFailureRate
  expr: |
    sum(rate(playbook_executions_total{status="failed"}[5m])) /
    sum(rate(playbook_executions_total[5m])) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Playbook failure rate above 10%"

REST API

Programmatic access to Triage Warden functionality.

Base URL

http://localhost:8080/api

Authentication

See Authentication for details.

API Key

curl -H "Authorization: Bearer tw_abc123_secretkey" \
  http://localhost:8080/api/incidents

For browser-based access, use session authentication via /login.

Response Format

All responses are JSON:

{
  "data": { ... },
  "meta": {
    "page": 1,
    "per_page": 20,
    "total": 150
  }
}

Error Responses

{
  "error": {
    "code": "not_found",
    "message": "Incident not found",
    "details": { ... }
  }
}

HTTP Status Codes

CodeMeaning
200Success
201Created
400Bad Request
401Unauthorized
403Forbidden
404Not Found
422Validation Error
429Rate Limited
500Server Error

Endpoints Overview

Incidents

MethodPathDescription
GET/incidentsList incidents
POST/incidentsCreate incident
GET/incidents/:idGet incident
PUT/incidents/:idUpdate incident
DELETE/incidents/:idDelete incident
POST/incidents/:id/triageRun triage
POST/incidents/:id/actionsExecute action

Actions

MethodPathDescription
GET/actionsList actions
GET/actions/:idGet action
POST/actions/:id/approveApprove action
POST/actions/:id/rejectReject action

Playbooks

MethodPathDescription
GET/playbooksList playbooks
POST/playbooksCreate playbook
GET/playbooks/:idGet playbook
PUT/playbooks/:idUpdate playbook
DELETE/playbooks/:idDelete playbook
POST/playbooks/:id/runRun playbook

Webhooks

MethodPathDescription
POST/webhooks/:sourceReceive webhook

System

MethodPathDescription
GET/healthHealth check
GET/metricsPrometheus metrics
GET/connectors/healthConnector status

Pagination

List endpoints support pagination:

curl "http://localhost:8080/api/incidents?page=2&per_page=50"

Parameters:

  • page - Page number (default: 1)
  • per_page - Items per page (default: 20, max: 100)

Filtering

Filter list results:

curl "http://localhost:8080/api/incidents?status=open&severity=high"

Common filters:

  • status - Filter by status
  • severity - Filter by severity
  • type - Filter by incident type
  • created_after - Created after date
  • created_before - Created before date

Sorting

curl "http://localhost:8080/api/incidents?sort=-created_at"
  • Prefix with - for descending order
  • Default: -created_at (newest first)

Rate Limiting

API requests are rate limited:

EndpointLimit
Read operations100/min
Write operations20/min
Triage requests10/min

Rate limit headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1705320000

Next Steps

API Authentication

Authenticate with the Triage Warden API.

API Keys

Creating an API Key

# Via CLI
tw-cli api-key create --name "automation-script" --scopes read,write

# Output:
# API Key created successfully
# Key: tw_abc123_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# WARNING: Store this key securely. It cannot be retrieved again.

Using API Keys

Include in the Authorization header:

curl -H "Authorization: Bearer tw_abc123_secretkey" \
  http://localhost:8080/api/incidents

API Key Scopes

ScopePermissions
readRead incidents, actions, playbooks
writeCreate/update incidents, execute actions
adminUser management, system configuration

Managing API Keys

# List keys
tw-cli api-key list

# Revoke key
tw-cli api-key revoke tw_abc123

# Rotate key
tw-cli api-key rotate tw_abc123

Session Authentication

For web dashboard access:

Login

curl -X POST http://localhost:8080/login \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "username=analyst&password=secret&csrf_token=xxx" \
  -c cookies.txt

Using Session

curl -b cookies.txt http://localhost:8080/api/incidents

Logout

curl -X POST http://localhost:8080/logout -b cookies.txt

CSRF Protection

State-changing requests require CSRF tokens:

  1. Get token from login page or API
  2. Include in request header or body
# Header
curl -X POST http://localhost:8080/api/incidents \
  -H "X-CSRF-Token: abc123" \
  -b cookies.txt \
  -d '{"type": "phishing"}'

# Form body
curl -X POST http://localhost:8080/api/incidents \
  -d "csrf_token=abc123&type=phishing" \
  -b cookies.txt

Webhook Authentication

Webhooks use HMAC signatures:

Configuring Webhook Secret

tw-cli webhook add email-gateway \
  --url http://localhost:8080/api/webhooks/email-gateway \
  --secret "your-secret-key"

Verifying Signatures

Triage Warden validates the X-Webhook-Signature header:

X-Webhook-Signature: sha256=abc123...

Signature is computed as:

HMAC-SHA256(secret, timestamp + "." + body)

Signature Verification Example

import hmac
import hashlib

def verify_signature(payload: bytes, signature: str, secret: str, timestamp: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        f"{timestamp}.{payload.decode()}".encode(),
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

Service Accounts

For automated systems:

# Create service account
tw-cli user create \
  --username automation-bot \
  --role analyst \
  --service-account

# Generate API key for service account
tw-cli api-key create \
  --user automation-bot \
  --name "ci-cd-integration" \
  --scopes read,write

Security Best Practices

  1. Rotate keys regularly - Set up automated rotation
  2. Use minimal scopes - Only grant necessary permissions
  3. Secure storage - Use secret managers, not code
  4. Monitor usage - Review audit logs for suspicious activity
  5. IP allowlisting - Restrict API access by IP (optional)
# Enable IP allowlist
tw-cli config set api.allowed_ips "10.0.0.0/8,192.168.1.0/24"

Error Responses

401 Unauthorized

Missing or invalid credentials:

{
  "error": {
    "code": "unauthorized",
    "message": "Invalid or missing authentication"
  }
}

403 Forbidden

Valid credentials but insufficient permissions:

{
  "error": {
    "code": "forbidden",
    "message": "Insufficient permissions for this operation"
  }
}

Incidents API

Create, read, update, and manage security incidents.

List Incidents

GET /api/incidents

Query Parameters

ParameterTypeDescription
statusstringFilter by status (open, triaged, resolved)
severitystringFilter by severity (low, medium, high, critical)
typestringFilter by incident type
created_afterdatetimeCreated after timestamp
created_beforedatetimeCreated before timestamp
pageintegerPage number
per_pageintegerItems per page
sortstringSort field (prefix - for desc)

Example

curl "http://localhost:8080/api/incidents?status=open&severity=high&per_page=10" \
  -H "Authorization: Bearer tw_xxx"

Response

{
  "data": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "incident_number": "INC-2024-0001",
      "incident_type": "phishing",
      "severity": "high",
      "status": "open",
      "source": "email_gateway",
      "created_at": "2024-01-15T10:30:00Z",
      "updated_at": "2024-01-15T10:30:00Z"
    }
  ],
  "meta": {
    "page": 1,
    "per_page": 10,
    "total": 42
  }
}

Get Incident

GET /api/incidents/:id

Example

curl "http://localhost:8080/api/incidents/550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer tw_xxx"

Response

{
  "data": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "incident_number": "INC-2024-0001",
    "incident_type": "phishing",
    "severity": "high",
    "status": "triaged",
    "source": "email_gateway",
    "raw_data": {
      "message_id": "AAMkAGI2...",
      "sender": "[email protected]",
      "subject": "Urgent: Update Account"
    },
    "verdict": {
      "classification": "malicious",
      "confidence": 0.92,
      "category": "phishing",
      "reasoning": "Multiple phishing indicators..."
    },
    "recommended_actions": [
      "quarantine_email",
      "block_sender"
    ],
    "created_at": "2024-01-15T10:30:00Z",
    "updated_at": "2024-01-15T10:35:00Z",
    "triaged_at": "2024-01-15T10:35:00Z"
  }
}

Create Incident

POST /api/incidents

Request Body

{
  "incident_type": "phishing",
  "source": "email_gateway",
  "severity": "medium",
  "raw_data": {
    "message_id": "AAMkAGI2...",
    "sender": "[email protected]",
    "recipient": "[email protected]",
    "subject": "Important Document",
    "received_at": "2024-01-15T10:00:00Z"
  }
}

Example

curl -X POST "http://localhost:8080/api/incidents" \
  -H "Authorization: Bearer tw_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "incident_type": "phishing",
    "source": "email_gateway",
    "severity": "medium",
    "raw_data": {...}
  }'

Response

{
  "data": {
    "id": "550e8400-e29b-41d4-a716-446655440001",
    "incident_number": "INC-2024-0002",
    "status": "open",
    "created_at": "2024-01-15T11:00:00Z"
  }
}

Update Incident

PUT /api/incidents/:id

Request Body

{
  "severity": "high",
  "status": "resolved",
  "resolution": "False positive - legitimate vendor email"
}

Delete Incident

DELETE /api/incidents/:id

Note: Requires admin role.

Run Triage

POST /api/incidents/:id/triage

Trigger AI triage on an incident.

Request Body (Optional)

{
  "playbook": "custom_phishing",
  "force": true
}

Response

{
  "data": {
    "triage_id": "triage-abc123",
    "status": "completed",
    "verdict": {
      "classification": "malicious",
      "confidence": 0.92
    },
    "duration_ms": 45000
  }
}

Execute Action

POST /api/incidents/:id/actions

Execute an action on an incident.

Request Body

{
  "action": "quarantine_email",
  "parameters": {
    "message_id": "AAMkAGI2...",
    "reason": "Phishing detected"
  }
}

Response (Immediate Execution)

{
  "data": {
    "action_id": "act-abc123",
    "status": "completed",
    "result": {
      "success": true,
      "message": "Email quarantined successfully"
    }
  }
}

Response (Pending Approval)

{
  "data": {
    "action_id": "act-abc123",
    "status": "pending_approval",
    "approval_level": "senior",
    "message": "Action requires senior analyst approval"
  }
}

Get Incident Actions

GET /api/incidents/:id/actions

List all actions for an incident.

Response

{
  "data": [
    {
      "id": "act-abc123",
      "action_type": "quarantine_email",
      "status": "completed",
      "executed_at": "2024-01-15T10:40:00Z",
      "executed_by": "system"
    },
    {
      "id": "act-def456",
      "action_type": "block_sender",
      "status": "pending_approval",
      "approval_level": "analyst",
      "requested_at": "2024-01-15T10:41:00Z"
    }
  ]
}

Actions API

Manage action execution and approvals.

List Actions

GET /api/actions

Query Parameters

ParameterTypeDescription
statusstringpending, pending_approval, completed, failed
action_typestringFilter by action type
incident_iduuidFilter by incident
approval_levelstringanalyst, senior, manager

Example

curl "http://localhost:8080/api/actions?status=pending_approval" \
  -H "Authorization: Bearer tw_xxx"

Response

{
  "data": [
    {
      "id": "act-abc123",
      "incident_id": "550e8400-e29b-41d4-a716-446655440000",
      "action_type": "isolate_host",
      "status": "pending_approval",
      "approval_level": "senior",
      "parameters": {
        "host_id": "aid:xyz789",
        "reason": "Malware detected"
      },
      "requested_by": "triage_agent",
      "requested_at": "2024-01-15T10:45:00Z"
    }
  ]
}

Get Action

GET /api/actions/:id

Response

{
  "data": {
    "id": "act-abc123",
    "incident_id": "550e8400-e29b-41d4-a716-446655440000",
    "action_type": "isolate_host",
    "status": "pending_approval",
    "approval_level": "senior",
    "parameters": {
      "host_id": "aid:xyz789",
      "reason": "Malware detected"
    },
    "requested_by": "triage_agent",
    "requested_at": "2024-01-15T10:45:00Z",
    "incident": {
      "incident_number": "INC-2024-0001",
      "incident_type": "malware",
      "severity": "high"
    }
  }
}

Approve Action

POST /api/actions/:id/approve

Request Body

{
  "comment": "Verified threat, approved for isolation"
}

Response

{
  "data": {
    "id": "act-abc123",
    "status": "completed",
    "approved_by": "[email protected]",
    "approved_at": "2024-01-15T11:00:00Z",
    "result": {
      "success": true,
      "message": "Host isolated successfully"
    }
  }
}

Errors

403 Forbidden - Insufficient approval level:

{
  "error": {
    "code": "insufficient_approval_level",
    "message": "This action requires senior analyst approval",
    "required_level": "senior",
    "your_level": "analyst"
  }
}

Reject Action

POST /api/actions/:id/reject

Request Body

{
  "reason": "False positive - user confirmed legitimate activity"
}

Response

{
  "data": {
    "id": "act-abc123",
    "status": "rejected",
    "rejected_by": "[email protected]",
    "rejected_at": "2024-01-15T11:00:00Z",
    "rejection_reason": "False positive - user confirmed legitimate activity"
  }
}

Execute Action Directly

POST /api/actions/execute

Execute an action without associating with an incident.

Request Body

{
  "action": "block_sender",
  "parameters": {
    "sender": "[email protected]"
  }
}

Response

{
  "data": {
    "action_id": "act-ghi789",
    "status": "completed",
    "result": {
      "success": true,
      "message": "Sender blocked"
    }
  }
}

Get Action Types

GET /api/actions/types

List all available action types.

Response

{
  "data": [
    {
      "name": "quarantine_email",
      "description": "Move email to quarantine",
      "category": "email",
      "supports_rollback": true,
      "parameters": [
        {
          "name": "message_id",
          "type": "string",
          "required": true
        },
        {
          "name": "reason",
          "type": "string",
          "required": false
        }
      ]
    },
    {
      "name": "isolate_host",
      "description": "Network-isolate a host",
      "category": "endpoint",
      "supports_rollback": true,
      "default_approval_level": "senior",
      "parameters": [...]
    }
  ]
}

Rollback Action

POST /api/actions/:id/rollback

Rollback a previously executed action.

Request Body

{
  "reason": "False positive confirmed"
}

Response

{
  "data": {
    "rollback_action_id": "act-jkl012",
    "original_action_id": "act-abc123",
    "status": "completed",
    "result": {
      "success": true,
      "message": "Host unisolated successfully"
    }
  }
}

Errors

400 Bad Request - Action doesn't support rollback:

{
  "error": {
    "code": "rollback_not_supported",
    "message": "Action type 'notify_user' does not support rollback"
  }
}

Playbooks API

Manage and execute playbooks.

List Playbooks

GET /api/playbooks

Response

{
  "data": [
    {
      "id": "pb-abc123",
      "name": "phishing_triage",
      "description": "Automated phishing email analysis",
      "version": "2.0",
      "enabled": true,
      "triggers": {
        "incident_type": "phishing",
        "auto_run": true
      },
      "created_at": "2024-01-01T00:00:00Z",
      "updated_at": "2024-01-10T00:00:00Z"
    }
  ]
}

Get Playbook

GET /api/playbooks/:id

Response

{
  "data": {
    "id": "pb-abc123",
    "name": "phishing_triage",
    "description": "Automated phishing email analysis",
    "version": "2.0",
    "enabled": true,
    "triggers": {
      "incident_type": "phishing",
      "auto_run": true
    },
    "variables": {
      "quarantine_threshold": 0.7
    },
    "steps": [
      {
        "name": "Parse Email",
        "action": "parse_email",
        "parameters": {
          "raw_email": "{{ incident.raw_data.raw_email }}"
        },
        "output": "parsed"
      }
    ],
    "created_at": "2024-01-01T00:00:00Z",
    "updated_at": "2024-01-10T00:00:00Z"
  }
}

Create Playbook

POST /api/playbooks

Request Body

{
  "name": "custom_playbook",
  "description": "My custom investigation playbook",
  "triggers": {
    "incident_type": "phishing",
    "auto_run": false
  },
  "steps": [
    {
      "name": "Parse Email",
      "action": "parse_email",
      "output": "parsed"
    }
  ]
}

Response

{
  "data": {
    "id": "pb-def456",
    "name": "custom_playbook",
    "version": "1.0",
    "created_at": "2024-01-15T12:00:00Z"
  }
}

Update Playbook

PUT /api/playbooks/:id

Request Body

{
  "description": "Updated description",
  "enabled": false
}

Delete Playbook

DELETE /api/playbooks/:id

Note: Built-in playbooks cannot be deleted.

Run Playbook

POST /api/playbooks/:id/run

Execute a playbook on an incident.

Request Body

{
  "incident_id": "550e8400-e29b-41d4-a716-446655440000",
  "variables": {
    "quarantine_threshold": 0.9
  }
}

Response

{
  "data": {
    "execution_id": "exec-abc123",
    "playbook_id": "pb-abc123",
    "incident_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "started_at": "2024-01-15T12:00:00Z",
    "completed_at": "2024-01-15T12:00:45Z",
    "steps_completed": 5,
    "steps_total": 5,
    "verdict": {
      "classification": "malicious",
      "confidence": 0.92
    }
  }
}

Get Playbook Executions

GET /api/playbooks/:id/executions

Response

{
  "data": [
    {
      "execution_id": "exec-abc123",
      "incident_id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "completed",
      "duration_ms": 45000,
      "started_at": "2024-01-15T12:00:00Z"
    }
  ]
}

Validate Playbook

POST /api/playbooks/validate

Validate playbook YAML without creating it.

Request Body

{
  "content": "name: test\nsteps:\n  - action: parse_email"
}

Response (Valid)

{
  "data": {
    "valid": true,
    "warnings": []
  }
}

Response (Invalid)

{
  "data": {
    "valid": false,
    "errors": [
      {
        "line": 3,
        "message": "Unknown action: invalid_action"
      }
    ]
  }
}

Export Playbook

GET /api/playbooks/:id/export

Download playbook as YAML file.

Response

name: phishing_triage
description: Automated phishing email analysis
version: "2.0"
...

Webhooks API

Receive events from external security tools.

Endpoint

POST /api/webhooks/:source

Where :source identifies the sending system (e.g., email-gateway, edr, siem).

Authentication

Webhooks are authenticated via HMAC signatures:

X-Webhook-Signature: sha256=abc123...
X-Webhook-Timestamp: 1705320000

Registering Webhook Sources

Via CLI

tw-cli webhook add email-gateway \
  --secret "your-secret-key" \
  --auto-triage true \
  --playbook phishing_triage

Via API

curl -X POST "http://localhost:8080/api/webhooks" \
  -H "Authorization: Bearer tw_xxx" \
  -d '{
    "source": "email-gateway",
    "secret": "your-secret-key",
    "auto_triage": true,
    "playbook": "phishing_triage"
  }'

Payload Formats

Generic Format

{
  "event_type": "security_alert",
  "timestamp": "2024-01-15T10:00:00Z",
  "source": "email-gateway",
  "data": {
    "alert_id": "alert-123",
    "severity": "high",
    "details": {...}
  }
}

Microsoft Defender for Office 365

{
  "eventType": "PhishingEmail",
  "id": "AAMkAGI2...",
  "creationTime": "2024-01-15T10:00:00Z",
  "severity": "high",
  "category": "Phish",
  "entityType": "Email",
  "data": {
    "sender": "[email protected]",
    "subject": "Urgent Action Required",
    "recipients": ["[email protected]"]
  }
}

CrowdStrike Falcon

{
  "metadata": {
    "eventType": "DetectionSummaryEvent",
    "eventCreationTime": 1705320000000
  },
  "event": {
    "DetectId": "ldt:abc123",
    "Severity": 4,
    "HostnameField": "WORKSTATION-01",
    "DetectName": "Malicious File Detected"
  }
}

Splunk Alert

{
  "result": {
    "host": "server-01",
    "source": "WinEventLog:Security",
    "sourcetype": "WinEventLog",
    "_raw": "...",
    "EventCode": "4625"
  },
  "search_name": "Failed Login Alert",
  "trigger_time": 1705320000
}

Response

Success

{
  "status": "accepted",
  "incident_id": "550e8400-e29b-41d4-a716-446655440000",
  "incident_number": "INC-2024-0001"
}

Queued for Processing

{
  "status": "queued",
  "queue_id": "queue-abc123",
  "message": "Event queued for processing"
}

Configuring Auto-Triage

When auto_triage is enabled, incidents created from webhooks are automatically triaged:

# webhook_config.yaml
sources:
  email-gateway:
    secret: "${EMAIL_GATEWAY_SECRET}"
    auto_triage: true
    playbook: phishing_triage
    severity_mapping:
      critical: critical
      high: high
      medium: medium
      low: low

  edr:
    secret: "${EDR_SECRET}"
    auto_triage: true
    playbook: malware_triage

Testing Webhooks

Send Test Event

# Generate signature
TIMESTAMP=$(date +%s)
BODY='{"event_type":"test","data":{}}'
SIGNATURE=$(echo -n "${TIMESTAMP}.${BODY}" | openssl dgst -sha256 -hmac "your-secret")

# Send request
curl -X POST "http://localhost:8080/api/webhooks/email-gateway" \
  -H "Content-Type: application/json" \
  -H "X-Webhook-Signature: sha256=${SIGNATURE}" \
  -H "X-Webhook-Timestamp: ${TIMESTAMP}" \
  -d "${BODY}"

Verify Configuration

tw-cli webhook test email-gateway

Error Handling

Invalid Signature

{
  "error": {
    "code": "invalid_signature",
    "message": "Webhook signature verification failed"
  }
}

Unknown Source

{
  "error": {
    "code": "unknown_source",
    "message": "Webhook source 'unknown' is not registered"
  }
}

Replay Attack

{
  "error": {
    "code": "timestamp_expired",
    "message": "Webhook timestamp is too old (>5 minutes)"
  }
}

Monitoring Webhooks

Metrics

# Webhook receive rate
rate(webhook_received_total[5m])

# Error rate by source
rate(webhook_errors_total[5m])

Logs

tw-cli logs --filter webhook --tail 100

API Error Codes

All API errors return a consistent JSON structure with an error code, message, and optional details.

Error Response Format

{
  "code": "ERROR_CODE",
  "message": "Human-readable error message",
  "details": { ... },
  "request_id": "optional-request-id"
}

Error Codes Reference

Authentication Errors (4xx)

CodeHTTP StatusDescriptionResolution
UNAUTHORIZED401Missing or invalid authenticationProvide valid API key or session cookie
INVALID_CREDENTIALS401Invalid username or passwordCheck login credentials
SESSION_EXPIRED401Session has expiredRe-authenticate to get new session
INVALID_SIGNATURE401Webhook signature validation failedVerify webhook secret configuration
FORBIDDEN403Authenticated but not authorizedCheck user role and permissions
CSRF_VALIDATION_FAILED403CSRF token missing or invalidInclude valid CSRF token in request
ACCOUNT_DISABLED403User account is disabledContact administrator

Client Errors (4xx)

CodeHTTP StatusDescriptionResolution
NOT_FOUND404Resource not foundVerify resource ID exists
BAD_REQUEST400Malformed requestCheck request syntax and parameters
CONFLICT409Resource conflict (e.g., already exists)Action already completed or duplicate resource
UNPROCESSABLE_ENTITY422Semantic error in requestCheck request logic and data validity
VALIDATION_ERROR422Field validation failedSee details for field-specific errors
RATE_LIMIT_EXCEEDED429Too many requestsWait and retry with exponential backoff

Server Errors (5xx)

CodeHTTP StatusDescriptionResolution
INTERNAL_ERROR500Unexpected server errorCheck server logs, contact support
DATABASE_ERROR500Database operation failedCheck database connectivity
SERVICE_UNAVAILABLE503Service temporarily unavailableRetry later

Detailed Error Examples

Validation Error

When field validation fails, the response includes detailed field-level errors:

{
  "code": "VALIDATION_ERROR",
  "message": "Validation failed",
  "details": {
    "name": {
      "code": "required",
      "message": "Name is required"
    },
    "email": {
      "code": "invalid_format",
      "message": "Invalid email format"
    }
  }
}

Not Found Error

{
  "code": "NOT_FOUND",
  "message": "Not found: Incident 550e8400-e29b-41d4-a716-446655440000 not found"
}

Conflict Error

Returned when attempting an action that conflicts with current state:

{
  "code": "CONFLICT",
  "message": "Conflict: Action is not pending approval (current status: Approved)"
}

Rate Limit Error

{
  "code": "RATE_LIMIT_EXCEEDED",
  "message": "Rate limit exceeded"
}

Include a Retry-After header when available.

Unauthorized Error

{
  "code": "UNAUTHORIZED",
  "message": "Unauthorized: No authentication provided"
}

Error Handling Best Practices

Client Implementation

import requests

def handle_api_error(response):
    error = response.json()
    code = error.get('code')

    if code == 'RATE_LIMIT_EXCEEDED':
        # Implement exponential backoff
        retry_after = int(response.headers.get('Retry-After', 60))
        time.sleep(retry_after)
        return retry_request()

    elif code == 'SESSION_EXPIRED':
        # Re-authenticate
        refresh_session()
        return retry_request()

    elif code == 'VALIDATION_ERROR':
        # Handle field-specific errors
        for field, details in error.get('details', {}).items():
            print(f"Field '{field}': {details['message']}")

    elif code in ['INTERNAL_ERROR', 'DATABASE_ERROR']:
        # Log and alert on server errors
        log_error(error)
        raise ServerError(error['message'])

Retry Strategy

For transient errors (5xx, RATE_LIMIT_EXCEEDED), implement exponential backoff:

import time
import random

def retry_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except (RateLimitError, ServiceUnavailableError) as e:
            if attempt == max_retries - 1:
                raise
            delay = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

HTTP Status Code Summary

StatusMeaningRetryable
400Bad RequestNo
401UnauthorizedAfter re-auth
403ForbiddenNo
404Not FoundNo
409ConflictNo
422Unprocessable EntityAfter fixing request
429Rate LimitedYes, with backoff
500Internal ErrorYes, with caution
503Service UnavailableYes, with backoff

Configuration Guide

Complete guides for configuring Triage Warden.

Initial Setup

After installation, configure Triage Warden in this order:

  1. Environment Variables - Set required environment variables
  2. Connectors - Connect to your security tools
  3. Notifications - Set up alert channels
  4. Playbooks - Create automation workflows
  5. Policies - Define approval and safety rules
  6. SSO Integrations - Configure enterprise identity providers

Quick Configuration

First Run

After starting Triage Warden, log in with the default credentials:

  • Username: admin
  • Password: admin

Important: Change the default password immediately!

Essential Settings

Navigate to Settings and configure:

  1. General

    • Organization name
    • Timezone
    • Operation mode (Assisted → Supervised → Autonomous)
  2. AI/LLM

    • Select provider (Anthropic, OpenAI, or Local)
    • Enter API key
    • Choose model
  3. Connectors (at minimum)

    • Threat intelligence (VirusTotal recommended)
    • Your primary SIEM or alert source
  4. Notifications

    • At least one channel for critical alerts

Configuration Methods

Most settings can be configured through the web dashboard at Settings.

Pros:

  • User-friendly interface
  • Validation feedback
  • Immediate effect

Environment Variables

For deployment configuration and secrets:

# Required
DATABASE_URL=postgres://...
TW_ENCRYPTION_KEY=...

# Optional overrides
TW_LLM_PROVIDER=anthropic
TW_LLM_MODEL=claude-3-sonnet

See Environment Variables Reference for full list.

Configuration Files

For complex configurations:

# config/default.yaml
server:
  bind_address: "0.0.0.0:8080"

guardrails:
  max_actions_per_incident: 10
  blocked_actions: []

Configuration Hierarchy

Configuration is loaded in this order (later overrides earlier):

1. Built-in defaults
         ↓
2. config/default.yaml
         ↓
3. config/{environment}.yaml
         ↓
4. Environment variables
         ↓
5. Database settings (via UI)

Validation

Triage Warden validates configuration at startup:

# Validate without starting
triage-warden serve --validate-only

# Check specific configuration
triage-warden config check

Common Validation Errors

ErrorSolution
Missing TW_ENCRYPTION_KEYSet encryption key environment variable
Invalid DATABASE_URLCheck connection string format
LLM API key requiredSet API key or disable LLM features
Guardrails file not foundCreate config/guardrails.yaml

Backup Configuration

Before making changes, backup current settings:

# Export settings via API
curl -H "Authorization: Bearer $API_KEY" \
  http://localhost:8080/api/settings/export > settings-backup.json

# Restore settings
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d @settings-backup.json \
  http://localhost:8080/api/settings/import

Next Steps

Environment Variables Reference

Complete reference of all environment variables for Triage Warden.

Required Variables

These must be set for Triage Warden to start.

Database

VariableDescriptionExample
DATABASE_URLPostgreSQL connection stringpostgres://user:pass@localhost:5432/triage_warden

Connection String Format:

postgres://username:password@hostname:port/database?sslmode=require

SSL Modes:

  • disable - No SSL (development only)
  • require - SSL required, no certificate verification
  • verify-ca - Verify server certificate against CA
  • verify-full - Verify server certificate and hostname

Security

VariableDescriptionExample
TW_ENCRYPTION_KEYCredential encryption key (32 bytes, base64)K7gNU3sdo+OL0wNhqoVW...
TW_JWT_SECRETJWT signing secret (min 32 characters)your-very-long-jwt-secret-here
TW_SESSION_SECRETSession encryption secretyour-session-secret-here

Generating Keys:

# Encryption key (32 bytes, base64)
openssl rand -base64 32

# JWT/Session secret (hex)
openssl rand -hex 32

Server Configuration

VariableDescriptionDefault
TW_BIND_ADDRESSServer bind address0.0.0.0:8080
TW_BASE_URLPublic URL for the applicationhttp://localhost:8080
TW_TRUSTED_PROXIESComma-separated trusted proxy IPsNone
TW_MAX_REQUEST_SIZEMaximum request body size10MB
TW_REQUEST_TIMEOUTRequest timeout in seconds30

Example:

TW_BIND_ADDRESS=0.0.0.0:8080
TW_BASE_URL=https://triage.company.com
TW_TRUSTED_PROXIES=10.0.0.0/8,172.16.0.0/12

Database Configuration

VariableDescriptionDefault
DATABASE_URLConnection stringRequired
DATABASE_MAX_CONNECTIONSMaximum pool connections10
DATABASE_MIN_CONNECTIONSMinimum pool connections1
DATABASE_CONNECT_TIMEOUTConnection timeout (seconds)30
DATABASE_IDLE_TIMEOUTIdle connection timeout (seconds)600
DATABASE_MAX_LIFETIMEMax connection lifetime (seconds)1800

High-Traffic Configuration:

DATABASE_MAX_CONNECTIONS=50
DATABASE_MIN_CONNECTIONS=5
DATABASE_IDLE_TIMEOUT=300

Authentication

VariableDescriptionDefault
TW_JWT_SECRETJWT signing secretRequired
TW_JWT_EXPIRYJWT token expiry24h
TW_SESSION_SECRETSession encryption keyRequired
TW_SESSION_EXPIRYSession duration7d
TW_CSRF_ENABLEDEnable CSRF protectiontrue
TW_COOKIE_SECURERequire HTTPS for cookiesfalse
TW_COOKIE_SAME_SITESameSite cookie policylax

Production Settings:

TW_COOKIE_SECURE=true
TW_COOKIE_SAME_SITE=strict
TW_SESSION_EXPIRY=1d

LLM Configuration

Provider Selection

VariableDescriptionDefault
TW_LLM_PROVIDERLLM provideropenai
TW_LLM_MODELModel namegpt-4-turbo
TW_LLM_ENABLEDEnable LLM featurestrue

Valid Providers: openai, anthropic, azure, local

API Keys

VariableDescription
OPENAI_API_KEYOpenAI API key
ANTHROPIC_API_KEYAnthropic API key
AZURE_OPENAI_API_KEYAzure OpenAI API key
AZURE_OPENAI_ENDPOINTAzure OpenAI endpoint URL

Model Parameters

VariableDescriptionDefault
TW_LLM_TEMPERATUREResponse randomness (0.0-2.0)0.2
TW_LLM_MAX_TOKENSMaximum response tokens4096
TW_LLM_TIMEOUTRequest timeout (seconds)60

Example Configuration:

# Using Anthropic
TW_LLM_PROVIDER=anthropic
TW_LLM_MODEL=claude-3-sonnet-20240229
ANTHROPIC_API_KEY=sk-ant-api03-...
TW_LLM_TEMPERATURE=0.1
TW_LLM_MAX_TOKENS=8192

# Using Azure OpenAI
TW_LLM_PROVIDER=azure
AZURE_OPENAI_API_KEY=your-azure-key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
TW_LLM_MODEL=gpt-4-deployment-name

Logging & Observability

VariableDescriptionDefault
RUST_LOGLog level filterinfo
TW_LOG_FORMATLog format (json or pretty)json
TW_LOG_FILELog file path (optional)None

Log Levels

# Basic levels
RUST_LOG=info          # Info and above
RUST_LOG=debug         # Debug and above
RUST_LOG=warn          # Warnings and errors only

# Granular control
RUST_LOG=info,triage_warden=debug                    # Debug for app, info for deps
RUST_LOG=warn,triage_warden::api=debug               # Debug specific module
RUST_LOG=info,sqlx=warn,hyper=warn                   # Quiet noisy dependencies

Metrics & Tracing

VariableDescriptionDefault
TW_METRICS_ENABLEDEnable Prometheus metricstrue
TW_METRICS_PATHMetrics endpoint path/metrics
TW_TRACING_ENABLEDEnable distributed tracingfalse
OTEL_EXPORTER_OTLP_ENDPOINTOpenTelemetry endpointNone
OTEL_SERVICE_NAMEService name for tracestriage-warden

Tracing Setup:

TW_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=triage-warden-prod

Rate Limiting

VariableDescriptionDefault
TW_RATE_LIMIT_ENABLEDEnable rate limitingtrue
TW_RATE_LIMIT_REQUESTSRequests per window100
TW_RATE_LIMIT_WINDOWRate limit window1m
TW_RATE_LIMIT_BURSTBurst allowance20

Webhooks

VariableDescriptionDefault
TW_WEBHOOK_SECRETDefault webhook signature secretNone
TW_WEBHOOK_SPLUNK_SECRETSplunk-specific secretNone
TW_WEBHOOK_CROWDSTRIKE_SECRETCrowdStrike-specific secretNone
TW_WEBHOOK_DEFENDER_SECRETDefender-specific secretNone
TW_WEBHOOK_SENTINEL_SECRETSentinel-specific secretNone

CORS Configuration

VariableDescriptionDefault
TW_CORS_ENABLEDEnable CORStrue
TW_CORS_ORIGINSAllowed origins (comma-separated)*
TW_CORS_METHODSAllowed methodsGET,POST,PUT,DELETE,OPTIONS
TW_CORS_HEADERSAllowed headers*
TW_CORS_MAX_AGEPreflight cache duration (seconds)86400

Production CORS:

TW_CORS_ORIGINS=https://triage.company.com,https://admin.company.com

Feature Flags

VariableDescriptionDefault
TW_FEATURE_PLAYBOOKSEnable playbook executiontrue
TW_FEATURE_AUTO_ENRICHEnable automatic enrichmenttrue
TW_FEATURE_API_KEYSEnable API key managementtrue

Development Variables

Not recommended for production:

VariableDescriptionDefault
TW_DEV_MODEEnable development modefalse
TW_SEED_DATASeed database with test datafalse
TW_DISABLE_AUTHDisable authenticationfalse

Example Configurations

Development

DATABASE_URL=sqlite:./dev.db
TW_ENCRYPTION_KEY=$(openssl rand -base64 32)
TW_JWT_SECRET=dev-jwt-secret-not-for-production
TW_SESSION_SECRET=dev-session-secret
RUST_LOG=debug
TW_LOG_FORMAT=pretty
TW_DEV_MODE=true

Production

# Database
DATABASE_URL=postgres://tw:[email protected]:5432/triage_warden?sslmode=verify-full
DATABASE_MAX_CONNECTIONS=25

# Security
TW_ENCRYPTION_KEY=your-production-encryption-key
TW_JWT_SECRET=your-production-jwt-secret-minimum-32-chars
TW_SESSION_SECRET=your-production-session-secret
TW_COOKIE_SECURE=true
TW_COOKIE_SAME_SITE=strict

# Server
TW_BASE_URL=https://triage.company.com
TW_TRUSTED_PROXIES=10.0.0.0/8

# LLM
TW_LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-api03-...
TW_LLM_MODEL=claude-3-sonnet-20240229

# Logging
RUST_LOG=info
TW_LOG_FORMAT=json
TW_METRICS_ENABLED=true

# Rate limiting
TW_RATE_LIMIT_ENABLED=true
TW_RATE_LIMIT_REQUESTS=200
TW_RATE_LIMIT_WINDOW=1m

Kubernetes

apiVersion: v1
kind: Secret
metadata:
  name: triage-warden-secrets
type: Opaque
stringData:
  DATABASE_URL: "postgres://user:pass@postgres:5432/triage_warden"
  TW_ENCRYPTION_KEY: "base64-encoded-32-byte-key"
  TW_JWT_SECRET: "jwt-signing-secret"
  TW_SESSION_SECRET: "session-secret"
  ANTHROPIC_API_KEY: "sk-ant-..."
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: triage-warden-config
data:
  TW_BASE_URL: "https://triage.company.com"
  TW_LLM_PROVIDER: "anthropic"
  TW_LLM_MODEL: "claude-3-sonnet-20240229"
  RUST_LOG: "info"
  TW_METRICS_ENABLED: "true"

Connector Setup Guide

Step-by-step instructions for configuring each connector type.

Overview

Connectors enable Triage Warden to:

  • Ingest alerts from SIEMs and security tools
  • Enrich incidents with threat intelligence
  • Execute actions like creating tickets or isolating hosts
  • Send notifications to communication platforms

Adding a Connector

  1. Navigate to Settings → Connectors
  2. Click Add Connector
  3. Select connector type
  4. Fill in the required fields
  5. Click Test Connection to verify
  6. Click Save

Threat Intelligence Connectors

VirusTotal

Enriches file hashes, URLs, IPs, and domains with reputation data.

Prerequisites:

Configuration:

FieldValue
NameVirusTotal
Typevirustotal
API KeyYour API key
Rate Limit4 (free) or 500 (premium)

Rate Limits:

  • Free tier: 4 requests/minute
  • Premium: 500+ requests/minute

Verify It Works:

  1. Create a test incident with a known-bad hash
  2. Check incident enrichments for VirusTotal data

AlienVault OTX

Open threat intelligence from AlienVault.

Prerequisites:

Configuration:

FieldValue
NameAlienVault OTX
Typealienvault
API KeyYour OTX API key

SIEM Connectors

Splunk

Ingest alerts from Splunk and run queries.

Prerequisites:

  • Splunk Enterprise or Cloud
  • HTTP Event Collector (HEC) token
  • User with search capabilities

Configuration:

FieldValue
NameSplunk Production
Typesplunk
Hosthttps://splunk.company.com:8089
UsernameService account username
PasswordService account password
Appsearch (or your app context)

Setting Up Webhooks:

  1. In Splunk, create an alert action that sends to webhook
  2. Configure webhook URL: https://triage.company.com/api/webhooks/splunk
  3. Set webhook secret in Triage Warden connector config

Elastic Security

Connect to Elastic Security for SIEM alerts.

Prerequisites:

  • Elasticsearch 7.x or 8.x
  • User with read access to security indices

Configuration:

FieldValue
NameElastic SIEM
Typeelastic
URLhttps://elasticsearch.company.com:9200
UsernameService account username
PasswordService account password
Index Patternsecurity-* or .alerts-security.*

Microsoft Sentinel

Azure Sentinel integration for cloud SIEM.

Prerequisites:

  • Azure subscription with Sentinel workspace
  • App registration with Log Analytics Reader role

Configuration:

FieldValue
NameAzure Sentinel
Typesentinel
Workspace IDLog Analytics Workspace ID
Tenant IDAzure AD Tenant ID
Client IDApp Registration Client ID
Client SecretApp Registration Secret

Azure Setup:

  1. Create App Registration in Azure AD
  2. Grant Log Analytics Reader role on Sentinel workspace
  3. Create client secret
  4. Copy IDs and secret to Triage Warden

EDR Connectors

CrowdStrike Falcon

Endpoint detection and host isolation.

Prerequisites:

  • CrowdStrike Falcon subscription
  • API client with appropriate scopes

Configuration:

FieldValue
NameCrowdStrike Falcon
Typecrowdstrike
Regionus-1, us-2, eu-1, or us-gov-1
Client IDOAuth Client ID
Client SecretOAuth Client Secret

Required API Scopes:

  • Detections: Read
  • Hosts: Read, Write (for isolation)
  • Incidents: Read

CrowdStrike Setup:

  1. Go to Support → API Clients and Keys
  2. Create new API client
  3. Select required scopes
  4. Copy Client ID and Secret

Microsoft Defender for Endpoint

MDE integration for alerts and host actions.

Prerequisites:

  • Microsoft 365 E5 or Defender for Endpoint license
  • App registration with Defender API permissions

Configuration:

FieldValue
NameDefender for Endpoint
Typedefender
Tenant IDAzure AD Tenant ID
Client IDApp Registration Client ID
Client SecretApp Registration Secret

Required API Permissions:

  • Alert.Read.All
  • Machine.Read.All
  • Machine.Isolate (for isolation actions)

SentinelOne

SentinelOne EDR integration.

Prerequisites:

  • SentinelOne console access
  • API token with appropriate permissions

Configuration:

FieldValue
NameSentinelOne
Typesentinelone
Console URLhttps://usea1-pax8.sentinelone.net
API TokenYour API token

Ticketing Connectors

Jira

Create and manage security tickets.

Prerequisites:

  • Jira Cloud or Server instance
  • API token (Cloud) or password (Server)

Configuration:

FieldValue
NameJira Security
Typejira
URLhttps://yourcompany.atlassian.net
EmailYour Jira email
API TokenAPI token from Atlassian account
Default ProjectSEC (your security project key)

Jira Cloud Setup:

  1. Go to id.atlassian.com/manage-profile/security/api-tokens
  2. Create API token
  3. Use your email as username

Jira Server Setup:

  • Use password instead of API token
  • Ensure user has project access

ServiceNow

ServiceNow ITSM integration.

Prerequisites:

  • ServiceNow instance
  • User with incident table access

Configuration:

FieldValue
NameServiceNow
Typeservicenow
Instance URLhttps://yourcompany.service-now.com
UsernameService account username
PasswordService account password

Identity Connectors

Microsoft 365 / Azure AD

User management and sign-in data.

Prerequisites:

  • Azure AD with appropriate licenses
  • App registration with Graph API permissions

Configuration:

FieldValue
NameMicrosoft 365
Typem365
Tenant IDAzure AD Tenant ID
Client IDApp Registration Client ID
Client SecretApp Registration Secret

Required API Permissions:

  • User.Read.All
  • AuditLog.Read.All
  • User.RevokeSessions.All (for user disable)

Google Workspace

Google Workspace user management.

Prerequisites:

  • Google Workspace admin access
  • Service account with domain-wide delegation

Configuration:

FieldValue
NameGoogle Workspace
Typegoogle
Service Account JSONPaste JSON key file contents
Domaincompany.com

Google Setup:

  1. Create service account in Google Cloud Console
  2. Enable domain-wide delegation
  3. Add required OAuth scopes in Google Admin
  4. Download JSON key file

Testing Connectors

After configuration, always test:

  1. Click Test Connection in connector settings
  2. Check the response for success/errors
  3. For ingestion connectors, verify sample data appears

Common Issues

ErrorSolution
Connection refusedCheck URL and network access
401 UnauthorizedVerify credentials/API key
403 ForbiddenCheck permissions/scopes
SSL certificate errorVerify certificate or disable verification
Rate limitedReduce request rate or upgrade tier

Connector Health

Monitor connector health at Settings → Connectors or via API:

curl http://localhost:8080/health/detailed | jq '.components.connectors'

Healthy connectors show status connected. Troubleshoot any showing error or disconnected.

Playbooks Guide

Create effective automated response playbooks.

What is a Playbook?

A playbook is an automated workflow that executes when specific conditions are met. Playbooks contain:

  • Trigger - Conditions that start the playbook
  • Stages - Ordered groups of steps
  • Steps - Individual actions to execute

Creating a Playbook

Via Web UI

  1. Navigate to Playbooks
  2. Click Create Playbook
  3. Enter name and description
  4. Configure trigger conditions
  5. Add stages and steps
  6. Enable and save

Via API

curl -X POST http://localhost:8080/api/playbooks \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Phishing Response",
    "description": "Automated response for phishing alerts",
    "trigger": {
      "type": "incident_created",
      "conditions": {
        "source": "email_gateway",
        "severity": ["high", "critical"]
      }
    },
    "stages": [...]
  }'

Trigger Types

incident_created

Fires when a new incident is created.

{
  "type": "incident_created",
  "conditions": {
    "severity": ["high", "critical"],
    "source": "crowdstrike",
    "title_contains": "malware"
  }
}

incident_updated

Fires when an incident is updated.

{
  "type": "incident_updated",
  "conditions": {
    "field": "severity",
    "new_value": "critical"
  }
}

scheduled

Fires on a schedule (cron format).

{
  "type": "scheduled",
  "schedule": "0 */6 * * *"
}

manual

Only triggered manually by user action.

{
  "type": "manual"
}

Stages

Stages group steps that should execute together. Configure:

  • Name - Descriptive name
  • Description - What this stage does
  • Parallel - Execute steps in parallel (default: false)

Sequential Execution

{
  "stages": [
    {
      "name": "Enrichment",
      "steps": [/* step 1, step 2, step 3 */]
    },
    {
      "name": "Response",
      "steps": [/* step 4, step 5 */]
    }
  ]
}

Steps in Enrichment complete before Response starts.

Parallel Execution

{
  "stages": [
    {
      "name": "Gather Intel",
      "parallel": true,
      "steps": [
        {"action": "lookup_hash_virustotal"},
        {"action": "lookup_ip_reputation"},
        {"action": "lookup_domain_reputation"}
      ]
    }
  ]
}

All lookups run simultaneously.

Step Types

Enrichment Actions

lookup_hash

Look up file hash reputation.

{
  "action": "lookup_hash",
  "parameters": {
    "hash": "{{ incident.iocs.file_hash }}",
    "providers": ["virustotal", "alienvault"]
  }
}

lookup_ip

Look up IP address reputation.

{
  "action": "lookup_ip",
  "parameters": {
    "ip": "{{ incident.source_ip }}"
  }
}

lookup_domain

Look up domain reputation.

{
  "action": "lookup_domain",
  "parameters": {
    "domain": "{{ incident.domain }}"
  }
}

lookup_user

Get user details from identity provider.

{
  "action": "lookup_user",
  "parameters": {
    "email": "{{ incident.user_email }}",
    "provider": "m365"
  }
}

Containment Actions

isolate_host

Isolate endpoint from network.

{
  "action": "isolate_host",
  "parameters": {
    "hostname": "{{ incident.hostname }}",
    "provider": "crowdstrike"
  },
  "requires_approval": true
}

disable_user

Disable user account.

{
  "action": "disable_user",
  "parameters": {
    "email": "{{ incident.user_email }}",
    "provider": "m365"
  },
  "requires_approval": true
}

block_ip

Block IP address at firewall.

{
  "action": "block_ip",
  "parameters": {
    "ip": "{{ incident.source_ip }}",
    "duration": "24h"
  },
  "requires_approval": true
}

Notification Actions

send_notification

Send alert to notification channel.

{
  "action": "send_notification",
  "parameters": {
    "channel": "slack-security",
    "message": "Critical incident: {{ incident.title }}"
  }
}

create_ticket

Create ticket in ticketing system.

{
  "action": "create_ticket",
  "parameters": {
    "provider": "jira",
    "project": "SEC",
    "type": "Incident",
    "title": "{{ incident.title }}",
    "description": "{{ incident.description }}"
  }
}

Analysis Actions

analyze_with_llm

Run AI analysis on incident.

{
  "action": "analyze_with_llm",
  "parameters": {
    "prompt": "Analyze this security incident and provide recommendations",
    "include_enrichments": true
  }
}

Utility Actions

wait

Pause execution for specified duration.

{
  "action": "wait",
  "parameters": {
    "duration": "5m"
  }
}

set_severity

Update incident severity.

{
  "action": "set_severity",
  "parameters": {
    "severity": "critical"
  }
}

add_comment

Add comment to incident.

{
  "action": "add_comment",
  "parameters": {
    "comment": "Automated enrichment complete. Found {{ enrichments.virustotal.positives }} detections."
  }
}

Variables and Templates

Use Jinja2-style templates to reference incident data:

Available Variables

VariableDescription
{{ incident.id }}Incident UUID
{{ incident.title }}Incident title
{{ incident.severity }}Severity level
{{ incident.source }}Alert source
{{ incident.description }}Full description
{{ incident.hostname }}Affected hostname
{{ incident.username }}Affected username
{{ incident.source_ip }}Source IP address
{{ incident.iocs.* }}Extracted IOCs
{{ enrichments.* }}Enrichment results
{{ previous_step.output }}Previous step output

Conditional Logic

{
  "action": "isolate_host",
  "conditions": "{{ incident.severity == 'critical' and enrichments.virustotal.positives > 5 }}"
}

Approval Requirements

Mark steps as requiring approval for dangerous actions:

{
  "action": "disable_user",
  "requires_approval": true
}

When requires_approval: true:

  1. Step pauses at approval queue
  2. Analyst reviews and approves/denies
  3. Execution continues or stops

Example Playbooks

Phishing Triage

{
  "name": "Phishing Triage",
  "description": "Automated triage for reported phishing emails",
  "trigger": {
    "type": "incident_created",
    "conditions": {
      "source": "email_gateway",
      "title_contains": "phishing"
    }
  },
  "stages": [
    {
      "name": "Extract and Enrich",
      "parallel": true,
      "steps": [
        {
          "action": "lookup_domain",
          "parameters": {"domain": "{{ incident.sender_domain }}"}
        },
        {
          "action": "lookup_url",
          "parameters": {"url": "{{ incident.iocs.url }}"}
        },
        {
          "action": "lookup_user",
          "parameters": {"email": "{{ incident.recipient }}"}
        }
      ]
    },
    {
      "name": "Analyze",
      "steps": [
        {
          "action": "analyze_with_llm",
          "parameters": {
            "prompt": "Analyze this phishing attempt and determine if it's targeted spear-phishing"
          }
        }
      ]
    },
    {
      "name": "Respond",
      "steps": [
        {
          "action": "send_notification",
          "parameters": {
            "channel": "slack-phishing",
            "message": "Phishing alert: {{ incident.title }}\nSender: {{ incident.sender }}\nVerdict: {{ analysis.verdict }}"
          }
        },
        {
          "action": "create_ticket",
          "conditions": "{{ analysis.verdict == 'malicious' }}",
          "parameters": {
            "provider": "jira",
            "project": "SEC",
            "title": "Phishing: {{ incident.title }}"
          }
        }
      ]
    }
  ]
}

Malware Containment

{
  "name": "Malware Containment",
  "description": "Isolate hosts with confirmed malware",
  "trigger": {
    "type": "incident_created",
    "conditions": {
      "source": "crowdstrike",
      "severity": "critical",
      "title_contains": "malware"
    }
  },
  "stages": [
    {
      "name": "Verify",
      "steps": [
        {
          "action": "lookup_hash",
          "parameters": {"hash": "{{ incident.iocs.file_hash }}"}
        }
      ]
    },
    {
      "name": "Contain",
      "steps": [
        {
          "action": "isolate_host",
          "conditions": "{{ enrichments.virustotal.positives >= 5 }}",
          "requires_approval": true,
          "parameters": {
            "hostname": "{{ incident.hostname }}",
            "reason": "Confirmed malware with {{ enrichments.virustotal.positives }} detections"
          }
        }
      ]
    },
    {
      "name": "Notify",
      "steps": [
        {
          "action": "send_notification",
          "parameters": {
            "channel": "pagerduty-security",
            "message": "Host {{ incident.hostname }} isolated due to malware"
          }
        }
      ]
    }
  ]
}

Best Practices

  1. Start small - Begin with enrichment-only playbooks before adding containment
  2. Require approval - Always require approval for containment actions initially
  3. Test in staging - Test playbooks with mock incidents first
  4. Monitor execution - Watch playbook executions for errors
  5. Document thoroughly - Include clear descriptions for each stage/step
  6. Use conditions - Don't execute actions blindly; use conditions to validate
  7. Handle failures - Consider what happens if a step fails

Troubleshooting

Playbook Not Triggering

  • Verify trigger conditions match incoming incidents
  • Check playbook is enabled
  • Review trigger condition syntax

Step Failing

  • Check connector is healthy
  • Verify required parameters are provided
  • Check variable templates resolve correctly
  • Review step logs in incident timeline

Approval Stuck

  • Check Approvals queue for pending items
  • Verify approvers have notification channel configured
  • Consider timeout settings for approvals

Notifications Setup Guide

Configure notification channels for alerts and incident updates.

Overview

Triage Warden supports multiple notification channels:

  • Slack - Team messaging
  • Microsoft Teams - Enterprise collaboration
  • PagerDuty - On-call alerting
  • Email - SMTP notifications
  • Webhooks - Custom integrations

Adding a Notification Channel

  1. Navigate to Settings → Notifications
  2. Click Add Channel
  3. Select channel type
  4. Configure settings
  5. Test and save

Slack

Prerequisites

  • Slack workspace admin access
  • Slack app with webhook permissions

Setup Steps

  1. Create Slack App:

    • Go to api.slack.com/apps
    • Click Create New AppFrom scratch
    • Name it "Triage Warden" and select your workspace
  2. Enable Incoming Webhooks:

    • In app settings, click Incoming Webhooks
    • Toggle Activate Incoming Webhooks to On
    • Click Add New Webhook to Workspace
    • Select the channel for alerts
  3. Copy Webhook URL:

    • Copy the webhook URL (starts with https://hooks.slack.com/...)
  4. Configure in Triage Warden:

FieldValue
NameSlack - Security
Typeslack
Webhook URLYour webhook URL
Channel#security-alerts

Message Format

Triage Warden sends formatted Slack messages with:

  • Severity color coding (red=critical, orange=high, yellow=medium, gray=low)
  • Incident summary and details
  • Quick action buttons (View, Acknowledge)
  • Enrichment highlights

Example Notification

{
  "attachments": [{
    "color": "#ff0000",
    "title": "Critical: Malware Detected on WORKSTATION-001",
    "text": "CrowdStrike detected Emotet malware on endpoint",
    "fields": [
      {"title": "Source", "value": "CrowdStrike", "short": true},
      {"title": "Severity", "value": "Critical", "short": true}
    ],
    "actions": [
      {"type": "button", "text": "View Incident", "url": "https://..."}
    ]
  }]
}

Microsoft Teams

Prerequisites

  • Microsoft 365 account
  • Teams channel where you can add connectors

Setup Steps

  1. Add Incoming Webhook Connector:

    • In Teams, go to the channel for alerts
    • Click ...Connectors
    • Find Incoming Webhook and click Configure
    • Name it "Triage Warden" and upload an icon (optional)
    • Click Create
  2. Copy Webhook URL:

    • Copy the generated webhook URL
  3. Configure in Triage Warden:

FieldValue
NameTeams - Security
Typeteams
Webhook URLYour webhook URL

Adaptive Cards

Triage Warden sends Teams notifications as Adaptive Cards with:

  • Severity indicators
  • Incident details in structured format
  • Action buttons for quick response

PagerDuty

Prerequisites

  • PagerDuty account
  • Service with Events API v2 integration

Setup Steps

  1. Create PagerDuty Service:

    • In PagerDuty, go to ServicesNew Service
    • Name it "Triage Warden Alerts"
    • Add an escalation policy
  2. Add Events API Integration:

    • On the service page, go to Integrations
    • Click Add Integration
    • Select Events API v2
    • Copy the Integration Key
  3. Configure in Triage Warden:

FieldValue
NamePagerDuty - Security
Typepagerduty
Integration KeyYour integration key
Severity MappingSee below

Severity Mapping

Map Triage Warden severities to PagerDuty:

TW SeverityPagerDuty Severity
Criticalcritical
Higherror
Mediumwarning
Lowinfo

Auto-Resolution

Configure auto-resolution to close PagerDuty incidents when Triage Warden incidents are resolved:

notifications:
  pagerduty:
    auto_resolve: true
    resolve_on_status:
      - resolved
      - closed
      - false_positive

Email (SMTP)

Prerequisites

  • SMTP server credentials
  • Recipient email addresses

Configuration

FieldValue
NameEmail - SOC Team
Typeemail
SMTP Hostsmtp.company.com
SMTP Port587
Username[email protected]
PasswordSMTP password
From Address[email protected]
To Addresses[email protected]
Use TLStrue

Email Templates

Customize email templates by creating files in config/templates/:

config/templates/
├── email_incident_created.html
├── email_incident_updated.html
└── email_incident_resolved.html

Template variables:

  • {{ incident.title }} - Incident title
  • {{ incident.severity }} - Severity level
  • {{ incident.source }} - Alert source
  • {{ incident.description }} - Full description
  • {{ incident.url }} - Link to incident

Custom Webhooks

Send notifications to any HTTP endpoint.

Configuration

FieldValue
NameCustom - SIEM
Typewebhook
URLhttps://siem.company.com/api/alerts
MethodPOST
Headers{"Authorization": "Bearer ..."}
SecretWebhook signing secret (optional)

Payload Format

Default JSON payload:

{
  "event_type": "incident_created",
  "timestamp": "2024-01-15T10:30:00Z",
  "incident": {
    "id": "uuid",
    "title": "Alert Title",
    "severity": "high",
    "source": "crowdstrike",
    "description": "...",
    "created_at": "2024-01-15T10:29:00Z"
  }
}

Webhook Signatures

If a secret is configured, Triage Warden signs webhooks with HMAC-SHA256:

X-TW-Signature: sha256=<signature>
X-TW-Timestamp: <unix_timestamp>

Verify signatures:

import hmac
import hashlib

def verify_signature(payload, signature, secret, timestamp):
    expected = hmac.new(
        secret.encode(),
        f"{timestamp}.{payload}".encode(),
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

Notification Rules

Configure when and how notifications are sent.

Severity Filtering

Send only high/critical alerts to PagerDuty:

notifications:
  rules:
    - channel: pagerduty-security
      conditions:
        severity:
          - critical
          - high

Time-Based Rules

Different channels for business hours vs. after hours:

notifications:
  rules:
    - channel: slack-security
      conditions:
        hours: "09:00-17:00"
        days: ["mon", "tue", "wed", "thu", "fri"]
    - channel: pagerduty-oncall
      conditions:
        hours: "17:00-09:00"
        days: ["sat", "sun"]

Source-Based Rules

Route by alert source:

notifications:
  rules:
    - channel: slack-phishing
      conditions:
        source: email_gateway
    - channel: slack-edr
      conditions:
        source:
          - crowdstrike
          - defender

Testing Notifications

Test via UI

  1. Go to Settings → Notifications
  2. Click Test next to any channel
  3. Check that test message arrives

Test via API

curl -X POST http://localhost:8080/api/notifications/test \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "channel_id": "uuid-of-channel",
    "message": "Test notification from Triage Warden"
  }'

Test via CLI

triage-warden notifications test --channel slack-security

Troubleshooting

Notifications Not Arriving

  1. Check channel health:

    curl http://localhost:8080/health/detailed | jq '.components.notifications'
    
  2. Verify webhook URL:

    • Test URL with curl
    • Check for firewalls or network restrictions
  3. Check logs:

    grep "notification" /var/log/triage-warden/app.log
    

Rate Limiting

If notifications are delayed:

  • Slack: 1 message per second per channel
  • PagerDuty: 120 events per minute
  • Teams: 4 messages per second

Configure rate limits:

notifications:
  rate_limits:
    slack: 1/s
    pagerduty: 2/s
    teams: 4/s

Duplicate Notifications

If receiving duplicates:

  1. Check for multiple channels targeting same destination
  2. Enable deduplication:
    notifications:
      deduplicate: true
      dedupe_window: 5m
    

Policies Guide

Configure approval policies, guardrails, and safety rules for Triage Warden.

Overview

Policies control what actions Triage Warden can take automatically and what requires human approval. The policy engine provides:

  • Approval Requirements - Which actions need human approval
  • Guardrails - Safety limits on automated actions
  • Kill Switch - Emergency halt for all automation
  • Audit Logging - Complete action history

Policy Configuration

Policies are defined in config/guardrails.yaml or via the web UI at Settings → Policies.

Basic Structure

# config/guardrails.yaml
version: "1"

# Global settings
global:
  operation_mode: supervised  # assisted, supervised, autonomous
  kill_switch_enabled: false
  max_actions_per_incident: 10
  max_concurrent_actions: 5

# Action-specific policies
actions:
  isolate_host:
    requires_approval: true
    approval_level: high
    allowed_sources:
      - crowdstrike
      - defender

  disable_user:
    requires_approval: true
    approval_level: critical
    max_per_hour: 5

  lookup_hash:
    requires_approval: false
    rate_limit: 100/minute

# Approval rules
approvals:
  levels:
    low:
      auto_approve_after: 5m
      approvers: [analyst]
    medium:
      auto_approve_after: 30m
      approvers: [analyst, senior_analyst]
    high:
      auto_approve_after: never
      approvers: [senior_analyst, manager]
    critical:
      auto_approve_after: never
      approvers: [manager]
      require_count: 2

Operation Modes

Assisted Mode

Human-in-the-loop for all decisions:

  • All actions require explicit approval
  • AI provides recommendations only
  • Best for initial deployment and high-risk environments
global:
  operation_mode: assisted

Balanced automation with oversight:

  • Low-risk actions (lookups, enrichment) run automatically
  • Medium/high-risk actions require approval
  • Humans can intervene at any time
global:
  operation_mode: supervised

Autonomous Mode

Maximum automation:

  • Most actions run without approval
  • Only critical actions require human review
  • Use only after thorough testing
global:
  operation_mode: autonomous

Approval Levels

Configuring Approval Requirements

Each action type can have an approval requirement:

Action TypeDefault LevelTypical Setting
lookup_*nonenone
send_notificationnonenone
create_ticketlownone or low
add_commentnonenone
set_severitylowlow
block_iphighhigh
isolate_hostcriticalhigh or critical
disable_usercriticalcritical

Approval Workflow

  1. Action Requested - Playbook or AI requests an action
  2. Policy Check - Engine evaluates approval requirements
  3. Queue or Execute - Action queued for approval or runs immediately
  4. Approval Decision - Approver accepts or denies
  5. Execution - Approved action executes
  6. Audit Log - All decisions recorded

Approval Escalation

Configure escalation for unanswered approvals:

approvals:
  escalation:
    enabled: true
    rules:
      - after: 15m
        notify: [slack-security]
      - after: 30m
        notify: [pagerduty-oncall]
        escalate_to: manager
      - after: 1h
        auto_deny: true
        reason: "Approval timeout"

Guardrails

Rate Limits

Prevent runaway automation:

guardrails:
  rate_limits:
    # Global limits
    global:
      max_actions_per_minute: 100
      max_actions_per_hour: 1000

    # Per-action limits
    isolate_host:
      max_per_hour: 10
      max_per_day: 50

    disable_user:
      max_per_hour: 5
      max_per_day: 20

Blocked Actions

Completely prevent certain actions:

guardrails:
  blocked_actions:
    - delete_user        # Never allow
    - format_disk        # Never allow
    - disable_mfa        # Too dangerous

Conditional Rules

Allow/deny based on conditions:

guardrails:
  conditional_rules:
    - action: isolate_host
      deny_if:
        - hostname_contains: "dc"      # Don't isolate domain controllers
        - hostname_contains: "prod-db" # Don't isolate production databases
        - is_server: true

    - action: disable_user
      deny_if:
        - is_admin: true               # Don't disable admins
        - is_service_account: true     # Don't disable service accounts
      require_if:
        - department: "executive"      # Extra approval for executives

Asset Protection

Protect critical assets:

guardrails:
  protected_assets:
    hosts:
      - pattern: "dc-*"
        actions_blocked: [isolate_host, shutdown]
        reason: "Domain controllers require manual intervention"

      - pattern: "prod-*"
        require_approval: critical
        reason: "Production systems require manager approval"

    users:
      - pattern: "*@executive.company.com"
        require_approval: critical

      - pattern: "svc-*"
        actions_blocked: [disable_user, reset_password]

Kill Switch

Emergency Automation Halt

The kill switch immediately stops all automated actions:

Via UI:

  1. Go to Settings → Safety
  2. Click Activate Kill Switch
  3. Enter reason
  4. Confirm

Via API:

curl -X POST http://localhost:8080/api/kill-switch/activate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"reason": "Investigating potential false positives"}'

Via CLI:

triage-warden kill-switch activate --reason "Emergency halt"

Kill Switch Effects

When active:

  • All pending actions are paused
  • New automated actions are blocked
  • Manual actions still allowed
  • Alerts continue to be ingested
  • Enrichment continues (read-only)

Deactivating

Only users with admin or manager role can deactivate:

curl -X POST http://localhost:8080/api/kill-switch/deactivate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"reason": "Issue resolved, resuming normal operations"}'

Audit Logging

What's Logged

Every action is logged with:

  • Timestamp
  • Action type
  • Target (host, user, etc.)
  • Requestor (playbook, user, AI)
  • Approver (if required)
  • Result (success, failure, denied)
  • Full context

Viewing Audit Logs

Via UI:

  • Settings → Audit Log
  • Filter by date, action type, user, result

Via API:

curl "http://localhost:8080/api/audit?action=isolate_host&from=2024-01-01" \
  -H "Authorization: Bearer $API_KEY"

Audit Retention

Configure retention in config/guardrails.yaml:

audit:
  retention_days: 365
  archive_to: s3://audit-logs-bucket/triage-warden/

Policy Testing

Dry Run Mode

Test policies without executing actions:

global:
  dry_run: true  # Log what would happen, don't execute

Policy Simulator

Test specific scenarios:

curl -X POST http://localhost:8080/api/policies/simulate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "action": "isolate_host",
    "context": {
      "hostname": "dc-primary",
      "severity": "critical",
      "source": "crowdstrike"
    }
  }'

Response:

{
  "allowed": false,
  "reason": "Host matches protected pattern 'dc-*'",
  "would_require_approval": null,
  "matching_rules": [
    "protected_assets.hosts[0]"
  ]
}

Best Practices

1. Start Restrictive

Begin with assisted mode and strict approvals. Loosen over time as you build confidence.

2. Protect Critical Assets

Always define protected assets for:

  • Domain controllers
  • Production databases
  • Executive accounts
  • Service accounts

3. Use Approval Escalation

Don't let approvals sit forever. Configure timeouts and escalations.

4. Monitor Guardrail Hits

Alert when guardrails are triggered frequently—it may indicate:

  • Misconfiguration
  • Attack in progress
  • Need to adjust thresholds

5. Test Policy Changes

Always use dry run or simulator before deploying policy changes.

6. Keep Audit Logs

Maintain audit logs for compliance and incident review. Archive to external storage.

Example: Phishing Response Policy

Complete policy for phishing incident automation:

version: "1"

global:
  operation_mode: supervised

actions:
  # Enrichment - automatic
  lookup_url:
    requires_approval: false
    rate_limit: 100/minute

  lookup_domain:
    requires_approval: false
    rate_limit: 100/minute

  lookup_user:
    requires_approval: false
    rate_limit: 50/minute

  # Notifications - automatic
  send_notification:
    requires_approval: false

  # Containment - requires approval
  block_sender:
    requires_approval: true
    approval_level: medium
    max_per_hour: 50

  quarantine_email:
    requires_approval: true
    approval_level: low
    auto_approve_confidence: 0.95

  disable_user:
    requires_approval: true
    approval_level: critical

guardrails:
  conditional_rules:
    - action: disable_user
      deny_if:
        - is_admin: true
        - is_executive: true

    - action: quarantine_email
      auto_approve_if:
        - ai_confidence: "> 0.95"
        - virustotal_malicious: "> 5"

Default Configuration Reference

The default configuration file (config/default.yaml) contains all settings for a Triage Warden deployment. Copy this file and customize it for your environment.

Sensitive values should use environment variable interpolation: ${ENV_VAR_NAME}.

Operation Mode

operation_mode: supervised
ModeDescription
assistedAI observes and suggests only, no automated actions
supervisedLow-risk actions automated, high-risk requires approval
autonomousFull automation for configured incident types

Concurrency

max_concurrent_incidents: 50

Maximum number of incidents being processed at the same time. Increase for high-volume environments; decrease to limit resource usage.

Connectors

External service integrations. Each connector follows the same structure:

connectors:
  <connector_name>:
    connector_type: <type>
    enabled: true
    base_url: <url>
    api_key: ${API_KEY_ENV_VAR}
    api_secret: ""
    timeout_secs: 30
    settings:
      <connector-specific settings>

Common Fields

FieldTypeDescription
connector_typeStringConnector implementation to use
enabledBooleanWhether this connector is active
base_urlStringBase URL for the service API
api_keyStringAPI key or username (use ${ENV_VAR})
api_secretStringAPI secret or password (use ${ENV_VAR})
timeout_secsIntegerHTTP request timeout in seconds
settingsMapConnector-specific settings

Jira

connectors:
  jira:
    connector_type: jira
    enabled: true
    base_url: https://your-company.atlassian.net
    api_key: ${JIRA_API_KEY}
    timeout_secs: 30
    settings:
      project_key: SEC
      default_issue_type: Incident

VirusTotal

connectors:
  virustotal:
    connector_type: virustotal
    enabled: true
    base_url: https://www.virustotal.com
    api_key: ${VIRUSTOTAL_API_KEY}
    timeout_secs: 30
    settings:
      cache_ttl_secs: 3600

Splunk (SIEM)

connectors:
  splunk:
    connector_type: splunk
    enabled: true
    base_url: https://splunk.company.com:8089
    api_key: ${SPLUNK_TOKEN}
    settings:
      index: main
      earliest_time: -24h

CrowdStrike (EDR)

connectors:
  crowdstrike:
    connector_type: crowdstrike
    enabled: true
    base_url: https://api.crowdstrike.com
    api_key: ${CS_CLIENT_ID}
    api_secret: ${CS_CLIENT_SECRET}

LLM Configuration

llm:
  provider: anthropic
  model: claude-3-5-sonnet-20241022
  api_key: ${ANTHROPIC_API_KEY}
  base_url: ""
  max_tokens: 4096
  temperature: 0.1
FieldDescription
providerLLM provider: anthropic, openai, or local
modelModel identifier
api_keyAPI key (use ${ENV_VAR})
base_urlCustom endpoint URL for local/self-hosted models
max_tokensMaximum tokens in LLM responses
temperatureSampling temperature (lower = more deterministic)

Policy Configuration

policy:
  guardrails_path: config/guardrails.yaml
  default_approval_level: analyst
  auto_approve_low_risk: true
  confidence_threshold: 0.9
FieldDescription
guardrails_pathPath to the guardrails configuration file
default_approval_levelDefault approval level for unknown actions (analyst, senior, manager)
auto_approve_low_riskWhether low-risk actions can be auto-approved
confidence_thresholdMinimum AI confidence for auto-approval (0.0-1.0)

Logging Configuration

logging:
  level: info
  json_format: false
  # file_path: /var/log/triage-warden/triage-warden.log
FieldDescription
levelLog level: trace, debug, info, warn, error
json_formatUse structured JSON format (recommended for production)
file_pathOptional log file path; omit to log to stdout

Database Configuration

database:
  url: sqlite://triage-warden.db?mode=rwc
  max_connections: 10
  run_migrations: true
FieldDescription
urlDatabase connection string
max_connectionsConnection pool size
run_migrationsWhether to run migrations on startup

Database URLs

DatabaseURL format
SQLite (dev)sqlite://triage-warden.db?mode=rwc
PostgreSQL (prod)postgres://user:pass@host:5432/triage_warden

API Server Configuration

api:
  port: 8080
  host: "0.0.0.0"
  enable_swagger: true
  timeout_secs: 30
FieldDescription
portTCP port to listen on
hostBind address (0.0.0.0 for all interfaces, 127.0.0.1 for localhost only)
enable_swaggerServe Swagger UI at /swagger-ui
timeout_secsHTTP request timeout in seconds

Guardrails Reference

The guardrails configuration file (config/guardrails.yaml) defines security boundaries for AI-automated actions. These rules apply regardless of the current autonomy level.

Deny List

Actions and targets that are never allowed automatically.

Denied Actions

deny_list:
  actions:
    - delete_user          # Too destructive
    - wipe_host            # Too destructive
    - delete_all_emails    # Too destructive
    - modify_firewall      # High risk

Add any action name here to prevent the AI from ever executing it. These actions can still be performed manually by an analyst.

Target Patterns

Regex patterns that match protected systems. Any automated action targeting a hostname or identifier that matches these patterns requires human approval.

deny_list:
  target_patterns:
    - ".*-prod-.*"         # Production systems
    - "dc\\d+\\..*"        # Domain controllers
    - ".*-critical-.*"     # Explicitly marked critical
    - ".*\\.corp\\..*"     # Corporate infrastructure

Protected IPs

Specific IP addresses that must never be targeted by automated actions.

deny_list:
  protected_ips:
    - "10.0.0.1"           # Core router
    - "10.0.0.2"           # DNS server
    - "10.0.0.3"           # DHCP server

Protected Users

User accounts that are protected from automated modifications (disable, password reset, etc.). Supports exact matches and glob patterns.

deny_list:
  protected_users:
    - "admin"
    - "root"
    - "administrator"
    - "service-account-*"
    - "svc-*"

Rate Limits

Prevent runaway automation by capping how many times each action can be executed.

rate_limits:
  isolate_host:
    max_per_hour: 5
    max_per_day: 20
    max_concurrent: 2

  disable_user:
    max_per_hour: 10
    max_per_day: 50
    max_concurrent: 5

  block_ip:
    max_per_hour: 20
    max_per_day: 100
    max_concurrent: 10

  quarantine_email:
    max_per_hour: 50
    max_per_day: 500
    max_concurrent: 20
FieldDescription
max_per_hourMaximum executions in a rolling 60-minute window
max_per_dayMaximum executions in a rolling 24-hour window
max_concurrentMaximum simultaneous in-flight executions

Approval Policies

Define when human approval is required, and at what level.

approval_policies:
  - name: critical_asset_protection
    description: "Require senior approval for actions on critical assets"
    condition:
      target_criticality:
        - critical
        - high
    requires: senior
    can_override: false

Condition Fields

FieldTypeDescription
target_criticalityList of stringsAsset criticality levels that trigger this policy
action_typeList of stringsAction types that trigger this policy
confidence_belowFloat (0.0-1.0)Trigger when AI confidence is below this threshold

Approval Levels

LevelWho can approve
analystAny analyst
seniorSenior analyst or above
managerSOC manager

Overridability

When can_override: true, a senior user can bypass the approval requirement. When false, the approval is mandatory and cannot be skipped.

Auto-Approve Rules

Actions that can be executed automatically when specific conditions are met, even in supervised mode.

auto_approve_rules:
  - name: ticket_operations
    description: "Auto-approve ticket creation and updates"
    action_types:
      - create_ticket
      - update_ticket
      - add_ticket_comment
    conditions:
      - confidence_above: 0.5

  - name: email_quarantine_high_confidence
    description: "Auto-approve email quarantine for high-confidence phishing"
    action_types:
      - quarantine_email
    conditions:
      - confidence_above: 0.95
      - verdict: true_positive

Condition Fields

FieldTypeDescription
confidence_aboveFloat (0.0-1.0)AI confidence must exceed this value
verdictStringAI verdict must match (e.g., true_positive)

All conditions in the list must be met (AND logic).

Data Policies

Control how sensitive data is handled in logs and LLM prompts.

data_policies:
  pii_filter: true
  pii_patterns:
    - "\\b\\d{3}-\\d{2}-\\d{4}\\b"      # SSN
    - "\\b\\d{16}\\b"                    # Credit card

  secrets_redaction: true
  secret_patterns:
    - "(?i)api[_-]?key"
    - "(?i)password"
    - "(?i)secret"
    - "(?i)token"
    - "(?i)credential"

  audit_data_access: true
FieldDescription
pii_filterEnable PII filtering in logs and LLM prompts
pii_patternsRegex patterns matching PII to redact
secrets_redactionEnable secret detection and redaction
secret_patternsRegex patterns matching secrets to redact
audit_data_accessLog all data access operations

Escalation Rules

Define automatic escalation triggers.

escalation_rules:
  - name: repeated_false_positives
    description: "Escalate if same alert type has high FP rate"
    condition:
      false_positive_rate_above: 0.5
      sample_size_min: 10
    action: escalate_to_analyst

  - name: incident_correlation
    description: "Escalate if multiple related incidents detected"
    condition:
      related_incidents_above: 3
      time_window_hours: 1
    action: escalate_to_senior

  - name: critical_severity
    description: "Always escalate critical severity incidents"
    condition:
      severity: critical
    action: escalate_to_manager

Escalation Actions

ActionDescription
escalate_to_analystRoute to any available analyst
escalate_to_seniorRoute to a senior analyst
escalate_to_managerRoute to the SOC manager

Integrations

Triage Warden supports integrations for identity, telemetry, enrichment, and response workflows.

SSO

Use the SSO guides to configure OIDC or SAML with your identity provider:

SSO Integration Guide

Triage Warden supports enterprise SSO through both OIDC and SAML endpoints.

Supported Flows

  • OIDC login: /auth/oidc/login
  • OIDC callback: /auth/oidc/callback
  • OIDC logout: /auth/oidc/logout
  • SAML metadata: /auth/saml/metadata
  • SAML login: /auth/saml/login
  • SAML ACS: /auth/saml/acs
  • SAML SLO: /auth/saml/slo

Common Environment Variables

  • TW_OIDC_ISSUER
  • TW_OIDC_CLIENT_ID
  • TW_OIDC_CLIENT_SECRET
  • TW_OIDC_REDIRECT_URI
  • TW_OIDC_SCOPES
  • TW_OIDC_JWKS_URI (optional override; discovery jwks_uri is used by default)
  • TW_OIDC_REQUIRE_MFA
  • TW_SSO_ROLE_MAPPING
  • TW_SSO_DEFAULT_ROLE
  • TW_SSO_AUTO_CREATE_USERS
  • TW_SAML_ENTITY_ID
  • TW_SAML_ACS_URL
  • TW_SAML_IDP_SSO_URL
  • TW_SAML_CERTIFICATE
  • TW_SAML_PRIVATE_KEY
  • TW_SAML_EXPECTED_ISSUER
  • TW_SAML_REQUIRE_MFA

Use provider-specific documents in this folder for exact values.

Security Notes

  • OIDC ID tokens are validated for issuer/audience/nonce/expiration and signature (JWKS).
  • SAML assertions enforce request correlation (InResponseTo), destination checks, signature presence, SHA-2 algorithm allow-listing, and certificate pinning checks.

Okta Setup

1. Create Application

  1. Okta Admin: Applications > Create App Integration.
  2. Choose OIDC - Web Application (recommended) or SAML 2.0.
  3. Configure sign-in redirect URI:
    • https://<your-host>/auth/oidc/callback

2. OIDC Environment Variables

  • TW_OIDC_ISSUER=https://<okta-domain>/oauth2/default
  • TW_OIDC_CLIENT_ID=<okta-client-id>
  • TW_OIDC_CLIENT_SECRET=<okta-client-secret>
  • TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callback
  • TW_OIDC_SCOPES=openid,profile,email,groups
  • TW_OIDC_REQUIRE_MFA=true

3. Group to Role Mapping

Example:

  • TW_SSO_ROLE_MAPPING=okta-soc-admin=admin,okta-soc-analyst=analyst,okta-soc-viewer=viewer

4. Optional SCIM Provisioning

SCIM can be enabled on top of JIT provisioning for pre-provisioning and automated lifecycle. JIT remains active for first-login provisioning fallback.

Azure AD (Microsoft Entra ID) Setup

1. Register App

  1. Microsoft Entra admin center: Applications > App registrations > New registration.
  2. Add redirect URI:
    • OIDC: https://<your-host>/auth/oidc/callback
    • SAML ACS (if using SAML): https://<your-host>/auth/saml/acs
  3. Save Application (client) ID and Directory (tenant) ID.

2. Configure OIDC in Triage Warden

Set:

  • TW_OIDC_ISSUER=https://login.microsoftonline.com/<tenant-id>/v2.0
  • TW_OIDC_CLIENT_ID=<application-client-id>
  • TW_OIDC_CLIENT_SECRET=<generated-client-secret>
  • TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callback
  • TW_OIDC_SCOPES=openid,profile,email
  • TW_OIDC_REQUIRE_MFA=true (recommended)

3. Claims and Group Mapping

  1. In app Token configuration, add group claims.
  2. Map groups to roles:
    • TW_SSO_ROLE_MAPPING=SOC-Admins=admin,SOC-Analysts=analyst,SOC-Viewers=viewer

4. Conditional Access / MFA

  1. Create conditional access policy requiring MFA for the app.
  2. Keep TW_OIDC_REQUIRE_MFA=true to enforce server-side claim checks.

Google Workspace Setup

  1. Google Cloud Console: configure OAuth consent screen.
  2. Create OAuth client (Web application).
  3. Add authorized redirect URI:
    • https://<your-host>/auth/oidc/callback

2. OIDC Configuration

  • TW_OIDC_ISSUER=https://accounts.google.com
  • TW_OIDC_CLIENT_ID=<google-client-id>
  • TW_OIDC_CLIENT_SECRET=<google-client-secret>
  • TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callback
  • TW_OIDC_SCOPES=openid,profile,email

3. Role Mapping

Google Workspace group claims may require Cloud Identity configuration. Use mapped group names:

  • TW_SSO_ROLE_MAPPING=tw-admins=admin,tw-analysts=analyst,tw-viewers=viewer

4. MFA

Enforce 2-Step Verification in Workspace admin policies and set:

  • TW_OIDC_REQUIRE_MFA=true

Generic OIDC/SAML Setup

OIDC Checklist

  1. Configure redirect URI: https://<host>/auth/oidc/callback.
  2. Set:
    • TW_OIDC_ISSUER
    • TW_OIDC_CLIENT_ID
    • TW_OIDC_CLIENT_SECRET
    • TW_OIDC_REDIRECT_URI
  3. Optional claim overrides:
    • TW_OIDC_EMAIL_CLAIM
    • TW_OIDC_NAME_CLAIM
    • TW_OIDC_GROUPS_CLAIM
    • TW_OIDC_ROLES_CLAIM
    • TW_OIDC_MFA_CLAIM
  4. Configure role mapping:
    • TW_SSO_ROLE_MAPPING=external_group=internal_role,...

SAML Checklist

  1. Download SP metadata from https://<host>/auth/saml/metadata.
  2. Configure IdP to POST assertions to https://<host>/auth/saml/acs.
  3. Set:
    • TW_SAML_ENTITY_ID
    • TW_SAML_ACS_URL
    • TW_SAML_IDP_SSO_URL
    • TW_SAML_CERTIFICATE
  4. Optional:
    • TW_SAML_PRIVATE_KEY (required for encrypted assertions)
    • TW_SAML_IDP_SLO_URL
    • TW_SAML_EXPECTED_ISSUER
    • TW_SAML_REQUIRE_MFA

Security Recommendations

  • Always require TLS termination.
  • Keep TW_OIDC_REQUIRE_MFA=true and TW_SAML_REQUIRE_MFA=true for privileged tenants.
  • Use least-privilege role mappings.
  • Rotate OIDC client secrets and SAML certificates regularly.

Architectural Decision Records

This directory contains Architectural Decision Records (ADRs) for Triage Warden.

What is an ADR?

An ADR is a document that captures an important architectural decision made along with its context and consequences.

ADR Index

NumberTitleStatusDate
001Event Bus ArchitectureAccepted2026-02
002Dual Database Support (SQLite + PostgreSQL)Accepted2026-02
003Credential Encryption at RestAccepted2026-02
004Session Management StrategyAccepted2026-02
005API Key Format and SecurityAccepted2026-02
006Operation Modes (Supervised/Autonomous)Accepted2026-02
007Kill Switch DesignAccepted2026-02

ADR Template

New ADRs should follow this template:

# ADR-XXX: Title

## Status

Proposed | Accepted | Deprecated | Superseded

## Context

What is the issue that we're seeing that is motivating this decision or change?

## Decision

What is the change that we're proposing and/or doing?

## Consequences

What becomes easier or more difficult to do because of this change?

ADR-001: Event Bus Architecture

Status

Accepted

Context

Triage Warden needs to coordinate multiple components (enrichment, analysis, action execution, notifications) in response to security incidents. We needed a way to:

  1. Decouple components for independent development and testing
  2. Enable real-time updates to the dashboard
  3. Support both synchronous and asynchronous processing
  4. Maintain an audit trail of all system events

Decision

We implemented an in-process event bus using Tokio channels with the following design:

Event Types

All significant system events are captured as TriageEvent variants:

  • AlertReceived - New alert from webhook
  • IncidentCreated - Incident created from alert
  • EnrichmentComplete - Single enrichment finished
  • EnrichmentPhaseComplete - All enrichments done
  • AnalysisComplete - AI analysis finished
  • ActionsProposed - Response actions proposed
  • ActionApproved/Denied - Action approval decision
  • ActionExecuted - Action completed
  • StatusChanged - Incident status transition
  • TicketCreated - External ticket created
  • IncidentEscalated - Incident escalated
  • IncidentResolved - Incident resolved
  • KillSwitchActivated - Emergency stop triggered

Delivery Mechanisms

  1. Broadcast Channel: For real-time dashboard updates via SSE
  2. Named Subscribers: For component-specific processing queues
  3. Event History: In-memory buffer for recent event retrieval

Error Handling

Events are fire-and-forget with fallback logging:

  • publish() - Returns Result for cases where failure matters
  • publish_with_fallback() - Logs errors, never fails (for non-critical events)

Consequences

Positive

  • Components are loosely coupled and independently testable
  • Dashboard receives real-time updates without polling
  • Complete event history available for debugging
  • Failed subscribers don't block the main processing flow

Negative

  • In-process only - no distributed event bus
  • Event history is limited and in-memory (lost on restart)
  • No guaranteed delivery or replay capability
  • Broadcast channel has limited buffer (may drop events under load)

Future Considerations

For high-availability deployments, consider:

  • Redis Pub/Sub for distributed events
  • PostgreSQL LISTEN/NOTIFY for persistent events
  • External message queue (RabbitMQ, Kafka) for durability

ADR-002: Dual Database Support (SQLite + PostgreSQL)

Status

Accepted

Context

Triage Warden needed to support different deployment scenarios:

  1. Development/Testing: Quick setup without external dependencies
  2. Small Deployments: Single-server installations with minimal infrastructure
  3. Production: Scalable deployments with high availability requirements

We evaluated:

  • SQLite only (simple but limited scalability)
  • PostgreSQL only (powerful but heavy for small deployments)
  • Dual support (flexibility but increased complexity)

Decision

We implemented dual database support using SQLx with compile-time query verification:

Architecture

┌─────────────────────────────────────────┐
│              Application                │
├─────────────────────────────────────────┤
│           Repository Traits             │
│   (IncidentRepository, UserRepository)  │
├──────────────────┬──────────────────────┤
│  SqliteXxxRepo   │    PgXxxRepo         │
├──────────────────┼──────────────────────┤
│   SQLite Pool    │   PostgreSQL Pool    │
└──────────────────┴──────────────────────┘

Implementation

  • DbPool enum wraps both pool types
  • Each repository has SQLite and PostgreSQL implementations
  • Factory functions create the appropriate implementation based on pool type
  • Migrations are maintained separately for each database

Database Selection

Determined by DATABASE_URL environment variable:

  • sqlite:path/to/file.db → SQLite
  • postgres://user:pass@host/db → PostgreSQL

Consequences

Positive

  • Zero-config development with SQLite
  • Production-ready PostgreSQL support
  • Same API regardless of database backend
  • Compile-time query verification for both backends

Negative

  • Duplicate migration files
  • Some features may have different behavior (e.g., JSON querying)
  • More complex testing matrix
  • Cannot use PostgreSQL-specific features (CTEs, window functions) without SQLite equivalents

Trade-offs

FeatureSQLitePostgreSQL
Setup complexityNoneRequires server
Concurrent writesLimitedExcellent
JSON indexingBasicJSONB with GIN
Full-text searchLimitedExcellent
Connection poolingIn-processNetwork
BackupFile copypg_dump

ADR-003: Credential Encryption at Rest

Status

Accepted

Context

Triage Warden stores sensitive credentials for external integrations:

  • API keys for threat intelligence services (VirusTotal, etc.)
  • OAuth tokens for cloud services (Microsoft, Google)
  • Webhook secrets for SIEM integrations
  • SMTP credentials for email notifications

These credentials must be protected at rest in the database.

Decision

We implemented AES-256-GCM encryption for sensitive fields:

Encryption Scheme

  • Algorithm: AES-256-GCM (authenticated encryption)
  • Key Derivation: HKDF from master key + unique salt per value
  • Nonce: 96-bit random nonce per encryption
  • Storage Format: Base64(nonce || ciphertext || auth_tag)

Key Management

ENCRYPTION_KEY (env var)
        │
        ▼
    HKDF-SHA256
        │
    ┌───┴───┐
    │ Salt  │ (per-value, stored with ciphertext)
    └───┬───┘
        ▼
   Derived Key
        │
        ▼
   AES-256-GCM

Implementation

#![allow(unused)]
fn main() {
pub trait CredentialEncryptor: Send + Sync {
    fn encrypt(&self, plaintext: &str) -> Result<String, EncryptionError>;
    fn decrypt(&self, ciphertext: &str) -> Result<String, EncryptionError>;
}
}

Two implementations:

  • Aes256GcmEncryptor - Production encryption
  • NoOpEncryptor - Development mode (disabled encryption)

Encrypted Fields

TableFieldContains
connectorsconfig.api_keyAPI keys
connectorsconfig.client_secretOAuth secrets
settingsllm.api_keyLLM provider API key
notification_channelsconfig.webhook_urlWebhook URLs with tokens

Consequences

Positive

  • Credentials protected if database is compromised
  • Authenticated encryption prevents tampering
  • Per-value salt prevents rainbow table attacks
  • Key rotation possible without re-encrypting all values

Negative

  • Cannot search encrypted fields
  • Master key must be securely managed
  • Performance overhead for encryption/decryption
  • Key loss = data loss (no recovery without key)

Security Considerations

  1. Key Storage: Use environment variable or secrets manager
  2. Key Rotation: Implement key versioning for rotation
  3. Audit: Log all decryption operations
  4. Memory: Clear sensitive data from memory after use

ADR-004: Session Management Strategy

Status

Accepted

Context

The dashboard requires user authentication with session management. We needed to decide between:

  1. JWT tokens (stateless)
  2. Server-side sessions (stateful)
  3. Hybrid approach

Requirements:

  • Secure authentication for web dashboard
  • Support for session revocation
  • CSRF protection for form submissions
  • Reasonable session lifetime

Decision

We chose server-side sessions stored in the database using tower-sessions:

Session Architecture

Browser                          Server
   │                                │
   │  POST /auth/login              │
   │  (username, password)          │
   ├───────────────────────────────►│
   │                                │ Validate credentials
   │                                │ Create session in DB
   │  Set-Cookie: id=session_id     │
   │◄───────────────────────────────┤
   │                                │
   │  GET /dashboard                │
   │  Cookie: id=session_id         │
   ├───────────────────────────────►│
   │                                │ Load session from DB
   │                                │ Verify not expired
   │  200 OK                        │
   │◄───────────────────────────────┤

Session Storage

Sessions are stored in the sessions table:

ColumnTypeDescription
idTEXTSession ID (secure random)
dataBLOBEncrypted session data
expiry_dateINTEGERUnix timestamp

Session Data

#![allow(unused)]
fn main() {
struct SessionData {
    user_id: Uuid,
    username: String,
    role: UserRole,
    login_csrf: String,  // CSRF token for sensitive actions
}
}

Security Measures

  1. Secure Cookies: HttpOnly, Secure (in production), SameSite=Lax
  2. CSRF Protection: Token in session, validated on state-changing requests
  3. Session Expiry: 24-hour default, configurable
  4. Rotation: New session ID on privilege changes

Consequences

Positive

  • Sessions can be revoked immediately
  • No token size limits for session data
  • CSRF tokens integrated naturally
  • Easy to implement "logout all devices"

Negative

  • Database read on every authenticated request
  • Session table requires cleanup (expired sessions)
  • Horizontal scaling requires shared database
  • Slightly higher latency than JWTs

Comparison with JWTs

AspectSessionsJWTs
RevocationImmediateRequires blacklist
StorageServerClient
ScalabilityRequires shared storeStateless
SizeCookie onlyFull payload
SecurityKeys in DBSignature verification

ADR-005: API Key Format and Security

Status

Accepted

Context

Triage Warden exposes a REST API that needs programmatic authentication. We needed to design an API key format that is:

  1. Secure against brute-force attacks
  2. Easily identifiable (for revocation)
  3. User-friendly for debugging
  4. Compatible with common tooling

Decision

We adopted a prefixed API key format similar to GitHub and Stripe:

Key Format

tw_<user_prefix>_<random_secret>

Example: tw_abc12345_9f8e7d6c5b4a3210fedcba9876543210

Components:

  • tw_ - Application prefix (identifies Triage Warden keys)
  • <user_prefix> - First 8 chars for identification (stored in DB)
  • <random_secret> - 32 bytes of cryptographic randomness

Storage

Only the hash is stored, never the raw key:

ColumnValue
key_prefixtw_abc12345 (for lookup)
key_hashSHA-256(full_key)

Authentication Flow

1. Extract key from Authorization header
2. Parse prefix (first 11 chars)
3. Look up by prefix in database
4. Compute SHA-256 of provided key
5. Compare with stored hash (constant-time)
6. Check expiration and scopes

Key Generation

#![allow(unused)]
fn main() {
use rand::Rng;
use sha2::{Sha256, Digest};

fn generate_api_key(user_id: Uuid) -> (String, String, String) {
    let secret: [u8; 32] = rand::thread_rng().gen();
    let secret_hex = hex::encode(secret);

    let prefix = format!("tw_{}", &user_id.to_string()[..8]);
    let full_key = format!("{}_{}", prefix, secret_hex);
    let key_hash = hex::encode(Sha256::digest(full_key.as_bytes()));

    (full_key, prefix, key_hash)  // Return key once, store prefix + hash
}
}

Consequences

Positive

  • Keys are identifiable without exposing secrets
  • Prefix enables efficient database lookup
  • Format is familiar to developers
  • Hash storage protects against database leaks
  • Constant-time comparison prevents timing attacks

Negative

  • Keys must be stored securely by users (cannot be recovered)
  • Prefix lookup could reveal key existence (minor info leak)
  • Longer keys than simple tokens

Security Properties

PropertyImplementation
Entropy256 bits (32 random bytes)
StorageSHA-256 hash only
ComparisonConstant-time
RevocationDelete from database
ExpirationOptional expiry_at field
ScopesJSON array of allowed operations

ADR-006: Operation Modes (Supervised/Autonomous)

Status

Accepted

Context

Security automation involves a trust spectrum from fully manual to fully autonomous. Organizations have different risk tolerances and regulatory requirements. We needed to support:

  1. Organizations starting with automation (cautious)
  2. Mature SOCs ready for autonomous response
  3. Gradual transition between modes
  4. Compliance with approval requirements

Decision

We implemented three operation modes configurable at the system level:

Modes

ModeDescriptionDefault Approval
supervisedAll actions require human approvalrequire_approval
semi_autonomousLow-risk actions auto-approved, high-risk need approvalpolicy-based
autonomousActions auto-approved unless policy deniesauto_approve

Mode Selection Flow

Incoming Action
      │
      ▼
┌─────────────────┐
│ Check Kill Switch│
└────────┬────────┘
         │ (not active)
         ▼
┌─────────────────┐
│ Evaluate Policies│
└────────┬────────┘
         │
    ┌────┴────┐
    │ Explicit │
    │ Policy?  │
    └────┬────┘
    Yes  │  No
    │    │
    │    ▼
    │ ┌─────────────────┐
    │ │ Apply Mode      │
    │ │ Default         │
    │ └────────┬────────┘
    │          │
    └────┬─────┘
         │
         ▼
   Final Decision

Policy Override

Policies can override mode defaults:

policies:
  - name: "Block critical IPs always requires approval"
    condition: "action.type == 'block_ip' && target.is_critical"
    action: "require_approval"
    approval_level: "manager"

  - name: "Low severity lookups auto-approved"
    condition: "action.type == 'lookup' && incident.severity in ['info', 'low']"
    action: "auto_approve"

Configuration

# config.yaml
general:
  mode: "supervised"  # supervised | semi_autonomous | autonomous

Or via API:

curl -X PUT /api/settings/general \
  -d '{"mode": "semi_autonomous"}'

Consequences

Positive

  • Flexible for different organizational needs
  • Gradual automation adoption path
  • Policies provide fine-grained control
  • Easy to fall back to supervised mode

Negative

  • More complex decision logic
  • Potential for misconfiguration
  • Requires clear documentation of behavior
  • Audit trails must capture mode at decision time

Mode Comparison

ScenarioSupervisedSemi-AutoAutonomous
Block malware IPApproval neededAuto-approvedAuto-approved
Disable userApproval neededApproval neededAuto-approved
Isolate hostApproval neededApproval neededApproval (policy)
Lookup IOCApproval neededAuto-approvedAuto-approved

ADR-007: Kill Switch Design

Status

Accepted

Context

Autonomous security response systems pose risks if they malfunction:

  1. False positives could disable legitimate users/systems
  2. Bugs could trigger cascading actions
  3. Compromised AI could be weaponized
  4. External events may require immediate halt

We needed an emergency stop mechanism that is:

  • Fast to activate (< 1 second)
  • Globally effective
  • Difficult to accidentally trigger
  • Easy to recover from

Decision

We implemented a global kill switch with the following design:

Architecture

                    ┌─────────────┐
                    │ Kill Switch │
                    │   State     │
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Orchestrator  │  │ Policy Engine │  │ Action Runner │
│               │  │               │  │               │
│ check()       │  │ check()       │  │ check()       │
│ before        │  │ before        │  │ before        │
│ processing    │  │ evaluation    │  │ execution     │
└───────────────┘  └───────────────┘  └───────────────┘

State

#![allow(unused)]
fn main() {
pub struct KillSwitchStatus {
    pub active: bool,
    pub reason: Option<String>,
    pub activated_by: Option<String>,
    pub activated_at: Option<DateTime<Utc>>,
}
}

Check Points

The kill switch is checked at multiple points:

  1. Alert Processing: Before creating incidents from alerts
  2. Policy Evaluation: Before evaluating approval policies
  3. Action Execution: Before executing any response action
  4. Playbook Execution: Before running playbook stages

Activation

#![allow(unused)]
fn main() {
// Via API
POST /api/kill-switch/activate
{
    "reason": "Investigating false positive surge",
    "activated_by": "[email protected]"
}

// Via CLI
tw-cli kill-switch activate --reason "Emergency maintenance"

// Programmatic
kill_switch.activate("Anomaly detected", "system").await;
}

Deactivation

#![allow(unused)]
fn main() {
// Via API
POST /api/kill-switch/deactivate
{
    "reason": "Issue resolved"
}

// Only admins can deactivate
}

Event Notification

Activation triggers:

  • KillSwitchActivated event to all subscribers
  • Dashboard alert banner
  • Notification to configured channels

Consequences

Positive

  • Immediate halt of all automation
  • Clear audit trail of activation/deactivation
  • Multiple activation methods (UI, API, CLI)
  • Visible status in all interfaces

Negative

  • In-memory state (lost on restart, resets to inactive)
  • No automatic activation triggers yet
  • Single global switch (no per-action granularity)
  • Requires admin access to deactivate

Future Enhancements

  1. Persistent State: Store kill switch state in database
  2. Auto-Activation: Trigger on anomaly detection
  3. Scoped Switches: Per-action-type or per-connector switches
  4. Scheduled Deactivation: Auto-deactivate after timeout
  5. Two-Person Rule: Require multiple admins for deactivation

Operational Procedures

When kill switch is activated:

  1. All pending actions remain pending
  2. New alerts create incidents but stop at enrichment
  3. Dashboard shows prominent warning banner
  4. Existing approved actions are NOT rolled back

To recover:

  1. Investigate root cause
  2. Fix underlying issue
  3. Deactivate kill switch
  4. Manually review pending actions
  5. Resume normal operations

Production Deployment

This section covers deploying Triage Warden in production environments.

Deployment Options

Triage Warden can be deployed in several ways:

  • Docker - Recommended for most deployments. Quick setup with Docker Compose.
  • Kubernetes - For orchestrated, scalable deployments using raw manifests.
  • Helm Chart - Recommended for Kubernetes. Templated deployment with environment-specific values.
  • Binary - Direct binary installation on Linux servers.

Before You Deploy

Before deploying to production, review:

  1. Production Checklist - Security and configuration requirements
  2. Configuration Reference - All environment variables and settings
  3. Database Setup - PostgreSQL configuration for production
  4. Security Hardening - TLS, secrets, network policies
  5. Scaling - Horizontal scaling considerations

Quick Start

For a quick production deployment with Docker:

# Clone the repository
git clone https://github.com/your-org/triage-warden.git
cd triage-warden/deploy/docker

# Configure environment
cp .env.example .env
# Edit .env with your settings

# Generate encryption key
echo "TW_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> .env

# Start services
docker compose -f docker-compose.prod.yml up -d

Architecture Overview

A typical production deployment includes:

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │   (TLS term.)   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
        ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
        │  Triage   │  │  Triage   │  │  Triage   │
        │  Warden   │  │  Warden   │  │  Warden   │
        │ Instance 1│  │ Instance 2│  │ Instance 3│
        └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
              │              │              │
              └──────────────┼──────────────┘
                             │
                    ┌────────▼────────┐
                    │   PostgreSQL    │
                    │   (Primary)     │
                    └─────────────────┘

Support

For deployment assistance:

Production Checklist

Complete this checklist before deploying Triage Warden to production.

Security Requirements

Authentication & Secrets

  • Encryption key configured: Set TW_ENCRYPTION_KEY with a 32-byte base64-encoded key

    # Generate a secure key
    openssl rand -base64 32
    
  • JWT secret configured: Set TW_JWT_SECRET with a strong random value

    openssl rand -hex 32
    
  • Session secret configured: Set TW_SESSION_SECRET for session encryption

  • Default admin password changed: Change the default admin credentials immediately after first login

  • API keys use scoped permissions: Don't create API keys with * scope in production

Network Security

  • TLS enabled: All traffic should use HTTPS
  • TLS certificates valid: Use certificates from a trusted CA (not self-signed)
  • Internal traffic encrypted: Database connections use TLS
  • Firewall rules configured: Only expose necessary ports (443 for HTTPS)
  • Rate limiting enabled: Protect against brute force attacks

Database Security

  • PostgreSQL in production: Don't use SQLite for production workloads
  • Database user has minimal permissions: Use a dedicated user, not superuser
  • Database connections encrypted: Enable sslmode=require or verify-full
  • Regular backups configured: Automated daily backups with tested restore procedure

Configuration Requirements

Required Environment Variables

VariableDescriptionExample
DATABASE_URLPostgreSQL connection stringpostgres://user:pass@host:5432/triage_warden?sslmode=require
TW_ENCRYPTION_KEYCredential encryption key (32 bytes, base64)K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72...
TW_JWT_SECRETJWT signing secretyour-256-bit-secret
TW_SESSION_SECRETSession encryption secretanother-secret-value
RUST_LOGLog levelinfo or triage_warden=debug
VariableDescriptionDefault
TW_BIND_ADDRESSServer bind address0.0.0.0:8080
TW_BASE_URLPublic URL for callbackshttps://triage.example.com
TW_TRUSTED_PROXIESComma-separated proxy IPsNone
TW_MAX_REQUEST_SIZEMaximum request body size10MB

LLM Configuration (if using AI features)

  • LLM API key configured: Set via UI or environment variable
  • Rate limits configured: Prevent runaway API costs
  • Model selected appropriately: Balance cost vs. capability

Infrastructure Requirements

Minimum Hardware

ComponentMinimumRecommended
CPU2 cores4 cores
RAM2 GB4 GB
Storage20 GB50 GB SSD

Database Requirements

MetricMinimumRecommended
PostgreSQL Version1415+
Connections2050+
Storage10 GB50 GB+

Network Requirements

  • Outbound HTTPS (443) to:
    • LLM provider (api.openai.com, api.anthropic.com)
    • Configured connectors (VirusTotal, Jira, etc.)
  • Inbound HTTPS (443) from:
    • Users accessing the dashboard
    • Webhook sources (SIEM, EDR systems)

Monitoring & Observability

Health Checks

  • Health endpoint accessible: GET /health returns component status
  • Readiness probe configured: GET /ready for load balancer
  • Liveness probe configured: GET /live for container orchestration

Metrics & Logging

  • Prometheus metrics exposed: GET /metrics endpoint enabled
  • Log aggregation configured: Logs shipped to central system
  • Alerting rules configured: Alerts for critical failures
AlertConditionSeverity
Service Down/health returns unhealthy for 5mCritical
Database Connection FailedDatabase component unhealthyCritical
Kill Switch ActiveKill switch activatedWarning
High Error Rate>5% HTTP 5xx responsesWarning
Connector UnhealthyAny connector in error stateWarning
LLM API ErrorsLLM requests failingWarning

Operational Readiness

Documentation

  • Runbooks available: Team has access to operational runbooks
  • Contact list current: On-call rotation and escalation paths defined
  • Recovery procedures tested: Backup restore verified within last 30 days

Access Control

  • Admin accounts audited: Remove unnecessary admin users
  • API keys audited: Revoke unused or over-privileged keys
  • Audit logging enabled: User actions are logged

Backup & Recovery

  • Database backups automated: Daily backups with 30-day retention
  • Backup encryption enabled: Backups encrypted at rest
  • Recovery time objective defined: Team knows target RTO
  • Recovery procedure documented: Step-by-step restore guide exists

Pre-Launch Testing

Functional Tests

  • User login works with configured auth
  • Incidents can be created via webhook
  • Playbooks execute correctly
  • Connectors authenticate successfully
  • Notifications are delivered

Load Testing

  • Tested with expected concurrent users
  • Tested with expected webhook volume
  • Response times acceptable under load

Failover Testing

  • Application recovers from database restart
  • Application handles LLM API failures gracefully
  • Kill switch stops all automation when activated

Sign-Off

RoleNameDateSignature
Security Review
Operations Review
Development Lead

Quick Validation Commands

# Check health endpoint
curl -s https://triage.example.com/health | jq

# Verify TLS certificate
openssl s_client -connect triage.example.com:443 -servername triage.example.com

# Test database connectivity (from application)
curl -s https://triage.example.com/health/detailed | jq '.components.database'

# Verify all connectors healthy
curl -s https://triage.example.com/health/detailed | jq '.components.connectors'

Docker Deployment

Deploy Triage Warden using Docker and Docker Compose.

Prerequisites

  • Docker Engine 20.10+
  • Docker Compose v2.0+
  • 4 GB RAM minimum (2 GB for basic, 4 GB+ for HA)
  • 20 GB disk space

Overview

Triage Warden provides three Docker Compose configurations:

FilePurposeUse Case
docker-compose.ymlBasic setupQuick start, single instance
docker-compose.dev.ymlDevelopmentLocal development with hot reload
docker-compose.ha.ymlHigh AvailabilityHA testing, multi-instance

Quick Start

# Clone the repository
git clone https://github.com/your-org/triage-warden.git
cd triage-warden/deploy/docker

# Copy and configure environment
cp .env.example .env

# Generate required secrets
echo "TW_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> .env
echo "TW_JWT_SECRET=$(openssl rand -hex 32)" >> .env
echo "TW_SESSION_SECRET=$(openssl rand -hex 32)" >> .env
echo "POSTGRES_PASSWORD=$(openssl rand -hex 16)" >> .env

# Start services
docker compose up -d

# Check status
docker compose ps
docker compose logs -f triage-warden

Access the dashboard at http://localhost:8080

Default credentials: admin / admin (change immediately!)

Configuration

Environment Variables

Edit .env file with your configuration:

# Database
POSTGRES_USER=triage_warden
POSTGRES_PASSWORD=your-secure-password
POSTGRES_DB=triage_warden
DATABASE_URL=postgres://triage_warden:your-secure-password@postgres:5432/triage_warden

# Application
TW_BIND_ADDRESS=0.0.0.0:8080
TW_BASE_URL=https://triage.example.com
TW_ENCRYPTION_KEY=your-32-byte-base64-key
TW_JWT_SECRET=your-jwt-secret
TW_SESSION_SECRET=your-session-secret

# Logging
RUST_LOG=info

# LLM (optional)
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...

Production Configuration

For production, use docker-compose.prod.yml:

docker compose -f docker-compose.prod.yml up -d

Key differences from development:

  • Uses external PostgreSQL volume for data persistence
  • Enables health checks
  • Sets resource limits
  • Configures restart policies

Docker Compose Files

Development (docker-compose.yml)

version: '3.8'

services:
  triage-warden:
    image: ghcr.io/your-org/triage-warden:latest
    ports:
      - "8080:8080"
    environment:
      - DATABASE_URL=${DATABASE_URL}
      - TW_ENCRYPTION_KEY=${TW_ENCRYPTION_KEY}
      - TW_JWT_SECRET=${TW_JWT_SECRET}
      - TW_SESSION_SECRET=${TW_SESSION_SECRET}
      - RUST_LOG=${RUST_LOG:-info}
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 5s
      timeout: 5s
      retries: 5

volumes:
  postgres_data:

Production (docker-compose.prod.yml)

version: '3.8'

services:
  triage-warden:
    image: ghcr.io/your-org/triage-warden:latest
    ports:
      - "8080:8080"
    environment:
      - DATABASE_URL=${DATABASE_URL}
      - TW_ENCRYPTION_KEY=${TW_ENCRYPTION_KEY}
      - TW_JWT_SECRET=${TW_JWT_SECRET}
      - TW_SESSION_SECRET=${TW_SESSION_SECRET}
      - TW_BASE_URL=${TW_BASE_URL}
      - RUST_LOG=${RUST_LOG:-info}
    depends_on:
      postgres:
        condition: service_healthy
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/live"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d:ro
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

volumes:
  postgres_data:
    external: true
    name: triage_warden_postgres

High Availability Testing

The HA configuration runs multiple instances for testing distributed features locally before deploying to Kubernetes.

Architecture

                    ┌─────────────┐
                    │   Traefik   │
                    │   (LB)      │
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼─────┐  ┌──────▼─────┐  ┌──────▼──────┐
    │   API-1    │  │   API-2    │  │   API-N     │
    │  (serve)   │  │  (serve)   │  │  (serve)    │
    └──────┬─────┘  └──────┬─────┘  └──────┬──────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
    ┌──────────────────────┼──────────────────────┐
    │                      │                      │
    ▼                      ▼                      ▼
┌───────┐           ┌──────────┐           ┌───────────┐
│ Redis │◄─────────►│ PostgreSQL│◄─────────►│Orchestrator│
│(MQ/Cache)│        │   (DB)   │           │ (1 leader) │
└───────┘           └──────────┘           └───────────┘

Starting HA Stack

# Navigate to deploy directory
cd deploy/docker

# Configure environment
cp .env.example .env
# Edit .env with required values

# Start all services
docker-compose -f docker-compose.ha.yml up -d

# Start with monitoring stack
docker-compose -f docker-compose.ha.yml --profile monitoring up -d

Accessing Services

ServiceURLDescription
API (Load Balanced)http://localhost:8080Main application endpoint
Traefik Dashboardhttp://localhost:8081Load balancer metrics
Prometheushttp://localhost:9090Metrics (with monitoring profile)
Grafanahttp://localhost:3000Dashboards (admin/admin)
PostgreSQLlocalhost:5432Database (for debugging)
Redislocalhost:6379Cache/MQ (for debugging)

Verifying HA Behavior

# Check all instances are healthy
curl -s http://localhost:8080/health | jq

# Check load balancing (run multiple times)
for i in {1..10}; do
  curl -s http://localhost:8080/health | jq -r '.instance_id // "unknown"'
done

# Check leader election
curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Simulate failure - stop one API instance
docker stop tw-api-1

# Verify traffic still flows
curl -s http://localhost:8080/health

# Restart the instance
docker start tw-api-1

Testing Orchestrator Failover

# Check which orchestrator is leader
docker exec tw-orchestrator-1 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'
docker exec tw-orchestrator-2 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Stop the leader
docker stop tw-orchestrator-1

# Verify failover (second orchestrator becomes leader)
sleep 5
docker exec tw-orchestrator-2 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Restart original
docker start tw-orchestrator-1

Building the Image

To build the Docker image locally:

# From repository root
docker build -t triage-warden:local -f deploy/docker/Dockerfile .

# Build with no cache
docker-compose -f docker-compose.ha.yml build --no-cache

# Build specific service
docker-compose -f docker-compose.ha.yml build api-1

# Use local image
# In docker-compose.yml, change:
# image: ghcr.io/your-org/triage-warden:latest
# to:
# image: triage-warden:local

Persistent Storage

Volume Management

# List volumes
docker volume ls | grep triage-warden

# Backup PostgreSQL
docker exec tw-postgres pg_dump -U triage triage_warden > backup.sql

# Restore PostgreSQL
cat backup.sql | docker exec -i tw-postgres psql -U triage triage_warden

# Backup Redis
docker exec tw-redis redis-cli BGSAVE
docker cp tw-redis:/data/dump.rdb ./redis-backup.rdb

Cleaning Up

# Stop services
docker-compose -f docker-compose.ha.yml down

# Stop and remove volumes (WARNING: deletes all data)
docker-compose -f docker-compose.ha.yml down -v

# Remove only unused volumes
docker volume prune

Common Operations

View Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f triage-warden

# Last 100 lines
docker compose logs --tail=100 triage-warden

# With timestamps
docker-compose -f docker-compose.ha.yml logs -f --timestamps

Restart Services

# Restart all
docker compose restart

# Restart specific service
docker compose restart triage-warden

Update to New Version

# Pull new images
docker compose pull

# Recreate containers
docker compose up -d

# Verify update
docker compose ps
curl http://localhost:8080/health | jq '.version'

Database Operations

# Create backup
docker compose exec postgres pg_dump -U triage_warden triage_warden > backup.sql

# Restore backup
docker compose exec -T postgres psql -U triage_warden triage_warden < backup.sql

# Access database shell
docker compose exec postgres psql -U triage_warden triage_warden

Debug Mode

Enable debug logging:

# In .env file
RUST_LOG=debug,triage_warden=trace,tw_api=trace,tw_core=trace
TW_LOG_FORMAT=pretty  # Human-readable format

Inspecting Containers

# Shell access
docker exec -it tw-api-1 /bin/sh

# Check process status
docker exec tw-api-1 ps aux

# Check network connectivity
docker exec tw-api-1 curl -v http://postgres:5432
docker exec tw-api-1 curl -v http://redis:6379

Resource Limits

The HA configuration includes resource limits suitable for local testing:

ServiceCPU LimitMemory Limit
API1 core512MB
Orchestrator1.5 cores1GB
PostgreSQL1 core1GB
Redis0.5 core512MB
Traefik0.5 core256MB

Adjust in docker-compose.ha.yml under deploy.resources.

TLS Configuration

For production, use a reverse proxy (nginx, Traefik, Caddy) for TLS termination:

With Traefik

# Add to docker-compose.prod.yml
services:
  traefik:
    image: traefik:v2.10
    command:
      - "--providers.docker=true"
      - "--entrypoints.websecure.address=:443"
      - "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
      - "[email protected]"
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
    ports:
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - letsencrypt:/letsencrypt

  triage-warden:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.triage.rule=Host(`triage.example.com`)"
      - "traefik.http.routers.triage.entrypoints=websecure"
      - "traefik.http.routers.triage.tls.certresolver=letsencrypt"

volumes:
  letsencrypt:

Troubleshooting

Container Won't Start

# Check logs for errors
docker compose logs triage-warden

# Common issues:
# - DATABASE_URL not set or incorrect
# - TW_ENCRYPTION_KEY missing
# - PostgreSQL not ready (check depends_on health)

Database Connection Failed

# Verify PostgreSQL is running
docker compose ps postgres

# Check PostgreSQL logs
docker compose logs postgres

# Test connection
docker compose exec postgres pg_isready -U triage_warden

# Verify connection from app container
docker exec tw-api-1 curl -v telnet://postgres:5432

Port Conflicts

# Find process using port 8080
lsof -i :8080

# Use different ports
# In docker-compose.ha.yml or via environment:
# - "8090:80" instead of "8080:80"

Container Exits Immediately

# Check exit code and logs
docker-compose -f docker-compose.ha.yml logs api-1

# Common causes:
# - Missing environment variables
# - Database not ready
# - Invalid configuration

Redis Connection Issues

# Test Redis connectivity
docker exec tw-api-1 curl -v telnet://redis:6379

# Check Redis logs
docker-compose -f docker-compose.ha.yml logs redis

# Connect to Redis CLI
docker exec -it tw-redis redis-cli ping

Out of Memory

# Check container memory usage
docker stats

# Increase limits in docker-compose.prod.yml
deploy:
  resources:
    limits:
      memory: 4G  # Increase from 2G

Next Steps

Kubernetes Deployment Guide

This guide covers deploying Triage Warden to Kubernetes using raw manifests. For the recommended Helm-based approach, see the Helm Chart guide.

Prerequisites

Before deploying, ensure you have:

  • Kubernetes cluster version 1.25 or later
  • kubectl configured with cluster access
  • Helm 3.8+ (see Helm Chart for Helm-based deployment)
  • Container registry access to pull Triage Warden images
  • PostgreSQL database (managed or self-hosted)
  • Redis (optional, required for HA deployments)

Optional Prerequisites

  • Ingress controller (nginx-ingress or Traefik recommended)
  • cert-manager for automatic TLS certificate management
  • Prometheus Operator for metrics and alerting

Quick Start with Helm

1. Add the Helm Repository

# Add the Triage Warden Helm repository
helm repo add triage-warden https://charts.triage-warden.io
helm repo update

2. Create Namespace

kubectl create namespace triage-warden

3. Create Secrets

Generate required secrets before deployment:

# Generate encryption keys
export TW_ENCRYPTION_KEY=$(openssl rand -base64 32)
export TW_JWT_SECRET=$(openssl rand -hex 32)
export TW_SESSION_SECRET=$(openssl rand -hex 32)

# Create Kubernetes secret
kubectl create secret generic triage-warden-secrets \
  --namespace triage-warden \
  --from-literal=TW_ENCRYPTION_KEY="$TW_ENCRYPTION_KEY" \
  --from-literal=TW_JWT_SECRET="$TW_JWT_SECRET" \
  --from-literal=TW_SESSION_SECRET="$TW_SESSION_SECRET" \
  --from-literal=DATABASE_URL="postgres://user:password@postgres:5432/triage_warden"

4. Install Triage Warden

# Basic installation
helm install triage-warden triage-warden/triage-warden \
  --namespace triage-warden \
  --set global.domain=triage.example.com

# Installation with custom values
helm install triage-warden triage-warden/triage-warden \
  --namespace triage-warden \
  --values values-production.yaml

5. Verify Deployment

# Check pod status
kubectl get pods -n triage-warden

# Check service status
kubectl get svc -n triage-warden

# View logs
kubectl logs -n triage-warden -l app.kubernetes.io/name=triage-warden -f

Helm Configuration

Minimal Production Values

Create a values-production.yaml file:

# values-production.yaml
global:
  domain: triage.example.com

api:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

orchestrator:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

postgresql:
  # Use external database
  enabled: false
  external:
    host: postgres.example.com
    port: 5432
    database: triage_warden
    existingSecret: triage-warden-secrets
    existingSecretPasswordKey: DATABASE_PASSWORD

redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: true
    existingSecret: triage-warden-secrets
    existingSecretPasswordKey: REDIS_PASSWORD

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  tls:
    - secretName: triage-warden-tls
      hosts:
        - triage.example.com

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true

Common Configuration Options

ParameterDescriptionDefault
api.replicasNumber of API server replicas2
orchestrator.replicasNumber of orchestrator replicas2
image.repositoryContainer image repositoryghcr.io/triage-warden/triage-warden
image.tagContainer image taglatest
ingress.enabledEnable ingresstrue
postgresql.enabledDeploy PostgreSQLtrue
redis.enabledDeploy Redistrue
monitoring.enabledEnable monitoringtrue

Manual Deployment (Without Helm)

If you prefer to use raw Kubernetes manifests:

Architecture

                        ┌─────────────────┐
                        │    Ingress      │
                        │  (TLS + routing)│
                        └────────┬────────┘
                                 │
                ┌────────────────┼────────────────┐
                │                │                │
          ┌─────▼─────┐    ┌─────▼─────┐    ┌─────▼─────┐
          │    Pod    │    │    Pod    │    │    Pod    │
          │  replica  │    │  replica  │    │  replica  │
          └─────┬─────┘    └─────┬─────┘    └─────┬─────┘
                │                │                │
                └────────────────┼────────────────┘
                                 │
                        ┌────────▼────────┐
                        │    Service      │
                        │  (ClusterIP)    │
                        └────────┬────────┘
                                 │
                        ┌────────▼────────┐
                        │   PostgreSQL    │
                        │  (StatefulSet)  │
                        └─────────────────┘

Manifests

Namespace

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: triage-warden
  labels:
    app.kubernetes.io/name: triage-warden

Secret

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: triage-warden-secrets
  namespace: triage-warden
type: Opaque
stringData:
  # Generate these values securely!
  # encryption-key: $(openssl rand -base64 32)
  # jwt-secret: $(openssl rand -hex 32)
  # session-secret: $(openssl rand -hex 32)
  encryption-key: "REPLACE_WITH_BASE64_32_BYTE_KEY"
  jwt-secret: "REPLACE_WITH_JWT_SECRET"
  session-secret: "REPLACE_WITH_SESSION_SECRET"
  database-url: "postgres://triage_warden:password@postgres-postgresql:5432/triage_warden"

ConfigMap

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: triage-warden-config
  namespace: triage-warden
data:
  RUST_LOG: "info"
  TW_BIND_ADDRESS: "0.0.0.0:8080"
  TW_BASE_URL: "https://triage.example.com"

Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triage-warden
  namespace: triage-warden
  labels:
    app.kubernetes.io/name: triage-warden
    app.kubernetes.io/component: server
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  template:
    metadata:
      labels:
        app.kubernetes.io/name: triage-warden
    spec:
      serviceAccountName: triage-warden
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: triage-warden
          image: ghcr.io/your-org/triage-warden:latest
          imagePullPolicy: Always
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: database-url
            - name: TW_ENCRYPTION_KEY
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: encryption-key
            - name: TW_JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: jwt-secret
            - name: TW_SESSION_SECRET
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: session-secret
          envFrom:
            - configMapRef:
                name: triage-warden-config
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /live
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL

Service

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: triage-warden
  namespace: triage-warden
  labels:
    app.kubernetes.io/name: triage-warden
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: triage-warden

Ingress

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: triage-warden
  namespace: triage-warden
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
  tls:
    - hosts:
        - triage.example.com
      secretName: triage-warden-tls
  rules:
    - host: triage.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: triage-warden
                port:
                  number: 80

ServiceAccount

# serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: triage-warden
  namespace: triage-warden

HorizontalPodAutoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

PodDisruptionBudget

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden

Apply Manifests

kubectl apply -f deploy/kubernetes/namespace.yaml
kubectl apply -f deploy/kubernetes/secret.yaml
kubectl apply -f deploy/kubernetes/configmap.yaml
kubectl apply -f deploy/kubernetes/deployment.yaml
kubectl apply -f deploy/kubernetes/service.yaml
kubectl apply -f deploy/kubernetes/ingress.yaml
kubectl apply -f deploy/kubernetes/servicemonitor.yaml
kubectl apply -f deploy/kubernetes/hpa.yaml

High Availability Configuration

For production HA deployments:

API Server HA

The API servers are stateless and can be scaled horizontally:

api:
  replicas: 3
  podAntiAffinity:
    enabled: true
    topologyKey: kubernetes.io/hostname
  topologySpreadConstraints:
    enabled: true
    maxSkew: 1

Orchestrator HA

Orchestrators use leader election to coordinate singleton tasks:

orchestrator:
  replicas: 2
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s

Pod Disruption Budget

Ensure availability during updates:

podDisruptionBudget:
  enabled: true
  minAvailable: 1

Database Setup

Using Helm (PostgreSQL)

# Add Bitnami repo
helm repo add bitnami https://charts.bitnami.com/bitnami

# Install PostgreSQL
helm install postgres bitnami/postgresql \
  --namespace triage-warden \
  --set auth.username=triage_warden \
  --set auth.password=your-secure-password \
  --set auth.database=triage_warden \
  --set primary.persistence.size=20Gi

Using External Database

Update the secret with your external database URL:

kubectl create secret generic triage-warden-secrets \
  --namespace triage-warden \
  --from-literal=database-url="postgres://user:[email protected]:5432/triage_warden?sslmode=require" \
  # ... other secrets

Monitoring

ServiceMonitor (Prometheus)

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

PrometheusRule (Alerts)

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  groups:
    - name: triage-warden
      rules:
        - alert: TriageWardenDown
          expr: up{job="triage-warden"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Triage Warden is down"
            description: "Triage Warden has been down for more than 5 minutes."

        - alert: TriageWardenHighErrorRate
          expr: rate(http_requests_total{job="triage-warden",status=~"5.."}[5m]) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate in Triage Warden"

Upgrading

Helm Upgrade

# Check current version
helm list -n triage-warden

# Upgrade to new version
helm upgrade triage-warden triage-warden/triage-warden \
  --namespace triage-warden \
  --values values-production.yaml \
  --set image.tag=v1.1.0

# Monitor the rollout
kubectl rollout status deployment/triage-warden-api -n triage-warden

Rollback

# View release history
helm history triage-warden -n triage-warden

# Rollback to previous version
helm rollback triage-warden 1 -n triage-warden

Database Migrations

Triage Warden automatically runs database migrations on startup. For manual control:

# Run migrations manually
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  triage-warden migrate

# Check migration status
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  triage-warden migrate --status

TLS Configuration

Using cert-manager

ingress:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  tls:
    - secretName: triage-warden-tls
      hosts:
        - triage.example.com

Manual TLS Secret

kubectl create secret tls triage-warden-tls \
  --namespace triage-warden \
  --cert=tls.crt \
  --key=tls.key

Security Hardening

Network Policy

# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: postgresql
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
    - to:  # External APIs (LLM, connectors)
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16
      ports:
        - protocol: TCP
          port: 443

Troubleshooting

Pod Not Starting

# Check pod events
kubectl describe pod -n triage-warden -l app.kubernetes.io/name=triage-warden

# Check logs
kubectl logs -n triage-warden -l app.kubernetes.io/name=triage-warden --previous

# Common issues:
# - ImagePullBackOff: Check image name and registry credentials
# - CrashLoopBackOff: Check logs for startup errors
# - Pending: Check resource requests and node capacity

Database Connection Issues

# Test database connectivity from a pod
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -v telnet://postgres:5432

# Check database URL
kubectl get secret triage-warden-secrets -n triage-warden -o jsonpath='{.data.DATABASE_URL}' | base64 -d

Health Check Failures

# Check liveness endpoint
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -s http://localhost:8080/live

# Check readiness endpoint
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -s http://localhost:8080/ready

# Check detailed health
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -s http://localhost:8080/health/detailed | jq

Leader Election Issues

# Check which instance is the leader
kubectl exec -it deployment/triage-warden-orchestrator-0 -n triage-warden -- \
  curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Check leader lease in Redis
kubectl exec -it deployment/triage-warden-redis-0 -n triage-warden -- \
  redis-cli KEYS "tw:leader:*"

Performance Issues

# Check resource usage
kubectl top pods -n triage-warden

# Check HPA status
kubectl get hpa -n triage-warden

# View Prometheus metrics
kubectl port-forward svc/prometheus -n monitoring 9090:9090

Ingress Not Working

# Check ingress
kubectl describe ingress triage-warden -n triage-warden

# Check TLS secret
kubectl get secret triage-warden-tls -n triage-warden

# Check ingress controller logs
kubectl logs -l app.kubernetes.io/name=ingress-nginx -n ingress-nginx

Operations

View Logs

# All pods
kubectl logs -l app.kubernetes.io/name=triage-warden -n triage-warden -f

# Specific pod
kubectl logs -f deployment/triage-warden -n triage-warden

# Previous container (after crash)
kubectl logs deployment/triage-warden -n triage-warden --previous

Scale Deployment

# Manual scale
kubectl scale deployment triage-warden -n triage-warden --replicas=5

# Check HPA status
kubectl get hpa -n triage-warden

Rolling Update

# Update image
kubectl set image deployment/triage-warden \
  triage-warden=ghcr.io/your-org/triage-warden:v1.2.0 \
  -n triage-warden

# Watch rollout
kubectl rollout status deployment/triage-warden -n triage-warden

# Rollback if needed
kubectl rollout undo deployment/triage-warden -n triage-warden

Uninstalling

Helm Uninstall

# Uninstall Triage Warden
helm uninstall triage-warden -n triage-warden

# Delete namespace (optional, removes all resources)
kubectl delete namespace triage-warden

# Delete PVCs if needed
kubectl delete pvc -n triage-warden --all

Next Steps

Helm Chart Deployment

Deploy Triage Warden to Kubernetes using the bundled Helm chart. This is the recommended approach for Kubernetes deployments, providing templated manifests with environment-specific value overrides.

The chart lives at deploy/helm/ in the repository.

Prerequisites

  • Kubernetes 1.25+
  • Helm 3.8+
  • External PostgreSQL database (required)
  • External Redis (optional, required for HA deployments)
  • Ingress controller (nginx recommended)
  • cert-manager (for automatic TLS)
  • Prometheus Operator (for monitoring)

Quick Start

Development

# Create a values file
cat > my-values.yaml << EOF
postgresql:
  host: "postgres.default.svc.cluster.local"
  port: 5432
  database: "triage_warden"
  username: "triage"
  password: "your-password"

secrets:
  encryptionKey: "$(openssl rand -base64 32)"
  jwtSecret: "$(openssl rand -hex 32)"
  sessionSecret: "$(openssl rand -hex 32)"

config:
  enableSwagger: true
  secureCookies: false
EOF

# Install
helm install triage-warden ./deploy/helm -f my-values.yaml

Production

# Create namespace
kubectl create namespace triage-warden

# Create secrets externally (recommended)
kubectl create secret generic triage-warden-secrets \
  --namespace triage-warden \
  --from-literal=TW_ENCRYPTION_KEY="$(openssl rand -base64 32)" \
  --from-literal=TW_JWT_SECRET="$(openssl rand -hex 32)" \
  --from-literal=TW_SESSION_SECRET="$(openssl rand -hex 32)"

kubectl create secret generic postgresql-credentials \
  --namespace triage-warden \
  --from-literal=postgresql-password="your-db-password"

# Install with production values
helm install triage-warden ./deploy/helm \
  --namespace triage-warden \
  -f deploy/helm/values-prod.yaml

Value Files

The chart ships with pre-built value files for common scenarios:

FilePurpose
values.yamlDefaults (base for all environments)
values-dev.yamlSingle-instance development (debug logging, no TLS)
values-prod.yamlMulti-instance production (3 API replicas, TLS, monitoring)
values-ha.yamlMaximum availability (5+ replicas, zone spreading, strict anti-affinity)

Override with -f:

helm install triage-warden ./deploy/helm \
  --namespace triage-warden \
  -f deploy/helm/values-prod.yaml \
  -f my-secrets.yaml

Key Parameters

Application

ParameterDescriptionDefault
api.replicasAPI server replicas2
api.resources.requests.cpuCPU request100m
api.resources.requests.memoryMemory request256Mi
orchestrator.replicasOrchestrator replicas1
config.logLevelLog levelinfo
config.enableSwaggerEnable Swagger UIfalse

Database

ParameterDescriptionDefault
postgresql.hostPostgreSQL host (required)""
postgresql.portPostgreSQL port5432
postgresql.databaseDatabase nametriage_warden
postgresql.existingSecretExisting secret with password""
postgresql.sslModeSSL moderequire

Networking

ParameterDescriptionDefault
ingress.enabledEnable ingressfalse
ingress.classNameIngress class namenginx
networkPolicy.enabledEnable network policiesfalse

Scaling & HA

ParameterDescriptionDefault
autoscaling.enabledEnable HPAfalse
autoscaling.minReplicasMinimum replicas2
autoscaling.maxReplicasMaximum replicas10
podDisruptionBudget.enabledEnable PDBfalse

Monitoring

ParameterDescriptionDefault
serviceMonitor.enabledEnable ServiceMonitorfalse
prometheusRules.enabledEnable alerting rulesfalse

See deploy/helm/values.yaml for the complete list.

Components

The chart deploys two main components:

  • API Server (deployment-api.yaml) - Handles HTTP requests, webhooks, and the web UI
  • Orchestrator (deployment-orchestrator.yaml) - Manages background tasks, scheduling, and automation

Supporting resources: ServiceAccount, ConfigMap, Secret, Service, Ingress, HPA, PDB, NetworkPolicy, ServiceMonitor, PrometheusRule.

External Secrets

For production, use an external secrets manager instead of storing secrets in values files:

secrets:
  create: false
  existingSecret: "triage-warden-secrets"

Compatible with:

Upgrading

helm upgrade triage-warden ./deploy/helm \
  --namespace triage-warden \
  -f deploy/helm/values-prod.yaml

# Monitor rollout
kubectl rollout status deployment/triage-warden-api -n triage-warden

Rollback

helm history triage-warden -n triage-warden
helm rollback triage-warden 1 -n triage-warden

Uninstalling

helm uninstall triage-warden -n triage-warden
kubectl delete namespace triage-warden

Alerts

When prometheusRules.enabled: true, the chart installs these alerts:

  • TriageWardenDown - Instance unreachable for 2+ minutes
  • TriageWardenHighErrorRate - 5xx errors exceed 5%
  • TriageWardenKillSwitchActive - Kill switch activated
  • TriageWardenDatabaseUnhealthy - Database connection issues
  • TriageWardenHighLatency - P99 latency above 1 second
  • TriageWardenConnectorUnhealthy - Connector health issues

The HA values file (values-ha.yaml) adds zone-balance and replica-mismatch alerts.

Next Steps

Configuration Reference

This document provides a comprehensive reference for all Triage Warden configuration options.

Configuration Methods

Triage Warden can be configured through:

  1. Environment variables (recommended for production)
  2. Configuration file (config/default.yaml)
  3. Command-line arguments (for specific settings)

Environment variables take precedence over configuration file values.

Environment Variables

Security Settings (Required)

VariableDescriptionExample
TW_ENCRYPTION_KEY32-byte base64 key for encrypting credentials stored in databaseopenssl rand -base64 32
TW_JWT_SECRETSecret for signing JWT tokens (min 32 chars)openssl rand -hex 32
TW_SESSION_SECRETSecret for signing session cookies (min 32 chars)openssl rand -hex 32

Warning: These secrets must be consistent across all instances in a cluster. Changing them will invalidate existing sessions and encrypted data.

Database Configuration

VariableDescriptionDefault
DATABASE_URLPostgreSQL connection stringpostgres://user:pass@host:5432/db
DATABASE_MAX_CONNECTIONSMaximum connection pool size25
DATABASE_MIN_CONNECTIONSMinimum connection pool size5
DATABASE_CONNECT_TIMEOUTConnection timeout in seconds30
DATABASE_IDLE_TIMEOUTIdle connection timeout in seconds600
DATABASE_MAX_LIFETIMEMaximum connection lifetime in seconds1800

Connection String Format:

postgres://username:password@hostname:port/database?sslmode=require

SSL modes: disable, allow, prefer, require, verify-ca, verify-full

Redis Configuration

Redis is required for HA deployments (message queue, cache, leader election).

VariableDescriptionDefault
REDIS_URLRedis connection URLredis://localhost:6379
TW_MESSAGE_QUEUE_ENABLEDEnable Redis-based message queuefalse
TW_CACHE_ENABLEDEnable Redis-based cachefalse
TW_LEADER_ELECTION_ENABLEDEnable Redis-based leader electionfalse
TW_CACHE_TTL_SECONDSDefault cache TTL3600
TW_CACHE_MAX_SIZEMaximum cache entries10000

Connection URL Formats:

redis://localhost:6379
redis://:password@localhost:6379
redis://localhost:6379/0
rediss://localhost:6379  # TLS

Server Configuration

VariableDescriptionDefault
TW_BIND_ADDRESSAddress and port to bind0.0.0.0:8080
TW_BASE_URLPublic URL for the applicationhttp://localhost:8080
TW_ENVEnvironment: development, productiondevelopment
TW_TRUSTED_PROXIESCIDR ranges for trusted reverse proxies``
TW_REQUEST_BODY_LIMITMax request body size in bytes10485760 (10MB)
TW_REQUEST_TIMEOUTRequest timeout in seconds30

Instance Configuration

VariableDescriptionDefault
TW_INSTANCE_IDUnique identifier for this instanceAuto-generated
TW_INSTANCE_TYPEInstance type: api, orchestrator, combinedcombined

Authentication & Sessions

VariableDescriptionDefault
TW_COOKIE_SECURERequire HTTPS for cookiestrue in production
TW_COOKIE_SAME_SITESameSite policy: strict, lax, nonestrict
TW_SESSION_EXPIRY_SECONDSSession duration86400 (24 hours)
TW_CSRF_ENABLEDEnable CSRF protectiontrue
TW_ADMIN_PASSWORDInitial admin password (first run only)Auto-generated

CORS Configuration

VariableDescriptionDefault
TW_CORS_ALLOWED_ORIGINSAllowed origins (comma-separated)Same origin only
TW_CORS_ALLOW_CREDENTIALSAllow credentials in CORS requeststrue
TW_CORS_MAX_AGEPreflight cache duration in seconds3600

LLM Configuration

VariableDescriptionDefault
TW_LLM_PROVIDERLLM provider: anthropic, openai, azure, localanthropic
TW_LLM_MODELModel identifierclaude-3-sonnet-20240229
TW_LLM_TEMPERATUREGeneration temperature (0.0-2.0)0.2
TW_LLM_MAX_TOKENSMaximum response tokens4096
TW_LLM_TIMEOUT_SECONDSAPI call timeout60
TW_LLM_RETRY_ATTEMPTSNumber of retry attempts3
TW_LLM_RETRY_DELAY_MSDelay between retries1000

Provider-specific API Keys:

VariableProvider
ANTHROPIC_API_KEYAnthropic Claude
OPENAI_API_KEYOpenAI GPT
AZURE_OPENAI_API_KEYAzure OpenAI
AZURE_OPENAI_ENDPOINTAzure OpenAI endpoint URL

Orchestrator Configuration

VariableDescriptionDefault
TW_OPERATION_MODEMode: supervised, assisted, autonomoussupervised
TW_AUTO_APPROVE_LOW_RISKAuto-approve low-risk actionsfalse
TW_MAX_CONCURRENT_INCIDENTSMax concurrent incident processing100
TW_ENRICHMENT_TIMEOUT_SECONDSEnrichment step timeout60
TW_ANALYSIS_TIMEOUT_SECONDSAI analysis timeout120
TW_ACTION_TIMEOUT_SECONDSAction execution timeout300

Logging Configuration

VariableDescriptionDefault
RUST_LOGLog level filterinfo
TW_LOG_FORMATFormat: json, prettyjson in production
TW_LOG_INCLUDE_LOCATIONInclude file/line in logsfalse

Log Level Examples:

# Basic level
RUST_LOG=info

# Per-module levels
RUST_LOG=info,triage_warden=debug,tw_api=trace

# All debug
RUST_LOG=debug

Metrics Configuration

VariableDescriptionDefault
TW_METRICS_ENABLEDEnable Prometheus metricstrue
TW_METRICS_PATHMetrics endpoint path/metrics
TW_METRICS_INCLUDE_LABELSInclude additional labelstrue

Rate Limiting

VariableDescriptionDefault
TW_RATE_LIMIT_ENABLEDEnable rate limitingtrue
TW_RATE_LIMIT_REQUESTSRequests per window200
TW_RATE_LIMIT_WINDOWWindow duration (e.g., 1m, 1h)1m
TW_RATE_LIMIT_BURSTBurst allowance50

Feature Flags

VariableDescriptionDefault
TW_FEATURE_PLAYBOOKSEnable playbook automationtrue
TW_FEATURE_AUTO_ENRICHEnable automatic enrichmenttrue
TW_FEATURE_API_KEYSEnable API key authenticationtrue
TW_FEATURE_MULTI_TENANTEnable multi-tenancyfalse
TW_ENABLE_SWAGGEREnable Swagger UItrue in dev

Webhook Configuration

VariableDescriptionDefault
TW_WEBHOOK_SECRETDefault webhook signature secret``
TW_WEBHOOK_TIMEOUT_SECONDSWebhook delivery timeout30
TW_WEBHOOK_RETRY_ATTEMPTSDelivery retry attempts3

Source-specific webhook secrets:

VariableSource
TW_WEBHOOK_SPLUNK_SECRETSplunk HEC
TW_WEBHOOK_CROWDSTRIKE_SECRETCrowdStrike
TW_WEBHOOK_SENTINEL_SECRETMicrosoft Sentinel
TW_WEBHOOK_GITHUB_SECRETGitHub (for DevSecOps)

Configuration File

Configuration can also be provided via YAML file.

File Locations

Triage Warden searches for configuration in order:

  1. Path specified by --config flag
  2. $HOME/.config/triage-warden/config.yaml
  3. /etc/triage-warden/config.yaml
  4. ./config/default.yaml

Example Configuration File

# config/default.yaml

# Server configuration
server:
  bind_address: "0.0.0.0:8080"
  base_url: "https://triage.example.com"
  trusted_proxies:
    - "10.0.0.0/8"
    - "172.16.0.0/12"

# Database configuration
database:
  url: "postgres://triage:password@localhost:5432/triage_warden"
  max_connections: 25
  min_connections: 5
  connect_timeout: 30

# Redis configuration (for HA)
redis:
  url: "redis://localhost:6379"
  message_queue:
    enabled: true
  cache:
    enabled: true
    ttl_seconds: 3600
  leader_election:
    enabled: true

# LLM configuration
llm:
  provider: anthropic
  model: claude-3-sonnet-20240229
  temperature: 0.2
  max_tokens: 4096
  # API key should be set via environment variable

# Orchestrator settings
orchestrator:
  operation_mode: supervised
  auto_approve_low_risk: false
  max_concurrent_incidents: 100
  timeouts:
    enrichment: 60
    analysis: 120
    action: 300

# Logging
logging:
  level: info
  format: json

# Metrics
metrics:
  enabled: true
  path: /metrics

# Rate limiting
rate_limit:
  enabled: true
  requests_per_minute: 200
  burst: 50

# Feature flags
features:
  playbooks: true
  auto_enrich: true
  api_keys: true
  multi_tenant: false

# Connectors
connectors:
  crowdstrike:
    enabled: true
    type: edr
    base_url: "https://api.crowdstrike.com"
    # Credentials via environment or secrets

  splunk:
    enabled: true
    type: siem
    base_url: "https://splunk.example.com:8089"

Precedence

Configuration is loaded in this order (later overrides earlier):

  1. Default values (built into application)
  2. Configuration file (config/default.yaml)
  3. Environment-specific file (config/{TW_ENV}.yaml)
  4. Environment variables

Generating Secrets

Encryption Key (32 bytes, base64)

# macOS/Linux
openssl rand -base64 32

# Alternative using /dev/urandom
head -c 32 /dev/urandom | base64

JWT/Session Secrets

# Hex-encoded secret
openssl rand -hex 32

# Or use a password generator
pwgen -s 64 1

Database URL Format

PostgreSQL

postgres://username:password@hostname:port/database?sslmode=require

Options:

  • sslmode=disable - No SSL (development only)
  • sslmode=require - Require SSL, don't verify certificate
  • sslmode=verify-ca - Require SSL, verify CA
  • sslmode=verify-full - Require SSL, verify CA and hostname

Connection Pooling (PgBouncer)

postgres://username:password@pgbouncer:6432/database?sslmode=require

Operation Modes

Triage Warden supports three operation modes:

Supervised Mode (Default)

All actions require human approval:

TW_OPERATION_MODE=supervised
TW_AUTO_APPROVE_LOW_RISK=false

Assisted Mode

Low-risk actions are auto-approved, high-risk require approval:

TW_OPERATION_MODE=assisted
TW_AUTO_APPROVE_LOW_RISK=true

Autonomous Mode

All actions within guardrails are auto-executed:

TW_OPERATION_MODE=autonomous

Warning: Autonomous mode should only be enabled after thorough testing and with appropriate guardrails configured.

Health Check Endpoints

EndpointPurposeResponse
/healthBasic health status{"status": "healthy", ...}
/health/detailedFull component statusIncludes all components
/liveLiveness probe (Kubernetes)200 OK
/readyReadiness probe (Kubernetes)200 OK or 503

Health Status Values

StatusDescription
healthyAll components operational
degradedSome non-critical components failing
unhealthyCritical components failing
haltedKill switch activated

Security Best Practices

  1. Never commit secrets to version control
  2. Use different secrets for each environment
  3. Rotate secrets periodically
  4. Enable TLS in production (TW_COOKIE_SECURE=true)
  5. Restrict trusted proxies to known IP ranges
  6. Enable rate limiting in production
  7. Use read-only database users where possible

Environment-Specific Recommendations

Development

TW_ENV=development
TW_LOG_FORMAT=pretty
RUST_LOG=debug,triage_warden=trace
TW_COOKIE_SECURE=false
TW_ENABLE_SWAGGER=true

Staging

TW_ENV=production
TW_LOG_FORMAT=json
RUST_LOG=info,triage_warden=debug
TW_COOKIE_SECURE=true
TW_ENABLE_SWAGGER=true

Production

TW_ENV=production
TW_LOG_FORMAT=json
RUST_LOG=info
TW_COOKIE_SECURE=true
TW_ENABLE_SWAGGER=false
TW_METRICS_ENABLED=true
TW_RATE_LIMIT_ENABLED=true

High-Availability

DATABASE_URL=postgres://tw_user:pass@pgbouncer:6432/triage_warden?sslmode=require
DATABASE_MAX_CONNECTIONS=50
TW_TRUSTED_PROXIES=10.0.0.0/8
TW_METRICS_ENABLED=true
TW_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

Next Steps

Operations Guide

Operational procedures and runbooks for Triage Warden.

Runbooks

Quick Reference

Health Check Endpoints

EndpointPurposeExpected Response
GET /liveLiveness probe200 OK
GET /readyReadiness probe200 OK if ready, 503 if not
GET /healthBasic healthJSON with status
GET /health/detailedFull component healthJSON with all components

Key Metrics

MetricDescriptionAlert Threshold
http_requests_totalTotal HTTP requestsN/A
http_request_duration_secondsRequest latencyp99 > 1s
http_requests_in_flightConcurrent requests> 100
db_pool_connections_activeActive DB connections> 80% of max
incidents_totalTotal incidents processedN/A
actions_executed_totalTotal actions executedN/A

Emergency Contacts

RoleContactEscalation
On-call EngineerPagerDutyAuto-escalates after 15m
Security Lead[email protected]Critical security issues
Database Admin[email protected]Database emergencies

Common Commands

Docker

# View logs
docker compose logs -f triage-warden

# Restart service
docker compose restart triage-warden

# Check health
curl http://localhost:8080/health | jq

# Database backup
docker compose exec postgres pg_dump -U triage_warden > backup.sql

Kubernetes

# View logs
kubectl logs -f deployment/triage-warden -n triage-warden

# Restart pods
kubectl rollout restart deployment/triage-warden -n triage-warden

# Check health
kubectl exec -it deployment/triage-warden -n triage-warden -- curl -s localhost:8080/health | jq

# Scale up/down
kubectl scale deployment triage-warden -n triage-warden --replicas=5

Database

# Connect to PostgreSQL
psql $DATABASE_URL

# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'triage_warden';

# Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Service Dependencies

┌──────────────────┐
│  Triage Warden   │
└────────┬─────────┘
         │
    ┌────┴────┬─────────┬─────────┐
    │         │         │         │
    ▼         ▼         ▼         ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Postgres│ │  LLM  │ │Connec-│ │Notifi-│
│   DB   │ │  API  │ │ tors  │ │cations│
└───────┘ └───────┘ └───────┘ └───────┘

Dependency Health Impact

DependencyIf Unavailable
PostgreSQLService fails readiness, no data access
LLM APIAI analysis disabled, manual triage only
ConnectorsSpecific integrations fail, core works
NotificationsAlerts not delivered, incidents still process

Scheduled Tasks

TaskScheduleDescription
Database backupDaily 2:00 AMFull PostgreSQL backup
Connector health checkEvery 5 minutesVerify connector connectivity
Incident cleanupWeekly Sunday 3:00 AMArchive old incidents
Log rotationDailyRotate and compress logs
Certificate renewal30 days before expiryRenew TLS certificates

Monitoring Guide

This guide covers monitoring, metrics, and alerting for Triage Warden deployments.

Overview

Triage Warden exposes metrics in Prometheus format and supports integration with common observability stacks.

┌─────────────────────────────────────────────────────────────┐
│                    Monitoring Stack                          │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Prometheus  │───▶│   Grafana    │    │ Alertmanager │  │
│  │  (scraping)  │    │ (dashboards) │    │  (alerts)    │  │
│  └──────┬───────┘    └──────────────┘    └──────────────┘  │
│         │                                                    │
└─────────┼────────────────────────────────────────────────────┘
          │
          │ /metrics
          │
┌─────────▼────────────────────────────────────────────────────┐
│                    Triage Warden                              │
│  ┌───────────┐  ┌───────────┐  ┌─────────────┐              │
│  │ API-1     │  │ API-2     │  │Orchestrator │              │
│  │ :8080     │  │ :8080     │  │    :8080    │              │
│  └───────────┘  └───────────┘  └─────────────┘              │
└──────────────────────────────────────────────────────────────┘

Metrics Endpoints

EndpointFormatDescription
/metricsPrometheusPrometheus-compatible metrics
/api/metricsJSONDashboard-friendly JSON format
/healthJSONBasic health status
/health/detailedJSONComprehensive health including components

Available Metrics

HTTP Metrics

# Request counter by method, path, status
http_requests_total{method="GET", path="/api/incidents", status="200"} 1234

# Request duration histogram
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="0.1"} 900
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="0.5"} 1100
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="1.0"} 1200

# Active connections
http_connections_active 42

Incident Metrics

# Total incidents by severity and status
triage_warden_incidents_total{severity="critical", status="new"} 5
triage_warden_incidents_total{severity="high", status="resolved"} 128

# Incidents currently being processed
triage_warden_incidents_in_progress 12

# Triage duration histogram
triage_warden_triage_duration_seconds_bucket{le="60"} 500
triage_warden_triage_duration_seconds_bucket{le="300"} 800

Action Metrics

# Actions by type and status
triage_warden_actions_total{action_type="isolate_host", status="success"} 45
triage_warden_actions_total{action_type="isolate_host", status="failed"} 2

# Pending approvals
triage_warden_actions_pending_approval 8

# Action execution duration
triage_warden_action_duration_seconds_bucket{action_type="isolate_host", le="30"} 40

System Metrics

# Kill switch status
kill_switch_active 0

# Component health (1=healthy, 0=unhealthy)
component_healthy{component="database"} 1
component_healthy{component="redis"} 1
component_healthy{component="connector_crowdstrike"} 1

# Database connection pool
db_pool_connections_total 25
db_pool_connections_idle 20
db_pool_connections_waiting 0

# Cache statistics
cache_hits_total 10000
cache_misses_total 500
cache_size 2500

LLM Metrics

# LLM API calls by provider and model
llm_requests_total{provider="anthropic", model="claude-3-sonnet"} 500

# LLM latency
llm_request_duration_seconds_bucket{provider="anthropic", le="5"} 400
llm_request_duration_seconds_bucket{provider="anthropic", le="30"} 490

# Token usage
llm_tokens_used_total{provider="anthropic", type="input"} 150000
llm_tokens_used_total{provider="anthropic", type="output"} 75000

Message Queue Metrics

# Queue depth by topic
mq_messages_pending{topic="triage.alerts"} 15
mq_messages_pending{topic="triage.enrichment"} 3

# Message processing rate
mq_messages_processed_total{topic="triage.alerts"} 5000
mq_messages_acknowledged_total{topic="triage.alerts"} 4995

Prometheus Configuration

Basic Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: 'triage-warden'
    static_configs:
      - targets:
          - 'triage-warden-api:8080'
          - 'triage-warden-orchestrator:8080'
    metrics_path: /metrics
    scrape_interval: 15s
    scrape_timeout: 10s

Kubernetes ServiceMonitor

For Prometheus Operator deployments:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: triage-warden
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  namespaceSelector:
    matchNames:
      - triage-warden
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s

Pod Annotations (Alternative)

If using annotation-based discovery:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Alerting Rules

PrometheusRule Resource

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: triage-warden-alerts
  labels:
    release: prometheus
spec:
  groups:
    - name: triage-warden.availability
      rules:
        # Service Down
        - alert: TriageWardenDown
          expr: up{job="triage-warden"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Triage Warden instance is down"
            description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."

        # High Error Rate
        - alert: TriageWardenHighErrorRate
          expr: |
            sum(rate(http_requests_total{job="triage-warden",status=~"5.."}[5m])) /
            sum(rate(http_requests_total{job="triage-warden"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate detected"
            description: "More than 5% of requests are returning 5xx errors."

        # Database Unhealthy
        - alert: TriageWardenDatabaseUnhealthy
          expr: component_healthy{job="triage-warden",component="database"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Database connection lost"
            description: "Triage Warden cannot connect to the database."

    - name: triage-warden.performance
      rules:
        # High Latency
        - alert: TriageWardenHighLatency
          expr: |
            histogram_quantile(0.99,
              rate(http_request_duration_seconds_bucket{job="triage-warden"}[5m])
            ) > 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High API latency"
            description: "P99 latency is above 1 second for the last 10 minutes."

        # Slow Triage Time
        - alert: TriageWardenSlowTriage
          expr: |
            histogram_quantile(0.90,
              rate(triage_warden_triage_duration_seconds_bucket[1h])
            ) > 300
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "Incident triage taking too long"
            description: "P90 triage duration is above 5 minutes."

    - name: triage-warden.operations
      rules:
        # Kill Switch Active
        - alert: TriageWardenKillSwitchActive
          expr: kill_switch_active == 1
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: "Kill switch is active"
            description: "All automation has been halted by the kill switch."

        # High Pending Approvals
        - alert: TriageWardenHighPendingApprovals
          expr: triage_warden_actions_pending_approval > 50
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "High number of pending approvals"
            description: "{{ $value }} actions are waiting for approval."

        # Connector Unhealthy
        - alert: TriageWardenConnectorUnhealthy
          expr: component_healthy{component=~"connector_.*"} == 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Connector {{ $labels.component }} is unhealthy"
            description: "Connector has been unhealthy for more than 10 minutes."

        # Queue Backlog
        - alert: TriageWardenQueueBacklog
          expr: mq_messages_pending{topic="triage.alerts"} > 100
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Alert queue backlog growing"
            description: "{{ $value }} unprocessed alerts in queue."

    - name: triage-warden.resources
      rules:
        # High CPU
        - alert: TriageWardenHighCPU
          expr: |
            sum(rate(container_cpu_usage_seconds_total{
              container="triage-warden"
            }[5m])) by (pod) > 0.8
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "High CPU usage"
            description: "Pod {{ $labels.pod }} CPU usage above 80%."

        # High Memory
        - alert: TriageWardenHighMemory
          expr: |
            container_memory_usage_bytes{container="triage-warden"} /
            container_spec_memory_limit_bytes{container="triage-warden"} > 0.9
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "High memory usage"
            description: "Pod {{ $labels.pod }} memory usage above 90%."

        # Database Connection Exhaustion
        - alert: TriageWardenDBConnectionsLow
          expr: db_pool_connections_idle < 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Database connection pool nearly exhausted"
            description: "Only {{ $value }} idle connections remaining."

Key Metrics to Monitor

SLI/SLO Recommendations

IndicatorTargetAlert Threshold
Availability99.9%< 99.5%
API Latency P99< 500ms> 1s
Error Rate< 0.1%> 1%
Triage Time P90< 5min> 10min

Dashboard Panels

Overview:

  • Instance count and status
  • Requests per second
  • Error rate percentage
  • Active incidents

Performance:

  • Request latency histogram
  • Database query duration
  • LLM response time
  • Cache hit ratio

Operations:

  • Incidents by severity/status
  • Actions executed vs pending
  • Queue depths
  • Connector health matrix

Resources:

  • CPU utilization by instance
  • Memory utilization by instance
  • Database connections
  • Redis memory usage

Grafana Dashboards

Importing Dashboards

Triage Warden provides pre-built Grafana dashboards:

# Download dashboard JSON
curl -o triage-warden-dashboard.json \
  https://raw.githubusercontent.com/triage-warden/triage-warden/main/deploy/grafana/dashboards/overview.json

# Import via Grafana API
curl -X POST -H "Content-Type: application/json" \
  -d @triage-warden-dashboard.json \
  http://admin:admin@localhost:3000/api/dashboards/db

Dashboard Provisioning

For automatic dashboard provisioning in Kubernetes:

# ConfigMap for dashboard provisioning
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  labels:
    grafana_dashboard: "1"
data:
  triage-warden.json: |
    {
      "dashboard": {
        "title": "Triage Warden",
        "panels": [...]
      }
    }

Example Panel Queries

Requests per Second:

sum(rate(http_requests_total{job="triage-warden"}[5m]))

Error Rate:

sum(rate(http_requests_total{job="triage-warden",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="triage-warden"}[5m])) * 100

P99 Latency:

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="triage-warden"}[5m])) by (le)
)

Incidents by Status:

triage_warden_incidents_total{job="triage-warden"}

Cache Hit Ratio:

sum(rate(cache_hits_total[5m])) /
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100

Logging

Log Format

Triage Warden outputs structured JSON logs:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "info",
  "target": "tw_api::routes::incidents",
  "message": "Incident created",
  "incident_id": "123e4567-e89b-12d3-a456-426614174000",
  "severity": "high",
  "source": "crowdstrike",
  "trace_id": "abc123",
  "span_id": "def456"
}

Log Aggregation

Loki Configuration:

# promtail config
scrape_configs:
  - job_name: triage-warden
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: triage-warden
        action: keep
    pipeline_stages:
      - json:
          expressions:
            level: level
            incident_id: incident_id
            trace_id: trace_id
      - labels:
          level:
          incident_id:

Elasticsearch/Fluentd:

# Fluentd config
<match kubernetes.var.log.containers.triage-warden**>
  @type elasticsearch
  host elasticsearch
  port 9200
  index_name triage-warden
  <buffer>
    @type file
    path /var/log/fluentd-buffers/triage-warden
  </buffer>
</match>

Log Queries

Find errors:

level:ERROR

Slow requests:

duration_ms:>1000

Specific user actions:

user.id:"user-uuid" AND target:*auth*

Distributed Tracing

OpenTelemetry Configuration

# Environment variables
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=triage-warden
OTEL_TRACES_EXPORTER=otlp

Trace Propagation

Triage Warden propagates trace context through:

  • HTTP headers (W3C Trace Context)
  • Message queue metadata
  • Internal async tasks

Health Check Integration

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Health Status Interpretation

StatusHTTP CodeMeaning
healthy200All systems operational
degraded200Non-critical issues
unhealthy503Critical component failure
halted200Kill switch active

Synthetic Monitoring

# blackbox-exporter probe
modules:
  http_triage_warden:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      fail_if_body_not_matches_regexp:
        - '"status":"healthy"'

Uptime Monitoring

Configure external uptime monitoring (Pingdom, UptimeRobot, etc.) to check:

  • https://triage.example.com/live - Basic availability
  • https://triage.example.com/ready - Full readiness

SLO/SLI Definitions

Availability SLO

Target: 99.9% availability

# SLI: Successful requests / Total requests
sum(rate(http_requests_total{job="triage-warden",status!~"5.."}[30d])) /
sum(rate(http_requests_total{job="triage-warden"}[30d]))

Latency SLO

Target: 99% of requests < 500ms

# SLI: Requests under threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{job="triage-warden",le="0.5"}[30d])) /
sum(rate(http_request_duration_seconds_count{job="triage-warden"}[30d]))

Error Budget

# Remaining error budget
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) /
  (1 - 0.999)
)

Troubleshooting with Metrics

High Latency Investigation

# Identify slow endpoints
topk(5,
  histogram_quantile(0.99,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (path, le)
  )
)

# Check database query time
histogram_quantile(0.99,
  rate(db_query_duration_seconds_bucket[5m])
)

Memory Issues

# Memory growth rate
deriv(process_resident_memory_bytes{job="triage-warden"}[1h])

# Compare to limits
container_memory_usage_bytes / container_spec_memory_limit_bytes

Queue Bottlenecks

# Processing rate vs arrival rate
rate(mq_messages_processed_total[5m]) - rate(mq_messages_received_total[5m])

# Time in queue
histogram_quantile(0.95, rate(mq_message_wait_seconds_bucket[5m]))

Next Steps

Horizontal Scaling Guide

This guide covers scaling Triage Warden horizontally to handle increased load and ensure high availability.

Architecture Overview

Triage Warden consists of two main components that scale differently:

                    ┌─────────────────────┐
                    │   Load Balancer     │
                    │  (Traefik/nginx)    │
                    └──────────┬──────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   API Server  │      │   API Server  │      │   API Server  │
│   (stateless) │      │   (stateless) │      │   (stateless) │
└───────┬───────┘      └───────┬───────┘      └───────┬───────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│ Orchestrator  │      │ Orchestrator  │      │ Orchestrator  │
│   (worker)    │      │   (worker)    │      │   (leader)    │
└───────┬───────┘      └───────┬───────┘      └───────┬───────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│    Redis      │      │  PostgreSQL   │      │  PostgreSQL   │
│  (MQ + Cache) │      │   (primary)   │      │   (replica)   │
└───────────────┘      └───────────────┘      └───────────────┘

Scaling Components

API Servers

API servers are stateless and can be scaled horizontally without coordination.

When to Scale:

  • CPU utilization > 70% sustained
  • Request latency P99 > 500ms
  • Concurrent connections approaching limits

Scaling Method:

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triage-warden-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Helm Configuration:

api:
  replicas: 3
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

Orchestrators

Orchestrators process incidents asynchronously. They use leader election for singleton tasks (scheduled jobs, metrics aggregation) while allowing parallel incident processing across all instances.

When to Scale:

  • Incident queue depth increasing
  • Mean time to triage increasing
  • Worker CPU utilization > 70%

Scaling Considerations:

  1. Leader Tasks: Only one orchestrator runs scheduled jobs
  2. Worker Tasks: All orchestrators process incidents from the queue
  3. State Sharing: Uses Redis for message queue and coordination

Configuration:

orchestrator:
  replicas: 3
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s

When to Scale

Metrics to Monitor

MetricWarning ThresholdCritical ThresholdAction
http_request_duration_seconds P99> 500ms> 1sScale API
cpu_usage_percent> 70%> 85%Scale component
memory_usage_percent> 80%> 90%Scale or optimize
incident_queue_depth> 100> 500Scale orchestrators
db_connection_pool_waiting> 0> 5Increase pool size
redis_connected_clients> 80% max> 95% maxScale Redis

Capacity Planning

API Server Capacity (per instance):

  • ~500 requests/second (simple endpoints)
  • ~100 requests/second (complex queries)
  • ~50 concurrent WebSocket connections

Orchestrator Capacity (per instance):

  • ~10 concurrent incident processing
  • ~5 concurrent LLM analysis calls
  • ~20 concurrent enrichment requests

Scaling Decision Matrix

SymptomLikely CauseSolution
High API latencyAPI overloadedScale API servers
Growing queue depthOrchestrators overloadedScale orchestrators
Database timeoutsConnection exhaustionIncrease pool, add replicas
Cache misses highCache too smallIncrease Redis memory
LLM rate limitsToo many concurrent callsAdd rate limiting, queue

Database Scaling

Connection Pooling

Each instance maintains a connection pool. Total connections:

Total = API_instances * pool_size + Orchestrator_instances * pool_size

Example: 3 API + 2 Orchestrator with pool_size=15:

Total = (3 * 15) + (2 * 15) = 75 connections

Configuration:

database:
  max_connections: 15  # Per instance
  min_connections: 2
  connect_timeout: 30

Read Replicas

For read-heavy workloads, configure read replicas:

database:
  primary_url: "postgres://user:pass@primary:5432/db"
  replica_url: "postgres://user:pass@replica:5432/db"
  read_replica_enabled: true

Connection Pooler (PgBouncer)

For large deployments, use PgBouncer:

# Kubernetes ConfigMap for PgBouncer
apiVersion: v1
kind: ConfigMap
metadata:
  name: pgbouncer-config
data:
  pgbouncer.ini: |
    [databases]
    triage_warden = host=postgres port=5432 dbname=triage_warden

    [pgbouncer]
    listen_port = 6432
    listen_addr = 0.0.0.0
    auth_type = md5
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 50

Redis Scaling

Standalone vs Cluster

Standalone (default): Suitable for most deployments

  • Up to ~100k ops/second
  • Single point of failure (use replica for HA)

Cluster: For high-throughput requirements

  • Horizontal scaling across nodes
  • Automatic sharding

Redis Configuration

redis:
  architecture: replication  # standalone, replication, cluster
  master:
    resources:
      limits:
        memory: 2Gi
  replica:
    replicaCount: 2

Cache Sizing

Calculate cache memory needs:

Memory = average_entry_size * expected_entries * 1.5 (overhead)

Example: 1KB average, 100k entries:

Memory = 1KB * 100,000 * 1.5 = 150MB

Load Balancer Configuration

Health Checks

Configure proper health checks for load balancing:

# Traefik
- "traefik.http.services.api.loadbalancer.healthcheck.path=/ready"
- "traefik.http.services.api.loadbalancer.healthcheck.interval=5s"
- "traefik.http.services.api.loadbalancer.healthcheck.timeout=3s"

Session Affinity

For WebSocket connections, enable sticky sessions:

# Traefik
- "traefik.http.services.api.loadbalancer.sticky.cookie.name=tw_server"
- "traefik.http.services.api.loadbalancer.sticky.cookie.httpOnly=true"

Rate Limiting

Configure rate limiting at the load balancer level:

# Traefik rate limiting middleware
http:
  middlewares:
    rate-limit:
      rateLimit:
        average: 100
        burst: 50
        period: 1s

Kubernetes Autoscaling

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triage-warden-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # CPU-based scaling
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # Memory-based scaling
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    # Custom metric scaling (requires Prometheus adapter)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min cooldown
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

Vertical Pod Autoscaler (VPA)

For automatic resource adjustment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: triage-warden-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden-api
  updatePolicy:
    updateMode: "Auto"  # or "Off" for recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: triage-warden
        minAllowed:
          cpu: 250m
          memory: 256Mi
        maxAllowed:
          cpu: 4
          memory: 4Gi

Pod Disruption Budget

Ensure availability during scaling:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: triage-warden-api
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
      app.kubernetes.io/component: api

Scaling Best Practices

1. Scale Gradually

  • Increase by 25-50% at a time
  • Monitor for 10-15 minutes before next scale
  • Watch for downstream bottlenecks

2. Test Scale Limits

# Load testing with k6
k6 run --vus 100 --duration 5m load-test.js

3. Set Resource Limits

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

4. Use Pod Anti-Affinity

Spread pods across nodes:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: triage-warden
          topologyKey: kubernetes.io/hostname

5. Configure Topology Spread

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: triage-warden

Troubleshooting Scaling Issues

Pods Not Scaling Up

# Check HPA status
kubectl describe hpa triage-warden-api

# Check metrics availability
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | jq

# Check events
kubectl get events --sort-by='.lastTimestamp' | grep -i scale

Pods Stuck Pending

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check pod events
kubectl describe pod <pod-name> | grep -A 10 Events

Scaling Oscillation

If pods scale up and down frequently:

  1. Increase stabilization window
  2. Adjust metric thresholds
  3. Add cooldown periods
behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # 10 min

Next Steps

Backup & Restore

Procedures for backing up and restoring Triage Warden data.

Overview

Triage Warden stores all persistent data in PostgreSQL. Regular backups are essential for disaster recovery.

What to backup:

  • PostgreSQL database (all data)
  • Configuration files (optional, if customized)
  • TLS certificates (if not using cert-manager)

What NOT to backup:

  • Application containers (stateless, rebuilt from image)
  • Logs (should be in log aggregation system)
  • Metrics (stored in Prometheus)

Backup Procedures

Manual Backup

Docker

# Create backup directory
mkdir -p /backups/triage-warden

# Create timestamped backup
BACKUP_FILE="/backups/triage-warden/backup-$(date +%Y%m%d-%H%M%S).sql"

docker compose exec -T postgres pg_dump \
  -U triage_warden \
  --format=custom \
  --compress=9 \
  triage_warden > "$BACKUP_FILE"

# Verify backup
pg_restore --list "$BACKUP_FILE" | head -20

echo "Backup created: $BACKUP_FILE ($(du -h $BACKUP_FILE | cut -f1))"

Kubernetes

# Get PostgreSQL pod
PG_POD=$(kubectl get pods -n triage-warden -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}')

# Create backup
BACKUP_FILE="backup-$(date +%Y%m%d-%H%M%S).sql"

kubectl exec -n triage-warden $PG_POD -- \
  pg_dump -U triage_warden --format=custom --compress=9 triage_warden \
  > "$BACKUP_FILE"

# Upload to S3 (optional)
aws s3 cp "$BACKUP_FILE" s3://your-backup-bucket/triage-warden/

Automated Backup

Docker (Cron)

# /etc/cron.d/triage-warden-backup
0 2 * * * root /opt/triage-warden/scripts/backup.sh >> /var/log/triage-warden-backup.log 2>&1
#!/bin/bash
# /opt/triage-warden/scripts/backup.sh

set -e

BACKUP_DIR="/backups/triage-warden"
RETENTION_DAYS=30
BACKUP_FILE="$BACKUP_DIR/backup-$(date +%Y%m%d-%H%M%S).sql"

# Create backup
cd /opt/triage-warden
docker compose exec -T postgres pg_dump \
  -U triage_warden \
  --format=custom \
  --compress=9 \
  triage_warden > "$BACKUP_FILE"

# Verify backup
if ! pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1; then
  echo "ERROR: Backup verification failed"
  rm -f "$BACKUP_FILE"
  exit 1
fi

# Cleanup old backups
find "$BACKUP_DIR" -name "backup-*.sql" -mtime +$RETENTION_DAYS -delete

echo "Backup completed: $BACKUP_FILE"

Kubernetes (CronJob)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: triage-warden-backup
  namespace: triage-warden
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: postgres:15-alpine
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-postgresql
                      key: postgres-password
              command:
                - /bin/sh
                - -c
                - |
                  set -e
                  BACKUP_FILE="/backups/backup-$(date +%Y%m%d-%H%M%S).sql"
                  pg_dump -h postgres-postgresql -U triage_warden \
                    --format=custom --compress=9 triage_warden > "$BACKUP_FILE"
                  echo "Backup completed: $BACKUP_FILE"
              volumeMounts:
                - name: backup-storage
                  mountPath: /backups
          volumes:
            - name: backup-storage
              persistentVolumeClaim:
                claimName: backup-pvc

Restore Procedures

Prerequisites

  1. Stop the Triage Warden application (to prevent data conflicts)
  2. Have the backup file accessible
  3. Database credentials available

Full Restore

Docker

# Stop application
docker compose stop triage-warden

# Restore from backup
docker compose exec -T postgres pg_restore \
  -U triage_warden \
  --clean \
  --if-exists \
  --no-owner \
  -d triage_warden < /path/to/backup.sql

# Start application
docker compose start triage-warden

# Verify
curl http://localhost:8080/health | jq

Kubernetes

# Scale down application
kubectl scale deployment triage-warden -n triage-warden --replicas=0

# Get PostgreSQL pod
PG_POD=$(kubectl get pods -n triage-warden -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}')

# Copy backup to pod
kubectl cp backup.sql triage-warden/$PG_POD:/tmp/backup.sql

# Restore
kubectl exec -n triage-warden $PG_POD -- \
  pg_restore -U triage_warden --clean --if-exists --no-owner \
  -d triage_warden /tmp/backup.sql

# Scale up application
kubectl scale deployment triage-warden -n triage-warden --replicas=3

# Verify
kubectl exec -it deployment/triage-warden -n triage-warden -- curl -s localhost:8080/health

Point-in-Time Recovery

For point-in-time recovery, enable PostgreSQL WAL archiving:

# PostgreSQL configuration
archive_mode: on
archive_command: 'aws s3 cp %p s3://your-bucket/wal/%f'

Recovery procedure:

# 1. Stop PostgreSQL
# 2. Clear data directory
# 3. Restore base backup
# 4. Create recovery.signal
# 5. Set recovery_target_time in postgresql.conf
# 6. Start PostgreSQL

Verification

After any restore, verify:

# 1. Health check passes
curl http://localhost:8080/health | jq '.status'
# Expected: "healthy"

# 2. Recent incidents exist
curl http://localhost:8080/api/incidents | jq '. | length'

# 3. User can login
# Test via UI or API

# 4. Connectors configured
curl http://localhost:8080/health/detailed | jq '.components.connectors'

Backup Storage

Local Storage

  • Pros: Simple, fast
  • Cons: Single point of failure
  • Recommendation: Development only

Cloud Storage (S3/GCS/Azure Blob)

# Upload to S3
aws s3 cp backup.sql s3://bucket/triage-warden/backup-$(date +%Y%m%d).sql

# Download from S3
aws s3 cp s3://bucket/triage-warden/backup-20240115.sql ./restore.sql

Encryption

Encrypt backups before storing:

# Encrypt backup
gpg --symmetric --cipher-algo AES256 backup.sql

# Decrypt for restore
gpg --decrypt backup.sql.gpg > backup.sql

Disaster Recovery Plan

RTO/RPO Targets

MetricTarget
Recovery Time Objective (RTO)4 hours
Recovery Point Objective (RPO)24 hours

Recovery Steps

  1. Assess the situation

    • Determine extent of data loss
    • Identify latest valid backup
  2. Provision new infrastructure

    • Deploy new database instance
    • Deploy new application instances
  3. Restore data

    • Restore database from backup
    • Verify data integrity
  4. Reconfigure

    • Update DNS/load balancer
    • Reconfigure connectors if needed
    • Reset API keys if compromised
  5. Verify and communicate

    • Run health checks
    • Test critical workflows
    • Notify stakeholders

Testing Schedule

TestFrequencyLast Tested
Backup verificationWeekly
Restore to test environmentMonthly
Full DR simulationQuarterly

Troubleshooting Guide

Common issues and their solutions.

Quick Diagnostics

# Check overall health
curl -s http://localhost:8080/health/detailed | jq

# Check logs for errors (last 100 lines)
docker compose logs --tail=100 triage-warden | grep -i error

# Check resource usage
docker stats --no-stream

Common Issues

Service Won't Start

Symptoms

  • Container exits immediately
  • "Connection refused" errors
  • Health check fails

Diagnosis

# Check container logs
docker compose logs triage-warden

# Check exit code
docker compose ps -a

Common Causes & Solutions

Missing environment variables:

Error: Required environment variable TW_ENCRYPTION_KEY not set

Solution: Ensure all required env vars are set in .env

Database connection failed:

Error: Failed to connect to database: Connection refused

Solution:

  1. Verify PostgreSQL is running: docker compose ps postgres
  2. Check DATABASE_URL is correct
  3. Verify network connectivity

Invalid encryption key:

Error: Invalid encryption key: must be 32 bytes base64-encoded

Solution: Generate new key: openssl rand -base64 32


Database Connection Issues

Symptoms

  • /ready returns 503
  • "Database unavailable" in health check
  • Queries timing out

Diagnosis

# Check database health
docker compose exec postgres pg_isready -U triage_warden

# Check connection count
docker compose exec postgres psql -U triage_warden -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname = 'triage_warden';"

# Check for locks
docker compose exec postgres psql -U triage_warden -c \
  "SELECT * FROM pg_locks WHERE NOT granted;"

Solutions

Connection pool exhausted:

# Increase max connections in docker-compose.yml
DATABASE_MAX_CONNECTIONS=50

# Or kill idle connections
docker compose exec postgres psql -U triage_warden -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
   WHERE datname = 'triage_warden' AND state = 'idle' AND pid <> pg_backend_pid();"

PostgreSQL not ready:

# Wait for PostgreSQL to be ready
until docker compose exec postgres pg_isready -U triage_warden; do
  echo "Waiting for PostgreSQL..."
  sleep 2
done

Authentication Issues

Symptoms

  • "Invalid credentials" on login
  • "Session expired" errors
  • API returns 401

Diagnosis

# Check if user exists
docker compose exec postgres psql -U triage_warden -c \
  "SELECT username, enabled, last_login_at FROM users;"

# Check session configuration
curl -s http://localhost:8080/health/detailed | jq '.components'

Solutions

Reset admin password:

# Generate new password hash (requires bcrypt)
NEW_HASH=$(htpasswd -bnBC 10 "" "newpassword" | tr -d ':\n')

# Update in database
docker compose exec postgres psql -U triage_warden -c \
  "UPDATE users SET password_hash = '$NEW_HASH' WHERE username = 'admin';"

Clear sessions:

docker compose exec postgres psql -U triage_warden -c \
  "DELETE FROM sessions;"

User account disabled:

docker compose exec postgres psql -U triage_warden -c \
  "UPDATE users SET enabled = true WHERE username = 'admin';"

LLM/AI Features Not Working

Symptoms

  • "LLM analysis failed" errors
  • No AI verdicts on incidents
  • Empty analysis in incident details

Diagnosis

# Check LLM configuration
curl -s http://localhost:8080/health/detailed | jq '.components.llm'

# Check for API key
docker compose exec triage-warden env | grep -E "(OPENAI|ANTHROPIC)_API_KEY"

# Check LLM settings in database
docker compose exec postgres psql -U triage_warden -c \
  "SELECT provider, model, enabled FROM settings WHERE key = 'llm';"

Solutions

API key not configured:

# Set via environment variable
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
docker compose up -d

LLM disabled: Configure via UI: Settings → AI/LLM → Enable toggle

Rate limited: Check provider dashboard for rate limit status. Consider:

  • Upgrading API tier
  • Reducing temperature/max_tokens
  • Adding request delays

Connector Failures

Symptoms

  • "Connector error" status in settings
  • Failed enrichments
  • Missing threat intel data

Diagnosis

# Check connector status
curl -s http://localhost:8080/health/detailed | jq '.components.connectors'

# Test specific connector
curl -X POST http://localhost:8080/api/connectors/{id}/test

Solutions by Connector

VirusTotal:

  • Verify API key is valid
  • Check rate limits (4 req/min for free tier)
  • Ensure outbound HTTPS to virustotal.com allowed

Jira:

  • Verify base URL (include /rest/api/3)
  • Use API token, not password
  • Check project key exists

CrowdStrike:

  • Verify OAuth client credentials
  • Check API scopes granted
  • Verify region (us-1, us-2, eu-1)

Splunk:

  • Verify HEC token is valid
  • Check SSL certificate if using HTTPS
  • Verify index exists

High Memory Usage

Symptoms

  • Container OOM killed
  • Slow response times
  • "Out of memory" errors

Diagnosis

# Check container memory
docker stats --no-stream triage-warden

# Check for memory leaks (trending)
docker stats triage-warden  # Watch over time

Solutions

Increase memory limits:

# docker-compose.yml
deploy:
  resources:
    limits:
      memory: 4G

Reduce connection pool:

DATABASE_MAX_CONNECTIONS=5

Enable garbage collection logging:

RUST_LOG=info,triage_warden=debug

Slow Performance

Symptoms

  • High latency on API calls
  • Dashboard loads slowly
  • Timeouts on queries

Diagnosis

# Check response times
curl -w "@curl-format.txt" -s http://localhost:8080/health -o /dev/null

# Check database query times
docker compose exec postgres psql -U triage_warden -c \
  "SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Check for table bloat
docker compose exec postgres psql -U triage_warden -c \
  "SELECT relname, n_dead_tup, n_live_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;"

Solutions

Add database indexes:

-- Common helpful indexes
CREATE INDEX idx_incidents_created_at ON incidents(created_at DESC);
CREATE INDEX idx_incidents_severity ON incidents(severity);
CREATE INDEX idx_audit_log_timestamp ON audit_log(timestamp DESC);

Vacuum database:

docker compose exec postgres psql -U triage_warden -c "VACUUM ANALYZE;"

Enable query caching: Already enabled by default in connection pool.


Kill Switch Issues

Symptoms

  • Automation stopped unexpectedly
  • "Kill switch active" warnings
  • Actions blocked

Diagnosis

# Check kill switch status
curl -s http://localhost:8080/api/kill-switch | jq

# Check who activated it
curl -s http://localhost:8080/health/detailed | jq '.components.kill_switch'

Solutions

Deactivate kill switch:

curl -X POST http://localhost:8080/api/kill-switch/deactivate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"reason": "Confirmed safe to resume"}'

Or via UI: Settings → Safety → Re-enable Automation


Webhook Not Receiving Events

Symptoms

  • No incidents created from SIEM
  • Webhook endpoint returns errors
  • Events missing

Diagnosis

# Test webhook endpoint
curl -X POST http://localhost:8080/api/webhooks/generic \
  -H "Content-Type: application/json" \
  -d '{"title": "Test Alert", "severity": "medium"}'

# Check webhook logs
docker compose logs triage-warden | grep -i webhook

Solutions

Signature validation failing:

  • Verify webhook secret matches source configuration
  • Check signature header name (X-Signature, X-Hub-Signature-256, etc.)

Payload format incorrect:

  • Check source webhook format documentation
  • Use generic webhook with custom mapping

Firewall blocking:

  • Ensure source IP can reach webhook endpoint
  • Check for WAF rules blocking requests

Diagnostic Commands

Get System Info

# Application version
curl -s http://localhost:8080/health | jq '.version'

# Database version
docker compose exec postgres psql -U triage_warden -c "SELECT version();"

# Container info
docker compose version
docker version

Export Debug Bundle

#!/bin/bash
# Create debug bundle
BUNDLE_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BUNDLE_DIR"

# Health check
curl -s http://localhost:8080/health/detailed > "$BUNDLE_DIR/health.json"

# Recent logs
docker compose logs --tail=1000 triage-warden > "$BUNDLE_DIR/app.log"
docker compose logs --tail=500 postgres > "$BUNDLE_DIR/db.log"

# Configuration (redacted)
docker compose config | grep -v -E "(PASSWORD|SECRET|KEY)" > "$BUNDLE_DIR/config.yml"

# Create archive
tar -czf "$BUNDLE_DIR.tar.gz" "$BUNDLE_DIR"
rm -rf "$BUNDLE_DIR"

echo "Debug bundle: $BUNDLE_DIR.tar.gz"

Getting Help

If you can't resolve the issue:

  1. Check GitHub Issues for known issues
  2. Create a new issue with:
    • Triage Warden version
    • Deployment method (Docker/K8s)
    • Error messages
    • Debug bundle (with secrets redacted)
  3. Contact support: [email protected]

Contributing

Guide to contributing to Triage Warden.

Getting Started

  1. Fork the repository
  2. Clone your fork
  3. Set up the development environment
  4. Create a branch for your changes
  5. Submit a pull request

Development Setup

Prerequisites

  • Rust 1.75+
  • Python 3.11+
  • uv (Python package manager)
  • SQLite (for development)

Initial Setup

# Clone repository
git clone https://github.com/your-username/triage-warden.git
cd triage-warden

# Install Rust dependencies
cargo build

# Install Python dependencies
cd python
uv sync
cd ..

# Run tests
cargo test
cd python && uv run pytest

Code Style

Rust

  • Follow standard Rust conventions
  • Run cargo fmt before committing
  • Run cargo clippy and fix warnings
  • Document public APIs with doc comments

Python

  • Follow PEP 8
  • Run ruff check and black before committing
  • Type hints required (mypy strict mode)
  • Docstrings for public functions

Pre-commit Hooks

Install pre-commit hooks:

# The project has pre-commit configured in .git/hooks
# It runs automatically on commit:
# - cargo fmt
# - cargo clippy
# - ruff
# - black
# - mypy

Pull Request Process

  1. Create a branch

    git checkout -b feature/my-feature
    
  2. Make changes

    • Write code
    • Add tests
    • Update documentation
  3. Run checks

    cargo fmt && cargo clippy
    cargo test
    cd python && uv run pytest
    
  4. Commit

    git commit -m "feat: add new feature"
    
  5. Push and create PR

    git push origin feature/my-feature
    
  6. Address review feedback

Commit Messages

Follow conventional commits:

type(scope): description

[optional body]

[optional footer]

Types:

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation
  • refactor: Code refactoring
  • test: Adding tests
  • chore: Maintenance

Testing

Rust Tests

# Run all tests
cargo test

# Run specific crate tests
cargo test -p tw-api

# Run with output
cargo test -- --nocapture

Python Tests

cd python
uv run pytest

# Run specific tests
uv run pytest tests/test_agents.py

# With coverage
uv run pytest --cov=tw_ai

Integration Tests

# Start test server
cargo run --bin tw-api &

# Run integration tests
./scripts/integration-tests.sh

Documentation

  • Update docs for API changes
  • Add examples for new features
  • Keep README.md current

Build docs locally:

cd docs-site
mdbook serve

Issue Reporting

When reporting issues:

  1. Search existing issues first
  2. Use issue templates
  3. Include:
    • Version information
    • Steps to reproduce
    • Expected vs actual behavior
    • Relevant logs

Questions

  • Open a GitHub Discussion
  • Check existing discussions first
  • Tag appropriately

License

By contributing, you agree that your contributions will be licensed under the MIT License.

Building from Source

Complete guide to building Triage Warden.

Prerequisites

Rust

# Install Rust via rustup
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Verify installation
rustc --version  # Should be 1.75+

Python

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Verify installation
uv --version

System Dependencies

macOS

brew install openssl pkg-config

Ubuntu/Debian

sudo apt-get install build-essential pkg-config libssl-dev

Fedora

sudo dnf install gcc openssl-devel pkgconfig

Building

Debug Build

cargo build

Outputs:

  • target/debug/tw-api
  • target/debug/tw-cli

Release Build

cargo build --release

Outputs:

  • target/release/tw-api
  • target/release/tw-cli

Python Package

cd python
uv sync
uv build

PyO3 Bridge

The bridge is built automatically with cargo:

cd tw-bridge
cargo build --release

Build Options

Feature Flags

# Build with PostgreSQL support only
cargo build --no-default-features --features postgres

# Build with all features
cargo build --all-features

Cross-Compilation

# For Linux (from macOS)
rustup target add x86_64-unknown-linux-gnu
cargo build --release --target x86_64-unknown-linux-gnu

# For musl (static binary)
rustup target add x86_64-unknown-linux-musl
cargo build --release --target x86_64-unknown-linux-musl

Docker Build

Build Image

docker build -t triage-warden .

Multi-Stage Dockerfile

# Builder stage
FROM rust:1.75 as builder
WORKDIR /app
COPY . .
RUN cargo build --release

# Runtime stage
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/tw-api /usr/local/bin/
CMD ["tw-api"]

Verification

Run Tests

# Rust tests
cargo test

# Python tests
cd python && uv run pytest

# All tests
./scripts/test-all.sh

Linting

# Rust
cargo fmt --check
cargo clippy -- -D warnings

# Python
cd python
uv run ruff check
uv run black --check .
uv run mypy .

Smoke Test

# Start server
./target/release/tw-api &

# Health check
curl http://localhost:8080/api/health

# Stop server
kill %1

Troubleshooting

OpenSSL Errors

# macOS
export OPENSSL_DIR=$(brew --prefix openssl)

# Linux
export OPENSSL_DIR=/usr

PyO3 Build Issues

# Ensure Python is found
export PYO3_PYTHON=$(which python3)

# Clean and rebuild
cargo clean -p tw-bridge
cargo build -p tw-bridge

Out of Memory

# Reduce parallel jobs
cargo build -j 2

Testing

Guide to testing Triage Warden.

Test Structure

triage-warden/
├── crates/
│   ├── tw-api/src/
│   │   └── tests/           # API integration tests
│   ├── tw-core/src/
│   │   └── tests/           # Core unit tests
│   └── tw-actions/src/
│       └── tests/           # Action handler tests
└── python/
    └── tests/               # Python tests

Running Tests

All Tests

# Rust
cargo test

# Python
cd python && uv run pytest

# Everything
./scripts/test-all.sh

Specific Tests

# Single crate
cargo test -p tw-api

# Single test
cargo test test_incident_creation

# Pattern match
cargo test incident

# With output
cargo test -- --nocapture

Unit Tests

Rust Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_incident_creation() {
        let incident = Incident::new(
            IncidentType::Phishing,
            Severity::High,
        );
        assert_eq!(incident.status, IncidentStatus::Open);
    }

    #[tokio::test]
    async fn test_async_operation() {
        let result = async_function().await;
        assert!(result.is_ok());
    }
}
}

Python Unit Tests

import pytest
from tw_ai.agents import TriageAgent

def test_agent_creation():
    agent = TriageAgent()
    assert agent.model == "claude-sonnet-4-20250514"

@pytest.mark.asyncio
async def test_triage():
    agent = TriageAgent()
    verdict = await agent.triage(mock_incident)
    assert verdict.classification in ["malicious", "benign"]

Integration Tests

API Integration Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_incident_api() {
    let app = create_test_app().await;

    // Create incident
    let response = app
        .oneshot(
            Request::builder()
                .method("POST")
                .uri("/api/incidents")
                .header("Content-Type", "application/json")
                .body(Body::from(r#"{"type":"phishing"}"#))
                .unwrap(),
        )
        .await
        .unwrap();

    assert_eq!(response.status(), StatusCode::CREATED);
}
}

Database Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_repository() {
    // Use in-memory SQLite
    let pool = create_test_pool().await;
    let repo = SqliteIncidentRepository::new(pool);

    let incident = repo.create(&new_incident).await.unwrap();
    let found = repo.get(incident.id).await.unwrap();

    assert_eq!(found.unwrap().id, incident.id);
}
}

Test Fixtures

Rust Fixtures

#![allow(unused)]
fn main() {
// tests/fixtures.rs
pub fn mock_incident() -> Incident {
    Incident {
        id: Uuid::new_v4(),
        incident_type: IncidentType::Phishing,
        severity: Severity::High,
        status: IncidentStatus::Open,
        raw_data: json!({"subject": "Test"}),
        ..Default::default()
    }
}
}

Python Fixtures

# tests/conftest.py
import pytest

@pytest.fixture
def mock_incident():
    return {
        "id": "test-123",
        "type": "phishing",
        "severity": "high",
        "raw_data": {"subject": "Test Email"}
    }

@pytest.fixture
def mock_connector():
    return MockThreatIntelConnector()

Mocking

Rust Mocking

#![allow(unused)]
fn main() {
use mockall::mock;

mock! {
    ThreatIntelConnector {}

    #[async_trait]
    impl ThreatIntelConnector for ThreatIntelConnector {
        async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>;
    }
}

#[tokio::test]
async fn test_with_mock() {
    let mut mock = MockThreatIntelConnector::new();
    mock.expect_lookup_hash()
        .returning(|_| Ok(ThreatReport::clean()));

    let result = function_using_connector(&mock).await;
    assert!(result.is_ok());
}
}

Python Mocking

from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_with_mock():
    with patch("tw_ai.agents.tools.lookup_hash") as mock:
        mock.return_value = {"malicious": False}

        agent = TriageAgent()
        verdict = await agent.triage(mock_incident)

        mock.assert_called_once()

Test Coverage

Rust Coverage

cargo install cargo-tarpaulin
cargo tarpaulin --out Html

Python Coverage

cd python
uv run pytest --cov=tw_ai --cov-report=html

CI Testing

GitHub Actions runs tests on every PR:

# .github/workflows/test.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - run: cargo test
      - run: cargo clippy -- -D warnings

Test Data

Evaluation Test Cases

Test cases for AI triage evaluation:

# python/tw_ai/evaluation/test_cases/phishing.yaml
- name: obvious_phishing
  input:
    sender: "[email protected]"
    subject: "Urgent: Verify Account"
    urls: ["https://phishing-site.com/login"]
    auth_results: {spf: fail, dkim: fail}
  expected:
    classification: malicious
    min_confidence: 0.8

Run evaluation:

cd python
uv run pytest tests/test_evaluation.py

Adding Connectors

Guide to implementing new connectors.

Connector Architecture

Connectors follow a trait-based pattern:

Connector Trait (base)
    │
    ├── ThreatIntelConnector
    ├── SIEMConnector
    ├── EDRConnector
    ├── EmailGatewayConnector
    └── TicketingConnector

Implementing a Connector

1. Create the File

touch crates/tw-connectors/src/threat_intel/my_provider.rs

2. Implement Base Trait

#![allow(unused)]
fn main() {
use crate::traits::{Connector, ConnectorError, ConnectorHealth, ConnectorResult};
use async_trait::async_trait;

pub struct MyProviderConnector {
    client: reqwest::Client,
    api_key: String,
    base_url: String,
}

impl MyProviderConnector {
    pub fn new(api_key: String) -> Result<Self, ConnectorError> {
        let client = reqwest::Client::builder()
            .timeout(std::time::Duration::from_secs(30))
            .build()
            .map_err(|e| ConnectorError::Configuration(e.to_string()))?;

        Ok(Self {
            client,
            api_key,
            base_url: "https://api.myprovider.com".to_string(),
        })
    }
}

#[async_trait]
impl Connector for MyProviderConnector {
    fn name(&self) -> &str {
        "my_provider"
    }

    fn connector_type(&self) -> &str {
        "threat_intel"
    }

    async fn health_check(&self) -> ConnectorResult<ConnectorHealth> {
        let response = self.client
            .get(format!("{}/health", self.base_url))
            .header("Authorization", format!("Bearer {}", self.api_key))
            .send()
            .await
            .map_err(|e| ConnectorError::NetworkError(e.to_string()))?;

        if response.status().is_success() {
            Ok(ConnectorHealth::Healthy)
        } else {
            Ok(ConnectorHealth::Unhealthy {
                message: "Health check failed".to_string(),
            })
        }
    }

    async fn test_connection(&self) -> ConnectorResult<bool> {
        match self.health_check().await? {
            ConnectorHealth::Healthy => Ok(true),
            _ => Ok(false),
        }
    }
}
}

3. Implement Specialized Trait

#![allow(unused)]
fn main() {
use crate::traits::{ThreatIntelConnector, ThreatReport, IndicatorType};

#[async_trait]
impl ThreatIntelConnector for MyProviderConnector {
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport> {
        let response = self.client
            .get(format!("{}/files/{}", self.base_url, hash))
            .header("Authorization", format!("Bearer {}", self.api_key))
            .send()
            .await
            .map_err(|e| ConnectorError::NetworkError(e.to_string()))?;

        if response.status() == reqwest::StatusCode::NOT_FOUND {
            return Ok(ThreatReport {
                indicator: hash.to_string(),
                indicator_type: IndicatorType::FileHash,
                malicious: false,
                confidence: 0.0,
                categories: vec![],
                first_seen: None,
                last_seen: None,
                sources: vec![],
            });
        }

        let data: ApiResponse = response.json().await
            .map_err(|e| ConnectorError::InvalidResponse(e.to_string()))?;

        Ok(self.convert_response(data))
    }

    async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport> {
        // Similar implementation
        todo!()
    }

    async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport> {
        // Similar implementation
        todo!()
    }

    async fn lookup_ip(&self, ip: &str) -> ConnectorResult<ThreatReport> {
        // Similar implementation
        todo!()
    }
}
}

4. Add to Module

#![allow(unused)]
fn main() {
// crates/tw-connectors/src/threat_intel/mod.rs
mod my_provider;
pub use my_provider::MyProviderConnector;
}

5. Register in Bridge

#![allow(unused)]
fn main() {
// tw-bridge/src/lib.rs
impl ThreatIntelBridge {
    pub fn new(mode: &str) -> PyResult<Self> {
        let connector: Arc<dyn ThreatIntelConnector + Send + Sync> = match mode {
            "virustotal" => Arc::new(VirusTotalConnector::new(
                std::env::var("TW_VIRUSTOTAL_API_KEY")
                    .map_err(|_| PyErr::new::<pyo3::exceptions::PyValueError, _>(
                        "TW_VIRUSTOTAL_API_KEY not set"
                    ))?
            )?),
            "my_provider" => Arc::new(MyProviderConnector::new(
                std::env::var("TW_MY_PROVIDER_API_KEY")
                    .map_err(|_| PyErr::new::<pyo3::exceptions::PyValueError, _>(
                        "TW_MY_PROVIDER_API_KEY not set"
                    ))?
            )?),
            _ => Arc::new(MockThreatIntelConnector::new("mock")),
        };

        Ok(Self { connector })
    }
}
}

Error Handling

Use appropriate error types:

#![allow(unused)]
fn main() {
pub enum ConnectorError {
    /// Configuration issue
    Configuration(String),

    /// Network/connection error
    NetworkError(String),

    /// Authentication failed
    AuthenticationFailed(String),

    /// Resource not found
    NotFound(String),

    /// Rate limited
    RateLimited { retry_after: Option<Duration> },

    /// Invalid response from service
    InvalidResponse(String),

    /// Request failed
    RequestFailed(String),
}
}

Rate Limiting

Implement rate limiting in your connector:

#![allow(unused)]
fn main() {
use governor::{Quota, RateLimiter};

pub struct MyProviderConnector {
    client: reqwest::Client,
    api_key: String,
    rate_limiter: RateLimiter<...>,
}

impl MyProviderConnector {
    async fn make_request(&self, url: &str) -> ConnectorResult<Response> {
        self.rate_limiter.until_ready().await;

        self.client.get(url)
            .header("Authorization", format!("Bearer {}", self.api_key))
            .send()
            .await
            .map_err(|e| ConnectorError::NetworkError(e.to_string()))
    }
}
}

Testing

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    use wiremock::{MockServer, Mock, ResponseTemplate};
    use wiremock::matchers::{method, path};

    #[tokio::test]
    async fn test_lookup_hash() {
        let mock_server = MockServer::start().await;

        Mock::given(method("GET"))
            .and(path("/files/abc123"))
            .respond_with(ResponseTemplate::new(200).set_body_json(json!({
                "malicious": true,
                "confidence": 0.95
            })))
            .mount(&mock_server)
            .await;

        let connector = MyProviderConnector::with_base_url(
            "test-key".to_string(),
            mock_server.uri(),
        );

        let result = connector.lookup_hash("abc123").await.unwrap();
        assert!(result.malicious);
    }
}
}

Documentation

Document your connector:

#![allow(unused)]
fn main() {
//! MyProvider threat intelligence connector.
//!
//! # Configuration
//!
//! Set `TW_MY_PROVIDER_API_KEY` environment variable.
//!
//! # Example
//!
//! ```rust
//! let connector = MyProviderConnector::new(api_key)?;
//! let report = connector.lookup_hash("abc123").await?;
//! ```
}

Adding Actions

Guide to implementing new action handlers.

Action Architecture

Actions implement the Action trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Action: Send + Sync {
    fn name(&self) -> &str;
    fn description(&self) -> &str;
    fn required_parameters(&self) -> Vec<ParameterDef>;
    fn supports_rollback(&self) -> bool;

    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError>;

    async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        Err(ActionError::RollbackNotSupported)
    }
}
}

Implementing an Action

1. Create the File

touch crates/tw-actions/src/my_action.rs

2. Define the Action

#![allow(unused)]
fn main() {
use crate::registry::{
    Action, ActionContext, ActionError, ActionResult, ParameterDef, ParameterType,
};
use async_trait::async_trait;
use chrono::Utc;
use std::collections::HashMap;
use tracing::{info, instrument};

/// My custom action handler.
pub struct MyAction;

impl MyAction {
    pub fn new() -> Self {
        Self
    }
}

impl Default for MyAction {
    fn default() -> Self {
        Self::new()
    }
}

#[async_trait]
impl Action for MyAction {
    fn name(&self) -> &str {
        "my_action"
    }

    fn description(&self) -> &str {
        "Description of what this action does"
    }

    fn required_parameters(&self) -> Vec<ParameterDef> {
        vec![
            ParameterDef::required(
                "target",
                "The target of the action",
                ParameterType::String,
            ),
            ParameterDef::optional(
                "force",
                "Force the action even if conditions aren't met",
                ParameterType::Boolean,
                serde_json::json!(false),
            ),
        ]
    }

    fn supports_rollback(&self) -> bool {
        true
    }

    #[instrument(skip(self, context))]
    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        let started_at = Utc::now();

        // Get required parameter
        let target = context.require_string("target")?;

        // Get optional parameter with default
        let force = context
            .get_param("force")
            .and_then(|v| v.as_bool())
            .unwrap_or(false);

        info!("Executing my_action on target: {}", target);

        // Perform the action
        // ...

        // Build output
        let mut output = HashMap::new();
        output.insert("target".to_string(), serde_json::json!(target));
        output.insert("success".to_string(), serde_json::json!(true));

        Ok(ActionResult::success(
            self.name(),
            &format!("Action completed on {}", target),
            started_at,
            output,
        ))
    }

    async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        let started_at = Utc::now();
        let target = context.require_string("target")?;

        info!("Rolling back my_action on target: {}", target);

        // Perform rollback
        // ...

        let mut output = HashMap::new();
        output.insert("target".to_string(), serde_json::json!(target));

        Ok(ActionResult::success(
            &format!("{}_rollback", self.name()),
            &format!("Rollback completed on {}", target),
            started_at,
            output,
        ))
    }
}
}

3. Add to Module

#![allow(unused)]
fn main() {
// crates/tw-actions/src/lib.rs
mod my_action;
pub use my_action::MyAction;
}

4. Register in Registry

#![allow(unused)]
fn main() {
// crates/tw-actions/src/registry.rs
impl ActionRegistry {
    pub fn new() -> Self {
        let mut registry = Self {
            actions: HashMap::new(),
        };

        // Register built-in actions
        registry.register(Box::new(QuarantineEmailAction::new()));
        registry.register(Box::new(BlockSenderAction::new()));
        registry.register(Box::new(MyAction::new())); // Add here

        registry
    }
}
}

Parameter Types

Available parameter types:

#![allow(unused)]
fn main() {
pub enum ParameterType {
    String,
    Integer,
    Float,
    Boolean,
    List,
    Object,
}
}

Define parameters:

#![allow(unused)]
fn main() {
fn required_parameters(&self) -> Vec<ParameterDef> {
    vec![
        ParameterDef::required("name", "Description", ParameterType::String),
        ParameterDef::optional("count", "Description", ParameterType::Integer, json!(10)),
        ParameterDef::optional("tags", "Description", ParameterType::List, json!([])),
    ]
}
}

Using Connectors

Actions can use connectors via dependency injection:

#![allow(unused)]
fn main() {
pub struct MyAction {
    connector: Arc<dyn MyConnector + Send + Sync>,
}

impl MyAction {
    pub fn new(connector: Arc<dyn MyConnector + Send + Sync>) -> Self {
        Self { connector }
    }
}

#[async_trait]
impl Action for MyAction {
    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        // Use connector
        let result = self.connector.do_something().await
            .map_err(|e| ActionError::ExecutionFailed(e.to_string()))?;

        // ...
    }
}
}

Error Handling

Use appropriate error types:

#![allow(unused)]
fn main() {
pub enum ActionError {
    /// Missing or invalid parameters
    InvalidParameters(String),

    /// Execution failed
    ExecutionFailed(String),

    /// Action timed out
    Timeout,

    /// Rollback not supported
    RollbackNotSupported,

    /// Policy denied the action
    PolicyDenied(String),
}
}

Testing

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    use uuid::Uuid;

    #[tokio::test]
    async fn test_my_action_success() {
        let action = MyAction::new();

        let context = ActionContext::new(Uuid::new_v4())
            .with_param("target", serde_json::json!("test-target"));

        let result = action.execute(context).await.unwrap();

        assert!(result.success);
        assert_eq!(result.output["target"], "test-target");
    }

    #[tokio::test]
    async fn test_my_action_missing_param() {
        let action = MyAction::new();
        let context = ActionContext::new(Uuid::new_v4());

        let result = action.execute(context).await;

        assert!(matches!(result, Err(ActionError::InvalidParameters(_))));
    }

    #[tokio::test]
    async fn test_my_action_rollback() {
        let action = MyAction::new();
        assert!(action.supports_rollback());

        let context = ActionContext::new(Uuid::new_v4())
            .with_param("target", serde_json::json!("test-target"));

        let result = action.rollback(context).await.unwrap();
        assert!(result.success);
    }
}
}

Policy Integration

Actions are automatically evaluated by the policy engine. Configure default approval:

# Default policy for new action
[[policy.rules]]
name = "my_action_default"
action = "my_action"
approval_level = "analyst"

Documentation

Document your action:

#![allow(unused)]
fn main() {
//! My custom action.
//!
//! This action performs X on target Y.
//!
//! # Parameters
//!
//! - `target` (required): The target to act on
//! - `force` (optional): Force execution (default: false)
//!
//! # Example
//!
//! ```yaml
//! - action: my_action
//!   parameters:
//!     target: "example"
//!     force: true
//! ```
//!
//! # Rollback
//!
//! This action supports rollback via `my_action_rollback`.
}

Changelog

All notable changes to Triage Warden.

[Unreleased]

Added

  • AI-powered triage agent with Claude integration
  • Configurable playbooks for automated investigation
  • Policy engine with approval workflows
  • Connector framework for external integrations
  • Web dashboard with HTMX
  • REST API for programmatic access
  • CLI for command-line operations

Connectors

  • VirusTotal threat intelligence
  • Splunk SIEM integration
  • CrowdStrike EDR integration
  • Microsoft 365 email gateway
  • Jira ticketing integration

Actions

  • Email: parse_email, check_email_authentication, quarantine_email, block_sender
  • Lookup: lookup_sender_reputation, lookup_urls, lookup_attachments
  • Host: isolate_host, scan_host
  • Notification: notify_user, escalate, create_ticket

[0.1.0] - 2024-01-15

Added

  • Initial release
  • Core incident management
  • Basic web interface
  • SQLite database support
  • Mock connectors for development

Version Numbering

This project follows Semantic Versioning:

  • MAJOR: Incompatible API changes
  • MINOR: Backwards-compatible new features
  • PATCH: Backwards-compatible bug fixes

Upgrade Guide

From 0.x to 1.0

When 1.0 is released, an upgrade guide will be provided here.