Triage Warden

AI-powered security incident triage and response platform

Triage Warden automates the analysis and response to security incidents using AI agents, configurable playbooks, and integrations with your existing security stack.

Features

AI-Powered Triage: Automated analysis of phishing emails, malware alerts, and suspicious login attempts
Configurable Playbooks: Define custom investigation and response workflows
Policy Engine: Role-based approval workflows for sensitive actions
Connector Framework: Integrate with VirusTotal, Splunk, CrowdStrike, Jira, Microsoft 365, and more
Web Dashboard: Real-time incident management with approval workflows
REST API: Programmatic access for automation and integration
Audit Trail: Complete logging of all actions and decisions

Quick Example

# Analyze a phishing email
tw-cli incident create --type phishing --source "email-gateway" --data '{"subject": "Urgent: Update Account"}'

# Run AI triage
tw-cli triage run --incident INC-2024-001

# View the verdict
tw-cli incident get INC-2024-001 --format json

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Web Dashboard                             │
│                    (HTMX + Askama Templates)                     │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                         REST API                                 │
│                     (Axum + Tower)                               │
└─────────────────────────────────────────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌───────────────┐    ┌───────────────────┐    ┌───────────────┐
│ Policy Engine │    │   AI Triage Agent │    │    Actions    │
│    (Rust)     │    │     (Python)      │    │    (Rust)     │
└───────────────┘    └───────────────────┘    └───────────────┘
        │                       │                       │
        └───────────────────────┼───────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                       Connector Layer                            │
│        (VirusTotal, Splunk, CrowdStrike, Jira, M365)            │
└─────────────────────────────────────────────────────────────────┘

Getting Started

Installation - Install Triage Warden
Quick Start - Create your first incident
Configuration - Configure connectors and policies

License

Triage Warden is licensed under the MIT License.

Getting Started

Welcome to Triage Warden! This guide will help you get up and running quickly.

Prerequisites

Rust 1.75+ (for building from source)
Python 3.11+ (for AI triage agents)
SQLite or PostgreSQL (for data storage)
uv (recommended Python package manager)

Installation Options

From Source - Build and run locally
Docker - Run in containers
Pre-built Binaries - Download releases

Next Steps

Installation - Detailed installation instructions
Quick Start - Create your first incident in 5 minutes
Configuration - Configure connectors and settings

Installation

Building from Source

Prerequisites

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

Clone and Build

# Clone the repository
git clone https://github.com/zachyking/triage-warden.git
cd triage-warden

# Build Rust components
cargo build --release

# Install Python dependencies
cd python
uv sync

Verify Installation

# Check the CLI
./target/release/triage-warden --version

# Run tests
cargo test
cd python && uv run pytest

Docker

# Build the image
docker build -t triage-warden .

# Run with default settings
docker run -p 8080:8080 triage-warden

# Run with custom configuration
docker run -p 8080:8080 \
  -e TW_DATABASE_URL=postgres://user:pass@host/db \
  -e TW_VIRUSTOTAL_API_KEY=your-key \
  triage-warden

Pre-built Binaries

Download the latest release from the releases page.

Available platforms:

Linux x86_64 (glibc)
Linux x86_64 (musl)
macOS x86_64
macOS aarch64 (Apple Silicon)

# Example for macOS
curl -LO https://github.com/zachyking/triage-warden/releases/latest/download/triage-warden-macos-aarch64.tar.gz
tar xzf triage-warden-macos-aarch64.tar.gz
./triage-warden --version

Database Setup

SQLite (Default)

SQLite is used by default. The database file is created automatically:

# Default location
DATABASE_URL=sqlite://./triage_warden.db

# Custom location
DATABASE_URL=sqlite:///var/lib/triage-warden/data.db

PostgreSQL

For production deployments:

# Create database
createdb triage_warden

# Set connection string
export DATABASE_URL=postgres://user:password@localhost/triage_warden

# Run migrations
triage-warden db migrate

Next Steps

Quick Start - Create your first incident
Configuration - Configure the system

Quick Start

Get Triage Warden running and process your first incident in 5 minutes.

1. Start the Server

# Start with default settings (SQLite, mock connectors)
cargo run --bin tw-api

# Or use the release binary
./target/release/tw-api

The web dashboard is now available at http://localhost:8080.

2. Create an Incident

Via Web Dashboard

Open http://localhost:8080 in your browser
Click "New Incident"
Fill in the incident details:
- Type: Phishing
- Source: Email Gateway
- Severity: Medium
Click Create

Via CLI

tw-cli incident create \
  --type phishing \
  --source "email-gateway" \
  --severity medium \
  --data '{
    "subject": "Urgent: Verify Your Account",
    "sender": "[email protected]",
    "recipient": "[email protected]"
  }'

Via API

curl -X POST http://localhost:8080/api/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "incident_type": "phishing",
    "source": "email-gateway",
    "severity": "medium",
    "raw_data": {
      "subject": "Urgent: Verify Your Account",
      "sender": "[email protected]"
    }
  }'

3. Run AI Triage

# Trigger triage for the incident
tw-cli triage run --incident INC-2024-0001

The AI agent will:

Parse email headers and content
Check sender reputation
Analyze URLs and attachments
Generate a verdict with confidence score

4. View the Verdict

# Get incident with triage results
tw-cli incident get INC-2024-0001

# Example output:
# Incident: INC-2024-0001
# Type: phishing
# Status: triaged
# Verdict: malicious
# Confidence: 0.92
# Recommended Actions:
#   - quarantine_email
#   - block_sender
#   - notify_user

5. Execute Actions

Actions may require approval based on your policy configuration:

# Request to quarantine the email
tw-cli action execute --incident INC-2024-0001 --action quarantine_email

# If auto-approved:
# Action executed: quarantine_email (status: completed)

# If requires approval:
# Action pending approval from: Senior Analyst

Approve pending actions via the dashboard at /approvals.

Next Steps

Configuration - Set up real connectors
Playbooks - Create automated workflows
Policy Engine - Configure approval rules

Configuration

Triage Warden is configured through environment variables and configuration files.

Environment Variables

Core Settings

Variable	Description	Default
`TW_DATABASE_URL`	Database connection string	`sqlite://./triage_warden.db`
`TW_HOST`	API server host	`0.0.0.0`
`TW_PORT`	API server port	`8080`
`TW_LOG_LEVEL`	Logging level (trace, debug, info, warn, error)	`info`
`TW_ADMIN_PASSWORD`	Initial admin password	(generated)

Connector Selection

Variable	Description	Values
`TW_THREAT_INTEL_MODE`	Threat intelligence backend	`mock`, `virustotal`
`TW_SIEM_MODE`	SIEM backend	`mock`, `splunk`
`TW_EDR_MODE`	EDR backend	`mock`, `crowdstrike`
`TW_EMAIL_GATEWAY_MODE`	Email gateway backend	`mock`, `m365`
`TW_TICKETING_MODE`	Ticketing backend	`mock`, `jira`

VirusTotal

TW_THREAT_INTEL_MODE=virustotal
TW_VIRUSTOTAL_API_KEY=your-api-key-here

Splunk

TW_SIEM_MODE=splunk
TW_SPLUNK_URL=https://splunk.company.com:8089
TW_SPLUNK_TOKEN=your-token-here

CrowdStrike

TW_EDR_MODE=crowdstrike
TW_CROWDSTRIKE_CLIENT_ID=your-client-id
TW_CROWDSTRIKE_CLIENT_SECRET=your-client-secret
TW_CROWDSTRIKE_REGION=us-1  # us-1, us-2, eu-1

Microsoft 365

TW_EMAIL_GATEWAY_MODE=m365
TW_M365_TENANT_ID=your-tenant-id
TW_M365_CLIENT_ID=your-client-id
TW_M365_CLIENT_SECRET=your-client-secret

Jira

TW_TICKETING_MODE=jira
TW_JIRA_URL=https://company.atlassian.net
[email protected]
TW_JIRA_API_TOKEN=your-api-token
TW_JIRA_PROJECT_KEY=SEC

AI Provider

TW_AI_PROVIDER=anthropic  # anthropic, openai
TW_ANTHROPIC_API_KEY=your-api-key
# or
TW_OPENAI_API_KEY=your-api-key

Configuration File

For complex configurations, use a TOML file:

# config.toml

[server]
host = "0.0.0.0"
port = 8080
log_level = "info"

[database]
url = "postgres://user:pass@localhost/triage_warden"
max_connections = 10

[connectors.threat_intel]
mode = "virustotal"
api_key = "${TW_VIRUSTOTAL_API_KEY}"
rate_limit = 4  # requests per minute

[connectors.siem]
mode = "splunk"
url = "https://splunk.company.com:8089"
token = "${TW_SPLUNK_TOKEN}"

[connectors.edr]
mode = "crowdstrike"
client_id = "${TW_CROWDSTRIKE_CLIENT_ID}"
client_secret = "${TW_CROWDSTRIKE_CLIENT_SECRET}"
region = "us-1"

[ai]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
max_tokens = 4096

[policy]
default_action_approval = "auto"  # auto, analyst, senior, manager
high_severity_approval = "senior"
critical_action_approval = "manager"

Load with:

tw-api --config config.toml

Policy Rules

Policy rules control action approval requirements. See Policy Engine for details.

# Example policy rule
[[policy.rules]]
name = "isolate_host_requires_manager"
action = "isolate_host"
severity = ["high", "critical"]
approval_level = "manager"

Logging

Configure structured logging:

# JSON output for production
TW_LOG_FORMAT=json

# Pretty output for development
TW_LOG_FORMAT=pretty

# Filter specific modules
RUST_LOG=tw_api=debug,tw_core=info

Next Steps

Connectors - Detailed connector configuration
Policy Engine - Configure approval workflows
API Authentication - Set up API access

Web Dashboard

Browser-based interface for incident management.

Overview

The dashboard provides:

Real-time incident monitoring
Approval workflow management
Playbook configuration
System settings

Access at: http://localhost:8080

Features

Home Dashboard

The main dashboard displays:

KPIs: Open incidents, pending approvals, triage rate
Recent Incidents: Latest incidents with status
Trend Charts: Incident volume over time
Quick Actions: Create incident, run playbook

Incident Management

List view with filtering and sorting
Detail view with full incident context
Action execution interface
Triage results and reasoning

Approval Workflow

Queue of pending approvals
One-click approve/reject
Bulk approval for related actions
SLA countdown timers

Playbook Management

Create and edit playbooks
Visual step editor
Test with sample data
Execution history

Settings

Connector configuration
Policy rule management
User administration
System preferences

Path	Description
`/`	Dashboard home
`/incidents`	Incident list
`/incidents/:id`	Incident detail
`/approvals`	Pending approvals
`/playbooks`	Playbook management
`/settings`	System settings
`/login`	Login page

Next Steps

Incidents - Managing incidents
Approvals - Approval workflow
Playbooks - Playbook configuration
Settings - System settings

Incidents

Managing incidents in the web dashboard.

Incident List

Access at /incidents

Filtering

Status: Open, Triaged, Resolved
Severity: Low, Medium, High, Critical
Type: Phishing, Malware, Suspicious Login
Date Range: Custom time period

Sorting

Click column headers to sort:

Created (newest/oldest)
Severity (highest/lowest)
Status

Bulk Actions

Select multiple incidents for:

Bulk resolve
Bulk escalate
Export to CSV

Incident Detail

Click an incident to view details.

Overview Tab

Incident metadata
AI verdict and confidence
Recommended actions
Timeline of events

Raw Data Tab

Original incident data (JSON)
Parsed email content (for phishing)
Detection details (for malware)

Actions Tab

Available actions
Executed actions with results
Pending approvals

Enrichment Tab

Threat intelligence results
SIEM correlation data
Related incidents

Creating Incidents

Click "New Incident" button.

Required Fields

Type: Select incident type
Source: Origin of the incident
Severity: Initial severity assessment

Optional Fields

Description: Free-form description
Raw Data: JSON payload
Assignee: Initial assignment

Executing Actions

From the incident detail page:

Click "Actions" tab
Select action from dropdown
Fill in parameters
Click "Execute"

If approval is required:

Action appears in pending state
Notification sent to approvers
Status updates when approved/rejected

Keyboard Shortcuts

Shortcut	Action
`j` / `k`	Navigate list
`Enter`	Open incident
`Esc`	Close modal
`a`	Open actions menu
`e`	Escalate
`r`	Resolve

Real-time Updates

The dashboard uses HTMX for live updates:

New incidents appear automatically
Status changes reflect immediately
Approval decisions update in real-time

Approvals

Managing action approvals in the web dashboard.

Approval Queue

Access at /approvals

The queue shows all actions pending your approval based on your role level.

Queue Columns

Action: Type of action requested
Incident: Related incident
Requested By: Who/what requested it
Requested At: When requested
SLA: Time remaining to respond

Filtering

Approval Level: Analyst, Senior, Manager
Action Type: Specific actions
Incident Type: Phishing, malware, etc.

Approval Detail

Click an approval to see full context.

Context Section

Full incident details
AI reasoning (if from triage)
Related actions already taken

Decision Section

Approve: Execute the action
Reject: Decline with reason
Delegate: Assign to another approver

Approving Actions

Single Approval

Click on pending action
Review incident context
Click "Approve" or "Reject"
Add optional comment
Confirm decision

Bulk Approval

For related actions:

Select multiple actions (checkbox)
Click "Bulk Approve" or "Bulk Reject"
Add comment applying to all
Confirm

Rejection

When rejecting:

Click "Reject"
Required: Enter rejection reason
Optionally suggest alternative
Confirm

The requester is notified of rejection and reason.

SLA Indicators

Color	Meaning
Green	Plenty of time
Yellow	< 50% time remaining
Orange	< 25% time remaining
Red	SLA exceeded

Notifications

You receive notifications for:

New actions requiring your approval
SLA warnings (50%, 75% elapsed)
Escalations to your level

Configure notification preferences in Settings.

Delegation

If unavailable:

Go to Settings > Delegation
Select delegate user
Set date range
Delegate receives your approvals

Audit Trail

All approvals are logged:

Who approved/rejected
When decision was made
Time to approve
Comments provided

View at Settings > Audit Logs.

Playbooks

Managing playbooks in the web dashboard.

Playbook List

Access at /playbooks

Views

Active: Currently enabled playbooks
Inactive: Disabled playbooks
All: Complete list

Information Displayed

Name and description
Trigger conditions
Last run time
Success rate

Creating Playbooks

Click "New Playbook" button.

Basic Information

Name: Unique identifier
Description: What this playbook does
Version: Semantic version

Triggers

Configure when playbook runs:

Incident Type: Phishing, malware, etc.
Auto Run: Run automatically on new incidents
Conditions: Additional criteria

Variables

Define playbook variables:

quarantine_threshold: 0.7
notification_channel: "#security"

Step Editor

Visual editor for playbook steps.

Adding Steps

Click "Add Step"
Select action type
Configure parameters
Set output variable name

Step Types

Action: Execute an action
Condition: Branch logic
AI Analysis: Get AI verdict
Parallel: Run steps concurrently

Connections

Drag to reorder steps
Connect condition branches
Set dependencies

Testing Playbooks

Dry Run

Click "Test"
Select or create test incident
Toggle "Dry Run"
View step-by-step execution

With Live Data

Click "Test"
Select real incident
Leave "Dry Run" off
Actions will execute (with approval)

Execution History

View past executions:

Execution timestamp
Incident processed
Steps completed
Final verdict
Duration

Click execution for detailed trace.

Import/Export

Export

Select playbook
Click "Export"
Download YAML file

Import

Click "Import"
Upload YAML file
Review parsed playbook
Click "Create"

Playbook Versions

Playbooks are versioned:

Edit playbook
Bump version number
Save as new version
Old version kept for rollback

View version history and compare changes.

Settings

System configuration in the web dashboard.

Settings Tabs

Access at /settings

General

Instance Name: Display name for this installation
Time Zone: Default timezone for display
Date Format: Date/time display format
Theme: Light/dark mode preference

Connectors

Configure external integrations.

Threat Intelligence

Mode: Mock or VirusTotal
API Key (for VirusTotal)
Rate limit settings

SIEM

Mode: Mock or Splunk
URL and authentication
Default search index

EDR

Mode: Mock or CrowdStrike
OAuth credentials
Region selection

Email Gateway

Mode: Mock or Microsoft 365
Azure AD configuration
Tenant settings

Ticketing

Mode: Mock or Jira
Instance URL
Project configuration

Policies

Manage policy rules.

Creating Rules

Click "Add Rule"
Enter rule name
Define matching criteria
Set decision (allow/deny/approval)
Save

Rule Priority

Drag rules to reorder. First matching rule wins.

Users

User management (admin only).

User List

Username and email
Role (viewer/analyst/senior/admin)
Last login
Status (active/disabled)

Creating Users

Click "Add User"
Enter email and username
Set initial role
Generate or set password
Send invitation email

Role Management

Assign roles:

Viewer: Read-only access
Analyst: Execute actions, approve analyst-level
Senior: Approve senior-level
Admin: Full access

Notifications

Configure notification preferences.

Channels

Email: SMTP settings
Slack: Webhook URL
Teams: Connector URL
PagerDuty: Integration key

Preferences

For each notification type:

Enable/disable channel
Set priority threshold
Configure quiet hours

Audit Logs

View system audit trail.

API Keys

Manage API credentials.

Creating Keys

Click "Create API Key"
Enter name and description
Select scopes
Set expiration (optional)
Copy generated key

Revoking Keys

Click "Revoke" on any key. Revocation is immediate.

Backup & Restore

Database management.

Backup

Click "Create Backup"
Wait for completion
Download backup file

Restore

Click "Restore"
Upload backup file
Confirm restore
System restarts

About

System information:

Version number
Build information
License status
Support links

CLI Reference

Command-line interface for Triage Warden.

Installation

The CLI is built with the main project:

cargo build --release
./target/release/tw-cli --help

Global Options

tw-cli [OPTIONS] <COMMAND>

Options:
  -c, --config <FILE>     Configuration file path
  -v, --verbose           Enable verbose output
  -q, --quiet             Suppress non-error output
  --json                  Output as JSON
  -h, --help              Print help
  -V, --version           Print version

Environment Variables

Variable	Description
`TW_API_URL`	API server URL (default: http://localhost:8080)
`TW_API_KEY`	API key for authentication
`TW_CONFIG`	Path to config file

Commands Overview

Command	Description
`incident`	Manage incidents
`action`	Execute and manage actions
`triage`	Run AI triage
`playbook`	Manage playbooks
`policy`	Manage policy rules
`connector`	Manage connectors
`user`	User management
`api-key`	API key management
`webhook`	Webhook management
`config`	Configuration management
`db`	Database operations
`serve`	Start API server

Quick Examples

# List open incidents
tw-cli incident list --status open

# Create incident
tw-cli incident create --type phishing --severity high

# Run triage
tw-cli triage run --incident INC-2024-001

# Execute action
tw-cli action execute --incident INC-2024-001 --action quarantine_email

# Approve pending action
tw-cli action approve act-abc123

# Start server
tw-cli serve --port 8080

CLI Commands

Detailed reference for all CLI commands.

incident

Manage security incidents.

list

tw-cli incident list [OPTIONS]

Options:
  --status <STATUS>      Filter by status (open, triaged, resolved)
  --severity <SEVERITY>  Filter by severity
  --type <TYPE>          Filter by incident type
  --limit <N>            Maximum results (default: 20)
  --offset <N>           Skip first N results
  --sort <FIELD>         Sort field (created_at, severity)
  --desc                 Sort descending

get

tw-cli incident get <ID> [OPTIONS]

Options:
  --format <FORMAT>      Output format (table, json, yaml)
  --include-actions      Include action history
  --include-enrichment   Include enrichment data

create

tw-cli incident create [OPTIONS]

Options:
  --type <TYPE>          Incident type (required)
  --source <SOURCE>      Incident source (required)
  --severity <SEVERITY>  Initial severity (default: medium)
  --data <JSON>          Raw incident data as JSON
  --file <FILE>          Read data from file
  --auto-triage          Run triage after creation

update

tw-cli incident update <ID> [OPTIONS]

Options:
  --severity <SEVERITY>  Update severity
  --status <STATUS>      Update status
  --assignee <USER>      Assign to user

resolve

tw-cli incident resolve <ID> [OPTIONS]

Options:
  --resolution <TEXT>    Resolution notes
  --false-positive       Mark as false positive

action

Execute and manage actions.

execute

tw-cli action execute [OPTIONS]

Options:
  --incident <ID>        Associated incident
  --action <NAME>        Action to execute (required)
  --param <KEY=VALUE>    Action parameter (repeatable)
  --emergency            Emergency override (manager only)

list

tw-cli action list [OPTIONS]

Options:
  --incident <ID>        Filter by incident
  --status <STATUS>      Filter by status
  --pending              Show only pending approval

get

tw-cli action get <ID>

approve

tw-cli action approve <ID> [OPTIONS]

Options:
  --comment <TEXT>       Approval comment

reject

tw-cli action reject <ID> [OPTIONS]

Options:
  --reason <TEXT>        Rejection reason (required)

rollback

tw-cli action rollback <ID> [OPTIONS]

Options:
  --reason <TEXT>        Rollback reason

triage

Run AI triage.

run

tw-cli triage run [OPTIONS]

Options:
  --incident <ID>        Incident to triage (required)
  --playbook <NAME>      Specific playbook
  --model <MODEL>        AI model override
  --wait                 Wait for completion

status

tw-cli triage status <TRIAGE_ID>

playbook

Manage playbooks.

list

tw-cli playbook list [OPTIONS]

Options:
  --enabled              Only enabled playbooks
  --trigger-type <TYPE>  Filter by trigger type

get

tw-cli playbook get <ID>

add

tw-cli playbook add <FILE>

update

tw-cli playbook update <ID> <FILE>

delete

tw-cli playbook delete <ID>

run

tw-cli playbook run <ID> [OPTIONS]

Options:
  --incident <ID>        Incident to process
  --var <KEY=VALUE>      Override variable (repeatable)
  --dry-run              Don't execute actions

test

tw-cli playbook test <NAME> [OPTIONS]

Options:
  --incident <ID>        Use existing incident
  --data <JSON>          Use mock data
  --dry-run              Don't execute actions

validate

tw-cli playbook validate <FILE>

export

tw-cli playbook export <ID> [OPTIONS]

Options:
  -o, --output <FILE>    Output file (default: stdout)

policy

Manage policy rules.

list

tw-cli policy list

add

tw-cli policy add [OPTIONS]

Options:
  --name <NAME>          Rule name (required)
  --action <ACTION>      Action to match
  --pattern <PATTERN>    Action pattern (glob)
  --severity <SEVERITY>  Severity condition
  --approval-level <L>   Required approval level
  --allow                Auto-allow
  --deny                 Deny with reason
  --reason <TEXT>        Denial reason

delete

tw-cli policy delete <NAME>

test

tw-cli policy test [OPTIONS]

Options:
  --action <ACTION>      Action to test
  --severity <SEVERITY>  Incident severity
  --proposer-type <T>    Proposer type
  --confidence <N>       AI confidence score

connector

Manage connectors.

status

tw-cli connector status

test

tw-cli connector test <NAME>

configure

tw-cli connector configure <NAME> [OPTIONS]

Options:
  --mode <MODE>          Connector mode
  --api-key <KEY>        API key
  --url <URL>            Service URL

user

User management.

list

tw-cli user list

create

tw-cli user create [OPTIONS]

Options:
  --username <NAME>      Username (required)
  --email <EMAIL>        Email address
  --role <ROLE>          User role
  --service-account      Create as service account

update

tw-cli user update <ID> [OPTIONS]

Options:
  --role <ROLE>          New role
  --enabled              Enable user
  --disabled             Disable user

delete

tw-cli user delete <ID>

api-key

API key management.

list

tw-cli api-key list

create

tw-cli api-key create [OPTIONS]

Options:
  --name <NAME>          Key name (required)
  --scopes <SCOPES>      Comma-separated scopes
  --user <USER>          Associated user
  --expires <DATE>       Expiration date

revoke

tw-cli api-key revoke <PREFIX>

rotate

tw-cli api-key rotate <PREFIX>

webhook

Webhook management.

list

tw-cli webhook list

add

tw-cli webhook add <SOURCE> [OPTIONS]

Options:
  --secret <SECRET>      Webhook secret
  --auto-triage          Enable auto-triage
  --playbook <NAME>      Playbook to run

test

tw-cli webhook test <SOURCE>

delete

tw-cli webhook delete <SOURCE>

db

Database operations.

migrate

tw-cli db migrate

backup

tw-cli db backup [OPTIONS]

Options:
  -o, --output <FILE>    Backup file path

restore

tw-cli db restore <FILE>

serve

Start the API server.

tw-cli serve [OPTIONS]

Options:
  --host <HOST>          Bind address (default: 0.0.0.0)
  --port <PORT>          Port number (default: 8080)
  --config <FILE>        Configuration file

Architecture Overview

Triage Warden is built as a modular, layered system combining Rust for performance-critical components and Python for AI capabilities.

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                           Clients                                    │
│              (Web Browser, CLI, API Consumers)                       │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         API Layer (tw-api)                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │  REST API   │  │ Web Handlers│  │  Webhooks   │  │   Metrics   │ │
│  │   (Axum)    │  │(HTMX+Askama)│  │             │  │ (Prometheus)│ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                   │
        ┌──────────────────────────┼──────────────────────────┐
        ▼                          ▼                          ▼
┌───────────────┐        ┌───────────────────┐       ┌───────────────┐
│ Policy Engine │        │   Action Registry │       │  Event Bus    │
│   (tw-policy) │        │    (tw-actions)   │       │  (tw-core)    │
└───────────────┘        └───────────────────┘       └───────────────┘
        │                          │                          │
        └──────────────────────────┼──────────────────────────┘
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Core Domain (tw-core)                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │  Incidents  │  │  Playbooks  │  │   Users     │  │   Audit     │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Database Layer (SQLx)                             │
│              (SQLite for dev, PostgreSQL for prod)                   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                      Python Bridge (tw-bridge)                       │
│                          (PyO3 Bindings)                             │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       AI Layer (tw_ai)                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │Triage Agent │  │    Tools    │  │  Playbook   │  │  Evaluation │ │
│  │  (Claude)   │  │             │  │   Engine    │  │  Framework  │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Connector Layer (tw-connectors)                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │ VirusTotal  │  │   Splunk    │  │ CrowdStrike │  │    Jira     │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Crate Structure

Crate	Purpose
`tw-api`	HTTP server, REST API, web handlers, webhooks
`tw-core`	Domain models, database repositories, event bus
`tw-actions`	Action handlers (quarantine, isolate, notify, etc.)
`tw-policy`	Policy engine, approval rules, decision evaluation
`tw-connectors`	External service integrations (VirusTotal, Splunk, etc.)
`tw-bridge`	PyO3 bindings exposing Rust to Python
`tw-cli`	Command-line interface
`tw-observability`	Metrics, tracing, logging infrastructure

Key Design Decisions

Rust + Python Hybrid

Rust: Core platform, API server, policy engine, actions
Python: AI agents, LLM integrations, playbook execution
Bridge: PyO3 enables Python to call Rust connectors and actions

Trait-Based Connectors

All connectors implement traits for testability:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait ThreatIntelConnector: Send + Sync {
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>;
    async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport>;
    async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport>;
}
}

Event-Driven Architecture

The event bus enables loose coupling:

#![allow(unused)]
fn main() {
event_bus.publish(Event::IncidentCreated { id, incident_type });
event_bus.publish(Event::ActionExecuted { action_id, result });
}

Policy-First Actions

All actions pass through the policy engine:

Request → Policy Evaluation → (Allowed | Denied | RequiresApproval) → Execute

Next Steps

Components - Detailed component descriptions
Data Flow - How data moves through the system
Security Model - Authentication and authorization

Components

Detailed description of each major component in Triage Warden.

tw-api

The HTTP server and web interface.

REST API Routes

Route	Description
`GET /api/incidents`	List incidents with filtering
`POST /api/incidents`	Create new incident
`GET /api/incidents/:id`	Get incident details
`POST /api/incidents/:id/actions`	Execute action on incident
`GET /api/playbooks`	List playbooks
`POST /api/webhooks/:source`	Receive webhook events

Web Handlers

Server-rendered pages using HTMX and Askama templates:

Dashboard with KPIs
Incident list and detail views
Approval workflow interface
Playbook management
Settings configuration

Authentication

Session-based auth for web dashboard
API key auth for programmatic access
Role-based access control (admin, analyst, viewer)

tw-core

Core domain logic and data access.

Domain Models

#![allow(unused)]
fn main() {
pub struct Incident {
    pub id: Uuid,
    pub incident_type: IncidentType,
    pub severity: Severity,
    pub status: IncidentStatus,
    pub source: String,
    pub raw_data: serde_json::Value,
    pub verdict: Option<Verdict>,
    pub confidence: Option<f64>,
    pub created_at: DateTime<Utc>,
}

pub struct Action {
    pub id: Uuid,
    pub incident_id: Uuid,
    pub action_type: ActionType,
    pub status: ActionStatus,
    pub approval_level: Option<ApprovalLevel>,
    pub executed_by: Option<String>,
}
}

Repositories

Database access layer with SQLite and PostgreSQL support:

IncidentRepository
ActionRepository
PlaybookRepository
UserRepository
AuditRepository

Event Bus

Async event distribution:

#![allow(unused)]
fn main() {
pub enum Event {
    IncidentCreated { id: Uuid },
    IncidentUpdated { id: Uuid },
    ActionRequested { id: Uuid },
    ActionApproved { id: Uuid, approver: String },
    ActionExecuted { id: Uuid, success: bool },
}
}

tw-actions

Action handlers for incident response.

Email Actions

Action	Description
`parse_email`	Extract headers, body, attachments
`check_email_authentication`	Validate SPF/DKIM/DMARC
`quarantine_email`	Move to quarantine
`block_sender`	Add to blocklist

Lookup Actions

Action	Description
`lookup_sender_reputation`	Check sender against threat intel
`lookup_urls`	Analyze URLs in content
`lookup_attachments`	Hash and check attachments

Host Actions

Action	Description
`isolate_host`	Network isolation via EDR
`scan_host`	Trigger endpoint scan

Notification Actions

Action	Description
`notify_user`	Send user notification
`notify_reporter`	Update incident reporter
`escalate`	Route to approval level
`create_ticket`	Create Jira ticket

tw-policy

Policy engine for action approval.

Rule Evaluation

#![allow(unused)]
fn main() {
pub struct PolicyRule {
    pub name: String,
    pub action_type: ActionType,
    pub conditions: Vec<Condition>,
    pub approval_level: ApprovalLevel,
}

pub enum PolicyDecision {
    Allowed,
    Denied { reason: String },
    RequiresApproval { level: ApprovalLevel },
}
}

Approval Levels

Auto - No approval required
Analyst - Any analyst can approve
Senior - Senior analyst required
Manager - SOC manager required

tw-connectors

External service integrations.

Connector Trait

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Connector: Send + Sync {
    fn name(&self) -> &str;
    fn connector_type(&self) -> &str;
    async fn health_check(&self) -> ConnectorResult<ConnectorHealth>;
    async fn test_connection(&self) -> ConnectorResult<bool>;
}
}

Available Connectors

Type	Implementations
Threat Intel	VirusTotal, Mock
SIEM	Splunk, Mock
EDR	CrowdStrike, Mock
Email Gateway	Microsoft 365, Mock
Ticketing	Jira, Mock

tw-bridge

PyO3 bindings for Python integration.

Exposed Classes

from tw_bridge import ThreatIntelBridge, SIEMBridge, EDRBridge

# Use connectors from Python
threat_intel = ThreatIntelBridge("virustotal")
result = threat_intel.lookup_hash("abc123...")

tw_ai (Python)

AI triage and playbook execution.

Triage Agent

Claude-powered agent for incident analysis:

agent = TriageAgent(model="claude-sonnet-4-20250514")
verdict = await agent.analyze(incident)
# Returns: Verdict(classification="malicious", confidence=0.92, ...)

Playbook Engine

YAML-based playbook execution:

name: phishing_triage
steps:
  - action: parse_email
  - action: check_email_authentication
  - action: lookup_sender_reputation
  - condition: sender_reputation < 0.3
    action: quarantine_email

Data Flow

How data moves through Triage Warden from incident creation to resolution.

Incident Lifecycle

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Created   │────▶│   Triaging  │────▶│   Triaged   │────▶│  Resolved   │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │                   │
       ▼                   ▼                   ▼                   ▼
   Webhook/API        AI Agent          Actions Executed      Closed
   receives data      analyzes          (with approval)

Detailed Flow

1. Incident Creation

External Source (Email Gateway, SIEM, EDR)
                    │
                    ▼
            Webhook Endpoint
            /api/webhooks/:source
                    │
                    ▼
         ┌──────────────────┐
         │  Parse & Validate │
         │  Incoming Data    │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Create Incident   │
         │ Record in DB      │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Publish Event:    │
         │ IncidentCreated   │
         └──────────────────┘

2. AI Triage

         ┌──────────────────┐
         │ Event: Incident   │
         │ Created           │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Load Playbook     │
         │ (based on type)   │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Execute Playbook  │
         │ Steps             │
         └──────────────────┘
                    │
        ┌───────────┴───────────┐
        ▼                       ▼
┌───────────────┐       ┌───────────────┐
│ Enrichment    │       │ AI Analysis   │
│ Actions       │       │ (Claude)      │
│ - parse_email │       │               │
│ - lookup_*    │       │ Generates:    │
└───────────────┘       │ - Verdict     │
        │               │ - Confidence  │
        │               │ - Reasoning   │
        │               │ - Actions     │
        └───────┬───────└───────────────┘
                │
                ▼
         ┌──────────────────┐
         │ Update Incident   │
         │ with Verdict      │
         └──────────────────┘

3. Action Execution

         ┌──────────────────┐
         │ Action Request    │
         │ (from agent or    │
         │  human)           │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Build Action      │
         │ Context           │
         └──────────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ Policy Engine     │
         │ Evaluation        │
         └──────────────────┘
                    │
        ┌───────────┼───────────┐
        ▼           ▼           ▼
   ┌────────┐  ┌────────┐  ┌────────┐
   │Allowed │  │Denied  │  │Requires│
   │        │  │        │  │Approval│
   └────────┘  └────────┘  └────────┘
        │           │           │
        ▼           ▼           ▼
   Execute      Return       Queue for
   Action       Error        Approval
        │                       │
        │                       ▼
        │              ┌──────────────┐
        │              │ Notify       │
        │              │ Approvers    │
        │              └──────────────┘
        │                       │
        │                       ▼
        │              ┌──────────────┐
        │              │ Wait for     │
        │              │ Approval     │
        │              └──────────────┘
        │                       │
        │        ┌──────────────┴──────────────┐
        │        ▼                             ▼
        │   ┌────────┐                    ┌────────┐
        │   │Approved│                    │Rejected│
        │   └────────┘                    └────────┘
        │        │                             │
        │        ▼                             ▼
        │   Execute Action               Update Status
        │        │
        └────────┴─────────┐
                           ▼
                  ┌──────────────┐
                  │ Connector    │
                  │ Execution    │
                  │ (External    │
                  │  Service)    │
                  └──────────────┘
                           │
                           ▼
                  ┌──────────────┐
                  │ Update       │
                  │ Action       │
                  │ Status       │
                  └──────────────┘
                           │
                           ▼
                  ┌──────────────┐
                  │ Audit Log    │
                  │ Entry        │
                  └──────────────┘

Data Stores

Primary Database

Table	Purpose
`incidents`	Incident records
`actions`	Action requests and results
`playbooks`	Playbook definitions
`users`	User accounts
`sessions`	Active sessions
`api_keys`	API credentials
`audit_logs`	Action audit trail
`connectors`	Connector configurations
`policies`	Policy rules
`notifications`	Notification history
`settings`	System settings

Event Bus (In-Memory)

Transient event distribution for real-time updates:

Incident lifecycle events
Action status changes
Approval notifications
System health events

External Data Flow

Inbound (Webhooks)

Email Gateway ──────┐
SIEM Alerts ────────┼──▶ Webhook Handler ──▶ Incident Creation
EDR Events ─────────┘

Outbound (Connectors)

                           ┌──▶ VirusTotal (threat intel)
Action Execution ──────────┼──▶ Splunk (SIEM queries)
                           ├──▶ CrowdStrike (host actions)
                           ├──▶ M365 (email actions)
                           └──▶ Jira (ticketing)

Metrics Flow

Rust Components ──┬──▶ Prometheus Registry ──▶ /metrics endpoint
Python Components ─┘

Exposed metrics:

triage_warden_incidents_total{type, severity}
triage_warden_actions_total{action, status}
triage_warden_triage_duration_seconds{type}
triage_warden_connector_requests_total{connector, status}

Security Model

Triage Warden implements defense-in-depth with multiple security layers.

Authentication

Web Dashboard

Session-based authentication with secure cookies:

Session tokens: Random 256-bit tokens
Cookie settings: HttpOnly, Secure, SameSite=Lax
Session duration: 8 hours (configurable)
CSRF protection: Per-request tokens on all state-changing forms

API Access

API key authentication for programmatic access:

curl -H "Authorization: Bearer tw_abc123_secretkey" \
  https://api.example.com/api/incidents

API key features:

Prefix stored in plain text for lookup (tw_abc123)
Secret portion hashed with Argon2
Scopes limit allowed operations
Expiration dates supported

Authorization

Role-Based Access Control (RBAC)

Role	Capabilities
Viewer	Read incidents, view dashboards
Analyst	Viewer + execute low-risk actions, approve analyst-level
Senior Analyst	Analyst + execute medium-risk actions, approve senior-level
Admin	Full access, user management, system configuration

Policy-Based Action Control

The policy engine evaluates every action request:

#![allow(unused)]
fn main() {
// Policy evaluation flow
ActionRequest
    → Build ActionContext (action_type, target, severity, proposer)
    → Evaluate policy rules
    → Return PolicyDecision
        - Allowed: Execute immediately
        - Denied: Return error with reason
        - RequiresApproval: Queue for specified approval level
}

Example Policy Rules

# Low-risk actions auto-approve
[[policy.rules]]
name = "auto_approve_lookups"
action_patterns = ["lookup_*"]
decision = "allowed"

# High-severity host isolation requires manager
[[policy.rules]]
name = "isolate_requires_manager"
action = "isolate_host"
severity = ["high", "critical"]
approval_level = "manager"

# Block dangerous actions on production
[[policy.rules]]
name = "no_delete_in_prod"
action_patterns = ["delete_*"]
environment = "production"
decision = "denied"
reason = "Deletion not allowed in production"

Multi-Tenant Isolation

Triage Warden supports multi-tenancy with strong data isolation guarantees.

Row-Level Security (RLS)

PostgreSQL Row-Level Security provides database-level tenant isolation:

-- Each table has RLS policies that filter by tenant
-- Application sets tenant context at the start of each request
SELECT set_tenant_context('tenant-uuid-here');

-- All subsequent queries automatically filtered
SELECT * FROM incidents;  -- Only returns current tenant's data

Key Features:

Feature	Description
Automatic filtering	All SELECT/UPDATE/DELETE queries filtered by tenant
Insert validation	INSERT must match current tenant context
Fail-secure	No tenant context = no data access
Defense-in-depth	Database enforces isolation even if app has bugs

Tenant Context Management

The application manages tenant context through several mechanisms:

Request Middleware: Resolves tenant from subdomain, header, or JWT
Session Variable: Sets app.current_tenant on each database connection
Context Guard: RAII pattern ensures cleanup

#![allow(unused)]
fn main() {
// Using the tenant context guard
async fn handle_request(pool: &TenantAwarePool, tenant_id: Uuid) {
    let _guard = TenantContextGuard::new(pool, tenant_id).await?;

    // All queries here are automatically filtered by tenant
    let incidents = incident_repo.list_all().await?;

    // Context cleared when guard drops
}
}

Admin Operations

Admin operations that need to bypass RLS use a separate connection pool:

Admin pool: Superuser role that bypasses RLS policies
Use cases: Tenant management, cross-tenant reporting, maintenance
Access control: Restricted to Admin role users only

Tables Protected by RLS

All tenant-scoped data tables have RLS enabled:

incidents, actions, approvals, audit_logs
users, api_keys, sessions
playbooks, policies, connectors
notification_channels, settings

System tables (tenants, feature_flags) do NOT have RLS.

Debugging RLS Issues

-- Check current tenant context
SELECT get_current_tenant();

-- View RLS policies for a table
SELECT * FROM pg_policies WHERE tablename = 'incidents';

-- Check if RLS is enabled
SELECT relname, relrowsecurity
FROM pg_class
WHERE relname IN ('incidents', 'tenants');

Data Protection

At Rest

Database encryption: SQLite with SQLCipher (optional), PostgreSQL with TDE
Credential storage: All API keys/tokens hashed with Argon2id
Secrets management: Environment variables or external secret stores

In Transit

TLS 1.3: Required for all external connections
Certificate validation: Strict validation for connectors
Internal traffic: TLS optional for localhost development

Sensitive Data Handling

#![allow(unused)]
fn main() {
// Credentials redacted in logs
impl std::fmt::Debug for ApiKey {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "ApiKey {{ prefix: {}, secret: [REDACTED] }}", self.prefix)
    }
}
}

Audit Trail

All security-relevant actions logged:

Event	Data Captured
Login	user_id, ip_address, success, timestamp
Logout	user_id, session_duration
Action executed	action_id, user_id, incident_id, result
Action approved	action_id, approver_id, decision
Policy change	user_id, old_value, new_value
User management	admin_id, target_user, operation

Audit log retention: 90 days (configurable)

Connector Security

Credential Management

Connector credentials stored encrypted:

# Environment variables (recommended)
TW_VIRUSTOTAL_API_KEY=your-key

# Or encrypted in database
tw-cli connector set virustotal --api-key "$(read -s)"

Rate Limiting

Built-in rate limiting prevents API abuse:

Connector	Default Limit
VirusTotal	4 req/min (free tier)
Splunk	100 req/min
CrowdStrike	50 req/min

Circuit Breaker

Automatic failure handling:

#![allow(unused)]
fn main() {
// After 5 consecutive failures, circuit opens
// Requests fail fast for 30 seconds
// Then half-open state allows test requests
}

Input Validation

API Requests

JSON schema validation on all endpoints
Size limits on request bodies (1MB default)
Type coercion disabled (strict typing)

Webhook Payloads

HMAC signature verification
Replay attack prevention (timestamp validation)
Payload size limits

#![allow(unused)]
fn main() {
// Webhook signature verification
fn verify_webhook(payload: &[u8], signature: &str, secret: &str) -> bool {
    let expected = hmac_sha256(secret, payload);
    constant_time_compare(signature, &expected)
}
}

Secure Defaults

HTTPS enforced in production
Secure cookie flags enabled
CORS restricted to configured origins
Debug endpoints disabled in production
Verbose errors only in development

Security Headers

Default response headers:

Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
Content-Security-Policy: default-src 'self'

Vulnerability Disclosure

Report security vulnerabilities to: [email protected]

We follow responsible disclosure practices and aim to respond within 48 hours.

Database Schema

Triage Warden supports both SQLite (development/small deployments) and PostgreSQL (production). This document describes the database schema used by both backends.

Overview

The database consists of 13 tables organized into four logical groups:

Core Incident Management: incidents, audit_logs, actions, approvals
Configuration: playbooks, connectors, policies, notification_channels, settings
Authentication: users, sessions, api_keys
Multi-Tenancy: tenants, feature_flags

All tenant-scoped tables include a tenant_id foreign key that references the tenants table. In PostgreSQL, Row-Level Security (RLS) policies automatically filter all queries by the current tenant context.

tenants

Tenant organizations in a multi-tenant deployment.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
name	TEXT	NOT NULL	Organization display name
slug	TEXT	UNIQUE, NOT NULL	URL-safe identifier for routing
status	ENUM/TEXT	DEFAULT 'active'	active, suspended, pending_deletion
settings	JSON/TEXT	DEFAULT '{}'	Tenant-specific settings
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Indexes: slug (unique), status

feature_flags

Feature flag configuration for gradual rollouts.

Column	Type	Constraints	Description
name	TEXT	PRIMARY KEY	Flag name
description	TEXT	DEFAULT ''	Flag description
default_enabled	BOOLEAN	DEFAULT FALSE	Default state
tenant_overrides	JSON	DEFAULT '{}'	Per-tenant overrides
percentage_rollout	INTEGER	NULLABLE	0-100 percentage rollout
created_at	TIMESTAMP	NOT NULL	Creation timestamp
updated_at	TIMESTAMP	NOT NULL	Last update timestamp

Note: The tenants and feature_flags tables are NOT protected by RLS.

Entity Relationship Diagram

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│    users     │       │   api_keys   │       │   sessions   │
├──────────────┤       ├──────────────┤       ├──────────────┤
│ id (PK)      │◄──────│ user_id (FK) │       │ id (PK)      │
│ email        │       │ id (PK)      │       │ data         │
│ username     │       │ key_hash     │       │ expiry_date  │
│ password_hash│       │ scopes       │       └──────────────┘
│ role         │       └──────────────┘
└──────────────┘

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│  incidents   │       │  audit_logs  │       │   actions    │
├──────────────┤       ├──────────────┤       ├──────────────┤
│ id (PK)      │◄──────│ incident_id  │       │ id (PK)      │
│ source       │       │ id (PK)      │       │ incident_id  │──┐
│ severity     │       │ action       │       │ action_type  │  │
│ status       │◄──────│ actor        │       │ target       │  │
│ alert_data   │       │ details      │       │ approval_status│ │
│ enrichments  │       │ created_at   │       └──────────────┘  │
│ analysis     │       └──────────────┘                         │
│ proposed_actions│                                             │
│ ticket_id    │       ┌──────────────┐                         │
│ tags         │       │  approvals   │◄────────────────────────┘
│ metadata     │       ├──────────────┤
└──────────────┘       │ id (PK)      │
                       │ action_id    │
                       │ incident_id  │
                       │ status       │
                       └──────────────┘

Core Tables

incidents

Stores security incidents created from incoming alerts.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
tenant_id	UUID/TEXT	FK → tenants, NOT NULL	Owning tenant
source	JSON/TEXT	NOT NULL	Alert source metadata
severity	ENUM/TEXT	NOT NULL	info, low, medium, high, critical
status	ENUM/TEXT	NOT NULL	See Status Values
alert_data	JSON/TEXT	NOT NULL	Original alert payload
enrichments	JSON/TEXT	DEFAULT '[]'	Array of enrichment results
analysis	JSON/TEXT	NULLABLE	AI triage analysis
proposed_actions	JSON/TEXT	DEFAULT '[]'	Array of proposed actions
ticket_id	TEXT	NULLABLE	External ticket reference
tags	JSON/TEXT	DEFAULT '[]'	User-defined tags
metadata	JSON/TEXT	DEFAULT '{}'	Additional metadata
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Indexes: (tenant_id, status), (tenant_id, severity), (tenant_id, created_at), status, severity, created_at, updated_at

RLS: Protected by Row-Level Security in PostgreSQL.

Incident Status Values

new - Newly created from alert
enriching - Gathering threat intelligence
analyzing - AI analysis in progress
pending_review - Awaiting analyst review
pending_approval - Actions awaiting approval
executing - Actions being executed
resolved - Incident resolved
false_positive - Marked as false positive
escalated - Escalated to higher tier
closed - Administratively closed

audit_logs

Immutable audit trail for all incident actions.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
incident_id	UUID/TEXT	FK → incidents	Parent incident
action	TEXT	NOT NULL	Action type (status_changed, action_approved, etc.)
actor	TEXT	NOT NULL	Username or "system"
details	JSON/TEXT	NULLABLE	Action-specific details
created_at	TIMESTAMP/TEXT	NOT NULL	Action timestamp

Indexes: incident_id, created_at

actions

Stores proposed and executed response actions.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
incident_id	UUID/TEXT	FK → incidents	Parent incident
action_type	TEXT	NOT NULL	isolate_host, disable_user, block_ip, etc.
target	JSON/TEXT	NOT NULL	Action target details
parameters	JSON/TEXT	DEFAULT '{}'	Action parameters
reason	TEXT	NOT NULL	Justification for action
priority	INTEGER	DEFAULT 50	Execution priority (1-100)
approval_status	ENUM/TEXT	NOT NULL	See Approval Status Values
approved_by	TEXT	NULLABLE	Approving user
approval_timestamp	TIMESTAMP/TEXT	NULLABLE	Approval time
result	JSON/TEXT	NULLABLE	Execution result
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
executed_at	TIMESTAMP/TEXT	NULLABLE	Execution timestamp

Indexes: incident_id, approval_status, created_at

Approval Status Values

pending - Awaiting approval decision
auto_approved - Automatically approved by policy
approved - Manually approved
denied - Manually denied
executed - Successfully executed
failed - Execution failed

approvals

Tracks multi-level approval workflows.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
action_id	UUID/TEXT	FK → actions	Related action
incident_id	UUID/TEXT	FK → incidents	Parent incident
approval_level	TEXT	NOT NULL	analyst, senior, manager, executive
status	ENUM/TEXT	NOT NULL	pending, approved, denied, expired
requested_by	TEXT	NOT NULL	Requesting user/system
requested_at	TIMESTAMP/TEXT	NOT NULL	Request timestamp
decided_by	TEXT	NULLABLE	Deciding user
decided_at	TIMESTAMP/TEXT	NULLABLE	Decision timestamp
decision_reason	TEXT	NULLABLE	Optional reason
expires_at	TIMESTAMP/TEXT	NULLABLE	Approval expiration

Indexes: action_id, status, expires_at

Configuration Tables

playbooks

Automation workflow definitions.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
name	TEXT	NOT NULL	Playbook name
description	TEXT	NULLABLE	Description
trigger_type	TEXT	NOT NULL	alert_type, severity, source, manual
trigger_condition	TEXT	NULLABLE	Trigger condition expression
stages	JSON/TEXT	DEFAULT '[]'	Array of workflow stages
enabled	BOOLEAN/INTEGER	DEFAULT TRUE	Active status
execution_count	INTEGER	DEFAULT 0	Times executed
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Indexes: name, trigger_type, enabled, created_at

connectors

External integration configurations.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
name	TEXT	NOT NULL	Display name
connector_type	TEXT	NOT NULL	virus_total, jira, splunk, etc.
config	JSON/TEXT	DEFAULT '{}'	Connection configuration (encrypted credentials)
status	TEXT	DEFAULT 'unknown'	connected, disconnected, error, unknown
enabled	BOOLEAN/INTEGER	DEFAULT TRUE	Active status
last_health_check	TIMESTAMP/TEXT	NULLABLE	Last health check time
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Indexes: name, connector_type, status, enabled

policies

Approval and automation policy rules.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
name	TEXT	NOT NULL	Policy name
description	TEXT	NULLABLE	Description
condition	TEXT	NOT NULL	Condition expression
action	TEXT	NOT NULL	auto_approve, require_approval, deny
approval_level	TEXT	NULLABLE	Required approval level
priority	INTEGER	DEFAULT 0	Evaluation priority
enabled	BOOLEAN/INTEGER	DEFAULT TRUE	Active status
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Indexes: name, action, priority, enabled

notification_channels

Alert notification configurations.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
name	TEXT	NOT NULL	Channel name
channel_type	TEXT	NOT NULL	slack, teams, email, pagerduty, webhook
config	JSON/TEXT	DEFAULT '{}'	Channel configuration
events	JSON/TEXT	DEFAULT '[]'	Subscribed event types
enabled	BOOLEAN/INTEGER	DEFAULT TRUE	Active status
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Indexes: name, channel_type, enabled

settings

Key-value configuration store.

Column	Type	Constraints	Description
key	TEXT	PRIMARY KEY	Setting key (general, rate_limits, llm)
value	JSON/TEXT	NOT NULL	Setting value as JSON
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Authentication Tables

users

User accounts for dashboard and API access.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
email	TEXT	UNIQUE, NOT NULL	Email address
username	TEXT	UNIQUE, NOT NULL	Login username
password_hash	TEXT	NOT NULL	Argon2 password hash
role	ENUM/TEXT	NOT NULL	admin, analyst, viewer
display_name	TEXT	NULLABLE	Display name
enabled	BOOLEAN/INTEGER	DEFAULT TRUE	Account active status
last_login_at	TIMESTAMP/TEXT	NULLABLE	Last login timestamp
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp
updated_at	TIMESTAMP/TEXT	NOT NULL	Last update timestamp

Indexes: email, username, role, enabled

sessions

User session storage (tower-sessions compatible).

Column	Type	Constraints	Description
id	TEXT	PRIMARY KEY	Session ID
data	BLOB	NOT NULL	Encrypted session data
expiry_date	INTEGER	NOT NULL	Unix timestamp expiration

Indexes: expiry_date

api_keys

API key authentication.

Column	Type	Constraints	Description
id	UUID/TEXT	PRIMARY KEY	Unique identifier
user_id	UUID/TEXT	FK → users	Owner user
name	TEXT	NOT NULL	Key display name
key_hash	TEXT	NOT NULL	SHA-256 hash of key
key_prefix	TEXT	NOT NULL	First 8 chars for identification
scopes	JSON/TEXT	DEFAULT '[]'	Allowed API scopes
expires_at	TIMESTAMP/TEXT	NULLABLE	Key expiration
last_used_at	TIMESTAMP/TEXT	NULLABLE	Last usage timestamp
created_at	TIMESTAMP/TEXT	NOT NULL	Creation timestamp

Indexes: user_id, key_prefix, expires_at

Database-Specific Notes

SQLite

UUIDs stored as TEXT
Timestamps stored as ISO 8601 TEXT
Boolean stored as INTEGER (0/1)
JSON stored as TEXT
Uses CHECK constraints for enums

PostgreSQL

Native UUID type
Native TIMESTAMPTZ type
Native BOOLEAN type
Native JSONB type with indexing
Uses custom ENUM types for status fields
Row-Level Security (RLS) enabled on all tenant-scoped tables

Row-Level Security

PostgreSQL deployments use RLS for defense-in-depth tenant isolation:

-- RLS policy example (automatically applied to all queries)
CREATE POLICY incidents_select_tenant_isolation ON incidents
    FOR SELECT
    USING (tenant_id = current_setting('app.current_tenant', true)::uuid);

To set the tenant context:

-- Set before executing tenant-scoped queries
SELECT set_tenant_context('00000000-0000-0000-0000-000000000001'::uuid);

-- Or use the session variable directly
SET app.current_tenant = '00000000-0000-0000-0000-000000000001';

Helper functions:

Function	Description
`set_tenant_context(uuid)`	Sets tenant context, returns previous value
`get_current_tenant()`	Returns current tenant UUID or NULL
`clear_tenant_context()`	Clears tenant context

Migrations

Migrations are managed by SQLx and located in:

SQLite: crates/tw-core/src/db/migrations/sqlite/
PostgreSQL: crates/tw-core/src/db/migrations/postgres/

Run migrations automatically on startup or manually:

# SQLite
tw-cli db migrate --database-url "sqlite:data/triage.db"

# PostgreSQL
tw-cli db migrate --database-url "postgres://user:pass@host/db"

Connectors

Connectors integrate Triage Warden with external security tools and services.

Overview

Each connector type has a trait interface and multiple implementations:

Type	Purpose	Implementations
Threat Intelligence	Hash/URL/domain reputation	VirusTotal, Mock
SIEM	Log queries and correlation	Splunk, Mock
EDR	Endpoint detection and response	CrowdStrike, Mock
Email Gateway	Email security operations	Microsoft 365, Mock
Ticketing	Incident ticket management	Jira, Mock

Configuration

Select connector implementations via environment variables:

# Use real connectors
TW_THREAT_INTEL_MODE=virustotal
TW_SIEM_MODE=splunk
TW_EDR_MODE=crowdstrike
TW_EMAIL_GATEWAY_MODE=m365
TW_TICKETING_MODE=jira

# Or use mocks for testing
TW_THREAT_INTEL_MODE=mock
TW_SIEM_MODE=mock

Connector Trait

All connectors implement the base Connector trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Connector: Send + Sync {
    /// Unique identifier for this connector instance
    fn name(&self) -> &str;

    /// Type of connector (threat_intel, siem, edr, etc.)
    fn connector_type(&self) -> &str;

    /// Check connector health
    async fn health_check(&self) -> ConnectorResult<ConnectorHealth>;

    /// Test connection to the service
    async fn test_connection(&self) -> ConnectorResult<bool>;
}

pub enum ConnectorHealth {
    Healthy,
    Degraded { message: String },
    Unhealthy { message: String },
}
}

Error Handling

Connectors return ConnectorResult<T> with detailed error types:

#![allow(unused)]
fn main() {
pub enum ConnectorError {
    /// Service returned an error
    RequestFailed(String),

    /// Resource not found
    NotFound(String),

    /// Authentication failed
    AuthenticationFailed(String),

    /// Rate limit exceeded
    RateLimited { retry_after: Option<Duration> },

    /// Network or connection error
    NetworkError(String),

    /// Invalid response from service
    InvalidResponse(String),
}
}

Health Monitoring

Check connector health via the API:

curl http://localhost:8080/api/connectors/health

{
  "connectors": [
    { "name": "virustotal", "type": "threat_intel", "status": "healthy" },
    { "name": "splunk", "type": "siem", "status": "healthy" },
    { "name": "crowdstrike", "type": "edr", "status": "degraded", "message": "High latency" }
  ]
}

Next Steps

Threat Intelligence - VirusTotal configuration
SIEM - Splunk configuration
EDR - CrowdStrike configuration
Email Gateway - Microsoft 365 configuration
Ticketing - Jira configuration

Threat Intelligence Connector

Query threat intelligence services for reputation data on hashes, URLs, domains, and IP addresses.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait ThreatIntelConnector: Connector {
    /// Look up file hash reputation
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>;

    /// Look up URL reputation
    async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport>;

    /// Look up domain reputation
    async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport>;

    /// Look up IP address reputation
    async fn lookup_ip(&self, ip: &str) -> ConnectorResult<ThreatReport>;
}

pub struct ThreatReport {
    pub indicator: String,
    pub indicator_type: IndicatorType,
    pub malicious: bool,
    pub confidence: f64,
    pub categories: Vec<String>,
    pub first_seen: Option<DateTime<Utc>>,
    pub last_seen: Option<DateTime<Utc>>,
    pub sources: Vec<ThreatSource>,
}
}

VirusTotal

Configuration

TW_THREAT_INTEL_MODE=virustotal
TW_VIRUSTOTAL_API_KEY=your-api-key-here

Rate Limits

Tier	Requests/Minute
Free	4
Premium	500+

The connector automatically handles rate limiting with exponential backoff.

Supported Lookups

Method	VT Endpoint	Notes
`lookup_hash`	`/files/{hash}`	MD5, SHA1, SHA256
`lookup_url`	`/urls/{url_id}`	Base64-encoded URL
`lookup_domain`	`/domains/{domain}`	Domain reputation
`lookup_ip`	`/ip_addresses/{ip}`	IP reputation

Example Usage

#![allow(unused)]
fn main() {
let connector = VirusTotalConnector::new(api_key)?;

let report = connector.lookup_hash("44d88612fea8a8f36de82e1278abb02f").await?;
println!("Malicious: {}", report.malicious);
println!("Confidence: {:.2}", report.confidence);
println!("Categories: {:?}", report.categories);
}

Response Mapping

VirusTotal detection ratios map to confidence scores:

Detection Ratio	Confidence	Classification
0%	0.0	Clean
1-10%	0.3	Suspicious
11-50%	0.6	Likely Malicious
51-100%	0.9	Malicious

Mock Connector

For testing without external API calls:

TW_THREAT_INTEL_MODE=mock

The mock connector returns predictable results based on indicator patterns:

Pattern	Result
Contains "malicious"	Malicious, confidence 0.95
Contains "suspicious"	Suspicious, confidence 0.5
Contains "clean"	Clean, confidence 0.1
Default	Clean, confidence 0.2

Python Bridge

Access from Python via the bridge:

from tw_bridge import ThreatIntelBridge

# Create bridge (uses TW_THREAT_INTEL_MODE env var)
bridge = ThreatIntelBridge()

# Or specify mode explicitly
bridge = ThreatIntelBridge("virustotal")

# Lookup hash
result = bridge.lookup_hash("44d88612fea8a8f36de82e1278abb02f")
print(f"Malicious: {result['malicious']}")
print(f"Confidence: {result['confidence']}")

# Lookup URL
result = bridge.lookup_url("https://example.com/suspicious")

# Lookup domain
result = bridge.lookup_domain("malware-site.com")

Caching

Results are cached to reduce API calls:

Lookup Type	Cache Duration
Hash	24 hours
URL	1 hour
Domain	6 hours
IP	6 hours

Cache is stored in the database and shared across instances.

Adding Custom Providers

Implement the ThreatIntelConnector trait:

#![allow(unused)]
fn main() {
pub struct CustomThreatIntelConnector {
    client: reqwest::Client,
    api_key: String,
}

#[async_trait]
impl Connector for CustomThreatIntelConnector {
    fn name(&self) -> &str { "custom" }
    fn connector_type(&self) -> &str { "threat_intel" }
    // ... implement health_check, test_connection
}

#[async_trait]
impl ThreatIntelConnector for CustomThreatIntelConnector {
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport> {
        // Custom implementation
    }
    // ... implement other methods
}
}

See Adding Connectors for full details.

SIEM Connector

Query SIEM platforms for log data, run searches, and correlate events.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait SIEMConnector: Connector {
    /// Run a search query
    async fn search(&self, query: &str, time_range: TimeRange) -> ConnectorResult<SearchResults>;

    /// Get events by ID
    async fn get_events(&self, event_ids: &[String]) -> ConnectorResult<Vec<SIEMEvent>>;

    /// Get related events (correlation)
    async fn get_related_events(
        &self,
        indicator: &str,
        indicator_type: IndicatorType,
        time_range: TimeRange,
    ) -> ConnectorResult<Vec<SIEMEvent>>;
}

pub struct SIEMEvent {
    pub id: String,
    pub timestamp: DateTime<Utc>,
    pub source: String,
    pub event_type: String,
    pub severity: String,
    pub raw_data: serde_json::Value,
}

pub struct SearchResults {
    pub events: Vec<SIEMEvent>,
    pub total_count: u64,
    pub search_id: String,
}
}

Splunk

Configuration

TW_SIEM_MODE=splunk
TW_SPLUNK_URL=https://splunk.company.com:8089
TW_SPLUNK_TOKEN=your-token-here

Token Permissions

The Splunk token requires these capabilities:

search - Run searches
list_inputs - Health check
rest_access - REST API access

Example Searches

#![allow(unused)]
fn main() {
let connector = SplunkConnector::new(url, token)?;

// Search for events
let results = connector.search(
    r#"index=security sourcetype=firewall action=blocked"#,
    TimeRange::last_hours(24),
).await?;

// Find related events by IP
let related = connector.get_related_events(
    "192.168.1.100",
    IndicatorType::IpAddress,
    TimeRange::last_hours(1),
).await?;
}

Search Query Translation

Common queries translated to SPL:

Triage Warden Query	Splunk SPL
IP correlation	`index=* src_ip="{ip}" OR dest_ip="{ip}"`
User activity	`index=* user="{user}"`
Hash lookup	`index=* (file_hash="{hash}" OR sha256="{hash}")`

Performance Tips

Use specific indexes in queries
Limit time ranges when possible
Use | head 1000 to limit results

Mock Connector

For testing:

TW_SIEM_MODE=mock

The mock returns sample security events matching the query pattern.

Python Bridge

from tw_bridge import SIEMBridge

bridge = SIEMBridge("splunk")

# Run a search
results = bridge.search(
    query='index=security action=blocked',
    hours=24
)

for event in results['events']:
    print(f"{event['timestamp']}: {event['source']}")

# Get related events
related = bridge.get_related_events(
    indicator="192.168.1.100",
    indicator_type="ip",
    hours=1
)

Adding Custom SIEM

Implement the SIEMConnector trait:

#![allow(unused)]
fn main() {
pub struct ElasticSIEMConnector {
    client: elasticsearch::Elasticsearch,
}

#[async_trait]
impl SIEMConnector for ElasticSIEMConnector {
    async fn search(&self, query: &str, time_range: TimeRange) -> ConnectorResult<SearchResults> {
        // Translate to Elasticsearch DSL and execute
    }
    // ... implement other methods
}
}

EDR Connector

Integrate with Endpoint Detection and Response platforms for host information and response actions.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EDRConnector: Connector {
    /// Get host information
    async fn get_host(&self, host_id: &str) -> ConnectorResult<HostInfo>;

    /// Search for hosts
    async fn search_hosts(&self, query: &str) -> ConnectorResult<Vec<HostInfo>>;

    /// Get recent detections for a host
    async fn get_detections(&self, host_id: &str) -> ConnectorResult<Vec<Detection>>;

    /// Isolate a host from the network
    async fn isolate_host(&self, host_id: &str) -> ConnectorResult<ActionResult>;

    /// Remove host isolation
    async fn unisolate_host(&self, host_id: &str) -> ConnectorResult<ActionResult>;

    /// Trigger a scan on the host
    async fn scan_host(&self, host_id: &str) -> ConnectorResult<ActionResult>;
}

pub struct HostInfo {
    pub id: String,
    pub hostname: String,
    pub platform: String,
    pub os_version: String,
    pub agent_version: String,
    pub last_seen: DateTime<Utc>,
    pub isolation_status: IsolationStatus,
    pub tags: Vec<String>,
}

pub struct Detection {
    pub id: String,
    pub timestamp: DateTime<Utc>,
    pub severity: String,
    pub tactic: String,
    pub technique: String,
    pub description: String,
    pub process_name: Option<String>,
    pub file_path: Option<String>,
}
}

CrowdStrike

Configuration

TW_EDR_MODE=crowdstrike
TW_CROWDSTRIKE_CLIENT_ID=your-client-id
TW_CROWDSTRIKE_CLIENT_SECRET=your-client-secret
TW_CROWDSTRIKE_REGION=us-1  # us-1, us-2, eu-1, usgov-1

API Scopes Required

The API client requires these scopes:

Hosts: Read - Get host information
Hosts: Write - Isolation actions
Detections: Read - Get detections
Real Time Response: Write - Scan actions

OAuth2 Token Management

The connector automatically handles token refresh:

#![allow(unused)]
fn main() {
// Token refreshed automatically when expired
let connector = CrowdStrikeConnector::new(client_id, client_secret, region)?;

// All subsequent calls use valid token
let host = connector.get_host("abc123").await?;
}

Example Usage

#![allow(unused)]
fn main() {
// Get host information
let host = connector.get_host("aid:abc123").await?;
println!("Hostname: {}", host.hostname);
println!("Last seen: {}", host.last_seen);

// Check for detections
let detections = connector.get_detections("aid:abc123").await?;
for d in detections {
    println!("{}: {} - {}", d.timestamp, d.severity, d.description);
}

// Isolate compromised host
let result = connector.isolate_host("aid:abc123").await?;
if result.success {
    println!("Host isolated successfully");
}
}

Action Confirmation

Isolation and scan actions require policy approval. See Policy Engine.

Mock Connector

TW_EDR_MODE=mock

The mock provides sample hosts and detections for testing.

Python Bridge

from tw_bridge import EDRBridge

bridge = EDRBridge("crowdstrike")

# Get host info
host = bridge.get_host("aid:abc123")
print(f"Hostname: {host['hostname']}")
print(f"Platform: {host['platform']}")

# Get detections
detections = bridge.get_detections("aid:abc123")
for d in detections:
    print(f"{d['severity']}: {d['description']}")

# Isolate host (requires policy approval)
result = bridge.isolate_host("aid:abc123")
if result['success']:
    print("Host isolated")

Response Actions

Action	Description	Rollback
`isolate_host`	Network isolation	`unisolate_host`
`scan_host`	On-demand scan	N/A

Isolation Behavior

When isolated:

Host cannot communicate on network
Falcon agent maintains connection to cloud
User may see isolation notification

Rate Limits

Endpoint	Limit
Host queries	100/min
Detection queries	50/min
Containment actions	10/min

Email Gateway Connector

Manage email security operations including search, quarantine, and sender blocking.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EmailGatewayConnector: Connector {
    /// Search for emails
    async fn search_emails(&self, query: EmailSearchQuery) -> ConnectorResult<Vec<EmailMessage>>;

    /// Get specific email by ID
    async fn get_email(&self, message_id: &str) -> ConnectorResult<EmailMessage>;

    /// Move email to quarantine
    async fn quarantine_email(&self, message_id: &str) -> ConnectorResult<ActionResult>;

    /// Release email from quarantine
    async fn release_email(&self, message_id: &str) -> ConnectorResult<ActionResult>;

    /// Block sender
    async fn block_sender(&self, sender: &str) -> ConnectorResult<ActionResult>;

    /// Unblock sender
    async fn unblock_sender(&self, sender: &str) -> ConnectorResult<ActionResult>;

    /// Get threat data for email
    async fn get_threat_data(&self, message_id: &str) -> ConnectorResult<EmailThreatData>;
}

pub struct EmailMessage {
    pub id: String,
    pub internet_message_id: String,
    pub sender: String,
    pub recipients: Vec<String>,
    pub subject: String,
    pub received_at: DateTime<Utc>,
    pub has_attachments: bool,
    pub attachments: Vec<EmailAttachment>,
    pub urls: Vec<String>,
    pub headers: HashMap<String, String>,
    pub threat_assessment: Option<ThreatAssessment>,
}

pub struct EmailSearchQuery {
    pub sender: Option<String>,
    pub recipient: Option<String>,
    pub subject_contains: Option<String>,
    pub timerange: TimeRange,
    pub has_attachments: Option<bool>,
    pub threat_type: Option<String>,
    pub limit: usize,
}
}

Microsoft 365

Configuration

TW_EMAIL_GATEWAY_MODE=m365
TW_M365_TENANT_ID=your-tenant-id
TW_M365_CLIENT_ID=your-client-id
TW_M365_CLIENT_SECRET=your-client-secret

App Registration

Create an Azure AD app registration with these API permissions:

Permission	Type	Purpose
`Mail.Read`	Application	Read emails
`Mail.ReadWrite`	Application	Quarantine actions
`ThreatAssessment.Read.All`	Application	Threat data
`Policy.Read.All`	Application	Block list management

Example Usage

#![allow(unused)]
fn main() {
let connector = M365Connector::new(tenant_id, client_id, client_secret)?;

// Search for suspicious emails
let query = EmailSearchQuery {
    sender: Some("[email protected]".to_string()),
    timerange: TimeRange::last_hours(24),
    ..Default::default()
};
let emails = connector.search_emails(query).await?;

// Quarantine malicious email
let result = connector.quarantine_email("AAMkAGI2...").await?;

// Block sender
let result = connector.block_sender("[email protected]").await?;
}

Quarantine Behavior

When quarantined:

Email moved to quarantine folder
User notified (configurable)
Admin can release if false positive

Mock Connector

TW_EMAIL_GATEWAY_MODE=mock

Provides sample emails with various threat characteristics:

Phishing with malicious URLs
Malware with executable attachments
BEC/impersonation attempts
Clean legitimate emails

Python Bridge

from tw_bridge import EmailGatewayBridge

bridge = EmailGatewayBridge("m365")

# Search emails
emails = bridge.search_emails(
    sender="[email protected]",
    hours=24
)

for email in emails:
    print(f"From: {email['sender']}")
    print(f"Subject: {email['subject']}")
    print(f"Attachments: {len(email['attachments'])}")

# Quarantine email
result = bridge.quarantine_email("AAMkAGI2...")
if result['success']:
    print("Email quarantined")

# Block sender
result = bridge.block_sender("[email protected]")

Response Actions

Action	Description	Rollback
`quarantine_email`	Move to quarantine	`release_email`
`block_sender`	Add to blocklist	`unblock_sender`

Threat Data

Get detailed threat information:

#![allow(unused)]
fn main() {
let threat_data = connector.get_threat_data("AAMkAGI2...").await?;

println!("Delivery action: {}", threat_data.delivery_action);
println!("Threat types: {:?}", threat_data.threat_types);
println!("Detection methods: {:?}", threat_data.detection_methods);
}

Fields:

delivery_action: Delivered, Quarantined, Blocked
threat_types: Phishing, Malware, Spam, BEC
detection_methods: URLAnalysis, AttachmentScanning, ImpersonationDetection
urls_clicked: URLs clicked by recipient (if tracking enabled)

Ticketing Connector

Create and manage security incident tickets in external ticketing systems.

Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait TicketingConnector: Connector {
    /// Create a new ticket
    async fn create_ticket(&self, ticket: CreateTicketRequest) -> ConnectorResult<Ticket>;

    /// Get ticket by ID
    async fn get_ticket(&self, ticket_id: &str) -> ConnectorResult<Ticket>;

    /// Update ticket fields
    async fn update_ticket(&self, ticket_id: &str, update: UpdateTicketRequest) -> ConnectorResult<Ticket>;

    /// Add comment to ticket
    async fn add_comment(&self, ticket_id: &str, comment: &str) -> ConnectorResult<()>;

    /// Search tickets
    async fn search_tickets(&self, query: TicketSearchQuery) -> ConnectorResult<Vec<Ticket>>;
}

pub struct CreateTicketRequest {
    pub title: String,
    pub description: String,
    pub priority: TicketPriority,
    pub ticket_type: String,
    pub labels: Vec<String>,
    pub assignee: Option<String>,
    pub custom_fields: HashMap<String, String>,
}

pub struct Ticket {
    pub id: String,
    pub key: String,
    pub title: String,
    pub description: String,
    pub status: String,
    pub priority: TicketPriority,
    pub assignee: Option<String>,
    pub created_at: DateTime<Utc>,
    pub updated_at: DateTime<Utc>,
    pub url: String,
}
}

Jira

Configuration

TW_TICKETING_MODE=jira
TW_JIRA_URL=https://company.atlassian.net
[email protected]
TW_JIRA_API_TOKEN=your-api-token
TW_JIRA_PROJECT_KEY=SEC

API Token

Generate an API token at: https://id.atlassian.com/manage-profile/security/api-tokens

Required permissions:

Create issues
Edit issues
Add comments
Browse project

Example Usage

#![allow(unused)]
fn main() {
let connector = JiraConnector::new(url, email, token, project_key)?;

// Create security ticket
let request = CreateTicketRequest {
    title: "Phishing Incident - INC-2024-001".to_string(),
    description: "Phishing email detected and quarantined.\n\n## Details\n...".to_string(),
    priority: TicketPriority::High,
    ticket_type: "Security Incident".to_string(),
    labels: vec!["phishing".to_string(), "triage-warden".to_string()],
    assignee: Some("[email protected]".to_string()),
    custom_fields: HashMap::new(),
};

let ticket = connector.create_ticket(request).await?;
println!("Created: {} - {}", ticket.key, ticket.url);

// Add investigation notes
connector.add_comment(
    &ticket.id,
    "## Investigation Notes\n\n- Sender reputation: Malicious\n- URLs: 2 phishing links"
).await?;
}

Issue Types

Configure the Jira project with these issue types:

Issue Type	Usage
Security Incident	Main incident ticket
Investigation	Sub-task for investigation steps
Remediation	Sub-task for response actions

Custom Fields

Map custom fields in configuration:

TW_JIRA_FIELD_SEVERITY=customfield_10001
TW_JIRA_FIELD_INCIDENT_ID=customfield_10002
TW_JIRA_FIELD_VERDICT=customfield_10003

Mock Connector

TW_TICKETING_MODE=mock

Simulates ticket operations with in-memory storage.

Python Bridge

from tw_bridge import TicketingBridge

bridge = TicketingBridge("jira")

# Create ticket
ticket = bridge.create_ticket(
    title="Phishing Incident - INC-2024-001",
    description="Phishing email detected...",
    priority="high",
    ticket_type="Security Incident",
    labels=["phishing", "triage-warden"]
)
print(f"Created: {ticket['key']}")
print(f"URL: {ticket['url']}")

# Add comment
bridge.add_comment(
    ticket_id=ticket['id'],
    comment="Investigation complete. Verdict: Malicious"
)

# Update status
bridge.update_ticket(
    ticket_id=ticket['id'],
    status="Done"
)

# Search tickets
tickets = bridge.search_tickets(
    query="project = SEC AND labels = phishing",
    limit=10
)

Ticket Templates

Define templates for consistent ticket creation:

# config/ticket_templates.toml

[templates.phishing]
title = "Phishing: {subject}"
description = """
## Incident Summary
- **Type**: Phishing
- **Severity**: {severity}
- **Incident ID**: {incident_id}

## Details
{details}

## Recommended Actions
{recommended_actions}
"""
labels = ["phishing", "triage-warden"]

[templates.malware]
title = "Malware Alert: {hostname}"
description = """
## Incident Summary
- **Type**: Malware
- **Host**: {hostname}
- **Detection**: {detection}

## IOCs
{iocs}
"""
labels = ["malware", "triage-warden"]

Integration with Incidents

Tickets are automatically linked to incidents:

#![allow(unused)]
fn main() {
// Create ticket action stores the ticket key
let action = execute_action("create_ticket", incident_id, params).await?;

// Incident updated with ticket reference
incident.metadata["ticket_key"] = "SEC-1234";
incident.metadata["ticket_url"] = "https://company.atlassian.net/browse/SEC-1234";
}

Actions

Actions are the executable operations that Triage Warden can perform in response to incidents.

Overview

Actions fall into several categories:

Category	Purpose	Examples
Analysis	Extract and parse data	`parse_email`, `check_email_authentication`
Lookup	Enrich with external data	`lookup_sender_reputation`, `lookup_urls`
Response	Take containment actions	`quarantine_email`, `isolate_host`
Notification	Alert stakeholders	`notify_user`, `escalate`
Ticketing	Create/update tickets	`create_ticket`, `add_ticket_comment`

Action Trait

All actions implement the Action trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Action: Send + Sync {
    /// Action name (used in playbooks and API)
    fn name(&self) -> &str;

    /// Human-readable description
    fn description(&self) -> &str;

    /// Required and optional parameters
    fn required_parameters(&self) -> Vec<ParameterDef>;

    /// Whether this action supports rollback
    fn supports_rollback(&self) -> bool;

    /// Execute the action
    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError>;

    /// Rollback the action (if supported)
    async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        Err(ActionError::RollbackNotSupported)
    }
}
}

Action Context

Actions receive an ActionContext with:

#![allow(unused)]
fn main() {
pub struct ActionContext {
    /// Unique execution ID
    pub execution_id: Uuid,

    /// Parameters passed to the action
    pub parameters: HashMap<String, serde_json::Value>,

    /// Related incident (if any)
    pub incident_id: Option<Uuid>,

    /// User or agent requesting the action
    pub proposer: String,

    /// Connectors available for use
    pub connectors: ConnectorRegistry,
}
}

Action Result

Actions return an ActionResult:

#![allow(unused)]
fn main() {
pub struct ActionResult {
    /// Whether the action succeeded
    pub success: bool,

    /// Action name
    pub action_name: String,

    /// Human-readable summary
    pub message: String,

    /// Execution duration
    pub duration: Duration,

    /// Output data (action-specific)
    pub output: HashMap<String, serde_json::Value>,

    /// Whether rollback is available
    pub rollback_available: bool,
}
}

Policy Integration

All actions pass through the policy engine before execution:

Action Request → Policy Evaluation → Decision
                                       ├─ Allowed → Execute
                                       ├─ Denied → Return Error
                                       └─ RequiresApproval → Queue

See Policy Engine for approval configuration.

Executing Actions

Via API

curl -X POST http://localhost:8080/api/incidents/{id}/actions \
  -H "Content-Type: application/json" \
  -d '{
    "action": "quarantine_email",
    "parameters": {
      "message_id": "AAMkAGI2...",
      "reason": "Phishing detected"
    }
  }'

Via CLI

tw-cli action execute \
  --incident INC-2024-001 \
  --action quarantine_email \
  --param message_id=AAMkAGI2... \
  --param reason="Phishing detected"

Via Playbook

steps:
  - action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
      reason: "Automated response to phishing"

Available Actions

Email Actions - Email parsing and response
Host Actions - Endpoint containment
Lookup Actions - Threat intelligence enrichment
Notification Actions - Alerts and escalation

Email Actions

Actions for analyzing and responding to email-based threats.

Analysis Actions

parse_email

Extract headers, body, attachments, and URLs from raw email.

Parameters:

Name	Type	Required	Description
`raw_email`	string	Yes	Raw email content (RFC 822)

Output:

{
  "headers": {
    "From": "[email protected]",
    "To": "[email protected]",
    "Subject": "Important Document",
    "Date": "2024-01-15T10:30:00Z",
    "Message-ID": "<[email protected]>",
    "X-Originating-IP": "[192.168.1.100]"
  },
  "sender": "[email protected]",
  "recipients": ["[email protected]"],
  "subject": "Important Document",
  "body_text": "Please review the attached document...",
  "body_html": "<html>...",
  "attachments": [
    {
      "filename": "document.pdf",
      "content_type": "application/pdf",
      "size": 102400,
      "sha256": "abc123..."
    }
  ],
  "urls": [
    "https://example.com/document",
    "https://suspicious-site.com/login"
  ]
}

check_email_authentication

Validate SPF, DKIM, and DMARC authentication results.

Parameters:

Name	Type	Required	Description
`headers`	object	Yes	Email headers (from parse_email)

Output:

{
  "spf": {
    "result": "pass",
    "domain": "example.com"
  },
  "dkim": {
    "result": "pass",
    "domain": "example.com",
    "selector": "default"
  },
  "dmarc": {
    "result": "pass",
    "policy": "reject"
  },
  "authentication_passed": true,
  "risk_indicators": []
}

Risk Indicators:

spf_fail - SPF validation failed
dkim_fail - DKIM signature invalid
dmarc_fail - DMARC policy violation
header_mismatch - From/Reply-To mismatch
suspicious_routing - Unusual mail routing

Response Actions

quarantine_email

Move email to quarantine via email gateway.

Parameters:

Name	Type	Required	Description
`message_id`	string	Yes	Email message ID
`reason`	string	No	Reason for quarantine

Output:

{
  "quarantine_id": "quar-abc123",
  "message_id": "AAMkAGI2...",
  "quarantined_at": "2024-01-15T10:35:00Z"
}

Rollback: release_email - Releases email from quarantine

block_sender

Add sender to organization blocklist.

Parameters:

Name	Type	Required	Description
`sender`	string	Yes	Email address to block
`scope`	string	No	Block scope: `organization` or `user`

Output:

{
  "block_id": "block-abc123",
  "sender": "[email protected]",
  "scope": "organization",
  "blocked_at": "2024-01-15T10:35:00Z"
}

Rollback: unblock_sender - Removes sender from blocklist

Usage Examples

Phishing Response Playbook

name: phishing_response
steps:
  - action: parse_email
    output: parsed

  - action: check_email_authentication
    parameters:
      headers: "{{ parsed.headers }}"
    output: auth

  - action: lookup_sender_reputation
    parameters:
      sender: "{{ parsed.sender }}"
    output: reputation

  - condition: "reputation.score < 0.3 or not auth.authentication_passed"
    action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
      reason: "Failed authentication and low sender reputation"

  - condition: "reputation.score < 0.2"
    action: block_sender
    parameters:
      sender: "{{ parsed.sender }}"
      scope: organization

CLI Example

# Quarantine suspicious email
tw-cli action execute \
  --action quarantine_email \
  --param message_id="AAMkAGI2..." \
  --param reason="Phishing indicators detected"

# Block malicious sender
tw-cli action execute \
  --action block_sender \
  --param sender="[email protected]" \
  --param scope=organization

Host Actions

Actions for endpoint containment and investigation.

isolate_host

Network-isolate a compromised host via EDR.

Parameters:

Name	Type	Required	Description
`host_id`	string	Yes	EDR host/agent ID
`reason`	string	No	Reason for isolation

Output:

{
  "isolation_id": "iso-abc123",
  "host_id": "aid:xyz789",
  "hostname": "WORKSTATION-01",
  "isolated_at": "2024-01-15T10:40:00Z",
  "status": "isolated"
}

Behavior:

Host network access blocked
EDR agent maintains cloud connectivity
User notified (configurable)

Rollback: unisolate_host

Policy: Typically requires senior analyst or manager approval.

unisolate_host

Remove network isolation from a host.

Parameters:

Name	Type	Required	Description
`host_id`	string	Yes	EDR host/agent ID
`reason`	string	No	Reason for removing isolation

Output:

{
  "host_id": "aid:xyz789",
  "hostname": "WORKSTATION-01",
  "unisolated_at": "2024-01-15T14:00:00Z",
  "status": "active"
}

scan_host

Trigger on-demand malware scan on a host.

Parameters:

Name	Type	Required	Description
`host_id`	string	Yes	EDR host/agent ID
`scan_type`	string	No	`quick` or `full` (default: quick)

Output:

{
  "scan_id": "scan-abc123",
  "host_id": "aid:xyz789",
  "scan_type": "quick",
  "started_at": "2024-01-15T10:45:00Z",
  "status": "running"
}

Note: Scan results are retrieved separately as they may take time.

Usage Examples

Malware Response Playbook

name: malware_response
steps:
  - action: isolate_host
    parameters:
      host_id: "{{ incident.raw_data.host_id }}"
      reason: "Malware detection - automated isolation"
    output: isolation

  - action: scan_host
    parameters:
      host_id: "{{ incident.raw_data.host_id }}"
      scan_type: full

  - action: create_ticket
    parameters:
      title: "Malware Incident - {{ incident.raw_data.hostname }}"
      priority: high

  - action: notify_user
    parameters:
      user: "{{ incident.raw_data.user }}"
      message: "Your workstation has been isolated due to a security incident"

CLI Example

# Isolate compromised host
tw-cli action execute \
  --action isolate_host \
  --param host_id="aid:xyz789" \
  --param reason="Active malware infection"

# This action typically requires approval
# Check approval status:
tw-cli action status act-123456

# After investigation, remove isolation:
tw-cli action execute \
  --action unisolate_host \
  --param host_id="aid:xyz789" \
  --param reason="Malware cleaned, host verified"

API Example

# Request host isolation
curl -X POST http://localhost:8080/api/incidents/INC-2024-001/actions \
  -H "Content-Type: application/json" \
  -d '{
    "action": "isolate_host",
    "parameters": {
      "host_id": "aid:xyz789",
      "reason": "Suspected compromise"
    }
  }'

# Response (if requires approval):
{
  "action_id": "act-abc123",
  "status": "pending_approval",
  "approval_level": "manager",
  "message": "Action requires SOC Manager approval"
}

Policy Configuration

Host actions are typically high-impact and require approval:

[[policy.rules]]
name = "isolate_requires_approval"
action = "isolate_host"
approval_level = "senior"

[[policy.rules]]
name = "critical_isolate_requires_manager"
action = "isolate_host"
severity = ["critical"]
approval_level = "manager"

Lookup Actions

Actions for enriching incidents with threat intelligence data.

lookup_sender_reputation

Query threat intelligence for sender domain and IP reputation.

Parameters:

Name	Type	Required	Description
`sender`	string	Yes	Email address
`originating_ip`	string	No	Sending server IP

Output:

{
  "sender": "[email protected]",
  "domain": "domain.com",
  "domain_reputation": {
    "score": 0.25,
    "categories": ["phishing", "newly-registered"],
    "first_seen": "2024-01-10",
    "registrar": "NameCheap"
  },
  "ip_reputation": {
    "ip": "192.168.1.100",
    "score": 0.3,
    "categories": ["spam", "proxy"],
    "country": "RU",
    "asn": "AS12345"
  },
  "overall_score": 0.25,
  "risk_level": "high"
}

Score Interpretation:

Score	Risk Level
0.0 - 0.3	High risk
0.3 - 0.6	Medium risk
0.6 - 0.8	Low risk
0.8 - 1.0	Clean

lookup_urls

Check URLs against threat intelligence.

Parameters:

Name	Type	Required	Description
`urls`	array	Yes	List of URLs to check

Output:

{
  "results": [
    {
      "url": "https://legitimate-site.com/page",
      "malicious": false,
      "categories": ["business"],
      "confidence": 0.95
    },
    {
      "url": "https://phishing-site.com/login",
      "malicious": true,
      "categories": ["phishing", "credential-theft"],
      "confidence": 0.92,
      "threat_details": {
        "targeted_brand": "Microsoft",
        "first_seen": "2024-01-14"
      }
    }
  ],
  "malicious_count": 1,
  "total_count": 2
}

lookup_attachments

Hash attachments and check against threat intelligence.

Parameters:

Name	Type	Required	Description
`attachments`	array	Yes	List of attachment objects with `sha256`

Output:

{
  "results": [
    {
      "filename": "invoice.pdf",
      "sha256": "abc123...",
      "malicious": false,
      "file_type": "PDF document",
      "confidence": 0.9
    },
    {
      "filename": "update.exe",
      "sha256": "def456...",
      "malicious": true,
      "file_type": "Windows executable",
      "confidence": 0.98,
      "threat_details": {
        "malware_family": "Emotet",
        "first_seen": "2024-01-12",
        "detection_engines": 45
      }
    }
  ],
  "malicious_count": 1,
  "total_count": 2
}

lookup_hash

Look up a single file hash.

Parameters:

Name	Type	Required	Description
`hash`	string	Yes	MD5, SHA1, or SHA256 hash

Output:

{
  "hash": "abc123...",
  "hash_type": "sha256",
  "malicious": true,
  "confidence": 0.95,
  "malware_family": "Emotet",
  "categories": ["trojan", "banking"],
  "first_seen": "2024-01-12",
  "last_seen": "2024-01-15",
  "detection_ratio": "45/70"
}

lookup_ip

Query IP address reputation.

Parameters:

Name	Type	Required	Description
`ip`	string	Yes	IP address

Output:

{
  "ip": "192.168.1.100",
  "malicious": true,
  "confidence": 0.8,
  "categories": ["c2", "malware-distribution"],
  "country": "RU",
  "asn": "AS12345",
  "asn_org": "Example ISP",
  "last_seen": "2024-01-15",
  "associated_malware": ["Cobalt Strike"]
}

Usage in Playbooks

name: email_triage
steps:
  - action: parse_email
    output: parsed

  - action: lookup_sender_reputation
    parameters:
      sender: "{{ parsed.sender }}"
    output: sender_rep

  - action: lookup_urls
    parameters:
      urls: "{{ parsed.urls }}"
    output: url_results

  - action: lookup_attachments
    parameters:
      attachments: "{{ parsed.attachments }}"
    output: attachment_results

  # Make decision based on lookups
  - condition: >
      sender_rep.risk_level == 'high' or
      url_results.malicious_count > 0 or
      attachment_results.malicious_count > 0
    set_verdict:
      classification: malicious
      confidence: 0.9

Caching

Lookup results are cached to reduce API calls:

Lookup	Cache Duration
Hash	24 hours
URL	1 hour
Domain	6 hours
IP	6 hours

Force fresh lookup with skip_cache: true parameter.

Notification Actions

Actions for alerting stakeholders and managing escalation.

notify_user

Send notification to an affected user.

Parameters:

Name	Type	Required	Description
`user`	string	Yes	User email or ID
`message`	string	Yes	Notification message
`channel`	string	No	`email`, `slack`, `teams` (default: email)
`template`	string	No	Notification template name

Output:

{
  "notification_id": "notif-abc123",
  "recipient": "[email protected]",
  "channel": "email",
  "sent_at": "2024-01-15T10:50:00Z",
  "status": "delivered"
}

Templates:

# templates/notifications.yaml
security_alert:
  subject: "Security Alert: Action Required"
  body: |
    A security incident affecting your account has been detected.

    Incident ID: {{ incident_id }}
    Type: {{ incident_type }}

    {{ message }}

    If you did not initiate this activity, please contact IT Security.

notify_reporter

Send status update to the incident reporter.

Parameters:

Name	Type	Required	Description
`incident_id`	string	Yes	Incident ID
`status`	string	Yes	Status update message
`include_verdict`	bool	No	Include AI verdict (default: false)

Output:

{
  "notification_id": "notif-def456",
  "reporter": "[email protected]",
  "status": "delivered"
}

escalate

Route incident to appropriate approval level.

Parameters:

Name	Type	Required	Description
`incident_id`	string	Yes	Incident ID
`escalation_level`	string	Yes	`analyst`, `senior`, `manager`
`reason`	string	Yes	Reason for escalation
`override_assignee`	string	No	Specific person to assign
`custom_sla_hours`	int	No	Custom SLA (overrides default)
`notify_channels`	array	No	Additional channels (`slack`, `pagerduty`)

Output:

{
  "escalation_id": "esc-abc123",
  "incident_id": "INC-2024-001",
  "escalation_level": "senior",
  "assigned_to": "[email protected]",
  "due_date": "2024-01-15T12:50:00Z",
  "priority": "high",
  "sla_hours": 2
}

Default SLAs:

Level	SLA
Analyst	4 hours
Senior	2 hours
Manager	1 hour

create_ticket

Create ticket in external ticketing system.

Parameters:

Name	Type	Required	Description
`title`	string	Yes	Ticket title
`description`	string	Yes	Ticket description
`priority`	string	No	`low`, `medium`, `high`, `critical`
`assignee`	string	No	Initial assignee
`labels`	array	No	Ticket labels

Output:

{
  "ticket_id": "12345",
  "ticket_key": "SEC-1234",
  "url": "https://company.atlassian.net/browse/SEC-1234",
  "created_at": "2024-01-15T10:55:00Z"
}

log_false_positive

Record a false positive for tuning.

Parameters:

Name	Type	Required	Description
`incident_id`	string	Yes	Incident ID
`reason`	string	Yes	Why this is a false positive
`feedback`	string	No	Additional feedback for AI improvement

Output:

{
  "fp_id": "fp-abc123",
  "incident_id": "INC-2024-001",
  "recorded_at": "2024-01-15T11:00:00Z",
  "used_for_training": true
}

run_triage_agent

Trigger AI triage agent on an incident.

Parameters:

Name	Type	Required	Description
`incident_id`	string	Yes	Incident ID
`playbook`	string	No	Specific playbook to use
`model`	string	No	AI model override

Output:

{
  "triage_id": "triage-abc123",
  "incident_id": "INC-2024-001",
  "verdict": "malicious",
  "confidence": 0.92,
  "reasoning": "Multiple indicators of phishing...",
  "recommended_actions": [
    "quarantine_email",
    "block_sender",
    "notify_user"
  ],
  "completed_at": "2024-01-15T10:52:00Z"
}

Usage Examples

Escalation Playbook

name: auto_escalate
trigger:
  - verdict: malicious
  - confidence: ">= 0.9"
  - severity: critical

steps:
  - action: escalate
    parameters:
      incident_id: "{{ incident.id }}"
      escalation_level: manager
      reason: "High-confidence critical incident requiring immediate attention"
      notify_channels:
        - slack
        - pagerduty

  - action: create_ticket
    parameters:
      title: "CRITICAL: {{ incident.subject }}"
      priority: critical

CLI Examples

# Escalate to senior analyst
tw-cli action execute \
  --incident INC-2024-001 \
  --action escalate \
  --param escalation_level=senior \
  --param reason="Complex threat requiring expertise"

# Create ticket
tw-cli action execute \
  --incident INC-2024-001 \
  --action create_ticket \
  --param title="Phishing Investigation" \
  --param priority=high

# Record false positive
tw-cli action execute \
  --incident INC-2024-001 \
  --action log_false_positive \
  --param reason="Legitimate vendor communication"

Policy Engine

The policy engine controls action approval workflows and enforces security boundaries.

Overview

Every action request passes through the policy engine:

Action Request → Build Context → Evaluate Rules → Decision
                                                    ├─ Allowed → Execute
                                                    ├─ Denied → Reject
                                                    └─ RequiresApproval → Queue

Policy Decision Types

Decision	Behavior
`Allowed`	Action executes immediately
`Denied`	Action rejected with reason
`RequiresApproval`	Queued for specified approval level

Action Context

The policy engine evaluates these attributes:

#![allow(unused)]
fn main() {
pub struct ActionContext {
    /// The action being requested
    pub action_type: String,

    /// Target of the action (host, email, user, etc.)
    pub target: String,

    /// Incident severity (if associated)
    pub severity: Option<Severity>,

    /// AI confidence score (if from triage)
    pub confidence: Option<f64>,

    /// Who/what is requesting the action
    pub proposer: Proposer,

    /// Additional context
    pub metadata: HashMap<String, Value>,
}

pub enum Proposer {
    User { id: String, role: Role },
    Agent { name: String },
    Playbook { name: String },
    System,
}
}

Default Policies

Without custom rules, these defaults apply:

Action Category	Default Decision
Lookup actions	Allowed
Analysis actions	Allowed
Notification actions	Allowed
Response actions	RequiresApproval (analyst)
Host containment	RequiresApproval (senior)

Next Steps

Rules - Configure custom policy rules
Approval Levels - Understanding approval workflow

Policy Rules

Define rules to control when actions require approval.

Rule Structure

[[policy.rules]]
name = "rule_name"
description = "Human-readable description"

# Matching criteria
action = "action_name"           # Specific action
action_patterns = ["pattern_*"]  # Glob patterns

# Conditions (all must match)
severity = ["high", "critical"]  # Incident severity
confidence_min = 0.8             # Minimum AI confidence
proposer_type = "agent"          # Who's requesting
proposer_role = "analyst"        # Role (if user)

# Decision
decision = "allowed"             # or "denied" or "requires_approval"
approval_level = "senior"        # If requires_approval
reason = "Explanation"           # If denied

Rule Examples

Auto-Approve Lookups

[[policy.rules]]
name = "auto_approve_lookups"
description = "Lookup actions are always allowed"
action_patterns = ["lookup_*"]
decision = "allowed"

Require Approval for Response Actions

[[policy.rules]]
name = "response_needs_analyst"
description = "Response actions require analyst approval"
action_patterns = ["quarantine_*", "block_*"]
decision = "requires_approval"
approval_level = "analyst"

High-Severity Host Isolation

[[policy.rules]]
name = "critical_isolation_needs_manager"
description = "Critical severity host isolation requires manager"
action = "isolate_host"
severity = ["critical"]
decision = "requires_approval"
approval_level = "manager"

Block Dangerous Actions in Production

[[policy.rules]]
name = "no_delete_production"
description = "Deletion actions not allowed in production"
action_patterns = ["delete_*"]
environment = "production"
decision = "denied"
reason = "Deletion actions are not permitted in production"

Trust High-Confidence AI Decisions

[[policy.rules]]
name = "trust_high_confidence_ai"
description = "Auto-approve when AI is highly confident"
proposer_type = "agent"
confidence_min = 0.95
severity = ["low", "medium"]
action_patterns = ["quarantine_email", "block_sender"]
decision = "allowed"

Analyst Self-Service

[[policy.rules]]
name = "analyst_can_notify"
description = "Analysts can send notifications without approval"
action_patterns = ["notify_*"]
proposer_role = "analyst"
decision = "allowed"

Rule Evaluation Order

Rules are evaluated in order. First matching rule wins.

# More specific rules first
[[policy.rules]]
name = "critical_isolation"
action = "isolate_host"
severity = ["critical"]
approval_level = "manager"

# General fallback
[[policy.rules]]
name = "default_isolation"
action = "isolate_host"
approval_level = "senior"

Condition Operators

Severity Matching

severity = ["high", "critical"]  # Match any in list

Confidence Ranges

confidence_min = 0.8   # Minimum confidence
confidence_max = 0.95  # Maximum confidence

Pattern Matching

action_patterns = ["lookup_*"]        # Prefix match
action_patterns = ["*_email"]         # Suffix match
action_patterns = ["*block*"]         # Contains

Proposer Conditions

proposer_type = "user"      # user, agent, playbook, system
proposer_role = "analyst"   # Only for user proposers

Managing Rules

Via Configuration File

# config/policy.toml
tw-api --config config/policy.toml

Via API

# List rules
curl http://localhost:8080/api/policies

# Create rule
curl -X POST http://localhost:8080/api/policies \
  -H "Content-Type: application/json" \
  -d '{
    "name": "new_rule",
    "action": "isolate_host",
    "approval_level": "senior"
  }'

Via CLI

# List rules
tw-cli policy list

# Add rule
tw-cli policy add \
  --name "block_needs_approval" \
  --action "block_sender" \
  --approval-level analyst

Testing Rules

Simulate policy evaluation without executing:

tw-cli policy test \
  --action isolate_host \
  --severity critical \
  --proposer-type agent \
  --confidence 0.92

# Output:
# Decision: RequiresApproval
# Level: manager
# Matched Rule: critical_isolation_needs_manager

Approval Levels

Understanding the approval workflow in Triage Warden.

Approval Hierarchy

Manager (SOC Manager)
    │
    ▼
Senior (Senior Analyst)
    │
    ▼
Analyst (Security Analyst)
    │
    ▼
Auto (No approval needed)

Higher levels can approve actions at their level or below.

Level Definitions

Level	Role	Typical Actions
Auto	System	Lookups, analysis, low-risk notifications
Analyst	Security Analyst	Email quarantine, sender blocking
Senior	Senior Analyst	Host isolation, broad blocks
Manager	SOC Manager	Critical containment, policy changes

Approval Workflow

1. Action Requested

tw-cli action execute --incident INC-001 --action isolate_host

2. Policy Evaluation

Policy engine evaluates and returns:

{
  "decision": "requires_approval",
  "approval_level": "senior",
  "reason": "Host isolation requires senior analyst approval"
}

3. Action Queued

Action stored with pending status:

{
  "action_id": "act-abc123",
  "incident_id": "INC-001",
  "action_type": "isolate_host",
  "status": "pending_approval",
  "approval_level": "senior",
  "requested_by": "[email protected]",
  "requested_at": "2024-01-15T10:30:00Z"
}

4. Approvers Notified

Notification sent to eligible approvers via configured channels.

5. Approval Decision

Approver reviews and decides:

Approve:

tw-cli action approve act-abc123 --comment "Verified threat"

Reject:

tw-cli action reject act-abc123 --reason "False positive, user traveling"

6. Execution or Rejection

Approved: Action executes automatically
Rejected: Action marked rejected, requester notified

Approval UI

Access pending approvals at /approvals in the web dashboard.

Features:

Filterable list of pending actions
Incident context display
One-click approve/reject
Bulk approval for related actions

SLA Tracking

Each approval level has a default SLA:

Level	Default SLA
Analyst	4 hours
Senior	2 hours
Manager	1 hour

Overdue approvals are:

Highlighted in dashboard
Re-notified to approvers
Optionally escalated to next level

Delegation

Approvers can delegate when unavailable:

tw-cli approval delegate \
  --from [email protected] \
  --to [email protected] \
  --until 2024-01-20

Approval Groups

Configure approval groups for redundancy:

[approval_groups]
senior_analysts = [
  "[email protected]",
  "[email protected]",
  "[email protected]"
]

managers = [
  "[email protected]",
  "[email protected]"
]

Any member of the group can approve.

Audit Trail

All approval decisions are logged:

{
  "event": "action_approved",
  "action_id": "act-abc123",
  "approver": "[email protected]",
  "decision": "approved",
  "comment": "Verified threat indicators",
  "timestamp": "2024-01-15T10:45:00Z",
  "time_to_approve": "15m"
}

Emergency Override

In emergencies, managers can bypass approval:

tw-cli action execute \
  --incident INC-001 \
  --action isolate_host \
  --emergency \
  --reason "Active ransomware, immediate containment required"

Emergency overrides are:

Logged with high visibility
Require manager credentials
Trigger additional notifications

Natural Language Queries

Query your security data using plain English instead of writing Splunk SPL, Elasticsearch KQL, or SQL by hand.

Overview

The NL Query Interface (Stage 4.1) lets analysts type questions like "show me critical incidents from the last 24 hours" and have Triage Warden translate them into structured queries against your SIEM, log store, or incident database.

The pipeline has four stages:

Intent classification -- determines what the analyst is trying to do
Entity extraction -- pulls out IPs, domains, hashes, date ranges, etc.
Query translation -- converts the parsed intent + entities into the target query language
Backend execution -- runs the query against Splunk, Elasticsearch, or SQL

Supported Intents

Intent	Example query
`search_incidents`	"show me open critical incidents"
`search_logs`	"find authentication failures in the last hour"
`lookup_ioc`	"check reputation for 192.168.1.100"
`explain_incident`	"what happened in INC-2024-0042?"
`compare_incidents`	"compare INC-001 and INC-002"
`timeline_query`	"show me events from last week"
`asset_lookup`	"who owns server web-prod-01?"
`statistics`	"how many phishing incidents this month?"

Intent classification uses keyword matching and regex patterns -- no LLM call is needed for routing.

Entity Extraction

The entity extractor recognizes security-specific tokens:

IP addresses -- IPv4 (192.168.1.100)
Domains -- evil-domain.com
Hashes -- MD5 (32 hex chars), SHA-1 (40), SHA-256 (64)
Incident IDs -- INC-2024-0042, #42
Date ranges -- "last 24 hours", "past 7 days", 2024-01-01 to 2024-01-31
Usernames, hostnames, CVE IDs

Query Translation

Once intent and entities are extracted, NLQueryTranslator builds a structured query object:

from tw_ai.nl_query import NLQueryTranslator

translator = NLQueryTranslator()
result = translator.translate(
    "show me failed logins from 10.0.0.50 in the last hour"
)
# result.intent.intent = QueryIntent.SEARCH_LOGS
# result.structured_query returns the backend-specific query

Backend Adapters

The translator outputs queries for three backends:

Backend	Output format	Use case
Splunk	SPL queries	`index=auth action=failure src_ip=10.0.0.50 earliest=-1h`
Elasticsearch	KQL / Query DSL	`event.action:failure AND source.ip:10.0.0.50`
SQL	SQL WHERE clauses	Incident database queries

Conversation Context

Multi-turn conversations are supported via ConversationContext. When an analyst asks "now show me the same for last week", the system retains the entities from the previous turn.

from tw_ai.nl_query import ConversationContext

ctx = ConversationContext()
ctx.update("show me incidents from 10.0.0.50", entities=[...])
ctx.update("now filter to critical only", entities=[...])
# Second turn inherits the IP entity from the first

Security and Audit

All NL queries are sanitized before execution to prevent injection attacks. The QuerySanitizer strips dangerous characters and SQL keywords from user input.

Every query is logged to the QueryAuditLog with:

Original natural language query
Classified intent and confidence
Translated structured query
Execution timestamp and user ID

API Endpoint

When FastAPI is available, the NL query service exposes a REST endpoint:

curl -X POST http://localhost:8080/api/v1/nl/query \
  -H "Content-Type: application/json" \
  -d '{"query": "show me critical incidents from the last 24 hours"}'

Configuration

No special configuration is required. The NL query engine uses the same SIEM and database connections already configured in config/default.yaml.

To add custom keywords for intent classification:

from tw_ai.nl_query import IntentClassifier, QueryIntent

classifier = IntentClassifier(
    custom_keywords={
        QueryIntent.SEARCH_LOGS: ["splunk", "kibana"],
    }
)

Automated Threat Hunting

Proactively search for threats across your environment using hypothesis-driven hunts with built-in query templates mapped to MITRE ATT&CK.

Overview

The threat hunting module (Stage 5.1) provides:

Hunt management -- create, schedule, and track hunts with hypotheses
Built-in query library -- 20+ pre-built queries across 8 MITRE ATT&CK categories
Multi-platform queries -- Splunk SPL and Elasticsearch KQL templates
Finding promotion -- promote hunt findings directly to incidents

Hunt Lifecycle

A hunt progresses through these statuses:

Status	Description
`draft`	Hunt is being designed, not yet executable
`active`	Hunt is enabled and will run on schedule or trigger
`paused`	Temporarily suspended
`completed`	Finished executing (one-time hunts)
`failed`	Execution encountered errors
`archived`	No longer active, kept for reference

Creating a Hunt

Via API

curl -X POST http://localhost:8080/api/v1/hunts \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Detect Kerberoasting",
    "hypothesis": "Attackers may request TGS tickets for service accounts to crack offline",
    "hunt_type": "scheduled",
    "queries": [
      {
        "query_type": "splunk",
        "query": "index=wineventlog EventCode=4769 TicketEncryptionType=0x17 | stats count by ServiceName",
        "description": "Detect RC4-encrypted TGS requests",
        "timeout_secs": 300,
        "expected_baseline": 5
      }
    ],
    "schedule": {
      "cron_expression": "0 */4 * * *",
      "timezone": "UTC",
      "max_runtime_secs": 600
    },
    "mitre_techniques": ["T1558.003"],
    "data_sources": ["windows_event_logs"],
    "tags": ["credential-access", "priority-high"],
    "enabled": true
  }'

Hunt Types

Type	Description
`scheduled`	Runs on a cron schedule
`continuous`	Runs as a streaming query
`on_demand`	Runs only when manually triggered
`triggered`	Runs when a condition is met (e.g., new threat intel)

Built-in Query Library

Access 20+ pre-built queries via the API:

curl http://localhost:8080/api/v1/hunts/queries/library

Queries span 8 MITRE ATT&CK categories:

Initial Access
Execution
Persistence
Credential Access
Lateral Movement
Collection
Command and Control
Exfiltration

Each built-in query includes Splunk SPL and Elasticsearch KQL templates, expected baselines for anomaly detection, and configurable parameters.

Executing a Hunt

Trigger a hunt manually:

curl -X POST http://localhost:8080/api/v1/hunts/{hunt_id}/execute

The response includes findings with severity levels, evidence data, and the query that produced each finding.

Viewing Results

# Get all results for a hunt
curl http://localhost:8080/api/v1/hunts/{hunt_id}/results

Each result includes:

Total and critical finding counts
Duration and execution status
Individual findings with severity, evidence, and matched query

Promoting Findings to Incidents

When a hunt finding warrants investigation, promote it to a full incident:

curl -X POST http://localhost:8080/api/v1/hunts/{hunt_id}/findings/{finding_id}/promote

This creates a new incident with the finding's evidence, severity, and hunt metadata attached.

Query Languages

Language	Identifier	Example
Splunk SPL	`splunk`	`index=wineventlog EventCode=4625`
Elasticsearch	`elasticsearch`	`event.code: 4625`
SQL	`sql`	`SELECT * FROM events WHERE event_code = 4625`
Kusto (KQL)	`kusto`	`SecurityEvent \| where EventID == 4625`
Custom	`custom`	Any custom query syntax

Python Hypothesis Generator

The Python tw_ai package includes an AI-powered hypothesis generator that suggests new hunts based on current threat intelligence and recent incident patterns.

Collaboration

Coordinate incident response across your team with assignments, comments, real-time events, activity feeds, and shift handoffs.

Overview

The collaboration module (Stage 4.3) adds team workflow features to incident management:

Incident assignment -- manual and auto-assignment with rules
Comments -- threaded discussion on incidents with mentions
Real-time events -- live updates pushed to connected clients
Activity feed -- chronological audit trail of all actions
Shift handoff -- structured handoff reports between shifts

Incident Assignment

Manual Assignment

Assign an incident to an analyst through the web UI's assignment picker, or via the web endpoint:

curl -X POST http://localhost:8080/web/incidents/{id}/assign \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d 'assignee=analyst-uuid'

Auto-Assignment Rules

The system supports rule-based auto-assignment. Rules are defined in the application configuration and evaluated when new incidents arrive. Each rule specifies conditions and an assignee target:

Field	Description
`name`	Human-readable rule name
`conditions`	List of conditions to match (severity, incident type, source, tag)
`assignee`	Who to assign to (see Assignee Targets below)
`priority`	Evaluation order (lower number = higher priority)

Rules are evaluated in priority order. The first matching rule wins.

Note: Auto-assignment rule management via API is planned for a future release. Rules are currently configured at the application level.

Assignee Targets

Type	Description
`user`	Assign to a specific analyst by ID
`team`	Round-robin across team members
`on_call`	Assign to whoever is on-call

Comments

Add discussion, analysis notes, and action records to incidents.

Creating a Comment

curl -X POST http://localhost:8080/api/v1/comments \
  -H "Content-Type: application/json" \
  -d '{
    "incident_id": "incident-uuid",
    "content": "Found lateral movement evidence via PsExec. @senior-analyst please review.",
    "comment_type": "analysis",
    "mentions": ["senior-analyst-uuid"]
  }'

Comment Types

Type	Use case
`note`	General notes and observations
`analysis`	Technical findings and analysis
`action_taken`	Record of actions performed
`question`	Questions for other team members
`resolution`	Final resolution summary

Filtering Comments

# All comments for an incident
curl "http://localhost:8080/api/v1/comments?incident_id={id}"

# Only analysis comments
curl "http://localhost:8080/api/v1/comments?incident_id={id}&comment_type=analysis"

# Comments by a specific analyst
curl "http://localhost:8080/api/v1/comments?author_id={analyst_id}"

Comments support pagination with page and per_page query parameters.

Real-time Events

The real-time event system pushes updates to connected clients when incidents are modified, comments are added, or assignments change. Events include:

Incident status changes
New comments and mentions
Assignment updates
Action execution results
Field-level change tracking

Subscribers can filter events by incident ID, event type, or severity.

Activity Feed

Every action on an incident is recorded in the activity feed, providing a complete audit trail:

Who did what and when
What fields changed (with before/after values)
Comment and assignment history
Action execution records

Filter the activity feed by incident, user, or activity type.

Shift Handoff

Generate structured handoff reports at shift transitions:

curl -X POST http://localhost:8080/api/v1/handoffs \
  -H "Content-Type: application/json" \
  -d '{
    "shift_start": "2025-01-15T08:00:00Z",
    "shift_end": "2025-01-15T16:00:00Z",
    "notes": "Ongoing phishing campaign targeting finance department"
  }'

Handoff reports include:

Summary of open incidents per severity
Actions pending approval
Recent escalations
Custom notes from the outgoing team

Agentic AI Response

Control how much autonomy the AI has when responding to incidents, from fully manual to fully autonomous, with time-based rules and per-action overrides.

Overview

The Agentic AI Response system (Stage 5.4) provides configurable autonomy levels that determine which actions the AI can execute automatically and which require human approval. It includes:

Four autonomy levels with increasing automation
Per-action and per-severity overrides
Time-based rules for different autonomy during business hours vs. off-hours
Execution guardrails to prevent dangerous actions
Full audit trail of every autonomy decision

Autonomy Levels

Level	Actions auto-executed	Human role
`assisted`	None	AI suggests, human executes everything
`supervised`	Low-risk only	AI auto-executes safe actions, human approves the rest
`autonomous`	All except protected	AI handles most actions, human reviews protected targets
`full_autonomous`	Everything	Emergency mode -- AI executes all actions (requires special auth)

Risk Level Mapping

Each action has an inherent risk level that determines whether it can be auto-executed:

Risk level	Auto-execute in Supervised?	Auto-execute in Autonomous?
`none`	Yes	Yes
`low`	Yes	Yes
`medium`	No	Yes
`high`	No	Yes
`critical`	No	No (requires `full_autonomous`)

Configuration

Get Current Config

curl http://localhost:8080/api/v1/autonomy/config

Update Config

curl -X PUT http://localhost:8080/api/v1/autonomy/config \
  -H "Content-Type: application/json" \
  -d '{
    "default_level": "supervised",
    "per_action_overrides": {
      "isolate_host": "assisted",
      "create_ticket": "autonomous"
    },
    "per_severity_overrides": {
      "critical": "assisted",
      "low": "autonomous"
    },
    "time_based_rules": [
      {
        "name": "Business hours - supervised",
        "start_hour": 9,
        "end_hour": 17,
        "days_of_week": [1, 2, 3, 4, 5],
        "level": "supervised"
      },
      {
        "name": "Off-hours - autonomous",
        "start_hour": 17,
        "end_hour": 9,
        "days_of_week": [0, 1, 2, 3, 4, 5, 6],
        "level": "autonomous"
      }
    ],
    "emergency_contacts": ["[email protected]"]
  }'

Resolution Priority

When resolving the autonomy level for a given action, overrides are checked in this order:

Per-action overrides (highest priority)
Per-severity overrides
Time-based rules
Default level (fallback)

Resolve for a Specific Action

Check what the system would decide for a specific action + severity combination:

curl -X POST http://localhost:8080/api/v1/autonomy/resolve \
  -H "Content-Type: application/json" \
  -d '{"action": "isolate_host", "severity": "critical"}'

Response:

{
  "level": "assisted",
  "auto_execute": false,
  "reason": "Per-action override for 'isolate_host'"
}

Time-Based Rules

Time-based rules let you run with less autonomy during business hours (when analysts are available) and more autonomy during nights and weekends.

Field	Description
`name`	Human-readable rule name
`start_hour`	Start hour, 0-23 inclusive
`end_hour`	End hour, 0-24 exclusive
`days_of_week`	Array of days (0=Sunday through 6=Saturday)
`level`	Autonomy level when rule applies

Hours wrap around midnight: start_hour: 22, end_hour: 6 means 10 PM to 6 AM.

Execution Guardrails

The guardrails system (configured in config/guardrails.yaml) provides hard limits regardless of autonomy level:

Forbidden actions -- actions that can never be automated (e.g., delete_user, wipe_host)
Protected assets -- targets that always require human approval (production systems, domain controllers)
Rate limits -- maximum actions per hour/day to prevent runaway automation
Blast radius limits -- caps on how many targets a single action can affect

See Guardrails Reference for full configuration details.

Audit Log

Every autonomy decision is logged for compliance and debugging:

curl "http://localhost:8080/api/v1/autonomy/audit?limit=20"

# Filter by incident
curl "http://localhost:8080/api/v1/autonomy/audit?incident_id={id}"

Each audit entry records:

Action and severity evaluated
Resolved autonomy level
Whether auto-execution was allowed
Reason for the decision
Whether the action was actually executed
Execution outcome

Attack Surface Integration

Correlate incidents with known vulnerabilities and external exposures using integrations with vulnerability scanners and attack surface monitoring platforms.

Overview

The attack surface module (Stage 5.2) connects Triage Warden to:

Vulnerability scanners -- Qualys, Tenable, and Rapid7 for known vulnerability data
Attack surface monitors -- Censys and SecurityScorecard for external exposure discovery
Risk scoring -- combined risk assessment using vulnerability and exposure data

Vulnerability Scanners

Supported Platforms

Platform	Connector	Capabilities
Qualys	`QualysConnector`	Asset vulns, scan results, CVE lookup, recent findings
Tenable	`TenableConnector`	Asset vulns, scan results, CVE lookup, recent findings
Rapid7	`Rapid7Connector`	Asset vulns, scan results, CVE lookup, recent findings

VulnerabilityScanner Trait

All scanners implement the same trait, making them interchangeable:

#![allow(unused)]
fn main() {
pub trait VulnerabilityScanner: Connector {
    async fn get_vulnerabilities_for_asset(&self, asset_id: &str) -> ConnectorResult<Vec<Vulnerability>>;
    async fn get_scan_results(&self, scan_id: &str) -> ConnectorResult<ScanResult>;
    async fn get_recent_vulnerabilities(&self, since: DateTime<Utc>, limit: Option<usize>) -> ConnectorResult<Vec<Vulnerability>>;
    async fn get_vulnerability_by_cve(&self, cve_id: &str) -> ConnectorResult<Option<Vulnerability>>;
}
}

Vulnerability Data

Each vulnerability includes:

Field	Description
`cve_id`	CVE identifier (if assigned)
`severity`	Informational, Low, Medium, High, Critical
`cvss_score`	CVSS base score (0.0 - 10.0)
`affected_asset_ids`	Which assets are affected
`exploit_available`	Whether a public exploit exists
`patch_available`	Whether a vendor patch is available
`status`	Open, Remediated, Accepted, FalsePositive

Scan Results

Query scan results for summary data:

Field	Description
`total_hosts`	Number of hosts scanned
`vulnerabilities_found`	Total vulnerabilities discovered
`critical_count`	Critical severity findings
`high_count`	High severity findings
`status`	Pending, Running, Completed, Failed, Cancelled

Attack Surface Monitoring

Supported Platforms

Platform	Connector	Capabilities
Censys	`CensysConnector`	Domain exposures, asset exposure, risk scoring
SecurityScorecard	`ScorecardConnector`	Domain exposures, asset exposure, risk scoring

AttackSurfaceMonitor Trait

#![allow(unused)]
fn main() {
pub trait AttackSurfaceMonitor: Connector {
    async fn get_exposures(&self, domain: &str) -> ConnectorResult<Vec<ExternalExposure>>;
    async fn get_asset_exposure(&self, asset_id: &str) -> ConnectorResult<Vec<ExternalExposure>>;
    async fn get_risk_score(&self, domain: &str) -> ConnectorResult<Option<f32>>;
}
}

Exposure Types

The system detects these categories of external exposure:

Type	Description	Example
`open_port`	Open network port with identified service	Port 22 running SSH
`expired_certificate`	TLS certificate past its expiry date	`example.com` cert expired
`weak_cipher`	Deprecated or weak TLS cipher in use	RC4 cipher detected
`exposed_service`	Publicly accessible service that may be unintended	Elasticsearch on public IP
`dns_issue`	DNS misconfiguration	Missing SPF record
`misconfigured_header`	Missing or incorrect HTTP security header	No X-Frame-Options

Each exposure includes a risk score (0.0 to 100.0) and structured details.

Risk Scoring

Risk scores from vulnerability scanners and ASM platforms are combined during incident triage to assess the exposure of affected assets. When the AI agent triages an incident involving a compromised host, it can check:

What known vulnerabilities exist on the host
Whether public exploits are available for those vulnerabilities
What external exposures exist for the host or its domain
The overall risk score for the affected domain

This context helps the agent make more accurate severity assessments and recommend appropriate response actions.

Configuration

Add vulnerability scanner and ASM connectors in config/default.yaml:

connectors:
  qualys:
    connector_type: qualys
    enabled: true
    base_url: https://qualysapi.qualys.com
    api_key: ${QUALYS_USERNAME}
    api_secret: ${QUALYS_PASSWORD}
    timeout_secs: 60

  censys:
    connector_type: censys
    enabled: true
    base_url: https://search.censys.io
    api_key: ${CENSYS_API_ID}
    api_secret: ${CENSYS_SECRET}
    timeout_secs: 30

Content Packages

Share playbooks, hunts, knowledge articles, and saved queries between Triage Warden instances using distributable content packages.

Overview

The content package system (Stage 5.5) provides:

Import/export of playbooks, hunts, knowledge, and queries
Package validation before import
Conflict resolution when imported content already exists
Semantic versioning and compatibility tracking

Package Format

A content package consists of a manifest and a list of content items:

{
  "manifest": {
    "name": "phishing-response-kit",
    "version": "1.2.0",
    "description": "Playbooks and hunts for phishing incident response",
    "author": "Security Team",
    "license": "MIT",
    "tags": ["phishing", "email", "social-engineering"],
    "compatibility": ">=2.0.0"
  },
  "contents": [
    {
      "type": "playbook",
      "name": "phishing-triage",
      "data": { "stages": [...] }
    },
    {
      "type": "hunt",
      "name": "credential-harvesting-detection",
      "data": { "hypothesis": "...", "queries": [...] }
    },
    {
      "type": "knowledge",
      "title": "Phishing Indicators Guide",
      "content": "Common phishing indicators include..."
    },
    {
      "type": "query",
      "name": "failed-logins-by-source",
      "query_type": "siem",
      "query": "event.type:authentication AND event.outcome:failure | stats count by source.ip"
    }
  ]
}

Content Types

Type	Description	Stored in
`playbook`	Automated response workflows	Playbook repository
`hunt`	Threat hunt definitions with queries	Hunt store
`knowledge`	Reference articles and guides	Knowledge base
`query`	Saved search queries	Query library

Manifest Fields

Field	Required	Description
`name`	Yes	Unique package name
`version`	Yes	Semantic version string
`description`	Yes	What the package contains
`author`	Yes	Creator name or organization
`license`	No	License identifier (e.g., "MIT", "Apache-2.0")
`tags`	No	Categorization tags
`compatibility`	No	Minimum Triage Warden version required

Importing Packages

curl -X POST http://localhost:8080/api/v1/packages/import \
  -H "Content-Type: application/json" \
  -d '{
    "package": { ... },
    "conflict_resolution": "skip"
  }'

Response:

{
  "imported": 3,
  "skipped": 1,
  "errors": []
}

Conflict Resolution

When an imported item has the same name as an existing one:

Mode	Behavior
`skip`	Keep existing, ignore the imported item (default)
`overwrite`	Replace existing with the imported version
`rename`	Import with a modified name (e.g., `phishing-triage-imported-1`)

Validating Packages

Check a package for errors before importing:

curl -X POST http://localhost:8080/api/v1/packages/validate \
  -H "Content-Type: application/json" \
  -d '{ "manifest": { ... }, "contents": [ ... ] }'

Response:

{
  "valid": true,
  "warnings": ["Package author is not specified"],
  "errors": [],
  "content_count": 4
}

Validation checks:

Package name and version are present
All content items have non-empty names
Warns on missing author or empty content list

Exporting Content

Export a Playbook

curl -X POST http://localhost:8080/api/v1/packages/export/playbook/{playbook_id} \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-playbook-package",
    "version": "1.0.0",
    "description": "Exported playbook",
    "author": "Security Team",
    "license": "MIT",
    "tags": ["phishing"]
  }'

Export a Hunt

curl -X POST http://localhost:8080/api/v1/packages/export/hunt/{hunt_id} \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-hunt-package",
    "version": "1.0.0",
    "description": "Exported hunt",
    "author": "Threat Hunting Team"
  }'

Both return the full package JSON that can be shared or imported into another instance.

AI Triage

Automated incident analysis using Claude AI agents.

Overview

The triage agent analyzes security incidents to:

Classify - Determine if the incident is malicious, suspicious, or benign
Assess confidence - Quantify certainty in the classification
Explain - Provide reasoning for the verdict
Recommend - Suggest response actions

How It Works

Incident → Playbook Selection → Tool Execution → AI Analysis → Verdict

Incident received - New incident created via webhook or API
Playbook selected - Based on incident type (phishing, malware, etc.)
Tools executed - Parse data, lookup reputation, check authentication
AI analysis - Claude analyzes gathered data
Verdict returned - Classification with confidence and recommendations

Example Verdict

{
  "incident_id": "INC-2024-001",
  "classification": "malicious",
  "confidence": 0.92,
  "category": "phishing",
  "reasoning": "Multiple indicators suggest this is a credential phishing attempt:\n1. Sender domain registered 2 days ago\n2. SPF and DKIM authentication failed\n3. URL leads to a fake Microsoft login page\n4. Subject uses urgency tactics",
  "recommended_actions": [
    {
      "action": "quarantine_email",
      "priority": 1,
      "reason": "Prevent user access to phishing content"
    },
    {
      "action": "block_sender",
      "priority": 2,
      "reason": "Sender has no legitimate history"
    },
    {
      "action": "notify_user",
      "priority": 3,
      "reason": "Educate user about phishing attempt"
    }
  ],
  "iocs": [
    {"type": "domain", "value": "phishing-site.com"},
    {"type": "ip", "value": "192.168.1.100"}
  ],
  "mitre_attack": ["T1566.001", "T1078"]
}

Triggering Triage

Automatic (Webhook)

Configure webhooks to auto-triage new incidents:

webhooks:
  email_gateway:
    auto_triage: true
    playbook: phishing_triage

Manual (CLI)

tw-cli triage run --incident INC-2024-001

Manual (API)

curl -X POST http://localhost:8080/api/incidents/INC-2024-001/triage

Next Steps

Triage Agent - Agent architecture and configuration
Verdict Types - Understanding classifications
Confidence Scoring - How confidence is calculated

Triage Agent

The AI agent that analyzes security incidents.

Architecture

┌─────────────────────────────────────────────────────────┐
│                     Triage Agent                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │   Claude    │  │   Tools     │  │  Playbook   │     │
│  │   Model     │  │   (Bridge)  │  │   Engine    │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    Python Bridge                         │
│           (ThreatIntelBridge, SIEMBridge, etc.)         │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                   Rust Connectors                        │
│        (VirusTotal, Splunk, CrowdStrike, etc.)          │
└─────────────────────────────────────────────────────────┘

Agent Configuration

# python/tw_ai/agents/config.py
class AgentConfig:
    model: str = "claude-sonnet-4-20250514"
    max_tokens: int = 4096
    temperature: float = 0.1
    max_tool_calls: int = 10
    timeout_seconds: int = 120

Environment variables:

TW_AI_PROVIDER=anthropic
TW_ANTHROPIC_API_KEY=your-key
TW_AI_MODEL=claude-sonnet-4-20250514

Available Tools

The agent has access to these tools via the Python bridge:

Tool	Purpose
`parse_email`	Extract email components
`check_email_authentication`	Validate SPF/DKIM/DMARC
`lookup_sender_reputation`	Query sender reputation
`lookup_urls`	Check URL reputation
`lookup_attachments`	Check attachment hashes
`search_siem`	Query SIEM for related events
`get_host_info`	Get EDR host information

Agent Workflow

async def triage(self, incident: Incident) -> Verdict:
    # 1. Load appropriate playbook
    playbook = self.load_playbook(incident.incident_type)

    # 2. Execute playbook steps (tools)
    context = {}
    for step in playbook.steps:
        result = await self.execute_step(step, incident, context)
        context[step.output] = result

    # 3. Build analysis prompt
    prompt = self.build_analysis_prompt(incident, context)

    # 4. Get AI verdict
    response = await self.client.messages.create(
        model=self.config.model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=self.config.max_tokens
    )

    # 5. Parse and return verdict
    return self.parse_verdict(response)

System Prompt

The agent uses a specialized system prompt:

You are an expert security analyst assistant. Analyze the provided security
incident data and determine:

1. Classification: Is this malicious, suspicious, benign, or inconclusive?
2. Confidence: How certain are you (0.0 to 1.0)?
3. Category: What type of threat is this (phishing, malware, etc.)?
4. Reasoning: Explain your analysis step by step
5. Recommended Actions: What should be done to respond?

Use the tool results provided to inform your analysis. Be thorough but concise.
Cite specific evidence for your conclusions.

Tool Calling

The agent can call tools during analysis:

# Agent decides to check URL reputation
tool_result = await self.call_tool(
    name="lookup_urls",
    parameters={"urls": ["https://suspicious-site.com/login"]}
)

# Result used in analysis
# {
#   "results": [{
#     "url": "https://suspicious-site.com/login",
#     "malicious": true,
#     "categories": ["phishing"],
#     "confidence": 0.95
#   }]
# }

Customizing the Agent

Custom System Prompt

agent = TriageAgent(
    system_prompt="""
    You are a SOC analyst specializing in email security.
    Focus on phishing indicators and BEC patterns.
    Always check sender authentication carefully.
    """
)

Custom Tools

@agent.tool
async def custom_lookup(domain: str) -> dict:
    """Look up domain in internal threat database."""
    return await internal_db.query(domain)

Model Selection

# Use different models for different scenarios
if incident.severity == "critical":
    agent = TriageAgent(model="claude-opus-4-20250514")
else:
    agent = TriageAgent(model="claude-sonnet-4-20250514")

Error Handling

The agent handles failures gracefully:

try:
    verdict = await agent.triage(incident)
except ToolError as e:
    # Tool failed - continue with available data
    verdict = await agent.triage_partial(incident, failed_tools=[e.tool])
except AIError as e:
    # AI call failed - return inconclusive
    verdict = Verdict.inconclusive(reason=str(e))

Metrics

Agent metrics exported to Prometheus:

triage_duration_seconds - Time to complete triage
triage_tool_calls_total - Tool calls per triage
triage_verdict_total - Verdicts by classification
triage_confidence_histogram - Confidence score distribution

Verdict Types

Understanding the classification outcomes from AI triage.

Classifications

Classification	Description	Typical Response
Malicious	Confirmed threat	Immediate containment
Suspicious	Likely threat, needs investigation	Queue for analyst review
Benign	Not a threat	Close or archive
Inconclusive	Insufficient data	Request more information

Malicious

The incident is a confirmed security threat.

Criteria:

Multiple strong threat indicators
High-confidence threat intelligence matches
Clear malicious intent (credential theft, malware, etc.)

Example:

{
  "classification": "malicious",
  "confidence": 0.95,
  "category": "phishing",
  "reasoning": "Email contains credential phishing page targeting Microsoft 365. Sender domain registered yesterday, fails all email authentication. URL redirects to fake login mimicking Microsoft branding."
}

Response:

Execute recommended containment actions
Create incident ticket
Notify affected users

Suspicious

The incident shows concerning indicators but lacks definitive proof.

Criteria:

Some threat indicators present
Mixed or conflicting signals
Unusual but not clearly malicious behavior

Example:

{
  "classification": "suspicious",
  "confidence": 0.65,
  "category": "potential_phishing",
  "reasoning": "Email sender is unknown but domain is 6 months old with valid authentication. URL leads to legitimate document sharing service but file name uses urgency tactics. Recipient has not received email from this sender before."
}

Response:

Queue for analyst review
Gather additional context
Consider temporary quarantine pending review

Benign

The incident is not a security threat.

Criteria:

No threat indicators found
Known good sender/source
Normal expected behavior

Example:

{
  "classification": "benign",
  "confidence": 0.92,
  "category": "legitimate_email",
  "reasoning": "Email from known vendor with established sending history. All authentication passes. Attachment is a standard invoice PDF matching expected format. No suspicious URLs or indicators."
}

Response:

Close incident
Release from quarantine if held
Update detection rules if false positive

Inconclusive

Insufficient data to make a determination.

Criteria:

Missing critical information
Tool failures preventing analysis
Conflicting strong indicators

Example:

{
  "classification": "inconclusive",
  "confidence": 0.3,
  "category": "unknown",
  "reasoning": "Unable to analyze attachment - file corrupted. Sender reputation service unavailable. Email authentication results are mixed (SPF pass, DKIM fail). Need manual review of attachment content.",
  "missing_data": [
    "attachment_analysis",
    "sender_reputation"
  ]
}

Response:

Escalate to analyst
Retry failed tool calls
Request additional information

Confidence Scores

Confidence ranges and their meaning:

Range	Interpretation
0.9 - 1.0	Very high confidence, clear evidence
0.7 - 0.9	High confidence, strong indicators
0.5 - 0.7	Moderate confidence, mixed signals
0.3 - 0.5	Low confidence, limited evidence
0.0 - 0.3	Very low confidence, insufficient data

Category Types

Email Threats

Category	Description
`phishing`	Credential theft attempt
`spear_phishing`	Targeted phishing
`bec`	Business email compromise
`malware_delivery`	Malicious attachment/link
`spam`	Unsolicited bulk email

Endpoint Threats

Category	Description
`malware`	Malicious software detected
`ransomware`	Ransomware activity
`cryptominer`	Cryptocurrency mining
`rat`	Remote access trojan
`pup`	Potentially unwanted program

Access Threats

Category	Description
`brute_force`	Password guessing attempt
`credential_stuffing`	Leaked credential use
`impossible_travel`	Geographically impossible login
`account_takeover`	Compromised account

Using Verdicts

Automation Rules

# Auto-respond to high-confidence malicious
- trigger:
    classification: malicious
    confidence: ">= 0.9"
  actions:
    - quarantine_email
    - block_sender
    - create_ticket

# Queue suspicious for review
- trigger:
    classification: suspicious
  actions:
    - escalate:
        level: analyst
        reason: "Suspicious activity requires review"

Metrics

Track verdict distribution:

# Verdict counts by classification
sum by (classification) (triage_verdict_total)

# Average confidence by category
avg by (category) (triage_confidence)

Confidence Scoring

How the AI agent determines confidence in its verdicts.

Confidence Factors

The agent considers multiple factors when calculating confidence:

Evidence Quality

Factor	Impact
Threat intel match (high confidence)	+0.3
Threat intel match (low confidence)	+0.1
Authentication failure	+0.2
Known malicious indicator	+0.3
Suspicious pattern	+0.1

Evidence Quantity

Indicators	Confidence Boost
1 indicator	Base
2-3 indicators	+0.1
4-5 indicators	+0.2
6+ indicators	+0.3

Data Completeness

Missing Data	Confidence Penalty
None	0
Minor (sender reputation)	-0.1
Moderate (attachment analysis)	-0.2
Major (multiple tools failed)	-0.3

Calculation Example

Phishing Email Analysis:

Base confidence: 0.5

Evidence found:
+ SPF failed: +0.15
+ DKIM failed: +0.15
+ Sender domain < 7 days old: +0.2
+ URL matches phishing pattern: +0.25
+ VirusTotal flags URL as phishing: +0.2

Evidence count (5): +0.2

Data completeness: All tools succeeded: +0

Final confidence: 0.5 + 0.15 + 0.15 + 0.2 + 0.25 + 0.2 + 0.2 = 1.0 (capped at 0.99)

Verdict: malicious, confidence: 0.99

Confidence Thresholds

Policy decisions use confidence thresholds:

# Auto-quarantine high confidence malicious
[[policy.rules]]
name = "auto_quarantine_confident"
classification = "malicious"
confidence_min = 0.9
action = "quarantine_email"
decision = "allowed"

# Require review for lower confidence
[[policy.rules]]
name = "review_uncertain"
confidence_max = 0.7
decision = "requires_approval"
approval_level = "analyst"

Confidence Calibration

The agent is calibrated so confidence correlates with accuracy:

Stated Confidence	Expected Accuracy
0.9	~90% of verdicts correct
0.8	~80% of verdicts correct
0.7	~70% of verdicts correct

Monitoring Calibration

Track calibration with metrics:

# Accuracy at confidence level
triage_accuracy_by_confidence{confidence_bucket="0.9-1.0"}

Improving Calibration

Feedback loop - Log false positives to improve
Periodic review - Sample low-confidence verdicts
Model updates - Retrain with corrected examples

Handling Low Confidence

When confidence is low:

Option 1: Escalate

- condition: confidence < 0.6
  action: escalate
  parameters:
    level: analyst
    reason: "Low confidence verdict requires human review"

Option 2: Gather More Data

- condition: confidence < 0.6
  action: request_additional_data
  parameters:
    - "sender_history"
    - "recipient_context"

Option 3: Conservative Default

- condition: confidence < 0.6
  action: quarantine_email
  parameters:
    reason: "Quarantined pending review due to uncertainty"

Confidence in UI

Dashboard displays confidence visually:

Confidence	Display
0.9+	Green badge, "High Confidence"
0.7-0.9	Yellow badge, "Moderate Confidence"
0.5-0.7	Orange badge, "Low Confidence"
<0.5	Red badge, "Very Low Confidence"

Improving Confidence

Actions that help the agent be more confident:

Complete data - Ensure all tools succeed
Rich context - Provide incident metadata
Historical data - Include past incidents with similar patterns
Clear playbooks - Well-defined analysis steps

Playbooks

Playbooks define automated investigation and response workflows.

Overview

A playbook is a sequence of steps that:

Gather and analyze incident data
Enrich with threat intelligence
Determine verdict and response
Execute approved actions

Playbook Structure

name: phishing_triage
description: Automated phishing email analysis
version: "1.0"

# When this playbook applies
triggers:
  incident_type: phishing
  auto_run: true

# Variables available to steps
variables:
  quarantine_threshold: 0.7
  block_threshold: 0.3

# Execution steps
steps:
  - name: Parse Email
    action: parse_email
    parameters:
      raw_email: "{{ incident.raw_data.raw_email }}"
    output: parsed

  - name: Check Authentication
    action: check_email_authentication
    parameters:
      headers: "{{ parsed.headers }}"
    output: auth

  - name: Check Sender
    action: lookup_sender_reputation
    parameters:
      sender: "{{ parsed.sender }}"
    output: sender_rep

  - name: Check URLs
    action: lookup_urls
    parameters:
      urls: "{{ parsed.urls }}"
    output: url_results
    condition: "{{ parsed.urls | length > 0 }}"

  - name: Quarantine if Malicious
    action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
      reason: "Automated quarantine - phishing detected"
    condition: >
      sender_rep.score < variables.quarantine_threshold or
      url_results.malicious_count > 0 or
      not auth.authentication_passed

# Final verdict generation
verdict:
  use_ai: true
  model: claude-sonnet-4-20250514
  context:
    - parsed
    - auth
    - sender_rep
    - url_results

Triggers

Define when a playbook runs:

triggers:
  # Run for specific incident types
  incident_type: phishing

  # Auto-run on incident creation
  auto_run: true

  # Or require manual trigger
  auto_run: false

  # Conditions
  conditions:
    severity: ["medium", "high", "critical"]
    source: "email_gateway"

Steps

Basic Step

- name: Step Name
  action: action_name
  parameters:
    key: value
  output: variable_name

Conditional Step

- name: Block Known Bad
  action: block_sender
  parameters:
    sender: "{{ parsed.sender }}"
  condition: "{{ sender_rep.score < 0.2 }}"

Parallel Steps

- parallel:
    - action: lookup_urls
      parameters:
        urls: "{{ parsed.urls }}"
      output: url_results

    - action: lookup_attachments
      parameters:
        attachments: "{{ parsed.attachments }}"
      output: attachment_results

Loop Steps

- name: Check Each URL
  loop: "{{ parsed.urls }}"
  action: lookup_url
  parameters:
    url: "{{ item }}"
  output: url_results
  aggregate: list

Variables

Built-in Variables

Variable	Description
`incident`	The incident being processed
`incident.id`	Incident ID
`incident.raw_data`	Original incident data
`incident.severity`	Incident severity
`variables`	Playbook-defined variables

Step Outputs

Each step's output is available to subsequent steps:

- action: parse_email
  output: parsed

- action: lookup_urls
  parameters:
    urls: "{{ parsed.urls }}"  # Use previous output

Templates

Use Jinja2-style templates:

parameters:
  message: "Alert for {{ incident.id }}: {{ parsed.subject }}"
  priority: "{{ 'high' if incident.severity == 'critical' else 'medium' }}"

Next Steps

Creating Playbooks - Write your own playbooks
Built-in Playbooks - Ready-to-use playbooks

Creating Playbooks

Guide to writing custom playbooks for your security workflows.

Getting Started

1. Create Playbook File

mkdir -p playbooks
touch playbooks/my_playbook.yaml

2. Define Basic Structure

name: my_playbook
description: Description of what this playbook does
version: "1.0"

triggers:
  incident_type: phishing
  auto_run: true

steps:
  - name: First Step
    action: parse_email
    output: result

3. Register Playbook

tw-cli playbook add playbooks/my_playbook.yaml

Step Types

Action Step

Execute a registered action:

- name: Parse Email Content
  action: parse_email
  parameters:
    raw_email: "{{ incident.raw_data.raw_email }}"
  output: parsed
  on_error: continue  # or "fail" (default)

Condition Step

Branch based on conditions:

- name: Check if High Risk
  condition: "{{ sender_rep.score < 0.3 }}"
  then:
    - action: quarantine_email
      parameters:
        message_id: "{{ incident.raw_data.message_id }}"
  else:
    - action: log_event
      parameters:
        message: "Low risk, no action needed"

AI Analysis Step

Get AI verdict:

- name: AI Analysis
  type: ai_analysis
  model: claude-sonnet-4-20250514
  context:
    - parsed
    - auth_results
    - reputation
  prompt: |
    Analyze this email for phishing indicators.
    Consider the authentication results and sender reputation.
  output: ai_verdict

Notification Step

Send alerts:

- name: Alert Team
  action: notify_channel
  parameters:
    channel: slack
    message: |
      New {{ incident.severity }} incident detected
      ID: {{ incident.id }}
      Type: {{ incident.incident_type }}

Error Handling

Per-Step Error Handling

- name: Check Reputation
  action: lookup_sender_reputation
  parameters:
    sender: "{{ parsed.sender }}"
  output: reputation
  on_error: continue  # Don't fail playbook if this fails
  default_output:     # Use this if step fails
    score: 0.5
    risk_level: "unknown"

Global Error Handler

on_error:
  - action: notify_channel
    parameters:
      channel: slack
      message: "Playbook {{ playbook.name }} failed: {{ error.message }}"
  - action: escalate
    parameters:
      level: analyst
      reason: "Automated triage failed"

Variables and Templates

Define Variables

variables:
  high_risk_threshold: 0.3
  quarantine_enabled: true
  notification_channel: "#security-alerts"

Use Variables

- name: Check Risk
  condition: "{{ sender_rep.score < variables.high_risk_threshold }}"
  then:
    - action: quarantine_email
      condition: "{{ variables.quarantine_enabled }}"

Template Functions

parameters:
  # String manipulation
  domain: "{{ parsed.sender | split('@') | last }}"

  # Conditionals
  priority: "{{ 'critical' if incident.severity == 'critical' else 'high' }}"

  # Lists
  all_urls: "{{ parsed.urls | join(', ') }}"
  url_count: "{{ parsed.urls | length }}"

  # Defaults
  assignee: "{{ incident.assignee | default('unassigned') }}"

Testing Playbooks

Dry Run

tw-cli playbook test my_playbook \
  --incident INC-2024-001 \
  --dry-run

With Mock Data

tw-cli playbook test my_playbook \
  --data '{"raw_email": "From: [email protected]..."}'

Validate Syntax

tw-cli playbook validate playbooks/my_playbook.yaml

Best Practices

1. Use Descriptive Names

# Good
- name: Check sender domain reputation

# Bad
- name: step1

2. Handle Failures Gracefully

- name: External Lookup
  action: lookup_sender_reputation
  on_error: continue
  default_output:
    score: 0.5

3. Add Timeouts

- name: Slow External API
  action: custom_lookup
  timeout: 30s

4. Log Key Decisions

- name: Log Verdict
  action: log_event
  parameters:
    level: info
    message: "Verdict: {{ verdict.classification }} ({{ verdict.confidence }})"

5. Version Your Playbooks

name: phishing_triage
version: "2.1.0"
changelog:
  - "2.1.0: Added attachment analysis"
  - "2.0.0: Restructured for parallel lookups"

Example: Complete Playbook

name: comprehensive_phishing_triage
description: Full phishing email analysis with all checks
version: "2.0"

triggers:
  incident_type: phishing
  auto_run: true

variables:
  quarantine_threshold: 0.3
  block_threshold: 0.2

steps:
  # Parse email
  - name: Parse Email
    action: parse_email
    parameters:
      raw_email: "{{ incident.raw_data.raw_email }}"
    output: parsed

  # Parallel enrichment
  - name: Enrich Data
    parallel:
      - action: check_email_authentication
        parameters:
          headers: "{{ parsed.headers }}"
        output: auth

      - action: lookup_sender_reputation
        parameters:
          sender: "{{ parsed.sender }}"
        output: sender_rep

      - action: lookup_urls
        parameters:
          urls: "{{ parsed.urls }}"
        output: urls
        condition: "{{ parsed.urls | length > 0 }}"

      - action: lookup_attachments
        parameters:
          attachments: "{{ parsed.attachments }}"
        output: attachments
        condition: "{{ parsed.attachments | length > 0 }}"

  # AI Analysis
  - name: AI Verdict
    type: ai_analysis
    model: claude-sonnet-4-20250514
    context: [parsed, auth, sender_rep, urls, attachments]
    output: verdict

  # Response actions
  - name: Quarantine Malicious
    action: quarantine_email
    parameters:
      message_id: "{{ incident.raw_data.message_id }}"
    condition: >
      verdict.classification == 'malicious' and
      verdict.confidence >= variables.quarantine_threshold

  - name: Block Repeat Offender
    action: block_sender
    parameters:
      sender: "{{ parsed.sender }}"
    condition: >
      sender_rep.score < variables.block_threshold

  - name: Create Ticket
    action: create_ticket
    parameters:
      title: "{{ verdict.classification | title }}: {{ parsed.subject | truncate(50) }}"
      priority: "{{ incident.severity }}"
    condition: "{{ verdict.classification != 'benign' }}"

on_error:
  - action: escalate
    parameters:
      level: analyst
      reason: "Playbook execution failed"

Built-in Playbooks

Ready-to-use playbooks included with Triage Warden.

Email Security

phishing_triage

Comprehensive phishing email analysis.

Triggers: incident_type: phishing

Steps:

Parse email headers and body
Check SPF/DKIM/DMARC authentication
Look up sender reputation
Analyze URLs against threat intel
Check attachment hashes
AI analysis and verdict
Auto-quarantine if malicious (confidence > 0.8)

Usage:

tw-cli playbook run phishing_triage --incident INC-2024-001

spam_triage

Quick spam classification.

Triggers: incident_type: spam

Steps:

Parse email
Check spam indicators (bulk headers, suspicious patterns)
Classify as spam/not spam
Auto-archive low-confidence spam

bec_detection

Business Email Compromise detection.

Triggers: incident_type: bec

Steps:

Parse email
Check for executive impersonation
Analyze reply-to mismatch
Check for urgency indicators
Verify sender against directory
AI analysis for social engineering patterns

Endpoint Security

malware_triage

Malware alert analysis.

Triggers: incident_type: malware

Steps:

Get host information from EDR
Look up file hash
Check related processes
Query SIEM for lateral movement
AI verdict
Auto-isolate if critical severity + high confidence

suspicious_login

Anomalous login investigation.

Triggers: incident_type: suspicious_login

Steps:

Get login details
Check for impossible travel
Query user's recent activity
Check IP reputation
Verify device fingerprint
AI analysis

Customizing Built-in Playbooks

Override Variables

tw-cli playbook run phishing_triage \
  --incident INC-2024-001 \
  --var quarantine_threshold=0.9 \
  --var auto_block=false

Fork and Modify

# Export built-in playbook
tw-cli playbook export phishing_triage > my_phishing.yaml

# Edit as needed
vim my_phishing.yaml

# Register custom version
tw-cli playbook add my_phishing.yaml

Extend with Hooks

# my_phishing.yaml
extends: phishing_triage

# Add steps after parent playbook
after_steps:
  - name: Custom Logging
    action: log_to_siem
    parameters:
      event: phishing_verdict
      data: "{{ verdict }}"

# Override variables
variables:
  quarantine_threshold: 0.85

Playbook Comparison

Playbook	AI Used	Auto-Response	Typical Duration
phishing_triage	Yes	Quarantine, Block	30-60s
spam_triage	No	Archive	5-10s
bec_detection	Yes	Escalate	45-90s
malware_triage	Yes	Isolate	60-120s
suspicious_login	Yes	Lock account	30-60s

Monitoring Playbooks

Execution Metrics

# Playbook execution count
sum by (playbook) (playbook_executions_total)

# Average duration
avg by (playbook) (playbook_duration_seconds)

# Success rate
sum(playbook_executions_total{status="success"}) /
sum(playbook_executions_total)

Alerts

# Alert on playbook failures
- alert: PlaybookFailureRate
  expr: |
    sum(rate(playbook_executions_total{status="failed"}[5m])) /
    sum(rate(playbook_executions_total[5m])) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Playbook failure rate above 10%"

REST API

Programmatic access to Triage Warden functionality.

Base URL

http://localhost:8080/api

Authentication

See Authentication for details.

API Key

curl -H "Authorization: Bearer tw_abc123_secretkey" \
  http://localhost:8080/api/incidents

For browser-based access, use session authentication via /login.

Response Format

All responses are JSON:

{
  "data": { ... },
  "meta": {
    "page": 1,
    "per_page": 20,
    "total": 150
  }
}

Error Responses

{
  "error": {
    "code": "not_found",
    "message": "Incident not found",
    "details": { ... }
  }
}

HTTP Status Codes

Code	Meaning
200	Success
201	Created
400	Bad Request
401	Unauthorized
403	Forbidden
404	Not Found
422	Validation Error
429	Rate Limited
500	Server Error

Endpoints Overview

Incidents

Method	Path	Description
GET	`/incidents`	List incidents
POST	`/incidents`	Create incident
GET	`/incidents/:id`	Get incident
PUT	`/incidents/:id`	Update incident
DELETE	`/incidents/:id`	Delete incident
POST	`/incidents/:id/triage`	Run triage
POST	`/incidents/:id/actions`	Execute action

Actions

Method	Path	Description
GET	`/actions`	List actions
GET	`/actions/:id`	Get action
POST	`/actions/:id/approve`	Approve action
POST	`/actions/:id/reject`	Reject action

Playbooks

Method	Path	Description
GET	`/playbooks`	List playbooks
POST	`/playbooks`	Create playbook
GET	`/playbooks/:id`	Get playbook
PUT	`/playbooks/:id`	Update playbook
DELETE	`/playbooks/:id`	Delete playbook
POST	`/playbooks/:id/run`	Run playbook

Webhooks

Method	Path	Description
POST	`/webhooks/:source`	Receive webhook

System

Method	Path	Description
GET	`/health`	Health check
GET	`/metrics`	Prometheus metrics
GET	`/connectors/health`	Connector status

Pagination

List endpoints support pagination:

curl "http://localhost:8080/api/incidents?page=2&per_page=50"

Parameters:

page - Page number (default: 1)
per_page - Items per page (default: 20, max: 100)

Filtering

Filter list results:

curl "http://localhost:8080/api/incidents?status=open&severity=high"

Common filters:

status - Filter by status
severity - Filter by severity
type - Filter by incident type
created_after - Created after date
created_before - Created before date

Sorting

curl "http://localhost:8080/api/incidents?sort=-created_at"

Prefix with - for descending order
Default: -created_at (newest first)

Rate Limiting

API requests are rate limited:

Endpoint	Limit
Read operations	100/min
Write operations	20/min
Triage requests	10/min

Rate limit headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1705320000

Next Steps

Authentication - API authentication
Incidents - Incident endpoints
Actions - Action endpoints
Playbooks - Playbook endpoints
Webhooks - Webhook integration

API Authentication

Authenticate with the Triage Warden API.

API Keys

Creating an API Key

# Via CLI
tw-cli api-key create --name "automation-script" --scopes read,write

# Output:
# API Key created successfully
# Key: tw_abc123_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# WARNING: Store this key securely. It cannot be retrieved again.

Using API Keys

Include in the Authorization header:

curl -H "Authorization: Bearer tw_abc123_secretkey" \
  http://localhost:8080/api/incidents

API Key Scopes

Scope	Permissions
`read`	Read incidents, actions, playbooks
`write`	Create/update incidents, execute actions
`admin`	User management, system configuration

Managing API Keys

# List keys
tw-cli api-key list

# Revoke key
tw-cli api-key revoke tw_abc123

# Rotate key
tw-cli api-key rotate tw_abc123

Session Authentication

For web dashboard access:

curl -X POST http://localhost:8080/login \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "username=analyst&password=secret&csrf_token=xxx" \
  -c cookies.txt

Using Session

curl -b cookies.txt http://localhost:8080/api/incidents

Logout

curl -X POST http://localhost:8080/logout -b cookies.txt

CSRF Protection

State-changing requests require CSRF tokens:

Get token from login page or API
Include in request header or body

# Header
curl -X POST http://localhost:8080/api/incidents \
  -H "X-CSRF-Token: abc123" \
  -b cookies.txt \
  -d '{"type": "phishing"}'

# Form body
curl -X POST http://localhost:8080/api/incidents \
  -d "csrf_token=abc123&type=phishing" \
  -b cookies.txt

Webhook Authentication

Webhooks use HMAC signatures:

Configuring Webhook Secret

tw-cli webhook add email-gateway \
  --url http://localhost:8080/api/webhooks/email-gateway \
  --secret "your-secret-key"

Verifying Signatures

Triage Warden validates the X-Webhook-Signature header:

X-Webhook-Signature: sha256=abc123...

Signature is computed as:

HMAC-SHA256(secret, timestamp + "." + body)

Signature Verification Example

import hmac
import hashlib

def verify_signature(payload: bytes, signature: str, secret: str, timestamp: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        f"{timestamp}.{payload.decode()}".encode(),
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

Service Accounts

For automated systems:

# Create service account
tw-cli user create \
  --username automation-bot \
  --role analyst \
  --service-account

# Generate API key for service account
tw-cli api-key create \
  --user automation-bot \
  --name "ci-cd-integration" \
  --scopes read,write

Security Best Practices

Rotate keys regularly - Set up automated rotation
Use minimal scopes - Only grant necessary permissions
Secure storage - Use secret managers, not code
Monitor usage - Review audit logs for suspicious activity
IP allowlisting - Restrict API access by IP (optional)

# Enable IP allowlist
tw-cli config set api.allowed_ips "10.0.0.0/8,192.168.1.0/24"

Error Responses

401 Unauthorized

Missing or invalid credentials:

{
  "error": {
    "code": "unauthorized",
    "message": "Invalid or missing authentication"
  }
}

403 Forbidden

Valid credentials but insufficient permissions:

{
  "error": {
    "code": "forbidden",
    "message": "Insufficient permissions for this operation"
  }
}

Incidents API

Create, read, update, and manage security incidents.

List Incidents

GET /api/incidents

Query Parameters

Parameter	Type	Description
`status`	string	Filter by status (open, triaged, resolved)
`severity`	string	Filter by severity (low, medium, high, critical)
`type`	string	Filter by incident type
`created_after`	datetime	Created after timestamp
`created_before`	datetime	Created before timestamp
`page`	integer	Page number
`per_page`	integer	Items per page
`sort`	string	Sort field (prefix `-` for desc)

Example

curl "http://localhost:8080/api/incidents?status=open&severity=high&per_page=10" \
  -H "Authorization: Bearer tw_xxx"

Response

{
  "data": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "incident_number": "INC-2024-0001",
      "incident_type": "phishing",
      "severity": "high",
      "status": "open",
      "source": "email_gateway",
      "created_at": "2024-01-15T10:30:00Z",
      "updated_at": "2024-01-15T10:30:00Z"
    }
  ],
  "meta": {
    "page": 1,
    "per_page": 10,
    "total": 42
  }
}

Get Incident

GET /api/incidents/:id

Example

curl "http://localhost:8080/api/incidents/550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer tw_xxx"

Response

{
  "data": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "incident_number": "INC-2024-0001",
    "incident_type": "phishing",
    "severity": "high",
    "status": "triaged",
    "source": "email_gateway",
    "raw_data": {
      "message_id": "AAMkAGI2...",
      "sender": "[email protected]",
      "subject": "Urgent: Update Account"
    },
    "verdict": {
      "classification": "malicious",
      "confidence": 0.92,
      "category": "phishing",
      "reasoning": "Multiple phishing indicators..."
    },
    "recommended_actions": [
      "quarantine_email",
      "block_sender"
    ],
    "created_at": "2024-01-15T10:30:00Z",
    "updated_at": "2024-01-15T10:35:00Z",
    "triaged_at": "2024-01-15T10:35:00Z"
  }
}

Create Incident

POST /api/incidents

Request Body

{
  "incident_type": "phishing",
  "source": "email_gateway",
  "severity": "medium",
  "raw_data": {
    "message_id": "AAMkAGI2...",
    "sender": "[email protected]",
    "recipient": "[email protected]",
    "subject": "Important Document",
    "received_at": "2024-01-15T10:00:00Z"
  }
}

Example

curl -X POST "http://localhost:8080/api/incidents" \
  -H "Authorization: Bearer tw_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "incident_type": "phishing",
    "source": "email_gateway",
    "severity": "medium",
    "raw_data": {...}
  }'

Response

{
  "data": {
    "id": "550e8400-e29b-41d4-a716-446655440001",
    "incident_number": "INC-2024-0002",
    "status": "open",
    "created_at": "2024-01-15T11:00:00Z"
  }
}

Update Incident

PUT /api/incidents/:id

Request Body

{
  "severity": "high",
  "status": "resolved",
  "resolution": "False positive - legitimate vendor email"
}

Delete Incident

DELETE /api/incidents/:id

Note: Requires admin role.

Run Triage

POST /api/incidents/:id/triage

Trigger AI triage on an incident.

Request Body (Optional)

{
  "playbook": "custom_phishing",
  "force": true
}

Response

{
  "data": {
    "triage_id": "triage-abc123",
    "status": "completed",
    "verdict": {
      "classification": "malicious",
      "confidence": 0.92
    },
    "duration_ms": 45000
  }
}

Execute Action

POST /api/incidents/:id/actions

Execute an action on an incident.

Request Body

{
  "action": "quarantine_email",
  "parameters": {
    "message_id": "AAMkAGI2...",
    "reason": "Phishing detected"
  }
}

Response (Immediate Execution)

{
  "data": {
    "action_id": "act-abc123",
    "status": "completed",
    "result": {
      "success": true,
      "message": "Email quarantined successfully"
    }
  }
}

Response (Pending Approval)

{
  "data": {
    "action_id": "act-abc123",
    "status": "pending_approval",
    "approval_level": "senior",
    "message": "Action requires senior analyst approval"
  }
}

Get Incident Actions

GET /api/incidents/:id/actions

List all actions for an incident.

Response

{
  "data": [
    {
      "id": "act-abc123",
      "action_type": "quarantine_email",
      "status": "completed",
      "executed_at": "2024-01-15T10:40:00Z",
      "executed_by": "system"
    },
    {
      "id": "act-def456",
      "action_type": "block_sender",
      "status": "pending_approval",
      "approval_level": "analyst",
      "requested_at": "2024-01-15T10:41:00Z"
    }
  ]
}

Actions API

Manage action execution and approvals.

List Actions

GET /api/actions

Query Parameters

Parameter	Type	Description
`status`	string	pending, pending_approval, completed, failed
`action_type`	string	Filter by action type
`incident_id`	uuid	Filter by incident
`approval_level`	string	analyst, senior, manager

Example

curl "http://localhost:8080/api/actions?status=pending_approval" \
  -H "Authorization: Bearer tw_xxx"

Response

{
  "data": [
    {
      "id": "act-abc123",
      "incident_id": "550e8400-e29b-41d4-a716-446655440000",
      "action_type": "isolate_host",
      "status": "pending_approval",
      "approval_level": "senior",
      "parameters": {
        "host_id": "aid:xyz789",
        "reason": "Malware detected"
      },
      "requested_by": "triage_agent",
      "requested_at": "2024-01-15T10:45:00Z"
    }
  ]
}

Get Action

GET /api/actions/:id

Response

{
  "data": {
    "id": "act-abc123",
    "incident_id": "550e8400-e29b-41d4-a716-446655440000",
    "action_type": "isolate_host",
    "status": "pending_approval",
    "approval_level": "senior",
    "parameters": {
      "host_id": "aid:xyz789",
      "reason": "Malware detected"
    },
    "requested_by": "triage_agent",
    "requested_at": "2024-01-15T10:45:00Z",
    "incident": {
      "incident_number": "INC-2024-0001",
      "incident_type": "malware",
      "severity": "high"
    }
  }
}

Approve Action

POST /api/actions/:id/approve

Request Body

{
  "comment": "Verified threat, approved for isolation"
}

Response

{
  "data": {
    "id": "act-abc123",
    "status": "completed",
    "approved_by": "[email protected]",
    "approved_at": "2024-01-15T11:00:00Z",
    "result": {
      "success": true,
      "message": "Host isolated successfully"
    }
  }
}

Errors

403 Forbidden - Insufficient approval level:

{
  "error": {
    "code": "insufficient_approval_level",
    "message": "This action requires senior analyst approval",
    "required_level": "senior",
    "your_level": "analyst"
  }
}

Reject Action

POST /api/actions/:id/reject

Request Body

{
  "reason": "False positive - user confirmed legitimate activity"
}

Response

{
  "data": {
    "id": "act-abc123",
    "status": "rejected",
    "rejected_by": "[email protected]",
    "rejected_at": "2024-01-15T11:00:00Z",
    "rejection_reason": "False positive - user confirmed legitimate activity"
  }
}

Execute Action Directly

POST /api/actions/execute

Execute an action without associating with an incident.

Request Body

{
  "action": "block_sender",
  "parameters": {
    "sender": "[email protected]"
  }
}

Response

{
  "data": {
    "action_id": "act-ghi789",
    "status": "completed",
    "result": {
      "success": true,
      "message": "Sender blocked"
    }
  }
}

Get Action Types

GET /api/actions/types

List all available action types.

Response

{
  "data": [
    {
      "name": "quarantine_email",
      "description": "Move email to quarantine",
      "category": "email",
      "supports_rollback": true,
      "parameters": [
        {
          "name": "message_id",
          "type": "string",
          "required": true
        },
        {
          "name": "reason",
          "type": "string",
          "required": false
        }
      ]
    },
    {
      "name": "isolate_host",
      "description": "Network-isolate a host",
      "category": "endpoint",
      "supports_rollback": true,
      "default_approval_level": "senior",
      "parameters": [...]
    }
  ]
}

Rollback Action

POST /api/actions/:id/rollback

Rollback a previously executed action.

Request Body

{
  "reason": "False positive confirmed"
}

Response

{
  "data": {
    "rollback_action_id": "act-jkl012",
    "original_action_id": "act-abc123",
    "status": "completed",
    "result": {
      "success": true,
      "message": "Host unisolated successfully"
    }
  }
}

Errors

400 Bad Request - Action doesn't support rollback:

{
  "error": {
    "code": "rollback_not_supported",
    "message": "Action type 'notify_user' does not support rollback"
  }
}

Playbooks API

Manage and execute playbooks.

List Playbooks

GET /api/playbooks

Response

{
  "data": [
    {
      "id": "pb-abc123",
      "name": "phishing_triage",
      "description": "Automated phishing email analysis",
      "version": "2.0",
      "enabled": true,
      "triggers": {
        "incident_type": "phishing",
        "auto_run": true
      },
      "created_at": "2024-01-01T00:00:00Z",
      "updated_at": "2024-01-10T00:00:00Z"
    }
  ]
}

Get Playbook

GET /api/playbooks/:id

Response

{
  "data": {
    "id": "pb-abc123",
    "name": "phishing_triage",
    "description": "Automated phishing email analysis",
    "version": "2.0",
    "enabled": true,
    "triggers": {
      "incident_type": "phishing",
      "auto_run": true
    },
    "variables": {
      "quarantine_threshold": 0.7
    },
    "steps": [
      {
        "name": "Parse Email",
        "action": "parse_email",
        "parameters": {
          "raw_email": "{{ incident.raw_data.raw_email }}"
        },
        "output": "parsed"
      }
    ],
    "created_at": "2024-01-01T00:00:00Z",
    "updated_at": "2024-01-10T00:00:00Z"
  }
}

Create Playbook

POST /api/playbooks

Request Body

{
  "name": "custom_playbook",
  "description": "My custom investigation playbook",
  "triggers": {
    "incident_type": "phishing",
    "auto_run": false
  },
  "steps": [
    {
      "name": "Parse Email",
      "action": "parse_email",
      "output": "parsed"
    }
  ]
}

Response

{
  "data": {
    "id": "pb-def456",
    "name": "custom_playbook",
    "version": "1.0",
    "created_at": "2024-01-15T12:00:00Z"
  }
}

Update Playbook

PUT /api/playbooks/:id

Request Body

{
  "description": "Updated description",
  "enabled": false
}

Delete Playbook

DELETE /api/playbooks/:id

Note: Built-in playbooks cannot be deleted.

Run Playbook

POST /api/playbooks/:id/run

Execute a playbook on an incident.

Request Body

{
  "incident_id": "550e8400-e29b-41d4-a716-446655440000",
  "variables": {
    "quarantine_threshold": 0.9
  }
}

Response

{
  "data": {
    "execution_id": "exec-abc123",
    "playbook_id": "pb-abc123",
    "incident_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "started_at": "2024-01-15T12:00:00Z",
    "completed_at": "2024-01-15T12:00:45Z",
    "steps_completed": 5,
    "steps_total": 5,
    "verdict": {
      "classification": "malicious",
      "confidence": 0.92
    }
  }
}

Get Playbook Executions

GET /api/playbooks/:id/executions

Response

{
  "data": [
    {
      "execution_id": "exec-abc123",
      "incident_id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "completed",
      "duration_ms": 45000,
      "started_at": "2024-01-15T12:00:00Z"
    }
  ]
}

Validate Playbook

POST /api/playbooks/validate

Validate playbook YAML without creating it.

Request Body

{
  "content": "name: test\nsteps:\n  - action: parse_email"
}

Response (Valid)

{
  "data": {
    "valid": true,
    "warnings": []
  }
}

Response (Invalid)

{
  "data": {
    "valid": false,
    "errors": [
      {
        "line": 3,
        "message": "Unknown action: invalid_action"
      }
    ]
  }
}

Export Playbook

GET /api/playbooks/:id/export

Download playbook as YAML file.

Response

name: phishing_triage
description: Automated phishing email analysis
version: "2.0"
...

Webhooks API

Receive events from external security tools.

Endpoint

POST /api/webhooks/:source

Where :source identifies the sending system (e.g., email-gateway, edr, siem).

Authentication

Webhooks are authenticated via HMAC signatures:

X-Webhook-Signature: sha256=abc123...
X-Webhook-Timestamp: 1705320000

Registering Webhook Sources

Via CLI

tw-cli webhook add email-gateway \
  --secret "your-secret-key" \
  --auto-triage true \
  --playbook phishing_triage

Via API

curl -X POST "http://localhost:8080/api/webhooks" \
  -H "Authorization: Bearer tw_xxx" \
  -d '{
    "source": "email-gateway",
    "secret": "your-secret-key",
    "auto_triage": true,
    "playbook": "phishing_triage"
  }'

Payload Formats

Generic Format

{
  "event_type": "security_alert",
  "timestamp": "2024-01-15T10:00:00Z",
  "source": "email-gateway",
  "data": {
    "alert_id": "alert-123",
    "severity": "high",
    "details": {...}
  }
}

Microsoft Defender for Office 365

{
  "eventType": "PhishingEmail",
  "id": "AAMkAGI2...",
  "creationTime": "2024-01-15T10:00:00Z",
  "severity": "high",
  "category": "Phish",
  "entityType": "Email",
  "data": {
    "sender": "[email protected]",
    "subject": "Urgent Action Required",
    "recipients": ["[email protected]"]
  }
}

CrowdStrike Falcon

{
  "metadata": {
    "eventType": "DetectionSummaryEvent",
    "eventCreationTime": 1705320000000
  },
  "event": {
    "DetectId": "ldt:abc123",
    "Severity": 4,
    "HostnameField": "WORKSTATION-01",
    "DetectName": "Malicious File Detected"
  }
}

Splunk Alert

{
  "result": {
    "host": "server-01",
    "source": "WinEventLog:Security",
    "sourcetype": "WinEventLog",
    "_raw": "...",
    "EventCode": "4625"
  },
  "search_name": "Failed Login Alert",
  "trigger_time": 1705320000
}

Response

Success

{
  "status": "accepted",
  "incident_id": "550e8400-e29b-41d4-a716-446655440000",
  "incident_number": "INC-2024-0001"
}

Queued for Processing

{
  "status": "queued",
  "queue_id": "queue-abc123",
  "message": "Event queued for processing"
}

Configuring Auto-Triage

When auto_triage is enabled, incidents created from webhooks are automatically triaged:

# webhook_config.yaml
sources:
  email-gateway:
    secret: "${EMAIL_GATEWAY_SECRET}"
    auto_triage: true
    playbook: phishing_triage
    severity_mapping:
      critical: critical
      high: high
      medium: medium
      low: low

  edr:
    secret: "${EDR_SECRET}"
    auto_triage: true
    playbook: malware_triage

Testing Webhooks

Send Test Event

# Generate signature
TIMESTAMP=$(date +%s)
BODY='{"event_type":"test","data":{}}'
SIGNATURE=$(echo -n "${TIMESTAMP}.${BODY}" | openssl dgst -sha256 -hmac "your-secret")

# Send request
curl -X POST "http://localhost:8080/api/webhooks/email-gateway" \
  -H "Content-Type: application/json" \
  -H "X-Webhook-Signature: sha256=${SIGNATURE}" \
  -H "X-Webhook-Timestamp: ${TIMESTAMP}" \
  -d "${BODY}"

Verify Configuration

tw-cli webhook test email-gateway

Error Handling

Invalid Signature

{
  "error": {
    "code": "invalid_signature",
    "message": "Webhook signature verification failed"
  }
}

Unknown Source

{
  "error": {
    "code": "unknown_source",
    "message": "Webhook source 'unknown' is not registered"
  }
}

Replay Attack

{
  "error": {
    "code": "timestamp_expired",
    "message": "Webhook timestamp is too old (>5 minutes)"
  }
}

Monitoring Webhooks

Metrics

# Webhook receive rate
rate(webhook_received_total[5m])

# Error rate by source
rate(webhook_errors_total[5m])

Logs

tw-cli logs --filter webhook --tail 100

API Error Codes

All API errors return a consistent JSON structure with an error code, message, and optional details.

Error Response Format

{
  "code": "ERROR_CODE",
  "message": "Human-readable error message",
  "details": { ... },
  "request_id": "optional-request-id"
}

Error Codes Reference

Authentication Errors (4xx)

Code	HTTP Status	Description	Resolution
`UNAUTHORIZED`	401	Missing or invalid authentication	Provide valid API key or session cookie
`INVALID_CREDENTIALS`	401	Invalid username or password	Check login credentials
`SESSION_EXPIRED`	401	Session has expired	Re-authenticate to get new session
`INVALID_SIGNATURE`	401	Webhook signature validation failed	Verify webhook secret configuration
`FORBIDDEN`	403	Authenticated but not authorized	Check user role and permissions
`CSRF_VALIDATION_FAILED`	403	CSRF token missing or invalid	Include valid CSRF token in request
`ACCOUNT_DISABLED`	403	User account is disabled	Contact administrator

Client Errors (4xx)

Code	HTTP Status	Description	Resolution
`NOT_FOUND`	404	Resource not found	Verify resource ID exists
`BAD_REQUEST`	400	Malformed request	Check request syntax and parameters
`CONFLICT`	409	Resource conflict (e.g., already exists)	Action already completed or duplicate resource
`UNPROCESSABLE_ENTITY`	422	Semantic error in request	Check request logic and data validity
`VALIDATION_ERROR`	422	Field validation failed	See `details` for field-specific errors
`RATE_LIMIT_EXCEEDED`	429	Too many requests	Wait and retry with exponential backoff

Server Errors (5xx)

Code	HTTP Status	Description	Resolution
`INTERNAL_ERROR`	500	Unexpected server error	Check server logs, contact support
`DATABASE_ERROR`	500	Database operation failed	Check database connectivity
`SERVICE_UNAVAILABLE`	503	Service temporarily unavailable	Retry later

Detailed Error Examples

Validation Error

When field validation fails, the response includes detailed field-level errors:

{
  "code": "VALIDATION_ERROR",
  "message": "Validation failed",
  "details": {
    "name": {
      "code": "required",
      "message": "Name is required"
    },
    "email": {
      "code": "invalid_format",
      "message": "Invalid email format"
    }
  }
}

Not Found Error

{
  "code": "NOT_FOUND",
  "message": "Not found: Incident 550e8400-e29b-41d4-a716-446655440000 not found"
}

Conflict Error

Returned when attempting an action that conflicts with current state:

{
  "code": "CONFLICT",
  "message": "Conflict: Action is not pending approval (current status: Approved)"
}

Rate Limit Error

{
  "code": "RATE_LIMIT_EXCEEDED",
  "message": "Rate limit exceeded"
}

Include a Retry-After header when available.

Unauthorized Error

{
  "code": "UNAUTHORIZED",
  "message": "Unauthorized: No authentication provided"
}

Error Handling Best Practices

Client Implementation

import requests

def handle_api_error(response):
    error = response.json()
    code = error.get('code')

    if code == 'RATE_LIMIT_EXCEEDED':
        # Implement exponential backoff
        retry_after = int(response.headers.get('Retry-After', 60))
        time.sleep(retry_after)
        return retry_request()

    elif code == 'SESSION_EXPIRED':
        # Re-authenticate
        refresh_session()
        return retry_request()

    elif code == 'VALIDATION_ERROR':
        # Handle field-specific errors
        for field, details in error.get('details', {}).items():
            print(f"Field '{field}': {details['message']}")

    elif code in ['INTERNAL_ERROR', 'DATABASE_ERROR']:
        # Log and alert on server errors
        log_error(error)
        raise ServerError(error['message'])

Retry Strategy

For transient errors (5xx, RATE_LIMIT_EXCEEDED), implement exponential backoff:

import time
import random

def retry_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except (RateLimitError, ServiceUnavailableError) as e:
            if attempt == max_retries - 1:
                raise
            delay = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

HTTP Status Code Summary

Status	Meaning	Retryable
400	Bad Request	No
401	Unauthorized	After re-auth
403	Forbidden	No
404	Not Found	No
409	Conflict	No
422	Unprocessable Entity	After fixing request
429	Rate Limited	Yes, with backoff
500	Internal Error	Yes, with caution
503	Service Unavailable	Yes, with backoff

Configuration Guide

Complete guides for configuring Triage Warden.

Initial Setup

After installation, configure Triage Warden in this order:

Environment Variables - Set required environment variables
Connectors - Connect to your security tools
Notifications - Set up alert channels
Playbooks - Create automation workflows
Policies - Define approval and safety rules
SSO Integrations - Configure enterprise identity providers

Quick Configuration

First Run

After starting Triage Warden, log in with the default credentials:

Username: admin
Password: admin

Important: Change the default password immediately!

Essential Settings

Navigate to Settings and configure:

General
- Organization name
- Timezone
- Operation mode (Assisted → Supervised → Autonomous)
AI/LLM
- Select provider (Anthropic, OpenAI, or Local)
- Enter API key
- Choose model
Connectors (at minimum)
- Threat intelligence (VirusTotal recommended)
- Your primary SIEM or alert source
Notifications
- At least one channel for critical alerts

Configuration Methods

Web UI (Recommended)

Most settings can be configured through the web dashboard at Settings.

Pros:

User-friendly interface
Validation feedback
Immediate effect

Environment Variables

For deployment configuration and secrets:

# Required
DATABASE_URL=postgres://...
TW_ENCRYPTION_KEY=...

# Optional overrides
TW_LLM_PROVIDER=anthropic
TW_LLM_MODEL=claude-3-sonnet

See Environment Variables Reference for full list.

Configuration Files

For complex configurations:

# config/default.yaml
server:
  bind_address: "0.0.0.0:8080"

guardrails:
  max_actions_per_incident: 10
  blocked_actions: []

Configuration Hierarchy

Configuration is loaded in this order (later overrides earlier):

1. Built-in defaults
         ↓
2. config/default.yaml
         ↓
3. config/{environment}.yaml
         ↓
4. Environment variables
         ↓
5. Database settings (via UI)

Validation

Triage Warden validates configuration at startup:

# Validate without starting
triage-warden serve --validate-only

# Check specific configuration
triage-warden config check

Common Validation Errors

Error	Solution
`Missing TW_ENCRYPTION_KEY`	Set encryption key environment variable
`Invalid DATABASE_URL`	Check connection string format
`LLM API key required`	Set API key or disable LLM features
`Guardrails file not found`	Create `config/guardrails.yaml`

Backup Configuration

Before making changes, backup current settings:

# Export settings via API
curl -H "Authorization: Bearer $API_KEY" \
  http://localhost:8080/api/settings/export > settings-backup.json

# Restore settings
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d @settings-backup.json \
  http://localhost:8080/api/settings/import

Next Steps

Environment Variables Reference

Complete reference of all environment variables for Triage Warden.

Required Variables

These must be set for Triage Warden to start.

Database

Variable	Description	Example
`DATABASE_URL`	PostgreSQL connection string	`postgres://user:pass@localhost:5432/triage_warden`

Connection String Format:

postgres://username:password@hostname:port/database?sslmode=require

SSL Modes:

disable - No SSL (development only)
require - SSL required, no certificate verification
verify-ca - Verify server certificate against CA
verify-full - Verify server certificate and hostname

Security

Variable	Description	Example
`TW_ENCRYPTION_KEY`	Credential encryption key (32 bytes, base64)	`K7gNU3sdo+OL0wNhqoVW...`
`TW_JWT_SECRET`	JWT signing secret (min 32 characters)	`your-very-long-jwt-secret-here`
`TW_SESSION_SECRET`	Session encryption secret	`your-session-secret-here`

Generating Keys:

# Encryption key (32 bytes, base64)
openssl rand -base64 32

# JWT/Session secret (hex)
openssl rand -hex 32

Server Configuration

Variable	Description	Default
`TW_BIND_ADDRESS`	Server bind address	`0.0.0.0:8080`
`TW_BASE_URL`	Public URL for the application	`http://localhost:8080`
`TW_TRUSTED_PROXIES`	Comma-separated trusted proxy IPs	None
`TW_MAX_REQUEST_SIZE`	Maximum request body size	`10MB`
`TW_REQUEST_TIMEOUT`	Request timeout in seconds	`30`

Example:

TW_BIND_ADDRESS=0.0.0.0:8080
TW_BASE_URL=https://triage.company.com
TW_TRUSTED_PROXIES=10.0.0.0/8,172.16.0.0/12

Database Configuration

Variable	Description	Default
`DATABASE_URL`	Connection string	Required
`DATABASE_MAX_CONNECTIONS`	Maximum pool connections	`10`
`DATABASE_MIN_CONNECTIONS`	Minimum pool connections	`1`
`DATABASE_CONNECT_TIMEOUT`	Connection timeout (seconds)	`30`
`DATABASE_IDLE_TIMEOUT`	Idle connection timeout (seconds)	`600`
`DATABASE_MAX_LIFETIME`	Max connection lifetime (seconds)	`1800`

High-Traffic Configuration:

DATABASE_MAX_CONNECTIONS=50
DATABASE_MIN_CONNECTIONS=5
DATABASE_IDLE_TIMEOUT=300

Authentication

Variable	Description	Default
`TW_JWT_SECRET`	JWT signing secret	Required
`TW_JWT_EXPIRY`	JWT token expiry	`24h`
`TW_SESSION_SECRET`	Session encryption key	Required
`TW_SESSION_EXPIRY`	Session duration	`7d`
`TW_CSRF_ENABLED`	Enable CSRF protection	`true`
`TW_COOKIE_SECURE`	Require HTTPS for cookies	`false`
`TW_COOKIE_SAME_SITE`	SameSite cookie policy	`lax`

Production Settings:

TW_COOKIE_SECURE=true
TW_COOKIE_SAME_SITE=strict
TW_SESSION_EXPIRY=1d

LLM Configuration

Provider Selection

Variable	Description	Default
`TW_LLM_PROVIDER`	LLM provider	`openai`
`TW_LLM_MODEL`	Model name	`gpt-4-turbo`
`TW_LLM_ENABLED`	Enable LLM features	`true`

Valid Providers: openai, anthropic, azure, local

API Keys

Variable	Description
`OPENAI_API_KEY`	OpenAI API key
`ANTHROPIC_API_KEY`	Anthropic API key
`AZURE_OPENAI_API_KEY`	Azure OpenAI API key
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI endpoint URL

Model Parameters

Variable	Description	Default
`TW_LLM_TEMPERATURE`	Response randomness (0.0-2.0)	`0.2`
`TW_LLM_MAX_TOKENS`	Maximum response tokens	`4096`
`TW_LLM_TIMEOUT`	Request timeout (seconds)	`60`

Example Configuration:

# Using Anthropic
TW_LLM_PROVIDER=anthropic
TW_LLM_MODEL=claude-3-sonnet-20240229
ANTHROPIC_API_KEY=sk-ant-api03-...
TW_LLM_TEMPERATURE=0.1
TW_LLM_MAX_TOKENS=8192

# Using Azure OpenAI
TW_LLM_PROVIDER=azure
AZURE_OPENAI_API_KEY=your-azure-key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
TW_LLM_MODEL=gpt-4-deployment-name

Logging & Observability

Variable	Description	Default
`RUST_LOG`	Log level filter	`info`
`TW_LOG_FORMAT`	Log format (`json` or `pretty`)	`json`
`TW_LOG_FILE`	Log file path (optional)	None

Log Levels

# Basic levels
RUST_LOG=info          # Info and above
RUST_LOG=debug         # Debug and above
RUST_LOG=warn          # Warnings and errors only

# Granular control
RUST_LOG=info,triage_warden=debug                    # Debug for app, info for deps
RUST_LOG=warn,triage_warden::api=debug               # Debug specific module
RUST_LOG=info,sqlx=warn,hyper=warn                   # Quiet noisy dependencies

Metrics & Tracing

Variable	Description	Default
`TW_METRICS_ENABLED`	Enable Prometheus metrics	`true`
`TW_METRICS_PATH`	Metrics endpoint path	`/metrics`
`TW_TRACING_ENABLED`	Enable distributed tracing	`false`
`OTEL_EXPORTER_OTLP_ENDPOINT`	OpenTelemetry endpoint	None
`OTEL_SERVICE_NAME`	Service name for traces	`triage-warden`

Tracing Setup:

TW_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=triage-warden-prod

Rate Limiting

Variable	Description	Default
`TW_RATE_LIMIT_ENABLED`	Enable rate limiting	`true`
`TW_RATE_LIMIT_REQUESTS`	Requests per window	`100`
`TW_RATE_LIMIT_WINDOW`	Rate limit window	`1m`
`TW_RATE_LIMIT_BURST`	Burst allowance	`20`

Webhooks

Variable	Description	Default
`TW_WEBHOOK_SECRET`	Default webhook signature secret	None
`TW_WEBHOOK_SPLUNK_SECRET`	Splunk-specific secret	None
`TW_WEBHOOK_CROWDSTRIKE_SECRET`	CrowdStrike-specific secret	None
`TW_WEBHOOK_DEFENDER_SECRET`	Defender-specific secret	None
`TW_WEBHOOK_SENTINEL_SECRET`	Sentinel-specific secret	None

CORS Configuration

Variable	Description	Default
`TW_CORS_ENABLED`	Enable CORS	`true`
`TW_CORS_ORIGINS`	Allowed origins (comma-separated)	`*`
`TW_CORS_METHODS`	Allowed methods	`GET,POST,PUT,DELETE,OPTIONS`
`TW_CORS_HEADERS`	Allowed headers	`*`
`TW_CORS_MAX_AGE`	Preflight cache duration (seconds)	`86400`

Production CORS:

TW_CORS_ORIGINS=https://triage.company.com,https://admin.company.com

Feature Flags

Variable	Description	Default
`TW_FEATURE_PLAYBOOKS`	Enable playbook execution	`true`
`TW_FEATURE_AUTO_ENRICH`	Enable automatic enrichment	`true`
`TW_FEATURE_API_KEYS`	Enable API key management	`true`

Development Variables

Not recommended for production:

Variable	Description	Default
`TW_DEV_MODE`	Enable development mode	`false`
`TW_SEED_DATA`	Seed database with test data	`false`
`TW_DISABLE_AUTH`	Disable authentication	`false`

Example Configurations

Development

DATABASE_URL=sqlite:./dev.db
TW_ENCRYPTION_KEY=$(openssl rand -base64 32)
TW_JWT_SECRET=dev-jwt-secret-not-for-production
TW_SESSION_SECRET=dev-session-secret
RUST_LOG=debug
TW_LOG_FORMAT=pretty
TW_DEV_MODE=true

Production

# Database
DATABASE_URL=postgres://tw:[email protected]:5432/triage_warden?sslmode=verify-full
DATABASE_MAX_CONNECTIONS=25

# Security
TW_ENCRYPTION_KEY=your-production-encryption-key
TW_JWT_SECRET=your-production-jwt-secret-minimum-32-chars
TW_SESSION_SECRET=your-production-session-secret
TW_COOKIE_SECURE=true
TW_COOKIE_SAME_SITE=strict

# Server
TW_BASE_URL=https://triage.company.com
TW_TRUSTED_PROXIES=10.0.0.0/8

# LLM
TW_LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-api03-...
TW_LLM_MODEL=claude-3-sonnet-20240229

# Logging
RUST_LOG=info
TW_LOG_FORMAT=json
TW_METRICS_ENABLED=true

# Rate limiting
TW_RATE_LIMIT_ENABLED=true
TW_RATE_LIMIT_REQUESTS=200
TW_RATE_LIMIT_WINDOW=1m

Kubernetes

apiVersion: v1
kind: Secret
metadata:
  name: triage-warden-secrets
type: Opaque
stringData:
  DATABASE_URL: "postgres://user:pass@postgres:5432/triage_warden"
  TW_ENCRYPTION_KEY: "base64-encoded-32-byte-key"
  TW_JWT_SECRET: "jwt-signing-secret"
  TW_SESSION_SECRET: "session-secret"
  ANTHROPIC_API_KEY: "sk-ant-..."
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: triage-warden-config
data:
  TW_BASE_URL: "https://triage.company.com"
  TW_LLM_PROVIDER: "anthropic"
  TW_LLM_MODEL: "claude-3-sonnet-20240229"
  RUST_LOG: "info"
  TW_METRICS_ENABLED: "true"

Connector Setup Guide

Step-by-step instructions for configuring each connector type.

Overview

Connectors enable Triage Warden to:

Ingest alerts from SIEMs and security tools
Enrich incidents with threat intelligence
Execute actions like creating tickets or isolating hosts
Send notifications to communication platforms

Adding a Connector

Navigate to Settings → Connectors
Click Add Connector
Select connector type
Fill in the required fields
Click Test Connection to verify
Click Save

Threat Intelligence Connectors

VirusTotal

Enriches file hashes, URLs, IPs, and domains with reputation data.

Prerequisites:

VirusTotal account (free or premium)
API key from virustotal.com/gui/my-apikey

Configuration:

Field	Value
Name	`VirusTotal`
Type	`virustotal`
API Key	Your API key
Rate Limit	`4` (free) or `500` (premium)

Rate Limits:

Free tier: 4 requests/minute
Premium: 500+ requests/minute

Verify It Works:

Create a test incident with a known-bad hash
Check incident enrichments for VirusTotal data

AlienVault OTX

Open threat intelligence from AlienVault.

Prerequisites:

OTX account at otx.alienvault.com
API key from Settings → API Keys

Configuration:

Field	Value
Name	`AlienVault OTX`
Type	`alienvault`
API Key	Your OTX API key

SIEM Connectors

Splunk

Ingest alerts from Splunk and run queries.

Prerequisites:

Splunk Enterprise or Cloud
HTTP Event Collector (HEC) token
User with search capabilities

Configuration:

Field	Value
Name	`Splunk Production`
Type	`splunk`
Host	`https://splunk.company.com:8089`
Username	Service account username
Password	Service account password
App	`search` (or your app context)

Setting Up Webhooks:

In Splunk, create an alert action that sends to webhook
Configure webhook URL: https://triage.company.com/api/webhooks/splunk
Set webhook secret in Triage Warden connector config

Elastic Security

Connect to Elastic Security for SIEM alerts.

Prerequisites:

Elasticsearch 7.x or 8.x
User with read access to security indices

Configuration:

Field	Value
Name	`Elastic SIEM`
Type	`elastic`
URL	`https://elasticsearch.company.com:9200`
Username	Service account username
Password	Service account password
Index Pattern	`security-` or `.alerts-security.`

Microsoft Sentinel

Azure Sentinel integration for cloud SIEM.

Prerequisites:

Azure subscription with Sentinel workspace
App registration with Log Analytics Reader role

Configuration:

Field	Value
Name	`Azure Sentinel`
Type	`sentinel`
Workspace ID	Log Analytics Workspace ID
Tenant ID	Azure AD Tenant ID
Client ID	App Registration Client ID
Client Secret	App Registration Secret

Azure Setup:

Create App Registration in Azure AD
Grant Log Analytics Reader role on Sentinel workspace
Create client secret
Copy IDs and secret to Triage Warden

EDR Connectors

CrowdStrike Falcon

Endpoint detection and host isolation.

Prerequisites:

CrowdStrike Falcon subscription
API client with appropriate scopes

Configuration:

Field	Value
Name	`CrowdStrike Falcon`
Type	`crowdstrike`
Region	`us-1`, `us-2`, `eu-1`, or `us-gov-1`
Client ID	OAuth Client ID
Client Secret	OAuth Client Secret

Required API Scopes:

Detections: Read
Hosts: Read, Write (for isolation)
Incidents: Read

CrowdStrike Setup:

Go to Support → API Clients and Keys
Create new API client
Select required scopes
Copy Client ID and Secret

Microsoft Defender for Endpoint

MDE integration for alerts and host actions.

Prerequisites:

Microsoft 365 E5 or Defender for Endpoint license
App registration with Defender API permissions

Configuration:

Field	Value
Name	`Defender for Endpoint`
Type	`defender`
Tenant ID	Azure AD Tenant ID
Client ID	App Registration Client ID
Client Secret	App Registration Secret

Required API Permissions:

Alert.Read.All
Machine.Read.All
Machine.Isolate (for isolation actions)

SentinelOne

SentinelOne EDR integration.

Prerequisites:

SentinelOne console access
API token with appropriate permissions

Configuration:

Field	Value
Name	`SentinelOne`
Type	`sentinelone`
Console URL	`https://usea1-pax8.sentinelone.net`
API Token	Your API token

Ticketing Connectors

Jira

Create and manage security tickets.

Prerequisites:

Jira Cloud or Server instance
API token (Cloud) or password (Server)

Configuration:

Field	Value
Name	`Jira Security`
Type	`jira`
URL	`https://yourcompany.atlassian.net`
Email	Your Jira email
API Token	API token from Atlassian account
Default Project	`SEC` (your security project key)

Jira Cloud Setup:

Go to id.atlassian.com/manage-profile/security/api-tokens
Create API token
Use your email as username

Jira Server Setup:

Use password instead of API token
Ensure user has project access

ServiceNow

ServiceNow ITSM integration.

Prerequisites:

ServiceNow instance
User with incident table access

Configuration:

Field	Value
Name	`ServiceNow`
Type	`servicenow`
Instance URL	`https://yourcompany.service-now.com`
Username	Service account username
Password	Service account password

Identity Connectors

Microsoft 365 / Azure AD

User management and sign-in data.

Prerequisites:

Azure AD with appropriate licenses
App registration with Graph API permissions

Configuration:

Field	Value
Name	`Microsoft 365`
Type	`m365`
Tenant ID	Azure AD Tenant ID
Client ID	App Registration Client ID
Client Secret	App Registration Secret

Required API Permissions:

User.Read.All
AuditLog.Read.All
User.RevokeSessions.All (for user disable)

Google Workspace

Google Workspace user management.

Prerequisites:

Google Workspace admin access
Service account with domain-wide delegation

Configuration:

Field	Value
Name	`Google Workspace`
Type	`google`
Service Account JSON	Paste JSON key file contents
Domain	`company.com`

Google Setup:

Create service account in Google Cloud Console
Enable domain-wide delegation
Add required OAuth scopes in Google Admin
Download JSON key file

Testing Connectors

After configuration, always test:

Click Test Connection in connector settings
Check the response for success/errors
For ingestion connectors, verify sample data appears

Common Issues

Error	Solution
Connection refused	Check URL and network access
401 Unauthorized	Verify credentials/API key
403 Forbidden	Check permissions/scopes
SSL certificate error	Verify certificate or disable verification
Rate limited	Reduce request rate or upgrade tier

Connector Health

Monitor connector health at Settings → Connectors or via API:

curl http://localhost:8080/health/detailed | jq '.components.connectors'

Healthy connectors show status connected. Troubleshoot any showing error or disconnected.

Playbooks Guide

Create effective automated response playbooks.

What is a Playbook?

A playbook is an automated workflow that executes when specific conditions are met. Playbooks contain:

Trigger - Conditions that start the playbook
Stages - Ordered groups of steps
Steps - Individual actions to execute

Creating a Playbook

Via Web UI

Navigate to Playbooks
Click Create Playbook
Enter name and description
Configure trigger conditions
Add stages and steps
Enable and save

Via API

curl -X POST http://localhost:8080/api/playbooks \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Phishing Response",
    "description": "Automated response for phishing alerts",
    "trigger": {
      "type": "incident_created",
      "conditions": {
        "source": "email_gateway",
        "severity": ["high", "critical"]
      }
    },
    "stages": [...]
  }'

Trigger Types

incident_created

Fires when a new incident is created.

{
  "type": "incident_created",
  "conditions": {
    "severity": ["high", "critical"],
    "source": "crowdstrike",
    "title_contains": "malware"
  }
}

incident_updated

Fires when an incident is updated.

{
  "type": "incident_updated",
  "conditions": {
    "field": "severity",
    "new_value": "critical"
  }
}

scheduled

Fires on a schedule (cron format).

{
  "type": "scheduled",
  "schedule": "0 */6 * * *"
}

manual

Only triggered manually by user action.

{
  "type": "manual"
}

Stages

Stages group steps that should execute together. Configure:

Name - Descriptive name
Description - What this stage does
Parallel - Execute steps in parallel (default: false)

Sequential Execution

{
  "stages": [
    {
      "name": "Enrichment",
      "steps": [/* step 1, step 2, step 3 */]
    },
    {
      "name": "Response",
      "steps": [/* step 4, step 5 */]
    }
  ]
}

Steps in Enrichment complete before Response starts.

Parallel Execution

{
  "stages": [
    {
      "name": "Gather Intel",
      "parallel": true,
      "steps": [
        {"action": "lookup_hash_virustotal"},
        {"action": "lookup_ip_reputation"},
        {"action": "lookup_domain_reputation"}
      ]
    }
  ]
}

All lookups run simultaneously.

Step Types

Enrichment Actions

lookup_hash

Look up file hash reputation.

{
  "action": "lookup_hash",
  "parameters": {
    "hash": "{{ incident.iocs.file_hash }}",
    "providers": ["virustotal", "alienvault"]
  }
}

lookup_ip

Look up IP address reputation.

{
  "action": "lookup_ip",
  "parameters": {
    "ip": "{{ incident.source_ip }}"
  }
}

lookup_domain

Look up domain reputation.

{
  "action": "lookup_domain",
  "parameters": {
    "domain": "{{ incident.domain }}"
  }
}

lookup_user

Get user details from identity provider.

{
  "action": "lookup_user",
  "parameters": {
    "email": "{{ incident.user_email }}",
    "provider": "m365"
  }
}

Containment Actions

isolate_host

Isolate endpoint from network.

{
  "action": "isolate_host",
  "parameters": {
    "hostname": "{{ incident.hostname }}",
    "provider": "crowdstrike"
  },
  "requires_approval": true
}

disable_user

Disable user account.

{
  "action": "disable_user",
  "parameters": {
    "email": "{{ incident.user_email }}",
    "provider": "m365"
  },
  "requires_approval": true
}

block_ip

Block IP address at firewall.

{
  "action": "block_ip",
  "parameters": {
    "ip": "{{ incident.source_ip }}",
    "duration": "24h"
  },
  "requires_approval": true
}

Notification Actions

send_notification

Send alert to notification channel.

{
  "action": "send_notification",
  "parameters": {
    "channel": "slack-security",
    "message": "Critical incident: {{ incident.title }}"
  }
}

create_ticket

Create ticket in ticketing system.

{
  "action": "create_ticket",
  "parameters": {
    "provider": "jira",
    "project": "SEC",
    "type": "Incident",
    "title": "{{ incident.title }}",
    "description": "{{ incident.description }}"
  }
}

Analysis Actions

analyze_with_llm

Run AI analysis on incident.

{
  "action": "analyze_with_llm",
  "parameters": {
    "prompt": "Analyze this security incident and provide recommendations",
    "include_enrichments": true
  }
}

Utility Actions

wait

Pause execution for specified duration.

{
  "action": "wait",
  "parameters": {
    "duration": "5m"
  }
}

set_severity

Update incident severity.

{
  "action": "set_severity",
  "parameters": {
    "severity": "critical"
  }
}

add_comment

Add comment to incident.

{
  "action": "add_comment",
  "parameters": {
    "comment": "Automated enrichment complete. Found {{ enrichments.virustotal.positives }} detections."
  }
}

Variables and Templates

Use Jinja2-style templates to reference incident data:

Available Variables

Variable	Description
`{{ incident.id }}`	Incident UUID
`{{ incident.title }}`	Incident title
`{{ incident.severity }}`	Severity level
`{{ incident.source }}`	Alert source
`{{ incident.description }}`	Full description
`{{ incident.hostname }}`	Affected hostname
`{{ incident.username }}`	Affected username
`{{ incident.source_ip }}`	Source IP address
`{{ incident.iocs.* }}`	Extracted IOCs
`{{ enrichments.* }}`	Enrichment results
`{{ previous_step.output }}`	Previous step output

Conditional Logic

{
  "action": "isolate_host",
  "conditions": "{{ incident.severity == 'critical' and enrichments.virustotal.positives > 5 }}"
}

Approval Requirements

Mark steps as requiring approval for dangerous actions:

{
  "action": "disable_user",
  "requires_approval": true
}

When requires_approval: true:

Step pauses at approval queue
Analyst reviews and approves/denies
Execution continues or stops

Example Playbooks

Phishing Triage

{
  "name": "Phishing Triage",
  "description": "Automated triage for reported phishing emails",
  "trigger": {
    "type": "incident_created",
    "conditions": {
      "source": "email_gateway",
      "title_contains": "phishing"
    }
  },
  "stages": [
    {
      "name": "Extract and Enrich",
      "parallel": true,
      "steps": [
        {
          "action": "lookup_domain",
          "parameters": {"domain": "{{ incident.sender_domain }}"}
        },
        {
          "action": "lookup_url",
          "parameters": {"url": "{{ incident.iocs.url }}"}
        },
        {
          "action": "lookup_user",
          "parameters": {"email": "{{ incident.recipient }}"}
        }
      ]
    },
    {
      "name": "Analyze",
      "steps": [
        {
          "action": "analyze_with_llm",
          "parameters": {
            "prompt": "Analyze this phishing attempt and determine if it's targeted spear-phishing"
          }
        }
      ]
    },
    {
      "name": "Respond",
      "steps": [
        {
          "action": "send_notification",
          "parameters": {
            "channel": "slack-phishing",
            "message": "Phishing alert: {{ incident.title }}\nSender: {{ incident.sender }}\nVerdict: {{ analysis.verdict }}"
          }
        },
        {
          "action": "create_ticket",
          "conditions": "{{ analysis.verdict == 'malicious' }}",
          "parameters": {
            "provider": "jira",
            "project": "SEC",
            "title": "Phishing: {{ incident.title }}"
          }
        }
      ]
    }
  ]
}

Malware Containment

{
  "name": "Malware Containment",
  "description": "Isolate hosts with confirmed malware",
  "trigger": {
    "type": "incident_created",
    "conditions": {
      "source": "crowdstrike",
      "severity": "critical",
      "title_contains": "malware"
    }
  },
  "stages": [
    {
      "name": "Verify",
      "steps": [
        {
          "action": "lookup_hash",
          "parameters": {"hash": "{{ incident.iocs.file_hash }}"}
        }
      ]
    },
    {
      "name": "Contain",
      "steps": [
        {
          "action": "isolate_host",
          "conditions": "{{ enrichments.virustotal.positives >= 5 }}",
          "requires_approval": true,
          "parameters": {
            "hostname": "{{ incident.hostname }}",
            "reason": "Confirmed malware with {{ enrichments.virustotal.positives }} detections"
          }
        }
      ]
    },
    {
      "name": "Notify",
      "steps": [
        {
          "action": "send_notification",
          "parameters": {
            "channel": "pagerduty-security",
            "message": "Host {{ incident.hostname }} isolated due to malware"
          }
        }
      ]
    }
  ]
}

Best Practices

Start small - Begin with enrichment-only playbooks before adding containment
Require approval - Always require approval for containment actions initially
Test in staging - Test playbooks with mock incidents first
Monitor execution - Watch playbook executions for errors
Document thoroughly - Include clear descriptions for each stage/step
Use conditions - Don't execute actions blindly; use conditions to validate
Handle failures - Consider what happens if a step fails

Troubleshooting

Playbook Not Triggering

Verify trigger conditions match incoming incidents
Check playbook is enabled
Review trigger condition syntax

Step Failing

Check connector is healthy
Verify required parameters are provided
Check variable templates resolve correctly
Review step logs in incident timeline

Approval Stuck

Check Approvals queue for pending items
Verify approvers have notification channel configured
Consider timeout settings for approvals

Notifications Setup Guide

Configure notification channels for alerts and incident updates.

Overview

Triage Warden supports multiple notification channels:

Slack - Team messaging
Microsoft Teams - Enterprise collaboration
PagerDuty - On-call alerting
Email - SMTP notifications
Webhooks - Custom integrations

Adding a Notification Channel

Navigate to Settings → Notifications
Click Add Channel
Select channel type
Configure settings
Test and save

Slack

Prerequisites

Slack workspace admin access
Slack app with webhook permissions

Setup Steps

Create Slack App:
- Go to api.slack.com/apps
- Click Create New App → From scratch
- Name it "Triage Warden" and select your workspace
Enable Incoming Webhooks:
- In app settings, click Incoming Webhooks
- Toggle Activate Incoming Webhooks to On
- Click Add New Webhook to Workspace
- Select the channel for alerts
Copy Webhook URL:
- Copy the webhook URL (starts with https://hooks.slack.com/...)
Configure in Triage Warden:

Field	Value
Name	`Slack - Security`
Type	`slack`
Webhook URL	Your webhook URL
Channel	`#security-alerts`

Message Format

Triage Warden sends formatted Slack messages with:

Severity color coding (red=critical, orange=high, yellow=medium, gray=low)
Incident summary and details
Quick action buttons (View, Acknowledge)
Enrichment highlights

Example Notification

{
  "attachments": [{
    "color": "#ff0000",
    "title": "Critical: Malware Detected on WORKSTATION-001",
    "text": "CrowdStrike detected Emotet malware on endpoint",
    "fields": [
      {"title": "Source", "value": "CrowdStrike", "short": true},
      {"title": "Severity", "value": "Critical", "short": true}
    ],
    "actions": [
      {"type": "button", "text": "View Incident", "url": "https://..."}
    ]
  }]
}

Microsoft Teams

Prerequisites

Microsoft 365 account
Teams channel where you can add connectors

Setup Steps

Add Incoming Webhook Connector:
- In Teams, go to the channel for alerts
- Click ... → Connectors
- Find Incoming Webhook and click Configure
- Name it "Triage Warden" and upload an icon (optional)
- Click Create
Copy Webhook URL:
- Copy the generated webhook URL
Configure in Triage Warden:

Field	Value
Name	`Teams - Security`
Type	`teams`
Webhook URL	Your webhook URL

Adaptive Cards

Triage Warden sends Teams notifications as Adaptive Cards with:

Severity indicators
Incident details in structured format
Action buttons for quick response

PagerDuty

Prerequisites

PagerDuty account
Service with Events API v2 integration

Setup Steps

Create PagerDuty Service:
- In PagerDuty, go to Services → New Service
- Name it "Triage Warden Alerts"
- Add an escalation policy
Add Events API Integration:
- On the service page, go to Integrations
- Click Add Integration
- Select Events API v2
- Copy the Integration Key
Configure in Triage Warden:

Field	Value
Name	`PagerDuty - Security`
Type	`pagerduty`
Integration Key	Your integration key
Severity Mapping	See below

Severity Mapping

Map Triage Warden severities to PagerDuty:

TW Severity	PagerDuty Severity
Critical	`critical`
High	`error`
Medium	`warning`
Low	`info`

Auto-Resolution

Configure auto-resolution to close PagerDuty incidents when Triage Warden incidents are resolved:

notifications:
  pagerduty:
    auto_resolve: true
    resolve_on_status:
      - resolved
      - closed
      - false_positive

Email (SMTP)

Prerequisites

SMTP server credentials
Recipient email addresses

Configuration

Field	Value
Name	`Email - SOC Team`
Type	`email`
SMTP Host	`smtp.company.com`
SMTP Port	`587`
Username	`[email protected]`
Password	SMTP password
From Address	`[email protected]`
To Addresses	`[email protected]`
Use TLS	`true`

Email Templates

Customize email templates by creating files in config/templates/:

config/templates/
├── email_incident_created.html
├── email_incident_updated.html
└── email_incident_resolved.html

Template variables:

{{ incident.title }} - Incident title
{{ incident.severity }} - Severity level
{{ incident.source }} - Alert source
{{ incident.description }} - Full description
{{ incident.url }} - Link to incident

Custom Webhooks

Send notifications to any HTTP endpoint.

Configuration

Field	Value
Name	`Custom - SIEM`
Type	`webhook`
URL	`https://siem.company.com/api/alerts`
Method	`POST`
Headers	`{"Authorization": "Bearer ..."}`
Secret	Webhook signing secret (optional)

Payload Format

Default JSON payload:

{
  "event_type": "incident_created",
  "timestamp": "2024-01-15T10:30:00Z",
  "incident": {
    "id": "uuid",
    "title": "Alert Title",
    "severity": "high",
    "source": "crowdstrike",
    "description": "...",
    "created_at": "2024-01-15T10:29:00Z"
  }
}

Webhook Signatures

If a secret is configured, Triage Warden signs webhooks with HMAC-SHA256:

X-TW-Signature: sha256=<signature>
X-TW-Timestamp: <unix_timestamp>

Verify signatures:

import hmac
import hashlib

def verify_signature(payload, signature, secret, timestamp):
    expected = hmac.new(
        secret.encode(),
        f"{timestamp}.{payload}".encode(),
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

Notification Rules

Configure when and how notifications are sent.

Severity Filtering

Send only high/critical alerts to PagerDuty:

notifications:
  rules:
    - channel: pagerduty-security
      conditions:
        severity:
          - critical
          - high

Time-Based Rules

Different channels for business hours vs. after hours:

notifications:
  rules:
    - channel: slack-security
      conditions:
        hours: "09:00-17:00"
        days: ["mon", "tue", "wed", "thu", "fri"]
    - channel: pagerduty-oncall
      conditions:
        hours: "17:00-09:00"
        days: ["sat", "sun"]

Source-Based Rules

Route by alert source:

notifications:
  rules:
    - channel: slack-phishing
      conditions:
        source: email_gateway
    - channel: slack-edr
      conditions:
        source:
          - crowdstrike
          - defender

Testing Notifications

Test via UI

Go to Settings → Notifications
Click Test next to any channel
Check that test message arrives

Test via API

curl -X POST http://localhost:8080/api/notifications/test \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "channel_id": "uuid-of-channel",
    "message": "Test notification from Triage Warden"
  }'

Test via CLI

triage-warden notifications test --channel slack-security

Troubleshooting

Notifications Not Arriving

Check channel health:

curl http://localhost:8080/health/detailed | jq '.components.notifications'

Verify webhook URL:
- Test URL with curl
- Check for firewalls or network restrictions

Check logs:

grep "notification" /var/log/triage-warden/app.log

Rate Limiting

If notifications are delayed:

Slack: 1 message per second per channel
PagerDuty: 120 events per minute
Teams: 4 messages per second

Configure rate limits:

notifications:
  rate_limits:
    slack: 1/s
    pagerduty: 2/s
    teams: 4/s

Duplicate Notifications

If receiving duplicates:

Check for multiple channels targeting same destination

Enable deduplication:

notifications:
  deduplicate: true
  dedupe_window: 5m

Policies Guide

Configure approval policies, guardrails, and safety rules for Triage Warden.

Overview

Policies control what actions Triage Warden can take automatically and what requires human approval. The policy engine provides:

Approval Requirements - Which actions need human approval
Guardrails - Safety limits on automated actions
Kill Switch - Emergency halt for all automation
Audit Logging - Complete action history

Policy Configuration

Policies are defined in config/guardrails.yaml or via the web UI at Settings → Policies.

Basic Structure

# config/guardrails.yaml
version: "1"

# Global settings
global:
  operation_mode: supervised  # assisted, supervised, autonomous
  kill_switch_enabled: false
  max_actions_per_incident: 10
  max_concurrent_actions: 5

# Action-specific policies
actions:
  isolate_host:
    requires_approval: true
    approval_level: high
    allowed_sources:
      - crowdstrike
      - defender

  disable_user:
    requires_approval: true
    approval_level: critical
    max_per_hour: 5

  lookup_hash:
    requires_approval: false
    rate_limit: 100/minute

# Approval rules
approvals:
  levels:
    low:
      auto_approve_after: 5m
      approvers: [analyst]
    medium:
      auto_approve_after: 30m
      approvers: [analyst, senior_analyst]
    high:
      auto_approve_after: never
      approvers: [senior_analyst, manager]
    critical:
      auto_approve_after: never
      approvers: [manager]
      require_count: 2

Operation Modes

Assisted Mode

Human-in-the-loop for all decisions:

All actions require explicit approval
AI provides recommendations only
Best for initial deployment and high-risk environments

global:
  operation_mode: assisted

Supervised Mode (Recommended)

Balanced automation with oversight:

Low-risk actions (lookups, enrichment) run automatically
Medium/high-risk actions require approval
Humans can intervene at any time

global:
  operation_mode: supervised

Autonomous Mode

Maximum automation:

Most actions run without approval
Only critical actions require human review
Use only after thorough testing

global:
  operation_mode: autonomous

Approval Levels

Configuring Approval Requirements

Each action type can have an approval requirement:

Action Type	Default Level	Typical Setting
lookup_*	none	none
send_notification	none	none
create_ticket	low	none or low
add_comment	none	none
set_severity	low	low
block_ip	high	high
isolate_host	critical	high or critical
disable_user	critical	critical

Approval Workflow

Action Requested - Playbook or AI requests an action
Policy Check - Engine evaluates approval requirements
Queue or Execute - Action queued for approval or runs immediately
Approval Decision - Approver accepts or denies
Execution - Approved action executes
Audit Log - All decisions recorded

Approval Escalation

Configure escalation for unanswered approvals:

approvals:
  escalation:
    enabled: true
    rules:
      - after: 15m
        notify: [slack-security]
      - after: 30m
        notify: [pagerduty-oncall]
        escalate_to: manager
      - after: 1h
        auto_deny: true
        reason: "Approval timeout"

Guardrails

Rate Limits

Prevent runaway automation:

guardrails:
  rate_limits:
    # Global limits
    global:
      max_actions_per_minute: 100
      max_actions_per_hour: 1000

    # Per-action limits
    isolate_host:
      max_per_hour: 10
      max_per_day: 50

    disable_user:
      max_per_hour: 5
      max_per_day: 20

Blocked Actions

Completely prevent certain actions:

guardrails:
  blocked_actions:
    - delete_user        # Never allow
    - format_disk        # Never allow
    - disable_mfa        # Too dangerous

Conditional Rules

Allow/deny based on conditions:

guardrails:
  conditional_rules:
    - action: isolate_host
      deny_if:
        - hostname_contains: "dc"      # Don't isolate domain controllers
        - hostname_contains: "prod-db" # Don't isolate production databases
        - is_server: true

    - action: disable_user
      deny_if:
        - is_admin: true               # Don't disable admins
        - is_service_account: true     # Don't disable service accounts
      require_if:
        - department: "executive"      # Extra approval for executives

Asset Protection

Protect critical assets:

guardrails:
  protected_assets:
    hosts:
      - pattern: "dc-*"
        actions_blocked: [isolate_host, shutdown]
        reason: "Domain controllers require manual intervention"

      - pattern: "prod-*"
        require_approval: critical
        reason: "Production systems require manager approval"

    users:
      - pattern: "*@executive.company.com"
        require_approval: critical

      - pattern: "svc-*"
        actions_blocked: [disable_user, reset_password]

Kill Switch

Emergency Automation Halt

The kill switch immediately stops all automated actions:

Via UI:

Go to Settings → Safety
Click Activate Kill Switch
Enter reason
Confirm

Via API:

curl -X POST http://localhost:8080/api/kill-switch/activate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"reason": "Investigating potential false positives"}'

Via CLI:

triage-warden kill-switch activate --reason "Emergency halt"

Kill Switch Effects

When active:

All pending actions are paused
New automated actions are blocked
Manual actions still allowed
Alerts continue to be ingested
Enrichment continues (read-only)

Deactivating

Only users with admin or manager role can deactivate:

curl -X POST http://localhost:8080/api/kill-switch/deactivate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"reason": "Issue resolved, resuming normal operations"}'

Audit Logging

What's Logged

Every action is logged with:

Timestamp
Action type
Target (host, user, etc.)
Requestor (playbook, user, AI)
Approver (if required)
Result (success, failure, denied)
Full context

Viewing Audit Logs

Via UI:

Settings → Audit Log
Filter by date, action type, user, result

Via API:

curl "http://localhost:8080/api/audit?action=isolate_host&from=2024-01-01" \
  -H "Authorization: Bearer $API_KEY"

Audit Retention

Configure retention in config/guardrails.yaml:

audit:
  retention_days: 365
  archive_to: s3://audit-logs-bucket/triage-warden/

Policy Testing

Dry Run Mode

Test policies without executing actions:

global:
  dry_run: true  # Log what would happen, don't execute

Policy Simulator

Test specific scenarios:

curl -X POST http://localhost:8080/api/policies/simulate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "action": "isolate_host",
    "context": {
      "hostname": "dc-primary",
      "severity": "critical",
      "source": "crowdstrike"
    }
  }'

Response:

{
  "allowed": false,
  "reason": "Host matches protected pattern 'dc-*'",
  "would_require_approval": null,
  "matching_rules": [
    "protected_assets.hosts[0]"
  ]
}

Best Practices

1. Start Restrictive

Begin with assisted mode and strict approvals. Loosen over time as you build confidence.

2. Protect Critical Assets

Always define protected assets for:

Domain controllers
Production databases
Executive accounts
Service accounts

3. Use Approval Escalation

Don't let approvals sit forever. Configure timeouts and escalations.

4. Monitor Guardrail Hits

Alert when guardrails are triggered frequently—it may indicate:

Misconfiguration
Attack in progress
Need to adjust thresholds

5. Test Policy Changes

Always use dry run or simulator before deploying policy changes.

6. Keep Audit Logs

Maintain audit logs for compliance and incident review. Archive to external storage.

Example: Phishing Response Policy

Complete policy for phishing incident automation:

version: "1"

global:
  operation_mode: supervised

actions:
  # Enrichment - automatic
  lookup_url:
    requires_approval: false
    rate_limit: 100/minute

  lookup_domain:
    requires_approval: false
    rate_limit: 100/minute

  lookup_user:
    requires_approval: false
    rate_limit: 50/minute

  # Notifications - automatic
  send_notification:
    requires_approval: false

  # Containment - requires approval
  block_sender:
    requires_approval: true
    approval_level: medium
    max_per_hour: 50

  quarantine_email:
    requires_approval: true
    approval_level: low
    auto_approve_confidence: 0.95

  disable_user:
    requires_approval: true
    approval_level: critical

guardrails:
  conditional_rules:
    - action: disable_user
      deny_if:
        - is_admin: true
        - is_executive: true

    - action: quarantine_email
      auto_approve_if:
        - ai_confidence: "> 0.95"
        - virustotal_malicious: "> 5"

Default Configuration Reference

The default configuration file (config/default.yaml) contains all settings for a Triage Warden deployment. Copy this file and customize it for your environment.

Sensitive values should use environment variable interpolation: ${ENV_VAR_NAME}.

Operation Mode

operation_mode: supervised

Mode	Description
`assisted`	AI observes and suggests only, no automated actions
`supervised`	Low-risk actions automated, high-risk requires approval
`autonomous`	Full automation for configured incident types

Concurrency

max_concurrent_incidents: 50

Maximum number of incidents being processed at the same time. Increase for high-volume environments; decrease to limit resource usage.

Connectors

External service integrations. Each connector follows the same structure:

connectors:
  <connector_name>:
    connector_type: <type>
    enabled: true
    base_url: <url>
    api_key: ${API_KEY_ENV_VAR}
    api_secret: ""
    timeout_secs: 30
    settings:
      <connector-specific settings>

Common Fields

Field	Type	Description
`connector_type`	String	Connector implementation to use
`enabled`	Boolean	Whether this connector is active
`base_url`	String	Base URL for the service API
`api_key`	String	API key or username (use `${ENV_VAR}`)
`api_secret`	String	API secret or password (use `${ENV_VAR}`)
`timeout_secs`	Integer	HTTP request timeout in seconds
`settings`	Map	Connector-specific settings

Jira

connectors:
  jira:
    connector_type: jira
    enabled: true
    base_url: https://your-company.atlassian.net
    api_key: ${JIRA_API_KEY}
    timeout_secs: 30
    settings:
      project_key: SEC
      default_issue_type: Incident

VirusTotal

connectors:
  virustotal:
    connector_type: virustotal
    enabled: true
    base_url: https://www.virustotal.com
    api_key: ${VIRUSTOTAL_API_KEY}
    timeout_secs: 30
    settings:
      cache_ttl_secs: 3600

Splunk (SIEM)

connectors:
  splunk:
    connector_type: splunk
    enabled: true
    base_url: https://splunk.company.com:8089
    api_key: ${SPLUNK_TOKEN}
    settings:
      index: main
      earliest_time: -24h

CrowdStrike (EDR)

connectors:
  crowdstrike:
    connector_type: crowdstrike
    enabled: true
    base_url: https://api.crowdstrike.com
    api_key: ${CS_CLIENT_ID}
    api_secret: ${CS_CLIENT_SECRET}

LLM Configuration

llm:
  provider: anthropic
  model: claude-3-5-sonnet-20241022
  api_key: ${ANTHROPIC_API_KEY}
  base_url: ""
  max_tokens: 4096
  temperature: 0.1

Field	Description
`provider`	LLM provider: `anthropic`, `openai`, or `local`
`model`	Model identifier
`api_key`	API key (use `${ENV_VAR}`)
`base_url`	Custom endpoint URL for local/self-hosted models
`max_tokens`	Maximum tokens in LLM responses
`temperature`	Sampling temperature (lower = more deterministic)

Policy Configuration

policy:
  guardrails_path: config/guardrails.yaml
  default_approval_level: analyst
  auto_approve_low_risk: true
  confidence_threshold: 0.9

Field	Description
`guardrails_path`	Path to the guardrails configuration file
`default_approval_level`	Default approval level for unknown actions (`analyst`, `senior`, `manager`)
`auto_approve_low_risk`	Whether low-risk actions can be auto-approved
`confidence_threshold`	Minimum AI confidence for auto-approval (0.0-1.0)

Logging Configuration

logging:
  level: info
  json_format: false
  # file_path: /var/log/triage-warden/triage-warden.log

Field	Description
`level`	Log level: `trace`, `debug`, `info`, `warn`, `error`
`json_format`	Use structured JSON format (recommended for production)
`file_path`	Optional log file path; omit to log to stdout

Database Configuration

database:
  url: sqlite://triage-warden.db?mode=rwc
  max_connections: 10
  run_migrations: true

Field	Description
`url`	Database connection string
`max_connections`	Connection pool size
`run_migrations`	Whether to run migrations on startup

Database URLs

Database	URL format
SQLite (dev)	`sqlite://triage-warden.db?mode=rwc`
PostgreSQL (prod)	`postgres://user:pass@host:5432/triage_warden`

API Server Configuration

api:
  port: 8080
  host: "0.0.0.0"
  enable_swagger: true
  timeout_secs: 30

Field	Description
`port`	TCP port to listen on
`host`	Bind address (`0.0.0.0` for all interfaces, `127.0.0.1` for localhost only)
`enable_swagger`	Serve Swagger UI at `/swagger-ui`
`timeout_secs`	HTTP request timeout in seconds

Guardrails Reference

The guardrails configuration file (config/guardrails.yaml) defines security boundaries for AI-automated actions. These rules apply regardless of the current autonomy level.

Deny List

Actions and targets that are never allowed automatically.

Denied Actions

deny_list:
  actions:
    - delete_user          # Too destructive
    - wipe_host            # Too destructive
    - delete_all_emails    # Too destructive
    - modify_firewall      # High risk

Add any action name here to prevent the AI from ever executing it. These actions can still be performed manually by an analyst.

Target Patterns

Regex patterns that match protected systems. Any automated action targeting a hostname or identifier that matches these patterns requires human approval.

deny_list:
  target_patterns:
    - ".*-prod-.*"         # Production systems
    - "dc\\d+\\..*"        # Domain controllers
    - ".*-critical-.*"     # Explicitly marked critical
    - ".*\\.corp\\..*"     # Corporate infrastructure

Protected IPs

Specific IP addresses that must never be targeted by automated actions.

deny_list:
  protected_ips:
    - "10.0.0.1"           # Core router
    - "10.0.0.2"           # DNS server
    - "10.0.0.3"           # DHCP server

Protected Users

User accounts that are protected from automated modifications (disable, password reset, etc.). Supports exact matches and glob patterns.

deny_list:
  protected_users:
    - "admin"
    - "root"
    - "administrator"
    - "service-account-*"
    - "svc-*"

Rate Limits

Prevent runaway automation by capping how many times each action can be executed.

rate_limits:
  isolate_host:
    max_per_hour: 5
    max_per_day: 20
    max_concurrent: 2

  disable_user:
    max_per_hour: 10
    max_per_day: 50
    max_concurrent: 5

  block_ip:
    max_per_hour: 20
    max_per_day: 100
    max_concurrent: 10

  quarantine_email:
    max_per_hour: 50
    max_per_day: 500
    max_concurrent: 20

Field	Description
`max_per_hour`	Maximum executions in a rolling 60-minute window
`max_per_day`	Maximum executions in a rolling 24-hour window
`max_concurrent`	Maximum simultaneous in-flight executions

Approval Policies

Define when human approval is required, and at what level.

approval_policies:
  - name: critical_asset_protection
    description: "Require senior approval for actions on critical assets"
    condition:
      target_criticality:
        - critical
        - high
    requires: senior
    can_override: false

Condition Fields

Field	Type	Description
`target_criticality`	List of strings	Asset criticality levels that trigger this policy
`action_type`	List of strings	Action types that trigger this policy
`confidence_below`	Float (0.0-1.0)	Trigger when AI confidence is below this threshold

Approval Levels

Level	Who can approve
`analyst`	Any analyst
`senior`	Senior analyst or above
`manager`	SOC manager

Overridability

When can_override: true, a senior user can bypass the approval requirement. When false, the approval is mandatory and cannot be skipped.

Auto-Approve Rules

Actions that can be executed automatically when specific conditions are met, even in supervised mode.

auto_approve_rules:
  - name: ticket_operations
    description: "Auto-approve ticket creation and updates"
    action_types:
      - create_ticket
      - update_ticket
      - add_ticket_comment
    conditions:
      - confidence_above: 0.5

  - name: email_quarantine_high_confidence
    description: "Auto-approve email quarantine for high-confidence phishing"
    action_types:
      - quarantine_email
    conditions:
      - confidence_above: 0.95
      - verdict: true_positive

Condition Fields

Field	Type	Description
`confidence_above`	Float (0.0-1.0)	AI confidence must exceed this value
`verdict`	String	AI verdict must match (e.g., `true_positive`)

All conditions in the list must be met (AND logic).

Data Policies

Control how sensitive data is handled in logs and LLM prompts.

data_policies:
  pii_filter: true
  pii_patterns:
    - "\\b\\d{3}-\\d{2}-\\d{4}\\b"      # SSN
    - "\\b\\d{16}\\b"                    # Credit card

  secrets_redaction: true
  secret_patterns:
    - "(?i)api[_-]?key"
    - "(?i)password"
    - "(?i)secret"
    - "(?i)token"
    - "(?i)credential"

  audit_data_access: true

Field	Description
`pii_filter`	Enable PII filtering in logs and LLM prompts
`pii_patterns`	Regex patterns matching PII to redact
`secrets_redaction`	Enable secret detection and redaction
`secret_patterns`	Regex patterns matching secrets to redact
`audit_data_access`	Log all data access operations

Escalation Rules

Define automatic escalation triggers.

escalation_rules:
  - name: repeated_false_positives
    description: "Escalate if same alert type has high FP rate"
    condition:
      false_positive_rate_above: 0.5
      sample_size_min: 10
    action: escalate_to_analyst

  - name: incident_correlation
    description: "Escalate if multiple related incidents detected"
    condition:
      related_incidents_above: 3
      time_window_hours: 1
    action: escalate_to_senior

  - name: critical_severity
    description: "Always escalate critical severity incidents"
    condition:
      severity: critical
    action: escalate_to_manager

Escalation Actions

Action	Description
`escalate_to_analyst`	Route to any available analyst
`escalate_to_senior`	Route to a senior analyst
`escalate_to_manager`	Route to the SOC manager

Integrations

Triage Warden supports integrations for identity, telemetry, enrichment, and response workflows.

SSO

Use the SSO guides to configure OIDC or SAML with your identity provider:

SSO Integration Guide

Triage Warden supports enterprise SSO through both OIDC and SAML endpoints.

Supported Flows

OIDC login: /auth/oidc/login
OIDC callback: /auth/oidc/callback
OIDC logout: /auth/oidc/logout
SAML metadata: /auth/saml/metadata
SAML login: /auth/saml/login
SAML ACS: /auth/saml/acs
SAML SLO: /auth/saml/slo

Common Environment Variables

TW_OIDC_ISSUER
TW_OIDC_CLIENT_ID
TW_OIDC_CLIENT_SECRET
TW_OIDC_REDIRECT_URI
TW_OIDC_SCOPES
TW_OIDC_JWKS_URI (optional override; discovery jwks_uri is used by default)
TW_OIDC_REQUIRE_MFA
TW_SSO_ROLE_MAPPING
TW_SSO_DEFAULT_ROLE
TW_SSO_AUTO_CREATE_USERS
TW_SAML_ENTITY_ID
TW_SAML_ACS_URL
TW_SAML_IDP_SSO_URL
TW_SAML_CERTIFICATE
TW_SAML_PRIVATE_KEY
TW_SAML_EXPECTED_ISSUER
TW_SAML_REQUIRE_MFA

Use provider-specific documents in this folder for exact values.

Security Notes

OIDC ID tokens are validated for issuer/audience/nonce/expiration and signature (JWKS).
SAML assertions enforce request correlation (InResponseTo), destination checks, signature presence, SHA-2 algorithm allow-listing, and certificate pinning checks.

Okta Setup

1. Create Application

Okta Admin: Applications > Create App Integration.
Choose OIDC - Web Application (recommended) or SAML 2.0.
Configure sign-in redirect URI:
- https://<your-host>/auth/oidc/callback

2. OIDC Environment Variables

TW_OIDC_ISSUER=https://<okta-domain>/oauth2/default
TW_OIDC_CLIENT_ID=<okta-client-id>
TW_OIDC_CLIENT_SECRET=<okta-client-secret>
TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callback
TW_OIDC_SCOPES=openid,profile,email,groups
TW_OIDC_REQUIRE_MFA=true

3. Group to Role Mapping

Example:

TW_SSO_ROLE_MAPPING=okta-soc-admin=admin,okta-soc-analyst=analyst,okta-soc-viewer=viewer

4. Optional SCIM Provisioning

SCIM can be enabled on top of JIT provisioning for pre-provisioning and automated lifecycle. JIT remains active for first-login provisioning fallback.

Azure AD (Microsoft Entra ID) Setup

1. Register App

Microsoft Entra admin center: Applications > App registrations > New registration.
Add redirect URI:
- OIDC: https://<your-host>/auth/oidc/callback
- SAML ACS (if using SAML): https://<your-host>/auth/saml/acs
Save Application (client) ID and Directory (tenant) ID.

2. Configure OIDC in Triage Warden

Set:

TW_OIDC_ISSUER=https://login.microsoftonline.com/<tenant-id>/v2.0
TW_OIDC_CLIENT_ID=<application-client-id>
TW_OIDC_CLIENT_SECRET=<generated-client-secret>
TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callback
TW_OIDC_SCOPES=openid,profile,email
TW_OIDC_REQUIRE_MFA=true (recommended)

3. Claims and Group Mapping

In app Token configuration, add group claims.
Map groups to roles:
- TW_SSO_ROLE_MAPPING=SOC-Admins=admin,SOC-Analysts=analyst,SOC-Viewers=viewer

4. Conditional Access / MFA

Create conditional access policy requiring MFA for the app.
Keep TW_OIDC_REQUIRE_MFA=true to enforce server-side claim checks.

Google Workspace Setup

Google Cloud Console: configure OAuth consent screen.
Create OAuth client (Web application).
Add authorized redirect URI:
- https://<your-host>/auth/oidc/callback

2. OIDC Configuration

TW_OIDC_ISSUER=https://accounts.google.com
TW_OIDC_CLIENT_ID=<google-client-id>
TW_OIDC_CLIENT_SECRET=<google-client-secret>
TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callback
TW_OIDC_SCOPES=openid,profile,email

3. Role Mapping

Google Workspace group claims may require Cloud Identity configuration. Use mapped group names:

TW_SSO_ROLE_MAPPING=tw-admins=admin,tw-analysts=analyst,tw-viewers=viewer

4. MFA

Enforce 2-Step Verification in Workspace admin policies and set:

TW_OIDC_REQUIRE_MFA=true

Generic OIDC/SAML Setup

OIDC Checklist

Configure redirect URI: https://<host>/auth/oidc/callback.
Set:
- TW_OIDC_ISSUER
- TW_OIDC_CLIENT_ID
- TW_OIDC_CLIENT_SECRET
- TW_OIDC_REDIRECT_URI
Optional claim overrides:
- TW_OIDC_EMAIL_CLAIM
- TW_OIDC_NAME_CLAIM
- TW_OIDC_GROUPS_CLAIM
- TW_OIDC_ROLES_CLAIM
- TW_OIDC_MFA_CLAIM
Configure role mapping:
- TW_SSO_ROLE_MAPPING=external_group=internal_role,...

SAML Checklist

Download SP metadata from https://<host>/auth/saml/metadata.
Configure IdP to POST assertions to https://<host>/auth/saml/acs.
Set:
- TW_SAML_ENTITY_ID
- TW_SAML_ACS_URL
- TW_SAML_IDP_SSO_URL
- TW_SAML_CERTIFICATE
Optional:
- TW_SAML_PRIVATE_KEY (required for encrypted assertions)
- TW_SAML_IDP_SLO_URL
- TW_SAML_EXPECTED_ISSUER
- TW_SAML_REQUIRE_MFA

Security Recommendations

Always require TLS termination.
Keep TW_OIDC_REQUIRE_MFA=true and TW_SAML_REQUIRE_MFA=true for privileged tenants.
Use least-privilege role mappings.
Rotate OIDC client secrets and SAML certificates regularly.

Architectural Decision Records

This directory contains Architectural Decision Records (ADRs) for Triage Warden.

What is an ADR?

An ADR is a document that captures an important architectural decision made along with its context and consequences.

ADR Index

Number	Title	Status	Date
001	Event Bus Architecture	Accepted	2026-02
002	Dual Database Support (SQLite + PostgreSQL)	Accepted	2026-02
003	Credential Encryption at Rest	Accepted	2026-02
004	Session Management Strategy	Accepted	2026-02
005	API Key Format and Security	Accepted	2026-02
006	Operation Modes (Supervised/Autonomous)	Accepted	2026-02
007	Kill Switch Design	Accepted	2026-02

ADR Template

New ADRs should follow this template:

# ADR-XXX: Title

## Status

Proposed | Accepted | Deprecated | Superseded

## Context

What is the issue that we're seeing that is motivating this decision or change?

## Decision

What is the change that we're proposing and/or doing?

## Consequences

What becomes easier or more difficult to do because of this change?

ADR-001: Event Bus Architecture

Status

Accepted

Context

Triage Warden needs to coordinate multiple components (enrichment, analysis, action execution, notifications) in response to security incidents. We needed a way to:

Decouple components for independent development and testing
Enable real-time updates to the dashboard
Support both synchronous and asynchronous processing
Maintain an audit trail of all system events

Decision

We implemented an in-process event bus using Tokio channels with the following design:

Event Types

All significant system events are captured as TriageEvent variants:

AlertReceived - New alert from webhook
IncidentCreated - Incident created from alert
EnrichmentComplete - Single enrichment finished
EnrichmentPhaseComplete - All enrichments done
AnalysisComplete - AI analysis finished
ActionsProposed - Response actions proposed
ActionApproved/Denied - Action approval decision
ActionExecuted - Action completed
StatusChanged - Incident status transition
TicketCreated - External ticket created
IncidentEscalated - Incident escalated
IncidentResolved - Incident resolved
KillSwitchActivated - Emergency stop triggered

Delivery Mechanisms

Broadcast Channel: For real-time dashboard updates via SSE
Named Subscribers: For component-specific processing queues
Event History: In-memory buffer for recent event retrieval

Error Handling

Events are fire-and-forget with fallback logging:

publish() - Returns Result for cases where failure matters
publish_with_fallback() - Logs errors, never fails (for non-critical events)

Consequences

Positive

Components are loosely coupled and independently testable
Dashboard receives real-time updates without polling
Complete event history available for debugging
Failed subscribers don't block the main processing flow

Negative

In-process only - no distributed event bus
Event history is limited and in-memory (lost on restart)
No guaranteed delivery or replay capability
Broadcast channel has limited buffer (may drop events under load)

Future Considerations

For high-availability deployments, consider:

Redis Pub/Sub for distributed events
PostgreSQL LISTEN/NOTIFY for persistent events
External message queue (RabbitMQ, Kafka) for durability

ADR-002: Dual Database Support (SQLite + PostgreSQL)

Status

Accepted

Context

Triage Warden needed to support different deployment scenarios:

Development/Testing: Quick setup without external dependencies
Small Deployments: Single-server installations with minimal infrastructure
Production: Scalable deployments with high availability requirements

We evaluated:

SQLite only (simple but limited scalability)
PostgreSQL only (powerful but heavy for small deployments)
Dual support (flexibility but increased complexity)

Decision

We implemented dual database support using SQLx with compile-time query verification:

Architecture

┌─────────────────────────────────────────┐
│              Application                │
├─────────────────────────────────────────┤
│           Repository Traits             │
│   (IncidentRepository, UserRepository)  │
├──────────────────┬──────────────────────┤
│  SqliteXxxRepo   │    PgXxxRepo         │
├──────────────────┼──────────────────────┤
│   SQLite Pool    │   PostgreSQL Pool    │
└──────────────────┴──────────────────────┘

Implementation

DbPool enum wraps both pool types
Each repository has SQLite and PostgreSQL implementations
Factory functions create the appropriate implementation based on pool type
Migrations are maintained separately for each database

Database Selection

Determined by DATABASE_URL environment variable:

sqlite:path/to/file.db → SQLite
postgres://user:pass@host/db → PostgreSQL

Consequences

Positive

Zero-config development with SQLite
Production-ready PostgreSQL support
Same API regardless of database backend
Compile-time query verification for both backends

Negative

Duplicate migration files
Some features may have different behavior (e.g., JSON querying)
More complex testing matrix
Cannot use PostgreSQL-specific features (CTEs, window functions) without SQLite equivalents

Trade-offs

Feature	SQLite	PostgreSQL
Setup complexity	None	Requires server
Concurrent writes	Limited	Excellent
JSON indexing	Basic	JSONB with GIN
Full-text search	Limited	Excellent
Connection pooling	In-process	Network
Backup	File copy	pg_dump

ADR-003: Credential Encryption at Rest

Status

Accepted

Context

Triage Warden stores sensitive credentials for external integrations:

API keys for threat intelligence services (VirusTotal, etc.)
OAuth tokens for cloud services (Microsoft, Google)
Webhook secrets for SIEM integrations
SMTP credentials for email notifications

These credentials must be protected at rest in the database.

Decision

We implemented AES-256-GCM encryption for sensitive fields:

Encryption Scheme

Algorithm: AES-256-GCM (authenticated encryption)
Key Derivation: HKDF from master key + unique salt per value
Nonce: 96-bit random nonce per encryption
Storage Format: Base64(nonce || ciphertext || auth_tag)

Key Management

ENCRYPTION_KEY (env var)
        │
        ▼
    HKDF-SHA256
        │
    ┌───┴───┐
    │ Salt  │ (per-value, stored with ciphertext)
    └───┬───┘
        ▼
   Derived Key
        │
        ▼
   AES-256-GCM

Implementation

#![allow(unused)]
fn main() {
pub trait CredentialEncryptor: Send + Sync {
    fn encrypt(&self, plaintext: &str) -> Result<String, EncryptionError>;
    fn decrypt(&self, ciphertext: &str) -> Result<String, EncryptionError>;
}
}

Two implementations:

Aes256GcmEncryptor - Production encryption
NoOpEncryptor - Development mode (disabled encryption)

Encrypted Fields

Table	Field	Contains
connectors	config.api_key	API keys
connectors	config.client_secret	OAuth secrets
settings	llm.api_key	LLM provider API key
notification_channels	config.webhook_url	Webhook URLs with tokens

Consequences

Positive

Credentials protected if database is compromised
Authenticated encryption prevents tampering
Per-value salt prevents rainbow table attacks
Key rotation possible without re-encrypting all values

Negative

Cannot search encrypted fields
Master key must be securely managed
Performance overhead for encryption/decryption
Key loss = data loss (no recovery without key)

Security Considerations

Key Storage: Use environment variable or secrets manager
Key Rotation: Implement key versioning for rotation
Audit: Log all decryption operations
Memory: Clear sensitive data from memory after use

ADR-004: Session Management Strategy

Status

Accepted

Context

The dashboard requires user authentication with session management. We needed to decide between:

JWT tokens (stateless)
Server-side sessions (stateful)
Hybrid approach

Requirements:

Secure authentication for web dashboard
Support for session revocation
CSRF protection for form submissions
Reasonable session lifetime

Decision

We chose server-side sessions stored in the database using tower-sessions:

Session Architecture

Browser                          Server
   │                                │
   │  POST /auth/login              │
   │  (username, password)          │
   ├───────────────────────────────►│
   │                                │ Validate credentials
   │                                │ Create session in DB
   │  Set-Cookie: id=session_id     │
   │◄───────────────────────────────┤
   │                                │
   │  GET /dashboard                │
   │  Cookie: id=session_id         │
   ├───────────────────────────────►│
   │                                │ Load session from DB
   │                                │ Verify not expired
   │  200 OK                        │
   │◄───────────────────────────────┤

Session Storage

Sessions are stored in the sessions table:

Column	Type	Description
id	TEXT	Session ID (secure random)
data	BLOB	Encrypted session data
expiry_date	INTEGER	Unix timestamp

Session Data

#![allow(unused)]
fn main() {
struct SessionData {
    user_id: Uuid,
    username: String,
    role: UserRole,
    login_csrf: String,  // CSRF token for sensitive actions
}
}

Security Measures

Secure Cookies: HttpOnly, Secure (in production), SameSite=Lax
CSRF Protection: Token in session, validated on state-changing requests
Session Expiry: 24-hour default, configurable
Rotation: New session ID on privilege changes

Consequences

Positive

Sessions can be revoked immediately
No token size limits for session data
CSRF tokens integrated naturally
Easy to implement "logout all devices"

Negative

Database read on every authenticated request
Session table requires cleanup (expired sessions)
Horizontal scaling requires shared database
Slightly higher latency than JWTs

Comparison with JWTs

Aspect	Sessions	JWTs
Revocation	Immediate	Requires blacklist
Storage	Server	Client
Scalability	Requires shared store	Stateless
Size	Cookie only	Full payload
Security	Keys in DB	Signature verification

ADR-005: API Key Format and Security

Status

Accepted

Context

Triage Warden exposes a REST API that needs programmatic authentication. We needed to design an API key format that is:

Secure against brute-force attacks
Easily identifiable (for revocation)
User-friendly for debugging
Compatible with common tooling

Decision

We adopted a prefixed API key format similar to GitHub and Stripe:

Key Format

tw_<user_prefix>_<random_secret>

Example: tw_abc12345_9f8e7d6c5b4a3210fedcba9876543210

Components:

tw_ - Application prefix (identifies Triage Warden keys)
<user_prefix> - First 8 chars for identification (stored in DB)
<random_secret> - 32 bytes of cryptographic randomness

Storage

Only the hash is stored, never the raw key:

Column	Value
key_prefix	`tw_abc12345` (for lookup)
key_hash	SHA-256(full_key)

Authentication Flow

1. Extract key from Authorization header
2. Parse prefix (first 11 chars)
3. Look up by prefix in database
4. Compute SHA-256 of provided key
5. Compare with stored hash (constant-time)
6. Check expiration and scopes

Key Generation

#![allow(unused)]
fn main() {
use rand::Rng;
use sha2::{Sha256, Digest};

fn generate_api_key(user_id: Uuid) -> (String, String, String) {
    let secret: [u8; 32] = rand::thread_rng().gen();
    let secret_hex = hex::encode(secret);

    let prefix = format!("tw_{}", &user_id.to_string()[..8]);
    let full_key = format!("{}_{}", prefix, secret_hex);
    let key_hash = hex::encode(Sha256::digest(full_key.as_bytes()));

    (full_key, prefix, key_hash)  // Return key once, store prefix + hash
}
}

Consequences

Positive

Keys are identifiable without exposing secrets
Prefix enables efficient database lookup
Format is familiar to developers
Hash storage protects against database leaks
Constant-time comparison prevents timing attacks

Negative

Keys must be stored securely by users (cannot be recovered)
Prefix lookup could reveal key existence (minor info leak)
Longer keys than simple tokens

Security Properties

Property	Implementation
Entropy	256 bits (32 random bytes)
Storage	SHA-256 hash only
Comparison	Constant-time
Revocation	Delete from database
Expiration	Optional expiry_at field
Scopes	JSON array of allowed operations

ADR-006: Operation Modes (Supervised/Autonomous)

Status

Accepted

Context

Security automation involves a trust spectrum from fully manual to fully autonomous. Organizations have different risk tolerances and regulatory requirements. We needed to support:

Organizations starting with automation (cautious)
Mature SOCs ready for autonomous response
Gradual transition between modes
Compliance with approval requirements

Decision

We implemented three operation modes configurable at the system level:

Modes

Mode	Description	Default Approval
`supervised`	All actions require human approval	require_approval
`semi_autonomous`	Low-risk actions auto-approved, high-risk need approval	policy-based
`autonomous`	Actions auto-approved unless policy denies	auto_approve

Mode Selection Flow

Incoming Action
      │
      ▼
┌─────────────────┐
│ Check Kill Switch│
└────────┬────────┘
         │ (not active)
         ▼
┌─────────────────┐
│ Evaluate Policies│
└────────┬────────┘
         │
    ┌────┴────┐
    │ Explicit │
    │ Policy?  │
    └────┬────┘
    Yes  │  No
    │    │
    │    ▼
    │ ┌─────────────────┐
    │ │ Apply Mode      │
    │ │ Default         │
    │ └────────┬────────┘
    │          │
    └────┬─────┘
         │
         ▼
   Final Decision

Policy Override

Policies can override mode defaults:

policies:
  - name: "Block critical IPs always requires approval"
    condition: "action.type == 'block_ip' && target.is_critical"
    action: "require_approval"
    approval_level: "manager"

  - name: "Low severity lookups auto-approved"
    condition: "action.type == 'lookup' && incident.severity in ['info', 'low']"
    action: "auto_approve"

Configuration

# config.yaml
general:
  mode: "supervised"  # supervised | semi_autonomous | autonomous

Or via API:

curl -X PUT /api/settings/general \
  -d '{"mode": "semi_autonomous"}'

Consequences

Positive

Flexible for different organizational needs
Gradual automation adoption path
Policies provide fine-grained control
Easy to fall back to supervised mode

Negative

More complex decision logic
Potential for misconfiguration
Requires clear documentation of behavior
Audit trails must capture mode at decision time

Mode Comparison

Scenario	Supervised	Semi-Auto	Autonomous
Block malware IP	Approval needed	Auto-approved	Auto-approved
Disable user	Approval needed	Approval needed	Auto-approved
Isolate host	Approval needed	Approval needed	Approval (policy)
Lookup IOC	Approval needed	Auto-approved	Auto-approved

ADR-007: Kill Switch Design

Status

Accepted

Context

Autonomous security response systems pose risks if they malfunction:

False positives could disable legitimate users/systems
Bugs could trigger cascading actions
Compromised AI could be weaponized
External events may require immediate halt

We needed an emergency stop mechanism that is:

Fast to activate (< 1 second)
Globally effective
Difficult to accidentally trigger
Easy to recover from

Decision

We implemented a global kill switch with the following design:

Architecture

                    ┌─────────────┐
                    │ Kill Switch │
                    │   State     │
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Orchestrator  │  │ Policy Engine │  │ Action Runner │
│               │  │               │  │               │
│ check()       │  │ check()       │  │ check()       │
│ before        │  │ before        │  │ before        │
│ processing    │  │ evaluation    │  │ execution     │
└───────────────┘  └───────────────┘  └───────────────┘

State

#![allow(unused)]
fn main() {
pub struct KillSwitchStatus {
    pub active: bool,
    pub reason: Option<String>,
    pub activated_by: Option<String>,
    pub activated_at: Option<DateTime<Utc>>,
}
}

Check Points

The kill switch is checked at multiple points:

Alert Processing: Before creating incidents from alerts
Policy Evaluation: Before evaluating approval policies
Action Execution: Before executing any response action
Playbook Execution: Before running playbook stages

Activation

#![allow(unused)]
fn main() {
// Via API
POST /api/kill-switch/activate
{
    "reason": "Investigating false positive surge",
    "activated_by": "[email protected]"
}

// Via CLI
tw-cli kill-switch activate --reason "Emergency maintenance"

// Programmatic
kill_switch.activate("Anomaly detected", "system").await;
}

Deactivation

#![allow(unused)]
fn main() {
// Via API
POST /api/kill-switch/deactivate
{
    "reason": "Issue resolved"
}

// Only admins can deactivate
}

Event Notification

Activation triggers:

KillSwitchActivated event to all subscribers
Dashboard alert banner
Notification to configured channels

Consequences

Positive

Immediate halt of all automation
Clear audit trail of activation/deactivation
Multiple activation methods (UI, API, CLI)
Visible status in all interfaces

Negative

In-memory state (lost on restart, resets to inactive)
No automatic activation triggers yet
Single global switch (no per-action granularity)
Requires admin access to deactivate

Future Enhancements

Persistent State: Store kill switch state in database
Auto-Activation: Trigger on anomaly detection
Scoped Switches: Per-action-type or per-connector switches
Scheduled Deactivation: Auto-deactivate after timeout
Two-Person Rule: Require multiple admins for deactivation

Operational Procedures

When kill switch is activated:

All pending actions remain pending
New alerts create incidents but stop at enrichment
Dashboard shows prominent warning banner
Existing approved actions are NOT rolled back

To recover:

Investigate root cause
Fix underlying issue
Deactivate kill switch
Manually review pending actions
Resume normal operations

Production Deployment

This section covers deploying Triage Warden in production environments.

Deployment Options

Triage Warden can be deployed in several ways:

Docker - Recommended for most deployments. Quick setup with Docker Compose.
Kubernetes - For orchestrated, scalable deployments using raw manifests.
Helm Chart - Recommended for Kubernetes. Templated deployment with environment-specific values.
Binary - Direct binary installation on Linux servers.

Before You Deploy

Before deploying to production, review:

Production Checklist - Security and configuration requirements
Configuration Reference - All environment variables and settings
Database Setup - PostgreSQL configuration for production
Security Hardening - TLS, secrets, network policies
Scaling - Horizontal scaling considerations

Quick Start

For a quick production deployment with Docker:

# Clone the repository
git clone https://github.com/your-org/triage-warden.git
cd triage-warden/deploy/docker

# Configure environment
cp .env.example .env
# Edit .env with your settings

# Generate encryption key
echo "TW_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> .env

# Start services
docker compose -f docker-compose.prod.yml up -d

Architecture Overview

A typical production deployment includes:

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │   (TLS term.)   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
        ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
        │  Triage   │  │  Triage   │  │  Triage   │
        │  Warden   │  │  Warden   │  │  Warden   │
        │ Instance 1│  │ Instance 2│  │ Instance 3│
        └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
              │              │              │
              └──────────────┼──────────────┘
                             │
                    ┌────────▼────────┐
                    │   PostgreSQL    │
                    │   (Primary)     │
                    └─────────────────┘

Support

For deployment assistance:

Check the Troubleshooting Guide
Review GitHub Issues
Contact support at [email protected]

Production Checklist

Complete this checklist before deploying Triage Warden to production.

Security Requirements

Authentication & Secrets

Encryption key configured: Set TW_ENCRYPTION_KEY with a 32-byte base64-encoded key
```
# Generate a secure key
openssl rand -base64 32
```
JWT secret configured: Set TW_JWT_SECRET with a strong random value
```
openssl rand -hex 32
```
Session secret configured: Set TW_SESSION_SECRET for session encryption
Default admin password changed: Change the default admin credentials immediately after first login
API keys use scoped permissions: Don't create API keys with * scope in production

Network Security

TLS enabled: All traffic should use HTTPS
TLS certificates valid: Use certificates from a trusted CA (not self-signed)
Internal traffic encrypted: Database connections use TLS
Firewall rules configured: Only expose necessary ports (443 for HTTPS)
Rate limiting enabled: Protect against brute force attacks

Database Security

PostgreSQL in production: Don't use SQLite for production workloads
Database user has minimal permissions: Use a dedicated user, not superuser
Database connections encrypted: Enable sslmode=require or verify-full
Regular backups configured: Automated daily backups with tested restore procedure

Configuration Requirements

Required Environment Variables

Variable	Description	Example
`DATABASE_URL`	PostgreSQL connection string	`postgres://user:pass@host:5432/triage_warden?sslmode=require`
`TW_ENCRYPTION_KEY`	Credential encryption key (32 bytes, base64)	`K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72...`
`TW_JWT_SECRET`	JWT signing secret	`your-256-bit-secret`
`TW_SESSION_SECRET`	Session encryption secret	`another-secret-value`
`RUST_LOG`	Log level	`info` or `triage_warden=debug`

Optional but Recommended

Variable	Description	Default
`TW_BIND_ADDRESS`	Server bind address	`0.0.0.0:8080`
`TW_BASE_URL`	Public URL for callbacks	`https://triage.example.com`
`TW_TRUSTED_PROXIES`	Comma-separated proxy IPs	None
`TW_MAX_REQUEST_SIZE`	Maximum request body size	`10MB`

LLM Configuration (if using AI features)

LLM API key configured: Set via UI or environment variable
Rate limits configured: Prevent runaway API costs
Model selected appropriately: Balance cost vs. capability

Infrastructure Requirements

Minimum Hardware

Component	Minimum	Recommended
CPU	2 cores	4 cores
RAM	2 GB	4 GB
Storage	20 GB	50 GB SSD

Database Requirements

Metric	Minimum	Recommended
PostgreSQL Version	14	15+
Connections	20	50+
Storage	10 GB	50 GB+

Network Requirements

Outbound HTTPS (443) to:
- LLM provider (api.openai.com, api.anthropic.com)
- Configured connectors (VirusTotal, Jira, etc.)
Inbound HTTPS (443) from:
- Users accessing the dashboard
- Webhook sources (SIEM, EDR systems)

Monitoring & Observability

Health Checks

Health endpoint accessible: GET /health returns component status
Readiness probe configured: GET /ready for load balancer
Liveness probe configured: GET /live for container orchestration

Metrics & Logging

Prometheus metrics exposed: GET /metrics endpoint enabled
Log aggregation configured: Logs shipped to central system
Alerting rules configured: Alerts for critical failures

Recommended Alerts

Alert	Condition	Severity
Service Down	`/health` returns unhealthy for 5m	Critical
Database Connection Failed	Database component unhealthy	Critical
Kill Switch Active	Kill switch activated	Warning
High Error Rate	>5% HTTP 5xx responses	Warning
Connector Unhealthy	Any connector in error state	Warning
LLM API Errors	LLM requests failing	Warning

Operational Readiness

Documentation

Runbooks available: Team has access to operational runbooks
Contact list current: On-call rotation and escalation paths defined
Recovery procedures tested: Backup restore verified within last 30 days

Access Control

Admin accounts audited: Remove unnecessary admin users
API keys audited: Revoke unused or over-privileged keys
Audit logging enabled: User actions are logged

Backup & Recovery

Database backups automated: Daily backups with 30-day retention
Backup encryption enabled: Backups encrypted at rest
Recovery time objective defined: Team knows target RTO
Recovery procedure documented: Step-by-step restore guide exists

Pre-Launch Testing

Functional Tests

User login works with configured auth
Incidents can be created via webhook
Playbooks execute correctly
Connectors authenticate successfully
Notifications are delivered

Load Testing

Tested with expected concurrent users
Tested with expected webhook volume
Response times acceptable under load

Failover Testing

Application recovers from database restart
Application handles LLM API failures gracefully
Kill switch stops all automation when activated

Sign-Off

Role	Name	Date	Signature
Security Review
Operations Review
Development Lead

Quick Validation Commands

# Check health endpoint
curl -s https://triage.example.com/health | jq

# Verify TLS certificate
openssl s_client -connect triage.example.com:443 -servername triage.example.com

# Test database connectivity (from application)
curl -s https://triage.example.com/health/detailed | jq '.components.database'

# Verify all connectors healthy
curl -s https://triage.example.com/health/detailed | jq '.components.connectors'

Docker Deployment

Deploy Triage Warden using Docker and Docker Compose.

Prerequisites

Docker Engine 20.10+
Docker Compose v2.0+
4 GB RAM minimum (2 GB for basic, 4 GB+ for HA)
20 GB disk space

Overview

Triage Warden provides three Docker Compose configurations:

File	Purpose	Use Case
`docker-compose.yml`	Basic setup	Quick start, single instance
`docker-compose.dev.yml`	Development	Local development with hot reload
`docker-compose.ha.yml`	High Availability	HA testing, multi-instance

Quick Start

# Clone the repository
git clone https://github.com/your-org/triage-warden.git
cd triage-warden/deploy/docker

# Copy and configure environment
cp .env.example .env

# Generate required secrets
echo "TW_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> .env
echo "TW_JWT_SECRET=$(openssl rand -hex 32)" >> .env
echo "TW_SESSION_SECRET=$(openssl rand -hex 32)" >> .env
echo "POSTGRES_PASSWORD=$(openssl rand -hex 16)" >> .env

# Start services
docker compose up -d

# Check status
docker compose ps
docker compose logs -f triage-warden

Access the dashboard at http://localhost:8080

Default credentials: admin / admin (change immediately!)

Configuration

Environment Variables

Edit .env file with your configuration:

# Database
POSTGRES_USER=triage_warden
POSTGRES_PASSWORD=your-secure-password
POSTGRES_DB=triage_warden
DATABASE_URL=postgres://triage_warden:your-secure-password@postgres:5432/triage_warden

# Application
TW_BIND_ADDRESS=0.0.0.0:8080
TW_BASE_URL=https://triage.example.com
TW_ENCRYPTION_KEY=your-32-byte-base64-key
TW_JWT_SECRET=your-jwt-secret
TW_SESSION_SECRET=your-session-secret

# Logging
RUST_LOG=info

# LLM (optional)
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...

Production Configuration

For production, use docker-compose.prod.yml:

docker compose -f docker-compose.prod.yml up -d

Key differences from development:

Uses external PostgreSQL volume for data persistence
Enables health checks
Sets resource limits
Configures restart policies

Docker Compose Files

Development (`docker-compose.yml`)

version: '3.8'

services:
  triage-warden:
    image: ghcr.io/your-org/triage-warden:latest
    ports:
      - "8080:8080"
    environment:
      - DATABASE_URL=${DATABASE_URL}
      - TW_ENCRYPTION_KEY=${TW_ENCRYPTION_KEY}
      - TW_JWT_SECRET=${TW_JWT_SECRET}
      - TW_SESSION_SECRET=${TW_SESSION_SECRET}
      - RUST_LOG=${RUST_LOG:-info}
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 5s
      timeout: 5s
      retries: 5

volumes:
  postgres_data:

Production (`docker-compose.prod.yml`)

version: '3.8'

services:
  triage-warden:
    image: ghcr.io/your-org/triage-warden:latest
    ports:
      - "8080:8080"
    environment:
      - DATABASE_URL=${DATABASE_URL}
      - TW_ENCRYPTION_KEY=${TW_ENCRYPTION_KEY}
      - TW_JWT_SECRET=${TW_JWT_SECRET}
      - TW_SESSION_SECRET=${TW_SESSION_SECRET}
      - TW_BASE_URL=${TW_BASE_URL}
      - RUST_LOG=${RUST_LOG:-info}
    depends_on:
      postgres:
        condition: service_healthy
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/live"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d:ro
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

volumes:
  postgres_data:
    external: true
    name: triage_warden_postgres

High Availability Testing

The HA configuration runs multiple instances for testing distributed features locally before deploying to Kubernetes.

Architecture

                    ┌─────────────┐
                    │   Traefik   │
                    │   (LB)      │
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼─────┐  ┌──────▼─────┐  ┌──────▼──────┐
    │   API-1    │  │   API-2    │  │   API-N     │
    │  (serve)   │  │  (serve)   │  │  (serve)    │
    └──────┬─────┘  └──────┬─────┘  └──────┬──────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
    ┌──────────────────────┼──────────────────────┐
    │                      │                      │
    ▼                      ▼                      ▼
┌───────┐           ┌──────────┐           ┌───────────┐
│ Redis │◄─────────►│ PostgreSQL│◄─────────►│Orchestrator│
│(MQ/Cache)│        │   (DB)   │           │ (1 leader) │
└───────┘           └──────────┘           └───────────┘

Starting HA Stack

# Navigate to deploy directory
cd deploy/docker

# Configure environment
cp .env.example .env
# Edit .env with required values

# Start all services
docker-compose -f docker-compose.ha.yml up -d

# Start with monitoring stack
docker-compose -f docker-compose.ha.yml --profile monitoring up -d

Accessing Services

Service	URL	Description
API (Load Balanced)	http://localhost:8080	Main application endpoint
Traefik Dashboard	http://localhost:8081	Load balancer metrics
Prometheus	http://localhost:9090	Metrics (with monitoring profile)
Grafana	http://localhost:3000	Dashboards (admin/admin)
PostgreSQL	localhost:5432	Database (for debugging)
Redis	localhost:6379	Cache/MQ (for debugging)

Verifying HA Behavior

# Check all instances are healthy
curl -s http://localhost:8080/health | jq

# Check load balancing (run multiple times)
for i in {1..10}; do
  curl -s http://localhost:8080/health | jq -r '.instance_id // "unknown"'
done

# Check leader election
curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Simulate failure - stop one API instance
docker stop tw-api-1

# Verify traffic still flows
curl -s http://localhost:8080/health

# Restart the instance
docker start tw-api-1

Testing Orchestrator Failover

# Check which orchestrator is leader
docker exec tw-orchestrator-1 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'
docker exec tw-orchestrator-2 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Stop the leader
docker stop tw-orchestrator-1

# Verify failover (second orchestrator becomes leader)
sleep 5
docker exec tw-orchestrator-2 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Restart original
docker start tw-orchestrator-1

Building the Image

To build the Docker image locally:

# From repository root
docker build -t triage-warden:local -f deploy/docker/Dockerfile .

# Build with no cache
docker-compose -f docker-compose.ha.yml build --no-cache

# Build specific service
docker-compose -f docker-compose.ha.yml build api-1

# Use local image
# In docker-compose.yml, change:
# image: ghcr.io/your-org/triage-warden:latest
# to:
# image: triage-warden:local

Persistent Storage

Volume Management

# List volumes
docker volume ls | grep triage-warden

# Backup PostgreSQL
docker exec tw-postgres pg_dump -U triage triage_warden > backup.sql

# Restore PostgreSQL
cat backup.sql | docker exec -i tw-postgres psql -U triage triage_warden

# Backup Redis
docker exec tw-redis redis-cli BGSAVE
docker cp tw-redis:/data/dump.rdb ./redis-backup.rdb

Cleaning Up

# Stop services
docker-compose -f docker-compose.ha.yml down

# Stop and remove volumes (WARNING: deletes all data)
docker-compose -f docker-compose.ha.yml down -v

# Remove only unused volumes
docker volume prune

Common Operations

View Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f triage-warden

# Last 100 lines
docker compose logs --tail=100 triage-warden

# With timestamps
docker-compose -f docker-compose.ha.yml logs -f --timestamps

Restart Services

# Restart all
docker compose restart

# Restart specific service
docker compose restart triage-warden

Update to New Version

# Pull new images
docker compose pull

# Recreate containers
docker compose up -d

# Verify update
docker compose ps
curl http://localhost:8080/health | jq '.version'

Database Operations

# Create backup
docker compose exec postgres pg_dump -U triage_warden triage_warden > backup.sql

# Restore backup
docker compose exec -T postgres psql -U triage_warden triage_warden < backup.sql

# Access database shell
docker compose exec postgres psql -U triage_warden triage_warden

Debug Mode

Enable debug logging:

# In .env file
RUST_LOG=debug,triage_warden=trace,tw_api=trace,tw_core=trace
TW_LOG_FORMAT=pretty  # Human-readable format

Inspecting Containers

# Shell access
docker exec -it tw-api-1 /bin/sh

# Check process status
docker exec tw-api-1 ps aux

# Check network connectivity
docker exec tw-api-1 curl -v http://postgres:5432
docker exec tw-api-1 curl -v http://redis:6379

Resource Limits

The HA configuration includes resource limits suitable for local testing:

Service	CPU Limit	Memory Limit
API	1 core	512MB
Orchestrator	1.5 cores	1GB
PostgreSQL	1 core	1GB
Redis	0.5 core	512MB
Traefik	0.5 core	256MB

Adjust in docker-compose.ha.yml under deploy.resources.

TLS Configuration

For production, use a reverse proxy (nginx, Traefik, Caddy) for TLS termination:

With Traefik

# Add to docker-compose.prod.yml
services:
  traefik:
    image: traefik:v2.10
    command:
      - "--providers.docker=true"
      - "--entrypoints.websecure.address=:443"
      - "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
      - "[email protected]"
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
    ports:
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - letsencrypt:/letsencrypt

  triage-warden:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.triage.rule=Host(`triage.example.com`)"
      - "traefik.http.routers.triage.entrypoints=websecure"
      - "traefik.http.routers.triage.tls.certresolver=letsencrypt"

volumes:
  letsencrypt:

Troubleshooting

Container Won't Start

# Check logs for errors
docker compose logs triage-warden

# Common issues:
# - DATABASE_URL not set or incorrect
# - TW_ENCRYPTION_KEY missing
# - PostgreSQL not ready (check depends_on health)

Database Connection Failed

# Verify PostgreSQL is running
docker compose ps postgres

# Check PostgreSQL logs
docker compose logs postgres

# Test connection
docker compose exec postgres pg_isready -U triage_warden

# Verify connection from app container
docker exec tw-api-1 curl -v telnet://postgres:5432

Port Conflicts

# Find process using port 8080
lsof -i :8080

# Use different ports
# In docker-compose.ha.yml or via environment:
# - "8090:80" instead of "8080:80"

Container Exits Immediately

# Check exit code and logs
docker-compose -f docker-compose.ha.yml logs api-1

# Common causes:
# - Missing environment variables
# - Database not ready
# - Invalid configuration

Redis Connection Issues

# Test Redis connectivity
docker exec tw-api-1 curl -v telnet://redis:6379

# Check Redis logs
docker-compose -f docker-compose.ha.yml logs redis

# Connect to Redis CLI
docker exec -it tw-redis redis-cli ping

Out of Memory

# Check container memory usage
docker stats

# Increase limits in docker-compose.prod.yml
deploy:
  resources:
    limits:
      memory: 4G  # Increase from 2G

Next Steps

Configure connectors
Set up notifications
Create playbooks
Set up monitoring
Deploy to Kubernetes using raw manifests
Deploy with Helm for templated Kubernetes deployments

Kubernetes Deployment Guide

This guide covers deploying Triage Warden to Kubernetes using raw manifests. For the recommended Helm-based approach, see the Helm Chart guide.

Prerequisites

Before deploying, ensure you have:

Kubernetes cluster version 1.25 or later
kubectl configured with cluster access
Helm 3.8+ (see Helm Chart for Helm-based deployment)
Container registry access to pull Triage Warden images
PostgreSQL database (managed or self-hosted)
Redis (optional, required for HA deployments)

Optional Prerequisites

Ingress controller (nginx-ingress or Traefik recommended)
cert-manager for automatic TLS certificate management
Prometheus Operator for metrics and alerting

Quick Start with Helm

1. Add the Helm Repository

# Add the Triage Warden Helm repository
helm repo add triage-warden https://charts.triage-warden.io
helm repo update

2. Create Namespace

kubectl create namespace triage-warden

3. Create Secrets

Generate required secrets before deployment:

# Generate encryption keys
export TW_ENCRYPTION_KEY=$(openssl rand -base64 32)
export TW_JWT_SECRET=$(openssl rand -hex 32)
export TW_SESSION_SECRET=$(openssl rand -hex 32)

# Create Kubernetes secret
kubectl create secret generic triage-warden-secrets \
  --namespace triage-warden \
  --from-literal=TW_ENCRYPTION_KEY="$TW_ENCRYPTION_KEY" \
  --from-literal=TW_JWT_SECRET="$TW_JWT_SECRET" \
  --from-literal=TW_SESSION_SECRET="$TW_SESSION_SECRET" \
  --from-literal=DATABASE_URL="postgres://user:password@postgres:5432/triage_warden"

4. Install Triage Warden

# Basic installation
helm install triage-warden triage-warden/triage-warden \
  --namespace triage-warden \
  --set global.domain=triage.example.com

# Installation with custom values
helm install triage-warden triage-warden/triage-warden \
  --namespace triage-warden \
  --values values-production.yaml

5. Verify Deployment

# Check pod status
kubectl get pods -n triage-warden

# Check service status
kubectl get svc -n triage-warden

# View logs
kubectl logs -n triage-warden -l app.kubernetes.io/name=triage-warden -f

Helm Configuration

Minimal Production Values

Create a values-production.yaml file:

# values-production.yaml
global:
  domain: triage.example.com

api:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

orchestrator:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

postgresql:
  # Use external database
  enabled: false
  external:
    host: postgres.example.com
    port: 5432
    database: triage_warden
    existingSecret: triage-warden-secrets
    existingSecretPasswordKey: DATABASE_PASSWORD

redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: true
    existingSecret: triage-warden-secrets
    existingSecretPasswordKey: REDIS_PASSWORD

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  tls:
    - secretName: triage-warden-tls
      hosts:
        - triage.example.com

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true

Common Configuration Options

Parameter	Description	Default
`api.replicas`	Number of API server replicas	`2`
`orchestrator.replicas`	Number of orchestrator replicas	`2`
`image.repository`	Container image repository	`ghcr.io/triage-warden/triage-warden`
`image.tag`	Container image tag	`latest`
`ingress.enabled`	Enable ingress	`true`
`postgresql.enabled`	Deploy PostgreSQL	`true`
`redis.enabled`	Deploy Redis	`true`
`monitoring.enabled`	Enable monitoring	`true`

Manual Deployment (Without Helm)

If you prefer to use raw Kubernetes manifests:

Architecture

                        ┌─────────────────┐
                        │    Ingress      │
                        │  (TLS + routing)│
                        └────────┬────────┘
                                 │
                ┌────────────────┼────────────────┐
                │                │                │
          ┌─────▼─────┐    ┌─────▼─────┐    ┌─────▼─────┐
          │    Pod    │    │    Pod    │    │    Pod    │
          │  replica  │    │  replica  │    │  replica  │
          └─────┬─────┘    └─────┬─────┘    └─────┬─────┘
                │                │                │
                └────────────────┼────────────────┘
                                 │
                        ┌────────▼────────┐
                        │    Service      │
                        │  (ClusterIP)    │
                        └────────┬────────┘
                                 │
                        ┌────────▼────────┐
                        │   PostgreSQL    │
                        │  (StatefulSet)  │
                        └─────────────────┘

Manifests

Namespace

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: triage-warden
  labels:
    app.kubernetes.io/name: triage-warden

Secret

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: triage-warden-secrets
  namespace: triage-warden
type: Opaque
stringData:
  # Generate these values securely!
  # encryption-key: $(openssl rand -base64 32)
  # jwt-secret: $(openssl rand -hex 32)
  # session-secret: $(openssl rand -hex 32)
  encryption-key: "REPLACE_WITH_BASE64_32_BYTE_KEY"
  jwt-secret: "REPLACE_WITH_JWT_SECRET"
  session-secret: "REPLACE_WITH_SESSION_SECRET"
  database-url: "postgres://triage_warden:password@postgres-postgresql:5432/triage_warden"

ConfigMap

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: triage-warden-config
  namespace: triage-warden
data:
  RUST_LOG: "info"
  TW_BIND_ADDRESS: "0.0.0.0:8080"
  TW_BASE_URL: "https://triage.example.com"

Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triage-warden
  namespace: triage-warden
  labels:
    app.kubernetes.io/name: triage-warden
    app.kubernetes.io/component: server
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  template:
    metadata:
      labels:
        app.kubernetes.io/name: triage-warden
    spec:
      serviceAccountName: triage-warden
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: triage-warden
          image: ghcr.io/your-org/triage-warden:latest
          imagePullPolicy: Always
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: database-url
            - name: TW_ENCRYPTION_KEY
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: encryption-key
            - name: TW_JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: jwt-secret
            - name: TW_SESSION_SECRET
              valueFrom:
                secretKeyRef:
                  name: triage-warden-secrets
                  key: session-secret
          envFrom:
            - configMapRef:
                name: triage-warden-config
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /live
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL

Service

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: triage-warden
  namespace: triage-warden
  labels:
    app.kubernetes.io/name: triage-warden
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: triage-warden

Ingress

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: triage-warden
  namespace: triage-warden
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
  tls:
    - hosts:
        - triage.example.com
      secretName: triage-warden-tls
  rules:
    - host: triage.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: triage-warden
                port:
                  number: 80

ServiceAccount

# serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: triage-warden
  namespace: triage-warden

HorizontalPodAutoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

PodDisruptionBudget

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden

Apply Manifests

kubectl apply -f deploy/kubernetes/namespace.yaml
kubectl apply -f deploy/kubernetes/secret.yaml
kubectl apply -f deploy/kubernetes/configmap.yaml
kubectl apply -f deploy/kubernetes/deployment.yaml
kubectl apply -f deploy/kubernetes/service.yaml
kubectl apply -f deploy/kubernetes/ingress.yaml
kubectl apply -f deploy/kubernetes/servicemonitor.yaml
kubectl apply -f deploy/kubernetes/hpa.yaml

High Availability Configuration

For production HA deployments:

API Server HA

The API servers are stateless and can be scaled horizontally:

api:
  replicas: 3
  podAntiAffinity:
    enabled: true
    topologyKey: kubernetes.io/hostname
  topologySpreadConstraints:
    enabled: true
    maxSkew: 1

Orchestrator HA

Orchestrators use leader election to coordinate singleton tasks:

orchestrator:
  replicas: 2
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s

Pod Disruption Budget

Ensure availability during updates:

podDisruptionBudget:
  enabled: true
  minAvailable: 1

Database Setup

Using Helm (PostgreSQL)

# Add Bitnami repo
helm repo add bitnami https://charts.bitnami.com/bitnami

# Install PostgreSQL
helm install postgres bitnami/postgresql \
  --namespace triage-warden \
  --set auth.username=triage_warden \
  --set auth.password=your-secure-password \
  --set auth.database=triage_warden \
  --set primary.persistence.size=20Gi

Using External Database

Update the secret with your external database URL:

kubectl create secret generic triage-warden-secrets \
  --namespace triage-warden \
  --from-literal=database-url="postgres://user:[email protected]:5432/triage_warden?sslmode=require" \
  # ... other secrets

Monitoring

ServiceMonitor (Prometheus)

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

PrometheusRule (Alerts)

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  groups:
    - name: triage-warden
      rules:
        - alert: TriageWardenDown
          expr: up{job="triage-warden"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Triage Warden is down"
            description: "Triage Warden has been down for more than 5 minutes."

        - alert: TriageWardenHighErrorRate
          expr: rate(http_requests_total{job="triage-warden",status=~"5.."}[5m]) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate in Triage Warden"

Upgrading

Helm Upgrade

# Check current version
helm list -n triage-warden

# Upgrade to new version
helm upgrade triage-warden triage-warden/triage-warden \
  --namespace triage-warden \
  --values values-production.yaml \
  --set image.tag=v1.1.0

# Monitor the rollout
kubectl rollout status deployment/triage-warden-api -n triage-warden

Rollback

# View release history
helm history triage-warden -n triage-warden

# Rollback to previous version
helm rollback triage-warden 1 -n triage-warden

Database Migrations

Triage Warden automatically runs database migrations on startup. For manual control:

# Run migrations manually
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  triage-warden migrate

# Check migration status
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  triage-warden migrate --status

TLS Configuration

Using cert-manager

ingress:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  tls:
    - secretName: triage-warden-tls
      hosts:
        - triage.example.com

Manual TLS Secret

kubectl create secret tls triage-warden-tls \
  --namespace triage-warden \
  --cert=tls.crt \
  --key=tls.key

Security Hardening

Network Policy

# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: triage-warden
  namespace: triage-warden
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: postgresql
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
    - to:  # External APIs (LLM, connectors)
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16
      ports:
        - protocol: TCP
          port: 443

Troubleshooting

Pod Not Starting

# Check pod events
kubectl describe pod -n triage-warden -l app.kubernetes.io/name=triage-warden

# Check logs
kubectl logs -n triage-warden -l app.kubernetes.io/name=triage-warden --previous

# Common issues:
# - ImagePullBackOff: Check image name and registry credentials
# - CrashLoopBackOff: Check logs for startup errors
# - Pending: Check resource requests and node capacity

Database Connection Issues

# Test database connectivity from a pod
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -v telnet://postgres:5432

# Check database URL
kubectl get secret triage-warden-secrets -n triage-warden -o jsonpath='{.data.DATABASE_URL}' | base64 -d

Health Check Failures

# Check liveness endpoint
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -s http://localhost:8080/live

# Check readiness endpoint
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -s http://localhost:8080/ready

# Check detailed health
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
  curl -s http://localhost:8080/health/detailed | jq

Leader Election Issues

# Check which instance is the leader
kubectl exec -it deployment/triage-warden-orchestrator-0 -n triage-warden -- \
  curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'

# Check leader lease in Redis
kubectl exec -it deployment/triage-warden-redis-0 -n triage-warden -- \
  redis-cli KEYS "tw:leader:*"

Performance Issues

# Check resource usage
kubectl top pods -n triage-warden

# Check HPA status
kubectl get hpa -n triage-warden

# View Prometheus metrics
kubectl port-forward svc/prometheus -n monitoring 9090:9090

Ingress Not Working

# Check ingress
kubectl describe ingress triage-warden -n triage-warden

# Check TLS secret
kubectl get secret triage-warden-tls -n triage-warden

# Check ingress controller logs
kubectl logs -l app.kubernetes.io/name=ingress-nginx -n ingress-nginx

Operations

View Logs

# All pods
kubectl logs -l app.kubernetes.io/name=triage-warden -n triage-warden -f

# Specific pod
kubectl logs -f deployment/triage-warden -n triage-warden

# Previous container (after crash)
kubectl logs deployment/triage-warden -n triage-warden --previous

Scale Deployment

# Manual scale
kubectl scale deployment triage-warden -n triage-warden --replicas=5

# Check HPA status
kubectl get hpa -n triage-warden

Rolling Update

# Update image
kubectl set image deployment/triage-warden \
  triage-warden=ghcr.io/your-org/triage-warden:v1.2.0 \
  -n triage-warden

# Watch rollout
kubectl rollout status deployment/triage-warden -n triage-warden

# Rollback if needed
kubectl rollout undo deployment/triage-warden -n triage-warden

Uninstalling

Helm Uninstall

# Uninstall Triage Warden
helm uninstall triage-warden -n triage-warden

# Delete namespace (optional, removes all resources)
kubectl delete namespace triage-warden

# Delete PVCs if needed
kubectl delete pvc -n triage-warden --all

Next Steps

Helm Chart Deployment

Deploy Triage Warden to Kubernetes using the bundled Helm chart. This is the recommended approach for Kubernetes deployments, providing templated manifests with environment-specific value overrides.

The chart lives at deploy/helm/ in the repository.

Prerequisites

Kubernetes 1.25+
Helm 3.8+
External PostgreSQL database (required)
External Redis (optional, required for HA deployments)
Ingress controller (nginx recommended)
cert-manager (for automatic TLS)
Prometheus Operator (for monitoring)

Quick Start

Development

# Create a values file
cat > my-values.yaml << EOF
postgresql:
  host: "postgres.default.svc.cluster.local"
  port: 5432
  database: "triage_warden"
  username: "triage"
  password: "your-password"

secrets:
  encryptionKey: "$(openssl rand -base64 32)"
  jwtSecret: "$(openssl rand -hex 32)"
  sessionSecret: "$(openssl rand -hex 32)"

config:
  enableSwagger: true
  secureCookies: false
EOF

# Install
helm install triage-warden ./deploy/helm -f my-values.yaml

Production

# Create namespace
kubectl create namespace triage-warden

# Create secrets externally (recommended)
kubectl create secret generic triage-warden-secrets \
  --namespace triage-warden \
  --from-literal=TW_ENCRYPTION_KEY="$(openssl rand -base64 32)" \
  --from-literal=TW_JWT_SECRET="$(openssl rand -hex 32)" \
  --from-literal=TW_SESSION_SECRET="$(openssl rand -hex 32)"

kubectl create secret generic postgresql-credentials \
  --namespace triage-warden \
  --from-literal=postgresql-password="your-db-password"

# Install with production values
helm install triage-warden ./deploy/helm \
  --namespace triage-warden \
  -f deploy/helm/values-prod.yaml

Value Files

The chart ships with pre-built value files for common scenarios:

File	Purpose
`values.yaml`	Defaults (base for all environments)
`values-dev.yaml`	Single-instance development (debug logging, no TLS)
`values-prod.yaml`	Multi-instance production (3 API replicas, TLS, monitoring)
`values-ha.yaml`	Maximum availability (5+ replicas, zone spreading, strict anti-affinity)

Override with -f:

helm install triage-warden ./deploy/helm \
  --namespace triage-warden \
  -f deploy/helm/values-prod.yaml \
  -f my-secrets.yaml

Key Parameters

Application

Parameter	Description	Default
`api.replicas`	API server replicas	`2`
`api.resources.requests.cpu`	CPU request	`100m`
`api.resources.requests.memory`	Memory request	`256Mi`
`orchestrator.replicas`	Orchestrator replicas	`1`
`config.logLevel`	Log level	`info`
`config.enableSwagger`	Enable Swagger UI	`false`

Database

Parameter	Description	Default
`postgresql.host`	PostgreSQL host (required)	`""`
`postgresql.port`	PostgreSQL port	`5432`
`postgresql.database`	Database name	`triage_warden`
`postgresql.existingSecret`	Existing secret with password	`""`
`postgresql.sslMode`	SSL mode	`require`

Networking

Parameter	Description	Default
`ingress.enabled`	Enable ingress	`false`
`ingress.className`	Ingress class name	`nginx`
`networkPolicy.enabled`	Enable network policies	`false`

Scaling & HA

Parameter	Description	Default
`autoscaling.enabled`	Enable HPA	`false`
`autoscaling.minReplicas`	Minimum replicas	`2`
`autoscaling.maxReplicas`	Maximum replicas	`10`
`podDisruptionBudget.enabled`	Enable PDB	`false`

Monitoring

Parameter	Description	Default
`serviceMonitor.enabled`	Enable ServiceMonitor	`false`
`prometheusRules.enabled`	Enable alerting rules	`false`

See deploy/helm/values.yaml for the complete list.

Components

The chart deploys two main components:

API Server (deployment-api.yaml) - Handles HTTP requests, webhooks, and the web UI
Orchestrator (deployment-orchestrator.yaml) - Manages background tasks, scheduling, and automation

Supporting resources: ServiceAccount, ConfigMap, Secret, Service, Ingress, HPA, PDB, NetworkPolicy, ServiceMonitor, PrometheusRule.

External Secrets

For production, use an external secrets manager instead of storing secrets in values files:

secrets:
  create: false
  existingSecret: "triage-warden-secrets"

Compatible with:

External Secrets Operator
AWS Secrets Manager with IRSA
HashiCorp Vault

Upgrading

helm upgrade triage-warden ./deploy/helm \
  --namespace triage-warden \
  -f deploy/helm/values-prod.yaml

# Monitor rollout
kubectl rollout status deployment/triage-warden-api -n triage-warden

Rollback

helm history triage-warden -n triage-warden
helm rollback triage-warden 1 -n triage-warden

Uninstalling

helm uninstall triage-warden -n triage-warden
kubectl delete namespace triage-warden

Alerts

When prometheusRules.enabled: true, the chart installs these alerts:

TriageWardenDown - Instance unreachable for 2+ minutes
TriageWardenHighErrorRate - 5xx errors exceed 5%
TriageWardenKillSwitchActive - Kill switch activated
TriageWardenDatabaseUnhealthy - Database connection issues
TriageWardenHighLatency - P99 latency above 1 second
TriageWardenConnectorUnhealthy - Connector health issues

The HA values file (values-ha.yaml) adds zone-balance and replica-mismatch alerts.

Next Steps

Production Checklist - Security and configuration review
Monitoring - Set up dashboards and alerting
Scaling - Horizontal scaling guidance
Raw Manifests - Alternative: deploy without Helm

Configuration Reference

This document provides a comprehensive reference for all Triage Warden configuration options.

Configuration Methods

Triage Warden can be configured through:

Environment variables (recommended for production)
Configuration file (config/default.yaml)
Command-line arguments (for specific settings)

Environment variables take precedence over configuration file values.

Environment Variables

Security Settings (Required)

Variable	Description	Example
`TW_ENCRYPTION_KEY`	32-byte base64 key for encrypting credentials stored in database	`openssl rand -base64 32`
`TW_JWT_SECRET`	Secret for signing JWT tokens (min 32 chars)	`openssl rand -hex 32`
`TW_SESSION_SECRET`	Secret for signing session cookies (min 32 chars)	`openssl rand -hex 32`

Warning: These secrets must be consistent across all instances in a cluster. Changing them will invalidate existing sessions and encrypted data.

Database Configuration

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	`postgres://user:pass@host:5432/db`
`DATABASE_MAX_CONNECTIONS`	Maximum connection pool size	`25`
`DATABASE_MIN_CONNECTIONS`	Minimum connection pool size	`5`
`DATABASE_CONNECT_TIMEOUT`	Connection timeout in seconds	`30`
`DATABASE_IDLE_TIMEOUT`	Idle connection timeout in seconds	`600`
`DATABASE_MAX_LIFETIME`	Maximum connection lifetime in seconds	`1800`

Connection String Format:

postgres://username:password@hostname:port/database?sslmode=require

SSL modes: disable, allow, prefer, require, verify-ca, verify-full

Redis Configuration

Redis is required for HA deployments (message queue, cache, leader election).

Variable	Description	Default
`REDIS_URL`	Redis connection URL	`redis://localhost:6379`
`TW_MESSAGE_QUEUE_ENABLED`	Enable Redis-based message queue	`false`
`TW_CACHE_ENABLED`	Enable Redis-based cache	`false`
`TW_LEADER_ELECTION_ENABLED`	Enable Redis-based leader election	`false`
`TW_CACHE_TTL_SECONDS`	Default cache TTL	`3600`
`TW_CACHE_MAX_SIZE`	Maximum cache entries	`10000`

Connection URL Formats:

redis://localhost:6379
redis://:password@localhost:6379
redis://localhost:6379/0
rediss://localhost:6379  # TLS

Server Configuration

Variable	Description	Default
`TW_BIND_ADDRESS`	Address and port to bind	`0.0.0.0:8080`
`TW_BASE_URL`	Public URL for the application	`http://localhost:8080`
`TW_ENV`	Environment: `development`, `production`	`development`
`TW_TRUSTED_PROXIES`	CIDR ranges for trusted reverse proxies	``
`TW_REQUEST_BODY_LIMIT`	Max request body size in bytes	`10485760` (10MB)
`TW_REQUEST_TIMEOUT`	Request timeout in seconds	`30`

Instance Configuration

Variable	Description	Default
`TW_INSTANCE_ID`	Unique identifier for this instance	Auto-generated
`TW_INSTANCE_TYPE`	Instance type: `api`, `orchestrator`, `combined`	`combined`

Authentication & Sessions

Variable	Description	Default
`TW_COOKIE_SECURE`	Require HTTPS for cookies	`true` in production
`TW_COOKIE_SAME_SITE`	SameSite policy: `strict`, `lax`, `none`	`strict`
`TW_SESSION_EXPIRY_SECONDS`	Session duration	`86400` (24 hours)
`TW_CSRF_ENABLED`	Enable CSRF protection	`true`
`TW_ADMIN_PASSWORD`	Initial admin password (first run only)	Auto-generated

CORS Configuration

Variable	Description	Default
`TW_CORS_ALLOWED_ORIGINS`	Allowed origins (comma-separated)	Same origin only
`TW_CORS_ALLOW_CREDENTIALS`	Allow credentials in CORS requests	`true`
`TW_CORS_MAX_AGE`	Preflight cache duration in seconds	`3600`

LLM Configuration

Variable	Description	Default
`TW_LLM_PROVIDER`	LLM provider: `anthropic`, `openai`, `azure`, `local`	`anthropic`
`TW_LLM_MODEL`	Model identifier	`claude-3-sonnet-20240229`
`TW_LLM_TEMPERATURE`	Generation temperature (0.0-2.0)	`0.2`
`TW_LLM_MAX_TOKENS`	Maximum response tokens	`4096`
`TW_LLM_TIMEOUT_SECONDS`	API call timeout	`60`
`TW_LLM_RETRY_ATTEMPTS`	Number of retry attempts	`3`
`TW_LLM_RETRY_DELAY_MS`	Delay between retries	`1000`

Provider-specific API Keys:

Variable	Provider
`ANTHROPIC_API_KEY`	Anthropic Claude
`OPENAI_API_KEY`	OpenAI GPT
`AZURE_OPENAI_API_KEY`	Azure OpenAI
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI endpoint URL

Orchestrator Configuration

Variable	Description	Default
`TW_OPERATION_MODE`	Mode: `supervised`, `assisted`, `autonomous`	`supervised`
`TW_AUTO_APPROVE_LOW_RISK`	Auto-approve low-risk actions	`false`
`TW_MAX_CONCURRENT_INCIDENTS`	Max concurrent incident processing	`100`
`TW_ENRICHMENT_TIMEOUT_SECONDS`	Enrichment step timeout	`60`
`TW_ANALYSIS_TIMEOUT_SECONDS`	AI analysis timeout	`120`
`TW_ACTION_TIMEOUT_SECONDS`	Action execution timeout	`300`

Logging Configuration

Variable	Description	Default
`RUST_LOG`	Log level filter	`info`
`TW_LOG_FORMAT`	Format: `json`, `pretty`	`json` in production
`TW_LOG_INCLUDE_LOCATION`	Include file/line in logs	`false`

Log Level Examples:

# Basic level
RUST_LOG=info

# Per-module levels
RUST_LOG=info,triage_warden=debug,tw_api=trace

# All debug
RUST_LOG=debug

Metrics Configuration

Variable	Description	Default
`TW_METRICS_ENABLED`	Enable Prometheus metrics	`true`
`TW_METRICS_PATH`	Metrics endpoint path	`/metrics`
`TW_METRICS_INCLUDE_LABELS`	Include additional labels	`true`

Rate Limiting

Variable	Description	Default
`TW_RATE_LIMIT_ENABLED`	Enable rate limiting	`true`
`TW_RATE_LIMIT_REQUESTS`	Requests per window	`200`
`TW_RATE_LIMIT_WINDOW`	Window duration (e.g., `1m`, `1h`)	`1m`
`TW_RATE_LIMIT_BURST`	Burst allowance	`50`

Feature Flags

Variable	Description	Default
`TW_FEATURE_PLAYBOOKS`	Enable playbook automation	`true`
`TW_FEATURE_AUTO_ENRICH`	Enable automatic enrichment	`true`
`TW_FEATURE_API_KEYS`	Enable API key authentication	`true`
`TW_FEATURE_MULTI_TENANT`	Enable multi-tenancy	`false`
`TW_ENABLE_SWAGGER`	Enable Swagger UI	`true` in dev

Webhook Configuration

Variable	Description	Default
`TW_WEBHOOK_SECRET`	Default webhook signature secret	``
`TW_WEBHOOK_TIMEOUT_SECONDS`	Webhook delivery timeout	`30`
`TW_WEBHOOK_RETRY_ATTEMPTS`	Delivery retry attempts	`3`

Source-specific webhook secrets:

Variable	Source
`TW_WEBHOOK_SPLUNK_SECRET`	Splunk HEC
`TW_WEBHOOK_CROWDSTRIKE_SECRET`	CrowdStrike
`TW_WEBHOOK_SENTINEL_SECRET`	Microsoft Sentinel
`TW_WEBHOOK_GITHUB_SECRET`	GitHub (for DevSecOps)

Configuration File

Configuration can also be provided via YAML file.

File Locations

Triage Warden searches for configuration in order:

Path specified by --config flag
$HOME/.config/triage-warden/config.yaml
/etc/triage-warden/config.yaml
./config/default.yaml

Example Configuration File

# config/default.yaml

# Server configuration
server:
  bind_address: "0.0.0.0:8080"
  base_url: "https://triage.example.com"
  trusted_proxies:
    - "10.0.0.0/8"
    - "172.16.0.0/12"

# Database configuration
database:
  url: "postgres://triage:password@localhost:5432/triage_warden"
  max_connections: 25
  min_connections: 5
  connect_timeout: 30

# Redis configuration (for HA)
redis:
  url: "redis://localhost:6379"
  message_queue:
    enabled: true
  cache:
    enabled: true
    ttl_seconds: 3600
  leader_election:
    enabled: true

# LLM configuration
llm:
  provider: anthropic
  model: claude-3-sonnet-20240229
  temperature: 0.2
  max_tokens: 4096
  # API key should be set via environment variable

# Orchestrator settings
orchestrator:
  operation_mode: supervised
  auto_approve_low_risk: false
  max_concurrent_incidents: 100
  timeouts:
    enrichment: 60
    analysis: 120
    action: 300

# Logging
logging:
  level: info
  format: json

# Metrics
metrics:
  enabled: true
  path: /metrics

# Rate limiting
rate_limit:
  enabled: true
  requests_per_minute: 200
  burst: 50

# Feature flags
features:
  playbooks: true
  auto_enrich: true
  api_keys: true
  multi_tenant: false

# Connectors
connectors:
  crowdstrike:
    enabled: true
    type: edr
    base_url: "https://api.crowdstrike.com"
    # Credentials via environment or secrets

  splunk:
    enabled: true
    type: siem
    base_url: "https://splunk.example.com:8089"

Precedence

Configuration is loaded in this order (later overrides earlier):

Default values (built into application)
Configuration file (config/default.yaml)
Environment-specific file (config/{TW_ENV}.yaml)
Environment variables

Generating Secrets

Encryption Key (32 bytes, base64)

# macOS/Linux
openssl rand -base64 32

# Alternative using /dev/urandom
head -c 32 /dev/urandom | base64

JWT/Session Secrets

# Hex-encoded secret
openssl rand -hex 32

# Or use a password generator
pwgen -s 64 1

Database URL Format

PostgreSQL

postgres://username:password@hostname:port/database?sslmode=require

Options:

sslmode=disable - No SSL (development only)
sslmode=require - Require SSL, don't verify certificate
sslmode=verify-ca - Require SSL, verify CA
sslmode=verify-full - Require SSL, verify CA and hostname

Connection Pooling (PgBouncer)

postgres://username:password@pgbouncer:6432/database?sslmode=require

Operation Modes

Triage Warden supports three operation modes:

Supervised Mode (Default)

All actions require human approval:

TW_OPERATION_MODE=supervised
TW_AUTO_APPROVE_LOW_RISK=false

Assisted Mode

Low-risk actions are auto-approved, high-risk require approval:

TW_OPERATION_MODE=assisted
TW_AUTO_APPROVE_LOW_RISK=true

Autonomous Mode

All actions within guardrails are auto-executed:

TW_OPERATION_MODE=autonomous

Warning: Autonomous mode should only be enabled after thorough testing and with appropriate guardrails configured.

Health Check Endpoints

Endpoint	Purpose	Response
`/health`	Basic health status	`{"status": "healthy", ...}`
`/health/detailed`	Full component status	Includes all components
`/live`	Liveness probe (Kubernetes)	`200 OK`
`/ready`	Readiness probe (Kubernetes)	`200 OK` or `503`

Health Status Values

Status	Description
`healthy`	All components operational
`degraded`	Some non-critical components failing
`unhealthy`	Critical components failing
`halted`	Kill switch activated

Security Best Practices

Never commit secrets to version control
Use different secrets for each environment
Rotate secrets periodically
Enable TLS in production (TW_COOKIE_SECURE=true)
Restrict trusted proxies to known IP ranges
Enable rate limiting in production
Use read-only database users where possible

Environment-Specific Recommendations

Development

TW_ENV=development
TW_LOG_FORMAT=pretty
RUST_LOG=debug,triage_warden=trace
TW_COOKIE_SECURE=false
TW_ENABLE_SWAGGER=true

Staging

TW_ENV=production
TW_LOG_FORMAT=json
RUST_LOG=info,triage_warden=debug
TW_COOKIE_SECURE=true
TW_ENABLE_SWAGGER=true

Production

TW_ENV=production
TW_LOG_FORMAT=json
RUST_LOG=info
TW_COOKIE_SECURE=true
TW_ENABLE_SWAGGER=false
TW_METRICS_ENABLED=true
TW_RATE_LIMIT_ENABLED=true

High-Availability

DATABASE_URL=postgres://tw_user:pass@pgbouncer:6432/triage_warden?sslmode=require
DATABASE_MAX_CONNECTIONS=50
TW_TRUSTED_PROXIES=10.0.0.0/8
TW_METRICS_ENABLED=true
TW_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

Next Steps

Configure monitoring
Set up horizontal scaling
Deploy to Kubernetes

Operations Guide

Operational procedures and runbooks for Triage Warden.

Runbooks

Backup & Restore - Database backup and recovery procedures
Monitoring - Prometheus metrics, alerting, and dashboards
Troubleshooting - Common issues and solutions
Maintenance - Routine maintenance tasks
Incident Response - Emergency procedures
Upgrade Guide - Version upgrade procedures

Quick Reference

Health Check Endpoints

Endpoint	Purpose	Expected Response
`GET /live`	Liveness probe	`200 OK`
`GET /ready`	Readiness probe	`200 OK` if ready, `503` if not
`GET /health`	Basic health	JSON with status
`GET /health/detailed`	Full component health	JSON with all components

Key Metrics

Metric	Description	Alert Threshold
`http_requests_total`	Total HTTP requests	N/A
`http_request_duration_seconds`	Request latency	p99 > 1s
`http_requests_in_flight`	Concurrent requests	> 100
`db_pool_connections_active`	Active DB connections	> 80% of max
`incidents_total`	Total incidents processed	N/A
`actions_executed_total`	Total actions executed	N/A

Emergency Contacts

Role	Contact	Escalation
On-call Engineer	PagerDuty	Auto-escalates after 15m
Security Lead	[email protected]	Critical security issues
Database Admin	[email protected]	Database emergencies

Common Commands

Docker

# View logs
docker compose logs -f triage-warden

# Restart service
docker compose restart triage-warden

# Check health
curl http://localhost:8080/health | jq

# Database backup
docker compose exec postgres pg_dump -U triage_warden > backup.sql

Kubernetes

# View logs
kubectl logs -f deployment/triage-warden -n triage-warden

# Restart pods
kubectl rollout restart deployment/triage-warden -n triage-warden

# Check health
kubectl exec -it deployment/triage-warden -n triage-warden -- curl -s localhost:8080/health | jq

# Scale up/down
kubectl scale deployment triage-warden -n triage-warden --replicas=5

Database

# Connect to PostgreSQL
psql $DATABASE_URL

# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'triage_warden';

# Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Service Dependencies

┌──────────────────┐
│  Triage Warden   │
└────────┬─────────┘
         │
    ┌────┴────┬─────────┬─────────┐
    │         │         │         │
    ▼         ▼         ▼         ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Postgres│ │  LLM  │ │Connec-│ │Notifi-│
│   DB   │ │  API  │ │ tors  │ │cations│
└───────┘ └───────┘ └───────┘ └───────┘

Dependency Health Impact

Dependency	If Unavailable
PostgreSQL	Service fails readiness, no data access
LLM API	AI analysis disabled, manual triage only
Connectors	Specific integrations fail, core works
Notifications	Alerts not delivered, incidents still process

Scheduled Tasks

Task	Schedule	Description
Database backup	Daily 2:00 AM	Full PostgreSQL backup
Connector health check	Every 5 minutes	Verify connector connectivity
Incident cleanup	Weekly Sunday 3:00 AM	Archive old incidents
Log rotation	Daily	Rotate and compress logs
Certificate renewal	30 days before expiry	Renew TLS certificates

Monitoring Guide

This guide covers monitoring, metrics, and alerting for Triage Warden deployments.

Overview

Triage Warden exposes metrics in Prometheus format and supports integration with common observability stacks.

┌─────────────────────────────────────────────────────────────┐
│                    Monitoring Stack                          │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Prometheus  │───▶│   Grafana    │    │ Alertmanager │  │
│  │  (scraping)  │    │ (dashboards) │    │  (alerts)    │  │
│  └──────┬───────┘    └──────────────┘    └──────────────┘  │
│         │                                                    │
└─────────┼────────────────────────────────────────────────────┘
          │
          │ /metrics
          │
┌─────────▼────────────────────────────────────────────────────┐
│                    Triage Warden                              │
│  ┌───────────┐  ┌───────────┐  ┌─────────────┐              │
│  │ API-1     │  │ API-2     │  │Orchestrator │              │
│  │ :8080     │  │ :8080     │  │    :8080    │              │
│  └───────────┘  └───────────┘  └─────────────┘              │
└──────────────────────────────────────────────────────────────┘

Metrics Endpoints

Endpoint	Format	Description
`/metrics`	Prometheus	Prometheus-compatible metrics
`/api/metrics`	JSON	Dashboard-friendly JSON format
`/health`	JSON	Basic health status
`/health/detailed`	JSON	Comprehensive health including components

Available Metrics

HTTP Metrics

# Request counter by method, path, status
http_requests_total{method="GET", path="/api/incidents", status="200"} 1234

# Request duration histogram
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="0.1"} 900
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="0.5"} 1100
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="1.0"} 1200

# Active connections
http_connections_active 42

Incident Metrics

# Total incidents by severity and status
triage_warden_incidents_total{severity="critical", status="new"} 5
triage_warden_incidents_total{severity="high", status="resolved"} 128

# Incidents currently being processed
triage_warden_incidents_in_progress 12

# Triage duration histogram
triage_warden_triage_duration_seconds_bucket{le="60"} 500
triage_warden_triage_duration_seconds_bucket{le="300"} 800

Action Metrics

# Actions by type and status
triage_warden_actions_total{action_type="isolate_host", status="success"} 45
triage_warden_actions_total{action_type="isolate_host", status="failed"} 2

# Pending approvals
triage_warden_actions_pending_approval 8

# Action execution duration
triage_warden_action_duration_seconds_bucket{action_type="isolate_host", le="30"} 40

System Metrics

# Kill switch status
kill_switch_active 0

# Component health (1=healthy, 0=unhealthy)
component_healthy{component="database"} 1
component_healthy{component="redis"} 1
component_healthy{component="connector_crowdstrike"} 1

# Database connection pool
db_pool_connections_total 25
db_pool_connections_idle 20
db_pool_connections_waiting 0

# Cache statistics
cache_hits_total 10000
cache_misses_total 500
cache_size 2500

LLM Metrics

# LLM API calls by provider and model
llm_requests_total{provider="anthropic", model="claude-3-sonnet"} 500

# LLM latency
llm_request_duration_seconds_bucket{provider="anthropic", le="5"} 400
llm_request_duration_seconds_bucket{provider="anthropic", le="30"} 490

# Token usage
llm_tokens_used_total{provider="anthropic", type="input"} 150000
llm_tokens_used_total{provider="anthropic", type="output"} 75000

Message Queue Metrics

# Queue depth by topic
mq_messages_pending{topic="triage.alerts"} 15
mq_messages_pending{topic="triage.enrichment"} 3

# Message processing rate
mq_messages_processed_total{topic="triage.alerts"} 5000
mq_messages_acknowledged_total{topic="triage.alerts"} 4995

Prometheus Configuration

Basic Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: 'triage-warden'
    static_configs:
      - targets:
          - 'triage-warden-api:8080'
          - 'triage-warden-orchestrator:8080'
    metrics_path: /metrics
    scrape_interval: 15s
    scrape_timeout: 10s

Kubernetes ServiceMonitor

For Prometheus Operator deployments:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: triage-warden
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
  namespaceSelector:
    matchNames:
      - triage-warden
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s

Pod Annotations (Alternative)

If using annotation-based discovery:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Alerting Rules

PrometheusRule Resource

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: triage-warden-alerts
  labels:
    release: prometheus
spec:
  groups:
    - name: triage-warden.availability
      rules:
        # Service Down
        - alert: TriageWardenDown
          expr: up{job="triage-warden"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Triage Warden instance is down"
            description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."

        # High Error Rate
        - alert: TriageWardenHighErrorRate
          expr: |
            sum(rate(http_requests_total{job="triage-warden",status=~"5.."}[5m])) /
            sum(rate(http_requests_total{job="triage-warden"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate detected"
            description: "More than 5% of requests are returning 5xx errors."

        # Database Unhealthy
        - alert: TriageWardenDatabaseUnhealthy
          expr: component_healthy{job="triage-warden",component="database"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Database connection lost"
            description: "Triage Warden cannot connect to the database."

    - name: triage-warden.performance
      rules:
        # High Latency
        - alert: TriageWardenHighLatency
          expr: |
            histogram_quantile(0.99,
              rate(http_request_duration_seconds_bucket{job="triage-warden"}[5m])
            ) > 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High API latency"
            description: "P99 latency is above 1 second for the last 10 minutes."

        # Slow Triage Time
        - alert: TriageWardenSlowTriage
          expr: |
            histogram_quantile(0.90,
              rate(triage_warden_triage_duration_seconds_bucket[1h])
            ) > 300
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "Incident triage taking too long"
            description: "P90 triage duration is above 5 minutes."

    - name: triage-warden.operations
      rules:
        # Kill Switch Active
        - alert: TriageWardenKillSwitchActive
          expr: kill_switch_active == 1
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: "Kill switch is active"
            description: "All automation has been halted by the kill switch."

        # High Pending Approvals
        - alert: TriageWardenHighPendingApprovals
          expr: triage_warden_actions_pending_approval > 50
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "High number of pending approvals"
            description: "{{ $value }} actions are waiting for approval."

        # Connector Unhealthy
        - alert: TriageWardenConnectorUnhealthy
          expr: component_healthy{component=~"connector_.*"} == 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Connector {{ $labels.component }} is unhealthy"
            description: "Connector has been unhealthy for more than 10 minutes."

        # Queue Backlog
        - alert: TriageWardenQueueBacklog
          expr: mq_messages_pending{topic="triage.alerts"} > 100
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Alert queue backlog growing"
            description: "{{ $value }} unprocessed alerts in queue."

    - name: triage-warden.resources
      rules:
        # High CPU
        - alert: TriageWardenHighCPU
          expr: |
            sum(rate(container_cpu_usage_seconds_total{
              container="triage-warden"
            }[5m])) by (pod) > 0.8
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "High CPU usage"
            description: "Pod {{ $labels.pod }} CPU usage above 80%."

        # High Memory
        - alert: TriageWardenHighMemory
          expr: |
            container_memory_usage_bytes{container="triage-warden"} /
            container_spec_memory_limit_bytes{container="triage-warden"} > 0.9
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "High memory usage"
            description: "Pod {{ $labels.pod }} memory usage above 90%."

        # Database Connection Exhaustion
        - alert: TriageWardenDBConnectionsLow
          expr: db_pool_connections_idle < 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Database connection pool nearly exhausted"
            description: "Only {{ $value }} idle connections remaining."

Key Metrics to Monitor

SLI/SLO Recommendations

Indicator	Target	Alert Threshold
Availability	99.9%	< 99.5%
API Latency P99	< 500ms	> 1s
Error Rate	< 0.1%	> 1%
Triage Time P90	< 5min	> 10min

Dashboard Panels

Overview:

Instance count and status
Requests per second
Error rate percentage
Active incidents

Performance:

Request latency histogram
Database query duration
LLM response time
Cache hit ratio

Operations:

Incidents by severity/status
Actions executed vs pending
Queue depths
Connector health matrix

Resources:

CPU utilization by instance
Memory utilization by instance
Database connections
Redis memory usage

Grafana Dashboards

Importing Dashboards

Triage Warden provides pre-built Grafana dashboards:

# Download dashboard JSON
curl -o triage-warden-dashboard.json \
  https://raw.githubusercontent.com/triage-warden/triage-warden/main/deploy/grafana/dashboards/overview.json

# Import via Grafana API
curl -X POST -H "Content-Type: application/json" \
  -d @triage-warden-dashboard.json \
  http://admin:admin@localhost:3000/api/dashboards/db

Dashboard Provisioning

For automatic dashboard provisioning in Kubernetes:

# ConfigMap for dashboard provisioning
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  labels:
    grafana_dashboard: "1"
data:
  triage-warden.json: |
    {
      "dashboard": {
        "title": "Triage Warden",
        "panels": [...]
      }
    }

Example Panel Queries

Requests per Second:

sum(rate(http_requests_total{job="triage-warden"}[5m]))

Error Rate:

sum(rate(http_requests_total{job="triage-warden",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="triage-warden"}[5m])) * 100

P99 Latency:

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="triage-warden"}[5m])) by (le)
)

Incidents by Status:

triage_warden_incidents_total{job="triage-warden"}

Cache Hit Ratio:

sum(rate(cache_hits_total[5m])) /
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100

Logging

Log Format

Triage Warden outputs structured JSON logs:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "info",
  "target": "tw_api::routes::incidents",
  "message": "Incident created",
  "incident_id": "123e4567-e89b-12d3-a456-426614174000",
  "severity": "high",
  "source": "crowdstrike",
  "trace_id": "abc123",
  "span_id": "def456"
}

Log Aggregation

Loki Configuration:

# promtail config
scrape_configs:
  - job_name: triage-warden
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: triage-warden
        action: keep
    pipeline_stages:
      - json:
          expressions:
            level: level
            incident_id: incident_id
            trace_id: trace_id
      - labels:
          level:
          incident_id:

Elasticsearch/Fluentd:

# Fluentd config
<match kubernetes.var.log.containers.triage-warden**>
  @type elasticsearch
  host elasticsearch
  port 9200
  index_name triage-warden
  <buffer>
    @type file
    path /var/log/fluentd-buffers/triage-warden
  </buffer>
</match>

Log Queries

Find errors:

level:ERROR

Slow requests:

duration_ms:>1000

Specific user actions:

user.id:"user-uuid" AND target:*auth*

Distributed Tracing

OpenTelemetry Configuration

# Environment variables
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=triage-warden
OTEL_TRACES_EXPORTER=otlp

Trace Propagation

Triage Warden propagates trace context through:

HTTP headers (W3C Trace Context)
Message queue metadata
Internal async tasks

Health Check Integration

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Health Status Interpretation

Status	HTTP Code	Meaning
healthy	200	All systems operational
degraded	200	Non-critical issues
unhealthy	503	Critical component failure
halted	200	Kill switch active

Synthetic Monitoring

# blackbox-exporter probe
modules:
  http_triage_warden:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      fail_if_body_not_matches_regexp:
        - '"status":"healthy"'

Uptime Monitoring

Configure external uptime monitoring (Pingdom, UptimeRobot, etc.) to check:

https://triage.example.com/live - Basic availability
https://triage.example.com/ready - Full readiness

SLO/SLI Definitions

Availability SLO

Target: 99.9% availability

# SLI: Successful requests / Total requests
sum(rate(http_requests_total{job="triage-warden",status!~"5.."}[30d])) /
sum(rate(http_requests_total{job="triage-warden"}[30d]))

Latency SLO

Target: 99% of requests < 500ms

# SLI: Requests under threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{job="triage-warden",le="0.5"}[30d])) /
sum(rate(http_request_duration_seconds_count{job="triage-warden"}[30d]))

Error Budget

# Remaining error budget
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) /
  (1 - 0.999)
)

Troubleshooting with Metrics

High Latency Investigation

# Identify slow endpoints
topk(5,
  histogram_quantile(0.99,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (path, le)
  )
)

# Check database query time
histogram_quantile(0.99,
  rate(db_query_duration_seconds_bucket[5m])
)

Memory Issues

# Memory growth rate
deriv(process_resident_memory_bytes{job="triage-warden"}[1h])

# Compare to limits
container_memory_usage_bytes / container_spec_memory_limit_bytes

Queue Bottlenecks

# Processing rate vs arrival rate
rate(mq_messages_processed_total[5m]) - rate(mq_messages_received_total[5m])

# Time in queue
histogram_quantile(0.95, rate(mq_message_wait_seconds_bucket[5m]))

Next Steps

Configure horizontal scaling based on metrics
Review configuration options
Set up Kubernetes deployment

Horizontal Scaling Guide

This guide covers scaling Triage Warden horizontally to handle increased load and ensure high availability.

Architecture Overview

Triage Warden consists of two main components that scale differently:

                    ┌─────────────────────┐
                    │   Load Balancer     │
                    │  (Traefik/nginx)    │
                    └──────────┬──────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   API Server  │      │   API Server  │      │   API Server  │
│   (stateless) │      │   (stateless) │      │   (stateless) │
└───────┬───────┘      └───────┬───────┘      └───────┬───────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│ Orchestrator  │      │ Orchestrator  │      │ Orchestrator  │
│   (worker)    │      │   (worker)    │      │   (leader)    │
└───────┬───────┘      └───────┬───────┘      └───────┬───────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│    Redis      │      │  PostgreSQL   │      │  PostgreSQL   │
│  (MQ + Cache) │      │   (primary)   │      │   (replica)   │
└───────────────┘      └───────────────┘      └───────────────┘

Scaling Components

API Servers

API servers are stateless and can be scaled horizontally without coordination.

When to Scale:

CPU utilization > 70% sustained
Request latency P99 > 500ms
Concurrent connections approaching limits

Scaling Method:

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triage-warden-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Helm Configuration:

api:
  replicas: 3
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

Orchestrators

Orchestrators process incidents asynchronously. They use leader election for singleton tasks (scheduled jobs, metrics aggregation) while allowing parallel incident processing across all instances.

When to Scale:

Incident queue depth increasing
Mean time to triage increasing
Worker CPU utilization > 70%

Scaling Considerations:

Leader Tasks: Only one orchestrator runs scheduled jobs
Worker Tasks: All orchestrators process incidents from the queue
State Sharing: Uses Redis for message queue and coordination

Configuration:

orchestrator:
  replicas: 3
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s

When to Scale

Metrics to Monitor

Metric	Warning Threshold	Critical Threshold	Action
`http_request_duration_seconds` P99	> 500ms	> 1s	Scale API
`cpu_usage_percent`	> 70%	> 85%	Scale component
`memory_usage_percent`	> 80%	> 90%	Scale or optimize
`incident_queue_depth`	> 100	> 500	Scale orchestrators
`db_connection_pool_waiting`	> 0	> 5	Increase pool size
`redis_connected_clients`	> 80% max	> 95% max	Scale Redis

Capacity Planning

API Server Capacity (per instance):

~500 requests/second (simple endpoints)
~100 requests/second (complex queries)
~50 concurrent WebSocket connections

Orchestrator Capacity (per instance):

~10 concurrent incident processing
~5 concurrent LLM analysis calls
~20 concurrent enrichment requests

Scaling Decision Matrix

Symptom	Likely Cause	Solution
High API latency	API overloaded	Scale API servers
Growing queue depth	Orchestrators overloaded	Scale orchestrators
Database timeouts	Connection exhaustion	Increase pool, add replicas
Cache misses high	Cache too small	Increase Redis memory
LLM rate limits	Too many concurrent calls	Add rate limiting, queue

Database Scaling

Connection Pooling

Each instance maintains a connection pool. Total connections:

Total = API_instances * pool_size + Orchestrator_instances * pool_size

Example: 3 API + 2 Orchestrator with pool_size=15:

Total = (3 * 15) + (2 * 15) = 75 connections

Configuration:

database:
  max_connections: 15  # Per instance
  min_connections: 2
  connect_timeout: 30

Read Replicas

For read-heavy workloads, configure read replicas:

database:
  primary_url: "postgres://user:pass@primary:5432/db"
  replica_url: "postgres://user:pass@replica:5432/db"
  read_replica_enabled: true

Connection Pooler (PgBouncer)

For large deployments, use PgBouncer:

# Kubernetes ConfigMap for PgBouncer
apiVersion: v1
kind: ConfigMap
metadata:
  name: pgbouncer-config
data:
  pgbouncer.ini: |
    [databases]
    triage_warden = host=postgres port=5432 dbname=triage_warden

    [pgbouncer]
    listen_port = 6432
    listen_addr = 0.0.0.0
    auth_type = md5
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 50

Redis Scaling

Standalone vs Cluster

Standalone (default): Suitable for most deployments

Up to ~100k ops/second
Single point of failure (use replica for HA)

Cluster: For high-throughput requirements

Horizontal scaling across nodes
Automatic sharding

Redis Configuration

redis:
  architecture: replication  # standalone, replication, cluster
  master:
    resources:
      limits:
        memory: 2Gi
  replica:
    replicaCount: 2

Cache Sizing

Calculate cache memory needs:

Memory = average_entry_size * expected_entries * 1.5 (overhead)

Example: 1KB average, 100k entries:

Memory = 1KB * 100,000 * 1.5 = 150MB

Load Balancer Configuration

Health Checks

Configure proper health checks for load balancing:

# Traefik
- "traefik.http.services.api.loadbalancer.healthcheck.path=/ready"
- "traefik.http.services.api.loadbalancer.healthcheck.interval=5s"
- "traefik.http.services.api.loadbalancer.healthcheck.timeout=3s"

Session Affinity

For WebSocket connections, enable sticky sessions:

# Traefik
- "traefik.http.services.api.loadbalancer.sticky.cookie.name=tw_server"
- "traefik.http.services.api.loadbalancer.sticky.cookie.httpOnly=true"

Rate Limiting

Configure rate limiting at the load balancer level:

# Traefik rate limiting middleware
http:
  middlewares:
    rate-limit:
      rateLimit:
        average: 100
        burst: 50
        period: 1s

Kubernetes Autoscaling

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triage-warden-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # CPU-based scaling
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # Memory-based scaling
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    # Custom metric scaling (requires Prometheus adapter)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min cooldown
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

Vertical Pod Autoscaler (VPA)

For automatic resource adjustment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: triage-warden-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-warden-api
  updatePolicy:
    updateMode: "Auto"  # or "Off" for recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: triage-warden
        minAllowed:
          cpu: 250m
          memory: 256Mi
        maxAllowed:
          cpu: 4
          memory: 4Gi

Pod Disruption Budget

Ensure availability during scaling:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: triage-warden-api
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: triage-warden
      app.kubernetes.io/component: api

Scaling Best Practices

1. Scale Gradually

Increase by 25-50% at a time
Monitor for 10-15 minutes before next scale
Watch for downstream bottlenecks

2. Test Scale Limits

# Load testing with k6
k6 run --vus 100 --duration 5m load-test.js

3. Set Resource Limits

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

4. Use Pod Anti-Affinity

Spread pods across nodes:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: triage-warden
          topologyKey: kubernetes.io/hostname

5. Configure Topology Spread

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: triage-warden

Troubleshooting Scaling Issues

Pods Not Scaling Up

# Check HPA status
kubectl describe hpa triage-warden-api

# Check metrics availability
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | jq

# Check events
kubectl get events --sort-by='.lastTimestamp' | grep -i scale

Pods Stuck Pending

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check pod events
kubectl describe pod <pod-name> | grep -A 10 Events

Scaling Oscillation

If pods scale up and down frequently:

Increase stabilization window
Adjust metric thresholds
Add cooldown periods

behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # 10 min

Next Steps

Set up monitoring for scaling metrics
Review configuration options
Configure Kubernetes deployment

Backup & Restore

Procedures for backing up and restoring Triage Warden data.

Overview

Triage Warden stores all persistent data in PostgreSQL. Regular backups are essential for disaster recovery.

What to backup:

PostgreSQL database (all data)
Configuration files (optional, if customized)
TLS certificates (if not using cert-manager)

What NOT to backup:

Application containers (stateless, rebuilt from image)
Logs (should be in log aggregation system)
Metrics (stored in Prometheus)

# Create backup directory
mkdir -p /backups/triage-warden

# Create timestamped backup
BACKUP_FILE="/backups/triage-warden/backup-$(date +%Y%m%d-%H%M%S).sql"

docker compose exec -T postgres pg_dump \
  -U triage_warden \
  --format=custom \
  --compress=9 \
  triage_warden > "$BACKUP_FILE"

# Verify backup
pg_restore --list "$BACKUP_FILE" | head -20

echo "Backup created: $BACKUP_FILE ($(du -h $BACKUP_FILE | cut -f1))"

Kubernetes

# Get PostgreSQL pod
PG_POD=$(kubectl get pods -n triage-warden -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}')

# Create backup
BACKUP_FILE="backup-$(date +%Y%m%d-%H%M%S).sql"

kubectl exec -n triage-warden $PG_POD -- \
  pg_dump -U triage_warden --format=custom --compress=9 triage_warden \
  > "$BACKUP_FILE"

# Upload to S3 (optional)
aws s3 cp "$BACKUP_FILE" s3://your-backup-bucket/triage-warden/

Automated Backup

Docker (Cron)

# /etc/cron.d/triage-warden-backup
0 2 * * * root /opt/triage-warden/scripts/backup.sh >> /var/log/triage-warden-backup.log 2>&1

#!/bin/bash
# /opt/triage-warden/scripts/backup.sh

set -e

BACKUP_DIR="/backups/triage-warden"
RETENTION_DAYS=30
BACKUP_FILE="$BACKUP_DIR/backup-$(date +%Y%m%d-%H%M%S).sql"

# Create backup
cd /opt/triage-warden
docker compose exec -T postgres pg_dump \
  -U triage_warden \
  --format=custom \
  --compress=9 \
  triage_warden > "$BACKUP_FILE"

# Verify backup
if ! pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1; then
  echo "ERROR: Backup verification failed"
  rm -f "$BACKUP_FILE"
  exit 1
fi

# Cleanup old backups
find "$BACKUP_DIR" -name "backup-*.sql" -mtime +$RETENTION_DAYS -delete

echo "Backup completed: $BACKUP_FILE"

Kubernetes (CronJob)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: triage-warden-backup
  namespace: triage-warden
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: postgres:15-alpine
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-postgresql
                      key: postgres-password
              command:
                - /bin/sh
                - -c
                - |
                  set -e
                  BACKUP_FILE="/backups/backup-$(date +%Y%m%d-%H%M%S).sql"
                  pg_dump -h postgres-postgresql -U triage_warden \
                    --format=custom --compress=9 triage_warden > "$BACKUP_FILE"
                  echo "Backup completed: $BACKUP_FILE"
              volumeMounts:
                - name: backup-storage
                  mountPath: /backups
          volumes:
            - name: backup-storage
              persistentVolumeClaim:
                claimName: backup-pvc

Restore Procedures

Prerequisites

Stop the Triage Warden application (to prevent data conflicts)
Have the backup file accessible
Database credentials available

Full Restore

Docker

# Stop application
docker compose stop triage-warden

# Restore from backup
docker compose exec -T postgres pg_restore \
  -U triage_warden \
  --clean \
  --if-exists \
  --no-owner \
  -d triage_warden < /path/to/backup.sql

# Start application
docker compose start triage-warden

# Verify
curl http://localhost:8080/health | jq

Kubernetes

# Scale down application
kubectl scale deployment triage-warden -n triage-warden --replicas=0

# Get PostgreSQL pod
PG_POD=$(kubectl get pods -n triage-warden -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}')

# Copy backup to pod
kubectl cp backup.sql triage-warden/$PG_POD:/tmp/backup.sql

# Restore
kubectl exec -n triage-warden $PG_POD -- \
  pg_restore -U triage_warden --clean --if-exists --no-owner \
  -d triage_warden /tmp/backup.sql

# Scale up application
kubectl scale deployment triage-warden -n triage-warden --replicas=3

# Verify
kubectl exec -it deployment/triage-warden -n triage-warden -- curl -s localhost:8080/health

Point-in-Time Recovery

For point-in-time recovery, enable PostgreSQL WAL archiving:

# PostgreSQL configuration
archive_mode: on
archive_command: 'aws s3 cp %p s3://your-bucket/wal/%f'

Recovery procedure:

# 1. Stop PostgreSQL
# 2. Clear data directory
# 3. Restore base backup
# 4. Create recovery.signal
# 5. Set recovery_target_time in postgresql.conf
# 6. Start PostgreSQL

Verification

After any restore, verify:

# 1. Health check passes
curl http://localhost:8080/health | jq '.status'
# Expected: "healthy"

# 2. Recent incidents exist
curl http://localhost:8080/api/incidents | jq '. | length'

# 3. User can login
# Test via UI or API

# 4. Connectors configured
curl http://localhost:8080/health/detailed | jq '.components.connectors'

Backup Storage

Local Storage

Pros: Simple, fast
Cons: Single point of failure
Recommendation: Development only

Cloud Storage (S3/GCS/Azure Blob)

# Upload to S3
aws s3 cp backup.sql s3://bucket/triage-warden/backup-$(date +%Y%m%d).sql

# Download from S3
aws s3 cp s3://bucket/triage-warden/backup-20240115.sql ./restore.sql

Encryption

Encrypt backups before storing:

# Encrypt backup
gpg --symmetric --cipher-algo AES256 backup.sql

# Decrypt for restore
gpg --decrypt backup.sql.gpg > backup.sql

Disaster Recovery Plan

RTO/RPO Targets

Metric	Target
Recovery Time Objective (RTO)	4 hours
Recovery Point Objective (RPO)	24 hours

Recovery Steps

Assess the situation
- Determine extent of data loss
- Identify latest valid backup
Provision new infrastructure
- Deploy new database instance
- Deploy new application instances
Restore data
- Restore database from backup
- Verify data integrity
Reconfigure
- Update DNS/load balancer
- Reconfigure connectors if needed
- Reset API keys if compromised
Verify and communicate
- Run health checks
- Test critical workflows
- Notify stakeholders

Testing Schedule

Test	Frequency	Last Tested
Backup verification	Weekly
Restore to test environment	Monthly
Full DR simulation	Quarterly

Troubleshooting Guide

Common issues and their solutions.

Quick Diagnostics

# Check overall health
curl -s http://localhost:8080/health/detailed | jq

# Check logs for errors (last 100 lines)
docker compose logs --tail=100 triage-warden | grep -i error

# Check resource usage
docker stats --no-stream

Container exits immediately
"Connection refused" errors
Health check fails

Diagnosis

# Check container logs
docker compose logs triage-warden

# Check exit code
docker compose ps -a

Common Causes & Solutions

Missing environment variables:

Error: Required environment variable TW_ENCRYPTION_KEY not set

Solution: Ensure all required env vars are set in .env

Database connection failed:

Error: Failed to connect to database: Connection refused

Solution:

Verify PostgreSQL is running: docker compose ps postgres
Check DATABASE_URL is correct
Verify network connectivity

Invalid encryption key:

Error: Invalid encryption key: must be 32 bytes base64-encoded

Solution: Generate new key: openssl rand -base64 32

Database Connection Issues

Symptoms

/ready returns 503
"Database unavailable" in health check
Queries timing out

Diagnosis

# Check database health
docker compose exec postgres pg_isready -U triage_warden

# Check connection count
docker compose exec postgres psql -U triage_warden -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname = 'triage_warden';"

# Check for locks
docker compose exec postgres psql -U triage_warden -c \
  "SELECT * FROM pg_locks WHERE NOT granted;"

Solutions

Connection pool exhausted:

# Increase max connections in docker-compose.yml
DATABASE_MAX_CONNECTIONS=50

# Or kill idle connections
docker compose exec postgres psql -U triage_warden -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
   WHERE datname = 'triage_warden' AND state = 'idle' AND pid <> pg_backend_pid();"

PostgreSQL not ready:

# Wait for PostgreSQL to be ready
until docker compose exec postgres pg_isready -U triage_warden; do
  echo "Waiting for PostgreSQL..."
  sleep 2
done

Authentication Issues

Symptoms

"Invalid credentials" on login
"Session expired" errors
API returns 401

Diagnosis

# Check if user exists
docker compose exec postgres psql -U triage_warden -c \
  "SELECT username, enabled, last_login_at FROM users;"

# Check session configuration
curl -s http://localhost:8080/health/detailed | jq '.components'

Solutions

Reset admin password:

# Generate new password hash (requires bcrypt)
NEW_HASH=$(htpasswd -bnBC 10 "" "newpassword" | tr -d ':\n')

# Update in database
docker compose exec postgres psql -U triage_warden -c \
  "UPDATE users SET password_hash = '$NEW_HASH' WHERE username = 'admin';"

Clear sessions:

docker compose exec postgres psql -U triage_warden -c \
  "DELETE FROM sessions;"

User account disabled:

docker compose exec postgres psql -U triage_warden -c \
  "UPDATE users SET enabled = true WHERE username = 'admin';"

LLM/AI Features Not Working

Symptoms

"LLM analysis failed" errors
No AI verdicts on incidents
Empty analysis in incident details

Diagnosis

# Check LLM configuration
curl -s http://localhost:8080/health/detailed | jq '.components.llm'

# Check for API key
docker compose exec triage-warden env | grep -E "(OPENAI|ANTHROPIC)_API_KEY"

# Check LLM settings in database
docker compose exec postgres psql -U triage_warden -c \
  "SELECT provider, model, enabled FROM settings WHERE key = 'llm';"

Solutions

API key not configured:

# Set via environment variable
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
docker compose up -d

LLM disabled: Configure via UI: Settings → AI/LLM → Enable toggle

Rate limited: Check provider dashboard for rate limit status. Consider:

Upgrading API tier
Reducing temperature/max_tokens
Adding request delays

Connector Failures

Symptoms

"Connector error" status in settings
Failed enrichments
Missing threat intel data

Diagnosis

# Check connector status
curl -s http://localhost:8080/health/detailed | jq '.components.connectors'

# Test specific connector
curl -X POST http://localhost:8080/api/connectors/{id}/test

Solutions by Connector

VirusTotal:

Verify API key is valid
Check rate limits (4 req/min for free tier)
Ensure outbound HTTPS to virustotal.com allowed

Jira:

Verify base URL (include /rest/api/3)
Use API token, not password
Check project key exists

CrowdStrike:

Verify OAuth client credentials
Check API scopes granted
Verify region (us-1, us-2, eu-1)

Splunk:

Verify HEC token is valid
Check SSL certificate if using HTTPS
Verify index exists

High Memory Usage

Symptoms

Container OOM killed
Slow response times
"Out of memory" errors

Diagnosis

# Check container memory
docker stats --no-stream triage-warden

# Check for memory leaks (trending)
docker stats triage-warden  # Watch over time

Solutions

Increase memory limits:

# docker-compose.yml
deploy:
  resources:
    limits:
      memory: 4G

Reduce connection pool:

DATABASE_MAX_CONNECTIONS=5

Enable garbage collection logging:

RUST_LOG=info,triage_warden=debug

Slow Performance

Symptoms

High latency on API calls
Dashboard loads slowly
Timeouts on queries

Diagnosis

# Check response times
curl -w "@curl-format.txt" -s http://localhost:8080/health -o /dev/null

# Check database query times
docker compose exec postgres psql -U triage_warden -c \
  "SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Check for table bloat
docker compose exec postgres psql -U triage_warden -c \
  "SELECT relname, n_dead_tup, n_live_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;"

Solutions

Add database indexes:

-- Common helpful indexes
CREATE INDEX idx_incidents_created_at ON incidents(created_at DESC);
CREATE INDEX idx_incidents_severity ON incidents(severity);
CREATE INDEX idx_audit_log_timestamp ON audit_log(timestamp DESC);

Vacuum database:

docker compose exec postgres psql -U triage_warden -c "VACUUM ANALYZE;"

Enable query caching: Already enabled by default in connection pool.

Kill Switch Issues

Symptoms

Automation stopped unexpectedly
"Kill switch active" warnings
Actions blocked

Diagnosis

# Check kill switch status
curl -s http://localhost:8080/api/kill-switch | jq

# Check who activated it
curl -s http://localhost:8080/health/detailed | jq '.components.kill_switch'

Solutions

Deactivate kill switch:

curl -X POST http://localhost:8080/api/kill-switch/deactivate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"reason": "Confirmed safe to resume"}'

Or via UI: Settings → Safety → Re-enable Automation

Webhook Not Receiving Events

Symptoms

No incidents created from SIEM
Webhook endpoint returns errors
Events missing

Diagnosis

# Test webhook endpoint
curl -X POST http://localhost:8080/api/webhooks/generic \
  -H "Content-Type: application/json" \
  -d '{"title": "Test Alert", "severity": "medium"}'

# Check webhook logs
docker compose logs triage-warden | grep -i webhook

Solutions

Signature validation failing:

Verify webhook secret matches source configuration
Check signature header name (X-Signature, X-Hub-Signature-256, etc.)

Payload format incorrect:

Check source webhook format documentation
Use generic webhook with custom mapping

Firewall blocking:

Ensure source IP can reach webhook endpoint
Check for WAF rules blocking requests

Diagnostic Commands

Get System Info

# Application version
curl -s http://localhost:8080/health | jq '.version'

# Database version
docker compose exec postgres psql -U triage_warden -c "SELECT version();"

# Container info
docker compose version
docker version

Export Debug Bundle

#!/bin/bash
# Create debug bundle
BUNDLE_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BUNDLE_DIR"

# Health check
curl -s http://localhost:8080/health/detailed > "$BUNDLE_DIR/health.json"

# Recent logs
docker compose logs --tail=1000 triage-warden > "$BUNDLE_DIR/app.log"
docker compose logs --tail=500 postgres > "$BUNDLE_DIR/db.log"

# Configuration (redacted)
docker compose config | grep -v -E "(PASSWORD|SECRET|KEY)" > "$BUNDLE_DIR/config.yml"

# Create archive
tar -czf "$BUNDLE_DIR.tar.gz" "$BUNDLE_DIR"
rm -rf "$BUNDLE_DIR"

echo "Debug bundle: $BUNDLE_DIR.tar.gz"

Getting Help

If you can't resolve the issue:

Check GitHub Issues for known issues
Create a new issue with:
- Triage Warden version
- Deployment method (Docker/K8s)
- Error messages
- Debug bundle (with secrets redacted)
Contact support: [email protected]

Contributing

Guide to contributing to Triage Warden.

Getting Started

Fork the repository
Clone your fork
Set up the development environment
Create a branch for your changes
Submit a pull request

Development Setup

Prerequisites

Rust 1.75+
Python 3.11+
uv (Python package manager)
SQLite (for development)

Initial Setup

# Clone repository
git clone https://github.com/your-username/triage-warden.git
cd triage-warden

# Install Rust dependencies
cargo build

# Install Python dependencies
cd python
uv sync
cd ..

# Run tests
cargo test
cd python && uv run pytest

Code Style

Rust

Follow standard Rust conventions
Run cargo fmt before committing
Run cargo clippy and fix warnings
Document public APIs with doc comments

Python

Follow PEP 8
Run ruff check and black before committing
Type hints required (mypy strict mode)
Docstrings for public functions

Pre-commit Hooks

Install pre-commit hooks:

# The project has pre-commit configured in .git/hooks
# It runs automatically on commit:
# - cargo fmt
# - cargo clippy
# - ruff
# - black
# - mypy

Pull Request Process

Create a branch
```
git checkout -b feature/my-feature
```
Make changes
- Write code
- Add tests
- Update documentation

Run checks

cargo fmt && cargo clippy
cargo test
cd python && uv run pytest

Commit
```
git commit -m "feat: add new feature"
```
Push and create PR
```
git push origin feature/my-feature
```
Address review feedback

Commit Messages

Follow conventional commits:

type(scope): description

[optional body]

[optional footer]

Types:

feat: New feature
fix: Bug fix
docs: Documentation
refactor: Code refactoring
test: Adding tests
chore: Maintenance

Testing

Rust Tests

# Run all tests
cargo test

# Run specific crate tests
cargo test -p tw-api

# Run with output
cargo test -- --nocapture

Python Tests

cd python
uv run pytest

# Run specific tests
uv run pytest tests/test_agents.py

# With coverage
uv run pytest --cov=tw_ai

Integration Tests

# Start test server
cargo run --bin tw-api &

# Run integration tests
./scripts/integration-tests.sh

Documentation

Update docs for API changes
Add examples for new features
Keep README.md current

Build docs locally:

cd docs-site
mdbook serve

Issue Reporting

When reporting issues:

Search existing issues first
Use issue templates
Include:
- Version information
- Steps to reproduce
- Expected vs actual behavior
- Relevant logs

Questions

Open a GitHub Discussion
Check existing discussions first
Tag appropriately

License

By contributing, you agree that your contributions will be licensed under the MIT License.

Building from Source

Complete guide to building Triage Warden.

Prerequisites

Rust

# Install Rust via rustup
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Verify installation
rustc --version  # Should be 1.75+

Python

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Verify installation
uv --version

System Dependencies

macOS

brew install openssl pkg-config

Ubuntu/Debian

sudo apt-get install build-essential pkg-config libssl-dev

Fedora

sudo dnf install gcc openssl-devel pkgconfig

Building

Debug Build

cargo build

Outputs:

target/debug/tw-api
target/debug/tw-cli

Release Build

cargo build --release

Outputs:

target/release/tw-api
target/release/tw-cli

Python Package

cd python
uv sync
uv build

PyO3 Bridge

The bridge is built automatically with cargo:

cd tw-bridge
cargo build --release

Build Options

Feature Flags

# Build with PostgreSQL support only
cargo build --no-default-features --features postgres

# Build with all features
cargo build --all-features

Cross-Compilation

# For Linux (from macOS)
rustup target add x86_64-unknown-linux-gnu
cargo build --release --target x86_64-unknown-linux-gnu

# For musl (static binary)
rustup target add x86_64-unknown-linux-musl
cargo build --release --target x86_64-unknown-linux-musl

Docker Build

Build Image

docker build -t triage-warden .

Multi-Stage Dockerfile

# Builder stage
FROM rust:1.75 as builder
WORKDIR /app
COPY . .
RUN cargo build --release

# Runtime stage
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/tw-api /usr/local/bin/
CMD ["tw-api"]

Verification

Run Tests

# Rust tests
cargo test

# Python tests
cd python && uv run pytest

# All tests
./scripts/test-all.sh

Linting

# Rust
cargo fmt --check
cargo clippy -- -D warnings

# Python
cd python
uv run ruff check
uv run black --check .
uv run mypy .

Smoke Test

# Start server
./target/release/tw-api &

# Health check
curl http://localhost:8080/api/health

# Stop server
kill %1

Troubleshooting

OpenSSL Errors

# macOS
export OPENSSL_DIR=$(brew --prefix openssl)

# Linux
export OPENSSL_DIR=/usr

PyO3 Build Issues

# Ensure Python is found
export PYO3_PYTHON=$(which python3)

# Clean and rebuild
cargo clean -p tw-bridge
cargo build -p tw-bridge

Out of Memory

# Reduce parallel jobs
cargo build -j 2

Testing

Guide to testing Triage Warden.

Test Structure

triage-warden/
├── crates/
│   ├── tw-api/src/
│   │   └── tests/           # API integration tests
│   ├── tw-core/src/
│   │   └── tests/           # Core unit tests
│   └── tw-actions/src/
│       └── tests/           # Action handler tests
└── python/
    └── tests/               # Python tests

Running Tests

All Tests

# Rust
cargo test

# Python
cd python && uv run pytest

# Everything
./scripts/test-all.sh

Specific Tests

# Single crate
cargo test -p tw-api

# Single test
cargo test test_incident_creation

# Pattern match
cargo test incident

# With output
cargo test -- --nocapture

Unit Tests

Rust Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_incident_creation() {
        let incident = Incident::new(
            IncidentType::Phishing,
            Severity::High,
        );
        assert_eq!(incident.status, IncidentStatus::Open);
    }

    #[tokio::test]
    async fn test_async_operation() {
        let result = async_function().await;
        assert!(result.is_ok());
    }
}
}

Python Unit Tests

import pytest
from tw_ai.agents import TriageAgent

def test_agent_creation():
    agent = TriageAgent()
    assert agent.model == "claude-sonnet-4-20250514"

@pytest.mark.asyncio
async def test_triage():
    agent = TriageAgent()
    verdict = await agent.triage(mock_incident)
    assert verdict.classification in ["malicious", "benign"]

Integration Tests

API Integration Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_incident_api() {
    let app = create_test_app().await;

    // Create incident
    let response = app
        .oneshot(
            Request::builder()
                .method("POST")
                .uri("/api/incidents")
                .header("Content-Type", "application/json")
                .body(Body::from(r#"{"type":"phishing"}"#))
                .unwrap(),
        )
        .await
        .unwrap();

    assert_eq!(response.status(), StatusCode::CREATED);
}
}

Database Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_repository() {
    // Use in-memory SQLite
    let pool = create_test_pool().await;
    let repo = SqliteIncidentRepository::new(pool);

    let incident = repo.create(&new_incident).await.unwrap();
    let found = repo.get(incident.id).await.unwrap();

    assert_eq!(found.unwrap().id, incident.id);
}
}

Test Fixtures

Rust Fixtures

#![allow(unused)]
fn main() {
// tests/fixtures.rs
pub fn mock_incident() -> Incident {
    Incident {
        id: Uuid::new_v4(),
        incident_type: IncidentType::Phishing,
        severity: Severity::High,
        status: IncidentStatus::Open,
        raw_data: json!({"subject": "Test"}),
        ..Default::default()
    }
}
}

Python Fixtures

# tests/conftest.py
import pytest

@pytest.fixture
def mock_incident():
    return {
        "id": "test-123",
        "type": "phishing",
        "severity": "high",
        "raw_data": {"subject": "Test Email"}
    }

@pytest.fixture
def mock_connector():
    return MockThreatIntelConnector()

Mocking

Rust Mocking

#![allow(unused)]
fn main() {
use mockall::mock;

mock! {
    ThreatIntelConnector {}

    #[async_trait]
    impl ThreatIntelConnector for ThreatIntelConnector {
        async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>;
    }
}

#[tokio::test]
async fn test_with_mock() {
    let mut mock = MockThreatIntelConnector::new();
    mock.expect_lookup_hash()
        .returning(|_| Ok(ThreatReport::clean()));

    let result = function_using_connector(&mock).await;
    assert!(result.is_ok());
}
}

Python Mocking

from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_with_mock():
    with patch("tw_ai.agents.tools.lookup_hash") as mock:
        mock.return_value = {"malicious": False}

        agent = TriageAgent()
        verdict = await agent.triage(mock_incident)

        mock.assert_called_once()

Test Coverage

Rust Coverage

cargo install cargo-tarpaulin
cargo tarpaulin --out Html

Python Coverage

cd python
uv run pytest --cov=tw_ai --cov-report=html

CI Testing

GitHub Actions runs tests on every PR:

# .github/workflows/test.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - run: cargo test
      - run: cargo clippy -- -D warnings

Test Data

Evaluation Test Cases

Test cases for AI triage evaluation:

# python/tw_ai/evaluation/test_cases/phishing.yaml
- name: obvious_phishing
  input:
    sender: "[email protected]"
    subject: "Urgent: Verify Account"
    urls: ["https://phishing-site.com/login"]
    auth_results: {spf: fail, dkim: fail}
  expected:
    classification: malicious
    min_confidence: 0.8

Run evaluation:

cd python
uv run pytest tests/test_evaluation.py

Adding Connectors

Guide to implementing new connectors.

Connector Architecture

Connectors follow a trait-based pattern:

Connector Trait (base)
    │
    ├── ThreatIntelConnector
    ├── SIEMConnector
    ├── EDRConnector
    ├── EmailGatewayConnector
    └── TicketingConnector

Implementing a Connector

1. Create the File

touch crates/tw-connectors/src/threat_intel/my_provider.rs

2. Implement Base Trait

#![allow(unused)]
fn main() {
use crate::traits::{Connector, ConnectorError, ConnectorHealth, ConnectorResult};
use async_trait::async_trait;

pub struct MyProviderConnector {
    client: reqwest::Client,
    api_key: String,
    base_url: String,
}

impl MyProviderConnector {
    pub fn new(api_key: String) -> Result<Self, ConnectorError> {
        let client = reqwest::Client::builder()
            .timeout(std::time::Duration::from_secs(30))
            .build()
            .map_err(|e| ConnectorError::Configuration(e.to_string()))?;

        Ok(Self {
            client,
            api_key,
            base_url: "https://api.myprovider.com".to_string(),
        })
    }
}

#[async_trait]
impl Connector for MyProviderConnector {
    fn name(&self) -> &str {
        "my_provider"
    }

    fn connector_type(&self) -> &str {
        "threat_intel"
    }

    async fn health_check(&self) -> ConnectorResult<ConnectorHealth> {
        let response = self.client
            .get(format!("{}/health", self.base_url))
            .header("Authorization", format!("Bearer {}", self.api_key))
            .send()
            .await
            .map_err(|e| ConnectorError::NetworkError(e.to_string()))?;

        if response.status().is_success() {
            Ok(ConnectorHealth::Healthy)
        } else {
            Ok(ConnectorHealth::Unhealthy {
                message: "Health check failed".to_string(),
            })
        }
    }

    async fn test_connection(&self) -> ConnectorResult<bool> {
        match self.health_check().await? {
            ConnectorHealth::Healthy => Ok(true),
            _ => Ok(false),
        }
    }
}
}

3. Implement Specialized Trait

#![allow(unused)]
fn main() {
use crate::traits::{ThreatIntelConnector, ThreatReport, IndicatorType};

#[async_trait]
impl ThreatIntelConnector for MyProviderConnector {
    async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport> {
        let response = self.client
            .get(format!("{}/files/{}", self.base_url, hash))
            .header("Authorization", format!("Bearer {}", self.api_key))
            .send()
            .await
            .map_err(|e| ConnectorError::NetworkError(e.to_string()))?;

        if response.status() == reqwest::StatusCode::NOT_FOUND {
            return Ok(ThreatReport {
                indicator: hash.to_string(),
                indicator_type: IndicatorType::FileHash,
                malicious: false,
                confidence: 0.0,
                categories: vec![],
                first_seen: None,
                last_seen: None,
                sources: vec![],
            });
        }

        let data: ApiResponse = response.json().await
            .map_err(|e| ConnectorError::InvalidResponse(e.to_string()))?;

        Ok(self.convert_response(data))
    }

    async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport> {
        // Similar implementation
        todo!()
    }

    async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport> {
        // Similar implementation
        todo!()
    }

    async fn lookup_ip(&self, ip: &str) -> ConnectorResult<ThreatReport> {
        // Similar implementation
        todo!()
    }
}
}

4. Add to Module

#![allow(unused)]
fn main() {
// crates/tw-connectors/src/threat_intel/mod.rs
mod my_provider;
pub use my_provider::MyProviderConnector;
}

5. Register in Bridge

#![allow(unused)]
fn main() {
// tw-bridge/src/lib.rs
impl ThreatIntelBridge {
    pub fn new(mode: &str) -> PyResult<Self> {
        let connector: Arc<dyn ThreatIntelConnector + Send + Sync> = match mode {
            "virustotal" => Arc::new(VirusTotalConnector::new(
                std::env::var("TW_VIRUSTOTAL_API_KEY")
                    .map_err(|_| PyErr::new::<pyo3::exceptions::PyValueError, _>(
                        "TW_VIRUSTOTAL_API_KEY not set"
                    ))?
            )?),
            "my_provider" => Arc::new(MyProviderConnector::new(
                std::env::var("TW_MY_PROVIDER_API_KEY")
                    .map_err(|_| PyErr::new::<pyo3::exceptions::PyValueError, _>(
                        "TW_MY_PROVIDER_API_KEY not set"
                    ))?
            )?),
            _ => Arc::new(MockThreatIntelConnector::new("mock")),
        };

        Ok(Self { connector })
    }
}
}

Error Handling

Use appropriate error types:

#![allow(unused)]
fn main() {
pub enum ConnectorError {
    /// Configuration issue
    Configuration(String),

    /// Network/connection error
    NetworkError(String),

    /// Authentication failed
    AuthenticationFailed(String),

    /// Resource not found
    NotFound(String),

    /// Rate limited
    RateLimited { retry_after: Option<Duration> },

    /// Invalid response from service
    InvalidResponse(String),

    /// Request failed
    RequestFailed(String),
}
}

Rate Limiting

Implement rate limiting in your connector:

#![allow(unused)]
fn main() {
use governor::{Quota, RateLimiter};

pub struct MyProviderConnector {
    client: reqwest::Client,
    api_key: String,
    rate_limiter: RateLimiter<...>,
}

impl MyProviderConnector {
    async fn make_request(&self, url: &str) -> ConnectorResult<Response> {
        self.rate_limiter.until_ready().await;

        self.client.get(url)
            .header("Authorization", format!("Bearer {}", self.api_key))
            .send()
            .await
            .map_err(|e| ConnectorError::NetworkError(e.to_string()))
    }
}
}

Testing

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    use wiremock::{MockServer, Mock, ResponseTemplate};
    use wiremock::matchers::{method, path};

    #[tokio::test]
    async fn test_lookup_hash() {
        let mock_server = MockServer::start().await;

        Mock::given(method("GET"))
            .and(path("/files/abc123"))
            .respond_with(ResponseTemplate::new(200).set_body_json(json!({
                "malicious": true,
                "confidence": 0.95
            })))
            .mount(&mock_server)
            .await;

        let connector = MyProviderConnector::with_base_url(
            "test-key".to_string(),
            mock_server.uri(),
        );

        let result = connector.lookup_hash("abc123").await.unwrap();
        assert!(result.malicious);
    }
}
}

Documentation

Document your connector:

#![allow(unused)]
fn main() {
//! MyProvider threat intelligence connector.
//!
//! # Configuration
//!
//! Set `TW_MY_PROVIDER_API_KEY` environment variable.
//!
//! # Example
//!
//! ```rust
//! let connector = MyProviderConnector::new(api_key)?;
//! let report = connector.lookup_hash("abc123").await?;
//! ```
}

Adding Actions

Guide to implementing new action handlers.

Action Architecture

Actions implement the Action trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Action: Send + Sync {
    fn name(&self) -> &str;
    fn description(&self) -> &str;
    fn required_parameters(&self) -> Vec<ParameterDef>;
    fn supports_rollback(&self) -> bool;

    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError>;

    async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        Err(ActionError::RollbackNotSupported)
    }
}
}

Implementing an Action

1. Create the File

touch crates/tw-actions/src/my_action.rs

2. Define the Action

#![allow(unused)]
fn main() {
use crate::registry::{
    Action, ActionContext, ActionError, ActionResult, ParameterDef, ParameterType,
};
use async_trait::async_trait;
use chrono::Utc;
use std::collections::HashMap;
use tracing::{info, instrument};

/// My custom action handler.
pub struct MyAction;

impl MyAction {
    pub fn new() -> Self {
        Self
    }
}

impl Default for MyAction {
    fn default() -> Self {
        Self::new()
    }
}

#[async_trait]
impl Action for MyAction {
    fn name(&self) -> &str {
        "my_action"
    }

    fn description(&self) -> &str {
        "Description of what this action does"
    }

    fn required_parameters(&self) -> Vec<ParameterDef> {
        vec![
            ParameterDef::required(
                "target",
                "The target of the action",
                ParameterType::String,
            ),
            ParameterDef::optional(
                "force",
                "Force the action even if conditions aren't met",
                ParameterType::Boolean,
                serde_json::json!(false),
            ),
        ]
    }

    fn supports_rollback(&self) -> bool {
        true
    }

    #[instrument(skip(self, context))]
    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        let started_at = Utc::now();

        // Get required parameter
        let target = context.require_string("target")?;

        // Get optional parameter with default
        let force = context
            .get_param("force")
            .and_then(|v| v.as_bool())
            .unwrap_or(false);

        info!("Executing my_action on target: {}", target);

        // Perform the action
        // ...

        // Build output
        let mut output = HashMap::new();
        output.insert("target".to_string(), serde_json::json!(target));
        output.insert("success".to_string(), serde_json::json!(true));

        Ok(ActionResult::success(
            self.name(),
            &format!("Action completed on {}", target),
            started_at,
            output,
        ))
    }

    async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        let started_at = Utc::now();
        let target = context.require_string("target")?;

        info!("Rolling back my_action on target: {}", target);

        // Perform rollback
        // ...

        let mut output = HashMap::new();
        output.insert("target".to_string(), serde_json::json!(target));

        Ok(ActionResult::success(
            &format!("{}_rollback", self.name()),
            &format!("Rollback completed on {}", target),
            started_at,
            output,
        ))
    }
}
}

3. Add to Module

#![allow(unused)]
fn main() {
// crates/tw-actions/src/lib.rs
mod my_action;
pub use my_action::MyAction;
}

4. Register in Registry

#![allow(unused)]
fn main() {
// crates/tw-actions/src/registry.rs
impl ActionRegistry {
    pub fn new() -> Self {
        let mut registry = Self {
            actions: HashMap::new(),
        };

        // Register built-in actions
        registry.register(Box::new(QuarantineEmailAction::new()));
        registry.register(Box::new(BlockSenderAction::new()));
        registry.register(Box::new(MyAction::new())); // Add here

        registry
    }
}
}

Parameter Types

Available parameter types:

#![allow(unused)]
fn main() {
pub enum ParameterType {
    String,
    Integer,
    Float,
    Boolean,
    List,
    Object,
}
}

Define parameters:

#![allow(unused)]
fn main() {
fn required_parameters(&self) -> Vec<ParameterDef> {
    vec![
        ParameterDef::required("name", "Description", ParameterType::String),
        ParameterDef::optional("count", "Description", ParameterType::Integer, json!(10)),
        ParameterDef::optional("tags", "Description", ParameterType::List, json!([])),
    ]
}
}

Using Connectors

Actions can use connectors via dependency injection:

#![allow(unused)]
fn main() {
pub struct MyAction {
    connector: Arc<dyn MyConnector + Send + Sync>,
}

impl MyAction {
    pub fn new(connector: Arc<dyn MyConnector + Send + Sync>) -> Self {
        Self { connector }
    }
}

#[async_trait]
impl Action for MyAction {
    async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError> {
        // Use connector
        let result = self.connector.do_something().await
            .map_err(|e| ActionError::ExecutionFailed(e.to_string()))?;

        // ...
    }
}
}

Error Handling

Use appropriate error types:

#![allow(unused)]
fn main() {
pub enum ActionError {
    /// Missing or invalid parameters
    InvalidParameters(String),

    /// Execution failed
    ExecutionFailed(String),

    /// Action timed out
    Timeout,

    /// Rollback not supported
    RollbackNotSupported,

    /// Policy denied the action
    PolicyDenied(String),
}
}

Testing

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    use uuid::Uuid;

    #[tokio::test]
    async fn test_my_action_success() {
        let action = MyAction::new();

        let context = ActionContext::new(Uuid::new_v4())
            .with_param("target", serde_json::json!("test-target"));

        let result = action.execute(context).await.unwrap();

        assert!(result.success);
        assert_eq!(result.output["target"], "test-target");
    }

    #[tokio::test]
    async fn test_my_action_missing_param() {
        let action = MyAction::new();
        let context = ActionContext::new(Uuid::new_v4());

        let result = action.execute(context).await;

        assert!(matches!(result, Err(ActionError::InvalidParameters(_))));
    }

    #[tokio::test]
    async fn test_my_action_rollback() {
        let action = MyAction::new();
        assert!(action.supports_rollback());

        let context = ActionContext::new(Uuid::new_v4())
            .with_param("target", serde_json::json!("test-target"));

        let result = action.rollback(context).await.unwrap();
        assert!(result.success);
    }
}
}

Policy Integration

Actions are automatically evaluated by the policy engine. Configure default approval:

# Default policy for new action
[[policy.rules]]
name = "my_action_default"
action = "my_action"
approval_level = "analyst"

Documentation

Document your action:

#![allow(unused)]
fn main() {
//! My custom action.
//!
//! This action performs X on target Y.
//!
//! # Parameters
//!
//! - `target` (required): The target to act on
//! - `force` (optional): Force execution (default: false)
//!
//! # Example
//!
//! ```yaml
//! - action: my_action
//!   parameters:
//!     target: "example"
//!     force: true
//! ```
//!
//! # Rollback
//!
//! This action supports rollback via `my_action_rollback`.
}