Triage Warden
AI-powered security incident triage and response platform
Triage Warden automates the analysis and response to security incidents using AI agents, configurable playbooks, and integrations with your existing security stack.
Features
- AI-Powered Triage: Automated analysis of phishing emails, malware alerts, and suspicious login attempts
- Configurable Playbooks: Define custom investigation and response workflows
- Policy Engine: Role-based approval workflows for sensitive actions
- Connector Framework: Integrate with VirusTotal, Splunk, CrowdStrike, Jira, Microsoft 365, and more
- Web Dashboard: Real-time incident management with approval workflows
- REST API: Programmatic access for automation and integration
- Audit Trail: Complete logging of all actions and decisions
Quick Example
# Analyze a phishing email
tw-cli incident create --type phishing --source "email-gateway" --data '{"subject": "Urgent: Update Account"}'
# Run AI triage
tw-cli triage run --incident INC-2024-001
# View the verdict
tw-cli incident get INC-2024-001 --format json
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Web Dashboard │
│ (HTMX + Askama Templates) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ REST API │
│ (Axum + Tower) │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────────┐ ┌───────────────┐
│ Policy Engine │ │ AI Triage Agent │ │ Actions │
│ (Rust) │ │ (Python) │ │ (Rust) │
└───────────────┘ └───────────────────┘ └───────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Connector Layer │
│ (VirusTotal, Splunk, CrowdStrike, Jira, M365) │
└─────────────────────────────────────────────────────────────────┘
Getting Started
- Installation - Install Triage Warden
- Quick Start - Create your first incident
- Configuration - Configure connectors and policies
License
Triage Warden is licensed under the MIT License.
Getting Started
Welcome to Triage Warden! This guide will help you get up and running quickly.
Prerequisites
- Rust 1.75+ (for building from source)
- Python 3.11+ (for AI triage agents)
- SQLite or PostgreSQL (for data storage)
- uv (recommended Python package manager)
Installation Options
- From Source - Build and run locally
- Docker - Run in containers
- Pre-built Binaries - Download releases
Next Steps
- Installation - Detailed installation instructions
- Quick Start - Create your first incident in 5 minutes
- Configuration - Configure connectors and settings
Installation
Building from Source
Prerequisites
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
Clone and Build
# Clone the repository
git clone https://github.com/zachyking/triage-warden.git
cd triage-warden
# Build Rust components
cargo build --release
# Install Python dependencies
cd python
uv sync
Verify Installation
# Check the CLI
./target/release/triage-warden --version
# Run tests
cargo test
cd python && uv run pytest
Docker
# Build the image
docker build -t triage-warden .
# Run with default settings
docker run -p 8080:8080 triage-warden
# Run with custom configuration
docker run -p 8080:8080 \
-e TW_DATABASE_URL=postgres://user:pass@host/db \
-e TW_VIRUSTOTAL_API_KEY=your-key \
triage-warden
Pre-built Binaries
Download the latest release from the releases page.
Available platforms:
- Linux x86_64 (glibc)
- Linux x86_64 (musl)
- macOS x86_64
- macOS aarch64 (Apple Silicon)
# Example for macOS
curl -LO https://github.com/zachyking/triage-warden/releases/latest/download/triage-warden-macos-aarch64.tar.gz
tar xzf triage-warden-macos-aarch64.tar.gz
./triage-warden --version
Database Setup
SQLite (Default)
SQLite is used by default. The database file is created automatically:
# Default location
DATABASE_URL=sqlite://./triage_warden.db
# Custom location
DATABASE_URL=sqlite:///var/lib/triage-warden/data.db
PostgreSQL
For production deployments:
# Create database
createdb triage_warden
# Set connection string
export DATABASE_URL=postgres://user:password@localhost/triage_warden
# Run migrations
triage-warden db migrate
Next Steps
- Quick Start - Create your first incident
- Configuration - Configure the system
Quick Start
Get Triage Warden running and process your first incident in 5 minutes.
1. Start the Server
# Start with default settings (SQLite, mock connectors)
cargo run --bin tw-api
# Or use the release binary
./target/release/tw-api
The web dashboard is now available at http://localhost:8080.
2. Create an Incident
Via Web Dashboard
- Open
http://localhost:8080in your browser - Click "New Incident"
- Fill in the incident details:
- Type: Phishing
- Source: Email Gateway
- Severity: Medium
- Click Create
Via CLI
tw-cli incident create \
--type phishing \
--source "email-gateway" \
--severity medium \
--data '{
"subject": "Urgent: Verify Your Account",
"sender": "[email protected]",
"recipient": "[email protected]"
}'
Via API
curl -X POST http://localhost:8080/api/incidents \
-H "Content-Type: application/json" \
-d '{
"incident_type": "phishing",
"source": "email-gateway",
"severity": "medium",
"raw_data": {
"subject": "Urgent: Verify Your Account",
"sender": "[email protected]"
}
}'
3. Run AI Triage
# Trigger triage for the incident
tw-cli triage run --incident INC-2024-0001
The AI agent will:
- Parse email headers and content
- Check sender reputation
- Analyze URLs and attachments
- Generate a verdict with confidence score
4. View the Verdict
# Get incident with triage results
tw-cli incident get INC-2024-0001
# Example output:
# Incident: INC-2024-0001
# Type: phishing
# Status: triaged
# Verdict: malicious
# Confidence: 0.92
# Recommended Actions:
# - quarantine_email
# - block_sender
# - notify_user
5. Execute Actions
Actions may require approval based on your policy configuration:
# Request to quarantine the email
tw-cli action execute --incident INC-2024-0001 --action quarantine_email
# If auto-approved:
# Action executed: quarantine_email (status: completed)
# If requires approval:
# Action pending approval from: Senior Analyst
Approve pending actions via the dashboard at /approvals.
Next Steps
- Configuration - Set up real connectors
- Playbooks - Create automated workflows
- Policy Engine - Configure approval rules
Configuration
Triage Warden is configured through environment variables and configuration files.
Environment Variables
Core Settings
| Variable | Description | Default |
|---|---|---|
TW_DATABASE_URL | Database connection string | sqlite://./triage_warden.db |
TW_HOST | API server host | 0.0.0.0 |
TW_PORT | API server port | 8080 |
TW_LOG_LEVEL | Logging level (trace, debug, info, warn, error) | info |
TW_ADMIN_PASSWORD | Initial admin password | (generated) |
Connector Selection
| Variable | Description | Values |
|---|---|---|
TW_THREAT_INTEL_MODE | Threat intelligence backend | mock, virustotal |
TW_SIEM_MODE | SIEM backend | mock, splunk |
TW_EDR_MODE | EDR backend | mock, crowdstrike |
TW_EMAIL_GATEWAY_MODE | Email gateway backend | mock, m365 |
TW_TICKETING_MODE | Ticketing backend | mock, jira |
VirusTotal
TW_THREAT_INTEL_MODE=virustotal
TW_VIRUSTOTAL_API_KEY=your-api-key-here
Splunk
TW_SIEM_MODE=splunk
TW_SPLUNK_URL=https://splunk.company.com:8089
TW_SPLUNK_TOKEN=your-token-here
CrowdStrike
TW_EDR_MODE=crowdstrike
TW_CROWDSTRIKE_CLIENT_ID=your-client-id
TW_CROWDSTRIKE_CLIENT_SECRET=your-client-secret
TW_CROWDSTRIKE_REGION=us-1 # us-1, us-2, eu-1
Microsoft 365
TW_EMAIL_GATEWAY_MODE=m365
TW_M365_TENANT_ID=your-tenant-id
TW_M365_CLIENT_ID=your-client-id
TW_M365_CLIENT_SECRET=your-client-secret
Jira
TW_TICKETING_MODE=jira
TW_JIRA_URL=https://company.atlassian.net
[email protected]
TW_JIRA_API_TOKEN=your-api-token
TW_JIRA_PROJECT_KEY=SEC
AI Provider
TW_AI_PROVIDER=anthropic # anthropic, openai
TW_ANTHROPIC_API_KEY=your-api-key
# or
TW_OPENAI_API_KEY=your-api-key
Configuration File
For complex configurations, use a TOML file:
# config.toml
[server]
host = "0.0.0.0"
port = 8080
log_level = "info"
[database]
url = "postgres://user:pass@localhost/triage_warden"
max_connections = 10
[connectors.threat_intel]
mode = "virustotal"
api_key = "${TW_VIRUSTOTAL_API_KEY}"
rate_limit = 4 # requests per minute
[connectors.siem]
mode = "splunk"
url = "https://splunk.company.com:8089"
token = "${TW_SPLUNK_TOKEN}"
[connectors.edr]
mode = "crowdstrike"
client_id = "${TW_CROWDSTRIKE_CLIENT_ID}"
client_secret = "${TW_CROWDSTRIKE_CLIENT_SECRET}"
region = "us-1"
[ai]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
max_tokens = 4096
[policy]
default_action_approval = "auto" # auto, analyst, senior, manager
high_severity_approval = "senior"
critical_action_approval = "manager"
Load with:
tw-api --config config.toml
Policy Rules
Policy rules control action approval requirements. See Policy Engine for details.
# Example policy rule
[[policy.rules]]
name = "isolate_host_requires_manager"
action = "isolate_host"
severity = ["high", "critical"]
approval_level = "manager"
Logging
Configure structured logging:
# JSON output for production
TW_LOG_FORMAT=json
# Pretty output for development
TW_LOG_FORMAT=pretty
# Filter specific modules
RUST_LOG=tw_api=debug,tw_core=info
Next Steps
- Connectors - Detailed connector configuration
- Policy Engine - Configure approval workflows
- API Authentication - Set up API access
Web Dashboard
Browser-based interface for incident management.
Overview
The dashboard provides:
- Real-time incident monitoring
- Approval workflow management
- Playbook configuration
- System settings
Access at: http://localhost:8080
Features
Home Dashboard
The main dashboard displays:
- KPIs: Open incidents, pending approvals, triage rate
- Recent Incidents: Latest incidents with status
- Trend Charts: Incident volume over time
- Quick Actions: Create incident, run playbook
Incident Management
- List view with filtering and sorting
- Detail view with full incident context
- Action execution interface
- Triage results and reasoning
Approval Workflow
- Queue of pending approvals
- One-click approve/reject
- Bulk approval for related actions
- SLA countdown timers
Playbook Management
- Create and edit playbooks
- Visual step editor
- Test with sample data
- Execution history
Settings
- Connector configuration
- Policy rule management
- User administration
- System preferences
Navigation
| Path | Description |
|---|---|
/ | Dashboard home |
/incidents | Incident list |
/incidents/:id | Incident detail |
/approvals | Pending approvals |
/playbooks | Playbook management |
/settings | System settings |
/login | Login page |
Next Steps
- Incidents - Managing incidents
- Approvals - Approval workflow
- Playbooks - Playbook configuration
- Settings - System settings
Incidents
Managing incidents in the web dashboard.
Incident List
Access at /incidents
Filtering
- Status: Open, Triaged, Resolved
- Severity: Low, Medium, High, Critical
- Type: Phishing, Malware, Suspicious Login
- Date Range: Custom time period
Sorting
Click column headers to sort:
- Created (newest/oldest)
- Severity (highest/lowest)
- Status
Bulk Actions
Select multiple incidents for:
- Bulk resolve
- Bulk escalate
- Export to CSV
Incident Detail
Click an incident to view details.
Overview Tab
- Incident metadata
- AI verdict and confidence
- Recommended actions
- Timeline of events
Raw Data Tab
- Original incident data (JSON)
- Parsed email content (for phishing)
- Detection details (for malware)
Actions Tab
- Available actions
- Executed actions with results
- Pending approvals
Enrichment Tab
- Threat intelligence results
- SIEM correlation data
- Related incidents
Creating Incidents
Click "New Incident" button.
Required Fields
- Type: Select incident type
- Source: Origin of the incident
- Severity: Initial severity assessment
Optional Fields
- Description: Free-form description
- Raw Data: JSON payload
- Assignee: Initial assignment
Executing Actions
From the incident detail page:
- Click "Actions" tab
- Select action from dropdown
- Fill in parameters
- Click "Execute"
If approval is required:
- Action appears in pending state
- Notification sent to approvers
- Status updates when approved/rejected
Keyboard Shortcuts
| Shortcut | Action |
|---|---|
j / k | Navigate list |
Enter | Open incident |
Esc | Close modal |
a | Open actions menu |
e | Escalate |
r | Resolve |
Real-time Updates
The dashboard uses HTMX for live updates:
- New incidents appear automatically
- Status changes reflect immediately
- Approval decisions update in real-time
Approvals
Managing action approvals in the web dashboard.
Approval Queue
Access at /approvals
The queue shows all actions pending your approval based on your role level.
Queue Columns
- Action: Type of action requested
- Incident: Related incident
- Requested By: Who/what requested it
- Requested At: When requested
- SLA: Time remaining to respond
Filtering
- Approval Level: Analyst, Senior, Manager
- Action Type: Specific actions
- Incident Type: Phishing, malware, etc.
Approval Detail
Click an approval to see full context.
Context Section
- Full incident details
- AI reasoning (if from triage)
- Related actions already taken
Decision Section
- Approve: Execute the action
- Reject: Decline with reason
- Delegate: Assign to another approver
Approving Actions
Single Approval
- Click on pending action
- Review incident context
- Click "Approve" or "Reject"
- Add optional comment
- Confirm decision
Bulk Approval
For related actions:
- Select multiple actions (checkbox)
- Click "Bulk Approve" or "Bulk Reject"
- Add comment applying to all
- Confirm
Rejection
When rejecting:
- Click "Reject"
- Required: Enter rejection reason
- Optionally suggest alternative
- Confirm
The requester is notified of rejection and reason.
SLA Indicators
| Color | Meaning |
|---|---|
| Green | Plenty of time |
| Yellow | < 50% time remaining |
| Orange | < 25% time remaining |
| Red | SLA exceeded |
Notifications
You receive notifications for:
- New actions requiring your approval
- SLA warnings (50%, 75% elapsed)
- Escalations to your level
Configure notification preferences in Settings.
Delegation
If unavailable:
- Go to Settings > Delegation
- Select delegate user
- Set date range
- Delegate receives your approvals
Audit Trail
All approvals are logged:
- Who approved/rejected
- When decision was made
- Time to approve
- Comments provided
View at Settings > Audit Logs.
Playbooks
Managing playbooks in the web dashboard.
Playbook List
Access at /playbooks
Views
- Active: Currently enabled playbooks
- Inactive: Disabled playbooks
- All: Complete list
Information Displayed
- Name and description
- Trigger conditions
- Last run time
- Success rate
Creating Playbooks
Click "New Playbook" button.
Basic Information
- Name: Unique identifier
- Description: What this playbook does
- Version: Semantic version
Triggers
Configure when playbook runs:
- Incident Type: Phishing, malware, etc.
- Auto Run: Run automatically on new incidents
- Conditions: Additional criteria
Variables
Define playbook variables:
quarantine_threshold: 0.7
notification_channel: "#security"
Step Editor
Visual editor for playbook steps.
Adding Steps
- Click "Add Step"
- Select action type
- Configure parameters
- Set output variable name
Step Types
- Action: Execute an action
- Condition: Branch logic
- AI Analysis: Get AI verdict
- Parallel: Run steps concurrently
Connections
- Drag to reorder steps
- Connect condition branches
- Set dependencies
Testing Playbooks
Dry Run
- Click "Test"
- Select or create test incident
- Toggle "Dry Run"
- View step-by-step execution
With Live Data
- Click "Test"
- Select real incident
- Leave "Dry Run" off
- Actions will execute (with approval)
Execution History
View past executions:
- Execution timestamp
- Incident processed
- Steps completed
- Final verdict
- Duration
Click execution for detailed trace.
Import/Export
Export
- Select playbook
- Click "Export"
- Download YAML file
Import
- Click "Import"
- Upload YAML file
- Review parsed playbook
- Click "Create"
Playbook Versions
Playbooks are versioned:
- Edit playbook
- Bump version number
- Save as new version
- Old version kept for rollback
View version history and compare changes.
Settings
System configuration in the web dashboard.
Settings Tabs
Access at /settings
General
- Instance Name: Display name for this installation
- Time Zone: Default timezone for display
- Date Format: Date/time display format
- Theme: Light/dark mode preference
Connectors
Configure external integrations.
Threat Intelligence
- Mode: Mock or VirusTotal
- API Key (for VirusTotal)
- Rate limit settings
SIEM
- Mode: Mock or Splunk
- URL and authentication
- Default search index
EDR
- Mode: Mock or CrowdStrike
- OAuth credentials
- Region selection
Email Gateway
- Mode: Mock or Microsoft 365
- Azure AD configuration
- Tenant settings
Ticketing
- Mode: Mock or Jira
- Instance URL
- Project configuration
Policies
Manage policy rules.
Creating Rules
- Click "Add Rule"
- Enter rule name
- Define matching criteria
- Set decision (allow/deny/approval)
- Save
Rule Priority
Drag rules to reorder. First matching rule wins.
Users
User management (admin only).
User List
- Username and email
- Role (viewer/analyst/senior/admin)
- Last login
- Status (active/disabled)
Creating Users
- Click "Add User"
- Enter email and username
- Set initial role
- Generate or set password
- Send invitation email
Role Management
Assign roles:
- Viewer: Read-only access
- Analyst: Execute actions, approve analyst-level
- Senior: Approve senior-level
- Admin: Full access
Notifications
Configure notification preferences.
Channels
- Email: SMTP settings
- Slack: Webhook URL
- Teams: Connector URL
- PagerDuty: Integration key
Preferences
For each notification type:
- Enable/disable channel
- Set priority threshold
- Configure quiet hours
Audit Logs
View system audit trail.
Filtering
- Date range
- Event type
- User
- Resource
Export
Export logs to CSV for compliance.
API Keys
Manage API credentials.
Creating Keys
- Click "Create API Key"
- Enter name and description
- Select scopes
- Set expiration (optional)
- Copy generated key
Revoking Keys
Click "Revoke" on any key. Revocation is immediate.
Backup & Restore
Database management.
Backup
- Click "Create Backup"
- Wait for completion
- Download backup file
Restore
- Click "Restore"
- Upload backup file
- Confirm restore
- System restarts
About
System information:
- Version number
- Build information
- License status
- Support links
CLI Reference
Command-line interface for Triage Warden.
Installation
The CLI is built with the main project:
cargo build --release
./target/release/tw-cli --help
Global Options
tw-cli [OPTIONS] <COMMAND>
Options:
-c, --config <FILE> Configuration file path
-v, --verbose Enable verbose output
-q, --quiet Suppress non-error output
--json Output as JSON
-h, --help Print help
-V, --version Print version
Environment Variables
| Variable | Description |
|---|---|
TW_API_URL | API server URL (default: http://localhost:8080) |
TW_API_KEY | API key for authentication |
TW_CONFIG | Path to config file |
Commands Overview
| Command | Description |
|---|---|
incident | Manage incidents |
action | Execute and manage actions |
triage | Run AI triage |
playbook | Manage playbooks |
policy | Manage policy rules |
connector | Manage connectors |
user | User management |
api-key | API key management |
webhook | Webhook management |
config | Configuration management |
db | Database operations |
serve | Start API server |
Quick Examples
# List open incidents
tw-cli incident list --status open
# Create incident
tw-cli incident create --type phishing --severity high
# Run triage
tw-cli triage run --incident INC-2024-001
# Execute action
tw-cli action execute --incident INC-2024-001 --action quarantine_email
# Approve pending action
tw-cli action approve act-abc123
# Start server
tw-cli serve --port 8080
Next Steps
- Commands - Detailed command reference
CLI Commands
Detailed reference for all CLI commands.
incident
Manage security incidents.
list
tw-cli incident list [OPTIONS]
Options:
--status <STATUS> Filter by status (open, triaged, resolved)
--severity <SEVERITY> Filter by severity
--type <TYPE> Filter by incident type
--limit <N> Maximum results (default: 20)
--offset <N> Skip first N results
--sort <FIELD> Sort field (created_at, severity)
--desc Sort descending
get
tw-cli incident get <ID> [OPTIONS]
Options:
--format <FORMAT> Output format (table, json, yaml)
--include-actions Include action history
--include-enrichment Include enrichment data
create
tw-cli incident create [OPTIONS]
Options:
--type <TYPE> Incident type (required)
--source <SOURCE> Incident source (required)
--severity <SEVERITY> Initial severity (default: medium)
--data <JSON> Raw incident data as JSON
--file <FILE> Read data from file
--auto-triage Run triage after creation
update
tw-cli incident update <ID> [OPTIONS]
Options:
--severity <SEVERITY> Update severity
--status <STATUS> Update status
--assignee <USER> Assign to user
resolve
tw-cli incident resolve <ID> [OPTIONS]
Options:
--resolution <TEXT> Resolution notes
--false-positive Mark as false positive
action
Execute and manage actions.
execute
tw-cli action execute [OPTIONS]
Options:
--incident <ID> Associated incident
--action <NAME> Action to execute (required)
--param <KEY=VALUE> Action parameter (repeatable)
--emergency Emergency override (manager only)
list
tw-cli action list [OPTIONS]
Options:
--incident <ID> Filter by incident
--status <STATUS> Filter by status
--pending Show only pending approval
get
tw-cli action get <ID>
approve
tw-cli action approve <ID> [OPTIONS]
Options:
--comment <TEXT> Approval comment
reject
tw-cli action reject <ID> [OPTIONS]
Options:
--reason <TEXT> Rejection reason (required)
rollback
tw-cli action rollback <ID> [OPTIONS]
Options:
--reason <TEXT> Rollback reason
triage
Run AI triage.
run
tw-cli triage run [OPTIONS]
Options:
--incident <ID> Incident to triage (required)
--playbook <NAME> Specific playbook
--model <MODEL> AI model override
--wait Wait for completion
status
tw-cli triage status <TRIAGE_ID>
playbook
Manage playbooks.
list
tw-cli playbook list [OPTIONS]
Options:
--enabled Only enabled playbooks
--trigger-type <TYPE> Filter by trigger type
get
tw-cli playbook get <ID>
add
tw-cli playbook add <FILE>
update
tw-cli playbook update <ID> <FILE>
delete
tw-cli playbook delete <ID>
run
tw-cli playbook run <ID> [OPTIONS]
Options:
--incident <ID> Incident to process
--var <KEY=VALUE> Override variable (repeatable)
--dry-run Don't execute actions
test
tw-cli playbook test <NAME> [OPTIONS]
Options:
--incident <ID> Use existing incident
--data <JSON> Use mock data
--dry-run Don't execute actions
validate
tw-cli playbook validate <FILE>
export
tw-cli playbook export <ID> [OPTIONS]
Options:
-o, --output <FILE> Output file (default: stdout)
policy
Manage policy rules.
list
tw-cli policy list
add
tw-cli policy add [OPTIONS]
Options:
--name <NAME> Rule name (required)
--action <ACTION> Action to match
--pattern <PATTERN> Action pattern (glob)
--severity <SEVERITY> Severity condition
--approval-level <L> Required approval level
--allow Auto-allow
--deny Deny with reason
--reason <TEXT> Denial reason
delete
tw-cli policy delete <NAME>
test
tw-cli policy test [OPTIONS]
Options:
--action <ACTION> Action to test
--severity <SEVERITY> Incident severity
--proposer-type <T> Proposer type
--confidence <N> AI confidence score
connector
Manage connectors.
status
tw-cli connector status
test
tw-cli connector test <NAME>
configure
tw-cli connector configure <NAME> [OPTIONS]
Options:
--mode <MODE> Connector mode
--api-key <KEY> API key
--url <URL> Service URL
user
User management.
list
tw-cli user list
create
tw-cli user create [OPTIONS]
Options:
--username <NAME> Username (required)
--email <EMAIL> Email address
--role <ROLE> User role
--service-account Create as service account
update
tw-cli user update <ID> [OPTIONS]
Options:
--role <ROLE> New role
--enabled Enable user
--disabled Disable user
delete
tw-cli user delete <ID>
api-key
API key management.
list
tw-cli api-key list
create
tw-cli api-key create [OPTIONS]
Options:
--name <NAME> Key name (required)
--scopes <SCOPES> Comma-separated scopes
--user <USER> Associated user
--expires <DATE> Expiration date
revoke
tw-cli api-key revoke <PREFIX>
rotate
tw-cli api-key rotate <PREFIX>
webhook
Webhook management.
list
tw-cli webhook list
add
tw-cli webhook add <SOURCE> [OPTIONS]
Options:
--secret <SECRET> Webhook secret
--auto-triage Enable auto-triage
--playbook <NAME> Playbook to run
test
tw-cli webhook test <SOURCE>
delete
tw-cli webhook delete <SOURCE>
db
Database operations.
migrate
tw-cli db migrate
backup
tw-cli db backup [OPTIONS]
Options:
-o, --output <FILE> Backup file path
restore
tw-cli db restore <FILE>
serve
Start the API server.
tw-cli serve [OPTIONS]
Options:
--host <HOST> Bind address (default: 0.0.0.0)
--port <PORT> Port number (default: 8080)
--config <FILE> Configuration file
Architecture Overview
Triage Warden is built as a modular, layered system combining Rust for performance-critical components and Python for AI capabilities.
System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Clients │
│ (Web Browser, CLI, API Consumers) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ API Layer (tw-api) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ REST API │ │ Web Handlers│ │ Webhooks │ │ Metrics │ │
│ │ (Axum) │ │(HTMX+Askama)│ │ │ │ (Prometheus)│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────────┐ ┌───────────────┐
│ Policy Engine │ │ Action Registry │ │ Event Bus │
│ (tw-policy) │ │ (tw-actions) │ │ (tw-core) │
└───────────────┘ └───────────────────┘ └───────────────┘
│ │ │
└──────────────────────────┼──────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Core Domain (tw-core) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Incidents │ │ Playbooks │ │ Users │ │ Audit │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Database Layer (SQLx) │
│ (SQLite for dev, PostgreSQL for prod) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Python Bridge (tw-bridge) │
│ (PyO3 Bindings) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ AI Layer (tw_ai) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │Triage Agent │ │ Tools │ │ Playbook │ │ Evaluation │ │
│ │ (Claude) │ │ │ │ Engine │ │ Framework │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Connector Layer (tw-connectors) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ VirusTotal │ │ Splunk │ │ CrowdStrike │ │ Jira │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Crate Structure
| Crate | Purpose |
|---|---|
tw-api | HTTP server, REST API, web handlers, webhooks |
tw-core | Domain models, database repositories, event bus |
tw-actions | Action handlers (quarantine, isolate, notify, etc.) |
tw-policy | Policy engine, approval rules, decision evaluation |
tw-connectors | External service integrations (VirusTotal, Splunk, etc.) |
tw-bridge | PyO3 bindings exposing Rust to Python |
tw-cli | Command-line interface |
tw-observability | Metrics, tracing, logging infrastructure |
Key Design Decisions
Rust + Python Hybrid
- Rust: Core platform, API server, policy engine, actions
- Python: AI agents, LLM integrations, playbook execution
- Bridge: PyO3 enables Python to call Rust connectors and actions
Trait-Based Connectors
All connectors implement traits for testability:
#![allow(unused)] fn main() { #[async_trait] pub trait ThreatIntelConnector: Send + Sync { async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>; async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport>; async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport>; } }
Event-Driven Architecture
The event bus enables loose coupling:
#![allow(unused)] fn main() { event_bus.publish(Event::IncidentCreated { id, incident_type }); event_bus.publish(Event::ActionExecuted { action_id, result }); }
Policy-First Actions
All actions pass through the policy engine:
Request → Policy Evaluation → (Allowed | Denied | RequiresApproval) → Execute
Next Steps
- Components - Detailed component descriptions
- Data Flow - How data moves through the system
- Security Model - Authentication and authorization
Components
Detailed description of each major component in Triage Warden.
tw-api
The HTTP server and web interface.
REST API Routes
| Route | Description |
|---|---|
GET /api/incidents | List incidents with filtering |
POST /api/incidents | Create new incident |
GET /api/incidents/:id | Get incident details |
POST /api/incidents/:id/actions | Execute action on incident |
GET /api/playbooks | List playbooks |
POST /api/webhooks/:source | Receive webhook events |
Web Handlers
Server-rendered pages using HTMX and Askama templates:
- Dashboard with KPIs
- Incident list and detail views
- Approval workflow interface
- Playbook management
- Settings configuration
Authentication
- Session-based auth for web dashboard
- API key auth for programmatic access
- Role-based access control (admin, analyst, viewer)
tw-core
Core domain logic and data access.
Domain Models
#![allow(unused)] fn main() { pub struct Incident { pub id: Uuid, pub incident_type: IncidentType, pub severity: Severity, pub status: IncidentStatus, pub source: String, pub raw_data: serde_json::Value, pub verdict: Option<Verdict>, pub confidence: Option<f64>, pub created_at: DateTime<Utc>, } pub struct Action { pub id: Uuid, pub incident_id: Uuid, pub action_type: ActionType, pub status: ActionStatus, pub approval_level: Option<ApprovalLevel>, pub executed_by: Option<String>, } }
Repositories
Database access layer with SQLite and PostgreSQL support:
IncidentRepositoryActionRepositoryPlaybookRepositoryUserRepositoryAuditRepository
Event Bus
Async event distribution:
#![allow(unused)] fn main() { pub enum Event { IncidentCreated { id: Uuid }, IncidentUpdated { id: Uuid }, ActionRequested { id: Uuid }, ActionApproved { id: Uuid, approver: String }, ActionExecuted { id: Uuid, success: bool }, } }
tw-actions
Action handlers for incident response.
Email Actions
| Action | Description |
|---|---|
parse_email | Extract headers, body, attachments |
check_email_authentication | Validate SPF/DKIM/DMARC |
quarantine_email | Move to quarantine |
block_sender | Add to blocklist |
Lookup Actions
| Action | Description |
|---|---|
lookup_sender_reputation | Check sender against threat intel |
lookup_urls | Analyze URLs in content |
lookup_attachments | Hash and check attachments |
Host Actions
| Action | Description |
|---|---|
isolate_host | Network isolation via EDR |
scan_host | Trigger endpoint scan |
Notification Actions
| Action | Description |
|---|---|
notify_user | Send user notification |
notify_reporter | Update incident reporter |
escalate | Route to approval level |
create_ticket | Create Jira ticket |
tw-policy
Policy engine for action approval.
Rule Evaluation
#![allow(unused)] fn main() { pub struct PolicyRule { pub name: String, pub action_type: ActionType, pub conditions: Vec<Condition>, pub approval_level: ApprovalLevel, } pub enum PolicyDecision { Allowed, Denied { reason: String }, RequiresApproval { level: ApprovalLevel }, } }
Approval Levels
- Auto - No approval required
- Analyst - Any analyst can approve
- Senior - Senior analyst required
- Manager - SOC manager required
tw-connectors
External service integrations.
Connector Trait
#![allow(unused)] fn main() { #[async_trait] pub trait Connector: Send + Sync { fn name(&self) -> &str; fn connector_type(&self) -> &str; async fn health_check(&self) -> ConnectorResult<ConnectorHealth>; async fn test_connection(&self) -> ConnectorResult<bool>; } }
Available Connectors
| Type | Implementations |
|---|---|
| Threat Intel | VirusTotal, Mock |
| SIEM | Splunk, Mock |
| EDR | CrowdStrike, Mock |
| Email Gateway | Microsoft 365, Mock |
| Ticketing | Jira, Mock |
tw-bridge
PyO3 bindings for Python integration.
Exposed Classes
from tw_bridge import ThreatIntelBridge, SIEMBridge, EDRBridge
# Use connectors from Python
threat_intel = ThreatIntelBridge("virustotal")
result = threat_intel.lookup_hash("abc123...")
tw_ai (Python)
AI triage and playbook execution.
Triage Agent
Claude-powered agent for incident analysis:
agent = TriageAgent(model="claude-sonnet-4-20250514")
verdict = await agent.analyze(incident)
# Returns: Verdict(classification="malicious", confidence=0.92, ...)
Playbook Engine
YAML-based playbook execution:
name: phishing_triage
steps:
- action: parse_email
- action: check_email_authentication
- action: lookup_sender_reputation
- condition: sender_reputation < 0.3
action: quarantine_email
Data Flow
How data moves through Triage Warden from incident creation to resolution.
Incident Lifecycle
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Created │────▶│ Triaging │────▶│ Triaged │────▶│ Resolved │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Webhook/API AI Agent Actions Executed Closed
receives data analyzes (with approval)
Detailed Flow
1. Incident Creation
External Source (Email Gateway, SIEM, EDR)
│
▼
Webhook Endpoint
/api/webhooks/:source
│
▼
┌──────────────────┐
│ Parse & Validate │
│ Incoming Data │
└──────────────────┘
│
▼
┌──────────────────┐
│ Create Incident │
│ Record in DB │
└──────────────────┘
│
▼
┌──────────────────┐
│ Publish Event: │
│ IncidentCreated │
└──────────────────┘
2. AI Triage
┌──────────────────┐
│ Event: Incident │
│ Created │
└──────────────────┘
│
▼
┌──────────────────┐
│ Load Playbook │
│ (based on type) │
└──────────────────┘
│
▼
┌──────────────────┐
│ Execute Playbook │
│ Steps │
└──────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Enrichment │ │ AI Analysis │
│ Actions │ │ (Claude) │
│ - parse_email │ │ │
│ - lookup_* │ │ Generates: │
└───────────────┘ │ - Verdict │
│ │ - Confidence │
│ │ - Reasoning │
│ │ - Actions │
└───────┬───────└───────────────┘
│
▼
┌──────────────────┐
│ Update Incident │
│ with Verdict │
└──────────────────┘
3. Action Execution
┌──────────────────┐
│ Action Request │
│ (from agent or │
│ human) │
└──────────────────┘
│
▼
┌──────────────────┐
│ Build Action │
│ Context │
└──────────────────┘
│
▼
┌──────────────────┐
│ Policy Engine │
│ Evaluation │
└──────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Allowed │ │Denied │ │Requires│
│ │ │ │ │Approval│
└────────┘ └────────┘ └────────┘
│ │ │
▼ ▼ ▼
Execute Return Queue for
Action Error Approval
│ │
│ ▼
│ ┌──────────────┐
│ │ Notify │
│ │ Approvers │
│ └──────────────┘
│ │
│ ▼
│ ┌──────────────┐
│ │ Wait for │
│ │ Approval │
│ └──────────────┘
│ │
│ ┌──────────────┴──────────────┐
│ ▼ ▼
│ ┌────────┐ ┌────────┐
│ │Approved│ │Rejected│
│ └────────┘ └────────┘
│ │ │
│ ▼ ▼
│ Execute Action Update Status
│ │
└────────┴─────────┐
▼
┌──────────────┐
│ Connector │
│ Execution │
│ (External │
│ Service) │
└──────────────┘
│
▼
┌──────────────┐
│ Update │
│ Action │
│ Status │
└──────────────┘
│
▼
┌──────────────┐
│ Audit Log │
│ Entry │
└──────────────┘
Data Stores
Primary Database
| Table | Purpose |
|---|---|
incidents | Incident records |
actions | Action requests and results |
playbooks | Playbook definitions |
users | User accounts |
sessions | Active sessions |
api_keys | API credentials |
audit_logs | Action audit trail |
connectors | Connector configurations |
policies | Policy rules |
notifications | Notification history |
settings | System settings |
Event Bus (In-Memory)
Transient event distribution for real-time updates:
- Incident lifecycle events
- Action status changes
- Approval notifications
- System health events
External Data Flow
Inbound (Webhooks)
Email Gateway ──────┐
SIEM Alerts ────────┼──▶ Webhook Handler ──▶ Incident Creation
EDR Events ─────────┘
Outbound (Connectors)
┌──▶ VirusTotal (threat intel)
Action Execution ──────────┼──▶ Splunk (SIEM queries)
├──▶ CrowdStrike (host actions)
├──▶ M365 (email actions)
└──▶ Jira (ticketing)
Metrics Flow
Rust Components ──┬──▶ Prometheus Registry ──▶ /metrics endpoint
Python Components ─┘
Exposed metrics:
triage_warden_incidents_total{type, severity}triage_warden_actions_total{action, status}triage_warden_triage_duration_seconds{type}triage_warden_connector_requests_total{connector, status}
Security Model
Triage Warden implements defense-in-depth with multiple security layers.
Authentication
Web Dashboard
Session-based authentication with secure cookies:
- Session tokens: Random 256-bit tokens
- Cookie settings: HttpOnly, Secure, SameSite=Lax
- Session duration: 8 hours (configurable)
- CSRF protection: Per-request tokens on all state-changing forms
API Access
API key authentication for programmatic access:
curl -H "Authorization: Bearer tw_abc123_secretkey" \
https://api.example.com/api/incidents
API key features:
- Prefix stored in plain text for lookup (
tw_abc123) - Secret portion hashed with Argon2
- Scopes limit allowed operations
- Expiration dates supported
Authorization
Role-Based Access Control (RBAC)
| Role | Capabilities |
|---|---|
| Viewer | Read incidents, view dashboards |
| Analyst | Viewer + execute low-risk actions, approve analyst-level |
| Senior Analyst | Analyst + execute medium-risk actions, approve senior-level |
| Admin | Full access, user management, system configuration |
Policy-Based Action Control
The policy engine evaluates every action request:
#![allow(unused)] fn main() { // Policy evaluation flow ActionRequest → Build ActionContext (action_type, target, severity, proposer) → Evaluate policy rules → Return PolicyDecision - Allowed: Execute immediately - Denied: Return error with reason - RequiresApproval: Queue for specified approval level }
Example Policy Rules
# Low-risk actions auto-approve
[[policy.rules]]
name = "auto_approve_lookups"
action_patterns = ["lookup_*"]
decision = "allowed"
# High-severity host isolation requires manager
[[policy.rules]]
name = "isolate_requires_manager"
action = "isolate_host"
severity = ["high", "critical"]
approval_level = "manager"
# Block dangerous actions on production
[[policy.rules]]
name = "no_delete_in_prod"
action_patterns = ["delete_*"]
environment = "production"
decision = "denied"
reason = "Deletion not allowed in production"
Multi-Tenant Isolation
Triage Warden supports multi-tenancy with strong data isolation guarantees.
Row-Level Security (RLS)
PostgreSQL Row-Level Security provides database-level tenant isolation:
-- Each table has RLS policies that filter by tenant
-- Application sets tenant context at the start of each request
SELECT set_tenant_context('tenant-uuid-here');
-- All subsequent queries automatically filtered
SELECT * FROM incidents; -- Only returns current tenant's data
Key Features:
| Feature | Description |
|---|---|
| Automatic filtering | All SELECT/UPDATE/DELETE queries filtered by tenant |
| Insert validation | INSERT must match current tenant context |
| Fail-secure | No tenant context = no data access |
| Defense-in-depth | Database enforces isolation even if app has bugs |
Tenant Context Management
The application manages tenant context through several mechanisms:
- Request Middleware: Resolves tenant from subdomain, header, or JWT
- Session Variable: Sets
app.current_tenanton each database connection - Context Guard: RAII pattern ensures cleanup
#![allow(unused)] fn main() { // Using the tenant context guard async fn handle_request(pool: &TenantAwarePool, tenant_id: Uuid) { let _guard = TenantContextGuard::new(pool, tenant_id).await?; // All queries here are automatically filtered by tenant let incidents = incident_repo.list_all().await?; // Context cleared when guard drops } }
Admin Operations
Admin operations that need to bypass RLS use a separate connection pool:
- Admin pool: Superuser role that bypasses RLS policies
- Use cases: Tenant management, cross-tenant reporting, maintenance
- Access control: Restricted to Admin role users only
Tables Protected by RLS
All tenant-scoped data tables have RLS enabled:
incidents,actions,approvals,audit_logsusers,api_keys,sessionsplaybooks,policies,connectorsnotification_channels,settings
System tables (tenants, feature_flags) do NOT have RLS.
Debugging RLS Issues
-- Check current tenant context
SELECT get_current_tenant();
-- View RLS policies for a table
SELECT * FROM pg_policies WHERE tablename = 'incidents';
-- Check if RLS is enabled
SELECT relname, relrowsecurity
FROM pg_class
WHERE relname IN ('incidents', 'tenants');
Data Protection
At Rest
- Database encryption: SQLite with SQLCipher (optional), PostgreSQL with TDE
- Credential storage: All API keys/tokens hashed with Argon2id
- Secrets management: Environment variables or external secret stores
In Transit
- TLS 1.3: Required for all external connections
- Certificate validation: Strict validation for connectors
- Internal traffic: TLS optional for localhost development
Sensitive Data Handling
#![allow(unused)] fn main() { // Credentials redacted in logs impl std::fmt::Debug for ApiKey { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { write!(f, "ApiKey {{ prefix: {}, secret: [REDACTED] }}", self.prefix) } } }
Audit Trail
All security-relevant actions logged:
| Event | Data Captured |
|---|---|
| Login | user_id, ip_address, success, timestamp |
| Logout | user_id, session_duration |
| Action executed | action_id, user_id, incident_id, result |
| Action approved | action_id, approver_id, decision |
| Policy change | user_id, old_value, new_value |
| User management | admin_id, target_user, operation |
Audit log retention: 90 days (configurable)
Connector Security
Credential Management
Connector credentials stored encrypted:
# Environment variables (recommended)
TW_VIRUSTOTAL_API_KEY=your-key
# Or encrypted in database
tw-cli connector set virustotal --api-key "$(read -s)"
Rate Limiting
Built-in rate limiting prevents API abuse:
| Connector | Default Limit |
|---|---|
| VirusTotal | 4 req/min (free tier) |
| Splunk | 100 req/min |
| CrowdStrike | 50 req/min |
Circuit Breaker
Automatic failure handling:
#![allow(unused)] fn main() { // After 5 consecutive failures, circuit opens // Requests fail fast for 30 seconds // Then half-open state allows test requests }
Input Validation
API Requests
- JSON schema validation on all endpoints
- Size limits on request bodies (1MB default)
- Type coercion disabled (strict typing)
Webhook Payloads
- HMAC signature verification
- Replay attack prevention (timestamp validation)
- Payload size limits
#![allow(unused)] fn main() { // Webhook signature verification fn verify_webhook(payload: &[u8], signature: &str, secret: &str) -> bool { let expected = hmac_sha256(secret, payload); constant_time_compare(signature, &expected) } }
Secure Defaults
- HTTPS enforced in production
- Secure cookie flags enabled
- CORS restricted to configured origins
- Debug endpoints disabled in production
- Verbose errors only in development
Security Headers
Default response headers:
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
Content-Security-Policy: default-src 'self'
Vulnerability Disclosure
Report security vulnerabilities to: [email protected]
We follow responsible disclosure practices and aim to respond within 48 hours.
Database Schema
Triage Warden supports both SQLite (development/small deployments) and PostgreSQL (production). This document describes the database schema used by both backends.
Overview
The database consists of 13 tables organized into four logical groups:
- Core Incident Management: incidents, audit_logs, actions, approvals
- Configuration: playbooks, connectors, policies, notification_channels, settings
- Authentication: users, sessions, api_keys
- Multi-Tenancy: tenants, feature_flags
Multi-Tenancy
All tenant-scoped tables include a tenant_id foreign key that references the tenants table. In PostgreSQL, Row-Level Security (RLS) policies automatically filter all queries by the current tenant context.
tenants
Tenant organizations in a multi-tenant deployment.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| name | TEXT | NOT NULL | Organization display name |
| slug | TEXT | UNIQUE, NOT NULL | URL-safe identifier for routing |
| status | ENUM/TEXT | DEFAULT 'active' | active, suspended, pending_deletion |
| settings | JSON/TEXT | DEFAULT '{}' | Tenant-specific settings |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Indexes: slug (unique), status
feature_flags
Feature flag configuration for gradual rollouts.
| Column | Type | Constraints | Description |
|---|---|---|---|
| name | TEXT | PRIMARY KEY | Flag name |
| description | TEXT | DEFAULT '' | Flag description |
| default_enabled | BOOLEAN | DEFAULT FALSE | Default state |
| tenant_overrides | JSON | DEFAULT '{}' | Per-tenant overrides |
| percentage_rollout | INTEGER | NULLABLE | 0-100 percentage rollout |
| created_at | TIMESTAMP | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP | NOT NULL | Last update timestamp |
Note: The tenants and feature_flags tables are NOT protected by RLS.
Entity Relationship Diagram
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ users │ │ api_keys │ │ sessions │
├──────────────┤ ├──────────────┤ ├──────────────┤
│ id (PK) │◄──────│ user_id (FK) │ │ id (PK) │
│ email │ │ id (PK) │ │ data │
│ username │ │ key_hash │ │ expiry_date │
│ password_hash│ │ scopes │ └──────────────┘
│ role │ └──────────────┘
└──────────────┘
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ incidents │ │ audit_logs │ │ actions │
├──────────────┤ ├──────────────┤ ├──────────────┤
│ id (PK) │◄──────│ incident_id │ │ id (PK) │
│ source │ │ id (PK) │ │ incident_id │──┐
│ severity │ │ action │ │ action_type │ │
│ status │◄──────│ actor │ │ target │ │
│ alert_data │ │ details │ │ approval_status│ │
│ enrichments │ │ created_at │ └──────────────┘ │
│ analysis │ └──────────────┘ │
│ proposed_actions│ │
│ ticket_id │ ┌──────────────┐ │
│ tags │ │ approvals │◄────────────────────────┘
│ metadata │ ├──────────────┤
└──────────────┘ │ id (PK) │
│ action_id │
│ incident_id │
│ status │
└──────────────┘
Core Tables
incidents
Stores security incidents created from incoming alerts.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| tenant_id | UUID/TEXT | FK → tenants, NOT NULL | Owning tenant |
| source | JSON/TEXT | NOT NULL | Alert source metadata |
| severity | ENUM/TEXT | NOT NULL | info, low, medium, high, critical |
| status | ENUM/TEXT | NOT NULL | See Status Values |
| alert_data | JSON/TEXT | NOT NULL | Original alert payload |
| enrichments | JSON/TEXT | DEFAULT '[]' | Array of enrichment results |
| analysis | JSON/TEXT | NULLABLE | AI triage analysis |
| proposed_actions | JSON/TEXT | DEFAULT '[]' | Array of proposed actions |
| ticket_id | TEXT | NULLABLE | External ticket reference |
| tags | JSON/TEXT | DEFAULT '[]' | User-defined tags |
| metadata | JSON/TEXT | DEFAULT '{}' | Additional metadata |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Indexes: (tenant_id, status), (tenant_id, severity), (tenant_id, created_at), status, severity, created_at, updated_at
RLS: Protected by Row-Level Security in PostgreSQL.
Incident Status Values
new- Newly created from alertenriching- Gathering threat intelligenceanalyzing- AI analysis in progresspending_review- Awaiting analyst reviewpending_approval- Actions awaiting approvalexecuting- Actions being executedresolved- Incident resolvedfalse_positive- Marked as false positiveescalated- Escalated to higher tierclosed- Administratively closed
audit_logs
Immutable audit trail for all incident actions.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| incident_id | UUID/TEXT | FK → incidents | Parent incident |
| action | TEXT | NOT NULL | Action type (status_changed, action_approved, etc.) |
| actor | TEXT | NOT NULL | Username or "system" |
| details | JSON/TEXT | NULLABLE | Action-specific details |
| created_at | TIMESTAMP/TEXT | NOT NULL | Action timestamp |
Indexes: incident_id, created_at
actions
Stores proposed and executed response actions.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| incident_id | UUID/TEXT | FK → incidents | Parent incident |
| action_type | TEXT | NOT NULL | isolate_host, disable_user, block_ip, etc. |
| target | JSON/TEXT | NOT NULL | Action target details |
| parameters | JSON/TEXT | DEFAULT '{}' | Action parameters |
| reason | TEXT | NOT NULL | Justification for action |
| priority | INTEGER | DEFAULT 50 | Execution priority (1-100) |
| approval_status | ENUM/TEXT | NOT NULL | See Approval Status Values |
| approved_by | TEXT | NULLABLE | Approving user |
| approval_timestamp | TIMESTAMP/TEXT | NULLABLE | Approval time |
| result | JSON/TEXT | NULLABLE | Execution result |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| executed_at | TIMESTAMP/TEXT | NULLABLE | Execution timestamp |
Indexes: incident_id, approval_status, created_at
Approval Status Values
pending- Awaiting approval decisionauto_approved- Automatically approved by policyapproved- Manually approveddenied- Manually deniedexecuted- Successfully executedfailed- Execution failed
approvals
Tracks multi-level approval workflows.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| action_id | UUID/TEXT | FK → actions | Related action |
| incident_id | UUID/TEXT | FK → incidents | Parent incident |
| approval_level | TEXT | NOT NULL | analyst, senior, manager, executive |
| status | ENUM/TEXT | NOT NULL | pending, approved, denied, expired |
| requested_by | TEXT | NOT NULL | Requesting user/system |
| requested_at | TIMESTAMP/TEXT | NOT NULL | Request timestamp |
| decided_by | TEXT | NULLABLE | Deciding user |
| decided_at | TIMESTAMP/TEXT | NULLABLE | Decision timestamp |
| decision_reason | TEXT | NULLABLE | Optional reason |
| expires_at | TIMESTAMP/TEXT | NULLABLE | Approval expiration |
Indexes: action_id, status, expires_at
Configuration Tables
playbooks
Automation workflow definitions.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| name | TEXT | NOT NULL | Playbook name |
| description | TEXT | NULLABLE | Description |
| trigger_type | TEXT | NOT NULL | alert_type, severity, source, manual |
| trigger_condition | TEXT | NULLABLE | Trigger condition expression |
| stages | JSON/TEXT | DEFAULT '[]' | Array of workflow stages |
| enabled | BOOLEAN/INTEGER | DEFAULT TRUE | Active status |
| execution_count | INTEGER | DEFAULT 0 | Times executed |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Indexes: name, trigger_type, enabled, created_at
connectors
External integration configurations.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| name | TEXT | NOT NULL | Display name |
| connector_type | TEXT | NOT NULL | virus_total, jira, splunk, etc. |
| config | JSON/TEXT | DEFAULT '{}' | Connection configuration (encrypted credentials) |
| status | TEXT | DEFAULT 'unknown' | connected, disconnected, error, unknown |
| enabled | BOOLEAN/INTEGER | DEFAULT TRUE | Active status |
| last_health_check | TIMESTAMP/TEXT | NULLABLE | Last health check time |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Indexes: name, connector_type, status, enabled
policies
Approval and automation policy rules.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| name | TEXT | NOT NULL | Policy name |
| description | TEXT | NULLABLE | Description |
| condition | TEXT | NOT NULL | Condition expression |
| action | TEXT | NOT NULL | auto_approve, require_approval, deny |
| approval_level | TEXT | NULLABLE | Required approval level |
| priority | INTEGER | DEFAULT 0 | Evaluation priority |
| enabled | BOOLEAN/INTEGER | DEFAULT TRUE | Active status |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Indexes: name, action, priority, enabled
notification_channels
Alert notification configurations.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| name | TEXT | NOT NULL | Channel name |
| channel_type | TEXT | NOT NULL | slack, teams, email, pagerduty, webhook |
| config | JSON/TEXT | DEFAULT '{}' | Channel configuration |
| events | JSON/TEXT | DEFAULT '[]' | Subscribed event types |
| enabled | BOOLEAN/INTEGER | DEFAULT TRUE | Active status |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Indexes: name, channel_type, enabled
settings
Key-value configuration store.
| Column | Type | Constraints | Description |
|---|---|---|---|
| key | TEXT | PRIMARY KEY | Setting key (general, rate_limits, llm) |
| value | JSON/TEXT | NOT NULL | Setting value as JSON |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Authentication Tables
users
User accounts for dashboard and API access.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| TEXT | UNIQUE, NOT NULL | Email address | |
| username | TEXT | UNIQUE, NOT NULL | Login username |
| password_hash | TEXT | NOT NULL | Argon2 password hash |
| role | ENUM/TEXT | NOT NULL | admin, analyst, viewer |
| display_name | TEXT | NULLABLE | Display name |
| enabled | BOOLEAN/INTEGER | DEFAULT TRUE | Account active status |
| last_login_at | TIMESTAMP/TEXT | NULLABLE | Last login timestamp |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
| updated_at | TIMESTAMP/TEXT | NOT NULL | Last update timestamp |
Indexes: email, username, role, enabled
sessions
User session storage (tower-sessions compatible).
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | TEXT | PRIMARY KEY | Session ID |
| data | BLOB | NOT NULL | Encrypted session data |
| expiry_date | INTEGER | NOT NULL | Unix timestamp expiration |
Indexes: expiry_date
api_keys
API key authentication.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | UUID/TEXT | PRIMARY KEY | Unique identifier |
| user_id | UUID/TEXT | FK → users | Owner user |
| name | TEXT | NOT NULL | Key display name |
| key_hash | TEXT | NOT NULL | SHA-256 hash of key |
| key_prefix | TEXT | NOT NULL | First 8 chars for identification |
| scopes | JSON/TEXT | DEFAULT '[]' | Allowed API scopes |
| expires_at | TIMESTAMP/TEXT | NULLABLE | Key expiration |
| last_used_at | TIMESTAMP/TEXT | NULLABLE | Last usage timestamp |
| created_at | TIMESTAMP/TEXT | NOT NULL | Creation timestamp |
Indexes: user_id, key_prefix, expires_at
Database-Specific Notes
SQLite
- UUIDs stored as TEXT
- Timestamps stored as ISO 8601 TEXT
- Boolean stored as INTEGER (0/1)
- JSON stored as TEXT
- Uses
CHECKconstraints for enums
PostgreSQL
- Native UUID type
- Native TIMESTAMPTZ type
- Native BOOLEAN type
- Native JSONB type with indexing
- Uses custom ENUM types for status fields
- Row-Level Security (RLS) enabled on all tenant-scoped tables
Row-Level Security
PostgreSQL deployments use RLS for defense-in-depth tenant isolation:
-- RLS policy example (automatically applied to all queries)
CREATE POLICY incidents_select_tenant_isolation ON incidents
FOR SELECT
USING (tenant_id = current_setting('app.current_tenant', true)::uuid);
To set the tenant context:
-- Set before executing tenant-scoped queries
SELECT set_tenant_context('00000000-0000-0000-0000-000000000001'::uuid);
-- Or use the session variable directly
SET app.current_tenant = '00000000-0000-0000-0000-000000000001';
Helper functions:
| Function | Description |
|---|---|
set_tenant_context(uuid) | Sets tenant context, returns previous value |
get_current_tenant() | Returns current tenant UUID or NULL |
clear_tenant_context() | Clears tenant context |
Migrations
Migrations are managed by SQLx and located in:
- SQLite:
crates/tw-core/src/db/migrations/sqlite/ - PostgreSQL:
crates/tw-core/src/db/migrations/postgres/
Run migrations automatically on startup or manually:
# SQLite
tw-cli db migrate --database-url "sqlite:data/triage.db"
# PostgreSQL
tw-cli db migrate --database-url "postgres://user:pass@host/db"
Connectors
Connectors integrate Triage Warden with external security tools and services.
Overview
Each connector type has a trait interface and multiple implementations:
| Type | Purpose | Implementations |
|---|---|---|
| Threat Intelligence | Hash/URL/domain reputation | VirusTotal, Mock |
| SIEM | Log queries and correlation | Splunk, Mock |
| EDR | Endpoint detection and response | CrowdStrike, Mock |
| Email Gateway | Email security operations | Microsoft 365, Mock |
| Ticketing | Incident ticket management | Jira, Mock |
Configuration
Select connector implementations via environment variables:
# Use real connectors
TW_THREAT_INTEL_MODE=virustotal
TW_SIEM_MODE=splunk
TW_EDR_MODE=crowdstrike
TW_EMAIL_GATEWAY_MODE=m365
TW_TICKETING_MODE=jira
# Or use mocks for testing
TW_THREAT_INTEL_MODE=mock
TW_SIEM_MODE=mock
Connector Trait
All connectors implement the base Connector trait:
#![allow(unused)] fn main() { #[async_trait] pub trait Connector: Send + Sync { /// Unique identifier for this connector instance fn name(&self) -> &str; /// Type of connector (threat_intel, siem, edr, etc.) fn connector_type(&self) -> &str; /// Check connector health async fn health_check(&self) -> ConnectorResult<ConnectorHealth>; /// Test connection to the service async fn test_connection(&self) -> ConnectorResult<bool>; } pub enum ConnectorHealth { Healthy, Degraded { message: String }, Unhealthy { message: String }, } }
Error Handling
Connectors return ConnectorResult<T> with detailed error types:
#![allow(unused)] fn main() { pub enum ConnectorError { /// Service returned an error RequestFailed(String), /// Resource not found NotFound(String), /// Authentication failed AuthenticationFailed(String), /// Rate limit exceeded RateLimited { retry_after: Option<Duration> }, /// Network or connection error NetworkError(String), /// Invalid response from service InvalidResponse(String), } }
Health Monitoring
Check connector health via the API:
curl http://localhost:8080/api/connectors/health
{
"connectors": [
{ "name": "virustotal", "type": "threat_intel", "status": "healthy" },
{ "name": "splunk", "type": "siem", "status": "healthy" },
{ "name": "crowdstrike", "type": "edr", "status": "degraded", "message": "High latency" }
]
}
Next Steps
- Threat Intelligence - VirusTotal configuration
- SIEM - Splunk configuration
- EDR - CrowdStrike configuration
- Email Gateway - Microsoft 365 configuration
- Ticketing - Jira configuration
Threat Intelligence Connector
Query threat intelligence services for reputation data on hashes, URLs, domains, and IP addresses.
Interface
#![allow(unused)] fn main() { #[async_trait] pub trait ThreatIntelConnector: Connector { /// Look up file hash reputation async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>; /// Look up URL reputation async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport>; /// Look up domain reputation async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport>; /// Look up IP address reputation async fn lookup_ip(&self, ip: &str) -> ConnectorResult<ThreatReport>; } pub struct ThreatReport { pub indicator: String, pub indicator_type: IndicatorType, pub malicious: bool, pub confidence: f64, pub categories: Vec<String>, pub first_seen: Option<DateTime<Utc>>, pub last_seen: Option<DateTime<Utc>>, pub sources: Vec<ThreatSource>, } }
VirusTotal
Configuration
TW_THREAT_INTEL_MODE=virustotal
TW_VIRUSTOTAL_API_KEY=your-api-key-here
Rate Limits
| Tier | Requests/Minute |
|---|---|
| Free | 4 |
| Premium | 500+ |
The connector automatically handles rate limiting with exponential backoff.
Supported Lookups
| Method | VT Endpoint | Notes |
|---|---|---|
lookup_hash | /files/{hash} | MD5, SHA1, SHA256 |
lookup_url | /urls/{url_id} | Base64-encoded URL |
lookup_domain | /domains/{domain} | Domain reputation |
lookup_ip | /ip_addresses/{ip} | IP reputation |
Example Usage
#![allow(unused)] fn main() { let connector = VirusTotalConnector::new(api_key)?; let report = connector.lookup_hash("44d88612fea8a8f36de82e1278abb02f").await?; println!("Malicious: {}", report.malicious); println!("Confidence: {:.2}", report.confidence); println!("Categories: {:?}", report.categories); }
Response Mapping
VirusTotal detection ratios map to confidence scores:
| Detection Ratio | Confidence | Classification |
|---|---|---|
| 0% | 0.0 | Clean |
| 1-10% | 0.3 | Suspicious |
| 11-50% | 0.6 | Likely Malicious |
| 51-100% | 0.9 | Malicious |
Mock Connector
For testing without external API calls:
TW_THREAT_INTEL_MODE=mock
The mock connector returns predictable results based on indicator patterns:
| Pattern | Result |
|---|---|
| Contains "malicious" | Malicious, confidence 0.95 |
| Contains "suspicious" | Suspicious, confidence 0.5 |
| Contains "clean" | Clean, confidence 0.1 |
| Default | Clean, confidence 0.2 |
Python Bridge
Access from Python via the bridge:
from tw_bridge import ThreatIntelBridge
# Create bridge (uses TW_THREAT_INTEL_MODE env var)
bridge = ThreatIntelBridge()
# Or specify mode explicitly
bridge = ThreatIntelBridge("virustotal")
# Lookup hash
result = bridge.lookup_hash("44d88612fea8a8f36de82e1278abb02f")
print(f"Malicious: {result['malicious']}")
print(f"Confidence: {result['confidence']}")
# Lookup URL
result = bridge.lookup_url("https://example.com/suspicious")
# Lookup domain
result = bridge.lookup_domain("malware-site.com")
Caching
Results are cached to reduce API calls:
| Lookup Type | Cache Duration |
|---|---|
| Hash | 24 hours |
| URL | 1 hour |
| Domain | 6 hours |
| IP | 6 hours |
Cache is stored in the database and shared across instances.
Adding Custom Providers
Implement the ThreatIntelConnector trait:
#![allow(unused)] fn main() { pub struct CustomThreatIntelConnector { client: reqwest::Client, api_key: String, } #[async_trait] impl Connector for CustomThreatIntelConnector { fn name(&self) -> &str { "custom" } fn connector_type(&self) -> &str { "threat_intel" } // ... implement health_check, test_connection } #[async_trait] impl ThreatIntelConnector for CustomThreatIntelConnector { async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport> { // Custom implementation } // ... implement other methods } }
See Adding Connectors for full details.
SIEM Connector
Query SIEM platforms for log data, run searches, and correlate events.
Interface
#![allow(unused)] fn main() { #[async_trait] pub trait SIEMConnector: Connector { /// Run a search query async fn search(&self, query: &str, time_range: TimeRange) -> ConnectorResult<SearchResults>; /// Get events by ID async fn get_events(&self, event_ids: &[String]) -> ConnectorResult<Vec<SIEMEvent>>; /// Get related events (correlation) async fn get_related_events( &self, indicator: &str, indicator_type: IndicatorType, time_range: TimeRange, ) -> ConnectorResult<Vec<SIEMEvent>>; } pub struct SIEMEvent { pub id: String, pub timestamp: DateTime<Utc>, pub source: String, pub event_type: String, pub severity: String, pub raw_data: serde_json::Value, } pub struct SearchResults { pub events: Vec<SIEMEvent>, pub total_count: u64, pub search_id: String, } }
Splunk
Configuration
TW_SIEM_MODE=splunk
TW_SPLUNK_URL=https://splunk.company.com:8089
TW_SPLUNK_TOKEN=your-token-here
Token Permissions
The Splunk token requires these capabilities:
search- Run searcheslist_inputs- Health checkrest_access- REST API access
Example Searches
#![allow(unused)] fn main() { let connector = SplunkConnector::new(url, token)?; // Search for events let results = connector.search( r#"index=security sourcetype=firewall action=blocked"#, TimeRange::last_hours(24), ).await?; // Find related events by IP let related = connector.get_related_events( "192.168.1.100", IndicatorType::IpAddress, TimeRange::last_hours(1), ).await?; }
Search Query Translation
Common queries translated to SPL:
| Triage Warden Query | Splunk SPL |
|---|---|
| IP correlation | index=* src_ip="{ip}" OR dest_ip="{ip}" |
| User activity | index=* user="{user}" |
| Hash lookup | index=* (file_hash="{hash}" OR sha256="{hash}") |
Performance Tips
- Use specific indexes in queries
- Limit time ranges when possible
- Use
| head 1000to limit results
Mock Connector
For testing:
TW_SIEM_MODE=mock
The mock returns sample security events matching the query pattern.
Python Bridge
from tw_bridge import SIEMBridge
bridge = SIEMBridge("splunk")
# Run a search
results = bridge.search(
query='index=security action=blocked',
hours=24
)
for event in results['events']:
print(f"{event['timestamp']}: {event['source']}")
# Get related events
related = bridge.get_related_events(
indicator="192.168.1.100",
indicator_type="ip",
hours=1
)
Adding Custom SIEM
Implement the SIEMConnector trait:
#![allow(unused)] fn main() { pub struct ElasticSIEMConnector { client: elasticsearch::Elasticsearch, } #[async_trait] impl SIEMConnector for ElasticSIEMConnector { async fn search(&self, query: &str, time_range: TimeRange) -> ConnectorResult<SearchResults> { // Translate to Elasticsearch DSL and execute } // ... implement other methods } }
EDR Connector
Integrate with Endpoint Detection and Response platforms for host information and response actions.
Interface
#![allow(unused)] fn main() { #[async_trait] pub trait EDRConnector: Connector { /// Get host information async fn get_host(&self, host_id: &str) -> ConnectorResult<HostInfo>; /// Search for hosts async fn search_hosts(&self, query: &str) -> ConnectorResult<Vec<HostInfo>>; /// Get recent detections for a host async fn get_detections(&self, host_id: &str) -> ConnectorResult<Vec<Detection>>; /// Isolate a host from the network async fn isolate_host(&self, host_id: &str) -> ConnectorResult<ActionResult>; /// Remove host isolation async fn unisolate_host(&self, host_id: &str) -> ConnectorResult<ActionResult>; /// Trigger a scan on the host async fn scan_host(&self, host_id: &str) -> ConnectorResult<ActionResult>; } pub struct HostInfo { pub id: String, pub hostname: String, pub platform: String, pub os_version: String, pub agent_version: String, pub last_seen: DateTime<Utc>, pub isolation_status: IsolationStatus, pub tags: Vec<String>, } pub struct Detection { pub id: String, pub timestamp: DateTime<Utc>, pub severity: String, pub tactic: String, pub technique: String, pub description: String, pub process_name: Option<String>, pub file_path: Option<String>, } }
CrowdStrike
Configuration
TW_EDR_MODE=crowdstrike
TW_CROWDSTRIKE_CLIENT_ID=your-client-id
TW_CROWDSTRIKE_CLIENT_SECRET=your-client-secret
TW_CROWDSTRIKE_REGION=us-1 # us-1, us-2, eu-1, usgov-1
API Scopes Required
The API client requires these scopes:
Hosts: Read- Get host informationHosts: Write- Isolation actionsDetections: Read- Get detectionsReal Time Response: Write- Scan actions
OAuth2 Token Management
The connector automatically handles token refresh:
#![allow(unused)] fn main() { // Token refreshed automatically when expired let connector = CrowdStrikeConnector::new(client_id, client_secret, region)?; // All subsequent calls use valid token let host = connector.get_host("abc123").await?; }
Example Usage
#![allow(unused)] fn main() { // Get host information let host = connector.get_host("aid:abc123").await?; println!("Hostname: {}", host.hostname); println!("Last seen: {}", host.last_seen); // Check for detections let detections = connector.get_detections("aid:abc123").await?; for d in detections { println!("{}: {} - {}", d.timestamp, d.severity, d.description); } // Isolate compromised host let result = connector.isolate_host("aid:abc123").await?; if result.success { println!("Host isolated successfully"); } }
Action Confirmation
Isolation and scan actions require policy approval. See Policy Engine.
Mock Connector
TW_EDR_MODE=mock
The mock provides sample hosts and detections for testing.
Python Bridge
from tw_bridge import EDRBridge
bridge = EDRBridge("crowdstrike")
# Get host info
host = bridge.get_host("aid:abc123")
print(f"Hostname: {host['hostname']}")
print(f"Platform: {host['platform']}")
# Get detections
detections = bridge.get_detections("aid:abc123")
for d in detections:
print(f"{d['severity']}: {d['description']}")
# Isolate host (requires policy approval)
result = bridge.isolate_host("aid:abc123")
if result['success']:
print("Host isolated")
Response Actions
| Action | Description | Rollback |
|---|---|---|
isolate_host | Network isolation | unisolate_host |
scan_host | On-demand scan | N/A |
Isolation Behavior
When isolated:
- Host cannot communicate on network
- Falcon agent maintains connection to cloud
- User may see isolation notification
Rate Limits
| Endpoint | Limit |
|---|---|
| Host queries | 100/min |
| Detection queries | 50/min |
| Containment actions | 10/min |
Email Gateway Connector
Manage email security operations including search, quarantine, and sender blocking.
Interface
#![allow(unused)] fn main() { #[async_trait] pub trait EmailGatewayConnector: Connector { /// Search for emails async fn search_emails(&self, query: EmailSearchQuery) -> ConnectorResult<Vec<EmailMessage>>; /// Get specific email by ID async fn get_email(&self, message_id: &str) -> ConnectorResult<EmailMessage>; /// Move email to quarantine async fn quarantine_email(&self, message_id: &str) -> ConnectorResult<ActionResult>; /// Release email from quarantine async fn release_email(&self, message_id: &str) -> ConnectorResult<ActionResult>; /// Block sender async fn block_sender(&self, sender: &str) -> ConnectorResult<ActionResult>; /// Unblock sender async fn unblock_sender(&self, sender: &str) -> ConnectorResult<ActionResult>; /// Get threat data for email async fn get_threat_data(&self, message_id: &str) -> ConnectorResult<EmailThreatData>; } pub struct EmailMessage { pub id: String, pub internet_message_id: String, pub sender: String, pub recipients: Vec<String>, pub subject: String, pub received_at: DateTime<Utc>, pub has_attachments: bool, pub attachments: Vec<EmailAttachment>, pub urls: Vec<String>, pub headers: HashMap<String, String>, pub threat_assessment: Option<ThreatAssessment>, } pub struct EmailSearchQuery { pub sender: Option<String>, pub recipient: Option<String>, pub subject_contains: Option<String>, pub timerange: TimeRange, pub has_attachments: Option<bool>, pub threat_type: Option<String>, pub limit: usize, } }
Microsoft 365
Configuration
TW_EMAIL_GATEWAY_MODE=m365
TW_M365_TENANT_ID=your-tenant-id
TW_M365_CLIENT_ID=your-client-id
TW_M365_CLIENT_SECRET=your-client-secret
App Registration
Create an Azure AD app registration with these API permissions:
| Permission | Type | Purpose |
|---|---|---|
Mail.Read | Application | Read emails |
Mail.ReadWrite | Application | Quarantine actions |
ThreatAssessment.Read.All | Application | Threat data |
Policy.Read.All | Application | Block list management |
Example Usage
#![allow(unused)] fn main() { let connector = M365Connector::new(tenant_id, client_id, client_secret)?; // Search for suspicious emails let query = EmailSearchQuery { sender: Some("[email protected]".to_string()), timerange: TimeRange::last_hours(24), ..Default::default() }; let emails = connector.search_emails(query).await?; // Quarantine malicious email let result = connector.quarantine_email("AAMkAGI2...").await?; // Block sender let result = connector.block_sender("[email protected]").await?; }
Quarantine Behavior
When quarantined:
- Email moved to quarantine folder
- User notified (configurable)
- Admin can release if false positive
Mock Connector
TW_EMAIL_GATEWAY_MODE=mock
Provides sample emails with various threat characteristics:
- Phishing with malicious URLs
- Malware with executable attachments
- BEC/impersonation attempts
- Clean legitimate emails
Python Bridge
from tw_bridge import EmailGatewayBridge
bridge = EmailGatewayBridge("m365")
# Search emails
emails = bridge.search_emails(
sender="[email protected]",
hours=24
)
for email in emails:
print(f"From: {email['sender']}")
print(f"Subject: {email['subject']}")
print(f"Attachments: {len(email['attachments'])}")
# Quarantine email
result = bridge.quarantine_email("AAMkAGI2...")
if result['success']:
print("Email quarantined")
# Block sender
result = bridge.block_sender("[email protected]")
Response Actions
| Action | Description | Rollback |
|---|---|---|
quarantine_email | Move to quarantine | release_email |
block_sender | Add to blocklist | unblock_sender |
Threat Data
Get detailed threat information:
#![allow(unused)] fn main() { let threat_data = connector.get_threat_data("AAMkAGI2...").await?; println!("Delivery action: {}", threat_data.delivery_action); println!("Threat types: {:?}", threat_data.threat_types); println!("Detection methods: {:?}", threat_data.detection_methods); }
Fields:
delivery_action: Delivered, Quarantined, Blockedthreat_types: Phishing, Malware, Spam, BECdetection_methods: URLAnalysis, AttachmentScanning, ImpersonationDetectionurls_clicked: URLs clicked by recipient (if tracking enabled)
Ticketing Connector
Create and manage security incident tickets in external ticketing systems.
Interface
#![allow(unused)] fn main() { #[async_trait] pub trait TicketingConnector: Connector { /// Create a new ticket async fn create_ticket(&self, ticket: CreateTicketRequest) -> ConnectorResult<Ticket>; /// Get ticket by ID async fn get_ticket(&self, ticket_id: &str) -> ConnectorResult<Ticket>; /// Update ticket fields async fn update_ticket(&self, ticket_id: &str, update: UpdateTicketRequest) -> ConnectorResult<Ticket>; /// Add comment to ticket async fn add_comment(&self, ticket_id: &str, comment: &str) -> ConnectorResult<()>; /// Search tickets async fn search_tickets(&self, query: TicketSearchQuery) -> ConnectorResult<Vec<Ticket>>; } pub struct CreateTicketRequest { pub title: String, pub description: String, pub priority: TicketPriority, pub ticket_type: String, pub labels: Vec<String>, pub assignee: Option<String>, pub custom_fields: HashMap<String, String>, } pub struct Ticket { pub id: String, pub key: String, pub title: String, pub description: String, pub status: String, pub priority: TicketPriority, pub assignee: Option<String>, pub created_at: DateTime<Utc>, pub updated_at: DateTime<Utc>, pub url: String, } }
Jira
Configuration
TW_TICKETING_MODE=jira
TW_JIRA_URL=https://company.atlassian.net
[email protected]
TW_JIRA_API_TOKEN=your-api-token
TW_JIRA_PROJECT_KEY=SEC
API Token
Generate an API token at: https://id.atlassian.com/manage-profile/security/api-tokens
Required permissions:
- Create issues
- Edit issues
- Add comments
- Browse project
Example Usage
#![allow(unused)] fn main() { let connector = JiraConnector::new(url, email, token, project_key)?; // Create security ticket let request = CreateTicketRequest { title: "Phishing Incident - INC-2024-001".to_string(), description: "Phishing email detected and quarantined.\n\n## Details\n...".to_string(), priority: TicketPriority::High, ticket_type: "Security Incident".to_string(), labels: vec!["phishing".to_string(), "triage-warden".to_string()], assignee: Some("[email protected]".to_string()), custom_fields: HashMap::new(), }; let ticket = connector.create_ticket(request).await?; println!("Created: {} - {}", ticket.key, ticket.url); // Add investigation notes connector.add_comment( &ticket.id, "## Investigation Notes\n\n- Sender reputation: Malicious\n- URLs: 2 phishing links" ).await?; }
Issue Types
Configure the Jira project with these issue types:
| Issue Type | Usage |
|---|---|
| Security Incident | Main incident ticket |
| Investigation | Sub-task for investigation steps |
| Remediation | Sub-task for response actions |
Custom Fields
Map custom fields in configuration:
TW_JIRA_FIELD_SEVERITY=customfield_10001
TW_JIRA_FIELD_INCIDENT_ID=customfield_10002
TW_JIRA_FIELD_VERDICT=customfield_10003
Mock Connector
TW_TICKETING_MODE=mock
Simulates ticket operations with in-memory storage.
Python Bridge
from tw_bridge import TicketingBridge
bridge = TicketingBridge("jira")
# Create ticket
ticket = bridge.create_ticket(
title="Phishing Incident - INC-2024-001",
description="Phishing email detected...",
priority="high",
ticket_type="Security Incident",
labels=["phishing", "triage-warden"]
)
print(f"Created: {ticket['key']}")
print(f"URL: {ticket['url']}")
# Add comment
bridge.add_comment(
ticket_id=ticket['id'],
comment="Investigation complete. Verdict: Malicious"
)
# Update status
bridge.update_ticket(
ticket_id=ticket['id'],
status="Done"
)
# Search tickets
tickets = bridge.search_tickets(
query="project = SEC AND labels = phishing",
limit=10
)
Ticket Templates
Define templates for consistent ticket creation:
# config/ticket_templates.toml
[templates.phishing]
title = "Phishing: {subject}"
description = """
## Incident Summary
- **Type**: Phishing
- **Severity**: {severity}
- **Incident ID**: {incident_id}
## Details
{details}
## Recommended Actions
{recommended_actions}
"""
labels = ["phishing", "triage-warden"]
[templates.malware]
title = "Malware Alert: {hostname}"
description = """
## Incident Summary
- **Type**: Malware
- **Host**: {hostname}
- **Detection**: {detection}
## IOCs
{iocs}
"""
labels = ["malware", "triage-warden"]
Integration with Incidents
Tickets are automatically linked to incidents:
#![allow(unused)] fn main() { // Create ticket action stores the ticket key let action = execute_action("create_ticket", incident_id, params).await?; // Incident updated with ticket reference incident.metadata["ticket_key"] = "SEC-1234"; incident.metadata["ticket_url"] = "https://company.atlassian.net/browse/SEC-1234"; }
Actions
Actions are the executable operations that Triage Warden can perform in response to incidents.
Overview
Actions fall into several categories:
| Category | Purpose | Examples |
|---|---|---|
| Analysis | Extract and parse data | parse_email, check_email_authentication |
| Lookup | Enrich with external data | lookup_sender_reputation, lookup_urls |
| Response | Take containment actions | quarantine_email, isolate_host |
| Notification | Alert stakeholders | notify_user, escalate |
| Ticketing | Create/update tickets | create_ticket, add_ticket_comment |
Action Trait
All actions implement the Action trait:
#![allow(unused)] fn main() { #[async_trait] pub trait Action: Send + Sync { /// Action name (used in playbooks and API) fn name(&self) -> &str; /// Human-readable description fn description(&self) -> &str; /// Required and optional parameters fn required_parameters(&self) -> Vec<ParameterDef>; /// Whether this action supports rollback fn supports_rollback(&self) -> bool; /// Execute the action async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError>; /// Rollback the action (if supported) async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> { Err(ActionError::RollbackNotSupported) } } }
Action Context
Actions receive an ActionContext with:
#![allow(unused)] fn main() { pub struct ActionContext { /// Unique execution ID pub execution_id: Uuid, /// Parameters passed to the action pub parameters: HashMap<String, serde_json::Value>, /// Related incident (if any) pub incident_id: Option<Uuid>, /// User or agent requesting the action pub proposer: String, /// Connectors available for use pub connectors: ConnectorRegistry, } }
Action Result
Actions return an ActionResult:
#![allow(unused)] fn main() { pub struct ActionResult { /// Whether the action succeeded pub success: bool, /// Action name pub action_name: String, /// Human-readable summary pub message: String, /// Execution duration pub duration: Duration, /// Output data (action-specific) pub output: HashMap<String, serde_json::Value>, /// Whether rollback is available pub rollback_available: bool, } }
Policy Integration
All actions pass through the policy engine before execution:
Action Request → Policy Evaluation → Decision
├─ Allowed → Execute
├─ Denied → Return Error
└─ RequiresApproval → Queue
See Policy Engine for approval configuration.
Executing Actions
Via API
curl -X POST http://localhost:8080/api/incidents/{id}/actions \
-H "Content-Type: application/json" \
-d '{
"action": "quarantine_email",
"parameters": {
"message_id": "AAMkAGI2...",
"reason": "Phishing detected"
}
}'
Via CLI
tw-cli action execute \
--incident INC-2024-001 \
--action quarantine_email \
--param message_id=AAMkAGI2... \
--param reason="Phishing detected"
Via Playbook
steps:
- action: quarantine_email
parameters:
message_id: "{{ incident.raw_data.message_id }}"
reason: "Automated response to phishing"
Available Actions
- Email Actions - Email parsing and response
- Host Actions - Endpoint containment
- Lookup Actions - Threat intelligence enrichment
- Notification Actions - Alerts and escalation
Email Actions
Actions for analyzing and responding to email-based threats.
Analysis Actions
parse_email
Extract headers, body, attachments, and URLs from raw email.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
raw_email | string | Yes | Raw email content (RFC 822) |
Output:
{
"headers": {
"From": "[email protected]",
"To": "[email protected]",
"Subject": "Important Document",
"Date": "2024-01-15T10:30:00Z",
"Message-ID": "<[email protected]>",
"X-Originating-IP": "[192.168.1.100]"
},
"sender": "[email protected]",
"recipients": ["[email protected]"],
"subject": "Important Document",
"body_text": "Please review the attached document...",
"body_html": "<html>...",
"attachments": [
{
"filename": "document.pdf",
"content_type": "application/pdf",
"size": 102400,
"sha256": "abc123..."
}
],
"urls": [
"https://example.com/document",
"https://suspicious-site.com/login"
]
}
check_email_authentication
Validate SPF, DKIM, and DMARC authentication results.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
headers | object | Yes | Email headers (from parse_email) |
Output:
{
"spf": {
"result": "pass",
"domain": "example.com"
},
"dkim": {
"result": "pass",
"domain": "example.com",
"selector": "default"
},
"dmarc": {
"result": "pass",
"policy": "reject"
},
"authentication_passed": true,
"risk_indicators": []
}
Risk Indicators:
spf_fail- SPF validation faileddkim_fail- DKIM signature invaliddmarc_fail- DMARC policy violationheader_mismatch- From/Reply-To mismatchsuspicious_routing- Unusual mail routing
Response Actions
quarantine_email
Move email to quarantine via email gateway.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
message_id | string | Yes | Email message ID |
reason | string | No | Reason for quarantine |
Output:
{
"quarantine_id": "quar-abc123",
"message_id": "AAMkAGI2...",
"quarantined_at": "2024-01-15T10:35:00Z"
}
Rollback: release_email - Releases email from quarantine
block_sender
Add sender to organization blocklist.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
sender | string | Yes | Email address to block |
scope | string | No | Block scope: organization or user |
Output:
{
"block_id": "block-abc123",
"sender": "[email protected]",
"scope": "organization",
"blocked_at": "2024-01-15T10:35:00Z"
}
Rollback: unblock_sender - Removes sender from blocklist
Usage Examples
Phishing Response Playbook
name: phishing_response
steps:
- action: parse_email
output: parsed
- action: check_email_authentication
parameters:
headers: "{{ parsed.headers }}"
output: auth
- action: lookup_sender_reputation
parameters:
sender: "{{ parsed.sender }}"
output: reputation
- condition: "reputation.score < 0.3 or not auth.authentication_passed"
action: quarantine_email
parameters:
message_id: "{{ incident.raw_data.message_id }}"
reason: "Failed authentication and low sender reputation"
- condition: "reputation.score < 0.2"
action: block_sender
parameters:
sender: "{{ parsed.sender }}"
scope: organization
CLI Example
# Quarantine suspicious email
tw-cli action execute \
--action quarantine_email \
--param message_id="AAMkAGI2..." \
--param reason="Phishing indicators detected"
# Block malicious sender
tw-cli action execute \
--action block_sender \
--param sender="[email protected]" \
--param scope=organization
Host Actions
Actions for endpoint containment and investigation.
isolate_host
Network-isolate a compromised host via EDR.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
host_id | string | Yes | EDR host/agent ID |
reason | string | No | Reason for isolation |
Output:
{
"isolation_id": "iso-abc123",
"host_id": "aid:xyz789",
"hostname": "WORKSTATION-01",
"isolated_at": "2024-01-15T10:40:00Z",
"status": "isolated"
}
Behavior:
- Host network access blocked
- EDR agent maintains cloud connectivity
- User notified (configurable)
Rollback: unisolate_host
Policy: Typically requires senior analyst or manager approval.
unisolate_host
Remove network isolation from a host.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
host_id | string | Yes | EDR host/agent ID |
reason | string | No | Reason for removing isolation |
Output:
{
"host_id": "aid:xyz789",
"hostname": "WORKSTATION-01",
"unisolated_at": "2024-01-15T14:00:00Z",
"status": "active"
}
scan_host
Trigger on-demand malware scan on a host.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
host_id | string | Yes | EDR host/agent ID |
scan_type | string | No | quick or full (default: quick) |
Output:
{
"scan_id": "scan-abc123",
"host_id": "aid:xyz789",
"scan_type": "quick",
"started_at": "2024-01-15T10:45:00Z",
"status": "running"
}
Note: Scan results are retrieved separately as they may take time.
Usage Examples
Malware Response Playbook
name: malware_response
steps:
- action: isolate_host
parameters:
host_id: "{{ incident.raw_data.host_id }}"
reason: "Malware detection - automated isolation"
output: isolation
- action: scan_host
parameters:
host_id: "{{ incident.raw_data.host_id }}"
scan_type: full
- action: create_ticket
parameters:
title: "Malware Incident - {{ incident.raw_data.hostname }}"
priority: high
- action: notify_user
parameters:
user: "{{ incident.raw_data.user }}"
message: "Your workstation has been isolated due to a security incident"
CLI Example
# Isolate compromised host
tw-cli action execute \
--action isolate_host \
--param host_id="aid:xyz789" \
--param reason="Active malware infection"
# This action typically requires approval
# Check approval status:
tw-cli action status act-123456
# After investigation, remove isolation:
tw-cli action execute \
--action unisolate_host \
--param host_id="aid:xyz789" \
--param reason="Malware cleaned, host verified"
API Example
# Request host isolation
curl -X POST http://localhost:8080/api/incidents/INC-2024-001/actions \
-H "Content-Type: application/json" \
-d '{
"action": "isolate_host",
"parameters": {
"host_id": "aid:xyz789",
"reason": "Suspected compromise"
}
}'
# Response (if requires approval):
{
"action_id": "act-abc123",
"status": "pending_approval",
"approval_level": "manager",
"message": "Action requires SOC Manager approval"
}
Policy Configuration
Host actions are typically high-impact and require approval:
[[policy.rules]]
name = "isolate_requires_approval"
action = "isolate_host"
approval_level = "senior"
[[policy.rules]]
name = "critical_isolate_requires_manager"
action = "isolate_host"
severity = ["critical"]
approval_level = "manager"
Lookup Actions
Actions for enriching incidents with threat intelligence data.
lookup_sender_reputation
Query threat intelligence for sender domain and IP reputation.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
sender | string | Yes | Email address |
originating_ip | string | No | Sending server IP |
Output:
{
"sender": "[email protected]",
"domain": "domain.com",
"domain_reputation": {
"score": 0.25,
"categories": ["phishing", "newly-registered"],
"first_seen": "2024-01-10",
"registrar": "NameCheap"
},
"ip_reputation": {
"ip": "192.168.1.100",
"score": 0.3,
"categories": ["spam", "proxy"],
"country": "RU",
"asn": "AS12345"
},
"overall_score": 0.25,
"risk_level": "high"
}
Score Interpretation:
| Score | Risk Level |
|---|---|
| 0.0 - 0.3 | High risk |
| 0.3 - 0.6 | Medium risk |
| 0.6 - 0.8 | Low risk |
| 0.8 - 1.0 | Clean |
lookup_urls
Check URLs against threat intelligence.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
urls | array | Yes | List of URLs to check |
Output:
{
"results": [
{
"url": "https://legitimate-site.com/page",
"malicious": false,
"categories": ["business"],
"confidence": 0.95
},
{
"url": "https://phishing-site.com/login",
"malicious": true,
"categories": ["phishing", "credential-theft"],
"confidence": 0.92,
"threat_details": {
"targeted_brand": "Microsoft",
"first_seen": "2024-01-14"
}
}
],
"malicious_count": 1,
"total_count": 2
}
lookup_attachments
Hash attachments and check against threat intelligence.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
attachments | array | Yes | List of attachment objects with sha256 |
Output:
{
"results": [
{
"filename": "invoice.pdf",
"sha256": "abc123...",
"malicious": false,
"file_type": "PDF document",
"confidence": 0.9
},
{
"filename": "update.exe",
"sha256": "def456...",
"malicious": true,
"file_type": "Windows executable",
"confidence": 0.98,
"threat_details": {
"malware_family": "Emotet",
"first_seen": "2024-01-12",
"detection_engines": 45
}
}
],
"malicious_count": 1,
"total_count": 2
}
lookup_hash
Look up a single file hash.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
hash | string | Yes | MD5, SHA1, or SHA256 hash |
Output:
{
"hash": "abc123...",
"hash_type": "sha256",
"malicious": true,
"confidence": 0.95,
"malware_family": "Emotet",
"categories": ["trojan", "banking"],
"first_seen": "2024-01-12",
"last_seen": "2024-01-15",
"detection_ratio": "45/70"
}
lookup_ip
Query IP address reputation.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
ip | string | Yes | IP address |
Output:
{
"ip": "192.168.1.100",
"malicious": true,
"confidence": 0.8,
"categories": ["c2", "malware-distribution"],
"country": "RU",
"asn": "AS12345",
"asn_org": "Example ISP",
"last_seen": "2024-01-15",
"associated_malware": ["Cobalt Strike"]
}
Usage in Playbooks
name: email_triage
steps:
- action: parse_email
output: parsed
- action: lookup_sender_reputation
parameters:
sender: "{{ parsed.sender }}"
output: sender_rep
- action: lookup_urls
parameters:
urls: "{{ parsed.urls }}"
output: url_results
- action: lookup_attachments
parameters:
attachments: "{{ parsed.attachments }}"
output: attachment_results
# Make decision based on lookups
- condition: >
sender_rep.risk_level == 'high' or
url_results.malicious_count > 0 or
attachment_results.malicious_count > 0
set_verdict:
classification: malicious
confidence: 0.9
Caching
Lookup results are cached to reduce API calls:
| Lookup | Cache Duration |
|---|---|
| Hash | 24 hours |
| URL | 1 hour |
| Domain | 6 hours |
| IP | 6 hours |
Force fresh lookup with skip_cache: true parameter.
Notification Actions
Actions for alerting stakeholders and managing escalation.
notify_user
Send notification to an affected user.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
user | string | Yes | User email or ID |
message | string | Yes | Notification message |
channel | string | No | email, slack, teams (default: email) |
template | string | No | Notification template name |
Output:
{
"notification_id": "notif-abc123",
"recipient": "[email protected]",
"channel": "email",
"sent_at": "2024-01-15T10:50:00Z",
"status": "delivered"
}
Templates:
# templates/notifications.yaml
security_alert:
subject: "Security Alert: Action Required"
body: |
A security incident affecting your account has been detected.
Incident ID: {{ incident_id }}
Type: {{ incident_type }}
{{ message }}
If you did not initiate this activity, please contact IT Security.
notify_reporter
Send status update to the incident reporter.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
incident_id | string | Yes | Incident ID |
status | string | Yes | Status update message |
include_verdict | bool | No | Include AI verdict (default: false) |
Output:
{
"notification_id": "notif-def456",
"reporter": "[email protected]",
"status": "delivered"
}
escalate
Route incident to appropriate approval level.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
incident_id | string | Yes | Incident ID |
escalation_level | string | Yes | analyst, senior, manager |
reason | string | Yes | Reason for escalation |
override_assignee | string | No | Specific person to assign |
custom_sla_hours | int | No | Custom SLA (overrides default) |
notify_channels | array | No | Additional channels (slack, pagerduty) |
Output:
{
"escalation_id": "esc-abc123",
"incident_id": "INC-2024-001",
"escalation_level": "senior",
"assigned_to": "[email protected]",
"due_date": "2024-01-15T12:50:00Z",
"priority": "high",
"sla_hours": 2
}
Default SLAs:
| Level | SLA |
|---|---|
| Analyst | 4 hours |
| Senior | 2 hours |
| Manager | 1 hour |
create_ticket
Create ticket in external ticketing system.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
title | string | Yes | Ticket title |
description | string | Yes | Ticket description |
priority | string | No | low, medium, high, critical |
assignee | string | No | Initial assignee |
labels | array | No | Ticket labels |
Output:
{
"ticket_id": "12345",
"ticket_key": "SEC-1234",
"url": "https://company.atlassian.net/browse/SEC-1234",
"created_at": "2024-01-15T10:55:00Z"
}
log_false_positive
Record a false positive for tuning.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
incident_id | string | Yes | Incident ID |
reason | string | Yes | Why this is a false positive |
feedback | string | No | Additional feedback for AI improvement |
Output:
{
"fp_id": "fp-abc123",
"incident_id": "INC-2024-001",
"recorded_at": "2024-01-15T11:00:00Z",
"used_for_training": true
}
run_triage_agent
Trigger AI triage agent on an incident.
Parameters:
| Name | Type | Required | Description |
|---|---|---|---|
incident_id | string | Yes | Incident ID |
playbook | string | No | Specific playbook to use |
model | string | No | AI model override |
Output:
{
"triage_id": "triage-abc123",
"incident_id": "INC-2024-001",
"verdict": "malicious",
"confidence": 0.92,
"reasoning": "Multiple indicators of phishing...",
"recommended_actions": [
"quarantine_email",
"block_sender",
"notify_user"
],
"completed_at": "2024-01-15T10:52:00Z"
}
Usage Examples
Escalation Playbook
name: auto_escalate
trigger:
- verdict: malicious
- confidence: ">= 0.9"
- severity: critical
steps:
- action: escalate
parameters:
incident_id: "{{ incident.id }}"
escalation_level: manager
reason: "High-confidence critical incident requiring immediate attention"
notify_channels:
- slack
- pagerduty
- action: create_ticket
parameters:
title: "CRITICAL: {{ incident.subject }}"
priority: critical
CLI Examples
# Escalate to senior analyst
tw-cli action execute \
--incident INC-2024-001 \
--action escalate \
--param escalation_level=senior \
--param reason="Complex threat requiring expertise"
# Create ticket
tw-cli action execute \
--incident INC-2024-001 \
--action create_ticket \
--param title="Phishing Investigation" \
--param priority=high
# Record false positive
tw-cli action execute \
--incident INC-2024-001 \
--action log_false_positive \
--param reason="Legitimate vendor communication"
Policy Engine
The policy engine controls action approval workflows and enforces security boundaries.
Overview
Every action request passes through the policy engine:
Action Request → Build Context → Evaluate Rules → Decision
├─ Allowed → Execute
├─ Denied → Reject
└─ RequiresApproval → Queue
Policy Decision Types
| Decision | Behavior |
|---|---|
Allowed | Action executes immediately |
Denied | Action rejected with reason |
RequiresApproval | Queued for specified approval level |
Action Context
The policy engine evaluates these attributes:
#![allow(unused)] fn main() { pub struct ActionContext { /// The action being requested pub action_type: String, /// Target of the action (host, email, user, etc.) pub target: String, /// Incident severity (if associated) pub severity: Option<Severity>, /// AI confidence score (if from triage) pub confidence: Option<f64>, /// Who/what is requesting the action pub proposer: Proposer, /// Additional context pub metadata: HashMap<String, Value>, } pub enum Proposer { User { id: String, role: Role }, Agent { name: String }, Playbook { name: String }, System, } }
Default Policies
Without custom rules, these defaults apply:
| Action Category | Default Decision |
|---|---|
| Lookup actions | Allowed |
| Analysis actions | Allowed |
| Notification actions | Allowed |
| Response actions | RequiresApproval (analyst) |
| Host containment | RequiresApproval (senior) |
Next Steps
- Rules - Configure custom policy rules
- Approval Levels - Understanding approval workflow
Policy Rules
Define rules to control when actions require approval.
Rule Structure
[[policy.rules]]
name = "rule_name"
description = "Human-readable description"
# Matching criteria
action = "action_name" # Specific action
action_patterns = ["pattern_*"] # Glob patterns
# Conditions (all must match)
severity = ["high", "critical"] # Incident severity
confidence_min = 0.8 # Minimum AI confidence
proposer_type = "agent" # Who's requesting
proposer_role = "analyst" # Role (if user)
# Decision
decision = "allowed" # or "denied" or "requires_approval"
approval_level = "senior" # If requires_approval
reason = "Explanation" # If denied
Rule Examples
Auto-Approve Lookups
[[policy.rules]]
name = "auto_approve_lookups"
description = "Lookup actions are always allowed"
action_patterns = ["lookup_*"]
decision = "allowed"
Require Approval for Response Actions
[[policy.rules]]
name = "response_needs_analyst"
description = "Response actions require analyst approval"
action_patterns = ["quarantine_*", "block_*"]
decision = "requires_approval"
approval_level = "analyst"
High-Severity Host Isolation
[[policy.rules]]
name = "critical_isolation_needs_manager"
description = "Critical severity host isolation requires manager"
action = "isolate_host"
severity = ["critical"]
decision = "requires_approval"
approval_level = "manager"
Block Dangerous Actions in Production
[[policy.rules]]
name = "no_delete_production"
description = "Deletion actions not allowed in production"
action_patterns = ["delete_*"]
environment = "production"
decision = "denied"
reason = "Deletion actions are not permitted in production"
Trust High-Confidence AI Decisions
[[policy.rules]]
name = "trust_high_confidence_ai"
description = "Auto-approve when AI is highly confident"
proposer_type = "agent"
confidence_min = 0.95
severity = ["low", "medium"]
action_patterns = ["quarantine_email", "block_sender"]
decision = "allowed"
Analyst Self-Service
[[policy.rules]]
name = "analyst_can_notify"
description = "Analysts can send notifications without approval"
action_patterns = ["notify_*"]
proposer_role = "analyst"
decision = "allowed"
Rule Evaluation Order
Rules are evaluated in order. First matching rule wins.
# More specific rules first
[[policy.rules]]
name = "critical_isolation"
action = "isolate_host"
severity = ["critical"]
approval_level = "manager"
# General fallback
[[policy.rules]]
name = "default_isolation"
action = "isolate_host"
approval_level = "senior"
Condition Operators
Severity Matching
severity = ["high", "critical"] # Match any in list
Confidence Ranges
confidence_min = 0.8 # Minimum confidence
confidence_max = 0.95 # Maximum confidence
Pattern Matching
action_patterns = ["lookup_*"] # Prefix match
action_patterns = ["*_email"] # Suffix match
action_patterns = ["*block*"] # Contains
Proposer Conditions
proposer_type = "user" # user, agent, playbook, system
proposer_role = "analyst" # Only for user proposers
Managing Rules
Via Configuration File
# config/policy.toml
tw-api --config config/policy.toml
Via API
# List rules
curl http://localhost:8080/api/policies
# Create rule
curl -X POST http://localhost:8080/api/policies \
-H "Content-Type: application/json" \
-d '{
"name": "new_rule",
"action": "isolate_host",
"approval_level": "senior"
}'
Via CLI
# List rules
tw-cli policy list
# Add rule
tw-cli policy add \
--name "block_needs_approval" \
--action "block_sender" \
--approval-level analyst
Testing Rules
Simulate policy evaluation without executing:
tw-cli policy test \
--action isolate_host \
--severity critical \
--proposer-type agent \
--confidence 0.92
# Output:
# Decision: RequiresApproval
# Level: manager
# Matched Rule: critical_isolation_needs_manager
Approval Levels
Understanding the approval workflow in Triage Warden.
Approval Hierarchy
Manager (SOC Manager)
│
▼
Senior (Senior Analyst)
│
▼
Analyst (Security Analyst)
│
▼
Auto (No approval needed)
Higher levels can approve actions at their level or below.
Level Definitions
| Level | Role | Typical Actions |
|---|---|---|
| Auto | System | Lookups, analysis, low-risk notifications |
| Analyst | Security Analyst | Email quarantine, sender blocking |
| Senior | Senior Analyst | Host isolation, broad blocks |
| Manager | SOC Manager | Critical containment, policy changes |
Approval Workflow
1. Action Requested
tw-cli action execute --incident INC-001 --action isolate_host
2. Policy Evaluation
Policy engine evaluates and returns:
{
"decision": "requires_approval",
"approval_level": "senior",
"reason": "Host isolation requires senior analyst approval"
}
3. Action Queued
Action stored with pending status:
{
"action_id": "act-abc123",
"incident_id": "INC-001",
"action_type": "isolate_host",
"status": "pending_approval",
"approval_level": "senior",
"requested_by": "[email protected]",
"requested_at": "2024-01-15T10:30:00Z"
}
4. Approvers Notified
Notification sent to eligible approvers via configured channels.
5. Approval Decision
Approver reviews and decides:
Approve:
tw-cli action approve act-abc123 --comment "Verified threat"
Reject:
tw-cli action reject act-abc123 --reason "False positive, user traveling"
6. Execution or Rejection
- Approved: Action executes automatically
- Rejected: Action marked rejected, requester notified
Approval UI
Access pending approvals at /approvals in the web dashboard.
Features:
- Filterable list of pending actions
- Incident context display
- One-click approve/reject
- Bulk approval for related actions
SLA Tracking
Each approval level has a default SLA:
| Level | Default SLA |
|---|---|
| Analyst | 4 hours |
| Senior | 2 hours |
| Manager | 1 hour |
Overdue approvals are:
- Highlighted in dashboard
- Re-notified to approvers
- Optionally escalated to next level
Delegation
Approvers can delegate when unavailable:
tw-cli approval delegate \
--from [email protected] \
--to [email protected] \
--until 2024-01-20
Approval Groups
Configure approval groups for redundancy:
[approval_groups]
senior_analysts = [
"[email protected]",
"[email protected]",
"[email protected]"
]
managers = [
"[email protected]",
"[email protected]"
]
Any member of the group can approve.
Audit Trail
All approval decisions are logged:
{
"event": "action_approved",
"action_id": "act-abc123",
"approver": "[email protected]",
"decision": "approved",
"comment": "Verified threat indicators",
"timestamp": "2024-01-15T10:45:00Z",
"time_to_approve": "15m"
}
Emergency Override
In emergencies, managers can bypass approval:
tw-cli action execute \
--incident INC-001 \
--action isolate_host \
--emergency \
--reason "Active ransomware, immediate containment required"
Emergency overrides are:
- Logged with high visibility
- Require manager credentials
- Trigger additional notifications
Natural Language Queries
Query your security data using plain English instead of writing Splunk SPL, Elasticsearch KQL, or SQL by hand.
Overview
The NL Query Interface (Stage 4.1) lets analysts type questions like "show me critical incidents from the last 24 hours" and have Triage Warden translate them into structured queries against your SIEM, log store, or incident database.
The pipeline has four stages:
- Intent classification -- determines what the analyst is trying to do
- Entity extraction -- pulls out IPs, domains, hashes, date ranges, etc.
- Query translation -- converts the parsed intent + entities into the target query language
- Backend execution -- runs the query against Splunk, Elasticsearch, or SQL
Supported Intents
| Intent | Example query |
|---|---|
search_incidents | "show me open critical incidents" |
search_logs | "find authentication failures in the last hour" |
lookup_ioc | "check reputation for 192.168.1.100" |
explain_incident | "what happened in INC-2024-0042?" |
compare_incidents | "compare INC-001 and INC-002" |
timeline_query | "show me events from last week" |
asset_lookup | "who owns server web-prod-01?" |
statistics | "how many phishing incidents this month?" |
Intent classification uses keyword matching and regex patterns -- no LLM call is needed for routing.
Entity Extraction
The entity extractor recognizes security-specific tokens:
- IP addresses -- IPv4 (
192.168.1.100) - Domains --
evil-domain.com - Hashes -- MD5 (32 hex chars), SHA-1 (40), SHA-256 (64)
- Incident IDs --
INC-2024-0042,#42 - Date ranges -- "last 24 hours", "past 7 days",
2024-01-01 to 2024-01-31 - Usernames, hostnames, CVE IDs
Query Translation
Once intent and entities are extracted, NLQueryTranslator builds a structured query object:
from tw_ai.nl_query import NLQueryTranslator
translator = NLQueryTranslator()
result = translator.translate(
"show me failed logins from 10.0.0.50 in the last hour"
)
# result.intent.intent = QueryIntent.SEARCH_LOGS
# result.structured_query returns the backend-specific query
Backend Adapters
The translator outputs queries for three backends:
| Backend | Output format | Use case |
|---|---|---|
| Splunk | SPL queries | index=auth action=failure src_ip=10.0.0.50 earliest=-1h |
| Elasticsearch | KQL / Query DSL | event.action:failure AND source.ip:10.0.0.50 |
| SQL | SQL WHERE clauses | Incident database queries |
Conversation Context
Multi-turn conversations are supported via ConversationContext. When an analyst asks "now show me the same for last week", the system retains the entities from the previous turn.
from tw_ai.nl_query import ConversationContext
ctx = ConversationContext()
ctx.update("show me incidents from 10.0.0.50", entities=[...])
ctx.update("now filter to critical only", entities=[...])
# Second turn inherits the IP entity from the first
Security and Audit
All NL queries are sanitized before execution to prevent injection attacks. The QuerySanitizer strips dangerous characters and SQL keywords from user input.
Every query is logged to the QueryAuditLog with:
- Original natural language query
- Classified intent and confidence
- Translated structured query
- Execution timestamp and user ID
API Endpoint
When FastAPI is available, the NL query service exposes a REST endpoint:
curl -X POST http://localhost:8080/api/v1/nl/query \
-H "Content-Type: application/json" \
-d '{"query": "show me critical incidents from the last 24 hours"}'
Configuration
No special configuration is required. The NL query engine uses the same SIEM and database connections already configured in config/default.yaml.
To add custom keywords for intent classification:
from tw_ai.nl_query import IntentClassifier, QueryIntent
classifier = IntentClassifier(
custom_keywords={
QueryIntent.SEARCH_LOGS: ["splunk", "kibana"],
}
)
Automated Threat Hunting
Proactively search for threats across your environment using hypothesis-driven hunts with built-in query templates mapped to MITRE ATT&CK.
Overview
The threat hunting module (Stage 5.1) provides:
- Hunt management -- create, schedule, and track hunts with hypotheses
- Built-in query library -- 20+ pre-built queries across 8 MITRE ATT&CK categories
- Multi-platform queries -- Splunk SPL and Elasticsearch KQL templates
- Finding promotion -- promote hunt findings directly to incidents
Hunt Lifecycle
A hunt progresses through these statuses:
| Status | Description |
|---|---|
draft | Hunt is being designed, not yet executable |
active | Hunt is enabled and will run on schedule or trigger |
paused | Temporarily suspended |
completed | Finished executing (one-time hunts) |
failed | Execution encountered errors |
archived | No longer active, kept for reference |
Creating a Hunt
Via API
curl -X POST http://localhost:8080/api/v1/hunts \
-H "Content-Type: application/json" \
-d '{
"name": "Detect Kerberoasting",
"hypothesis": "Attackers may request TGS tickets for service accounts to crack offline",
"hunt_type": "scheduled",
"queries": [
{
"query_type": "splunk",
"query": "index=wineventlog EventCode=4769 TicketEncryptionType=0x17 | stats count by ServiceName",
"description": "Detect RC4-encrypted TGS requests",
"timeout_secs": 300,
"expected_baseline": 5
}
],
"schedule": {
"cron_expression": "0 */4 * * *",
"timezone": "UTC",
"max_runtime_secs": 600
},
"mitre_techniques": ["T1558.003"],
"data_sources": ["windows_event_logs"],
"tags": ["credential-access", "priority-high"],
"enabled": true
}'
Hunt Types
| Type | Description |
|---|---|
scheduled | Runs on a cron schedule |
continuous | Runs as a streaming query |
on_demand | Runs only when manually triggered |
triggered | Runs when a condition is met (e.g., new threat intel) |
Built-in Query Library
Access 20+ pre-built queries via the API:
curl http://localhost:8080/api/v1/hunts/queries/library
Queries span 8 MITRE ATT&CK categories:
- Initial Access
- Execution
- Persistence
- Credential Access
- Lateral Movement
- Collection
- Command and Control
- Exfiltration
Each built-in query includes Splunk SPL and Elasticsearch KQL templates, expected baselines for anomaly detection, and configurable parameters.
Executing a Hunt
Trigger a hunt manually:
curl -X POST http://localhost:8080/api/v1/hunts/{hunt_id}/execute
The response includes findings with severity levels, evidence data, and the query that produced each finding.
Viewing Results
# Get all results for a hunt
curl http://localhost:8080/api/v1/hunts/{hunt_id}/results
Each result includes:
- Total and critical finding counts
- Duration and execution status
- Individual findings with severity, evidence, and matched query
Promoting Findings to Incidents
When a hunt finding warrants investigation, promote it to a full incident:
curl -X POST http://localhost:8080/api/v1/hunts/{hunt_id}/findings/{finding_id}/promote
This creates a new incident with the finding's evidence, severity, and hunt metadata attached.
Query Languages
| Language | Identifier | Example |
|---|---|---|
| Splunk SPL | splunk | index=wineventlog EventCode=4625 |
| Elasticsearch | elasticsearch | event.code: 4625 |
| SQL | sql | SELECT * FROM events WHERE event_code = 4625 |
| Kusto (KQL) | kusto | SecurityEvent | where EventID == 4625 |
| Custom | custom | Any custom query syntax |
Python Hypothesis Generator
The Python tw_ai package includes an AI-powered hypothesis generator that suggests new hunts based on current threat intelligence and recent incident patterns.
Collaboration
Coordinate incident response across your team with assignments, comments, real-time events, activity feeds, and shift handoffs.
Overview
The collaboration module (Stage 4.3) adds team workflow features to incident management:
- Incident assignment -- manual and auto-assignment with rules
- Comments -- threaded discussion on incidents with mentions
- Real-time events -- live updates pushed to connected clients
- Activity feed -- chronological audit trail of all actions
- Shift handoff -- structured handoff reports between shifts
Incident Assignment
Manual Assignment
Assign an incident to an analyst through the web UI's assignment picker, or via the web endpoint:
curl -X POST http://localhost:8080/web/incidents/{id}/assign \
-H "Content-Type: application/x-www-form-urlencoded" \
-d 'assignee=analyst-uuid'
Auto-Assignment Rules
The system supports rule-based auto-assignment. Rules are defined in the application configuration and evaluated when new incidents arrive. Each rule specifies conditions and an assignee target:
| Field | Description |
|---|---|
name | Human-readable rule name |
conditions | List of conditions to match (severity, incident type, source, tag) |
assignee | Who to assign to (see Assignee Targets below) |
priority | Evaluation order (lower number = higher priority) |
Rules are evaluated in priority order. The first matching rule wins.
Note: Auto-assignment rule management via API is planned for a future release. Rules are currently configured at the application level.
Assignee Targets
| Type | Description |
|---|---|
user | Assign to a specific analyst by ID |
team | Round-robin across team members |
on_call | Assign to whoever is on-call |
Comments
Add discussion, analysis notes, and action records to incidents.
Creating a Comment
curl -X POST http://localhost:8080/api/v1/comments \
-H "Content-Type: application/json" \
-d '{
"incident_id": "incident-uuid",
"content": "Found lateral movement evidence via PsExec. @senior-analyst please review.",
"comment_type": "analysis",
"mentions": ["senior-analyst-uuid"]
}'
Comment Types
| Type | Use case |
|---|---|
note | General notes and observations |
analysis | Technical findings and analysis |
action_taken | Record of actions performed |
question | Questions for other team members |
resolution | Final resolution summary |
Filtering Comments
# All comments for an incident
curl "http://localhost:8080/api/v1/comments?incident_id={id}"
# Only analysis comments
curl "http://localhost:8080/api/v1/comments?incident_id={id}&comment_type=analysis"
# Comments by a specific analyst
curl "http://localhost:8080/api/v1/comments?author_id={analyst_id}"
Comments support pagination with page and per_page query parameters.
Real-time Events
The real-time event system pushes updates to connected clients when incidents are modified, comments are added, or assignments change. Events include:
- Incident status changes
- New comments and mentions
- Assignment updates
- Action execution results
- Field-level change tracking
Subscribers can filter events by incident ID, event type, or severity.
Activity Feed
Every action on an incident is recorded in the activity feed, providing a complete audit trail:
- Who did what and when
- What fields changed (with before/after values)
- Comment and assignment history
- Action execution records
Filter the activity feed by incident, user, or activity type.
Shift Handoff
Generate structured handoff reports at shift transitions:
curl -X POST http://localhost:8080/api/v1/handoffs \
-H "Content-Type: application/json" \
-d '{
"shift_start": "2025-01-15T08:00:00Z",
"shift_end": "2025-01-15T16:00:00Z",
"notes": "Ongoing phishing campaign targeting finance department"
}'
Handoff reports include:
- Summary of open incidents per severity
- Actions pending approval
- Recent escalations
- Custom notes from the outgoing team
Agentic AI Response
Control how much autonomy the AI has when responding to incidents, from fully manual to fully autonomous, with time-based rules and per-action overrides.
Overview
The Agentic AI Response system (Stage 5.4) provides configurable autonomy levels that determine which actions the AI can execute automatically and which require human approval. It includes:
- Four autonomy levels with increasing automation
- Per-action and per-severity overrides
- Time-based rules for different autonomy during business hours vs. off-hours
- Execution guardrails to prevent dangerous actions
- Full audit trail of every autonomy decision
Autonomy Levels
| Level | Actions auto-executed | Human role |
|---|---|---|
assisted | None | AI suggests, human executes everything |
supervised | Low-risk only | AI auto-executes safe actions, human approves the rest |
autonomous | All except protected | AI handles most actions, human reviews protected targets |
full_autonomous | Everything | Emergency mode -- AI executes all actions (requires special auth) |
Risk Level Mapping
Each action has an inherent risk level that determines whether it can be auto-executed:
| Risk level | Auto-execute in Supervised? | Auto-execute in Autonomous? |
|---|---|---|
none | Yes | Yes |
low | Yes | Yes |
medium | No | Yes |
high | No | Yes |
critical | No | No (requires full_autonomous) |
Configuration
Get Current Config
curl http://localhost:8080/api/v1/autonomy/config
Update Config
curl -X PUT http://localhost:8080/api/v1/autonomy/config \
-H "Content-Type: application/json" \
-d '{
"default_level": "supervised",
"per_action_overrides": {
"isolate_host": "assisted",
"create_ticket": "autonomous"
},
"per_severity_overrides": {
"critical": "assisted",
"low": "autonomous"
},
"time_based_rules": [
{
"name": "Business hours - supervised",
"start_hour": 9,
"end_hour": 17,
"days_of_week": [1, 2, 3, 4, 5],
"level": "supervised"
},
{
"name": "Off-hours - autonomous",
"start_hour": 17,
"end_hour": 9,
"days_of_week": [0, 1, 2, 3, 4, 5, 6],
"level": "autonomous"
}
],
"emergency_contacts": ["[email protected]"]
}'
Resolution Priority
When resolving the autonomy level for a given action, overrides are checked in this order:
- Per-action overrides (highest priority)
- Per-severity overrides
- Time-based rules
- Default level (fallback)
Resolve for a Specific Action
Check what the system would decide for a specific action + severity combination:
curl -X POST http://localhost:8080/api/v1/autonomy/resolve \
-H "Content-Type: application/json" \
-d '{"action": "isolate_host", "severity": "critical"}'
Response:
{
"level": "assisted",
"auto_execute": false,
"reason": "Per-action override for 'isolate_host'"
}
Time-Based Rules
Time-based rules let you run with less autonomy during business hours (when analysts are available) and more autonomy during nights and weekends.
| Field | Description |
|---|---|
name | Human-readable rule name |
start_hour | Start hour, 0-23 inclusive |
end_hour | End hour, 0-24 exclusive |
days_of_week | Array of days (0=Sunday through 6=Saturday) |
level | Autonomy level when rule applies |
Hours wrap around midnight: start_hour: 22, end_hour: 6 means 10 PM to 6 AM.
Execution Guardrails
The guardrails system (configured in config/guardrails.yaml) provides hard limits regardless of autonomy level:
- Forbidden actions -- actions that can never be automated (e.g.,
delete_user,wipe_host) - Protected assets -- targets that always require human approval (production systems, domain controllers)
- Rate limits -- maximum actions per hour/day to prevent runaway automation
- Blast radius limits -- caps on how many targets a single action can affect
See Guardrails Reference for full configuration details.
Audit Log
Every autonomy decision is logged for compliance and debugging:
curl "http://localhost:8080/api/v1/autonomy/audit?limit=20"
# Filter by incident
curl "http://localhost:8080/api/v1/autonomy/audit?incident_id={id}"
Each audit entry records:
- Action and severity evaluated
- Resolved autonomy level
- Whether auto-execution was allowed
- Reason for the decision
- Whether the action was actually executed
- Execution outcome
Attack Surface Integration
Correlate incidents with known vulnerabilities and external exposures using integrations with vulnerability scanners and attack surface monitoring platforms.
Overview
The attack surface module (Stage 5.2) connects Triage Warden to:
- Vulnerability scanners -- Qualys, Tenable, and Rapid7 for known vulnerability data
- Attack surface monitors -- Censys and SecurityScorecard for external exposure discovery
- Risk scoring -- combined risk assessment using vulnerability and exposure data
Vulnerability Scanners
Supported Platforms
| Platform | Connector | Capabilities |
|---|---|---|
| Qualys | QualysConnector | Asset vulns, scan results, CVE lookup, recent findings |
| Tenable | TenableConnector | Asset vulns, scan results, CVE lookup, recent findings |
| Rapid7 | Rapid7Connector | Asset vulns, scan results, CVE lookup, recent findings |
VulnerabilityScanner Trait
All scanners implement the same trait, making them interchangeable:
#![allow(unused)] fn main() { pub trait VulnerabilityScanner: Connector { async fn get_vulnerabilities_for_asset(&self, asset_id: &str) -> ConnectorResult<Vec<Vulnerability>>; async fn get_scan_results(&self, scan_id: &str) -> ConnectorResult<ScanResult>; async fn get_recent_vulnerabilities(&self, since: DateTime<Utc>, limit: Option<usize>) -> ConnectorResult<Vec<Vulnerability>>; async fn get_vulnerability_by_cve(&self, cve_id: &str) -> ConnectorResult<Option<Vulnerability>>; } }
Vulnerability Data
Each vulnerability includes:
| Field | Description |
|---|---|
cve_id | CVE identifier (if assigned) |
severity | Informational, Low, Medium, High, Critical |
cvss_score | CVSS base score (0.0 - 10.0) |
affected_asset_ids | Which assets are affected |
exploit_available | Whether a public exploit exists |
patch_available | Whether a vendor patch is available |
status | Open, Remediated, Accepted, FalsePositive |
Scan Results
Query scan results for summary data:
| Field | Description |
|---|---|
total_hosts | Number of hosts scanned |
vulnerabilities_found | Total vulnerabilities discovered |
critical_count | Critical severity findings |
high_count | High severity findings |
status | Pending, Running, Completed, Failed, Cancelled |
Attack Surface Monitoring
Supported Platforms
| Platform | Connector | Capabilities |
|---|---|---|
| Censys | CensysConnector | Domain exposures, asset exposure, risk scoring |
| SecurityScorecard | ScorecardConnector | Domain exposures, asset exposure, risk scoring |
AttackSurfaceMonitor Trait
#![allow(unused)] fn main() { pub trait AttackSurfaceMonitor: Connector { async fn get_exposures(&self, domain: &str) -> ConnectorResult<Vec<ExternalExposure>>; async fn get_asset_exposure(&self, asset_id: &str) -> ConnectorResult<Vec<ExternalExposure>>; async fn get_risk_score(&self, domain: &str) -> ConnectorResult<Option<f32>>; } }
Exposure Types
The system detects these categories of external exposure:
| Type | Description | Example |
|---|---|---|
open_port | Open network port with identified service | Port 22 running SSH |
expired_certificate | TLS certificate past its expiry date | example.com cert expired |
weak_cipher | Deprecated or weak TLS cipher in use | RC4 cipher detected |
exposed_service | Publicly accessible service that may be unintended | Elasticsearch on public IP |
dns_issue | DNS misconfiguration | Missing SPF record |
misconfigured_header | Missing or incorrect HTTP security header | No X-Frame-Options |
Each exposure includes a risk score (0.0 to 100.0) and structured details.
Risk Scoring
Risk scores from vulnerability scanners and ASM platforms are combined during incident triage to assess the exposure of affected assets. When the AI agent triages an incident involving a compromised host, it can check:
- What known vulnerabilities exist on the host
- Whether public exploits are available for those vulnerabilities
- What external exposures exist for the host or its domain
- The overall risk score for the affected domain
This context helps the agent make more accurate severity assessments and recommend appropriate response actions.
Configuration
Add vulnerability scanner and ASM connectors in config/default.yaml:
connectors:
qualys:
connector_type: qualys
enabled: true
base_url: https://qualysapi.qualys.com
api_key: ${QUALYS_USERNAME}
api_secret: ${QUALYS_PASSWORD}
timeout_secs: 60
censys:
connector_type: censys
enabled: true
base_url: https://search.censys.io
api_key: ${CENSYS_API_ID}
api_secret: ${CENSYS_SECRET}
timeout_secs: 30
Content Packages
Share playbooks, hunts, knowledge articles, and saved queries between Triage Warden instances using distributable content packages.
Overview
The content package system (Stage 5.5) provides:
- Import/export of playbooks, hunts, knowledge, and queries
- Package validation before import
- Conflict resolution when imported content already exists
- Semantic versioning and compatibility tracking
Package Format
A content package consists of a manifest and a list of content items:
{
"manifest": {
"name": "phishing-response-kit",
"version": "1.2.0",
"description": "Playbooks and hunts for phishing incident response",
"author": "Security Team",
"license": "MIT",
"tags": ["phishing", "email", "social-engineering"],
"compatibility": ">=2.0.0"
},
"contents": [
{
"type": "playbook",
"name": "phishing-triage",
"data": { "stages": [...] }
},
{
"type": "hunt",
"name": "credential-harvesting-detection",
"data": { "hypothesis": "...", "queries": [...] }
},
{
"type": "knowledge",
"title": "Phishing Indicators Guide",
"content": "Common phishing indicators include..."
},
{
"type": "query",
"name": "failed-logins-by-source",
"query_type": "siem",
"query": "event.type:authentication AND event.outcome:failure | stats count by source.ip"
}
]
}
Content Types
| Type | Description | Stored in |
|---|---|---|
playbook | Automated response workflows | Playbook repository |
hunt | Threat hunt definitions with queries | Hunt store |
knowledge | Reference articles and guides | Knowledge base |
query | Saved search queries | Query library |
Manifest Fields
| Field | Required | Description |
|---|---|---|
name | Yes | Unique package name |
version | Yes | Semantic version string |
description | Yes | What the package contains |
author | Yes | Creator name or organization |
license | No | License identifier (e.g., "MIT", "Apache-2.0") |
tags | No | Categorization tags |
compatibility | No | Minimum Triage Warden version required |
Importing Packages
curl -X POST http://localhost:8080/api/v1/packages/import \
-H "Content-Type: application/json" \
-d '{
"package": { ... },
"conflict_resolution": "skip"
}'
Response:
{
"imported": 3,
"skipped": 1,
"errors": []
}
Conflict Resolution
When an imported item has the same name as an existing one:
| Mode | Behavior |
|---|---|
skip | Keep existing, ignore the imported item (default) |
overwrite | Replace existing with the imported version |
rename | Import with a modified name (e.g., phishing-triage-imported-1) |
Validating Packages
Check a package for errors before importing:
curl -X POST http://localhost:8080/api/v1/packages/validate \
-H "Content-Type: application/json" \
-d '{ "manifest": { ... }, "contents": [ ... ] }'
Response:
{
"valid": true,
"warnings": ["Package author is not specified"],
"errors": [],
"content_count": 4
}
Validation checks:
- Package name and version are present
- All content items have non-empty names
- Warns on missing author or empty content list
Exporting Content
Export a Playbook
curl -X POST http://localhost:8080/api/v1/packages/export/playbook/{playbook_id} \
-H "Content-Type: application/json" \
-d '{
"name": "my-playbook-package",
"version": "1.0.0",
"description": "Exported playbook",
"author": "Security Team",
"license": "MIT",
"tags": ["phishing"]
}'
Export a Hunt
curl -X POST http://localhost:8080/api/v1/packages/export/hunt/{hunt_id} \
-H "Content-Type: application/json" \
-d '{
"name": "my-hunt-package",
"version": "1.0.0",
"description": "Exported hunt",
"author": "Threat Hunting Team"
}'
Both return the full package JSON that can be shared or imported into another instance.
AI Triage
Automated incident analysis using Claude AI agents.
Overview
The triage agent analyzes security incidents to:
- Classify - Determine if the incident is malicious, suspicious, or benign
- Assess confidence - Quantify certainty in the classification
- Explain - Provide reasoning for the verdict
- Recommend - Suggest response actions
How It Works
Incident → Playbook Selection → Tool Execution → AI Analysis → Verdict
- Incident received - New incident created via webhook or API
- Playbook selected - Based on incident type (phishing, malware, etc.)
- Tools executed - Parse data, lookup reputation, check authentication
- AI analysis - Claude analyzes gathered data
- Verdict returned - Classification with confidence and recommendations
Example Verdict
{
"incident_id": "INC-2024-001",
"classification": "malicious",
"confidence": 0.92,
"category": "phishing",
"reasoning": "Multiple indicators suggest this is a credential phishing attempt:\n1. Sender domain registered 2 days ago\n2. SPF and DKIM authentication failed\n3. URL leads to a fake Microsoft login page\n4. Subject uses urgency tactics",
"recommended_actions": [
{
"action": "quarantine_email",
"priority": 1,
"reason": "Prevent user access to phishing content"
},
{
"action": "block_sender",
"priority": 2,
"reason": "Sender has no legitimate history"
},
{
"action": "notify_user",
"priority": 3,
"reason": "Educate user about phishing attempt"
}
],
"iocs": [
{"type": "domain", "value": "phishing-site.com"},
{"type": "ip", "value": "192.168.1.100"}
],
"mitre_attack": ["T1566.001", "T1078"]
}
Triggering Triage
Automatic (Webhook)
Configure webhooks to auto-triage new incidents:
webhooks:
email_gateway:
auto_triage: true
playbook: phishing_triage
Manual (CLI)
tw-cli triage run --incident INC-2024-001
Manual (API)
curl -X POST http://localhost:8080/api/incidents/INC-2024-001/triage
Next Steps
- Triage Agent - Agent architecture and configuration
- Verdict Types - Understanding classifications
- Confidence Scoring - How confidence is calculated
Triage Agent
The AI agent that analyzes security incidents.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Triage Agent │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Claude │ │ Tools │ │ Playbook │ │
│ │ Model │ │ (Bridge) │ │ Engine │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Python Bridge │
│ (ThreatIntelBridge, SIEMBridge, etc.) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Rust Connectors │
│ (VirusTotal, Splunk, CrowdStrike, etc.) │
└─────────────────────────────────────────────────────────┘
Agent Configuration
# python/tw_ai/agents/config.py
class AgentConfig:
model: str = "claude-sonnet-4-20250514"
max_tokens: int = 4096
temperature: float = 0.1
max_tool_calls: int = 10
timeout_seconds: int = 120
Environment variables:
TW_AI_PROVIDER=anthropic
TW_ANTHROPIC_API_KEY=your-key
TW_AI_MODEL=claude-sonnet-4-20250514
Available Tools
The agent has access to these tools via the Python bridge:
| Tool | Purpose |
|---|---|
parse_email | Extract email components |
check_email_authentication | Validate SPF/DKIM/DMARC |
lookup_sender_reputation | Query sender reputation |
lookup_urls | Check URL reputation |
lookup_attachments | Check attachment hashes |
search_siem | Query SIEM for related events |
get_host_info | Get EDR host information |
Agent Workflow
async def triage(self, incident: Incident) -> Verdict:
# 1. Load appropriate playbook
playbook = self.load_playbook(incident.incident_type)
# 2. Execute playbook steps (tools)
context = {}
for step in playbook.steps:
result = await self.execute_step(step, incident, context)
context[step.output] = result
# 3. Build analysis prompt
prompt = self.build_analysis_prompt(incident, context)
# 4. Get AI verdict
response = await self.client.messages.create(
model=self.config.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=self.config.max_tokens
)
# 5. Parse and return verdict
return self.parse_verdict(response)
System Prompt
The agent uses a specialized system prompt:
You are an expert security analyst assistant. Analyze the provided security
incident data and determine:
1. Classification: Is this malicious, suspicious, benign, or inconclusive?
2. Confidence: How certain are you (0.0 to 1.0)?
3. Category: What type of threat is this (phishing, malware, etc.)?
4. Reasoning: Explain your analysis step by step
5. Recommended Actions: What should be done to respond?
Use the tool results provided to inform your analysis. Be thorough but concise.
Cite specific evidence for your conclusions.
Tool Calling
The agent can call tools during analysis:
# Agent decides to check URL reputation
tool_result = await self.call_tool(
name="lookup_urls",
parameters={"urls": ["https://suspicious-site.com/login"]}
)
# Result used in analysis
# {
# "results": [{
# "url": "https://suspicious-site.com/login",
# "malicious": true,
# "categories": ["phishing"],
# "confidence": 0.95
# }]
# }
Customizing the Agent
Custom System Prompt
agent = TriageAgent(
system_prompt="""
You are a SOC analyst specializing in email security.
Focus on phishing indicators and BEC patterns.
Always check sender authentication carefully.
"""
)
Custom Tools
Register additional tools:
@agent.tool
async def custom_lookup(domain: str) -> dict:
"""Look up domain in internal threat database."""
return await internal_db.query(domain)
Model Selection
# Use different models for different scenarios
if incident.severity == "critical":
agent = TriageAgent(model="claude-opus-4-20250514")
else:
agent = TriageAgent(model="claude-sonnet-4-20250514")
Error Handling
The agent handles failures gracefully:
try:
verdict = await agent.triage(incident)
except ToolError as e:
# Tool failed - continue with available data
verdict = await agent.triage_partial(incident, failed_tools=[e.tool])
except AIError as e:
# AI call failed - return inconclusive
verdict = Verdict.inconclusive(reason=str(e))
Metrics
Agent metrics exported to Prometheus:
triage_duration_seconds- Time to complete triagetriage_tool_calls_total- Tool calls per triagetriage_verdict_total- Verdicts by classificationtriage_confidence_histogram- Confidence score distribution
Verdict Types
Understanding the classification outcomes from AI triage.
Classifications
| Classification | Description | Typical Response |
|---|---|---|
| Malicious | Confirmed threat | Immediate containment |
| Suspicious | Likely threat, needs investigation | Queue for analyst review |
| Benign | Not a threat | Close or archive |
| Inconclusive | Insufficient data | Request more information |
Malicious
The incident is a confirmed security threat.
Criteria:
- Multiple strong threat indicators
- High-confidence threat intelligence matches
- Clear malicious intent (credential theft, malware, etc.)
Example:
{
"classification": "malicious",
"confidence": 0.95,
"category": "phishing",
"reasoning": "Email contains credential phishing page targeting Microsoft 365. Sender domain registered yesterday, fails all email authentication. URL redirects to fake login mimicking Microsoft branding."
}
Response:
- Execute recommended containment actions
- Create incident ticket
- Notify affected users
Suspicious
The incident shows concerning indicators but lacks definitive proof.
Criteria:
- Some threat indicators present
- Mixed or conflicting signals
- Unusual but not clearly malicious behavior
Example:
{
"classification": "suspicious",
"confidence": 0.65,
"category": "potential_phishing",
"reasoning": "Email sender is unknown but domain is 6 months old with valid authentication. URL leads to legitimate document sharing service but file name uses urgency tactics. Recipient has not received email from this sender before."
}
Response:
- Queue for analyst review
- Gather additional context
- Consider temporary quarantine pending review
Benign
The incident is not a security threat.
Criteria:
- No threat indicators found
- Known good sender/source
- Normal expected behavior
Example:
{
"classification": "benign",
"confidence": 0.92,
"category": "legitimate_email",
"reasoning": "Email from known vendor with established sending history. All authentication passes. Attachment is a standard invoice PDF matching expected format. No suspicious URLs or indicators."
}
Response:
- Close incident
- Release from quarantine if held
- Update detection rules if false positive
Inconclusive
Insufficient data to make a determination.
Criteria:
- Missing critical information
- Tool failures preventing analysis
- Conflicting strong indicators
Example:
{
"classification": "inconclusive",
"confidence": 0.3,
"category": "unknown",
"reasoning": "Unable to analyze attachment - file corrupted. Sender reputation service unavailable. Email authentication results are mixed (SPF pass, DKIM fail). Need manual review of attachment content.",
"missing_data": [
"attachment_analysis",
"sender_reputation"
]
}
Response:
- Escalate to analyst
- Retry failed tool calls
- Request additional information
Confidence Scores
Confidence ranges and their meaning:
| Range | Interpretation |
|---|---|
| 0.9 - 1.0 | Very high confidence, clear evidence |
| 0.7 - 0.9 | High confidence, strong indicators |
| 0.5 - 0.7 | Moderate confidence, mixed signals |
| 0.3 - 0.5 | Low confidence, limited evidence |
| 0.0 - 0.3 | Very low confidence, insufficient data |
Category Types
Email Threats
| Category | Description |
|---|---|
phishing | Credential theft attempt |
spear_phishing | Targeted phishing |
bec | Business email compromise |
malware_delivery | Malicious attachment/link |
spam | Unsolicited bulk email |
Endpoint Threats
| Category | Description |
|---|---|
malware | Malicious software detected |
ransomware | Ransomware activity |
cryptominer | Cryptocurrency mining |
rat | Remote access trojan |
pup | Potentially unwanted program |
Access Threats
| Category | Description |
|---|---|
brute_force | Password guessing attempt |
credential_stuffing | Leaked credential use |
impossible_travel | Geographically impossible login |
account_takeover | Compromised account |
Using Verdicts
Automation Rules
# Auto-respond to high-confidence malicious
- trigger:
classification: malicious
confidence: ">= 0.9"
actions:
- quarantine_email
- block_sender
- create_ticket
# Queue suspicious for review
- trigger:
classification: suspicious
actions:
- escalate:
level: analyst
reason: "Suspicious activity requires review"
Metrics
Track verdict distribution:
# Verdict counts by classification
sum by (classification) (triage_verdict_total)
# Average confidence by category
avg by (category) (triage_confidence)
Confidence Scoring
How the AI agent determines confidence in its verdicts.
Confidence Factors
The agent considers multiple factors when calculating confidence:
Evidence Quality
| Factor | Impact |
|---|---|
| Threat intel match (high confidence) | +0.3 |
| Threat intel match (low confidence) | +0.1 |
| Authentication failure | +0.2 |
| Known malicious indicator | +0.3 |
| Suspicious pattern | +0.1 |
Evidence Quantity
| Indicators | Confidence Boost |
|---|---|
| 1 indicator | Base |
| 2-3 indicators | +0.1 |
| 4-5 indicators | +0.2 |
| 6+ indicators | +0.3 |
Data Completeness
| Missing Data | Confidence Penalty |
|---|---|
| None | 0 |
| Minor (sender reputation) | -0.1 |
| Moderate (attachment analysis) | -0.2 |
| Major (multiple tools failed) | -0.3 |
Calculation Example
Phishing Email Analysis:
Base confidence: 0.5
Evidence found:
+ SPF failed: +0.15
+ DKIM failed: +0.15
+ Sender domain < 7 days old: +0.2
+ URL matches phishing pattern: +0.25
+ VirusTotal flags URL as phishing: +0.2
Evidence count (5): +0.2
Data completeness: All tools succeeded: +0
Final confidence: 0.5 + 0.15 + 0.15 + 0.2 + 0.25 + 0.2 + 0.2 = 1.0 (capped at 0.99)
Verdict: malicious, confidence: 0.99
Confidence Thresholds
Policy decisions use confidence thresholds:
# Auto-quarantine high confidence malicious
[[policy.rules]]
name = "auto_quarantine_confident"
classification = "malicious"
confidence_min = 0.9
action = "quarantine_email"
decision = "allowed"
# Require review for lower confidence
[[policy.rules]]
name = "review_uncertain"
confidence_max = 0.7
decision = "requires_approval"
approval_level = "analyst"
Confidence Calibration
The agent is calibrated so confidence correlates with accuracy:
| Stated Confidence | Expected Accuracy |
|---|---|
| 0.9 | ~90% of verdicts correct |
| 0.8 | ~80% of verdicts correct |
| 0.7 | ~70% of verdicts correct |
Monitoring Calibration
Track calibration with metrics:
# Accuracy at confidence level
triage_accuracy_by_confidence{confidence_bucket="0.9-1.0"}
Improving Calibration
- Feedback loop - Log false positives to improve
- Periodic review - Sample low-confidence verdicts
- Model updates - Retrain with corrected examples
Handling Low Confidence
When confidence is low:
Option 1: Escalate
- condition: confidence < 0.6
action: escalate
parameters:
level: analyst
reason: "Low confidence verdict requires human review"
Option 2: Gather More Data
- condition: confidence < 0.6
action: request_additional_data
parameters:
- "sender_history"
- "recipient_context"
Option 3: Conservative Default
- condition: confidence < 0.6
action: quarantine_email
parameters:
reason: "Quarantined pending review due to uncertainty"
Confidence in UI
Dashboard displays confidence visually:
| Confidence | Display |
|---|---|
| 0.9+ | Green badge, "High Confidence" |
| 0.7-0.9 | Yellow badge, "Moderate Confidence" |
| 0.5-0.7 | Orange badge, "Low Confidence" |
| <0.5 | Red badge, "Very Low Confidence" |
Improving Confidence
Actions that help the agent be more confident:
- Complete data - Ensure all tools succeed
- Rich context - Provide incident metadata
- Historical data - Include past incidents with similar patterns
- Clear playbooks - Well-defined analysis steps
Playbooks
Playbooks define automated investigation and response workflows.
Overview
A playbook is a sequence of steps that:
- Gather and analyze incident data
- Enrich with threat intelligence
- Determine verdict and response
- Execute approved actions
Playbook Structure
name: phishing_triage
description: Automated phishing email analysis
version: "1.0"
# When this playbook applies
triggers:
incident_type: phishing
auto_run: true
# Variables available to steps
variables:
quarantine_threshold: 0.7
block_threshold: 0.3
# Execution steps
steps:
- name: Parse Email
action: parse_email
parameters:
raw_email: "{{ incident.raw_data.raw_email }}"
output: parsed
- name: Check Authentication
action: check_email_authentication
parameters:
headers: "{{ parsed.headers }}"
output: auth
- name: Check Sender
action: lookup_sender_reputation
parameters:
sender: "{{ parsed.sender }}"
output: sender_rep
- name: Check URLs
action: lookup_urls
parameters:
urls: "{{ parsed.urls }}"
output: url_results
condition: "{{ parsed.urls | length > 0 }}"
- name: Quarantine if Malicious
action: quarantine_email
parameters:
message_id: "{{ incident.raw_data.message_id }}"
reason: "Automated quarantine - phishing detected"
condition: >
sender_rep.score < variables.quarantine_threshold or
url_results.malicious_count > 0 or
not auth.authentication_passed
# Final verdict generation
verdict:
use_ai: true
model: claude-sonnet-4-20250514
context:
- parsed
- auth
- sender_rep
- url_results
Triggers
Define when a playbook runs:
triggers:
# Run for specific incident types
incident_type: phishing
# Auto-run on incident creation
auto_run: true
# Or require manual trigger
auto_run: false
# Conditions
conditions:
severity: ["medium", "high", "critical"]
source: "email_gateway"
Steps
Basic Step
- name: Step Name
action: action_name
parameters:
key: value
output: variable_name
Conditional Step
- name: Block Known Bad
action: block_sender
parameters:
sender: "{{ parsed.sender }}"
condition: "{{ sender_rep.score < 0.2 }}"
Parallel Steps
- parallel:
- action: lookup_urls
parameters:
urls: "{{ parsed.urls }}"
output: url_results
- action: lookup_attachments
parameters:
attachments: "{{ parsed.attachments }}"
output: attachment_results
Loop Steps
- name: Check Each URL
loop: "{{ parsed.urls }}"
action: lookup_url
parameters:
url: "{{ item }}"
output: url_results
aggregate: list
Variables
Built-in Variables
| Variable | Description |
|---|---|
incident | The incident being processed |
incident.id | Incident ID |
incident.raw_data | Original incident data |
incident.severity | Incident severity |
variables | Playbook-defined variables |
Step Outputs
Each step's output is available to subsequent steps:
- action: parse_email
output: parsed
- action: lookup_urls
parameters:
urls: "{{ parsed.urls }}" # Use previous output
Templates
Use Jinja2-style templates:
parameters:
message: "Alert for {{ incident.id }}: {{ parsed.subject }}"
priority: "{{ 'high' if incident.severity == 'critical' else 'medium' }}"
Next Steps
- Creating Playbooks - Write your own playbooks
- Built-in Playbooks - Ready-to-use playbooks
Creating Playbooks
Guide to writing custom playbooks for your security workflows.
Getting Started
1. Create Playbook File
mkdir -p playbooks
touch playbooks/my_playbook.yaml
2. Define Basic Structure
name: my_playbook
description: Description of what this playbook does
version: "1.0"
triggers:
incident_type: phishing
auto_run: true
steps:
- name: First Step
action: parse_email
output: result
3. Register Playbook
tw-cli playbook add playbooks/my_playbook.yaml
Step Types
Action Step
Execute a registered action:
- name: Parse Email Content
action: parse_email
parameters:
raw_email: "{{ incident.raw_data.raw_email }}"
output: parsed
on_error: continue # or "fail" (default)
Condition Step
Branch based on conditions:
- name: Check if High Risk
condition: "{{ sender_rep.score < 0.3 }}"
then:
- action: quarantine_email
parameters:
message_id: "{{ incident.raw_data.message_id }}"
else:
- action: log_event
parameters:
message: "Low risk, no action needed"
AI Analysis Step
Get AI verdict:
- name: AI Analysis
type: ai_analysis
model: claude-sonnet-4-20250514
context:
- parsed
- auth_results
- reputation
prompt: |
Analyze this email for phishing indicators.
Consider the authentication results and sender reputation.
output: ai_verdict
Notification Step
Send alerts:
- name: Alert Team
action: notify_channel
parameters:
channel: slack
message: |
New {{ incident.severity }} incident detected
ID: {{ incident.id }}
Type: {{ incident.incident_type }}
Error Handling
Per-Step Error Handling
- name: Check Reputation
action: lookup_sender_reputation
parameters:
sender: "{{ parsed.sender }}"
output: reputation
on_error: continue # Don't fail playbook if this fails
default_output: # Use this if step fails
score: 0.5
risk_level: "unknown"
Global Error Handler
on_error:
- action: notify_channel
parameters:
channel: slack
message: "Playbook {{ playbook.name }} failed: {{ error.message }}"
- action: escalate
parameters:
level: analyst
reason: "Automated triage failed"
Variables and Templates
Define Variables
variables:
high_risk_threshold: 0.3
quarantine_enabled: true
notification_channel: "#security-alerts"
Use Variables
- name: Check Risk
condition: "{{ sender_rep.score < variables.high_risk_threshold }}"
then:
- action: quarantine_email
condition: "{{ variables.quarantine_enabled }}"
Template Functions
parameters:
# String manipulation
domain: "{{ parsed.sender | split('@') | last }}"
# Conditionals
priority: "{{ 'critical' if incident.severity == 'critical' else 'high' }}"
# Lists
all_urls: "{{ parsed.urls | join(', ') }}"
url_count: "{{ parsed.urls | length }}"
# Defaults
assignee: "{{ incident.assignee | default('unassigned') }}"
Testing Playbooks
Dry Run
tw-cli playbook test my_playbook \
--incident INC-2024-001 \
--dry-run
With Mock Data
tw-cli playbook test my_playbook \
--data '{"raw_email": "From: [email protected]..."}'
Validate Syntax
tw-cli playbook validate playbooks/my_playbook.yaml
Best Practices
1. Use Descriptive Names
# Good
- name: Check sender domain reputation
# Bad
- name: step1
2. Handle Failures Gracefully
- name: External Lookup
action: lookup_sender_reputation
on_error: continue
default_output:
score: 0.5
3. Add Timeouts
- name: Slow External API
action: custom_lookup
timeout: 30s
4. Log Key Decisions
- name: Log Verdict
action: log_event
parameters:
level: info
message: "Verdict: {{ verdict.classification }} ({{ verdict.confidence }})"
5. Version Your Playbooks
name: phishing_triage
version: "2.1.0"
changelog:
- "2.1.0: Added attachment analysis"
- "2.0.0: Restructured for parallel lookups"
Example: Complete Playbook
name: comprehensive_phishing_triage
description: Full phishing email analysis with all checks
version: "2.0"
triggers:
incident_type: phishing
auto_run: true
variables:
quarantine_threshold: 0.3
block_threshold: 0.2
steps:
# Parse email
- name: Parse Email
action: parse_email
parameters:
raw_email: "{{ incident.raw_data.raw_email }}"
output: parsed
# Parallel enrichment
- name: Enrich Data
parallel:
- action: check_email_authentication
parameters:
headers: "{{ parsed.headers }}"
output: auth
- action: lookup_sender_reputation
parameters:
sender: "{{ parsed.sender }}"
output: sender_rep
- action: lookup_urls
parameters:
urls: "{{ parsed.urls }}"
output: urls
condition: "{{ parsed.urls | length > 0 }}"
- action: lookup_attachments
parameters:
attachments: "{{ parsed.attachments }}"
output: attachments
condition: "{{ parsed.attachments | length > 0 }}"
# AI Analysis
- name: AI Verdict
type: ai_analysis
model: claude-sonnet-4-20250514
context: [parsed, auth, sender_rep, urls, attachments]
output: verdict
# Response actions
- name: Quarantine Malicious
action: quarantine_email
parameters:
message_id: "{{ incident.raw_data.message_id }}"
condition: >
verdict.classification == 'malicious' and
verdict.confidence >= variables.quarantine_threshold
- name: Block Repeat Offender
action: block_sender
parameters:
sender: "{{ parsed.sender }}"
condition: >
sender_rep.score < variables.block_threshold
- name: Create Ticket
action: create_ticket
parameters:
title: "{{ verdict.classification | title }}: {{ parsed.subject | truncate(50) }}"
priority: "{{ incident.severity }}"
condition: "{{ verdict.classification != 'benign' }}"
on_error:
- action: escalate
parameters:
level: analyst
reason: "Playbook execution failed"
Built-in Playbooks
Ready-to-use playbooks included with Triage Warden.
Email Security
phishing_triage
Comprehensive phishing email analysis.
Triggers: incident_type: phishing
Steps:
- Parse email headers and body
- Check SPF/DKIM/DMARC authentication
- Look up sender reputation
- Analyze URLs against threat intel
- Check attachment hashes
- AI analysis and verdict
- Auto-quarantine if malicious (confidence > 0.8)
Usage:
tw-cli playbook run phishing_triage --incident INC-2024-001
spam_triage
Quick spam classification.
Triggers: incident_type: spam
Steps:
- Parse email
- Check spam indicators (bulk headers, suspicious patterns)
- Classify as spam/not spam
- Auto-archive low-confidence spam
bec_detection
Business Email Compromise detection.
Triggers: incident_type: bec
Steps:
- Parse email
- Check for executive impersonation
- Analyze reply-to mismatch
- Check for urgency indicators
- Verify sender against directory
- AI analysis for social engineering patterns
Endpoint Security
malware_triage
Malware alert analysis.
Triggers: incident_type: malware
Steps:
- Get host information from EDR
- Look up file hash
- Check related processes
- Query SIEM for lateral movement
- AI verdict
- Auto-isolate if critical severity + high confidence
suspicious_login
Anomalous login investigation.
Triggers: incident_type: suspicious_login
Steps:
- Get login details
- Check for impossible travel
- Query user's recent activity
- Check IP reputation
- Verify device fingerprint
- AI analysis
Customizing Built-in Playbooks
Override Variables
tw-cli playbook run phishing_triage \
--incident INC-2024-001 \
--var quarantine_threshold=0.9 \
--var auto_block=false
Fork and Modify
# Export built-in playbook
tw-cli playbook export phishing_triage > my_phishing.yaml
# Edit as needed
vim my_phishing.yaml
# Register custom version
tw-cli playbook add my_phishing.yaml
Extend with Hooks
# my_phishing.yaml
extends: phishing_triage
# Add steps after parent playbook
after_steps:
- name: Custom Logging
action: log_to_siem
parameters:
event: phishing_verdict
data: "{{ verdict }}"
# Override variables
variables:
quarantine_threshold: 0.85
Playbook Comparison
| Playbook | AI Used | Auto-Response | Typical Duration |
|---|---|---|---|
| phishing_triage | Yes | Quarantine, Block | 30-60s |
| spam_triage | No | Archive | 5-10s |
| bec_detection | Yes | Escalate | 45-90s |
| malware_triage | Yes | Isolate | 60-120s |
| suspicious_login | Yes | Lock account | 30-60s |
Monitoring Playbooks
Execution Metrics
# Playbook execution count
sum by (playbook) (playbook_executions_total)
# Average duration
avg by (playbook) (playbook_duration_seconds)
# Success rate
sum(playbook_executions_total{status="success"}) /
sum(playbook_executions_total)
Alerts
# Alert on playbook failures
- alert: PlaybookFailureRate
expr: |
sum(rate(playbook_executions_total{status="failed"}[5m])) /
sum(rate(playbook_executions_total[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Playbook failure rate above 10%"
REST API
Programmatic access to Triage Warden functionality.
Base URL
http://localhost:8080/api
Authentication
See Authentication for details.
API Key
curl -H "Authorization: Bearer tw_abc123_secretkey" \
http://localhost:8080/api/incidents
Session Cookie
For browser-based access, use session authentication via /login.
Response Format
All responses are JSON:
{
"data": { ... },
"meta": {
"page": 1,
"per_page": 20,
"total": 150
}
}
Error Responses
{
"error": {
"code": "not_found",
"message": "Incident not found",
"details": { ... }
}
}
HTTP Status Codes
| Code | Meaning |
|---|---|
| 200 | Success |
| 201 | Created |
| 400 | Bad Request |
| 401 | Unauthorized |
| 403 | Forbidden |
| 404 | Not Found |
| 422 | Validation Error |
| 429 | Rate Limited |
| 500 | Server Error |
Endpoints Overview
Incidents
| Method | Path | Description |
|---|---|---|
| GET | /incidents | List incidents |
| POST | /incidents | Create incident |
| GET | /incidents/:id | Get incident |
| PUT | /incidents/:id | Update incident |
| DELETE | /incidents/:id | Delete incident |
| POST | /incidents/:id/triage | Run triage |
| POST | /incidents/:id/actions | Execute action |
Actions
| Method | Path | Description |
|---|---|---|
| GET | /actions | List actions |
| GET | /actions/:id | Get action |
| POST | /actions/:id/approve | Approve action |
| POST | /actions/:id/reject | Reject action |
Playbooks
| Method | Path | Description |
|---|---|---|
| GET | /playbooks | List playbooks |
| POST | /playbooks | Create playbook |
| GET | /playbooks/:id | Get playbook |
| PUT | /playbooks/:id | Update playbook |
| DELETE | /playbooks/:id | Delete playbook |
| POST | /playbooks/:id/run | Run playbook |
Webhooks
| Method | Path | Description |
|---|---|---|
| POST | /webhooks/:source | Receive webhook |
System
| Method | Path | Description |
|---|---|---|
| GET | /health | Health check |
| GET | /metrics | Prometheus metrics |
| GET | /connectors/health | Connector status |
Pagination
List endpoints support pagination:
curl "http://localhost:8080/api/incidents?page=2&per_page=50"
Parameters:
page- Page number (default: 1)per_page- Items per page (default: 20, max: 100)
Filtering
Filter list results:
curl "http://localhost:8080/api/incidents?status=open&severity=high"
Common filters:
status- Filter by statusseverity- Filter by severitytype- Filter by incident typecreated_after- Created after datecreated_before- Created before date
Sorting
curl "http://localhost:8080/api/incidents?sort=-created_at"
- Prefix with
-for descending order - Default:
-created_at(newest first)
Rate Limiting
API requests are rate limited:
| Endpoint | Limit |
|---|---|
| Read operations | 100/min |
| Write operations | 20/min |
| Triage requests | 10/min |
Rate limit headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1705320000
Next Steps
- Authentication - API authentication
- Incidents - Incident endpoints
- Actions - Action endpoints
- Playbooks - Playbook endpoints
- Webhooks - Webhook integration
API Authentication
Authenticate with the Triage Warden API.
API Keys
Creating an API Key
# Via CLI
tw-cli api-key create --name "automation-script" --scopes read,write
# Output:
# API Key created successfully
# Key: tw_abc123_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# WARNING: Store this key securely. It cannot be retrieved again.
Using API Keys
Include in the Authorization header:
curl -H "Authorization: Bearer tw_abc123_secretkey" \
http://localhost:8080/api/incidents
API Key Scopes
| Scope | Permissions |
|---|---|
read | Read incidents, actions, playbooks |
write | Create/update incidents, execute actions |
admin | User management, system configuration |
Managing API Keys
# List keys
tw-cli api-key list
# Revoke key
tw-cli api-key revoke tw_abc123
# Rotate key
tw-cli api-key rotate tw_abc123
Session Authentication
For web dashboard access:
Login
curl -X POST http://localhost:8080/login \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "username=analyst&password=secret&csrf_token=xxx" \
-c cookies.txt
Using Session
curl -b cookies.txt http://localhost:8080/api/incidents
Logout
curl -X POST http://localhost:8080/logout -b cookies.txt
CSRF Protection
State-changing requests require CSRF tokens:
- Get token from login page or API
- Include in request header or body
# Header
curl -X POST http://localhost:8080/api/incidents \
-H "X-CSRF-Token: abc123" \
-b cookies.txt \
-d '{"type": "phishing"}'
# Form body
curl -X POST http://localhost:8080/api/incidents \
-d "csrf_token=abc123&type=phishing" \
-b cookies.txt
Webhook Authentication
Webhooks use HMAC signatures:
Configuring Webhook Secret
tw-cli webhook add email-gateway \
--url http://localhost:8080/api/webhooks/email-gateway \
--secret "your-secret-key"
Verifying Signatures
Triage Warden validates the X-Webhook-Signature header:
X-Webhook-Signature: sha256=abc123...
Signature is computed as:
HMAC-SHA256(secret, timestamp + "." + body)
Signature Verification Example
import hmac
import hashlib
def verify_signature(payload: bytes, signature: str, secret: str, timestamp: str) -> bool:
expected = hmac.new(
secret.encode(),
f"{timestamp}.{payload.decode()}".encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature)
Service Accounts
For automated systems:
# Create service account
tw-cli user create \
--username automation-bot \
--role analyst \
--service-account
# Generate API key for service account
tw-cli api-key create \
--user automation-bot \
--name "ci-cd-integration" \
--scopes read,write
Security Best Practices
- Rotate keys regularly - Set up automated rotation
- Use minimal scopes - Only grant necessary permissions
- Secure storage - Use secret managers, not code
- Monitor usage - Review audit logs for suspicious activity
- IP allowlisting - Restrict API access by IP (optional)
# Enable IP allowlist
tw-cli config set api.allowed_ips "10.0.0.0/8,192.168.1.0/24"
Error Responses
401 Unauthorized
Missing or invalid credentials:
{
"error": {
"code": "unauthorized",
"message": "Invalid or missing authentication"
}
}
403 Forbidden
Valid credentials but insufficient permissions:
{
"error": {
"code": "forbidden",
"message": "Insufficient permissions for this operation"
}
}
Incidents API
Create, read, update, and manage security incidents.
List Incidents
GET /api/incidents
Query Parameters
| Parameter | Type | Description |
|---|---|---|
status | string | Filter by status (open, triaged, resolved) |
severity | string | Filter by severity (low, medium, high, critical) |
type | string | Filter by incident type |
created_after | datetime | Created after timestamp |
created_before | datetime | Created before timestamp |
page | integer | Page number |
per_page | integer | Items per page |
sort | string | Sort field (prefix - for desc) |
Example
curl "http://localhost:8080/api/incidents?status=open&severity=high&per_page=10" \
-H "Authorization: Bearer tw_xxx"
Response
{
"data": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"incident_number": "INC-2024-0001",
"incident_type": "phishing",
"severity": "high",
"status": "open",
"source": "email_gateway",
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:30:00Z"
}
],
"meta": {
"page": 1,
"per_page": 10,
"total": 42
}
}
Get Incident
GET /api/incidents/:id
Example
curl "http://localhost:8080/api/incidents/550e8400-e29b-41d4-a716-446655440000" \
-H "Authorization: Bearer tw_xxx"
Response
{
"data": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"incident_number": "INC-2024-0001",
"incident_type": "phishing",
"severity": "high",
"status": "triaged",
"source": "email_gateway",
"raw_data": {
"message_id": "AAMkAGI2...",
"sender": "[email protected]",
"subject": "Urgent: Update Account"
},
"verdict": {
"classification": "malicious",
"confidence": 0.92,
"category": "phishing",
"reasoning": "Multiple phishing indicators..."
},
"recommended_actions": [
"quarantine_email",
"block_sender"
],
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:35:00Z",
"triaged_at": "2024-01-15T10:35:00Z"
}
}
Create Incident
POST /api/incidents
Request Body
{
"incident_type": "phishing",
"source": "email_gateway",
"severity": "medium",
"raw_data": {
"message_id": "AAMkAGI2...",
"sender": "[email protected]",
"recipient": "[email protected]",
"subject": "Important Document",
"received_at": "2024-01-15T10:00:00Z"
}
}
Example
curl -X POST "http://localhost:8080/api/incidents" \
-H "Authorization: Bearer tw_xxx" \
-H "Content-Type: application/json" \
-d '{
"incident_type": "phishing",
"source": "email_gateway",
"severity": "medium",
"raw_data": {...}
}'
Response
{
"data": {
"id": "550e8400-e29b-41d4-a716-446655440001",
"incident_number": "INC-2024-0002",
"status": "open",
"created_at": "2024-01-15T11:00:00Z"
}
}
Update Incident
PUT /api/incidents/:id
Request Body
{
"severity": "high",
"status": "resolved",
"resolution": "False positive - legitimate vendor email"
}
Delete Incident
DELETE /api/incidents/:id
Note: Requires admin role.
Run Triage
POST /api/incidents/:id/triage
Trigger AI triage on an incident.
Request Body (Optional)
{
"playbook": "custom_phishing",
"force": true
}
Response
{
"data": {
"triage_id": "triage-abc123",
"status": "completed",
"verdict": {
"classification": "malicious",
"confidence": 0.92
},
"duration_ms": 45000
}
}
Execute Action
POST /api/incidents/:id/actions
Execute an action on an incident.
Request Body
{
"action": "quarantine_email",
"parameters": {
"message_id": "AAMkAGI2...",
"reason": "Phishing detected"
}
}
Response (Immediate Execution)
{
"data": {
"action_id": "act-abc123",
"status": "completed",
"result": {
"success": true,
"message": "Email quarantined successfully"
}
}
}
Response (Pending Approval)
{
"data": {
"action_id": "act-abc123",
"status": "pending_approval",
"approval_level": "senior",
"message": "Action requires senior analyst approval"
}
}
Get Incident Actions
GET /api/incidents/:id/actions
List all actions for an incident.
Response
{
"data": [
{
"id": "act-abc123",
"action_type": "quarantine_email",
"status": "completed",
"executed_at": "2024-01-15T10:40:00Z",
"executed_by": "system"
},
{
"id": "act-def456",
"action_type": "block_sender",
"status": "pending_approval",
"approval_level": "analyst",
"requested_at": "2024-01-15T10:41:00Z"
}
]
}
Actions API
Manage action execution and approvals.
List Actions
GET /api/actions
Query Parameters
| Parameter | Type | Description |
|---|---|---|
status | string | pending, pending_approval, completed, failed |
action_type | string | Filter by action type |
incident_id | uuid | Filter by incident |
approval_level | string | analyst, senior, manager |
Example
curl "http://localhost:8080/api/actions?status=pending_approval" \
-H "Authorization: Bearer tw_xxx"
Response
{
"data": [
{
"id": "act-abc123",
"incident_id": "550e8400-e29b-41d4-a716-446655440000",
"action_type": "isolate_host",
"status": "pending_approval",
"approval_level": "senior",
"parameters": {
"host_id": "aid:xyz789",
"reason": "Malware detected"
},
"requested_by": "triage_agent",
"requested_at": "2024-01-15T10:45:00Z"
}
]
}
Get Action
GET /api/actions/:id
Response
{
"data": {
"id": "act-abc123",
"incident_id": "550e8400-e29b-41d4-a716-446655440000",
"action_type": "isolate_host",
"status": "pending_approval",
"approval_level": "senior",
"parameters": {
"host_id": "aid:xyz789",
"reason": "Malware detected"
},
"requested_by": "triage_agent",
"requested_at": "2024-01-15T10:45:00Z",
"incident": {
"incident_number": "INC-2024-0001",
"incident_type": "malware",
"severity": "high"
}
}
}
Approve Action
POST /api/actions/:id/approve
Request Body
{
"comment": "Verified threat, approved for isolation"
}
Response
{
"data": {
"id": "act-abc123",
"status": "completed",
"approved_by": "[email protected]",
"approved_at": "2024-01-15T11:00:00Z",
"result": {
"success": true,
"message": "Host isolated successfully"
}
}
}
Errors
403 Forbidden - Insufficient approval level:
{
"error": {
"code": "insufficient_approval_level",
"message": "This action requires senior analyst approval",
"required_level": "senior",
"your_level": "analyst"
}
}
Reject Action
POST /api/actions/:id/reject
Request Body
{
"reason": "False positive - user confirmed legitimate activity"
}
Response
{
"data": {
"id": "act-abc123",
"status": "rejected",
"rejected_by": "[email protected]",
"rejected_at": "2024-01-15T11:00:00Z",
"rejection_reason": "False positive - user confirmed legitimate activity"
}
}
Execute Action Directly
POST /api/actions/execute
Execute an action without associating with an incident.
Request Body
{
"action": "block_sender",
"parameters": {
"sender": "[email protected]"
}
}
Response
{
"data": {
"action_id": "act-ghi789",
"status": "completed",
"result": {
"success": true,
"message": "Sender blocked"
}
}
}
Get Action Types
GET /api/actions/types
List all available action types.
Response
{
"data": [
{
"name": "quarantine_email",
"description": "Move email to quarantine",
"category": "email",
"supports_rollback": true,
"parameters": [
{
"name": "message_id",
"type": "string",
"required": true
},
{
"name": "reason",
"type": "string",
"required": false
}
]
},
{
"name": "isolate_host",
"description": "Network-isolate a host",
"category": "endpoint",
"supports_rollback": true,
"default_approval_level": "senior",
"parameters": [...]
}
]
}
Rollback Action
POST /api/actions/:id/rollback
Rollback a previously executed action.
Request Body
{
"reason": "False positive confirmed"
}
Response
{
"data": {
"rollback_action_id": "act-jkl012",
"original_action_id": "act-abc123",
"status": "completed",
"result": {
"success": true,
"message": "Host unisolated successfully"
}
}
}
Errors
400 Bad Request - Action doesn't support rollback:
{
"error": {
"code": "rollback_not_supported",
"message": "Action type 'notify_user' does not support rollback"
}
}
Playbooks API
Manage and execute playbooks.
List Playbooks
GET /api/playbooks
Response
{
"data": [
{
"id": "pb-abc123",
"name": "phishing_triage",
"description": "Automated phishing email analysis",
"version": "2.0",
"enabled": true,
"triggers": {
"incident_type": "phishing",
"auto_run": true
},
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-10T00:00:00Z"
}
]
}
Get Playbook
GET /api/playbooks/:id
Response
{
"data": {
"id": "pb-abc123",
"name": "phishing_triage",
"description": "Automated phishing email analysis",
"version": "2.0",
"enabled": true,
"triggers": {
"incident_type": "phishing",
"auto_run": true
},
"variables": {
"quarantine_threshold": 0.7
},
"steps": [
{
"name": "Parse Email",
"action": "parse_email",
"parameters": {
"raw_email": "{{ incident.raw_data.raw_email }}"
},
"output": "parsed"
}
],
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-10T00:00:00Z"
}
}
Create Playbook
POST /api/playbooks
Request Body
{
"name": "custom_playbook",
"description": "My custom investigation playbook",
"triggers": {
"incident_type": "phishing",
"auto_run": false
},
"steps": [
{
"name": "Parse Email",
"action": "parse_email",
"output": "parsed"
}
]
}
Response
{
"data": {
"id": "pb-def456",
"name": "custom_playbook",
"version": "1.0",
"created_at": "2024-01-15T12:00:00Z"
}
}
Update Playbook
PUT /api/playbooks/:id
Request Body
{
"description": "Updated description",
"enabled": false
}
Delete Playbook
DELETE /api/playbooks/:id
Note: Built-in playbooks cannot be deleted.
Run Playbook
POST /api/playbooks/:id/run
Execute a playbook on an incident.
Request Body
{
"incident_id": "550e8400-e29b-41d4-a716-446655440000",
"variables": {
"quarantine_threshold": 0.9
}
}
Response
{
"data": {
"execution_id": "exec-abc123",
"playbook_id": "pb-abc123",
"incident_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"started_at": "2024-01-15T12:00:00Z",
"completed_at": "2024-01-15T12:00:45Z",
"steps_completed": 5,
"steps_total": 5,
"verdict": {
"classification": "malicious",
"confidence": 0.92
}
}
}
Get Playbook Executions
GET /api/playbooks/:id/executions
Response
{
"data": [
{
"execution_id": "exec-abc123",
"incident_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"duration_ms": 45000,
"started_at": "2024-01-15T12:00:00Z"
}
]
}
Validate Playbook
POST /api/playbooks/validate
Validate playbook YAML without creating it.
Request Body
{
"content": "name: test\nsteps:\n - action: parse_email"
}
Response (Valid)
{
"data": {
"valid": true,
"warnings": []
}
}
Response (Invalid)
{
"data": {
"valid": false,
"errors": [
{
"line": 3,
"message": "Unknown action: invalid_action"
}
]
}
}
Export Playbook
GET /api/playbooks/:id/export
Download playbook as YAML file.
Response
name: phishing_triage
description: Automated phishing email analysis
version: "2.0"
...
Webhooks API
Receive events from external security tools.
Endpoint
POST /api/webhooks/:source
Where :source identifies the sending system (e.g., email-gateway, edr, siem).
Authentication
Webhooks are authenticated via HMAC signatures:
X-Webhook-Signature: sha256=abc123...
X-Webhook-Timestamp: 1705320000
Registering Webhook Sources
Via CLI
tw-cli webhook add email-gateway \
--secret "your-secret-key" \
--auto-triage true \
--playbook phishing_triage
Via API
curl -X POST "http://localhost:8080/api/webhooks" \
-H "Authorization: Bearer tw_xxx" \
-d '{
"source": "email-gateway",
"secret": "your-secret-key",
"auto_triage": true,
"playbook": "phishing_triage"
}'
Payload Formats
Generic Format
{
"event_type": "security_alert",
"timestamp": "2024-01-15T10:00:00Z",
"source": "email-gateway",
"data": {
"alert_id": "alert-123",
"severity": "high",
"details": {...}
}
}
Microsoft Defender for Office 365
{
"eventType": "PhishingEmail",
"id": "AAMkAGI2...",
"creationTime": "2024-01-15T10:00:00Z",
"severity": "high",
"category": "Phish",
"entityType": "Email",
"data": {
"sender": "[email protected]",
"subject": "Urgent Action Required",
"recipients": ["[email protected]"]
}
}
CrowdStrike Falcon
{
"metadata": {
"eventType": "DetectionSummaryEvent",
"eventCreationTime": 1705320000000
},
"event": {
"DetectId": "ldt:abc123",
"Severity": 4,
"HostnameField": "WORKSTATION-01",
"DetectName": "Malicious File Detected"
}
}
Splunk Alert
{
"result": {
"host": "server-01",
"source": "WinEventLog:Security",
"sourcetype": "WinEventLog",
"_raw": "...",
"EventCode": "4625"
},
"search_name": "Failed Login Alert",
"trigger_time": 1705320000
}
Response
Success
{
"status": "accepted",
"incident_id": "550e8400-e29b-41d4-a716-446655440000",
"incident_number": "INC-2024-0001"
}
Queued for Processing
{
"status": "queued",
"queue_id": "queue-abc123",
"message": "Event queued for processing"
}
Configuring Auto-Triage
When auto_triage is enabled, incidents created from webhooks are automatically triaged:
# webhook_config.yaml
sources:
email-gateway:
secret: "${EMAIL_GATEWAY_SECRET}"
auto_triage: true
playbook: phishing_triage
severity_mapping:
critical: critical
high: high
medium: medium
low: low
edr:
secret: "${EDR_SECRET}"
auto_triage: true
playbook: malware_triage
Testing Webhooks
Send Test Event
# Generate signature
TIMESTAMP=$(date +%s)
BODY='{"event_type":"test","data":{}}'
SIGNATURE=$(echo -n "${TIMESTAMP}.${BODY}" | openssl dgst -sha256 -hmac "your-secret")
# Send request
curl -X POST "http://localhost:8080/api/webhooks/email-gateway" \
-H "Content-Type: application/json" \
-H "X-Webhook-Signature: sha256=${SIGNATURE}" \
-H "X-Webhook-Timestamp: ${TIMESTAMP}" \
-d "${BODY}"
Verify Configuration
tw-cli webhook test email-gateway
Error Handling
Invalid Signature
{
"error": {
"code": "invalid_signature",
"message": "Webhook signature verification failed"
}
}
Unknown Source
{
"error": {
"code": "unknown_source",
"message": "Webhook source 'unknown' is not registered"
}
}
Replay Attack
{
"error": {
"code": "timestamp_expired",
"message": "Webhook timestamp is too old (>5 minutes)"
}
}
Monitoring Webhooks
Metrics
# Webhook receive rate
rate(webhook_received_total[5m])
# Error rate by source
rate(webhook_errors_total[5m])
Logs
tw-cli logs --filter webhook --tail 100
API Error Codes
All API errors return a consistent JSON structure with an error code, message, and optional details.
Error Response Format
{
"code": "ERROR_CODE",
"message": "Human-readable error message",
"details": { ... },
"request_id": "optional-request-id"
}
Error Codes Reference
Authentication Errors (4xx)
| Code | HTTP Status | Description | Resolution |
|---|---|---|---|
UNAUTHORIZED | 401 | Missing or invalid authentication | Provide valid API key or session cookie |
INVALID_CREDENTIALS | 401 | Invalid username or password | Check login credentials |
SESSION_EXPIRED | 401 | Session has expired | Re-authenticate to get new session |
INVALID_SIGNATURE | 401 | Webhook signature validation failed | Verify webhook secret configuration |
FORBIDDEN | 403 | Authenticated but not authorized | Check user role and permissions |
CSRF_VALIDATION_FAILED | 403 | CSRF token missing or invalid | Include valid CSRF token in request |
ACCOUNT_DISABLED | 403 | User account is disabled | Contact administrator |
Client Errors (4xx)
| Code | HTTP Status | Description | Resolution |
|---|---|---|---|
NOT_FOUND | 404 | Resource not found | Verify resource ID exists |
BAD_REQUEST | 400 | Malformed request | Check request syntax and parameters |
CONFLICT | 409 | Resource conflict (e.g., already exists) | Action already completed or duplicate resource |
UNPROCESSABLE_ENTITY | 422 | Semantic error in request | Check request logic and data validity |
VALIDATION_ERROR | 422 | Field validation failed | See details for field-specific errors |
RATE_LIMIT_EXCEEDED | 429 | Too many requests | Wait and retry with exponential backoff |
Server Errors (5xx)
| Code | HTTP Status | Description | Resolution |
|---|---|---|---|
INTERNAL_ERROR | 500 | Unexpected server error | Check server logs, contact support |
DATABASE_ERROR | 500 | Database operation failed | Check database connectivity |
SERVICE_UNAVAILABLE | 503 | Service temporarily unavailable | Retry later |
Detailed Error Examples
Validation Error
When field validation fails, the response includes detailed field-level errors:
{
"code": "VALIDATION_ERROR",
"message": "Validation failed",
"details": {
"name": {
"code": "required",
"message": "Name is required"
},
"email": {
"code": "invalid_format",
"message": "Invalid email format"
}
}
}
Not Found Error
{
"code": "NOT_FOUND",
"message": "Not found: Incident 550e8400-e29b-41d4-a716-446655440000 not found"
}
Conflict Error
Returned when attempting an action that conflicts with current state:
{
"code": "CONFLICT",
"message": "Conflict: Action is not pending approval (current status: Approved)"
}
Rate Limit Error
{
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded"
}
Include a Retry-After header when available.
Unauthorized Error
{
"code": "UNAUTHORIZED",
"message": "Unauthorized: No authentication provided"
}
Error Handling Best Practices
Client Implementation
import requests
def handle_api_error(response):
error = response.json()
code = error.get('code')
if code == 'RATE_LIMIT_EXCEEDED':
# Implement exponential backoff
retry_after = int(response.headers.get('Retry-After', 60))
time.sleep(retry_after)
return retry_request()
elif code == 'SESSION_EXPIRED':
# Re-authenticate
refresh_session()
return retry_request()
elif code == 'VALIDATION_ERROR':
# Handle field-specific errors
for field, details in error.get('details', {}).items():
print(f"Field '{field}': {details['message']}")
elif code in ['INTERNAL_ERROR', 'DATABASE_ERROR']:
# Log and alert on server errors
log_error(error)
raise ServerError(error['message'])
Retry Strategy
For transient errors (5xx, RATE_LIMIT_EXCEEDED), implement exponential backoff:
import time
import random
def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except (RateLimitError, ServiceUnavailableError) as e:
if attempt == max_retries - 1:
raise
delay = (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
HTTP Status Code Summary
| Status | Meaning | Retryable |
|---|---|---|
| 400 | Bad Request | No |
| 401 | Unauthorized | After re-auth |
| 403 | Forbidden | No |
| 404 | Not Found | No |
| 409 | Conflict | No |
| 422 | Unprocessable Entity | After fixing request |
| 429 | Rate Limited | Yes, with backoff |
| 500 | Internal Error | Yes, with caution |
| 503 | Service Unavailable | Yes, with backoff |
Configuration Guide
Complete guides for configuring Triage Warden.
Initial Setup
After installation, configure Triage Warden in this order:
- Environment Variables - Set required environment variables
- Connectors - Connect to your security tools
- Notifications - Set up alert channels
- Playbooks - Create automation workflows
- Policies - Define approval and safety rules
- SSO Integrations - Configure enterprise identity providers
Quick Configuration
First Run
After starting Triage Warden, log in with the default credentials:
- Username:
admin - Password:
admin
Important: Change the default password immediately!
Essential Settings
Navigate to Settings and configure:
-
General
- Organization name
- Timezone
- Operation mode (Assisted → Supervised → Autonomous)
-
AI/LLM
- Select provider (Anthropic, OpenAI, or Local)
- Enter API key
- Choose model
-
Connectors (at minimum)
- Threat intelligence (VirusTotal recommended)
- Your primary SIEM or alert source
-
Notifications
- At least one channel for critical alerts
Configuration Methods
Web UI (Recommended)
Most settings can be configured through the web dashboard at Settings.
Pros:
- User-friendly interface
- Validation feedback
- Immediate effect
Environment Variables
For deployment configuration and secrets:
# Required
DATABASE_URL=postgres://...
TW_ENCRYPTION_KEY=...
# Optional overrides
TW_LLM_PROVIDER=anthropic
TW_LLM_MODEL=claude-3-sonnet
See Environment Variables Reference for full list.
Configuration Files
For complex configurations:
# config/default.yaml
server:
bind_address: "0.0.0.0:8080"
guardrails:
max_actions_per_incident: 10
blocked_actions: []
Configuration Hierarchy
Configuration is loaded in this order (later overrides earlier):
1. Built-in defaults
↓
2. config/default.yaml
↓
3. config/{environment}.yaml
↓
4. Environment variables
↓
5. Database settings (via UI)
Validation
Triage Warden validates configuration at startup:
# Validate without starting
triage-warden serve --validate-only
# Check specific configuration
triage-warden config check
Common Validation Errors
| Error | Solution |
|---|---|
Missing TW_ENCRYPTION_KEY | Set encryption key environment variable |
Invalid DATABASE_URL | Check connection string format |
LLM API key required | Set API key or disable LLM features |
Guardrails file not found | Create config/guardrails.yaml |
Backup Configuration
Before making changes, backup current settings:
# Export settings via API
curl -H "Authorization: Bearer $API_KEY" \
http://localhost:8080/api/settings/export > settings-backup.json
# Restore settings
curl -X POST -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d @settings-backup.json \
http://localhost:8080/api/settings/import
Next Steps
Environment Variables Reference
Complete reference of all environment variables for Triage Warden.
Required Variables
These must be set for Triage Warden to start.
Database
| Variable | Description | Example |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | postgres://user:pass@localhost:5432/triage_warden |
Connection String Format:
postgres://username:password@hostname:port/database?sslmode=require
SSL Modes:
disable- No SSL (development only)require- SSL required, no certificate verificationverify-ca- Verify server certificate against CAverify-full- Verify server certificate and hostname
Security
| Variable | Description | Example |
|---|---|---|
TW_ENCRYPTION_KEY | Credential encryption key (32 bytes, base64) | K7gNU3sdo+OL0wNhqoVW... |
TW_JWT_SECRET | JWT signing secret (min 32 characters) | your-very-long-jwt-secret-here |
TW_SESSION_SECRET | Session encryption secret | your-session-secret-here |
Generating Keys:
# Encryption key (32 bytes, base64)
openssl rand -base64 32
# JWT/Session secret (hex)
openssl rand -hex 32
Server Configuration
| Variable | Description | Default |
|---|---|---|
TW_BIND_ADDRESS | Server bind address | 0.0.0.0:8080 |
TW_BASE_URL | Public URL for the application | http://localhost:8080 |
TW_TRUSTED_PROXIES | Comma-separated trusted proxy IPs | None |
TW_MAX_REQUEST_SIZE | Maximum request body size | 10MB |
TW_REQUEST_TIMEOUT | Request timeout in seconds | 30 |
Example:
TW_BIND_ADDRESS=0.0.0.0:8080
TW_BASE_URL=https://triage.company.com
TW_TRUSTED_PROXIES=10.0.0.0/8,172.16.0.0/12
Database Configuration
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | Connection string | Required |
DATABASE_MAX_CONNECTIONS | Maximum pool connections | 10 |
DATABASE_MIN_CONNECTIONS | Minimum pool connections | 1 |
DATABASE_CONNECT_TIMEOUT | Connection timeout (seconds) | 30 |
DATABASE_IDLE_TIMEOUT | Idle connection timeout (seconds) | 600 |
DATABASE_MAX_LIFETIME | Max connection lifetime (seconds) | 1800 |
High-Traffic Configuration:
DATABASE_MAX_CONNECTIONS=50
DATABASE_MIN_CONNECTIONS=5
DATABASE_IDLE_TIMEOUT=300
Authentication
| Variable | Description | Default |
|---|---|---|
TW_JWT_SECRET | JWT signing secret | Required |
TW_JWT_EXPIRY | JWT token expiry | 24h |
TW_SESSION_SECRET | Session encryption key | Required |
TW_SESSION_EXPIRY | Session duration | 7d |
TW_CSRF_ENABLED | Enable CSRF protection | true |
TW_COOKIE_SECURE | Require HTTPS for cookies | false |
TW_COOKIE_SAME_SITE | SameSite cookie policy | lax |
Production Settings:
TW_COOKIE_SECURE=true
TW_COOKIE_SAME_SITE=strict
TW_SESSION_EXPIRY=1d
LLM Configuration
Provider Selection
| Variable | Description | Default |
|---|---|---|
TW_LLM_PROVIDER | LLM provider | openai |
TW_LLM_MODEL | Model name | gpt-4-turbo |
TW_LLM_ENABLED | Enable LLM features | true |
Valid Providers: openai, anthropic, azure, local
API Keys
| Variable | Description |
|---|---|
OPENAI_API_KEY | OpenAI API key |
ANTHROPIC_API_KEY | Anthropic API key |
AZURE_OPENAI_API_KEY | Azure OpenAI API key |
AZURE_OPENAI_ENDPOINT | Azure OpenAI endpoint URL |
Model Parameters
| Variable | Description | Default |
|---|---|---|
TW_LLM_TEMPERATURE | Response randomness (0.0-2.0) | 0.2 |
TW_LLM_MAX_TOKENS | Maximum response tokens | 4096 |
TW_LLM_TIMEOUT | Request timeout (seconds) | 60 |
Example Configuration:
# Using Anthropic
TW_LLM_PROVIDER=anthropic
TW_LLM_MODEL=claude-3-sonnet-20240229
ANTHROPIC_API_KEY=sk-ant-api03-...
TW_LLM_TEMPERATURE=0.1
TW_LLM_MAX_TOKENS=8192
# Using Azure OpenAI
TW_LLM_PROVIDER=azure
AZURE_OPENAI_API_KEY=your-azure-key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
TW_LLM_MODEL=gpt-4-deployment-name
Logging & Observability
| Variable | Description | Default |
|---|---|---|
RUST_LOG | Log level filter | info |
TW_LOG_FORMAT | Log format (json or pretty) | json |
TW_LOG_FILE | Log file path (optional) | None |
Log Levels
# Basic levels
RUST_LOG=info # Info and above
RUST_LOG=debug # Debug and above
RUST_LOG=warn # Warnings and errors only
# Granular control
RUST_LOG=info,triage_warden=debug # Debug for app, info for deps
RUST_LOG=warn,triage_warden::api=debug # Debug specific module
RUST_LOG=info,sqlx=warn,hyper=warn # Quiet noisy dependencies
Metrics & Tracing
| Variable | Description | Default |
|---|---|---|
TW_METRICS_ENABLED | Enable Prometheus metrics | true |
TW_METRICS_PATH | Metrics endpoint path | /metrics |
TW_TRACING_ENABLED | Enable distributed tracing | false |
OTEL_EXPORTER_OTLP_ENDPOINT | OpenTelemetry endpoint | None |
OTEL_SERVICE_NAME | Service name for traces | triage-warden |
Tracing Setup:
TW_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=triage-warden-prod
Rate Limiting
| Variable | Description | Default |
|---|---|---|
TW_RATE_LIMIT_ENABLED | Enable rate limiting | true |
TW_RATE_LIMIT_REQUESTS | Requests per window | 100 |
TW_RATE_LIMIT_WINDOW | Rate limit window | 1m |
TW_RATE_LIMIT_BURST | Burst allowance | 20 |
Webhooks
| Variable | Description | Default |
|---|---|---|
TW_WEBHOOK_SECRET | Default webhook signature secret | None |
TW_WEBHOOK_SPLUNK_SECRET | Splunk-specific secret | None |
TW_WEBHOOK_CROWDSTRIKE_SECRET | CrowdStrike-specific secret | None |
TW_WEBHOOK_DEFENDER_SECRET | Defender-specific secret | None |
TW_WEBHOOK_SENTINEL_SECRET | Sentinel-specific secret | None |
CORS Configuration
| Variable | Description | Default |
|---|---|---|
TW_CORS_ENABLED | Enable CORS | true |
TW_CORS_ORIGINS | Allowed origins (comma-separated) | * |
TW_CORS_METHODS | Allowed methods | GET,POST,PUT,DELETE,OPTIONS |
TW_CORS_HEADERS | Allowed headers | * |
TW_CORS_MAX_AGE | Preflight cache duration (seconds) | 86400 |
Production CORS:
TW_CORS_ORIGINS=https://triage.company.com,https://admin.company.com
Feature Flags
| Variable | Description | Default |
|---|---|---|
TW_FEATURE_PLAYBOOKS | Enable playbook execution | true |
TW_FEATURE_AUTO_ENRICH | Enable automatic enrichment | true |
TW_FEATURE_API_KEYS | Enable API key management | true |
Development Variables
Not recommended for production:
| Variable | Description | Default |
|---|---|---|
TW_DEV_MODE | Enable development mode | false |
TW_SEED_DATA | Seed database with test data | false |
TW_DISABLE_AUTH | Disable authentication | false |
Example Configurations
Development
DATABASE_URL=sqlite:./dev.db
TW_ENCRYPTION_KEY=$(openssl rand -base64 32)
TW_JWT_SECRET=dev-jwt-secret-not-for-production
TW_SESSION_SECRET=dev-session-secret
RUST_LOG=debug
TW_LOG_FORMAT=pretty
TW_DEV_MODE=true
Production
# Database
DATABASE_URL=postgres://tw:[email protected]:5432/triage_warden?sslmode=verify-full
DATABASE_MAX_CONNECTIONS=25
# Security
TW_ENCRYPTION_KEY=your-production-encryption-key
TW_JWT_SECRET=your-production-jwt-secret-minimum-32-chars
TW_SESSION_SECRET=your-production-session-secret
TW_COOKIE_SECURE=true
TW_COOKIE_SAME_SITE=strict
# Server
TW_BASE_URL=https://triage.company.com
TW_TRUSTED_PROXIES=10.0.0.0/8
# LLM
TW_LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-api03-...
TW_LLM_MODEL=claude-3-sonnet-20240229
# Logging
RUST_LOG=info
TW_LOG_FORMAT=json
TW_METRICS_ENABLED=true
# Rate limiting
TW_RATE_LIMIT_ENABLED=true
TW_RATE_LIMIT_REQUESTS=200
TW_RATE_LIMIT_WINDOW=1m
Kubernetes
apiVersion: v1
kind: Secret
metadata:
name: triage-warden-secrets
type: Opaque
stringData:
DATABASE_URL: "postgres://user:pass@postgres:5432/triage_warden"
TW_ENCRYPTION_KEY: "base64-encoded-32-byte-key"
TW_JWT_SECRET: "jwt-signing-secret"
TW_SESSION_SECRET: "session-secret"
ANTHROPIC_API_KEY: "sk-ant-..."
---
apiVersion: v1
kind: ConfigMap
metadata:
name: triage-warden-config
data:
TW_BASE_URL: "https://triage.company.com"
TW_LLM_PROVIDER: "anthropic"
TW_LLM_MODEL: "claude-3-sonnet-20240229"
RUST_LOG: "info"
TW_METRICS_ENABLED: "true"
Connector Setup Guide
Step-by-step instructions for configuring each connector type.
Overview
Connectors enable Triage Warden to:
- Ingest alerts from SIEMs and security tools
- Enrich incidents with threat intelligence
- Execute actions like creating tickets or isolating hosts
- Send notifications to communication platforms
Adding a Connector
- Navigate to Settings → Connectors
- Click Add Connector
- Select connector type
- Fill in the required fields
- Click Test Connection to verify
- Click Save
Threat Intelligence Connectors
VirusTotal
Enriches file hashes, URLs, IPs, and domains with reputation data.
Prerequisites:
- VirusTotal account (free or premium)
- API key from virustotal.com/gui/my-apikey
Configuration:
| Field | Value |
|---|---|
| Name | VirusTotal |
| Type | virustotal |
| API Key | Your API key |
| Rate Limit | 4 (free) or 500 (premium) |
Rate Limits:
- Free tier: 4 requests/minute
- Premium: 500+ requests/minute
Verify It Works:
- Create a test incident with a known-bad hash
- Check incident enrichments for VirusTotal data
AlienVault OTX
Open threat intelligence from AlienVault.
Prerequisites:
- OTX account at otx.alienvault.com
- API key from Settings → API Keys
Configuration:
| Field | Value |
|---|---|
| Name | AlienVault OTX |
| Type | alienvault |
| API Key | Your OTX API key |
SIEM Connectors
Splunk
Ingest alerts from Splunk and run queries.
Prerequisites:
- Splunk Enterprise or Cloud
- HTTP Event Collector (HEC) token
- User with search capabilities
Configuration:
| Field | Value |
|---|---|
| Name | Splunk Production |
| Type | splunk |
| Host | https://splunk.company.com:8089 |
| Username | Service account username |
| Password | Service account password |
| App | search (or your app context) |
Setting Up Webhooks:
- In Splunk, create an alert action that sends to webhook
- Configure webhook URL:
https://triage.company.com/api/webhooks/splunk - Set webhook secret in Triage Warden connector config
Elastic Security
Connect to Elastic Security for SIEM alerts.
Prerequisites:
- Elasticsearch 7.x or 8.x
- User with read access to security indices
Configuration:
| Field | Value |
|---|---|
| Name | Elastic SIEM |
| Type | elastic |
| URL | https://elasticsearch.company.com:9200 |
| Username | Service account username |
| Password | Service account password |
| Index Pattern | security-* or .alerts-security.* |
Microsoft Sentinel
Azure Sentinel integration for cloud SIEM.
Prerequisites:
- Azure subscription with Sentinel workspace
- App registration with Log Analytics Reader role
Configuration:
| Field | Value |
|---|---|
| Name | Azure Sentinel |
| Type | sentinel |
| Workspace ID | Log Analytics Workspace ID |
| Tenant ID | Azure AD Tenant ID |
| Client ID | App Registration Client ID |
| Client Secret | App Registration Secret |
Azure Setup:
- Create App Registration in Azure AD
- Grant
Log Analytics Readerrole on Sentinel workspace - Create client secret
- Copy IDs and secret to Triage Warden
EDR Connectors
CrowdStrike Falcon
Endpoint detection and host isolation.
Prerequisites:
- CrowdStrike Falcon subscription
- API client with appropriate scopes
Configuration:
| Field | Value |
|---|---|
| Name | CrowdStrike Falcon |
| Type | crowdstrike |
| Region | us-1, us-2, eu-1, or us-gov-1 |
| Client ID | OAuth Client ID |
| Client Secret | OAuth Client Secret |
Required API Scopes:
Detections: ReadHosts: Read, Write(for isolation)Incidents: Read
CrowdStrike Setup:
- Go to Support → API Clients and Keys
- Create new API client
- Select required scopes
- Copy Client ID and Secret
Microsoft Defender for Endpoint
MDE integration for alerts and host actions.
Prerequisites:
- Microsoft 365 E5 or Defender for Endpoint license
- App registration with Defender API permissions
Configuration:
| Field | Value |
|---|---|
| Name | Defender for Endpoint |
| Type | defender |
| Tenant ID | Azure AD Tenant ID |
| Client ID | App Registration Client ID |
| Client Secret | App Registration Secret |
Required API Permissions:
Alert.Read.AllMachine.Read.AllMachine.Isolate(for isolation actions)
SentinelOne
SentinelOne EDR integration.
Prerequisites:
- SentinelOne console access
- API token with appropriate permissions
Configuration:
| Field | Value |
|---|---|
| Name | SentinelOne |
| Type | sentinelone |
| Console URL | https://usea1-pax8.sentinelone.net |
| API Token | Your API token |
Ticketing Connectors
Jira
Create and manage security tickets.
Prerequisites:
- Jira Cloud or Server instance
- API token (Cloud) or password (Server)
Configuration:
| Field | Value |
|---|---|
| Name | Jira Security |
| Type | jira |
| URL | https://yourcompany.atlassian.net |
| Your Jira email | |
| API Token | API token from Atlassian account |
| Default Project | SEC (your security project key) |
Jira Cloud Setup:
- Go to id.atlassian.com/manage-profile/security/api-tokens
- Create API token
- Use your email as username
Jira Server Setup:
- Use password instead of API token
- Ensure user has project access
ServiceNow
ServiceNow ITSM integration.
Prerequisites:
- ServiceNow instance
- User with incident table access
Configuration:
| Field | Value |
|---|---|
| Name | ServiceNow |
| Type | servicenow |
| Instance URL | https://yourcompany.service-now.com |
| Username | Service account username |
| Password | Service account password |
Identity Connectors
Microsoft 365 / Azure AD
User management and sign-in data.
Prerequisites:
- Azure AD with appropriate licenses
- App registration with Graph API permissions
Configuration:
| Field | Value |
|---|---|
| Name | Microsoft 365 |
| Type | m365 |
| Tenant ID | Azure AD Tenant ID |
| Client ID | App Registration Client ID |
| Client Secret | App Registration Secret |
Required API Permissions:
User.Read.AllAuditLog.Read.AllUser.RevokeSessions.All(for user disable)
Google Workspace
Google Workspace user management.
Prerequisites:
- Google Workspace admin access
- Service account with domain-wide delegation
Configuration:
| Field | Value |
|---|---|
| Name | Google Workspace |
| Type | google |
| Service Account JSON | Paste JSON key file contents |
| Domain | company.com |
Google Setup:
- Create service account in Google Cloud Console
- Enable domain-wide delegation
- Add required OAuth scopes in Google Admin
- Download JSON key file
Testing Connectors
After configuration, always test:
- Click Test Connection in connector settings
- Check the response for success/errors
- For ingestion connectors, verify sample data appears
Common Issues
| Error | Solution |
|---|---|
| Connection refused | Check URL and network access |
| 401 Unauthorized | Verify credentials/API key |
| 403 Forbidden | Check permissions/scopes |
| SSL certificate error | Verify certificate or disable verification |
| Rate limited | Reduce request rate or upgrade tier |
Connector Health
Monitor connector health at Settings → Connectors or via API:
curl http://localhost:8080/health/detailed | jq '.components.connectors'
Healthy connectors show status connected. Troubleshoot any showing error or disconnected.
Playbooks Guide
Create effective automated response playbooks.
What is a Playbook?
A playbook is an automated workflow that executes when specific conditions are met. Playbooks contain:
- Trigger - Conditions that start the playbook
- Stages - Ordered groups of steps
- Steps - Individual actions to execute
Creating a Playbook
Via Web UI
- Navigate to Playbooks
- Click Create Playbook
- Enter name and description
- Configure trigger conditions
- Add stages and steps
- Enable and save
Via API
curl -X POST http://localhost:8080/api/playbooks \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Phishing Response",
"description": "Automated response for phishing alerts",
"trigger": {
"type": "incident_created",
"conditions": {
"source": "email_gateway",
"severity": ["high", "critical"]
}
},
"stages": [...]
}'
Trigger Types
incident_created
Fires when a new incident is created.
{
"type": "incident_created",
"conditions": {
"severity": ["high", "critical"],
"source": "crowdstrike",
"title_contains": "malware"
}
}
incident_updated
Fires when an incident is updated.
{
"type": "incident_updated",
"conditions": {
"field": "severity",
"new_value": "critical"
}
}
scheduled
Fires on a schedule (cron format).
{
"type": "scheduled",
"schedule": "0 */6 * * *"
}
manual
Only triggered manually by user action.
{
"type": "manual"
}
Stages
Stages group steps that should execute together. Configure:
- Name - Descriptive name
- Description - What this stage does
- Parallel - Execute steps in parallel (default: false)
Sequential Execution
{
"stages": [
{
"name": "Enrichment",
"steps": [/* step 1, step 2, step 3 */]
},
{
"name": "Response",
"steps": [/* step 4, step 5 */]
}
]
}
Steps in Enrichment complete before Response starts.
Parallel Execution
{
"stages": [
{
"name": "Gather Intel",
"parallel": true,
"steps": [
{"action": "lookup_hash_virustotal"},
{"action": "lookup_ip_reputation"},
{"action": "lookup_domain_reputation"}
]
}
]
}
All lookups run simultaneously.
Step Types
Enrichment Actions
lookup_hash
Look up file hash reputation.
{
"action": "lookup_hash",
"parameters": {
"hash": "{{ incident.iocs.file_hash }}",
"providers": ["virustotal", "alienvault"]
}
}
lookup_ip
Look up IP address reputation.
{
"action": "lookup_ip",
"parameters": {
"ip": "{{ incident.source_ip }}"
}
}
lookup_domain
Look up domain reputation.
{
"action": "lookup_domain",
"parameters": {
"domain": "{{ incident.domain }}"
}
}
lookup_user
Get user details from identity provider.
{
"action": "lookup_user",
"parameters": {
"email": "{{ incident.user_email }}",
"provider": "m365"
}
}
Containment Actions
isolate_host
Isolate endpoint from network.
{
"action": "isolate_host",
"parameters": {
"hostname": "{{ incident.hostname }}",
"provider": "crowdstrike"
},
"requires_approval": true
}
disable_user
Disable user account.
{
"action": "disable_user",
"parameters": {
"email": "{{ incident.user_email }}",
"provider": "m365"
},
"requires_approval": true
}
block_ip
Block IP address at firewall.
{
"action": "block_ip",
"parameters": {
"ip": "{{ incident.source_ip }}",
"duration": "24h"
},
"requires_approval": true
}
Notification Actions
send_notification
Send alert to notification channel.
{
"action": "send_notification",
"parameters": {
"channel": "slack-security",
"message": "Critical incident: {{ incident.title }}"
}
}
create_ticket
Create ticket in ticketing system.
{
"action": "create_ticket",
"parameters": {
"provider": "jira",
"project": "SEC",
"type": "Incident",
"title": "{{ incident.title }}",
"description": "{{ incident.description }}"
}
}
Analysis Actions
analyze_with_llm
Run AI analysis on incident.
{
"action": "analyze_with_llm",
"parameters": {
"prompt": "Analyze this security incident and provide recommendations",
"include_enrichments": true
}
}
Utility Actions
wait
Pause execution for specified duration.
{
"action": "wait",
"parameters": {
"duration": "5m"
}
}
set_severity
Update incident severity.
{
"action": "set_severity",
"parameters": {
"severity": "critical"
}
}
add_comment
Add comment to incident.
{
"action": "add_comment",
"parameters": {
"comment": "Automated enrichment complete. Found {{ enrichments.virustotal.positives }} detections."
}
}
Variables and Templates
Use Jinja2-style templates to reference incident data:
Available Variables
| Variable | Description |
|---|---|
{{ incident.id }} | Incident UUID |
{{ incident.title }} | Incident title |
{{ incident.severity }} | Severity level |
{{ incident.source }} | Alert source |
{{ incident.description }} | Full description |
{{ incident.hostname }} | Affected hostname |
{{ incident.username }} | Affected username |
{{ incident.source_ip }} | Source IP address |
{{ incident.iocs.* }} | Extracted IOCs |
{{ enrichments.* }} | Enrichment results |
{{ previous_step.output }} | Previous step output |
Conditional Logic
{
"action": "isolate_host",
"conditions": "{{ incident.severity == 'critical' and enrichments.virustotal.positives > 5 }}"
}
Approval Requirements
Mark steps as requiring approval for dangerous actions:
{
"action": "disable_user",
"requires_approval": true
}
When requires_approval: true:
- Step pauses at approval queue
- Analyst reviews and approves/denies
- Execution continues or stops
Example Playbooks
Phishing Triage
{
"name": "Phishing Triage",
"description": "Automated triage for reported phishing emails",
"trigger": {
"type": "incident_created",
"conditions": {
"source": "email_gateway",
"title_contains": "phishing"
}
},
"stages": [
{
"name": "Extract and Enrich",
"parallel": true,
"steps": [
{
"action": "lookup_domain",
"parameters": {"domain": "{{ incident.sender_domain }}"}
},
{
"action": "lookup_url",
"parameters": {"url": "{{ incident.iocs.url }}"}
},
{
"action": "lookup_user",
"parameters": {"email": "{{ incident.recipient }}"}
}
]
},
{
"name": "Analyze",
"steps": [
{
"action": "analyze_with_llm",
"parameters": {
"prompt": "Analyze this phishing attempt and determine if it's targeted spear-phishing"
}
}
]
},
{
"name": "Respond",
"steps": [
{
"action": "send_notification",
"parameters": {
"channel": "slack-phishing",
"message": "Phishing alert: {{ incident.title }}\nSender: {{ incident.sender }}\nVerdict: {{ analysis.verdict }}"
}
},
{
"action": "create_ticket",
"conditions": "{{ analysis.verdict == 'malicious' }}",
"parameters": {
"provider": "jira",
"project": "SEC",
"title": "Phishing: {{ incident.title }}"
}
}
]
}
]
}
Malware Containment
{
"name": "Malware Containment",
"description": "Isolate hosts with confirmed malware",
"trigger": {
"type": "incident_created",
"conditions": {
"source": "crowdstrike",
"severity": "critical",
"title_contains": "malware"
}
},
"stages": [
{
"name": "Verify",
"steps": [
{
"action": "lookup_hash",
"parameters": {"hash": "{{ incident.iocs.file_hash }}"}
}
]
},
{
"name": "Contain",
"steps": [
{
"action": "isolate_host",
"conditions": "{{ enrichments.virustotal.positives >= 5 }}",
"requires_approval": true,
"parameters": {
"hostname": "{{ incident.hostname }}",
"reason": "Confirmed malware with {{ enrichments.virustotal.positives }} detections"
}
}
]
},
{
"name": "Notify",
"steps": [
{
"action": "send_notification",
"parameters": {
"channel": "pagerduty-security",
"message": "Host {{ incident.hostname }} isolated due to malware"
}
}
]
}
]
}
Best Practices
- Start small - Begin with enrichment-only playbooks before adding containment
- Require approval - Always require approval for containment actions initially
- Test in staging - Test playbooks with mock incidents first
- Monitor execution - Watch playbook executions for errors
- Document thoroughly - Include clear descriptions for each stage/step
- Use conditions - Don't execute actions blindly; use conditions to validate
- Handle failures - Consider what happens if a step fails
Troubleshooting
Playbook Not Triggering
- Verify trigger conditions match incoming incidents
- Check playbook is enabled
- Review trigger condition syntax
Step Failing
- Check connector is healthy
- Verify required parameters are provided
- Check variable templates resolve correctly
- Review step logs in incident timeline
Approval Stuck
- Check Approvals queue for pending items
- Verify approvers have notification channel configured
- Consider timeout settings for approvals
Notifications Setup Guide
Configure notification channels for alerts and incident updates.
Overview
Triage Warden supports multiple notification channels:
- Slack - Team messaging
- Microsoft Teams - Enterprise collaboration
- PagerDuty - On-call alerting
- Email - SMTP notifications
- Webhooks - Custom integrations
Adding a Notification Channel
- Navigate to Settings → Notifications
- Click Add Channel
- Select channel type
- Configure settings
- Test and save
Slack
Prerequisites
- Slack workspace admin access
- Slack app with webhook permissions
Setup Steps
-
Create Slack App:
- Go to api.slack.com/apps
- Click Create New App → From scratch
- Name it "Triage Warden" and select your workspace
-
Enable Incoming Webhooks:
- In app settings, click Incoming Webhooks
- Toggle Activate Incoming Webhooks to On
- Click Add New Webhook to Workspace
- Select the channel for alerts
-
Copy Webhook URL:
- Copy the webhook URL (starts with
https://hooks.slack.com/...)
- Copy the webhook URL (starts with
-
Configure in Triage Warden:
| Field | Value |
|---|---|
| Name | Slack - Security |
| Type | slack |
| Webhook URL | Your webhook URL |
| Channel | #security-alerts |
Message Format
Triage Warden sends formatted Slack messages with:
- Severity color coding (red=critical, orange=high, yellow=medium, gray=low)
- Incident summary and details
- Quick action buttons (View, Acknowledge)
- Enrichment highlights
Example Notification
{
"attachments": [{
"color": "#ff0000",
"title": "Critical: Malware Detected on WORKSTATION-001",
"text": "CrowdStrike detected Emotet malware on endpoint",
"fields": [
{"title": "Source", "value": "CrowdStrike", "short": true},
{"title": "Severity", "value": "Critical", "short": true}
],
"actions": [
{"type": "button", "text": "View Incident", "url": "https://..."}
]
}]
}
Microsoft Teams
Prerequisites
- Microsoft 365 account
- Teams channel where you can add connectors
Setup Steps
-
Add Incoming Webhook Connector:
- In Teams, go to the channel for alerts
- Click ... → Connectors
- Find Incoming Webhook and click Configure
- Name it "Triage Warden" and upload an icon (optional)
- Click Create
-
Copy Webhook URL:
- Copy the generated webhook URL
-
Configure in Triage Warden:
| Field | Value |
|---|---|
| Name | Teams - Security |
| Type | teams |
| Webhook URL | Your webhook URL |
Adaptive Cards
Triage Warden sends Teams notifications as Adaptive Cards with:
- Severity indicators
- Incident details in structured format
- Action buttons for quick response
PagerDuty
Prerequisites
- PagerDuty account
- Service with Events API v2 integration
Setup Steps
-
Create PagerDuty Service:
- In PagerDuty, go to Services → New Service
- Name it "Triage Warden Alerts"
- Add an escalation policy
-
Add Events API Integration:
- On the service page, go to Integrations
- Click Add Integration
- Select Events API v2
- Copy the Integration Key
-
Configure in Triage Warden:
| Field | Value |
|---|---|
| Name | PagerDuty - Security |
| Type | pagerduty |
| Integration Key | Your integration key |
| Severity Mapping | See below |
Severity Mapping
Map Triage Warden severities to PagerDuty:
| TW Severity | PagerDuty Severity |
|---|---|
| Critical | critical |
| High | error |
| Medium | warning |
| Low | info |
Auto-Resolution
Configure auto-resolution to close PagerDuty incidents when Triage Warden incidents are resolved:
notifications:
pagerduty:
auto_resolve: true
resolve_on_status:
- resolved
- closed
- false_positive
Email (SMTP)
Prerequisites
- SMTP server credentials
- Recipient email addresses
Configuration
| Field | Value |
|---|---|
| Name | Email - SOC Team |
| Type | email |
| SMTP Host | smtp.company.com |
| SMTP Port | 587 |
| Username | [email protected] |
| Password | SMTP password |
| From Address | [email protected] |
| To Addresses | [email protected] |
| Use TLS | true |
Email Templates
Customize email templates by creating files in config/templates/:
config/templates/
├── email_incident_created.html
├── email_incident_updated.html
└── email_incident_resolved.html
Template variables:
{{ incident.title }}- Incident title{{ incident.severity }}- Severity level{{ incident.source }}- Alert source{{ incident.description }}- Full description{{ incident.url }}- Link to incident
Custom Webhooks
Send notifications to any HTTP endpoint.
Configuration
| Field | Value |
|---|---|
| Name | Custom - SIEM |
| Type | webhook |
| URL | https://siem.company.com/api/alerts |
| Method | POST |
| Headers | {"Authorization": "Bearer ..."} |
| Secret | Webhook signing secret (optional) |
Payload Format
Default JSON payload:
{
"event_type": "incident_created",
"timestamp": "2024-01-15T10:30:00Z",
"incident": {
"id": "uuid",
"title": "Alert Title",
"severity": "high",
"source": "crowdstrike",
"description": "...",
"created_at": "2024-01-15T10:29:00Z"
}
}
Webhook Signatures
If a secret is configured, Triage Warden signs webhooks with HMAC-SHA256:
X-TW-Signature: sha256=<signature>
X-TW-Timestamp: <unix_timestamp>
Verify signatures:
import hmac
import hashlib
def verify_signature(payload, signature, secret, timestamp):
expected = hmac.new(
secret.encode(),
f"{timestamp}.{payload}".encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature)
Notification Rules
Configure when and how notifications are sent.
Severity Filtering
Send only high/critical alerts to PagerDuty:
notifications:
rules:
- channel: pagerduty-security
conditions:
severity:
- critical
- high
Time-Based Rules
Different channels for business hours vs. after hours:
notifications:
rules:
- channel: slack-security
conditions:
hours: "09:00-17:00"
days: ["mon", "tue", "wed", "thu", "fri"]
- channel: pagerduty-oncall
conditions:
hours: "17:00-09:00"
days: ["sat", "sun"]
Source-Based Rules
Route by alert source:
notifications:
rules:
- channel: slack-phishing
conditions:
source: email_gateway
- channel: slack-edr
conditions:
source:
- crowdstrike
- defender
Testing Notifications
Test via UI
- Go to Settings → Notifications
- Click Test next to any channel
- Check that test message arrives
Test via API
curl -X POST http://localhost:8080/api/notifications/test \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"channel_id": "uuid-of-channel",
"message": "Test notification from Triage Warden"
}'
Test via CLI
triage-warden notifications test --channel slack-security
Troubleshooting
Notifications Not Arriving
-
Check channel health:
curl http://localhost:8080/health/detailed | jq '.components.notifications' -
Verify webhook URL:
- Test URL with curl
- Check for firewalls or network restrictions
-
Check logs:
grep "notification" /var/log/triage-warden/app.log
Rate Limiting
If notifications are delayed:
- Slack: 1 message per second per channel
- PagerDuty: 120 events per minute
- Teams: 4 messages per second
Configure rate limits:
notifications:
rate_limits:
slack: 1/s
pagerduty: 2/s
teams: 4/s
Duplicate Notifications
If receiving duplicates:
- Check for multiple channels targeting same destination
- Enable deduplication:
notifications: deduplicate: true dedupe_window: 5m
Policies Guide
Configure approval policies, guardrails, and safety rules for Triage Warden.
Overview
Policies control what actions Triage Warden can take automatically and what requires human approval. The policy engine provides:
- Approval Requirements - Which actions need human approval
- Guardrails - Safety limits on automated actions
- Kill Switch - Emergency halt for all automation
- Audit Logging - Complete action history
Policy Configuration
Policies are defined in config/guardrails.yaml or via the web UI at Settings → Policies.
Basic Structure
# config/guardrails.yaml
version: "1"
# Global settings
global:
operation_mode: supervised # assisted, supervised, autonomous
kill_switch_enabled: false
max_actions_per_incident: 10
max_concurrent_actions: 5
# Action-specific policies
actions:
isolate_host:
requires_approval: true
approval_level: high
allowed_sources:
- crowdstrike
- defender
disable_user:
requires_approval: true
approval_level: critical
max_per_hour: 5
lookup_hash:
requires_approval: false
rate_limit: 100/minute
# Approval rules
approvals:
levels:
low:
auto_approve_after: 5m
approvers: [analyst]
medium:
auto_approve_after: 30m
approvers: [analyst, senior_analyst]
high:
auto_approve_after: never
approvers: [senior_analyst, manager]
critical:
auto_approve_after: never
approvers: [manager]
require_count: 2
Operation Modes
Assisted Mode
Human-in-the-loop for all decisions:
- All actions require explicit approval
- AI provides recommendations only
- Best for initial deployment and high-risk environments
global:
operation_mode: assisted
Supervised Mode (Recommended)
Balanced automation with oversight:
- Low-risk actions (lookups, enrichment) run automatically
- Medium/high-risk actions require approval
- Humans can intervene at any time
global:
operation_mode: supervised
Autonomous Mode
Maximum automation:
- Most actions run without approval
- Only critical actions require human review
- Use only after thorough testing
global:
operation_mode: autonomous
Approval Levels
Configuring Approval Requirements
Each action type can have an approval requirement:
| Action Type | Default Level | Typical Setting |
|---|---|---|
| lookup_* | none | none |
| send_notification | none | none |
| create_ticket | low | none or low |
| add_comment | none | none |
| set_severity | low | low |
| block_ip | high | high |
| isolate_host | critical | high or critical |
| disable_user | critical | critical |
Approval Workflow
- Action Requested - Playbook or AI requests an action
- Policy Check - Engine evaluates approval requirements
- Queue or Execute - Action queued for approval or runs immediately
- Approval Decision - Approver accepts or denies
- Execution - Approved action executes
- Audit Log - All decisions recorded
Approval Escalation
Configure escalation for unanswered approvals:
approvals:
escalation:
enabled: true
rules:
- after: 15m
notify: [slack-security]
- after: 30m
notify: [pagerduty-oncall]
escalate_to: manager
- after: 1h
auto_deny: true
reason: "Approval timeout"
Guardrails
Rate Limits
Prevent runaway automation:
guardrails:
rate_limits:
# Global limits
global:
max_actions_per_minute: 100
max_actions_per_hour: 1000
# Per-action limits
isolate_host:
max_per_hour: 10
max_per_day: 50
disable_user:
max_per_hour: 5
max_per_day: 20
Blocked Actions
Completely prevent certain actions:
guardrails:
blocked_actions:
- delete_user # Never allow
- format_disk # Never allow
- disable_mfa # Too dangerous
Conditional Rules
Allow/deny based on conditions:
guardrails:
conditional_rules:
- action: isolate_host
deny_if:
- hostname_contains: "dc" # Don't isolate domain controllers
- hostname_contains: "prod-db" # Don't isolate production databases
- is_server: true
- action: disable_user
deny_if:
- is_admin: true # Don't disable admins
- is_service_account: true # Don't disable service accounts
require_if:
- department: "executive" # Extra approval for executives
Asset Protection
Protect critical assets:
guardrails:
protected_assets:
hosts:
- pattern: "dc-*"
actions_blocked: [isolate_host, shutdown]
reason: "Domain controllers require manual intervention"
- pattern: "prod-*"
require_approval: critical
reason: "Production systems require manager approval"
users:
- pattern: "*@executive.company.com"
require_approval: critical
- pattern: "svc-*"
actions_blocked: [disable_user, reset_password]
Kill Switch
Emergency Automation Halt
The kill switch immediately stops all automated actions:
Via UI:
- Go to Settings → Safety
- Click Activate Kill Switch
- Enter reason
- Confirm
Via API:
curl -X POST http://localhost:8080/api/kill-switch/activate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"reason": "Investigating potential false positives"}'
Via CLI:
triage-warden kill-switch activate --reason "Emergency halt"
Kill Switch Effects
When active:
- All pending actions are paused
- New automated actions are blocked
- Manual actions still allowed
- Alerts continue to be ingested
- Enrichment continues (read-only)
Deactivating
Only users with admin or manager role can deactivate:
curl -X POST http://localhost:8080/api/kill-switch/deactivate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"reason": "Issue resolved, resuming normal operations"}'
Audit Logging
What's Logged
Every action is logged with:
- Timestamp
- Action type
- Target (host, user, etc.)
- Requestor (playbook, user, AI)
- Approver (if required)
- Result (success, failure, denied)
- Full context
Viewing Audit Logs
Via UI:
- Settings → Audit Log
- Filter by date, action type, user, result
Via API:
curl "http://localhost:8080/api/audit?action=isolate_host&from=2024-01-01" \
-H "Authorization: Bearer $API_KEY"
Audit Retention
Configure retention in config/guardrails.yaml:
audit:
retention_days: 365
archive_to: s3://audit-logs-bucket/triage-warden/
Policy Testing
Dry Run Mode
Test policies without executing actions:
global:
dry_run: true # Log what would happen, don't execute
Policy Simulator
Test specific scenarios:
curl -X POST http://localhost:8080/api/policies/simulate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"action": "isolate_host",
"context": {
"hostname": "dc-primary",
"severity": "critical",
"source": "crowdstrike"
}
}'
Response:
{
"allowed": false,
"reason": "Host matches protected pattern 'dc-*'",
"would_require_approval": null,
"matching_rules": [
"protected_assets.hosts[0]"
]
}
Best Practices
1. Start Restrictive
Begin with assisted mode and strict approvals. Loosen over time as you build confidence.
2. Protect Critical Assets
Always define protected assets for:
- Domain controllers
- Production databases
- Executive accounts
- Service accounts
3. Use Approval Escalation
Don't let approvals sit forever. Configure timeouts and escalations.
4. Monitor Guardrail Hits
Alert when guardrails are triggered frequently—it may indicate:
- Misconfiguration
- Attack in progress
- Need to adjust thresholds
5. Test Policy Changes
Always use dry run or simulator before deploying policy changes.
6. Keep Audit Logs
Maintain audit logs for compliance and incident review. Archive to external storage.
Example: Phishing Response Policy
Complete policy for phishing incident automation:
version: "1"
global:
operation_mode: supervised
actions:
# Enrichment - automatic
lookup_url:
requires_approval: false
rate_limit: 100/minute
lookup_domain:
requires_approval: false
rate_limit: 100/minute
lookup_user:
requires_approval: false
rate_limit: 50/minute
# Notifications - automatic
send_notification:
requires_approval: false
# Containment - requires approval
block_sender:
requires_approval: true
approval_level: medium
max_per_hour: 50
quarantine_email:
requires_approval: true
approval_level: low
auto_approve_confidence: 0.95
disable_user:
requires_approval: true
approval_level: critical
guardrails:
conditional_rules:
- action: disable_user
deny_if:
- is_admin: true
- is_executive: true
- action: quarantine_email
auto_approve_if:
- ai_confidence: "> 0.95"
- virustotal_malicious: "> 5"
Default Configuration Reference
The default configuration file (config/default.yaml) contains all settings for a Triage Warden deployment. Copy this file and customize it for your environment.
Sensitive values should use environment variable interpolation: ${ENV_VAR_NAME}.
Operation Mode
operation_mode: supervised
| Mode | Description |
|---|---|
assisted | AI observes and suggests only, no automated actions |
supervised | Low-risk actions automated, high-risk requires approval |
autonomous | Full automation for configured incident types |
Concurrency
max_concurrent_incidents: 50
Maximum number of incidents being processed at the same time. Increase for high-volume environments; decrease to limit resource usage.
Connectors
External service integrations. Each connector follows the same structure:
connectors:
<connector_name>:
connector_type: <type>
enabled: true
base_url: <url>
api_key: ${API_KEY_ENV_VAR}
api_secret: ""
timeout_secs: 30
settings:
<connector-specific settings>
Common Fields
| Field | Type | Description |
|---|---|---|
connector_type | String | Connector implementation to use |
enabled | Boolean | Whether this connector is active |
base_url | String | Base URL for the service API |
api_key | String | API key or username (use ${ENV_VAR}) |
api_secret | String | API secret or password (use ${ENV_VAR}) |
timeout_secs | Integer | HTTP request timeout in seconds |
settings | Map | Connector-specific settings |
Jira
connectors:
jira:
connector_type: jira
enabled: true
base_url: https://your-company.atlassian.net
api_key: ${JIRA_API_KEY}
timeout_secs: 30
settings:
project_key: SEC
default_issue_type: Incident
VirusTotal
connectors:
virustotal:
connector_type: virustotal
enabled: true
base_url: https://www.virustotal.com
api_key: ${VIRUSTOTAL_API_KEY}
timeout_secs: 30
settings:
cache_ttl_secs: 3600
Splunk (SIEM)
connectors:
splunk:
connector_type: splunk
enabled: true
base_url: https://splunk.company.com:8089
api_key: ${SPLUNK_TOKEN}
settings:
index: main
earliest_time: -24h
CrowdStrike (EDR)
connectors:
crowdstrike:
connector_type: crowdstrike
enabled: true
base_url: https://api.crowdstrike.com
api_key: ${CS_CLIENT_ID}
api_secret: ${CS_CLIENT_SECRET}
LLM Configuration
llm:
provider: anthropic
model: claude-3-5-sonnet-20241022
api_key: ${ANTHROPIC_API_KEY}
base_url: ""
max_tokens: 4096
temperature: 0.1
| Field | Description |
|---|---|
provider | LLM provider: anthropic, openai, or local |
model | Model identifier |
api_key | API key (use ${ENV_VAR}) |
base_url | Custom endpoint URL for local/self-hosted models |
max_tokens | Maximum tokens in LLM responses |
temperature | Sampling temperature (lower = more deterministic) |
Policy Configuration
policy:
guardrails_path: config/guardrails.yaml
default_approval_level: analyst
auto_approve_low_risk: true
confidence_threshold: 0.9
| Field | Description |
|---|---|
guardrails_path | Path to the guardrails configuration file |
default_approval_level | Default approval level for unknown actions (analyst, senior, manager) |
auto_approve_low_risk | Whether low-risk actions can be auto-approved |
confidence_threshold | Minimum AI confidence for auto-approval (0.0-1.0) |
Logging Configuration
logging:
level: info
json_format: false
# file_path: /var/log/triage-warden/triage-warden.log
| Field | Description |
|---|---|
level | Log level: trace, debug, info, warn, error |
json_format | Use structured JSON format (recommended for production) |
file_path | Optional log file path; omit to log to stdout |
Database Configuration
database:
url: sqlite://triage-warden.db?mode=rwc
max_connections: 10
run_migrations: true
| Field | Description |
|---|---|
url | Database connection string |
max_connections | Connection pool size |
run_migrations | Whether to run migrations on startup |
Database URLs
| Database | URL format |
|---|---|
| SQLite (dev) | sqlite://triage-warden.db?mode=rwc |
| PostgreSQL (prod) | postgres://user:pass@host:5432/triage_warden |
API Server Configuration
api:
port: 8080
host: "0.0.0.0"
enable_swagger: true
timeout_secs: 30
| Field | Description |
|---|---|
port | TCP port to listen on |
host | Bind address (0.0.0.0 for all interfaces, 127.0.0.1 for localhost only) |
enable_swagger | Serve Swagger UI at /swagger-ui |
timeout_secs | HTTP request timeout in seconds |
Guardrails Reference
The guardrails configuration file (config/guardrails.yaml) defines security boundaries for AI-automated actions. These rules apply regardless of the current autonomy level.
Deny List
Actions and targets that are never allowed automatically.
Denied Actions
deny_list:
actions:
- delete_user # Too destructive
- wipe_host # Too destructive
- delete_all_emails # Too destructive
- modify_firewall # High risk
Add any action name here to prevent the AI from ever executing it. These actions can still be performed manually by an analyst.
Target Patterns
Regex patterns that match protected systems. Any automated action targeting a hostname or identifier that matches these patterns requires human approval.
deny_list:
target_patterns:
- ".*-prod-.*" # Production systems
- "dc\\d+\\..*" # Domain controllers
- ".*-critical-.*" # Explicitly marked critical
- ".*\\.corp\\..*" # Corporate infrastructure
Protected IPs
Specific IP addresses that must never be targeted by automated actions.
deny_list:
protected_ips:
- "10.0.0.1" # Core router
- "10.0.0.2" # DNS server
- "10.0.0.3" # DHCP server
Protected Users
User accounts that are protected from automated modifications (disable, password reset, etc.). Supports exact matches and glob patterns.
deny_list:
protected_users:
- "admin"
- "root"
- "administrator"
- "service-account-*"
- "svc-*"
Rate Limits
Prevent runaway automation by capping how many times each action can be executed.
rate_limits:
isolate_host:
max_per_hour: 5
max_per_day: 20
max_concurrent: 2
disable_user:
max_per_hour: 10
max_per_day: 50
max_concurrent: 5
block_ip:
max_per_hour: 20
max_per_day: 100
max_concurrent: 10
quarantine_email:
max_per_hour: 50
max_per_day: 500
max_concurrent: 20
| Field | Description |
|---|---|
max_per_hour | Maximum executions in a rolling 60-minute window |
max_per_day | Maximum executions in a rolling 24-hour window |
max_concurrent | Maximum simultaneous in-flight executions |
Approval Policies
Define when human approval is required, and at what level.
approval_policies:
- name: critical_asset_protection
description: "Require senior approval for actions on critical assets"
condition:
target_criticality:
- critical
- high
requires: senior
can_override: false
Condition Fields
| Field | Type | Description |
|---|---|---|
target_criticality | List of strings | Asset criticality levels that trigger this policy |
action_type | List of strings | Action types that trigger this policy |
confidence_below | Float (0.0-1.0) | Trigger when AI confidence is below this threshold |
Approval Levels
| Level | Who can approve |
|---|---|
analyst | Any analyst |
senior | Senior analyst or above |
manager | SOC manager |
Overridability
When can_override: true, a senior user can bypass the approval requirement. When false, the approval is mandatory and cannot be skipped.
Auto-Approve Rules
Actions that can be executed automatically when specific conditions are met, even in supervised mode.
auto_approve_rules:
- name: ticket_operations
description: "Auto-approve ticket creation and updates"
action_types:
- create_ticket
- update_ticket
- add_ticket_comment
conditions:
- confidence_above: 0.5
- name: email_quarantine_high_confidence
description: "Auto-approve email quarantine for high-confidence phishing"
action_types:
- quarantine_email
conditions:
- confidence_above: 0.95
- verdict: true_positive
Condition Fields
| Field | Type | Description |
|---|---|---|
confidence_above | Float (0.0-1.0) | AI confidence must exceed this value |
verdict | String | AI verdict must match (e.g., true_positive) |
All conditions in the list must be met (AND logic).
Data Policies
Control how sensitive data is handled in logs and LLM prompts.
data_policies:
pii_filter: true
pii_patterns:
- "\\b\\d{3}-\\d{2}-\\d{4}\\b" # SSN
- "\\b\\d{16}\\b" # Credit card
secrets_redaction: true
secret_patterns:
- "(?i)api[_-]?key"
- "(?i)password"
- "(?i)secret"
- "(?i)token"
- "(?i)credential"
audit_data_access: true
| Field | Description |
|---|---|
pii_filter | Enable PII filtering in logs and LLM prompts |
pii_patterns | Regex patterns matching PII to redact |
secrets_redaction | Enable secret detection and redaction |
secret_patterns | Regex patterns matching secrets to redact |
audit_data_access | Log all data access operations |
Escalation Rules
Define automatic escalation triggers.
escalation_rules:
- name: repeated_false_positives
description: "Escalate if same alert type has high FP rate"
condition:
false_positive_rate_above: 0.5
sample_size_min: 10
action: escalate_to_analyst
- name: incident_correlation
description: "Escalate if multiple related incidents detected"
condition:
related_incidents_above: 3
time_window_hours: 1
action: escalate_to_senior
- name: critical_severity
description: "Always escalate critical severity incidents"
condition:
severity: critical
action: escalate_to_manager
Escalation Actions
| Action | Description |
|---|---|
escalate_to_analyst | Route to any available analyst |
escalate_to_senior | Route to a senior analyst |
escalate_to_manager | Route to the SOC manager |
Integrations
Triage Warden supports integrations for identity, telemetry, enrichment, and response workflows.
SSO
Use the SSO guides to configure OIDC or SAML with your identity provider:
Related Docs
SSO Integration Guide
Triage Warden supports enterprise SSO through both OIDC and SAML endpoints.
Supported Flows
- OIDC login:
/auth/oidc/login - OIDC callback:
/auth/oidc/callback - OIDC logout:
/auth/oidc/logout - SAML metadata:
/auth/saml/metadata - SAML login:
/auth/saml/login - SAML ACS:
/auth/saml/acs - SAML SLO:
/auth/saml/slo
Common Environment Variables
TW_OIDC_ISSUERTW_OIDC_CLIENT_IDTW_OIDC_CLIENT_SECRETTW_OIDC_REDIRECT_URITW_OIDC_SCOPESTW_OIDC_JWKS_URI(optional override; discoveryjwks_uriis used by default)TW_OIDC_REQUIRE_MFATW_SSO_ROLE_MAPPINGTW_SSO_DEFAULT_ROLETW_SSO_AUTO_CREATE_USERSTW_SAML_ENTITY_IDTW_SAML_ACS_URLTW_SAML_IDP_SSO_URLTW_SAML_CERTIFICATETW_SAML_PRIVATE_KEYTW_SAML_EXPECTED_ISSUERTW_SAML_REQUIRE_MFA
Use provider-specific documents in this folder for exact values.
Security Notes
- OIDC ID tokens are validated for issuer/audience/nonce/expiration and signature (JWKS).
- SAML assertions enforce request correlation (
InResponseTo), destination checks, signature presence, SHA-2 algorithm allow-listing, and certificate pinning checks.
Okta Setup
1. Create Application
- Okta Admin:
Applications > Create App Integration. - Choose
OIDC - Web Application(recommended) or SAML 2.0. - Configure sign-in redirect URI:
https://<your-host>/auth/oidc/callback
2. OIDC Environment Variables
TW_OIDC_ISSUER=https://<okta-domain>/oauth2/defaultTW_OIDC_CLIENT_ID=<okta-client-id>TW_OIDC_CLIENT_SECRET=<okta-client-secret>TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callbackTW_OIDC_SCOPES=openid,profile,email,groupsTW_OIDC_REQUIRE_MFA=true
3. Group to Role Mapping
Example:
TW_SSO_ROLE_MAPPING=okta-soc-admin=admin,okta-soc-analyst=analyst,okta-soc-viewer=viewer
4. Optional SCIM Provisioning
SCIM can be enabled on top of JIT provisioning for pre-provisioning and automated lifecycle. JIT remains active for first-login provisioning fallback.
Azure AD (Microsoft Entra ID) Setup
1. Register App
- Microsoft Entra admin center:
Applications > App registrations > New registration. - Add redirect URI:
- OIDC:
https://<your-host>/auth/oidc/callback - SAML ACS (if using SAML):
https://<your-host>/auth/saml/acs
- OIDC:
- Save
Application (client) IDandDirectory (tenant) ID.
2. Configure OIDC in Triage Warden
Set:
TW_OIDC_ISSUER=https://login.microsoftonline.com/<tenant-id>/v2.0TW_OIDC_CLIENT_ID=<application-client-id>TW_OIDC_CLIENT_SECRET=<generated-client-secret>TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callbackTW_OIDC_SCOPES=openid,profile,emailTW_OIDC_REQUIRE_MFA=true(recommended)
3. Claims and Group Mapping
- In app
Token configuration, add group claims. - Map groups to roles:
TW_SSO_ROLE_MAPPING=SOC-Admins=admin,SOC-Analysts=analyst,SOC-Viewers=viewer
4. Conditional Access / MFA
- Create conditional access policy requiring MFA for the app.
- Keep
TW_OIDC_REQUIRE_MFA=trueto enforce server-side claim checks.
Google Workspace Setup
1. Configure OAuth Consent and Client
- Google Cloud Console: configure OAuth consent screen.
- Create OAuth client (Web application).
- Add authorized redirect URI:
https://<your-host>/auth/oidc/callback
2. OIDC Configuration
TW_OIDC_ISSUER=https://accounts.google.comTW_OIDC_CLIENT_ID=<google-client-id>TW_OIDC_CLIENT_SECRET=<google-client-secret>TW_OIDC_REDIRECT_URI=https://<your-host>/auth/oidc/callbackTW_OIDC_SCOPES=openid,profile,email
3. Role Mapping
Google Workspace group claims may require Cloud Identity configuration. Use mapped group names:
TW_SSO_ROLE_MAPPING=tw-admins=admin,tw-analysts=analyst,tw-viewers=viewer
4. MFA
Enforce 2-Step Verification in Workspace admin policies and set:
TW_OIDC_REQUIRE_MFA=true
Generic OIDC/SAML Setup
OIDC Checklist
- Configure redirect URI:
https://<host>/auth/oidc/callback. - Set:
TW_OIDC_ISSUERTW_OIDC_CLIENT_IDTW_OIDC_CLIENT_SECRETTW_OIDC_REDIRECT_URI
- Optional claim overrides:
TW_OIDC_EMAIL_CLAIMTW_OIDC_NAME_CLAIMTW_OIDC_GROUPS_CLAIMTW_OIDC_ROLES_CLAIMTW_OIDC_MFA_CLAIM
- Configure role mapping:
TW_SSO_ROLE_MAPPING=external_group=internal_role,...
SAML Checklist
- Download SP metadata from
https://<host>/auth/saml/metadata. - Configure IdP to POST assertions to
https://<host>/auth/saml/acs. - Set:
TW_SAML_ENTITY_IDTW_SAML_ACS_URLTW_SAML_IDP_SSO_URLTW_SAML_CERTIFICATE
- Optional:
TW_SAML_PRIVATE_KEY(required for encrypted assertions)TW_SAML_IDP_SLO_URLTW_SAML_EXPECTED_ISSUERTW_SAML_REQUIRE_MFA
Security Recommendations
- Always require TLS termination.
- Keep
TW_OIDC_REQUIRE_MFA=trueandTW_SAML_REQUIRE_MFA=truefor privileged tenants. - Use least-privilege role mappings.
- Rotate OIDC client secrets and SAML certificates regularly.
Architectural Decision Records
This directory contains Architectural Decision Records (ADRs) for Triage Warden.
What is an ADR?
An ADR is a document that captures an important architectural decision made along with its context and consequences.
ADR Index
| Number | Title | Status | Date |
|---|---|---|---|
| 001 | Event Bus Architecture | Accepted | 2026-02 |
| 002 | Dual Database Support (SQLite + PostgreSQL) | Accepted | 2026-02 |
| 003 | Credential Encryption at Rest | Accepted | 2026-02 |
| 004 | Session Management Strategy | Accepted | 2026-02 |
| 005 | API Key Format and Security | Accepted | 2026-02 |
| 006 | Operation Modes (Supervised/Autonomous) | Accepted | 2026-02 |
| 007 | Kill Switch Design | Accepted | 2026-02 |
ADR Template
New ADRs should follow this template:
# ADR-XXX: Title
## Status
Proposed | Accepted | Deprecated | Superseded
## Context
What is the issue that we're seeing that is motivating this decision or change?
## Decision
What is the change that we're proposing and/or doing?
## Consequences
What becomes easier or more difficult to do because of this change?
ADR-001: Event Bus Architecture
Status
Accepted
Context
Triage Warden needs to coordinate multiple components (enrichment, analysis, action execution, notifications) in response to security incidents. We needed a way to:
- Decouple components for independent development and testing
- Enable real-time updates to the dashboard
- Support both synchronous and asynchronous processing
- Maintain an audit trail of all system events
Decision
We implemented an in-process event bus using Tokio channels with the following design:
Event Types
All significant system events are captured as TriageEvent variants:
AlertReceived- New alert from webhookIncidentCreated- Incident created from alertEnrichmentComplete- Single enrichment finishedEnrichmentPhaseComplete- All enrichments doneAnalysisComplete- AI analysis finishedActionsProposed- Response actions proposedActionApproved/Denied- Action approval decisionActionExecuted- Action completedStatusChanged- Incident status transitionTicketCreated- External ticket createdIncidentEscalated- Incident escalatedIncidentResolved- Incident resolvedKillSwitchActivated- Emergency stop triggered
Delivery Mechanisms
- Broadcast Channel: For real-time dashboard updates via SSE
- Named Subscribers: For component-specific processing queues
- Event History: In-memory buffer for recent event retrieval
Error Handling
Events are fire-and-forget with fallback logging:
publish()- Returns Result for cases where failure matterspublish_with_fallback()- Logs errors, never fails (for non-critical events)
Consequences
Positive
- Components are loosely coupled and independently testable
- Dashboard receives real-time updates without polling
- Complete event history available for debugging
- Failed subscribers don't block the main processing flow
Negative
- In-process only - no distributed event bus
- Event history is limited and in-memory (lost on restart)
- No guaranteed delivery or replay capability
- Broadcast channel has limited buffer (may drop events under load)
Future Considerations
For high-availability deployments, consider:
- Redis Pub/Sub for distributed events
- PostgreSQL LISTEN/NOTIFY for persistent events
- External message queue (RabbitMQ, Kafka) for durability
ADR-002: Dual Database Support (SQLite + PostgreSQL)
Status
Accepted
Context
Triage Warden needed to support different deployment scenarios:
- Development/Testing: Quick setup without external dependencies
- Small Deployments: Single-server installations with minimal infrastructure
- Production: Scalable deployments with high availability requirements
We evaluated:
- SQLite only (simple but limited scalability)
- PostgreSQL only (powerful but heavy for small deployments)
- Dual support (flexibility but increased complexity)
Decision
We implemented dual database support using SQLx with compile-time query verification:
Architecture
┌─────────────────────────────────────────┐
│ Application │
├─────────────────────────────────────────┤
│ Repository Traits │
│ (IncidentRepository, UserRepository) │
├──────────────────┬──────────────────────┤
│ SqliteXxxRepo │ PgXxxRepo │
├──────────────────┼──────────────────────┤
│ SQLite Pool │ PostgreSQL Pool │
└──────────────────┴──────────────────────┘
Implementation
DbPoolenum wraps both pool types- Each repository has SQLite and PostgreSQL implementations
- Factory functions create the appropriate implementation based on pool type
- Migrations are maintained separately for each database
Database Selection
Determined by DATABASE_URL environment variable:
sqlite:path/to/file.db→ SQLitepostgres://user:pass@host/db→ PostgreSQL
Consequences
Positive
- Zero-config development with SQLite
- Production-ready PostgreSQL support
- Same API regardless of database backend
- Compile-time query verification for both backends
Negative
- Duplicate migration files
- Some features may have different behavior (e.g., JSON querying)
- More complex testing matrix
- Cannot use PostgreSQL-specific features (CTEs, window functions) without SQLite equivalents
Trade-offs
| Feature | SQLite | PostgreSQL |
|---|---|---|
| Setup complexity | None | Requires server |
| Concurrent writes | Limited | Excellent |
| JSON indexing | Basic | JSONB with GIN |
| Full-text search | Limited | Excellent |
| Connection pooling | In-process | Network |
| Backup | File copy | pg_dump |
ADR-003: Credential Encryption at Rest
Status
Accepted
Context
Triage Warden stores sensitive credentials for external integrations:
- API keys for threat intelligence services (VirusTotal, etc.)
- OAuth tokens for cloud services (Microsoft, Google)
- Webhook secrets for SIEM integrations
- SMTP credentials for email notifications
These credentials must be protected at rest in the database.
Decision
We implemented AES-256-GCM encryption for sensitive fields:
Encryption Scheme
- Algorithm: AES-256-GCM (authenticated encryption)
- Key Derivation: HKDF from master key + unique salt per value
- Nonce: 96-bit random nonce per encryption
- Storage Format: Base64(nonce || ciphertext || auth_tag)
Key Management
ENCRYPTION_KEY (env var)
│
▼
HKDF-SHA256
│
┌───┴───┐
│ Salt │ (per-value, stored with ciphertext)
└───┬───┘
▼
Derived Key
│
▼
AES-256-GCM
Implementation
#![allow(unused)] fn main() { pub trait CredentialEncryptor: Send + Sync { fn encrypt(&self, plaintext: &str) -> Result<String, EncryptionError>; fn decrypt(&self, ciphertext: &str) -> Result<String, EncryptionError>; } }
Two implementations:
Aes256GcmEncryptor- Production encryptionNoOpEncryptor- Development mode (disabled encryption)
Encrypted Fields
| Table | Field | Contains |
|---|---|---|
| connectors | config.api_key | API keys |
| connectors | config.client_secret | OAuth secrets |
| settings | llm.api_key | LLM provider API key |
| notification_channels | config.webhook_url | Webhook URLs with tokens |
Consequences
Positive
- Credentials protected if database is compromised
- Authenticated encryption prevents tampering
- Per-value salt prevents rainbow table attacks
- Key rotation possible without re-encrypting all values
Negative
- Cannot search encrypted fields
- Master key must be securely managed
- Performance overhead for encryption/decryption
- Key loss = data loss (no recovery without key)
Security Considerations
- Key Storage: Use environment variable or secrets manager
- Key Rotation: Implement key versioning for rotation
- Audit: Log all decryption operations
- Memory: Clear sensitive data from memory after use
ADR-004: Session Management Strategy
Status
Accepted
Context
The dashboard requires user authentication with session management. We needed to decide between:
- JWT tokens (stateless)
- Server-side sessions (stateful)
- Hybrid approach
Requirements:
- Secure authentication for web dashboard
- Support for session revocation
- CSRF protection for form submissions
- Reasonable session lifetime
Decision
We chose server-side sessions stored in the database using tower-sessions:
Session Architecture
Browser Server
│ │
│ POST /auth/login │
│ (username, password) │
├───────────────────────────────►│
│ │ Validate credentials
│ │ Create session in DB
│ Set-Cookie: id=session_id │
│◄───────────────────────────────┤
│ │
│ GET /dashboard │
│ Cookie: id=session_id │
├───────────────────────────────►│
│ │ Load session from DB
│ │ Verify not expired
│ 200 OK │
│◄───────────────────────────────┤
Session Storage
Sessions are stored in the sessions table:
| Column | Type | Description |
|---|---|---|
| id | TEXT | Session ID (secure random) |
| data | BLOB | Encrypted session data |
| expiry_date | INTEGER | Unix timestamp |
Session Data
#![allow(unused)] fn main() { struct SessionData { user_id: Uuid, username: String, role: UserRole, login_csrf: String, // CSRF token for sensitive actions } }
Security Measures
- Secure Cookies: HttpOnly, Secure (in production), SameSite=Lax
- CSRF Protection: Token in session, validated on state-changing requests
- Session Expiry: 24-hour default, configurable
- Rotation: New session ID on privilege changes
Consequences
Positive
- Sessions can be revoked immediately
- No token size limits for session data
- CSRF tokens integrated naturally
- Easy to implement "logout all devices"
Negative
- Database read on every authenticated request
- Session table requires cleanup (expired sessions)
- Horizontal scaling requires shared database
- Slightly higher latency than JWTs
Comparison with JWTs
| Aspect | Sessions | JWTs |
|---|---|---|
| Revocation | Immediate | Requires blacklist |
| Storage | Server | Client |
| Scalability | Requires shared store | Stateless |
| Size | Cookie only | Full payload |
| Security | Keys in DB | Signature verification |
ADR-005: API Key Format and Security
Status
Accepted
Context
Triage Warden exposes a REST API that needs programmatic authentication. We needed to design an API key format that is:
- Secure against brute-force attacks
- Easily identifiable (for revocation)
- User-friendly for debugging
- Compatible with common tooling
Decision
We adopted a prefixed API key format similar to GitHub and Stripe:
Key Format
tw_<user_prefix>_<random_secret>
Example: tw_abc12345_9f8e7d6c5b4a3210fedcba9876543210
Components:
tw_- Application prefix (identifies Triage Warden keys)<user_prefix>- First 8 chars for identification (stored in DB)<random_secret>- 32 bytes of cryptographic randomness
Storage
Only the hash is stored, never the raw key:
| Column | Value |
|---|---|
| key_prefix | tw_abc12345 (for lookup) |
| key_hash | SHA-256(full_key) |
Authentication Flow
1. Extract key from Authorization header
2. Parse prefix (first 11 chars)
3. Look up by prefix in database
4. Compute SHA-256 of provided key
5. Compare with stored hash (constant-time)
6. Check expiration and scopes
Key Generation
#![allow(unused)] fn main() { use rand::Rng; use sha2::{Sha256, Digest}; fn generate_api_key(user_id: Uuid) -> (String, String, String) { let secret: [u8; 32] = rand::thread_rng().gen(); let secret_hex = hex::encode(secret); let prefix = format!("tw_{}", &user_id.to_string()[..8]); let full_key = format!("{}_{}", prefix, secret_hex); let key_hash = hex::encode(Sha256::digest(full_key.as_bytes())); (full_key, prefix, key_hash) // Return key once, store prefix + hash } }
Consequences
Positive
- Keys are identifiable without exposing secrets
- Prefix enables efficient database lookup
- Format is familiar to developers
- Hash storage protects against database leaks
- Constant-time comparison prevents timing attacks
Negative
- Keys must be stored securely by users (cannot be recovered)
- Prefix lookup could reveal key existence (minor info leak)
- Longer keys than simple tokens
Security Properties
| Property | Implementation |
|---|---|
| Entropy | 256 bits (32 random bytes) |
| Storage | SHA-256 hash only |
| Comparison | Constant-time |
| Revocation | Delete from database |
| Expiration | Optional expiry_at field |
| Scopes | JSON array of allowed operations |
ADR-006: Operation Modes (Supervised/Autonomous)
Status
Accepted
Context
Security automation involves a trust spectrum from fully manual to fully autonomous. Organizations have different risk tolerances and regulatory requirements. We needed to support:
- Organizations starting with automation (cautious)
- Mature SOCs ready for autonomous response
- Gradual transition between modes
- Compliance with approval requirements
Decision
We implemented three operation modes configurable at the system level:
Modes
| Mode | Description | Default Approval |
|---|---|---|
supervised | All actions require human approval | require_approval |
semi_autonomous | Low-risk actions auto-approved, high-risk need approval | policy-based |
autonomous | Actions auto-approved unless policy denies | auto_approve |
Mode Selection Flow
Incoming Action
│
▼
┌─────────────────┐
│ Check Kill Switch│
└────────┬────────┘
│ (not active)
▼
┌─────────────────┐
│ Evaluate Policies│
└────────┬────────┘
│
┌────┴────┐
│ Explicit │
│ Policy? │
└────┬────┘
Yes │ No
│ │
│ ▼
│ ┌─────────────────┐
│ │ Apply Mode │
│ │ Default │
│ └────────┬────────┘
│ │
└────┬─────┘
│
▼
Final Decision
Policy Override
Policies can override mode defaults:
policies:
- name: "Block critical IPs always requires approval"
condition: "action.type == 'block_ip' && target.is_critical"
action: "require_approval"
approval_level: "manager"
- name: "Low severity lookups auto-approved"
condition: "action.type == 'lookup' && incident.severity in ['info', 'low']"
action: "auto_approve"
Configuration
# config.yaml
general:
mode: "supervised" # supervised | semi_autonomous | autonomous
Or via API:
curl -X PUT /api/settings/general \
-d '{"mode": "semi_autonomous"}'
Consequences
Positive
- Flexible for different organizational needs
- Gradual automation adoption path
- Policies provide fine-grained control
- Easy to fall back to supervised mode
Negative
- More complex decision logic
- Potential for misconfiguration
- Requires clear documentation of behavior
- Audit trails must capture mode at decision time
Mode Comparison
| Scenario | Supervised | Semi-Auto | Autonomous |
|---|---|---|---|
| Block malware IP | Approval needed | Auto-approved | Auto-approved |
| Disable user | Approval needed | Approval needed | Auto-approved |
| Isolate host | Approval needed | Approval needed | Approval (policy) |
| Lookup IOC | Approval needed | Auto-approved | Auto-approved |
ADR-007: Kill Switch Design
Status
Accepted
Context
Autonomous security response systems pose risks if they malfunction:
- False positives could disable legitimate users/systems
- Bugs could trigger cascading actions
- Compromised AI could be weaponized
- External events may require immediate halt
We needed an emergency stop mechanism that is:
- Fast to activate (< 1 second)
- Globally effective
- Difficult to accidentally trigger
- Easy to recover from
Decision
We implemented a global kill switch with the following design:
Architecture
┌─────────────┐
│ Kill Switch │
│ State │
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Orchestrator │ │ Policy Engine │ │ Action Runner │
│ │ │ │ │ │
│ check() │ │ check() │ │ check() │
│ before │ │ before │ │ before │
│ processing │ │ evaluation │ │ execution │
└───────────────┘ └───────────────┘ └───────────────┘
State
#![allow(unused)] fn main() { pub struct KillSwitchStatus { pub active: bool, pub reason: Option<String>, pub activated_by: Option<String>, pub activated_at: Option<DateTime<Utc>>, } }
Check Points
The kill switch is checked at multiple points:
- Alert Processing: Before creating incidents from alerts
- Policy Evaluation: Before evaluating approval policies
- Action Execution: Before executing any response action
- Playbook Execution: Before running playbook stages
Activation
#![allow(unused)] fn main() { // Via API POST /api/kill-switch/activate { "reason": "Investigating false positive surge", "activated_by": "[email protected]" } // Via CLI tw-cli kill-switch activate --reason "Emergency maintenance" // Programmatic kill_switch.activate("Anomaly detected", "system").await; }
Deactivation
#![allow(unused)] fn main() { // Via API POST /api/kill-switch/deactivate { "reason": "Issue resolved" } // Only admins can deactivate }
Event Notification
Activation triggers:
KillSwitchActivatedevent to all subscribers- Dashboard alert banner
- Notification to configured channels
Consequences
Positive
- Immediate halt of all automation
- Clear audit trail of activation/deactivation
- Multiple activation methods (UI, API, CLI)
- Visible status in all interfaces
Negative
- In-memory state (lost on restart, resets to inactive)
- No automatic activation triggers yet
- Single global switch (no per-action granularity)
- Requires admin access to deactivate
Future Enhancements
- Persistent State: Store kill switch state in database
- Auto-Activation: Trigger on anomaly detection
- Scoped Switches: Per-action-type or per-connector switches
- Scheduled Deactivation: Auto-deactivate after timeout
- Two-Person Rule: Require multiple admins for deactivation
Operational Procedures
When kill switch is activated:
- All pending actions remain pending
- New alerts create incidents but stop at enrichment
- Dashboard shows prominent warning banner
- Existing approved actions are NOT rolled back
To recover:
- Investigate root cause
- Fix underlying issue
- Deactivate kill switch
- Manually review pending actions
- Resume normal operations
Production Deployment
This section covers deploying Triage Warden in production environments.
Deployment Options
Triage Warden can be deployed in several ways:
- Docker - Recommended for most deployments. Quick setup with Docker Compose.
- Kubernetes - For orchestrated, scalable deployments using raw manifests.
- Helm Chart - Recommended for Kubernetes. Templated deployment with environment-specific values.
- Binary - Direct binary installation on Linux servers.
Before You Deploy
Before deploying to production, review:
- Production Checklist - Security and configuration requirements
- Configuration Reference - All environment variables and settings
- Database Setup - PostgreSQL configuration for production
- Security Hardening - TLS, secrets, network policies
- Scaling - Horizontal scaling considerations
Quick Start
For a quick production deployment with Docker:
# Clone the repository
git clone https://github.com/your-org/triage-warden.git
cd triage-warden/deploy/docker
# Configure environment
cp .env.example .env
# Edit .env with your settings
# Generate encryption key
echo "TW_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> .env
# Start services
docker compose -f docker-compose.prod.yml up -d
Architecture Overview
A typical production deployment includes:
┌─────────────────┐
│ Load Balancer │
│ (TLS term.) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Triage │ │ Triage │ │ Triage │
│ Warden │ │ Warden │ │ Warden │
│ Instance 1│ │ Instance 2│ │ Instance 3│
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘
│
┌────────▼────────┐
│ PostgreSQL │
│ (Primary) │
└─────────────────┘
Support
For deployment assistance:
- Check the Troubleshooting Guide
- Review GitHub Issues
- Contact support at [email protected]
Production Checklist
Complete this checklist before deploying Triage Warden to production.
Security Requirements
Authentication & Secrets
-
Encryption key configured: Set
TW_ENCRYPTION_KEYwith a 32-byte base64-encoded key# Generate a secure key openssl rand -base64 32 -
JWT secret configured: Set
TW_JWT_SECRETwith a strong random valueopenssl rand -hex 32 -
Session secret configured: Set
TW_SESSION_SECRETfor session encryption -
Default admin password changed: Change the default admin credentials immediately after first login
-
API keys use scoped permissions: Don't create API keys with
*scope in production
Network Security
- TLS enabled: All traffic should use HTTPS
- TLS certificates valid: Use certificates from a trusted CA (not self-signed)
- Internal traffic encrypted: Database connections use TLS
- Firewall rules configured: Only expose necessary ports (443 for HTTPS)
- Rate limiting enabled: Protect against brute force attacks
Database Security
- PostgreSQL in production: Don't use SQLite for production workloads
- Database user has minimal permissions: Use a dedicated user, not superuser
-
Database connections encrypted: Enable
sslmode=requireorverify-full - Regular backups configured: Automated daily backups with tested restore procedure
Configuration Requirements
Required Environment Variables
| Variable | Description | Example |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | postgres://user:pass@host:5432/triage_warden?sslmode=require |
TW_ENCRYPTION_KEY | Credential encryption key (32 bytes, base64) | K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72... |
TW_JWT_SECRET | JWT signing secret | your-256-bit-secret |
TW_SESSION_SECRET | Session encryption secret | another-secret-value |
RUST_LOG | Log level | info or triage_warden=debug |
Optional but Recommended
| Variable | Description | Default |
|---|---|---|
TW_BIND_ADDRESS | Server bind address | 0.0.0.0:8080 |
TW_BASE_URL | Public URL for callbacks | https://triage.example.com |
TW_TRUSTED_PROXIES | Comma-separated proxy IPs | None |
TW_MAX_REQUEST_SIZE | Maximum request body size | 10MB |
LLM Configuration (if using AI features)
- LLM API key configured: Set via UI or environment variable
- Rate limits configured: Prevent runaway API costs
- Model selected appropriately: Balance cost vs. capability
Infrastructure Requirements
Minimum Hardware
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 2 cores | 4 cores |
| RAM | 2 GB | 4 GB |
| Storage | 20 GB | 50 GB SSD |
Database Requirements
| Metric | Minimum | Recommended |
|---|---|---|
| PostgreSQL Version | 14 | 15+ |
| Connections | 20 | 50+ |
| Storage | 10 GB | 50 GB+ |
Network Requirements
- Outbound HTTPS (443) to:
- LLM provider (api.openai.com, api.anthropic.com)
- Configured connectors (VirusTotal, Jira, etc.)
- Inbound HTTPS (443) from:
- Users accessing the dashboard
- Webhook sources (SIEM, EDR systems)
Monitoring & Observability
Health Checks
-
Health endpoint accessible:
GET /healthreturns component status -
Readiness probe configured:
GET /readyfor load balancer -
Liveness probe configured:
GET /livefor container orchestration
Metrics & Logging
-
Prometheus metrics exposed:
GET /metricsendpoint enabled - Log aggregation configured: Logs shipped to central system
- Alerting rules configured: Alerts for critical failures
Recommended Alerts
| Alert | Condition | Severity |
|---|---|---|
| Service Down | /health returns unhealthy for 5m | Critical |
| Database Connection Failed | Database component unhealthy | Critical |
| Kill Switch Active | Kill switch activated | Warning |
| High Error Rate | >5% HTTP 5xx responses | Warning |
| Connector Unhealthy | Any connector in error state | Warning |
| LLM API Errors | LLM requests failing | Warning |
Operational Readiness
Documentation
- Runbooks available: Team has access to operational runbooks
- Contact list current: On-call rotation and escalation paths defined
- Recovery procedures tested: Backup restore verified within last 30 days
Access Control
- Admin accounts audited: Remove unnecessary admin users
- API keys audited: Revoke unused or over-privileged keys
- Audit logging enabled: User actions are logged
Backup & Recovery
- Database backups automated: Daily backups with 30-day retention
- Backup encryption enabled: Backups encrypted at rest
- Recovery time objective defined: Team knows target RTO
- Recovery procedure documented: Step-by-step restore guide exists
Pre-Launch Testing
Functional Tests
- User login works with configured auth
- Incidents can be created via webhook
- Playbooks execute correctly
- Connectors authenticate successfully
- Notifications are delivered
Load Testing
- Tested with expected concurrent users
- Tested with expected webhook volume
- Response times acceptable under load
Failover Testing
- Application recovers from database restart
- Application handles LLM API failures gracefully
- Kill switch stops all automation when activated
Sign-Off
| Role | Name | Date | Signature |
|---|---|---|---|
| Security Review | |||
| Operations Review | |||
| Development Lead |
Quick Validation Commands
# Check health endpoint
curl -s https://triage.example.com/health | jq
# Verify TLS certificate
openssl s_client -connect triage.example.com:443 -servername triage.example.com
# Test database connectivity (from application)
curl -s https://triage.example.com/health/detailed | jq '.components.database'
# Verify all connectors healthy
curl -s https://triage.example.com/health/detailed | jq '.components.connectors'
Docker Deployment
Deploy Triage Warden using Docker and Docker Compose.
Prerequisites
- Docker Engine 20.10+
- Docker Compose v2.0+
- 4 GB RAM minimum (2 GB for basic, 4 GB+ for HA)
- 20 GB disk space
Overview
Triage Warden provides three Docker Compose configurations:
| File | Purpose | Use Case |
|---|---|---|
docker-compose.yml | Basic setup | Quick start, single instance |
docker-compose.dev.yml | Development | Local development with hot reload |
docker-compose.ha.yml | High Availability | HA testing, multi-instance |
Quick Start
# Clone the repository
git clone https://github.com/your-org/triage-warden.git
cd triage-warden/deploy/docker
# Copy and configure environment
cp .env.example .env
# Generate required secrets
echo "TW_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> .env
echo "TW_JWT_SECRET=$(openssl rand -hex 32)" >> .env
echo "TW_SESSION_SECRET=$(openssl rand -hex 32)" >> .env
echo "POSTGRES_PASSWORD=$(openssl rand -hex 16)" >> .env
# Start services
docker compose up -d
# Check status
docker compose ps
docker compose logs -f triage-warden
Access the dashboard at http://localhost:8080
Default credentials: admin / admin (change immediately!)
Configuration
Environment Variables
Edit .env file with your configuration:
# Database
POSTGRES_USER=triage_warden
POSTGRES_PASSWORD=your-secure-password
POSTGRES_DB=triage_warden
DATABASE_URL=postgres://triage_warden:your-secure-password@postgres:5432/triage_warden
# Application
TW_BIND_ADDRESS=0.0.0.0:8080
TW_BASE_URL=https://triage.example.com
TW_ENCRYPTION_KEY=your-32-byte-base64-key
TW_JWT_SECRET=your-jwt-secret
TW_SESSION_SECRET=your-session-secret
# Logging
RUST_LOG=info
# LLM (optional)
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...
Production Configuration
For production, use docker-compose.prod.yml:
docker compose -f docker-compose.prod.yml up -d
Key differences from development:
- Uses external PostgreSQL volume for data persistence
- Enables health checks
- Sets resource limits
- Configures restart policies
Docker Compose Files
Development (docker-compose.yml)
version: '3.8'
services:
triage-warden:
image: ghcr.io/your-org/triage-warden:latest
ports:
- "8080:8080"
environment:
- DATABASE_URL=${DATABASE_URL}
- TW_ENCRYPTION_KEY=${TW_ENCRYPTION_KEY}
- TW_JWT_SECRET=${TW_JWT_SECRET}
- TW_SESSION_SECRET=${TW_SESSION_SECRET}
- RUST_LOG=${RUST_LOG:-info}
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:15-alpine
environment:
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=${POSTGRES_DB}
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
interval: 5s
timeout: 5s
retries: 5
volumes:
postgres_data:
Production (docker-compose.prod.yml)
version: '3.8'
services:
triage-warden:
image: ghcr.io/your-org/triage-warden:latest
ports:
- "8080:8080"
environment:
- DATABASE_URL=${DATABASE_URL}
- TW_ENCRYPTION_KEY=${TW_ENCRYPTION_KEY}
- TW_JWT_SECRET=${TW_JWT_SECRET}
- TW_SESSION_SECRET=${TW_SESSION_SECRET}
- TW_BASE_URL=${TW_BASE_URL}
- RUST_LOG=${RUST_LOG:-info}
depends_on:
postgres:
condition: service_healthy
restart: unless-stopped
deploy:
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/live"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
postgres:
image: postgres:15-alpine
environment:
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=${POSTGRES_DB}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init-scripts:/docker-entrypoint-initdb.d:ro
restart: unless-stopped
deploy:
resources:
limits:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
interval: 10s
timeout: 5s
retries: 5
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
volumes:
postgres_data:
external: true
name: triage_warden_postgres
High Availability Testing
The HA configuration runs multiple instances for testing distributed features locally before deploying to Kubernetes.
Architecture
┌─────────────┐
│ Traefik │
│ (LB) │
└──────┬──────┘
│
┌───────────────┼───────────────┐
│ │ │
┌──────▼─────┐ ┌──────▼─────┐ ┌──────▼──────┐
│ API-1 │ │ API-2 │ │ API-N │
│ (serve) │ │ (serve) │ │ (serve) │
└──────┬─────┘ └──────┬─────┘ └──────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌──────────┐ ┌───────────┐
│ Redis │◄─────────►│ PostgreSQL│◄─────────►│Orchestrator│
│(MQ/Cache)│ │ (DB) │ │ (1 leader) │
└───────┘ └──────────┘ └───────────┘
Starting HA Stack
# Navigate to deploy directory
cd deploy/docker
# Configure environment
cp .env.example .env
# Edit .env with required values
# Start all services
docker-compose -f docker-compose.ha.yml up -d
# Start with monitoring stack
docker-compose -f docker-compose.ha.yml --profile monitoring up -d
Accessing Services
| Service | URL | Description |
|---|---|---|
| API (Load Balanced) | http://localhost:8080 | Main application endpoint |
| Traefik Dashboard | http://localhost:8081 | Load balancer metrics |
| Prometheus | http://localhost:9090 | Metrics (with monitoring profile) |
| Grafana | http://localhost:3000 | Dashboards (admin/admin) |
| PostgreSQL | localhost:5432 | Database (for debugging) |
| Redis | localhost:6379 | Cache/MQ (for debugging) |
Verifying HA Behavior
# Check all instances are healthy
curl -s http://localhost:8080/health | jq
# Check load balancing (run multiple times)
for i in {1..10}; do
curl -s http://localhost:8080/health | jq -r '.instance_id // "unknown"'
done
# Check leader election
curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'
# Simulate failure - stop one API instance
docker stop tw-api-1
# Verify traffic still flows
curl -s http://localhost:8080/health
# Restart the instance
docker start tw-api-1
Testing Orchestrator Failover
# Check which orchestrator is leader
docker exec tw-orchestrator-1 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'
docker exec tw-orchestrator-2 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'
# Stop the leader
docker stop tw-orchestrator-1
# Verify failover (second orchestrator becomes leader)
sleep 5
docker exec tw-orchestrator-2 curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'
# Restart original
docker start tw-orchestrator-1
Building the Image
To build the Docker image locally:
# From repository root
docker build -t triage-warden:local -f deploy/docker/Dockerfile .
# Build with no cache
docker-compose -f docker-compose.ha.yml build --no-cache
# Build specific service
docker-compose -f docker-compose.ha.yml build api-1
# Use local image
# In docker-compose.yml, change:
# image: ghcr.io/your-org/triage-warden:latest
# to:
# image: triage-warden:local
Persistent Storage
Volume Management
# List volumes
docker volume ls | grep triage-warden
# Backup PostgreSQL
docker exec tw-postgres pg_dump -U triage triage_warden > backup.sql
# Restore PostgreSQL
cat backup.sql | docker exec -i tw-postgres psql -U triage triage_warden
# Backup Redis
docker exec tw-redis redis-cli BGSAVE
docker cp tw-redis:/data/dump.rdb ./redis-backup.rdb
Cleaning Up
# Stop services
docker-compose -f docker-compose.ha.yml down
# Stop and remove volumes (WARNING: deletes all data)
docker-compose -f docker-compose.ha.yml down -v
# Remove only unused volumes
docker volume prune
Common Operations
View Logs
# All services
docker compose logs -f
# Specific service
docker compose logs -f triage-warden
# Last 100 lines
docker compose logs --tail=100 triage-warden
# With timestamps
docker-compose -f docker-compose.ha.yml logs -f --timestamps
Restart Services
# Restart all
docker compose restart
# Restart specific service
docker compose restart triage-warden
Update to New Version
# Pull new images
docker compose pull
# Recreate containers
docker compose up -d
# Verify update
docker compose ps
curl http://localhost:8080/health | jq '.version'
Database Operations
# Create backup
docker compose exec postgres pg_dump -U triage_warden triage_warden > backup.sql
# Restore backup
docker compose exec -T postgres psql -U triage_warden triage_warden < backup.sql
# Access database shell
docker compose exec postgres psql -U triage_warden triage_warden
Debug Mode
Enable debug logging:
# In .env file
RUST_LOG=debug,triage_warden=trace,tw_api=trace,tw_core=trace
TW_LOG_FORMAT=pretty # Human-readable format
Inspecting Containers
# Shell access
docker exec -it tw-api-1 /bin/sh
# Check process status
docker exec tw-api-1 ps aux
# Check network connectivity
docker exec tw-api-1 curl -v http://postgres:5432
docker exec tw-api-1 curl -v http://redis:6379
Resource Limits
The HA configuration includes resource limits suitable for local testing:
| Service | CPU Limit | Memory Limit |
|---|---|---|
| API | 1 core | 512MB |
| Orchestrator | 1.5 cores | 1GB |
| PostgreSQL | 1 core | 1GB |
| Redis | 0.5 core | 512MB |
| Traefik | 0.5 core | 256MB |
Adjust in docker-compose.ha.yml under deploy.resources.
TLS Configuration
For production, use a reverse proxy (nginx, Traefik, Caddy) for TLS termination:
With Traefik
# Add to docker-compose.prod.yml
services:
traefik:
image: traefik:v2.10
command:
- "--providers.docker=true"
- "--entrypoints.websecure.address=:443"
- "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
- "[email protected]"
- "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
ports:
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- letsencrypt:/letsencrypt
triage-warden:
labels:
- "traefik.enable=true"
- "traefik.http.routers.triage.rule=Host(`triage.example.com`)"
- "traefik.http.routers.triage.entrypoints=websecure"
- "traefik.http.routers.triage.tls.certresolver=letsencrypt"
volumes:
letsencrypt:
Troubleshooting
Container Won't Start
# Check logs for errors
docker compose logs triage-warden
# Common issues:
# - DATABASE_URL not set or incorrect
# - TW_ENCRYPTION_KEY missing
# - PostgreSQL not ready (check depends_on health)
Database Connection Failed
# Verify PostgreSQL is running
docker compose ps postgres
# Check PostgreSQL logs
docker compose logs postgres
# Test connection
docker compose exec postgres pg_isready -U triage_warden
# Verify connection from app container
docker exec tw-api-1 curl -v telnet://postgres:5432
Port Conflicts
# Find process using port 8080
lsof -i :8080
# Use different ports
# In docker-compose.ha.yml or via environment:
# - "8090:80" instead of "8080:80"
Container Exits Immediately
# Check exit code and logs
docker-compose -f docker-compose.ha.yml logs api-1
# Common causes:
# - Missing environment variables
# - Database not ready
# - Invalid configuration
Redis Connection Issues
# Test Redis connectivity
docker exec tw-api-1 curl -v telnet://redis:6379
# Check Redis logs
docker-compose -f docker-compose.ha.yml logs redis
# Connect to Redis CLI
docker exec -it tw-redis redis-cli ping
Out of Memory
# Check container memory usage
docker stats
# Increase limits in docker-compose.prod.yml
deploy:
resources:
limits:
memory: 4G # Increase from 2G
Next Steps
- Configure connectors
- Set up notifications
- Create playbooks
- Set up monitoring
- Deploy to Kubernetes using raw manifests
- Deploy with Helm for templated Kubernetes deployments
Kubernetes Deployment Guide
This guide covers deploying Triage Warden to Kubernetes using raw manifests. For the recommended Helm-based approach, see the Helm Chart guide.
Prerequisites
Before deploying, ensure you have:
- Kubernetes cluster version 1.25 or later
- kubectl configured with cluster access
- Helm 3.8+ (see Helm Chart for Helm-based deployment)
- Container registry access to pull Triage Warden images
- PostgreSQL database (managed or self-hosted)
- Redis (optional, required for HA deployments)
Optional Prerequisites
- Ingress controller (nginx-ingress or Traefik recommended)
- cert-manager for automatic TLS certificate management
- Prometheus Operator for metrics and alerting
Quick Start with Helm
1. Add the Helm Repository
# Add the Triage Warden Helm repository
helm repo add triage-warden https://charts.triage-warden.io
helm repo update
2. Create Namespace
kubectl create namespace triage-warden
3. Create Secrets
Generate required secrets before deployment:
# Generate encryption keys
export TW_ENCRYPTION_KEY=$(openssl rand -base64 32)
export TW_JWT_SECRET=$(openssl rand -hex 32)
export TW_SESSION_SECRET=$(openssl rand -hex 32)
# Create Kubernetes secret
kubectl create secret generic triage-warden-secrets \
--namespace triage-warden \
--from-literal=TW_ENCRYPTION_KEY="$TW_ENCRYPTION_KEY" \
--from-literal=TW_JWT_SECRET="$TW_JWT_SECRET" \
--from-literal=TW_SESSION_SECRET="$TW_SESSION_SECRET" \
--from-literal=DATABASE_URL="postgres://user:password@postgres:5432/triage_warden"
4. Install Triage Warden
# Basic installation
helm install triage-warden triage-warden/triage-warden \
--namespace triage-warden \
--set global.domain=triage.example.com
# Installation with custom values
helm install triage-warden triage-warden/triage-warden \
--namespace triage-warden \
--values values-production.yaml
5. Verify Deployment
# Check pod status
kubectl get pods -n triage-warden
# Check service status
kubectl get svc -n triage-warden
# View logs
kubectl logs -n triage-warden -l app.kubernetes.io/name=triage-warden -f
Helm Configuration
Minimal Production Values
Create a values-production.yaml file:
# values-production.yaml
global:
domain: triage.example.com
api:
replicas: 2
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
orchestrator:
replicas: 2
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
postgresql:
# Use external database
enabled: false
external:
host: postgres.example.com
port: 5432
database: triage_warden
existingSecret: triage-warden-secrets
existingSecretPasswordKey: DATABASE_PASSWORD
redis:
enabled: true
architecture: standalone
auth:
enabled: true
existingSecret: triage-warden-secrets
existingSecretPasswordKey: REDIS_PASSWORD
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
tls:
- secretName: triage-warden-tls
hosts:
- triage.example.com
monitoring:
enabled: true
serviceMonitor:
enabled: true
Common Configuration Options
| Parameter | Description | Default |
|---|---|---|
api.replicas | Number of API server replicas | 2 |
orchestrator.replicas | Number of orchestrator replicas | 2 |
image.repository | Container image repository | ghcr.io/triage-warden/triage-warden |
image.tag | Container image tag | latest |
ingress.enabled | Enable ingress | true |
postgresql.enabled | Deploy PostgreSQL | true |
redis.enabled | Deploy Redis | true |
monitoring.enabled | Enable monitoring | true |
Manual Deployment (Without Helm)
If you prefer to use raw Kubernetes manifests:
Architecture
┌─────────────────┐
│ Ingress │
│ (TLS + routing)│
└────────┬────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Pod │ │ Pod │ │ Pod │
│ replica │ │ replica │ │ replica │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└────────────────┼────────────────┘
│
┌────────▼────────┐
│ Service │
│ (ClusterIP) │
└────────┬────────┘
│
┌────────▼────────┐
│ PostgreSQL │
│ (StatefulSet) │
└─────────────────┘
Manifests
Namespace
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: triage-warden
labels:
app.kubernetes.io/name: triage-warden
Secret
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: triage-warden-secrets
namespace: triage-warden
type: Opaque
stringData:
# Generate these values securely!
# encryption-key: $(openssl rand -base64 32)
# jwt-secret: $(openssl rand -hex 32)
# session-secret: $(openssl rand -hex 32)
encryption-key: "REPLACE_WITH_BASE64_32_BYTE_KEY"
jwt-secret: "REPLACE_WITH_JWT_SECRET"
session-secret: "REPLACE_WITH_SESSION_SECRET"
database-url: "postgres://triage_warden:password@postgres-postgresql:5432/triage_warden"
ConfigMap
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: triage-warden-config
namespace: triage-warden
data:
RUST_LOG: "info"
TW_BIND_ADDRESS: "0.0.0.0:8080"
TW_BASE_URL: "https://triage.example.com"
Deployment
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triage-warden
namespace: triage-warden
labels:
app.kubernetes.io/name: triage-warden
app.kubernetes.io/component: server
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: triage-warden
template:
metadata:
labels:
app.kubernetes.io/name: triage-warden
spec:
serviceAccountName: triage-warden
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: triage-warden
image: ghcr.io/your-org/triage-warden:latest
imagePullPolicy: Always
ports:
- name: http
containerPort: 8080
protocol: TCP
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: triage-warden-secrets
key: database-url
- name: TW_ENCRYPTION_KEY
valueFrom:
secretKeyRef:
name: triage-warden-secrets
key: encryption-key
- name: TW_JWT_SECRET
valueFrom:
secretKeyRef:
name: triage-warden-secrets
key: jwt-secret
- name: TW_SESSION_SECRET
valueFrom:
secretKeyRef:
name: triage-warden-secrets
key: session-secret
envFrom:
- configMapRef:
name: triage-warden-config
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /live
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Service
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: triage-warden
namespace: triage-warden
labels:
app.kubernetes.io/name: triage-warden
spec:
type: ClusterIP
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app.kubernetes.io/name: triage-warden
Ingress
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: triage-warden
namespace: triage-warden
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
tls:
- hosts:
- triage.example.com
secretName: triage-warden-tls
rules:
- host: triage.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: triage-warden
port:
number: 80
ServiceAccount
# serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: triage-warden
namespace: triage-warden
HorizontalPodAutoscaler
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triage-warden
namespace: triage-warden
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triage-warden
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
PodDisruptionBudget
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: triage-warden
namespace: triage-warden
spec:
minAvailable: 1
selector:
matchLabels:
app.kubernetes.io/name: triage-warden
Apply Manifests
kubectl apply -f deploy/kubernetes/namespace.yaml
kubectl apply -f deploy/kubernetes/secret.yaml
kubectl apply -f deploy/kubernetes/configmap.yaml
kubectl apply -f deploy/kubernetes/deployment.yaml
kubectl apply -f deploy/kubernetes/service.yaml
kubectl apply -f deploy/kubernetes/ingress.yaml
kubectl apply -f deploy/kubernetes/servicemonitor.yaml
kubectl apply -f deploy/kubernetes/hpa.yaml
High Availability Configuration
For production HA deployments:
API Server HA
The API servers are stateless and can be scaled horizontally:
api:
replicas: 3
podAntiAffinity:
enabled: true
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
enabled: true
maxSkew: 1
Orchestrator HA
Orchestrators use leader election to coordinate singleton tasks:
orchestrator:
replicas: 2
leaderElection:
enabled: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
Pod Disruption Budget
Ensure availability during updates:
podDisruptionBudget:
enabled: true
minAvailable: 1
Database Setup
Using Helm (PostgreSQL)
# Add Bitnami repo
helm repo add bitnami https://charts.bitnami.com/bitnami
# Install PostgreSQL
helm install postgres bitnami/postgresql \
--namespace triage-warden \
--set auth.username=triage_warden \
--set auth.password=your-secure-password \
--set auth.database=triage_warden \
--set primary.persistence.size=20Gi
Using External Database
Update the secret with your external database URL:
kubectl create secret generic triage-warden-secrets \
--namespace triage-warden \
--from-literal=database-url="postgres://user:[email protected]:5432/triage_warden?sslmode=require" \
# ... other secrets
Monitoring
ServiceMonitor (Prometheus)
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: triage-warden
namespace: triage-warden
spec:
selector:
matchLabels:
app.kubernetes.io/name: triage-warden
endpoints:
- port: http
path: /metrics
interval: 30s
PrometheusRule (Alerts)
# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: triage-warden
namespace: triage-warden
spec:
groups:
- name: triage-warden
rules:
- alert: TriageWardenDown
expr: up{job="triage-warden"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Triage Warden is down"
description: "Triage Warden has been down for more than 5 minutes."
- alert: TriageWardenHighErrorRate
expr: rate(http_requests_total{job="triage-warden",status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in Triage Warden"
Upgrading
Helm Upgrade
# Check current version
helm list -n triage-warden
# Upgrade to new version
helm upgrade triage-warden triage-warden/triage-warden \
--namespace triage-warden \
--values values-production.yaml \
--set image.tag=v1.1.0
# Monitor the rollout
kubectl rollout status deployment/triage-warden-api -n triage-warden
Rollback
# View release history
helm history triage-warden -n triage-warden
# Rollback to previous version
helm rollback triage-warden 1 -n triage-warden
Database Migrations
Triage Warden automatically runs database migrations on startup. For manual control:
# Run migrations manually
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
triage-warden migrate
# Check migration status
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
triage-warden migrate --status
TLS Configuration
Using cert-manager
ingress:
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
tls:
- secretName: triage-warden-tls
hosts:
- triage.example.com
Manual TLS Secret
kubectl create secret tls triage-warden-tls \
--namespace triage-warden \
--cert=tls.crt \
--key=tls.key
Security Hardening
Network Policy
# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: triage-warden
namespace: triage-warden
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: triage-warden
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: postgresql
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- to: # External APIs (LLM, connectors)
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- protocol: TCP
port: 443
Troubleshooting
Pod Not Starting
# Check pod events
kubectl describe pod -n triage-warden -l app.kubernetes.io/name=triage-warden
# Check logs
kubectl logs -n triage-warden -l app.kubernetes.io/name=triage-warden --previous
# Common issues:
# - ImagePullBackOff: Check image name and registry credentials
# - CrashLoopBackOff: Check logs for startup errors
# - Pending: Check resource requests and node capacity
Database Connection Issues
# Test database connectivity from a pod
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
curl -v telnet://postgres:5432
# Check database URL
kubectl get secret triage-warden-secrets -n triage-warden -o jsonpath='{.data.DATABASE_URL}' | base64 -d
Health Check Failures
# Check liveness endpoint
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
curl -s http://localhost:8080/live
# Check readiness endpoint
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
curl -s http://localhost:8080/ready
# Check detailed health
kubectl exec -it deployment/triage-warden-api -n triage-warden -- \
curl -s http://localhost:8080/health/detailed | jq
Leader Election Issues
# Check which instance is the leader
kubectl exec -it deployment/triage-warden-orchestrator-0 -n triage-warden -- \
curl -s http://localhost:8080/health/detailed | jq '.components.leader_elector'
# Check leader lease in Redis
kubectl exec -it deployment/triage-warden-redis-0 -n triage-warden -- \
redis-cli KEYS "tw:leader:*"
Performance Issues
# Check resource usage
kubectl top pods -n triage-warden
# Check HPA status
kubectl get hpa -n triage-warden
# View Prometheus metrics
kubectl port-forward svc/prometheus -n monitoring 9090:9090
Ingress Not Working
# Check ingress
kubectl describe ingress triage-warden -n triage-warden
# Check TLS secret
kubectl get secret triage-warden-tls -n triage-warden
# Check ingress controller logs
kubectl logs -l app.kubernetes.io/name=ingress-nginx -n ingress-nginx
Operations
View Logs
# All pods
kubectl logs -l app.kubernetes.io/name=triage-warden -n triage-warden -f
# Specific pod
kubectl logs -f deployment/triage-warden -n triage-warden
# Previous container (after crash)
kubectl logs deployment/triage-warden -n triage-warden --previous
Scale Deployment
# Manual scale
kubectl scale deployment triage-warden -n triage-warden --replicas=5
# Check HPA status
kubectl get hpa -n triage-warden
Rolling Update
# Update image
kubectl set image deployment/triage-warden \
triage-warden=ghcr.io/your-org/triage-warden:v1.2.0 \
-n triage-warden
# Watch rollout
kubectl rollout status deployment/triage-warden -n triage-warden
# Rollback if needed
kubectl rollout undo deployment/triage-warden -n triage-warden
Uninstalling
Helm Uninstall
# Uninstall Triage Warden
helm uninstall triage-warden -n triage-warden
# Delete namespace (optional, removes all resources)
kubectl delete namespace triage-warden
# Delete PVCs if needed
kubectl delete pvc -n triage-warden --all
Next Steps
- Configure monitoring and alerting
- Set up horizontal scaling
- Review configuration options
Helm Chart Deployment
Deploy Triage Warden to Kubernetes using the bundled Helm chart. This is the recommended approach for Kubernetes deployments, providing templated manifests with environment-specific value overrides.
The chart lives at deploy/helm/ in the repository.
Prerequisites
- Kubernetes 1.25+
- Helm 3.8+
- External PostgreSQL database (required)
- External Redis (optional, required for HA deployments)
- Ingress controller (nginx recommended)
- cert-manager (for automatic TLS)
- Prometheus Operator (for monitoring)
Quick Start
Development
# Create a values file
cat > my-values.yaml << EOF
postgresql:
host: "postgres.default.svc.cluster.local"
port: 5432
database: "triage_warden"
username: "triage"
password: "your-password"
secrets:
encryptionKey: "$(openssl rand -base64 32)"
jwtSecret: "$(openssl rand -hex 32)"
sessionSecret: "$(openssl rand -hex 32)"
config:
enableSwagger: true
secureCookies: false
EOF
# Install
helm install triage-warden ./deploy/helm -f my-values.yaml
Production
# Create namespace
kubectl create namespace triage-warden
# Create secrets externally (recommended)
kubectl create secret generic triage-warden-secrets \
--namespace triage-warden \
--from-literal=TW_ENCRYPTION_KEY="$(openssl rand -base64 32)" \
--from-literal=TW_JWT_SECRET="$(openssl rand -hex 32)" \
--from-literal=TW_SESSION_SECRET="$(openssl rand -hex 32)"
kubectl create secret generic postgresql-credentials \
--namespace triage-warden \
--from-literal=postgresql-password="your-db-password"
# Install with production values
helm install triage-warden ./deploy/helm \
--namespace triage-warden \
-f deploy/helm/values-prod.yaml
Value Files
The chart ships with pre-built value files for common scenarios:
| File | Purpose |
|---|---|
values.yaml | Defaults (base for all environments) |
values-dev.yaml | Single-instance development (debug logging, no TLS) |
values-prod.yaml | Multi-instance production (3 API replicas, TLS, monitoring) |
values-ha.yaml | Maximum availability (5+ replicas, zone spreading, strict anti-affinity) |
Override with -f:
helm install triage-warden ./deploy/helm \
--namespace triage-warden \
-f deploy/helm/values-prod.yaml \
-f my-secrets.yaml
Key Parameters
Application
| Parameter | Description | Default |
|---|---|---|
api.replicas | API server replicas | 2 |
api.resources.requests.cpu | CPU request | 100m |
api.resources.requests.memory | Memory request | 256Mi |
orchestrator.replicas | Orchestrator replicas | 1 |
config.logLevel | Log level | info |
config.enableSwagger | Enable Swagger UI | false |
Database
| Parameter | Description | Default |
|---|---|---|
postgresql.host | PostgreSQL host (required) | "" |
postgresql.port | PostgreSQL port | 5432 |
postgresql.database | Database name | triage_warden |
postgresql.existingSecret | Existing secret with password | "" |
postgresql.sslMode | SSL mode | require |
Networking
| Parameter | Description | Default |
|---|---|---|
ingress.enabled | Enable ingress | false |
ingress.className | Ingress class name | nginx |
networkPolicy.enabled | Enable network policies | false |
Scaling & HA
| Parameter | Description | Default |
|---|---|---|
autoscaling.enabled | Enable HPA | false |
autoscaling.minReplicas | Minimum replicas | 2 |
autoscaling.maxReplicas | Maximum replicas | 10 |
podDisruptionBudget.enabled | Enable PDB | false |
Monitoring
| Parameter | Description | Default |
|---|---|---|
serviceMonitor.enabled | Enable ServiceMonitor | false |
prometheusRules.enabled | Enable alerting rules | false |
See deploy/helm/values.yaml for the complete list.
Components
The chart deploys two main components:
- API Server (
deployment-api.yaml) - Handles HTTP requests, webhooks, and the web UI - Orchestrator (
deployment-orchestrator.yaml) - Manages background tasks, scheduling, and automation
Supporting resources: ServiceAccount, ConfigMap, Secret, Service, Ingress, HPA, PDB, NetworkPolicy, ServiceMonitor, PrometheusRule.
External Secrets
For production, use an external secrets manager instead of storing secrets in values files:
secrets:
create: false
existingSecret: "triage-warden-secrets"
Compatible with:
- External Secrets Operator
- AWS Secrets Manager with IRSA
- HashiCorp Vault
Upgrading
helm upgrade triage-warden ./deploy/helm \
--namespace triage-warden \
-f deploy/helm/values-prod.yaml
# Monitor rollout
kubectl rollout status deployment/triage-warden-api -n triage-warden
Rollback
helm history triage-warden -n triage-warden
helm rollback triage-warden 1 -n triage-warden
Uninstalling
helm uninstall triage-warden -n triage-warden
kubectl delete namespace triage-warden
Alerts
When prometheusRules.enabled: true, the chart installs these alerts:
TriageWardenDown- Instance unreachable for 2+ minutesTriageWardenHighErrorRate- 5xx errors exceed 5%TriageWardenKillSwitchActive- Kill switch activatedTriageWardenDatabaseUnhealthy- Database connection issuesTriageWardenHighLatency- P99 latency above 1 secondTriageWardenConnectorUnhealthy- Connector health issues
The HA values file (values-ha.yaml) adds zone-balance and replica-mismatch alerts.
Next Steps
- Production Checklist - Security and configuration review
- Monitoring - Set up dashboards and alerting
- Scaling - Horizontal scaling guidance
- Raw Manifests - Alternative: deploy without Helm
Configuration Reference
This document provides a comprehensive reference for all Triage Warden configuration options.
Configuration Methods
Triage Warden can be configured through:
- Environment variables (recommended for production)
- Configuration file (
config/default.yaml) - Command-line arguments (for specific settings)
Environment variables take precedence over configuration file values.
Environment Variables
Security Settings (Required)
| Variable | Description | Example |
|---|---|---|
TW_ENCRYPTION_KEY | 32-byte base64 key for encrypting credentials stored in database | openssl rand -base64 32 |
TW_JWT_SECRET | Secret for signing JWT tokens (min 32 chars) | openssl rand -hex 32 |
TW_SESSION_SECRET | Secret for signing session cookies (min 32 chars) | openssl rand -hex 32 |
Warning: These secrets must be consistent across all instances in a cluster. Changing them will invalidate existing sessions and encrypted data.
Database Configuration
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | postgres://user:pass@host:5432/db |
DATABASE_MAX_CONNECTIONS | Maximum connection pool size | 25 |
DATABASE_MIN_CONNECTIONS | Minimum connection pool size | 5 |
DATABASE_CONNECT_TIMEOUT | Connection timeout in seconds | 30 |
DATABASE_IDLE_TIMEOUT | Idle connection timeout in seconds | 600 |
DATABASE_MAX_LIFETIME | Maximum connection lifetime in seconds | 1800 |
Connection String Format:
postgres://username:password@hostname:port/database?sslmode=require
SSL modes: disable, allow, prefer, require, verify-ca, verify-full
Redis Configuration
Redis is required for HA deployments (message queue, cache, leader election).
| Variable | Description | Default |
|---|---|---|
REDIS_URL | Redis connection URL | redis://localhost:6379 |
TW_MESSAGE_QUEUE_ENABLED | Enable Redis-based message queue | false |
TW_CACHE_ENABLED | Enable Redis-based cache | false |
TW_LEADER_ELECTION_ENABLED | Enable Redis-based leader election | false |
TW_CACHE_TTL_SECONDS | Default cache TTL | 3600 |
TW_CACHE_MAX_SIZE | Maximum cache entries | 10000 |
Connection URL Formats:
redis://localhost:6379
redis://:password@localhost:6379
redis://localhost:6379/0
rediss://localhost:6379 # TLS
Server Configuration
| Variable | Description | Default |
|---|---|---|
TW_BIND_ADDRESS | Address and port to bind | 0.0.0.0:8080 |
TW_BASE_URL | Public URL for the application | http://localhost:8080 |
TW_ENV | Environment: development, production | development |
TW_TRUSTED_PROXIES | CIDR ranges for trusted reverse proxies | `` |
TW_REQUEST_BODY_LIMIT | Max request body size in bytes | 10485760 (10MB) |
TW_REQUEST_TIMEOUT | Request timeout in seconds | 30 |
Instance Configuration
| Variable | Description | Default |
|---|---|---|
TW_INSTANCE_ID | Unique identifier for this instance | Auto-generated |
TW_INSTANCE_TYPE | Instance type: api, orchestrator, combined | combined |
Authentication & Sessions
| Variable | Description | Default |
|---|---|---|
TW_COOKIE_SECURE | Require HTTPS for cookies | true in production |
TW_COOKIE_SAME_SITE | SameSite policy: strict, lax, none | strict |
TW_SESSION_EXPIRY_SECONDS | Session duration | 86400 (24 hours) |
TW_CSRF_ENABLED | Enable CSRF protection | true |
TW_ADMIN_PASSWORD | Initial admin password (first run only) | Auto-generated |
CORS Configuration
| Variable | Description | Default |
|---|---|---|
TW_CORS_ALLOWED_ORIGINS | Allowed origins (comma-separated) | Same origin only |
TW_CORS_ALLOW_CREDENTIALS | Allow credentials in CORS requests | true |
TW_CORS_MAX_AGE | Preflight cache duration in seconds | 3600 |
LLM Configuration
| Variable | Description | Default |
|---|---|---|
TW_LLM_PROVIDER | LLM provider: anthropic, openai, azure, local | anthropic |
TW_LLM_MODEL | Model identifier | claude-3-sonnet-20240229 |
TW_LLM_TEMPERATURE | Generation temperature (0.0-2.0) | 0.2 |
TW_LLM_MAX_TOKENS | Maximum response tokens | 4096 |
TW_LLM_TIMEOUT_SECONDS | API call timeout | 60 |
TW_LLM_RETRY_ATTEMPTS | Number of retry attempts | 3 |
TW_LLM_RETRY_DELAY_MS | Delay between retries | 1000 |
Provider-specific API Keys:
| Variable | Provider |
|---|---|
ANTHROPIC_API_KEY | Anthropic Claude |
OPENAI_API_KEY | OpenAI GPT |
AZURE_OPENAI_API_KEY | Azure OpenAI |
AZURE_OPENAI_ENDPOINT | Azure OpenAI endpoint URL |
Orchestrator Configuration
| Variable | Description | Default |
|---|---|---|
TW_OPERATION_MODE | Mode: supervised, assisted, autonomous | supervised |
TW_AUTO_APPROVE_LOW_RISK | Auto-approve low-risk actions | false |
TW_MAX_CONCURRENT_INCIDENTS | Max concurrent incident processing | 100 |
TW_ENRICHMENT_TIMEOUT_SECONDS | Enrichment step timeout | 60 |
TW_ANALYSIS_TIMEOUT_SECONDS | AI analysis timeout | 120 |
TW_ACTION_TIMEOUT_SECONDS | Action execution timeout | 300 |
Logging Configuration
| Variable | Description | Default |
|---|---|---|
RUST_LOG | Log level filter | info |
TW_LOG_FORMAT | Format: json, pretty | json in production |
TW_LOG_INCLUDE_LOCATION | Include file/line in logs | false |
Log Level Examples:
# Basic level
RUST_LOG=info
# Per-module levels
RUST_LOG=info,triage_warden=debug,tw_api=trace
# All debug
RUST_LOG=debug
Metrics Configuration
| Variable | Description | Default |
|---|---|---|
TW_METRICS_ENABLED | Enable Prometheus metrics | true |
TW_METRICS_PATH | Metrics endpoint path | /metrics |
TW_METRICS_INCLUDE_LABELS | Include additional labels | true |
Rate Limiting
| Variable | Description | Default |
|---|---|---|
TW_RATE_LIMIT_ENABLED | Enable rate limiting | true |
TW_RATE_LIMIT_REQUESTS | Requests per window | 200 |
TW_RATE_LIMIT_WINDOW | Window duration (e.g., 1m, 1h) | 1m |
TW_RATE_LIMIT_BURST | Burst allowance | 50 |
Feature Flags
| Variable | Description | Default |
|---|---|---|
TW_FEATURE_PLAYBOOKS | Enable playbook automation | true |
TW_FEATURE_AUTO_ENRICH | Enable automatic enrichment | true |
TW_FEATURE_API_KEYS | Enable API key authentication | true |
TW_FEATURE_MULTI_TENANT | Enable multi-tenancy | false |
TW_ENABLE_SWAGGER | Enable Swagger UI | true in dev |
Webhook Configuration
| Variable | Description | Default |
|---|---|---|
TW_WEBHOOK_SECRET | Default webhook signature secret | `` |
TW_WEBHOOK_TIMEOUT_SECONDS | Webhook delivery timeout | 30 |
TW_WEBHOOK_RETRY_ATTEMPTS | Delivery retry attempts | 3 |
Source-specific webhook secrets:
| Variable | Source |
|---|---|
TW_WEBHOOK_SPLUNK_SECRET | Splunk HEC |
TW_WEBHOOK_CROWDSTRIKE_SECRET | CrowdStrike |
TW_WEBHOOK_SENTINEL_SECRET | Microsoft Sentinel |
TW_WEBHOOK_GITHUB_SECRET | GitHub (for DevSecOps) |
Configuration File
Configuration can also be provided via YAML file.
File Locations
Triage Warden searches for configuration in order:
- Path specified by
--configflag $HOME/.config/triage-warden/config.yaml/etc/triage-warden/config.yaml./config/default.yaml
Example Configuration File
# config/default.yaml
# Server configuration
server:
bind_address: "0.0.0.0:8080"
base_url: "https://triage.example.com"
trusted_proxies:
- "10.0.0.0/8"
- "172.16.0.0/12"
# Database configuration
database:
url: "postgres://triage:password@localhost:5432/triage_warden"
max_connections: 25
min_connections: 5
connect_timeout: 30
# Redis configuration (for HA)
redis:
url: "redis://localhost:6379"
message_queue:
enabled: true
cache:
enabled: true
ttl_seconds: 3600
leader_election:
enabled: true
# LLM configuration
llm:
provider: anthropic
model: claude-3-sonnet-20240229
temperature: 0.2
max_tokens: 4096
# API key should be set via environment variable
# Orchestrator settings
orchestrator:
operation_mode: supervised
auto_approve_low_risk: false
max_concurrent_incidents: 100
timeouts:
enrichment: 60
analysis: 120
action: 300
# Logging
logging:
level: info
format: json
# Metrics
metrics:
enabled: true
path: /metrics
# Rate limiting
rate_limit:
enabled: true
requests_per_minute: 200
burst: 50
# Feature flags
features:
playbooks: true
auto_enrich: true
api_keys: true
multi_tenant: false
# Connectors
connectors:
crowdstrike:
enabled: true
type: edr
base_url: "https://api.crowdstrike.com"
# Credentials via environment or secrets
splunk:
enabled: true
type: siem
base_url: "https://splunk.example.com:8089"
Precedence
Configuration is loaded in this order (later overrides earlier):
- Default values (built into application)
- Configuration file (
config/default.yaml) - Environment-specific file (
config/{TW_ENV}.yaml) - Environment variables
Generating Secrets
Encryption Key (32 bytes, base64)
# macOS/Linux
openssl rand -base64 32
# Alternative using /dev/urandom
head -c 32 /dev/urandom | base64
JWT/Session Secrets
# Hex-encoded secret
openssl rand -hex 32
# Or use a password generator
pwgen -s 64 1
Database URL Format
PostgreSQL
postgres://username:password@hostname:port/database?sslmode=require
Options:
sslmode=disable- No SSL (development only)sslmode=require- Require SSL, don't verify certificatesslmode=verify-ca- Require SSL, verify CAsslmode=verify-full- Require SSL, verify CA and hostname
Connection Pooling (PgBouncer)
postgres://username:password@pgbouncer:6432/database?sslmode=require
Operation Modes
Triage Warden supports three operation modes:
Supervised Mode (Default)
All actions require human approval:
TW_OPERATION_MODE=supervised
TW_AUTO_APPROVE_LOW_RISK=false
Assisted Mode
Low-risk actions are auto-approved, high-risk require approval:
TW_OPERATION_MODE=assisted
TW_AUTO_APPROVE_LOW_RISK=true
Autonomous Mode
All actions within guardrails are auto-executed:
TW_OPERATION_MODE=autonomous
Warning: Autonomous mode should only be enabled after thorough testing and with appropriate guardrails configured.
Health Check Endpoints
| Endpoint | Purpose | Response |
|---|---|---|
/health | Basic health status | {"status": "healthy", ...} |
/health/detailed | Full component status | Includes all components |
/live | Liveness probe (Kubernetes) | 200 OK |
/ready | Readiness probe (Kubernetes) | 200 OK or 503 |
Health Status Values
| Status | Description |
|---|---|
healthy | All components operational |
degraded | Some non-critical components failing |
unhealthy | Critical components failing |
halted | Kill switch activated |
Security Best Practices
- Never commit secrets to version control
- Use different secrets for each environment
- Rotate secrets periodically
- Enable TLS in production (
TW_COOKIE_SECURE=true) - Restrict trusted proxies to known IP ranges
- Enable rate limiting in production
- Use read-only database users where possible
Environment-Specific Recommendations
Development
TW_ENV=development
TW_LOG_FORMAT=pretty
RUST_LOG=debug,triage_warden=trace
TW_COOKIE_SECURE=false
TW_ENABLE_SWAGGER=true
Staging
TW_ENV=production
TW_LOG_FORMAT=json
RUST_LOG=info,triage_warden=debug
TW_COOKIE_SECURE=true
TW_ENABLE_SWAGGER=true
Production
TW_ENV=production
TW_LOG_FORMAT=json
RUST_LOG=info
TW_COOKIE_SECURE=true
TW_ENABLE_SWAGGER=false
TW_METRICS_ENABLED=true
TW_RATE_LIMIT_ENABLED=true
High-Availability
DATABASE_URL=postgres://tw_user:pass@pgbouncer:6432/triage_warden?sslmode=require
DATABASE_MAX_CONNECTIONS=50
TW_TRUSTED_PROXIES=10.0.0.0/8
TW_METRICS_ENABLED=true
TW_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
Next Steps
- Configure monitoring
- Set up horizontal scaling
- Deploy to Kubernetes
Operations Guide
Operational procedures and runbooks for Triage Warden.
Runbooks
- Backup & Restore - Database backup and recovery procedures
- Monitoring - Prometheus metrics, alerting, and dashboards
- Troubleshooting - Common issues and solutions
- Maintenance - Routine maintenance tasks
- Incident Response - Emergency procedures
- Upgrade Guide - Version upgrade procedures
Quick Reference
Health Check Endpoints
| Endpoint | Purpose | Expected Response |
|---|---|---|
GET /live | Liveness probe | 200 OK |
GET /ready | Readiness probe | 200 OK if ready, 503 if not |
GET /health | Basic health | JSON with status |
GET /health/detailed | Full component health | JSON with all components |
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
http_requests_total | Total HTTP requests | N/A |
http_request_duration_seconds | Request latency | p99 > 1s |
http_requests_in_flight | Concurrent requests | > 100 |
db_pool_connections_active | Active DB connections | > 80% of max |
incidents_total | Total incidents processed | N/A |
actions_executed_total | Total actions executed | N/A |
Emergency Contacts
| Role | Contact | Escalation |
|---|---|---|
| On-call Engineer | PagerDuty | Auto-escalates after 15m |
| Security Lead | [email protected] | Critical security issues |
| Database Admin | [email protected] | Database emergencies |
Common Commands
Docker
# View logs
docker compose logs -f triage-warden
# Restart service
docker compose restart triage-warden
# Check health
curl http://localhost:8080/health | jq
# Database backup
docker compose exec postgres pg_dump -U triage_warden > backup.sql
Kubernetes
# View logs
kubectl logs -f deployment/triage-warden -n triage-warden
# Restart pods
kubectl rollout restart deployment/triage-warden -n triage-warden
# Check health
kubectl exec -it deployment/triage-warden -n triage-warden -- curl -s localhost:8080/health | jq
# Scale up/down
kubectl scale deployment triage-warden -n triage-warden --replicas=5
Database
# Connect to PostgreSQL
psql $DATABASE_URL
# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'triage_warden';
# Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
Service Dependencies
┌──────────────────┐
│ Triage Warden │
└────────┬─────────┘
│
┌────┴────┬─────────┬─────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Postgres│ │ LLM │ │Connec-│ │Notifi-│
│ DB │ │ API │ │ tors │ │cations│
└───────┘ └───────┘ └───────┘ └───────┘
Dependency Health Impact
| Dependency | If Unavailable |
|---|---|
| PostgreSQL | Service fails readiness, no data access |
| LLM API | AI analysis disabled, manual triage only |
| Connectors | Specific integrations fail, core works |
| Notifications | Alerts not delivered, incidents still process |
Scheduled Tasks
| Task | Schedule | Description |
|---|---|---|
| Database backup | Daily 2:00 AM | Full PostgreSQL backup |
| Connector health check | Every 5 minutes | Verify connector connectivity |
| Incident cleanup | Weekly Sunday 3:00 AM | Archive old incidents |
| Log rotation | Daily | Rotate and compress logs |
| Certificate renewal | 30 days before expiry | Renew TLS certificates |
Monitoring Guide
This guide covers monitoring, metrics, and alerting for Triage Warden deployments.
Overview
Triage Warden exposes metrics in Prometheus format and supports integration with common observability stacks.
┌─────────────────────────────────────────────────────────────┐
│ Monitoring Stack │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │───▶│ Grafana │ │ Alertmanager │ │
│ │ (scraping) │ │ (dashboards) │ │ (alerts) │ │
│ └──────┬───────┘ └──────────────┘ └──────────────┘ │
│ │ │
└─────────┼────────────────────────────────────────────────────┘
│
│ /metrics
│
┌─────────▼────────────────────────────────────────────────────┐
│ Triage Warden │
│ ┌───────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ API-1 │ │ API-2 │ │Orchestrator │ │
│ │ :8080 │ │ :8080 │ │ :8080 │ │
│ └───────────┘ └───────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────────┘
Metrics Endpoints
| Endpoint | Format | Description |
|---|---|---|
/metrics | Prometheus | Prometheus-compatible metrics |
/api/metrics | JSON | Dashboard-friendly JSON format |
/health | JSON | Basic health status |
/health/detailed | JSON | Comprehensive health including components |
Available Metrics
HTTP Metrics
# Request counter by method, path, status
http_requests_total{method="GET", path="/api/incidents", status="200"} 1234
# Request duration histogram
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="0.1"} 900
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="0.5"} 1100
http_request_duration_seconds_bucket{method="GET", path="/api/incidents", le="1.0"} 1200
# Active connections
http_connections_active 42
Incident Metrics
# Total incidents by severity and status
triage_warden_incidents_total{severity="critical", status="new"} 5
triage_warden_incidents_total{severity="high", status="resolved"} 128
# Incidents currently being processed
triage_warden_incidents_in_progress 12
# Triage duration histogram
triage_warden_triage_duration_seconds_bucket{le="60"} 500
triage_warden_triage_duration_seconds_bucket{le="300"} 800
Action Metrics
# Actions by type and status
triage_warden_actions_total{action_type="isolate_host", status="success"} 45
triage_warden_actions_total{action_type="isolate_host", status="failed"} 2
# Pending approvals
triage_warden_actions_pending_approval 8
# Action execution duration
triage_warden_action_duration_seconds_bucket{action_type="isolate_host", le="30"} 40
System Metrics
# Kill switch status
kill_switch_active 0
# Component health (1=healthy, 0=unhealthy)
component_healthy{component="database"} 1
component_healthy{component="redis"} 1
component_healthy{component="connector_crowdstrike"} 1
# Database connection pool
db_pool_connections_total 25
db_pool_connections_idle 20
db_pool_connections_waiting 0
# Cache statistics
cache_hits_total 10000
cache_misses_total 500
cache_size 2500
LLM Metrics
# LLM API calls by provider and model
llm_requests_total{provider="anthropic", model="claude-3-sonnet"} 500
# LLM latency
llm_request_duration_seconds_bucket{provider="anthropic", le="5"} 400
llm_request_duration_seconds_bucket{provider="anthropic", le="30"} 490
# Token usage
llm_tokens_used_total{provider="anthropic", type="input"} 150000
llm_tokens_used_total{provider="anthropic", type="output"} 75000
Message Queue Metrics
# Queue depth by topic
mq_messages_pending{topic="triage.alerts"} 15
mq_messages_pending{topic="triage.enrichment"} 3
# Message processing rate
mq_messages_processed_total{topic="triage.alerts"} 5000
mq_messages_acknowledged_total{topic="triage.alerts"} 4995
Prometheus Configuration
Basic Scrape Config
# prometheus.yml
scrape_configs:
- job_name: 'triage-warden'
static_configs:
- targets:
- 'triage-warden-api:8080'
- 'triage-warden-orchestrator:8080'
metrics_path: /metrics
scrape_interval: 15s
scrape_timeout: 10s
Kubernetes ServiceMonitor
For Prometheus Operator deployments:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: triage-warden
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: triage-warden
namespaceSelector:
matchNames:
- triage-warden
endpoints:
- port: http
path: /metrics
interval: 15s
scrapeTimeout: 10s
Pod Annotations (Alternative)
If using annotation-based discovery:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
Alerting Rules
PrometheusRule Resource
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: triage-warden-alerts
labels:
release: prometheus
spec:
groups:
- name: triage-warden.availability
rules:
# Service Down
- alert: TriageWardenDown
expr: up{job="triage-warden"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Triage Warden instance is down"
description: "{{ $labels.instance }} has been unreachable for more than 2 minutes."
# High Error Rate
- alert: TriageWardenHighErrorRate
expr: |
sum(rate(http_requests_total{job="triage-warden",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="triage-warden"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "More than 5% of requests are returning 5xx errors."
# Database Unhealthy
- alert: TriageWardenDatabaseUnhealthy
expr: component_healthy{job="triage-warden",component="database"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection lost"
description: "Triage Warden cannot connect to the database."
- name: triage-warden.performance
rules:
# High Latency
- alert: TriageWardenHighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{job="triage-warden"}[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High API latency"
description: "P99 latency is above 1 second for the last 10 minutes."
# Slow Triage Time
- alert: TriageWardenSlowTriage
expr: |
histogram_quantile(0.90,
rate(triage_warden_triage_duration_seconds_bucket[1h])
) > 300
for: 30m
labels:
severity: warning
annotations:
summary: "Incident triage taking too long"
description: "P90 triage duration is above 5 minutes."
- name: triage-warden.operations
rules:
# Kill Switch Active
- alert: TriageWardenKillSwitchActive
expr: kill_switch_active == 1
for: 0m
labels:
severity: warning
annotations:
summary: "Kill switch is active"
description: "All automation has been halted by the kill switch."
# High Pending Approvals
- alert: TriageWardenHighPendingApprovals
expr: triage_warden_actions_pending_approval > 50
for: 15m
labels:
severity: warning
annotations:
summary: "High number of pending approvals"
description: "{{ $value }} actions are waiting for approval."
# Connector Unhealthy
- alert: TriageWardenConnectorUnhealthy
expr: component_healthy{component=~"connector_.*"} == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Connector {{ $labels.component }} is unhealthy"
description: "Connector has been unhealthy for more than 10 minutes."
# Queue Backlog
- alert: TriageWardenQueueBacklog
expr: mq_messages_pending{topic="triage.alerts"} > 100
for: 15m
labels:
severity: warning
annotations:
summary: "Alert queue backlog growing"
description: "{{ $value }} unprocessed alerts in queue."
- name: triage-warden.resources
rules:
# High CPU
- alert: TriageWardenHighCPU
expr: |
sum(rate(container_cpu_usage_seconds_total{
container="triage-warden"
}[5m])) by (pod) > 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "Pod {{ $labels.pod }} CPU usage above 80%."
# High Memory
- alert: TriageWardenHighMemory
expr: |
container_memory_usage_bytes{container="triage-warden"} /
container_spec_memory_limit_bytes{container="triage-warden"} > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Pod {{ $labels.pod }} memory usage above 90%."
# Database Connection Exhaustion
- alert: TriageWardenDBConnectionsLow
expr: db_pool_connections_idle < 2
for: 5m
labels:
severity: warning
annotations:
summary: "Database connection pool nearly exhausted"
description: "Only {{ $value }} idle connections remaining."
Key Metrics to Monitor
SLI/SLO Recommendations
| Indicator | Target | Alert Threshold |
|---|---|---|
| Availability | 99.9% | < 99.5% |
| API Latency P99 | < 500ms | > 1s |
| Error Rate | < 0.1% | > 1% |
| Triage Time P90 | < 5min | > 10min |
Dashboard Panels
Overview:
- Instance count and status
- Requests per second
- Error rate percentage
- Active incidents
Performance:
- Request latency histogram
- Database query duration
- LLM response time
- Cache hit ratio
Operations:
- Incidents by severity/status
- Actions executed vs pending
- Queue depths
- Connector health matrix
Resources:
- CPU utilization by instance
- Memory utilization by instance
- Database connections
- Redis memory usage
Grafana Dashboards
Importing Dashboards
Triage Warden provides pre-built Grafana dashboards:
# Download dashboard JSON
curl -o triage-warden-dashboard.json \
https://raw.githubusercontent.com/triage-warden/triage-warden/main/deploy/grafana/dashboards/overview.json
# Import via Grafana API
curl -X POST -H "Content-Type: application/json" \
-d @triage-warden-dashboard.json \
http://admin:admin@localhost:3000/api/dashboards/db
Dashboard Provisioning
For automatic dashboard provisioning in Kubernetes:
# ConfigMap for dashboard provisioning
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
labels:
grafana_dashboard: "1"
data:
triage-warden.json: |
{
"dashboard": {
"title": "Triage Warden",
"panels": [...]
}
}
Example Panel Queries
Requests per Second:
sum(rate(http_requests_total{job="triage-warden"}[5m]))
Error Rate:
sum(rate(http_requests_total{job="triage-warden",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="triage-warden"}[5m])) * 100
P99 Latency:
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="triage-warden"}[5m])) by (le)
)
Incidents by Status:
triage_warden_incidents_total{job="triage-warden"}
Cache Hit Ratio:
sum(rate(cache_hits_total[5m])) /
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100
Logging
Log Format
Triage Warden outputs structured JSON logs:
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "info",
"target": "tw_api::routes::incidents",
"message": "Incident created",
"incident_id": "123e4567-e89b-12d3-a456-426614174000",
"severity": "high",
"source": "crowdstrike",
"trace_id": "abc123",
"span_id": "def456"
}
Log Aggregation
Loki Configuration:
# promtail config
scrape_configs:
- job_name: triage-warden
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
regex: triage-warden
action: keep
pipeline_stages:
- json:
expressions:
level: level
incident_id: incident_id
trace_id: trace_id
- labels:
level:
incident_id:
Elasticsearch/Fluentd:
# Fluentd config
<match kubernetes.var.log.containers.triage-warden**>
@type elasticsearch
host elasticsearch
port 9200
index_name triage-warden
<buffer>
@type file
path /var/log/fluentd-buffers/triage-warden
</buffer>
</match>
Log Queries
Find errors:
level:ERROR
Slow requests:
duration_ms:>1000
Specific user actions:
user.id:"user-uuid" AND target:*auth*
Distributed Tracing
OpenTelemetry Configuration
# Environment variables
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=triage-warden
OTEL_TRACES_EXPORTER=otlp
Trace Propagation
Triage Warden propagates trace context through:
- HTTP headers (W3C Trace Context)
- Message queue metadata
- Internal async tasks
Health Check Integration
Kubernetes Probes
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
Health Status Interpretation
| Status | HTTP Code | Meaning |
|---|---|---|
| healthy | 200 | All systems operational |
| degraded | 200 | Non-critical issues |
| unhealthy | 503 | Critical component failure |
| halted | 200 | Kill switch active |
Synthetic Monitoring
# blackbox-exporter probe
modules:
http_triage_warden:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
fail_if_body_not_matches_regexp:
- '"status":"healthy"'
Uptime Monitoring
Configure external uptime monitoring (Pingdom, UptimeRobot, etc.) to check:
https://triage.example.com/live- Basic availabilityhttps://triage.example.com/ready- Full readiness
SLO/SLI Definitions
Availability SLO
Target: 99.9% availability
# SLI: Successful requests / Total requests
sum(rate(http_requests_total{job="triage-warden",status!~"5.."}[30d])) /
sum(rate(http_requests_total{job="triage-warden"}[30d]))
Latency SLO
Target: 99% of requests < 500ms
# SLI: Requests under threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{job="triage-warden",le="0.5"}[30d])) /
sum(rate(http_request_duration_seconds_count{job="triage-warden"}[30d]))
Error Budget
# Remaining error budget
1 - (
(1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) /
(1 - 0.999)
)
Troubleshooting with Metrics
High Latency Investigation
# Identify slow endpoints
topk(5,
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (path, le)
)
)
# Check database query time
histogram_quantile(0.99,
rate(db_query_duration_seconds_bucket[5m])
)
Memory Issues
# Memory growth rate
deriv(process_resident_memory_bytes{job="triage-warden"}[1h])
# Compare to limits
container_memory_usage_bytes / container_spec_memory_limit_bytes
Queue Bottlenecks
# Processing rate vs arrival rate
rate(mq_messages_processed_total[5m]) - rate(mq_messages_received_total[5m])
# Time in queue
histogram_quantile(0.95, rate(mq_message_wait_seconds_bucket[5m]))
Next Steps
- Configure horizontal scaling based on metrics
- Review configuration options
- Set up Kubernetes deployment
Horizontal Scaling Guide
This guide covers scaling Triage Warden horizontally to handle increased load and ensure high availability.
Architecture Overview
Triage Warden consists of two main components that scale differently:
┌─────────────────────┐
│ Load Balancer │
│ (Traefik/nginx) │
└──────────┬──────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ API Server │ │ API Server │ │ API Server │
│ (stateless) │ │ (stateless) │ │ (stateless) │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Orchestrator │ │ Orchestrator │ │ Orchestrator │
│ (worker) │ │ (worker) │ │ (leader) │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Redis │ │ PostgreSQL │ │ PostgreSQL │
│ (MQ + Cache) │ │ (primary) │ │ (replica) │
└───────────────┘ └───────────────┘ └───────────────┘
Scaling Components
API Servers
API servers are stateless and can be scaled horizontally without coordination.
When to Scale:
- CPU utilization > 70% sustained
- Request latency P99 > 500ms
- Concurrent connections approaching limits
Scaling Method:
# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triage-warden-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triage-warden-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Helm Configuration:
api:
replicas: 3
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
Orchestrators
Orchestrators process incidents asynchronously. They use leader election for singleton tasks (scheduled jobs, metrics aggregation) while allowing parallel incident processing across all instances.
When to Scale:
- Incident queue depth increasing
- Mean time to triage increasing
- Worker CPU utilization > 70%
Scaling Considerations:
- Leader Tasks: Only one orchestrator runs scheduled jobs
- Worker Tasks: All orchestrators process incidents from the queue
- State Sharing: Uses Redis for message queue and coordination
Configuration:
orchestrator:
replicas: 3
leaderElection:
enabled: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
When to Scale
Metrics to Monitor
| Metric | Warning Threshold | Critical Threshold | Action |
|---|---|---|---|
http_request_duration_seconds P99 | > 500ms | > 1s | Scale API |
cpu_usage_percent | > 70% | > 85% | Scale component |
memory_usage_percent | > 80% | > 90% | Scale or optimize |
incident_queue_depth | > 100 | > 500 | Scale orchestrators |
db_connection_pool_waiting | > 0 | > 5 | Increase pool size |
redis_connected_clients | > 80% max | > 95% max | Scale Redis |
Capacity Planning
API Server Capacity (per instance):
- ~500 requests/second (simple endpoints)
- ~100 requests/second (complex queries)
- ~50 concurrent WebSocket connections
Orchestrator Capacity (per instance):
- ~10 concurrent incident processing
- ~5 concurrent LLM analysis calls
- ~20 concurrent enrichment requests
Scaling Decision Matrix
| Symptom | Likely Cause | Solution |
|---|---|---|
| High API latency | API overloaded | Scale API servers |
| Growing queue depth | Orchestrators overloaded | Scale orchestrators |
| Database timeouts | Connection exhaustion | Increase pool, add replicas |
| Cache misses high | Cache too small | Increase Redis memory |
| LLM rate limits | Too many concurrent calls | Add rate limiting, queue |
Database Scaling
Connection Pooling
Each instance maintains a connection pool. Total connections:
Total = API_instances * pool_size + Orchestrator_instances * pool_size
Example: 3 API + 2 Orchestrator with pool_size=15:
Total = (3 * 15) + (2 * 15) = 75 connections
Configuration:
database:
max_connections: 15 # Per instance
min_connections: 2
connect_timeout: 30
Read Replicas
For read-heavy workloads, configure read replicas:
database:
primary_url: "postgres://user:pass@primary:5432/db"
replica_url: "postgres://user:pass@replica:5432/db"
read_replica_enabled: true
Connection Pooler (PgBouncer)
For large deployments, use PgBouncer:
# Kubernetes ConfigMap for PgBouncer
apiVersion: v1
kind: ConfigMap
metadata:
name: pgbouncer-config
data:
pgbouncer.ini: |
[databases]
triage_warden = host=postgres port=5432 dbname=triage_warden
[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
Redis Scaling
Standalone vs Cluster
Standalone (default): Suitable for most deployments
- Up to ~100k ops/second
- Single point of failure (use replica for HA)
Cluster: For high-throughput requirements
- Horizontal scaling across nodes
- Automatic sharding
Redis Configuration
redis:
architecture: replication # standalone, replication, cluster
master:
resources:
limits:
memory: 2Gi
replica:
replicaCount: 2
Cache Sizing
Calculate cache memory needs:
Memory = average_entry_size * expected_entries * 1.5 (overhead)
Example: 1KB average, 100k entries:
Memory = 1KB * 100,000 * 1.5 = 150MB
Load Balancer Configuration
Health Checks
Configure proper health checks for load balancing:
# Traefik
- "traefik.http.services.api.loadbalancer.healthcheck.path=/ready"
- "traefik.http.services.api.loadbalancer.healthcheck.interval=5s"
- "traefik.http.services.api.loadbalancer.healthcheck.timeout=3s"
Session Affinity
For WebSocket connections, enable sticky sessions:
# Traefik
- "traefik.http.services.api.loadbalancer.sticky.cookie.name=tw_server"
- "traefik.http.services.api.loadbalancer.sticky.cookie.httpOnly=true"
Rate Limiting
Configure rate limiting at the load balancer level:
# Traefik rate limiting middleware
http:
middlewares:
rate-limit:
rateLimit:
average: 100
burst: 50
period: 1s
Kubernetes Autoscaling
Horizontal Pod Autoscaler (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triage-warden-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triage-warden-api
minReplicas: 2
maxReplicas: 10
metrics:
# CPU-based scaling
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory-based scaling
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric scaling (requires Prometheus adapter)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
Vertical Pod Autoscaler (VPA)
For automatic resource adjustment:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: triage-warden-api
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: triage-warden-api
updatePolicy:
updateMode: "Auto" # or "Off" for recommendations only
resourcePolicy:
containerPolicies:
- containerName: triage-warden
minAllowed:
cpu: 250m
memory: 256Mi
maxAllowed:
cpu: 4
memory: 4Gi
Pod Disruption Budget
Ensure availability during scaling:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: triage-warden-api
spec:
minAvailable: 2
selector:
matchLabels:
app.kubernetes.io/name: triage-warden
app.kubernetes.io/component: api
Scaling Best Practices
1. Scale Gradually
- Increase by 25-50% at a time
- Monitor for 10-15 minutes before next scale
- Watch for downstream bottlenecks
2. Test Scale Limits
# Load testing with k6
k6 run --vus 100 --duration 5m load-test.js
3. Set Resource Limits
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
4. Use Pod Anti-Affinity
Spread pods across nodes:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: triage-warden
topologyKey: kubernetes.io/hostname
5. Configure Topology Spread
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: triage-warden
Troubleshooting Scaling Issues
Pods Not Scaling Up
# Check HPA status
kubectl describe hpa triage-warden-api
# Check metrics availability
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | jq
# Check events
kubectl get events --sort-by='.lastTimestamp' | grep -i scale
Pods Stuck Pending
# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check pod events
kubectl describe pod <pod-name> | grep -A 10 Events
Scaling Oscillation
If pods scale up and down frequently:
- Increase stabilization window
- Adjust metric thresholds
- Add cooldown periods
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # 10 min
Next Steps
- Set up monitoring for scaling metrics
- Review configuration options
- Configure Kubernetes deployment
Backup & Restore
Procedures for backing up and restoring Triage Warden data.
Overview
Triage Warden stores all persistent data in PostgreSQL. Regular backups are essential for disaster recovery.
What to backup:
- PostgreSQL database (all data)
- Configuration files (optional, if customized)
- TLS certificates (if not using cert-manager)
What NOT to backup:
- Application containers (stateless, rebuilt from image)
- Logs (should be in log aggregation system)
- Metrics (stored in Prometheus)
Backup Procedures
Manual Backup
Docker
# Create backup directory
mkdir -p /backups/triage-warden
# Create timestamped backup
BACKUP_FILE="/backups/triage-warden/backup-$(date +%Y%m%d-%H%M%S).sql"
docker compose exec -T postgres pg_dump \
-U triage_warden \
--format=custom \
--compress=9 \
triage_warden > "$BACKUP_FILE"
# Verify backup
pg_restore --list "$BACKUP_FILE" | head -20
echo "Backup created: $BACKUP_FILE ($(du -h $BACKUP_FILE | cut -f1))"
Kubernetes
# Get PostgreSQL pod
PG_POD=$(kubectl get pods -n triage-warden -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}')
# Create backup
BACKUP_FILE="backup-$(date +%Y%m%d-%H%M%S).sql"
kubectl exec -n triage-warden $PG_POD -- \
pg_dump -U triage_warden --format=custom --compress=9 triage_warden \
> "$BACKUP_FILE"
# Upload to S3 (optional)
aws s3 cp "$BACKUP_FILE" s3://your-backup-bucket/triage-warden/
Automated Backup
Docker (Cron)
# /etc/cron.d/triage-warden-backup
0 2 * * * root /opt/triage-warden/scripts/backup.sh >> /var/log/triage-warden-backup.log 2>&1
#!/bin/bash
# /opt/triage-warden/scripts/backup.sh
set -e
BACKUP_DIR="/backups/triage-warden"
RETENTION_DAYS=30
BACKUP_FILE="$BACKUP_DIR/backup-$(date +%Y%m%d-%H%M%S).sql"
# Create backup
cd /opt/triage-warden
docker compose exec -T postgres pg_dump \
-U triage_warden \
--format=custom \
--compress=9 \
triage_warden > "$BACKUP_FILE"
# Verify backup
if ! pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1; then
echo "ERROR: Backup verification failed"
rm -f "$BACKUP_FILE"
exit 1
fi
# Cleanup old backups
find "$BACKUP_DIR" -name "backup-*.sql" -mtime +$RETENTION_DAYS -delete
echo "Backup completed: $BACKUP_FILE"
Kubernetes (CronJob)
apiVersion: batch/v1
kind: CronJob
metadata:
name: triage-warden-backup
namespace: triage-warden
spec:
schedule: "0 2 * * *" # Daily at 2 AM
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: postgres:15-alpine
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-postgresql
key: postgres-password
command:
- /bin/sh
- -c
- |
set -e
BACKUP_FILE="/backups/backup-$(date +%Y%m%d-%H%M%S).sql"
pg_dump -h postgres-postgresql -U triage_warden \
--format=custom --compress=9 triage_warden > "$BACKUP_FILE"
echo "Backup completed: $BACKUP_FILE"
volumeMounts:
- name: backup-storage
mountPath: /backups
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
Restore Procedures
Prerequisites
- Stop the Triage Warden application (to prevent data conflicts)
- Have the backup file accessible
- Database credentials available
Full Restore
Docker
# Stop application
docker compose stop triage-warden
# Restore from backup
docker compose exec -T postgres pg_restore \
-U triage_warden \
--clean \
--if-exists \
--no-owner \
-d triage_warden < /path/to/backup.sql
# Start application
docker compose start triage-warden
# Verify
curl http://localhost:8080/health | jq
Kubernetes
# Scale down application
kubectl scale deployment triage-warden -n triage-warden --replicas=0
# Get PostgreSQL pod
PG_POD=$(kubectl get pods -n triage-warden -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}')
# Copy backup to pod
kubectl cp backup.sql triage-warden/$PG_POD:/tmp/backup.sql
# Restore
kubectl exec -n triage-warden $PG_POD -- \
pg_restore -U triage_warden --clean --if-exists --no-owner \
-d triage_warden /tmp/backup.sql
# Scale up application
kubectl scale deployment triage-warden -n triage-warden --replicas=3
# Verify
kubectl exec -it deployment/triage-warden -n triage-warden -- curl -s localhost:8080/health
Point-in-Time Recovery
For point-in-time recovery, enable PostgreSQL WAL archiving:
# PostgreSQL configuration
archive_mode: on
archive_command: 'aws s3 cp %p s3://your-bucket/wal/%f'
Recovery procedure:
# 1. Stop PostgreSQL
# 2. Clear data directory
# 3. Restore base backup
# 4. Create recovery.signal
# 5. Set recovery_target_time in postgresql.conf
# 6. Start PostgreSQL
Verification
After any restore, verify:
# 1. Health check passes
curl http://localhost:8080/health | jq '.status'
# Expected: "healthy"
# 2. Recent incidents exist
curl http://localhost:8080/api/incidents | jq '. | length'
# 3. User can login
# Test via UI or API
# 4. Connectors configured
curl http://localhost:8080/health/detailed | jq '.components.connectors'
Backup Storage
Local Storage
- Pros: Simple, fast
- Cons: Single point of failure
- Recommendation: Development only
Cloud Storage (S3/GCS/Azure Blob)
# Upload to S3
aws s3 cp backup.sql s3://bucket/triage-warden/backup-$(date +%Y%m%d).sql
# Download from S3
aws s3 cp s3://bucket/triage-warden/backup-20240115.sql ./restore.sql
Encryption
Encrypt backups before storing:
# Encrypt backup
gpg --symmetric --cipher-algo AES256 backup.sql
# Decrypt for restore
gpg --decrypt backup.sql.gpg > backup.sql
Disaster Recovery Plan
RTO/RPO Targets
| Metric | Target |
|---|---|
| Recovery Time Objective (RTO) | 4 hours |
| Recovery Point Objective (RPO) | 24 hours |
Recovery Steps
-
Assess the situation
- Determine extent of data loss
- Identify latest valid backup
-
Provision new infrastructure
- Deploy new database instance
- Deploy new application instances
-
Restore data
- Restore database from backup
- Verify data integrity
-
Reconfigure
- Update DNS/load balancer
- Reconfigure connectors if needed
- Reset API keys if compromised
-
Verify and communicate
- Run health checks
- Test critical workflows
- Notify stakeholders
Testing Schedule
| Test | Frequency | Last Tested |
|---|---|---|
| Backup verification | Weekly | |
| Restore to test environment | Monthly | |
| Full DR simulation | Quarterly |
Troubleshooting Guide
Common issues and their solutions.
Quick Diagnostics
# Check overall health
curl -s http://localhost:8080/health/detailed | jq
# Check logs for errors (last 100 lines)
docker compose logs --tail=100 triage-warden | grep -i error
# Check resource usage
docker stats --no-stream
Common Issues
Service Won't Start
Symptoms
- Container exits immediately
- "Connection refused" errors
- Health check fails
Diagnosis
# Check container logs
docker compose logs triage-warden
# Check exit code
docker compose ps -a
Common Causes & Solutions
Missing environment variables:
Error: Required environment variable TW_ENCRYPTION_KEY not set
Solution: Ensure all required env vars are set in .env
Database connection failed:
Error: Failed to connect to database: Connection refused
Solution:
- Verify PostgreSQL is running:
docker compose ps postgres - Check DATABASE_URL is correct
- Verify network connectivity
Invalid encryption key:
Error: Invalid encryption key: must be 32 bytes base64-encoded
Solution: Generate new key: openssl rand -base64 32
Database Connection Issues
Symptoms
/readyreturns 503- "Database unavailable" in health check
- Queries timing out
Diagnosis
# Check database health
docker compose exec postgres pg_isready -U triage_warden
# Check connection count
docker compose exec postgres psql -U triage_warden -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname = 'triage_warden';"
# Check for locks
docker compose exec postgres psql -U triage_warden -c \
"SELECT * FROM pg_locks WHERE NOT granted;"
Solutions
Connection pool exhausted:
# Increase max connections in docker-compose.yml
DATABASE_MAX_CONNECTIONS=50
# Or kill idle connections
docker compose exec postgres psql -U triage_warden -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE datname = 'triage_warden' AND state = 'idle' AND pid <> pg_backend_pid();"
PostgreSQL not ready:
# Wait for PostgreSQL to be ready
until docker compose exec postgres pg_isready -U triage_warden; do
echo "Waiting for PostgreSQL..."
sleep 2
done
Authentication Issues
Symptoms
- "Invalid credentials" on login
- "Session expired" errors
- API returns 401
Diagnosis
# Check if user exists
docker compose exec postgres psql -U triage_warden -c \
"SELECT username, enabled, last_login_at FROM users;"
# Check session configuration
curl -s http://localhost:8080/health/detailed | jq '.components'
Solutions
Reset admin password:
# Generate new password hash (requires bcrypt)
NEW_HASH=$(htpasswd -bnBC 10 "" "newpassword" | tr -d ':\n')
# Update in database
docker compose exec postgres psql -U triage_warden -c \
"UPDATE users SET password_hash = '$NEW_HASH' WHERE username = 'admin';"
Clear sessions:
docker compose exec postgres psql -U triage_warden -c \
"DELETE FROM sessions;"
User account disabled:
docker compose exec postgres psql -U triage_warden -c \
"UPDATE users SET enabled = true WHERE username = 'admin';"
LLM/AI Features Not Working
Symptoms
- "LLM analysis failed" errors
- No AI verdicts on incidents
- Empty analysis in incident details
Diagnosis
# Check LLM configuration
curl -s http://localhost:8080/health/detailed | jq '.components.llm'
# Check for API key
docker compose exec triage-warden env | grep -E "(OPENAI|ANTHROPIC)_API_KEY"
# Check LLM settings in database
docker compose exec postgres psql -U triage_warden -c \
"SELECT provider, model, enabled FROM settings WHERE key = 'llm';"
Solutions
API key not configured:
# Set via environment variable
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
docker compose up -d
LLM disabled: Configure via UI: Settings → AI/LLM → Enable toggle
Rate limited: Check provider dashboard for rate limit status. Consider:
- Upgrading API tier
- Reducing temperature/max_tokens
- Adding request delays
Connector Failures
Symptoms
- "Connector error" status in settings
- Failed enrichments
- Missing threat intel data
Diagnosis
# Check connector status
curl -s http://localhost:8080/health/detailed | jq '.components.connectors'
# Test specific connector
curl -X POST http://localhost:8080/api/connectors/{id}/test
Solutions by Connector
VirusTotal:
- Verify API key is valid
- Check rate limits (4 req/min for free tier)
- Ensure outbound HTTPS to virustotal.com allowed
Jira:
- Verify base URL (include
/rest/api/3) - Use API token, not password
- Check project key exists
CrowdStrike:
- Verify OAuth client credentials
- Check API scopes granted
- Verify region (us-1, us-2, eu-1)
Splunk:
- Verify HEC token is valid
- Check SSL certificate if using HTTPS
- Verify index exists
High Memory Usage
Symptoms
- Container OOM killed
- Slow response times
- "Out of memory" errors
Diagnosis
# Check container memory
docker stats --no-stream triage-warden
# Check for memory leaks (trending)
docker stats triage-warden # Watch over time
Solutions
Increase memory limits:
# docker-compose.yml
deploy:
resources:
limits:
memory: 4G
Reduce connection pool:
DATABASE_MAX_CONNECTIONS=5
Enable garbage collection logging:
RUST_LOG=info,triage_warden=debug
Slow Performance
Symptoms
- High latency on API calls
- Dashboard loads slowly
- Timeouts on queries
Diagnosis
# Check response times
curl -w "@curl-format.txt" -s http://localhost:8080/health -o /dev/null
# Check database query times
docker compose exec postgres psql -U triage_warden -c \
"SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
# Check for table bloat
docker compose exec postgres psql -U triage_warden -c \
"SELECT relname, n_dead_tup, n_live_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;"
Solutions
Add database indexes:
-- Common helpful indexes
CREATE INDEX idx_incidents_created_at ON incidents(created_at DESC);
CREATE INDEX idx_incidents_severity ON incidents(severity);
CREATE INDEX idx_audit_log_timestamp ON audit_log(timestamp DESC);
Vacuum database:
docker compose exec postgres psql -U triage_warden -c "VACUUM ANALYZE;"
Enable query caching: Already enabled by default in connection pool.
Kill Switch Issues
Symptoms
- Automation stopped unexpectedly
- "Kill switch active" warnings
- Actions blocked
Diagnosis
# Check kill switch status
curl -s http://localhost:8080/api/kill-switch | jq
# Check who activated it
curl -s http://localhost:8080/health/detailed | jq '.components.kill_switch'
Solutions
Deactivate kill switch:
curl -X POST http://localhost:8080/api/kill-switch/deactivate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"reason": "Confirmed safe to resume"}'
Or via UI: Settings → Safety → Re-enable Automation
Webhook Not Receiving Events
Symptoms
- No incidents created from SIEM
- Webhook endpoint returns errors
- Events missing
Diagnosis
# Test webhook endpoint
curl -X POST http://localhost:8080/api/webhooks/generic \
-H "Content-Type: application/json" \
-d '{"title": "Test Alert", "severity": "medium"}'
# Check webhook logs
docker compose logs triage-warden | grep -i webhook
Solutions
Signature validation failing:
- Verify webhook secret matches source configuration
- Check signature header name (X-Signature, X-Hub-Signature-256, etc.)
Payload format incorrect:
- Check source webhook format documentation
- Use generic webhook with custom mapping
Firewall blocking:
- Ensure source IP can reach webhook endpoint
- Check for WAF rules blocking requests
Diagnostic Commands
Get System Info
# Application version
curl -s http://localhost:8080/health | jq '.version'
# Database version
docker compose exec postgres psql -U triage_warden -c "SELECT version();"
# Container info
docker compose version
docker version
Export Debug Bundle
#!/bin/bash
# Create debug bundle
BUNDLE_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BUNDLE_DIR"
# Health check
curl -s http://localhost:8080/health/detailed > "$BUNDLE_DIR/health.json"
# Recent logs
docker compose logs --tail=1000 triage-warden > "$BUNDLE_DIR/app.log"
docker compose logs --tail=500 postgres > "$BUNDLE_DIR/db.log"
# Configuration (redacted)
docker compose config | grep -v -E "(PASSWORD|SECRET|KEY)" > "$BUNDLE_DIR/config.yml"
# Create archive
tar -czf "$BUNDLE_DIR.tar.gz" "$BUNDLE_DIR"
rm -rf "$BUNDLE_DIR"
echo "Debug bundle: $BUNDLE_DIR.tar.gz"
Getting Help
If you can't resolve the issue:
- Check GitHub Issues for known issues
- Create a new issue with:
- Triage Warden version
- Deployment method (Docker/K8s)
- Error messages
- Debug bundle (with secrets redacted)
- Contact support: [email protected]
Contributing
Guide to contributing to Triage Warden.
Getting Started
- Fork the repository
- Clone your fork
- Set up the development environment
- Create a branch for your changes
- Submit a pull request
Development Setup
Prerequisites
- Rust 1.75+
- Python 3.11+
- uv (Python package manager)
- SQLite (for development)
Initial Setup
# Clone repository
git clone https://github.com/your-username/triage-warden.git
cd triage-warden
# Install Rust dependencies
cargo build
# Install Python dependencies
cd python
uv sync
cd ..
# Run tests
cargo test
cd python && uv run pytest
Code Style
Rust
- Follow standard Rust conventions
- Run
cargo fmtbefore committing - Run
cargo clippyand fix warnings - Document public APIs with doc comments
Python
- Follow PEP 8
- Run
ruff checkandblackbefore committing - Type hints required (mypy strict mode)
- Docstrings for public functions
Pre-commit Hooks
Install pre-commit hooks:
# The project has pre-commit configured in .git/hooks
# It runs automatically on commit:
# - cargo fmt
# - cargo clippy
# - ruff
# - black
# - mypy
Pull Request Process
-
Create a branch
git checkout -b feature/my-feature -
Make changes
- Write code
- Add tests
- Update documentation
-
Run checks
cargo fmt && cargo clippy cargo test cd python && uv run pytest -
Commit
git commit -m "feat: add new feature" -
Push and create PR
git push origin feature/my-feature -
Address review feedback
Commit Messages
Follow conventional commits:
type(scope): description
[optional body]
[optional footer]
Types:
feat: New featurefix: Bug fixdocs: Documentationrefactor: Code refactoringtest: Adding testschore: Maintenance
Testing
Rust Tests
# Run all tests
cargo test
# Run specific crate tests
cargo test -p tw-api
# Run with output
cargo test -- --nocapture
Python Tests
cd python
uv run pytest
# Run specific tests
uv run pytest tests/test_agents.py
# With coverage
uv run pytest --cov=tw_ai
Integration Tests
# Start test server
cargo run --bin tw-api &
# Run integration tests
./scripts/integration-tests.sh
Documentation
- Update docs for API changes
- Add examples for new features
- Keep README.md current
Build docs locally:
cd docs-site
mdbook serve
Issue Reporting
When reporting issues:
- Search existing issues first
- Use issue templates
- Include:
- Version information
- Steps to reproduce
- Expected vs actual behavior
- Relevant logs
Questions
- Open a GitHub Discussion
- Check existing discussions first
- Tag appropriately
License
By contributing, you agree that your contributions will be licensed under the MIT License.
Building from Source
Complete guide to building Triage Warden.
Prerequisites
Rust
# Install Rust via rustup
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Verify installation
rustc --version # Should be 1.75+
Python
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Verify installation
uv --version
System Dependencies
macOS
brew install openssl pkg-config
Ubuntu/Debian
sudo apt-get install build-essential pkg-config libssl-dev
Fedora
sudo dnf install gcc openssl-devel pkgconfig
Building
Debug Build
cargo build
Outputs:
target/debug/tw-apitarget/debug/tw-cli
Release Build
cargo build --release
Outputs:
target/release/tw-apitarget/release/tw-cli
Python Package
cd python
uv sync
uv build
PyO3 Bridge
The bridge is built automatically with cargo:
cd tw-bridge
cargo build --release
Build Options
Feature Flags
# Build with PostgreSQL support only
cargo build --no-default-features --features postgres
# Build with all features
cargo build --all-features
Cross-Compilation
# For Linux (from macOS)
rustup target add x86_64-unknown-linux-gnu
cargo build --release --target x86_64-unknown-linux-gnu
# For musl (static binary)
rustup target add x86_64-unknown-linux-musl
cargo build --release --target x86_64-unknown-linux-musl
Docker Build
Build Image
docker build -t triage-warden .
Multi-Stage Dockerfile
# Builder stage
FROM rust:1.75 as builder
WORKDIR /app
COPY . .
RUN cargo build --release
# Runtime stage
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/tw-api /usr/local/bin/
CMD ["tw-api"]
Verification
Run Tests
# Rust tests
cargo test
# Python tests
cd python && uv run pytest
# All tests
./scripts/test-all.sh
Linting
# Rust
cargo fmt --check
cargo clippy -- -D warnings
# Python
cd python
uv run ruff check
uv run black --check .
uv run mypy .
Smoke Test
# Start server
./target/release/tw-api &
# Health check
curl http://localhost:8080/api/health
# Stop server
kill %1
Troubleshooting
OpenSSL Errors
# macOS
export OPENSSL_DIR=$(brew --prefix openssl)
# Linux
export OPENSSL_DIR=/usr
PyO3 Build Issues
# Ensure Python is found
export PYO3_PYTHON=$(which python3)
# Clean and rebuild
cargo clean -p tw-bridge
cargo build -p tw-bridge
Out of Memory
# Reduce parallel jobs
cargo build -j 2
Testing
Guide to testing Triage Warden.
Test Structure
triage-warden/
├── crates/
│ ├── tw-api/src/
│ │ └── tests/ # API integration tests
│ ├── tw-core/src/
│ │ └── tests/ # Core unit tests
│ └── tw-actions/src/
│ └── tests/ # Action handler tests
└── python/
└── tests/ # Python tests
Running Tests
All Tests
# Rust
cargo test
# Python
cd python && uv run pytest
# Everything
./scripts/test-all.sh
Specific Tests
# Single crate
cargo test -p tw-api
# Single test
cargo test test_incident_creation
# Pattern match
cargo test incident
# With output
cargo test -- --nocapture
Unit Tests
Rust Unit Tests
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn test_incident_creation() { let incident = Incident::new( IncidentType::Phishing, Severity::High, ); assert_eq!(incident.status, IncidentStatus::Open); } #[tokio::test] async fn test_async_operation() { let result = async_function().await; assert!(result.is_ok()); } } }
Python Unit Tests
import pytest
from tw_ai.agents import TriageAgent
def test_agent_creation():
agent = TriageAgent()
assert agent.model == "claude-sonnet-4-20250514"
@pytest.mark.asyncio
async def test_triage():
agent = TriageAgent()
verdict = await agent.triage(mock_incident)
assert verdict.classification in ["malicious", "benign"]
Integration Tests
API Integration Tests
#![allow(unused)] fn main() { #[tokio::test] async fn test_incident_api() { let app = create_test_app().await; // Create incident let response = app .oneshot( Request::builder() .method("POST") .uri("/api/incidents") .header("Content-Type", "application/json") .body(Body::from(r#"{"type":"phishing"}"#)) .unwrap(), ) .await .unwrap(); assert_eq!(response.status(), StatusCode::CREATED); } }
Database Tests
#![allow(unused)] fn main() { #[tokio::test] async fn test_repository() { // Use in-memory SQLite let pool = create_test_pool().await; let repo = SqliteIncidentRepository::new(pool); let incident = repo.create(&new_incident).await.unwrap(); let found = repo.get(incident.id).await.unwrap(); assert_eq!(found.unwrap().id, incident.id); } }
Test Fixtures
Rust Fixtures
#![allow(unused)] fn main() { // tests/fixtures.rs pub fn mock_incident() -> Incident { Incident { id: Uuid::new_v4(), incident_type: IncidentType::Phishing, severity: Severity::High, status: IncidentStatus::Open, raw_data: json!({"subject": "Test"}), ..Default::default() } } }
Python Fixtures
# tests/conftest.py
import pytest
@pytest.fixture
def mock_incident():
return {
"id": "test-123",
"type": "phishing",
"severity": "high",
"raw_data": {"subject": "Test Email"}
}
@pytest.fixture
def mock_connector():
return MockThreatIntelConnector()
Mocking
Rust Mocking
#![allow(unused)] fn main() { use mockall::mock; mock! { ThreatIntelConnector {} #[async_trait] impl ThreatIntelConnector for ThreatIntelConnector { async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport>; } } #[tokio::test] async fn test_with_mock() { let mut mock = MockThreatIntelConnector::new(); mock.expect_lookup_hash() .returning(|_| Ok(ThreatReport::clean())); let result = function_using_connector(&mock).await; assert!(result.is_ok()); } }
Python Mocking
from unittest.mock import AsyncMock, patch
@pytest.mark.asyncio
async def test_with_mock():
with patch("tw_ai.agents.tools.lookup_hash") as mock:
mock.return_value = {"malicious": False}
agent = TriageAgent()
verdict = await agent.triage(mock_incident)
mock.assert_called_once()
Test Coverage
Rust Coverage
cargo install cargo-tarpaulin
cargo tarpaulin --out Html
Python Coverage
cd python
uv run pytest --cov=tw_ai --cov-report=html
CI Testing
GitHub Actions runs tests on every PR:
# .github/workflows/test.yml
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- run: cargo test
- run: cargo clippy -- -D warnings
Test Data
Evaluation Test Cases
Test cases for AI triage evaluation:
# python/tw_ai/evaluation/test_cases/phishing.yaml
- name: obvious_phishing
input:
sender: "[email protected]"
subject: "Urgent: Verify Account"
urls: ["https://phishing-site.com/login"]
auth_results: {spf: fail, dkim: fail}
expected:
classification: malicious
min_confidence: 0.8
Run evaluation:
cd python
uv run pytest tests/test_evaluation.py
Adding Connectors
Guide to implementing new connectors.
Connector Architecture
Connectors follow a trait-based pattern:
Connector Trait (base)
│
├── ThreatIntelConnector
├── SIEMConnector
├── EDRConnector
├── EmailGatewayConnector
└── TicketingConnector
Implementing a Connector
1. Create the File
touch crates/tw-connectors/src/threat_intel/my_provider.rs
2. Implement Base Trait
#![allow(unused)] fn main() { use crate::traits::{Connector, ConnectorError, ConnectorHealth, ConnectorResult}; use async_trait::async_trait; pub struct MyProviderConnector { client: reqwest::Client, api_key: String, base_url: String, } impl MyProviderConnector { pub fn new(api_key: String) -> Result<Self, ConnectorError> { let client = reqwest::Client::builder() .timeout(std::time::Duration::from_secs(30)) .build() .map_err(|e| ConnectorError::Configuration(e.to_string()))?; Ok(Self { client, api_key, base_url: "https://api.myprovider.com".to_string(), }) } } #[async_trait] impl Connector for MyProviderConnector { fn name(&self) -> &str { "my_provider" } fn connector_type(&self) -> &str { "threat_intel" } async fn health_check(&self) -> ConnectorResult<ConnectorHealth> { let response = self.client .get(format!("{}/health", self.base_url)) .header("Authorization", format!("Bearer {}", self.api_key)) .send() .await .map_err(|e| ConnectorError::NetworkError(e.to_string()))?; if response.status().is_success() { Ok(ConnectorHealth::Healthy) } else { Ok(ConnectorHealth::Unhealthy { message: "Health check failed".to_string(), }) } } async fn test_connection(&self) -> ConnectorResult<bool> { match self.health_check().await? { ConnectorHealth::Healthy => Ok(true), _ => Ok(false), } } } }
3. Implement Specialized Trait
#![allow(unused)] fn main() { use crate::traits::{ThreatIntelConnector, ThreatReport, IndicatorType}; #[async_trait] impl ThreatIntelConnector for MyProviderConnector { async fn lookup_hash(&self, hash: &str) -> ConnectorResult<ThreatReport> { let response = self.client .get(format!("{}/files/{}", self.base_url, hash)) .header("Authorization", format!("Bearer {}", self.api_key)) .send() .await .map_err(|e| ConnectorError::NetworkError(e.to_string()))?; if response.status() == reqwest::StatusCode::NOT_FOUND { return Ok(ThreatReport { indicator: hash.to_string(), indicator_type: IndicatorType::FileHash, malicious: false, confidence: 0.0, categories: vec![], first_seen: None, last_seen: None, sources: vec![], }); } let data: ApiResponse = response.json().await .map_err(|e| ConnectorError::InvalidResponse(e.to_string()))?; Ok(self.convert_response(data)) } async fn lookup_url(&self, url: &str) -> ConnectorResult<ThreatReport> { // Similar implementation todo!() } async fn lookup_domain(&self, domain: &str) -> ConnectorResult<ThreatReport> { // Similar implementation todo!() } async fn lookup_ip(&self, ip: &str) -> ConnectorResult<ThreatReport> { // Similar implementation todo!() } } }
4. Add to Module
#![allow(unused)] fn main() { // crates/tw-connectors/src/threat_intel/mod.rs mod my_provider; pub use my_provider::MyProviderConnector; }
5. Register in Bridge
#![allow(unused)] fn main() { // tw-bridge/src/lib.rs impl ThreatIntelBridge { pub fn new(mode: &str) -> PyResult<Self> { let connector: Arc<dyn ThreatIntelConnector + Send + Sync> = match mode { "virustotal" => Arc::new(VirusTotalConnector::new( std::env::var("TW_VIRUSTOTAL_API_KEY") .map_err(|_| PyErr::new::<pyo3::exceptions::PyValueError, _>( "TW_VIRUSTOTAL_API_KEY not set" ))? )?), "my_provider" => Arc::new(MyProviderConnector::new( std::env::var("TW_MY_PROVIDER_API_KEY") .map_err(|_| PyErr::new::<pyo3::exceptions::PyValueError, _>( "TW_MY_PROVIDER_API_KEY not set" ))? )?), _ => Arc::new(MockThreatIntelConnector::new("mock")), }; Ok(Self { connector }) } } }
Error Handling
Use appropriate error types:
#![allow(unused)] fn main() { pub enum ConnectorError { /// Configuration issue Configuration(String), /// Network/connection error NetworkError(String), /// Authentication failed AuthenticationFailed(String), /// Resource not found NotFound(String), /// Rate limited RateLimited { retry_after: Option<Duration> }, /// Invalid response from service InvalidResponse(String), /// Request failed RequestFailed(String), } }
Rate Limiting
Implement rate limiting in your connector:
#![allow(unused)] fn main() { use governor::{Quota, RateLimiter}; pub struct MyProviderConnector { client: reqwest::Client, api_key: String, rate_limiter: RateLimiter<...>, } impl MyProviderConnector { async fn make_request(&self, url: &str) -> ConnectorResult<Response> { self.rate_limiter.until_ready().await; self.client.get(url) .header("Authorization", format!("Bearer {}", self.api_key)) .send() .await .map_err(|e| ConnectorError::NetworkError(e.to_string())) } } }
Testing
Unit Tests
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; use wiremock::{MockServer, Mock, ResponseTemplate}; use wiremock::matchers::{method, path}; #[tokio::test] async fn test_lookup_hash() { let mock_server = MockServer::start().await; Mock::given(method("GET")) .and(path("/files/abc123")) .respond_with(ResponseTemplate::new(200).set_body_json(json!({ "malicious": true, "confidence": 0.95 }))) .mount(&mock_server) .await; let connector = MyProviderConnector::with_base_url( "test-key".to_string(), mock_server.uri(), ); let result = connector.lookup_hash("abc123").await.unwrap(); assert!(result.malicious); } } }
Documentation
Document your connector:
#![allow(unused)] fn main() { //! MyProvider threat intelligence connector. //! //! # Configuration //! //! Set `TW_MY_PROVIDER_API_KEY` environment variable. //! //! # Example //! //! ```rust //! let connector = MyProviderConnector::new(api_key)?; //! let report = connector.lookup_hash("abc123").await?; //! ``` }
Adding Actions
Guide to implementing new action handlers.
Action Architecture
Actions implement the Action trait:
#![allow(unused)] fn main() { #[async_trait] pub trait Action: Send + Sync { fn name(&self) -> &str; fn description(&self) -> &str; fn required_parameters(&self) -> Vec<ParameterDef>; fn supports_rollback(&self) -> bool; async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError>; async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> { Err(ActionError::RollbackNotSupported) } } }
Implementing an Action
1. Create the File
touch crates/tw-actions/src/my_action.rs
2. Define the Action
#![allow(unused)] fn main() { use crate::registry::{ Action, ActionContext, ActionError, ActionResult, ParameterDef, ParameterType, }; use async_trait::async_trait; use chrono::Utc; use std::collections::HashMap; use tracing::{info, instrument}; /// My custom action handler. pub struct MyAction; impl MyAction { pub fn new() -> Self { Self } } impl Default for MyAction { fn default() -> Self { Self::new() } } #[async_trait] impl Action for MyAction { fn name(&self) -> &str { "my_action" } fn description(&self) -> &str { "Description of what this action does" } fn required_parameters(&self) -> Vec<ParameterDef> { vec![ ParameterDef::required( "target", "The target of the action", ParameterType::String, ), ParameterDef::optional( "force", "Force the action even if conditions aren't met", ParameterType::Boolean, serde_json::json!(false), ), ] } fn supports_rollback(&self) -> bool { true } #[instrument(skip(self, context))] async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError> { let started_at = Utc::now(); // Get required parameter let target = context.require_string("target")?; // Get optional parameter with default let force = context .get_param("force") .and_then(|v| v.as_bool()) .unwrap_or(false); info!("Executing my_action on target: {}", target); // Perform the action // ... // Build output let mut output = HashMap::new(); output.insert("target".to_string(), serde_json::json!(target)); output.insert("success".to_string(), serde_json::json!(true)); Ok(ActionResult::success( self.name(), &format!("Action completed on {}", target), started_at, output, )) } async fn rollback(&self, context: ActionContext) -> Result<ActionResult, ActionError> { let started_at = Utc::now(); let target = context.require_string("target")?; info!("Rolling back my_action on target: {}", target); // Perform rollback // ... let mut output = HashMap::new(); output.insert("target".to_string(), serde_json::json!(target)); Ok(ActionResult::success( &format!("{}_rollback", self.name()), &format!("Rollback completed on {}", target), started_at, output, )) } } }
3. Add to Module
#![allow(unused)] fn main() { // crates/tw-actions/src/lib.rs mod my_action; pub use my_action::MyAction; }
4. Register in Registry
#![allow(unused)] fn main() { // crates/tw-actions/src/registry.rs impl ActionRegistry { pub fn new() -> Self { let mut registry = Self { actions: HashMap::new(), }; // Register built-in actions registry.register(Box::new(QuarantineEmailAction::new())); registry.register(Box::new(BlockSenderAction::new())); registry.register(Box::new(MyAction::new())); // Add here registry } } }
Parameter Types
Available parameter types:
#![allow(unused)] fn main() { pub enum ParameterType { String, Integer, Float, Boolean, List, Object, } }
Define parameters:
#![allow(unused)] fn main() { fn required_parameters(&self) -> Vec<ParameterDef> { vec![ ParameterDef::required("name", "Description", ParameterType::String), ParameterDef::optional("count", "Description", ParameterType::Integer, json!(10)), ParameterDef::optional("tags", "Description", ParameterType::List, json!([])), ] } }
Using Connectors
Actions can use connectors via dependency injection:
#![allow(unused)] fn main() { pub struct MyAction { connector: Arc<dyn MyConnector + Send + Sync>, } impl MyAction { pub fn new(connector: Arc<dyn MyConnector + Send + Sync>) -> Self { Self { connector } } } #[async_trait] impl Action for MyAction { async fn execute(&self, context: ActionContext) -> Result<ActionResult, ActionError> { // Use connector let result = self.connector.do_something().await .map_err(|e| ActionError::ExecutionFailed(e.to_string()))?; // ... } } }
Error Handling
Use appropriate error types:
#![allow(unused)] fn main() { pub enum ActionError { /// Missing or invalid parameters InvalidParameters(String), /// Execution failed ExecutionFailed(String), /// Action timed out Timeout, /// Rollback not supported RollbackNotSupported, /// Policy denied the action PolicyDenied(String), } }
Testing
Unit Tests
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; use uuid::Uuid; #[tokio::test] async fn test_my_action_success() { let action = MyAction::new(); let context = ActionContext::new(Uuid::new_v4()) .with_param("target", serde_json::json!("test-target")); let result = action.execute(context).await.unwrap(); assert!(result.success); assert_eq!(result.output["target"], "test-target"); } #[tokio::test] async fn test_my_action_missing_param() { let action = MyAction::new(); let context = ActionContext::new(Uuid::new_v4()); let result = action.execute(context).await; assert!(matches!(result, Err(ActionError::InvalidParameters(_)))); } #[tokio::test] async fn test_my_action_rollback() { let action = MyAction::new(); assert!(action.supports_rollback()); let context = ActionContext::new(Uuid::new_v4()) .with_param("target", serde_json::json!("test-target")); let result = action.rollback(context).await.unwrap(); assert!(result.success); } } }
Policy Integration
Actions are automatically evaluated by the policy engine. Configure default approval:
# Default policy for new action
[[policy.rules]]
name = "my_action_default"
action = "my_action"
approval_level = "analyst"
Documentation
Document your action:
#![allow(unused)] fn main() { //! My custom action. //! //! This action performs X on target Y. //! //! # Parameters //! //! - `target` (required): The target to act on //! - `force` (optional): Force execution (default: false) //! //! # Example //! //! ```yaml //! - action: my_action //! parameters: //! target: "example" //! force: true //! ``` //! //! # Rollback //! //! This action supports rollback via `my_action_rollback`. }
Changelog
All notable changes to Triage Warden.
[Unreleased]
Added
- AI-powered triage agent with Claude integration
- Configurable playbooks for automated investigation
- Policy engine with approval workflows
- Connector framework for external integrations
- Web dashboard with HTMX
- REST API for programmatic access
- CLI for command-line operations
Connectors
- VirusTotal threat intelligence
- Splunk SIEM integration
- CrowdStrike EDR integration
- Microsoft 365 email gateway
- Jira ticketing integration
Actions
- Email: parse_email, check_email_authentication, quarantine_email, block_sender
- Lookup: lookup_sender_reputation, lookup_urls, lookup_attachments
- Host: isolate_host, scan_host
- Notification: notify_user, escalate, create_ticket
[0.1.0] - 2024-01-15
Added
- Initial release
- Core incident management
- Basic web interface
- SQLite database support
- Mock connectors for development
Version Numbering
This project follows Semantic Versioning:
- MAJOR: Incompatible API changes
- MINOR: Backwards-compatible new features
- PATCH: Backwards-compatible bug fixes
Upgrade Guide
From 0.x to 1.0
When 1.0 is released, an upgrade guide will be provided here.