Operations Guide

Operational procedures and runbooks for Triage Warden.

Runbooks

Quick Reference

Health Check Endpoints

EndpointPurposeExpected Response
GET /liveLiveness probe200 OK
GET /readyReadiness probe200 OK if ready, 503 if not
GET /healthBasic healthJSON with status
GET /health/detailedFull component healthJSON with all components

Key Metrics

MetricDescriptionAlert Threshold
http_requests_totalTotal HTTP requestsN/A
http_request_duration_secondsRequest latencyp99 > 1s
http_requests_in_flightConcurrent requests> 100
db_pool_connections_activeActive DB connections> 80% of max
incidents_totalTotal incidents processedN/A
actions_executed_totalTotal actions executedN/A

Emergency Contacts

RoleContactEscalation
On-call EngineerPagerDutyAuto-escalates after 15m
Security Lead[email protected]Critical security issues
Database Admin[email protected]Database emergencies

Common Commands

Docker

# View logs
docker compose logs -f triage-warden

# Restart service
docker compose restart triage-warden

# Check health
curl http://localhost:8080/health | jq

# Database backup
docker compose exec postgres pg_dump -U triage_warden > backup.sql

Kubernetes

# View logs
kubectl logs -f deployment/triage-warden -n triage-warden

# Restart pods
kubectl rollout restart deployment/triage-warden -n triage-warden

# Check health
kubectl exec -it deployment/triage-warden -n triage-warden -- curl -s localhost:8080/health | jq

# Scale up/down
kubectl scale deployment triage-warden -n triage-warden --replicas=5

Database

# Connect to PostgreSQL
psql $DATABASE_URL

# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'triage_warden';

# Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Service Dependencies

┌──────────────────┐
│  Triage Warden   │
└────────┬─────────┘
         │
    ┌────┴────┬─────────┬─────────┐
    │         │         │         │
    ▼         ▼         ▼         ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Postgres│ │  LLM  │ │Connec-│ │Notifi-│
│   DB   │ │  API  │ │ tors  │ │cations│
└───────┘ └───────┘ └───────┘ └───────┘

Dependency Health Impact

DependencyIf Unavailable
PostgreSQLService fails readiness, no data access
LLM APIAI analysis disabled, manual triage only
ConnectorsSpecific integrations fail, core works
NotificationsAlerts not delivered, incidents still process

Scheduled Tasks

TaskScheduleDescription
Database backupDaily 2:00 AMFull PostgreSQL backup
Connector health checkEvery 5 minutesVerify connector connectivity
Incident cleanupWeekly Sunday 3:00 AMArchive old incidents
Log rotationDailyRotate and compress logs
Certificate renewal30 days before expiryRenew TLS certificates