ADR-001: Event Bus Architecture

Status

Accepted

Context

Triage Warden needs to coordinate multiple components (enrichment, analysis, action execution, notifications) in response to security incidents. We needed a way to:

Decouple components for independent development and testing
Enable real-time updates to the dashboard
Support both synchronous and asynchronous processing
Maintain an audit trail of all system events

Decision

We implemented an in-process event bus using Tokio channels with the following design:

Event Types

All significant system events are captured as TriageEvent variants:

AlertReceived - New alert from webhook
IncidentCreated - Incident created from alert
EnrichmentComplete - Single enrichment finished
EnrichmentPhaseComplete - All enrichments done
AnalysisComplete - AI analysis finished
ActionsProposed - Response actions proposed
ActionApproved/Denied - Action approval decision
ActionExecuted - Action completed
StatusChanged - Incident status transition
TicketCreated - External ticket created
IncidentEscalated - Incident escalated
IncidentResolved - Incident resolved
KillSwitchActivated - Emergency stop triggered

Delivery Mechanisms

Broadcast Channel: For real-time dashboard updates via SSE
Named Subscribers: For component-specific processing queues
Event History: In-memory buffer for recent event retrieval

Error Handling

Events are fire-and-forget with fallback logging:

publish() - Returns Result for cases where failure matters
publish_with_fallback() - Logs errors, never fails (for non-critical events)

Consequences

Positive

Components are loosely coupled and independently testable
Dashboard receives real-time updates without polling
Complete event history available for debugging
Failed subscribers don't block the main processing flow

Negative

In-process only - no distributed event bus
Event history is limited and in-memory (lost on restart)
No guaranteed delivery or replay capability
Broadcast channel has limited buffer (may drop events under load)

Future Considerations

For high-availability deployments, consider:

Redis Pub/Sub for distributed events
PostgreSQL LISTEN/NOTIFY for persistent events
External message queue (RabbitMQ, Kafka) for durability