Feature: SLA Monitoring, Health Checks & Incident ResponseDocumentation Index
Fetch the complete documentation index at: https://docs.encoreos.io/llms.txt
Use this file to discover all available pages before exploring further.
Spec: PF-83
Last Updated: 2026-03-17
Overview
This runbook covers operational procedures for monitoring SLA health, responding to breaches, and troubleshooting common issues.Architecture
Current (Phase 1)
- SLA definitions and instances stored in
pf_sla_definitionsandpf_sla_instances - Instance status computed client-side via hooks (
useSLADashboard,useSLAInstances) - Events published to
fw_domain_eventsfor downstream notifications - Deadline calculation uses interval-based fallback (
now() + interval)
Planned (Phase 2)
pf-sla-checkerEdge Function — Periodic cron job to scan active instances, update statuses, and fire breach/warning events automatically- Database Webhook on
fw_domain_events— Auto-instantiate SLA instances when trigger events arrive - PF-84 Business Calendars — Business-hours-aware deadline calculation (excludes weekends, holidays)
- Breach detection latency target: < 2 minutes from event to status update
Health Checks
Active Instance Count
Compliance Rate (Last 30 Days)
Stale Active Instances
Instances that have been active longer than 2× their target duration may indicate missed completion events:Manual Instance Management
Pause an Instance (UI)
- Navigate to Settings → SLA → Instances
- Find the instance → ⋮ menu → Pause
- Enter reason → Confirm
Resume a Paused Instance (UI)
- Find the paused instance → ⋮ menu → Resume
Extend a Deadline (SQL — Admin Only)
Force-Complete a Stale Instance (SQL — Admin Only)
Troubleshooting
Instance Not Created After Trigger Event
- Verify the trigger event exists in
fw_domain_events - Check the SLA definition’s
trigger_event_typematches the event’sevent_type - Confirm the definition’s
is_activeflag istrue - Phase 2: Check
pf-sla-checkeredge function logs for errors
Instance Stuck in Active (Should Be Breached)
- Check if
deadline_athas passed — the checker may not have run yet (Phase 2) - In Phase 1, breach detection is client-side; the dashboard will show it correctly on next load
- To manually mark as breached:
Missed Completion Events
- Query
fw_domain_eventsfor the expected completion event type - If the event exists but the instance wasn’t completed, check event matching logic
- Phase 2: The checker edge function will handle this automatically
High Active Instance Count
- Run the active instance count query above
- If near the max limit, archive or complete stale instances
- Review whether definitions are too broad (triggering on high-frequency events)
Monitoring Alerts (Planned)
| Alert | Condition | Action |
|---|---|---|
| High breach rate | > 20% breaches in 24h | Review affected definitions; extend deadlines if targets are unrealistic |
| Stale instances | Active instances > 2× target duration | Investigate missed completion events |
| Instance count spike | > 500 new instances in 1 hour | Check for runaway trigger events; disable definition if needed |
| Checker failure | Edge function errors > 5 in 10 min | Check edge function logs; redeploy if needed |