Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.encoreos.io/llms.txt

Use this file to discover all available pages before exploring further.

Feature: SLA Monitoring, Health Checks & Incident Response
Spec: PF-83
Last Updated: 2026-03-17

Overview

This runbook covers operational procedures for monitoring SLA health, responding to breaches, and troubleshooting common issues.

Architecture

Current (Phase 1)

  • SLA definitions and instances stored in pf_sla_definitions and pf_sla_instances
  • Instance status computed client-side via hooks (useSLADashboard, useSLAInstances)
  • Events published to fw_domain_events for downstream notifications
  • Deadline calculation uses interval-based fallback (now() + interval)

Planned (Phase 2)

  • pf-sla-checker Edge Function — Periodic cron job to scan active instances, update statuses, and fire breach/warning events automatically
  • Database Webhook on fw_domain_events — Auto-instantiate SLA instances when trigger events arrive
  • PF-84 Business Calendars — Business-hours-aware deadline calculation (excludes weekends, holidays)
  • Breach detection latency target: < 2 minutes from event to status update

Health Checks

Active Instance Count

SELECT status, COUNT(*)
FROM pf_sla_instances
WHERE organization_id = '<org_id>'
  AND deleted_at IS NULL
GROUP BY status;

Compliance Rate (Last 30 Days)

SELECT
  COUNT(*) FILTER (WHERE status = 'completed') AS completed,
  COUNT(*) FILTER (WHERE status = 'breached') AS breached,
  ROUND(
    100.0 * COUNT(*) FILTER (WHERE status = 'completed')
    / NULLIF(COUNT(*) FILTER (WHERE status IN ('completed', 'breached')), 0),
    1
  ) AS compliance_pct
FROM pf_sla_instances
WHERE organization_id = '<org_id>'
  AND started_at > NOW() - INTERVAL '30 days';

Stale Active Instances

Instances that have been active longer than 2× their target duration may indicate missed completion events:
SELECT si.id, sd.name, si.started_at, si.deadline_at,
  EXTRACT(EPOCH FROM (NOW() - si.deadline_at)) / 3600 AS hours_past_deadline
FROM pf_sla_instances si
JOIN pf_sla_definitions sd ON si.definition_id = sd.id
WHERE si.status = 'active'
  AND si.deadline_at < NOW() - INTERVAL '1 hour'
ORDER BY si.deadline_at ASC;

Manual Instance Management

Pause an Instance (UI)

  1. Navigate to SettingsSLAInstances
  2. Find the instance → menu → Pause
  3. Enter reason → Confirm

Resume a Paused Instance (UI)

  1. Find the paused instance → menu → Resume

Extend a Deadline (SQL — Admin Only)

UPDATE pf_sla_instances
SET deadline_at = deadline_at + INTERVAL '24 hours',
    updated_at = NOW()
WHERE id = '<instance_id>'
  AND organization_id = '<org_id>';

Force-Complete a Stale Instance (SQL — Admin Only)

UPDATE pf_sla_instances
SET status = 'completed',
    completed_at = NOW(),
    updated_at = NOW()
WHERE id = '<instance_id>'
  AND organization_id = '<org_id>';

Troubleshooting

Instance Not Created After Trigger Event

  1. Verify the trigger event exists in fw_domain_events
  2. Check the SLA definition’s trigger_event_type matches the event’s event_type
  3. Confirm the definition’s is_active flag is true
  4. Phase 2: Check pf-sla-checker edge function logs for errors

Instance Stuck in Active (Should Be Breached)

  1. Check if deadline_at has passed — the checker may not have run yet (Phase 2)
  2. In Phase 1, breach detection is client-side; the dashboard will show it correctly on next load
  3. To manually mark as breached:
UPDATE pf_sla_instances
SET status = 'breached', updated_at = NOW()
WHERE id = '<instance_id>' AND deadline_at < NOW();

Missed Completion Events

  1. Query fw_domain_events for the expected completion event type
  2. If the event exists but the instance wasn’t completed, check event matching logic
  3. Phase 2: The checker edge function will handle this automatically

High Active Instance Count

  1. Run the active instance count query above
  2. If near the max limit, archive or complete stale instances
  3. Review whether definitions are too broad (triggering on high-frequency events)

Monitoring Alerts (Planned)

AlertConditionAction
High breach rate> 20% breaches in 24hReview affected definitions; extend deadlines if targets are unrealistic
Stale instancesActive instances > 2× target durationInvestigate missed completion events
Instance count spike> 500 new instances in 1 hourCheck for runaway trigger events; disable definition if needed
Checker failureEdge function errors > 5 in 10 minCheck edge function logs; redeploy if needed

References