> ## Documentation Index
> Fetch the complete documentation index at: https://docs.encoreos.io/llms.txt
> Use this file to discover all available pages before exploring further.

# SLA Management — Operational Runbook

> Feature: SLA Monitoring, Health Checks & Incident Response Spec: PF-83 Last Updated: 2026-03-17

**Feature:** SLA Monitoring, Health Checks & Incident Response\
**Spec:** PF-83\
**Last Updated:** 2026-03-17

***

## Overview

This runbook covers operational procedures for monitoring SLA health, responding to breaches, and troubleshooting common issues.

## Architecture

### Current (Phase 1)

* SLA definitions and instances stored in `pf_sla_definitions` and `pf_sla_instances`
* Instance status computed client-side via hooks (`useSLADashboard`, `useSLAInstances`)
* Events published to `fw_domain_events` for downstream notifications
* Deadline calculation uses interval-based fallback (`now() + interval`)

### Planned (Phase 2)

* **`pf-sla-checker` Edge Function** — Periodic cron job to scan active instances, update statuses, and fire breach/warning events automatically
* **Database Webhook** on `fw_domain_events` — Auto-instantiate SLA instances when trigger events arrive
* **PF-84 Business Calendars** — Business-hours-aware deadline calculation (excludes weekends, holidays)
* **Breach detection latency target:** \< 2 minutes from event to status update

## Health Checks

### Active Instance Count

```sql theme={null}
SELECT status, COUNT(*)
FROM pf_sla_instances
WHERE organization_id = '<org_id>'
  AND deleted_at IS NULL
GROUP BY status;
```

### Compliance Rate (Last 30 Days)

```sql theme={null}
SELECT
  COUNT(*) FILTER (WHERE status = 'completed') AS completed,
  COUNT(*) FILTER (WHERE status = 'breached') AS breached,
  ROUND(
    100.0 * COUNT(*) FILTER (WHERE status = 'completed')
    / NULLIF(COUNT(*) FILTER (WHERE status IN ('completed', 'breached')), 0),
    1
  ) AS compliance_pct
FROM pf_sla_instances
WHERE organization_id = '<org_id>'
  AND started_at > NOW() - INTERVAL '30 days';
```

### Stale Active Instances

Instances that have been active longer than 2× their target duration may indicate missed completion events:

```sql theme={null}
SELECT si.id, sd.name, si.started_at, si.deadline_at,
  EXTRACT(EPOCH FROM (NOW() - si.deadline_at)) / 3600 AS hours_past_deadline
FROM pf_sla_instances si
JOIN pf_sla_definitions sd ON si.definition_id = sd.id
WHERE si.status = 'active'
  AND si.deadline_at < NOW() - INTERVAL '1 hour'
ORDER BY si.deadline_at ASC;
```

## Manual Instance Management

### Pause an Instance (UI)

1. Navigate to **Settings** → **SLA** → **Instances**
2. Find the instance → **⋮** menu → **Pause**
3. Enter reason → **Confirm**

### Resume a Paused Instance (UI)

1. Find the paused instance → **⋮** menu → **Resume**

### Extend a Deadline (SQL — Admin Only)

```sql theme={null}
UPDATE pf_sla_instances
SET deadline_at = deadline_at + INTERVAL '24 hours',
    updated_at = NOW()
WHERE id = '<instance_id>'
  AND organization_id = '<org_id>';
```

### Force-Complete a Stale Instance (SQL — Admin Only)

```sql theme={null}
UPDATE pf_sla_instances
SET status = 'completed',
    completed_at = NOW(),
    updated_at = NOW()
WHERE id = '<instance_id>'
  AND organization_id = '<org_id>';
```

## Troubleshooting

### Instance Not Created After Trigger Event

1. Verify the trigger event exists in `fw_domain_events`
2. Check the SLA definition's `trigger_event_type` matches the event's `event_type`
3. Confirm the definition's `is_active` flag is `true`
4. Phase 2: Check `pf-sla-checker` edge function logs for errors

### Instance Stuck in Active (Should Be Breached)

1. Check if `deadline_at` has passed — the checker may not have run yet (Phase 2)
2. In Phase 1, breach detection is client-side; the dashboard will show it correctly on next load
3. To manually mark as breached:

```sql theme={null}
UPDATE pf_sla_instances
SET status = 'breached', updated_at = NOW()
WHERE id = '<instance_id>' AND deadline_at < NOW();
```

### Missed Completion Events

1. Query `fw_domain_events` for the expected completion event type
2. If the event exists but the instance wasn't completed, check event matching logic
3. Phase 2: The checker edge function will handle this automatically

### High Active Instance Count

1. Run the active instance count query above
2. If near the max limit, archive or complete stale instances
3. Review whether definitions are too broad (triggering on high-frequency events)

## Monitoring Alerts (Planned)

| Alert                | Condition                             | Action                                                                   |
| -------------------- | ------------------------------------- | ------------------------------------------------------------------------ |
| High breach rate     | > 20% breaches in 24h                 | Review affected definitions; extend deadlines if targets are unrealistic |
| Stale instances      | Active instances > 2× target duration | Investigate missed completion events                                     |
| Instance count spike | > 500 new instances in 1 hour         | Check for runaway trigger events; disable definition if needed           |
| Checker failure      | Edge function errors > 5 in 10 min    | Check edge function logs; redeploy if needed                             |

## References

* [PF-83 Spec](../../specs/pf/specs/PF-83-sla-management-platform-layer.md)
* [Admin Guide](sla-management-admin-guide.md)
* [User Guide](sla-management-user-guide.md)
* [Migration Notes](sla-management-migration-notes.md)
