> ## Documentation Index
> Fetch the complete documentation index at: https://docs.encoreos.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring & Alerting Runbook

> Phase 2 & 3 — Error Handling & Monitoring Last Updated: 2026-03-14

**Spec:** PF-07 Phase 2 & 3 — Error Handling & Monitoring\
**Last Updated:** 2026-03-14

***

## 1. Overview

Encore Health OS uses **Sentry** for error tracking, performance monitoring, and session replay. Custom performance metrics are stored in `pf_health_metrics` via the platform `performanceMonitor`.

### Architecture

```
┌─────────────┐   errors, traces   ┌─────────────┐
│  React App  │ ──────────────────→ │   Sentry    │
│  (browser)  │   replay, logs     │  Dashboard  │
└─────────────┘                    └─────────────┘
      │
      │ page load, custom marks,
      │ API histograms
      ▼
┌─────────────────────┐
│  pf_health_metrics  │
│  (Supabase)         │
└─────────────────────┘
```

***

## 2. Sentry Configuration

**File:** `src/platform/monitoring/sentry.ts`

| Setting                    | Value     | Notes                                          |
| -------------------------- | --------- | ---------------------------------------------- |
| `tracesSampleRate`         | 0.5 (50%) | Auth/billing/clinical/HR-payroll forced to 1.0 |
| `profilesSampleRate`       | 0.1 (10%) | JS Self-Profiling API                          |
| `replaysSessionSampleRate` | 0.0       | Off by default                                 |
| `replaysOnErrorSampleRate` | 1.0       | 100% on errors                                 |
| `enableLogs`               | true      | Structured log search                          |
| `enableMetrics`            | true      | Custom metrics (SDK 10.25+)                    |

### PHI Scrubbing

The `beforeSend` callback:

* Truncates all event/exception messages to 500 characters
* Strips emails, phone numbers, SSNs, DOBs via regex
* Drops breadcrumb messages matching PHI patterns
* Only UUIDs are sent as user/org context — never names, emails, or clinical data

### Source Maps

Source maps are uploaded via `@sentry/vite-plugin` during the Vercel build. The `SENTRY_AUTH_TOKEN`, `SENTRY_ORG`, and `SENTRY_PROJECT` environment variables must be set in the Vercel project settings.

**Verification:** After a deploy, check Sentry → Settings → Source Maps → Artifacts to confirm the release has uploaded maps.

***

## 3. Alerting Thresholds

### Error Rate

| Metric                   | Warning          | Critical | Action                                             |
| ------------------------ | ---------------- | -------- | -------------------------------------------------- |
| Error rate (events/min)  | > 10/min         | > 50/min | Check Sentry Issues feed; page on-call if critical |
| Unique issues (new/hour) | > 5              | > 15     | Review new issues for regressions                  |
| Unhandled rejection rate | > 1% of sessions | > 5%     | Investigate JS errors in production                |

### Performance (Web Vitals)

| Metric                          | Good    | Needs Improvement | Poor    |
| ------------------------------- | ------- | ----------------- | ------- |
| LCP (Largest Contentful Paint)  | ≤ 2.5s  | 2.5–4.0s          | > 4.0s  |
| INP (Interaction to Next Paint) | ≤ 200ms | 200–500ms         | > 500ms |
| CLS (Cumulative Layout Shift)   | ≤ 0.1   | 0.1–0.25          | > 0.25  |

### API Performance

| Metric               | Warning | Critical |
| -------------------- | ------- | -------- |
| p95 API latency      | > 2s    | > 5s     |
| API error rate (5xx) | > 1%    | > 5%     |

***

## 4. Dashboards

### Sentry Project

* **Issues:** Real-time error feed with stack traces and session replay
* **Performance:** Transaction duration, Web Vitals, throughput
* **Replays:** Session recordings for error context
* **Logs:** Structured log search (`Sentry.logger.*`)

### Key Sentry Queries

```
# High-frequency errors in the last hour
is:unresolved times_seen:>10 firstSeen:-1h

# Errors on auth routes
transaction:/auth/* is:unresolved

# Clinical module errors
module:cl is:unresolved
```

### Platform Health Metrics

Custom metrics in `pf_health_metrics` (Supabase):

* Page load timing (`page_load`, `dom_ready`)
* Custom marks (`startMark`/`endMark`)
* API response time histograms

Query via Supabase dashboard or the platform health module.

***

## 5. Error Boundaries

The application uses a layered error boundary strategy:

| Level             | Location              | Behavior                                                      |
| ----------------- | --------------------- | ------------------------------------------------------------- |
| **Global (root)** | `main.tsx`            | Catches catastrophic failures; shows full-page fallback       |
| **Global (app)**  | `App.tsx`             | Defense-in-depth; catches errors inside providers             |
| **Feature**       | `RouteLoader.tsx`     | Per-module isolation; module crash doesn't break other routes |
| **Component**     | Individual components | Optional; for non-critical widgets                            |

The double global boundary (`main.tsx` + `App.tsx`) is **intentional** — the outer boundary catches errors that occur during provider initialization.

***

## 6. Correlation IDs

Every auth state change (sign-in, sign-out, token refresh) generates a `correlation_id` via `crypto.randomUUID()`. This ID is:

* Set in the logger context for all subsequent logs
* Included in structured log entries
* Useful for tracing a user session across log entries

***

## 7. Escalation Procedure

1. **P3 (Low):** New non-critical issue appears in Sentry → assign to relevant core team in next standup
2. **P2 (Medium):** Error rate warning threshold → investigate within 4 hours
3. **P1 (High):** Error rate critical threshold or auth/billing errors → investigate within 1 hour
4. **P0 (Critical):** Application-wide crash or data integrity issue → page on-call immediately

***

## 8. Maintenance

### Sentry Housekeeping

* Review and resolve/archive stale issues monthly
* Update `ignoreErrors` patterns when new non-actionable errors are identified
* Verify source map uploads after Vite/build tool upgrades
* Review sampling rates quarterly (adjust based on event volume and budget)

### Performance Monitor

* `performanceMonitor` flushes metrics to `pf_health_metrics` every 30 seconds
* Metrics are sampled at 10% in production, 100% in development
* Stale metrics can be cleaned up via SQL on `pf_health_metrics`
