Monitoring & Alerting Guide - Encore Health OS

Version: 1.0.0
Last Updated: 2025-01-07 This guide documents the monitoring and alerting strategy for the Encore Health OS Platform, covering error tracking, performance monitoring, log aggregation, and incident response.

Overview
Error Tracking
Performance Monitoring
Log Aggregation
Alerting Configuration
Dashboard Setup
Incident Response
Best Practices
Troubleshooting

Overview

Encore Health OS uses a multi-layered monitoring approach:

Error Tracking: Sentry/LogRocket (planned)
Performance Monitoring: Core Web Vitals, custom metrics
Log Aggregation: Structured logging, Supabase logs
Alerting: Email/SMS notifications for critical issues
Dashboards: Supabase Dashboard, custom dashboards (planned)

Monitoring Goals:

Detect errors before users report them
Track performance degradation
Monitor system health
Enable rapid incident response

Error Tracking

Current Implementation

Structured Logging:

Location: src/platform/monitoring/logger.ts
Format: JSON with standard fields
PHI Protection: Never logs PHI/PII

Log Levels:

debug - Development debugging
info - Normal operations
warn - Warning conditions
error - Error conditions

Sentry Integration (Implemented)

Current implementation: src/platform/monitoring/sentry.ts (PF-07). Initialization runs in src/main.tsx via initSentry() before React renders. Package: @sentry/react only (no deprecated @sentry/tracing). Tracing uses reactRouterV6BrowserTracingIntegration and replay uses replayIntegration with HIPAA-safe options (maskAllText, blockAllMedia). Required environment variable:

VITE_SENTRY_DSN – If unset, Sentry is disabled (enabled: false). Do not commit DSN to repo; use env per environment.

Optional (for source map upload on build):

SENTRY_AUTH_TOKEN – Auth token for uploads (e.g. production CI).
SENTRY_ORG – Sentry organization slug.
SENTRY_PROJECT – Sentry project slug.

When all three are set, npm run build generates source maps and uploads them via @sentry/vite-plugin so production errors show readable stack traces. Initialization (reference): See src/platform/monitoring/sentry.ts. Summary:

dsn and enabled from VITE_SENTRY_DSN.
release from VITE_APP_VERSION (buildId from Vite).
beforeSend scrubs ui.input breadcrumb data and truncates message/exception text to limit PHI.

Error boundaries: Use the platform ErrorBoundary from @/platform/monitoring, which reports to Sentry and shows a fallback UI. Do not use @sentry/react’s ErrorBoundary directly; the platform boundary is used in App.tsx and route-level boundaries.

Planned Integration: LogRocket

Alternative to Sentry:

Session replay
User interaction tracking
Network request monitoring

Setup Steps (Future):

Create LogRocket project
Install SDK:
```
npm install logrocket
```

Initialize in src/main.tsx:

import LogRocket from 'logrocket';

LogRocket.init(import.meta.env.VITE_LOGROCKET_APP_ID, {
  shouldCaptureIP: false, // Privacy
  sanitizeInputs: true,
});

Error Tracking Best Practices

✅ DO:

Capture all unhandled errors
Include correlation IDs
Sanitize PHI/PII before sending
Group similar errors
Track error rates

❌ DON’T:

Log full user data
Include passwords or tokens
Log PHI/PII
Overwhelm with noise

Performance Monitoring

Core Web Vitals

Current Implementation:

Location: src/platform/monitoring/performance-monitor.ts
Metrics: LCP, INP (replaces FID), CLS
Sampling: 10% in production, 100% in development

Metrics Tracked:

LCP (Largest Contentful Paint): < 2.5s (good)
INP (Interaction to Next Paint): < 200ms (good)
CLS (Cumulative Layout Shift): < 0.1 (good)

Usage:

import { performanceMonitor } from '@/platform/monitoring';

// Initialize
performanceMonitor.init({
  sampleRate: 0.1, // 10% sampling
  enablePerformanceTracking: true,
});

// Custom metric
performanceMonitor.recordMetric({
  name: 'form_submit_duration',
  value: 1234, // milliseconds
  route: '/forms/submit',
});

Performance Targets

Lighthouse Scores:

Performance: 85+
Accessibility: 90+
Best Practices: 90+
SEO: 90+
PWA: 90+

Core Web Vitals:

LCP: < 2.5s
INP: < 200ms
CLS: < 0.1

Time to Interactive: < 3.5s on 3G

Custom Metrics

Track Business Metrics:

Form submission time
Report generation time
API response times
Database query times

Example:

// Start timing
performanceMonitor.markStart('form_submit');

// ... form submission logic ...

// End timing
performanceMonitor.markEnd('form_submit');

Log Aggregation

Structured Logging

Format:

{
  "timestamp": "2025-01-07T10:00:00Z",
  "level": "info",
  "module": "hr",
  "action": "create_employee",
  "message": "Employee created successfully",
  "user_id": "uuid",
  "org_id": "uuid",
  "correlation_id": "uuid",
  "context": {
    "employee_id": "uuid"
  }
}

Standard Fields:

timestamp - ISO 8601 timestamp
level - Log level (debug, info, warn, error)
module - Module/core name
action - Action being performed
message - Human-readable message
user_id - User ID (stable UUID, not PHI)
org_id - Organization ID
site_id - Site ID (if applicable)
correlation_id - Request correlation ID

PHI Protection:

Never log names, emails, SSNs, addresses
Only log stable IDs (UUIDs)
Sanitize error messages

Log Destinations

Development:

Console (pretty-printed)
Browser DevTools

Production (Planned):

Log aggregation service (Datadog, Logtail, etc.)
Supabase function logs
Error tracking service (Sentry)

Supabase Logs

Edge Function Logs:

View in Supabase Dashboard → Edge Functions → Logs
Filter by function, time range, log level
Export logs for analysis

Database Logs:

Query logs in Supabase Dashboard → Database → Logs
Monitor slow queries
Track connection usage

Alerting Configuration

Alert Types

Critical Alerts (Immediate):

System downtime
Database connection failures
Authentication failures
RLS policy violations
Security breaches

Warning Alerts (Within 1 hour):

High error rates (> 1%)
Performance degradation
High database CPU (> 80%)
Storage usage > 80%
Edge function failures

Info Alerts (Daily digest):

Daily usage statistics
Weekly performance summary
Monthly security review

Alert Channels

Email:

Use send-email-notification edge function
Send to platform team email
Include correlation IDs and context

SMS (Critical Only):

Use send-sms-notification edge function
Only for critical alerts
Keep messages concise

In-App Notifications:

Use Platform Notifications (PF-10)
Show in application UI
Persist in database

Alert Configuration (Planned)

Set up alerts for:

Error rate > 1% over 5 minutes
LCP > 3s for > 10% of users
Database CPU > 80% for > 5 minutes
Edge function failure rate > 5%
Storage usage > 90%

Example Alert Rule:

alert:
  name: High Error Rate
  condition: error_rate > 0.01
  duration: 5m
  channels:
    - email: platform-team@northsight.com
    - sms: +1234567890
  severity: warning

Dashboard Setup

Supabase Dashboard

Available Metrics:

Database CPU/Memory usage
API request count
Storage usage
Edge function invocations
Authentication events

Access:

Go to Supabase Dashboard
Navigate to Project → Metrics
View real-time and historical data

Custom Dashboard (Planned)

Metrics to Display:

Error rate (last 24 hours)
Performance metrics (LCP, INP, CLS)
Active users
API response times
Database query performance
Edge function success rate

Tools:

Grafana (planned)
Datadog (planned)
Custom React dashboard (planned)

Incident Response

Incident Severity Levels

P0 - Critical:

System down
Data breach
Security incident
Response: Immediate (< 15 minutes)

P1 - High:

Major feature broken
Performance degradation
High error rate
Response: Within 1 hour

P2 - Medium:

Minor feature broken
Performance issues (non-critical)
Response: Within 4 hours

P3 - Low:

Cosmetic issues
Non-critical bugs
Response: Next business day

Incident Response Process

1. Detection:

Monitor alerts
Review error tracking
Check performance metrics

2. Triage:

Assess severity
Identify root cause
Assign owner

3. Resolution:

Fix issue
Deploy fix
Verify resolution

4. Post-Mortem:

Document incident
Identify improvements
Update procedures

On-Call Rotation

Responsibilities:

Monitor alerts
Respond to incidents
Escalate if needed
Document incidents

Rotation:

Weekly rotation (planned)
24/7 coverage (planned)
Escalation path defined

Best Practices

1. Monitoring Coverage

✅ DO:

Monitor all critical paths
Track business metrics
Set up alerts for anomalies
Review metrics regularly

❌ DON’T:

Monitor everything (too noisy)
Ignore false positives
Set alerts too sensitive
Forget to update alerts

2. Error Tracking

✅ DO:

Capture all errors
Include context
Group similar errors
Track error rates

❌ DON’T:

Log PHI/PII
Overwhelm with noise
Ignore error trends
Skip error boundaries

3. Performance Monitoring

✅ DO:

Track Core Web Vitals
Monitor custom metrics
Set performance budgets
Optimize slow paths

❌ DON’T:

Track too many metrics
Ignore performance regressions
Skip performance testing
Forget mobile performance

4. Alerting

✅ DO:

Set meaningful thresholds
Include context in alerts
Test alert delivery
Review and tune alerts

❌ DON’T:

Alert on everything
Ignore alert fatigue
Skip alert testing
Forget to update contacts

Troubleshooting

Issue: Too Many Alerts

Symptoms:

Alert fatigue
Important alerts missed

Solutions:

Increase alert thresholds
Reduce alert frequency
Group similar alerts
Use alert suppression
Review and tune alerts

Issue: Missing Alerts

Symptoms:

Issues not detected
Users report before alerts

Solutions:

Lower alert thresholds
Add more alert types
Improve monitoring coverage
Test alert delivery
Review alert configuration

Issue: Performance Monitoring Not Working

Symptoms:

No metrics collected
Dashboard empty

Solutions:

Verify initialization:

performanceMonitor.init({ enablePerformanceTracking: true });

Check sample rate (may be too low)
Verify web-vitals library installed
Check browser console for errors
Test in development (100% sampling)

Issue: Logs Not Appearing

Symptoms:

No logs in console
Missing log entries

Solutions:

Check log level (may filter out)
Verify logger initialized
Check browser console filters
Verify structured format
Test with explicit log call

Monitoring Checklist

Setup

Ongoing

Performance Patterns: constitution.md §5.6 (Performance)
Error Handling: src/platform/monitoring/logger.ts
Performance Monitor: src/platform/monitoring/performance-monitor.ts
Production Readiness: docs/operations/PRODUCTION_READINESS.md

Document Owner: Platform Operations Team
Review Frequency: Quarterly
Last Updated: 2025-01-07

Architecture

Development

Database

Testing

Migration

Operations

Governance & Security

Platform Internals

Documentation Index

​Table of Contents

​Overview

​Error Tracking

​Current Implementation

​Sentry Integration (Implemented)

​Planned Integration: LogRocket

​Error Tracking Best Practices

​Performance Monitoring

​Core Web Vitals

​Performance Targets

​Custom Metrics

​Log Aggregation

​Structured Logging

​Log Destinations

​Supabase Logs

​Alerting Configuration

​Alert Types

​Alert Channels

​Alert Configuration (Planned)

​Dashboard Setup

​Supabase Dashboard

​Custom Dashboard (Planned)

​Incident Response

​Incident Severity Levels

​Incident Response Process

​On-Call Rotation

​Best Practices

​1. Monitoring Coverage

​2. Error Tracking

​3. Performance Monitoring

​4. Alerting

​Troubleshooting

​Issue: Too Many Alerts

​Issue: Missing Alerts

​Issue: Performance Monitoring Not Working

​Issue: Logs Not Appearing

​Monitoring Checklist

​Setup

​Ongoing

​Related Documentation

Table of Contents

Overview

Error Tracking

Current Implementation

Sentry Integration (Implemented)

Planned Integration: LogRocket

Error Tracking Best Practices

Performance Monitoring

Core Web Vitals

Performance Targets

Custom Metrics

Log Aggregation

Structured Logging

Log Destinations

Supabase Logs

Alerting Configuration

Alert Types

Alert Channels

Alert Configuration (Planned)

Dashboard Setup

Supabase Dashboard

Custom Dashboard (Planned)

Incident Response

Incident Severity Levels

Incident Response Process

On-Call Rotation

Best Practices

1. Monitoring Coverage

2. Error Tracking

3. Performance Monitoring

4. Alerting

Troubleshooting

Issue: Too Many Alerts

Issue: Missing Alerts

Issue: Performance Monitoring Not Working

Issue: Logs Not Appearing

Monitoring Checklist

Setup

Ongoing

Related Documentation