Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.encoreos.io/llms.txt

Use this file to discover all available pages before exploring further.

Version: 1.0.0
Last Updated: 2025-01-07
This guide documents the monitoring and alerting strategy for the Encore Health OS Platform, covering error tracking, performance monitoring, log aggregation, and incident response.

Table of Contents

  1. Overview
  2. Error Tracking
  3. Performance Monitoring
  4. Log Aggregation
  5. Alerting Configuration
  6. Dashboard Setup
  7. Incident Response
  8. Best Practices
  9. Troubleshooting

Overview

Encore Health OS uses a multi-layered monitoring approach:
  • Error Tracking: Sentry/LogRocket (planned)
  • Performance Monitoring: Core Web Vitals, custom metrics
  • Log Aggregation: Structured logging, Supabase logs
  • Alerting: Email/SMS notifications for critical issues
  • Dashboards: Supabase Dashboard, custom dashboards (planned)
Monitoring Goals:
  • Detect errors before users report them
  • Track performance degradation
  • Monitor system health
  • Enable rapid incident response

Error Tracking

Current Implementation

Structured Logging:
  • Location: src/platform/monitoring/logger.ts
  • Format: JSON with standard fields
  • PHI Protection: Never logs PHI/PII
Log Levels:
  • debug - Development debugging
  • info - Normal operations
  • warn - Warning conditions
  • error - Error conditions

Sentry Integration (Implemented)

Current implementation: src/platform/monitoring/sentry.ts (PF-07). Initialization runs in src/main.tsx via initSentry() before React renders. Package: @sentry/react only (no deprecated @sentry/tracing). Tracing uses reactRouterV6BrowserTracingIntegration and replay uses replayIntegration with HIPAA-safe options (maskAllText, blockAllMedia). Required environment variable:
  • VITE_SENTRY_DSN – If unset, Sentry is disabled (enabled: false). Do not commit DSN to repo; use env per environment.
Optional (for source map upload on build):
  • SENTRY_AUTH_TOKEN – Auth token for uploads (e.g. production CI).
  • SENTRY_ORG – Sentry organization slug.
  • SENTRY_PROJECT – Sentry project slug.
When all three are set, npm run build generates source maps and uploads them via @sentry/vite-plugin so production errors show readable stack traces. Initialization (reference): See src/platform/monitoring/sentry.ts. Summary:
  • dsn and enabled from VITE_SENTRY_DSN.
  • release from VITE_APP_VERSION (buildId from Vite).
  • beforeSend scrubs ui.input breadcrumb data and truncates message/exception text to limit PHI.
Error boundaries: Use the platform ErrorBoundary from @/platform/monitoring, which reports to Sentry and shows a fallback UI. Do not use @sentry/react’s ErrorBoundary directly; the platform boundary is used in App.tsx and route-level boundaries.

Planned Integration: LogRocket

Alternative to Sentry:
  • Session replay
  • User interaction tracking
  • Network request monitoring
Setup Steps (Future):
  1. Create LogRocket project
  2. Install SDK:
    npm install logrocket
    
  3. Initialize in src/main.tsx:
    import LogRocket from 'logrocket';
    
    LogRocket.init(import.meta.env.VITE_LOGROCKET_APP_ID, {
      shouldCaptureIP: false, // Privacy
      sanitizeInputs: true,
    });
    

Error Tracking Best Practices

✅ DO:
  • Capture all unhandled errors
  • Include correlation IDs
  • Sanitize PHI/PII before sending
  • Group similar errors
  • Track error rates
❌ DON’T:
  • Log full user data
  • Include passwords or tokens
  • Log PHI/PII
  • Overwhelm with noise

Performance Monitoring

Core Web Vitals

Current Implementation:
  • Location: src/platform/monitoring/performance-monitor.ts
  • Metrics: LCP, INP (replaces FID), CLS
  • Sampling: 10% in production, 100% in development
Metrics Tracked:
  • LCP (Largest Contentful Paint): < 2.5s (good)
  • INP (Interaction to Next Paint): < 200ms (good)
  • CLS (Cumulative Layout Shift): < 0.1 (good)
Usage:
import { performanceMonitor } from '@/platform/monitoring';

// Initialize
performanceMonitor.init({
  sampleRate: 0.1, // 10% sampling
  enablePerformanceTracking: true,
});

// Custom metric
performanceMonitor.recordMetric({
  name: 'form_submit_duration',
  value: 1234, // milliseconds
  route: '/forms/submit',
});

Performance Targets

Lighthouse Scores:
  • Performance: 85+
  • Accessibility: 90+
  • Best Practices: 90+
  • SEO: 90+
  • PWA: 90+
Core Web Vitals:
  • LCP: < 2.5s
  • INP: < 200ms
  • CLS: < 0.1
Time to Interactive: < 3.5s on 3G

Custom Metrics

Track Business Metrics:
  • Form submission time
  • Report generation time
  • API response times
  • Database query times
Example:
// Start timing
performanceMonitor.markStart('form_submit');

// ... form submission logic ...

// End timing
performanceMonitor.markEnd('form_submit');

Log Aggregation

Structured Logging

Format:
{
  "timestamp": "2025-01-07T10:00:00Z",
  "level": "info",
  "module": "hr",
  "action": "create_employee",
  "message": "Employee created successfully",
  "user_id": "uuid",
  "org_id": "uuid",
  "correlation_id": "uuid",
  "context": {
    "employee_id": "uuid"
  }
}
Standard Fields:
  • timestamp - ISO 8601 timestamp
  • level - Log level (debug, info, warn, error)
  • module - Module/core name
  • action - Action being performed
  • message - Human-readable message
  • user_id - User ID (stable UUID, not PHI)
  • org_id - Organization ID
  • site_id - Site ID (if applicable)
  • correlation_id - Request correlation ID
PHI Protection:
  • Never log names, emails, SSNs, addresses
  • Only log stable IDs (UUIDs)
  • Sanitize error messages

Log Destinations

Development:
  • Console (pretty-printed)
  • Browser DevTools
Production (Planned):
  • Log aggregation service (Datadog, Logtail, etc.)
  • Supabase function logs
  • Error tracking service (Sentry)

Supabase Logs

Edge Function Logs:
  • View in Supabase Dashboard → Edge Functions → Logs
  • Filter by function, time range, log level
  • Export logs for analysis
Database Logs:
  • Query logs in Supabase Dashboard → Database → Logs
  • Monitor slow queries
  • Track connection usage

Alerting Configuration

Alert Types

Critical Alerts (Immediate):
  • System downtime
  • Database connection failures
  • Authentication failures
  • RLS policy violations
  • Security breaches
Warning Alerts (Within 1 hour):
  • High error rates (> 1%)
  • Performance degradation
  • High database CPU (> 80%)
  • Storage usage > 80%
  • Edge function failures
Info Alerts (Daily digest):
  • Daily usage statistics
  • Weekly performance summary
  • Monthly security review

Alert Channels

Email:
  • Use send-email-notification edge function
  • Send to platform team email
  • Include correlation IDs and context
SMS (Critical Only):
  • Use send-sms-notification edge function
  • Only for critical alerts
  • Keep messages concise
In-App Notifications:
  • Use Platform Notifications (PF-10)
  • Show in application UI
  • Persist in database

Alert Configuration (Planned)

Set up alerts for:
  • Error rate > 1% over 5 minutes
  • LCP > 3s for > 10% of users
  • Database CPU > 80% for > 5 minutes
  • Edge function failure rate > 5%
  • Storage usage > 90%
Example Alert Rule:
alert:
  name: High Error Rate
  condition: error_rate > 0.01
  duration: 5m
  channels:
    - email: platform-team@northsight.com
    - sms: +1234567890
  severity: warning

Dashboard Setup

Supabase Dashboard

Available Metrics:
  • Database CPU/Memory usage
  • API request count
  • Storage usage
  • Edge function invocations
  • Authentication events
Access:
  • Go to Supabase Dashboard
  • Navigate to ProjectMetrics
  • View real-time and historical data

Custom Dashboard (Planned)

Metrics to Display:
  • Error rate (last 24 hours)
  • Performance metrics (LCP, INP, CLS)
  • Active users
  • API response times
  • Database query performance
  • Edge function success rate
Tools:
  • Grafana (planned)
  • Datadog (planned)
  • Custom React dashboard (planned)

Incident Response

Incident Severity Levels

P0 - Critical:
  • System down
  • Data breach
  • Security incident
  • Response: Immediate (< 15 minutes)
P1 - High:
  • Major feature broken
  • Performance degradation
  • High error rate
  • Response: Within 1 hour
P2 - Medium:
  • Minor feature broken
  • Performance issues (non-critical)
  • Response: Within 4 hours
P3 - Low:
  • Cosmetic issues
  • Non-critical bugs
  • Response: Next business day

Incident Response Process

1. Detection:
  • Monitor alerts
  • Review error tracking
  • Check performance metrics
2. Triage:
  • Assess severity
  • Identify root cause
  • Assign owner
3. Resolution:
  • Fix issue
  • Deploy fix
  • Verify resolution
4. Post-Mortem:
  • Document incident
  • Identify improvements
  • Update procedures

On-Call Rotation

Responsibilities:
  • Monitor alerts
  • Respond to incidents
  • Escalate if needed
  • Document incidents
Rotation:
  • Weekly rotation (planned)
  • 24/7 coverage (planned)
  • Escalation path defined

Best Practices

1. Monitoring Coverage

✅ DO:
  • Monitor all critical paths
  • Track business metrics
  • Set up alerts for anomalies
  • Review metrics regularly
❌ DON’T:
  • Monitor everything (too noisy)
  • Ignore false positives
  • Set alerts too sensitive
  • Forget to update alerts

2. Error Tracking

✅ DO:
  • Capture all errors
  • Include context
  • Group similar errors
  • Track error rates
❌ DON’T:
  • Log PHI/PII
  • Overwhelm with noise
  • Ignore error trends
  • Skip error boundaries

3. Performance Monitoring

✅ DO:
  • Track Core Web Vitals
  • Monitor custom metrics
  • Set performance budgets
  • Optimize slow paths
❌ DON’T:
  • Track too many metrics
  • Ignore performance regressions
  • Skip performance testing
  • Forget mobile performance

4. Alerting

✅ DO:
  • Set meaningful thresholds
  • Include context in alerts
  • Test alert delivery
  • Review and tune alerts
❌ DON’T:
  • Alert on everything
  • Ignore alert fatigue
  • Skip alert testing
  • Forget to update contacts

Troubleshooting

Issue: Too Many Alerts

Symptoms:
  • Alert fatigue
  • Important alerts missed
Solutions:
  1. Increase alert thresholds
  2. Reduce alert frequency
  3. Group similar alerts
  4. Use alert suppression
  5. Review and tune alerts

Issue: Missing Alerts

Symptoms:
  • Issues not detected
  • Users report before alerts
Solutions:
  1. Lower alert thresholds
  2. Add more alert types
  3. Improve monitoring coverage
  4. Test alert delivery
  5. Review alert configuration

Issue: Performance Monitoring Not Working

Symptoms:
  • No metrics collected
  • Dashboard empty
Solutions:
  1. Verify initialization:
    performanceMonitor.init({ enablePerformanceTracking: true });
    
  2. Check sample rate (may be too low)
  3. Verify web-vitals library installed
  4. Check browser console for errors
  5. Test in development (100% sampling)

Issue: Logs Not Appearing

Symptoms:
  • No logs in console
  • Missing log entries
Solutions:
  1. Check log level (may filter out)
  2. Verify logger initialized
  3. Check browser console filters
  4. Verify structured format
  5. Test with explicit log call

Monitoring Checklist

Setup

  • Error tracking configured (Sentry/LogRocket)
  • Performance monitoring initialized
  • Log aggregation configured
  • Alerts configured
  • Dashboard created
  • On-call rotation established

Ongoing

  • Daily error review
  • Weekly performance review
  • Monthly alert tuning
  • Quarterly monitoring review
  • Incident post-mortems completed

  • Performance Patterns: constitution.md §5.6 (Performance)
  • Error Handling: src/platform/monitoring/logger.ts
  • Performance Monitor: src/platform/monitoring/performance-monitor.ts
  • Production Readiness: docs/operations/PRODUCTION_READINESS.md

Document Owner: Platform Operations Team
Review Frequency: Quarterly
Last Updated: 2025-01-07