FW Operating Runbook - Encore Health OS

Version: 1.0.0 Last Updated: 2026-04-18 Audience: On-call platform admin / FW core engineer Companion docs:

Quick triage table

Symptom	First page to open	Likely cause	First action
Workflow executions stuck “in flight” for hours	Automation Observability	Worker crashed or saturated	Confirm worker healthy; if not, restart
DLQ counter rising fast	DLQ admin guide	External dependency outage	Triage by error class; throttle source
`fw_workflow_audit_events` hash mismatch (FW-62 once shipped)	Audit Trail (FW-43)	DBA-level tamper or migration drift	Pause writers; security review
Inbound webhook 5xx flood	External Webhooks (Inbound)	Source mis-configured	Lower endpoint rate limit; coordinate with source
KPI snapshot lag (no new snapshots)	KPI Dashboards & Snapshots	Snapshot cron stuck (deferred edge fn)	Run ad-hoc; track FW-58 deferred items
Rate-limit saturation across many workflows	Rate Limiting & Throttling	Org-wide spike or runaway workflow	Identify workflow; pause / throttle
Approval chain stuck without escalation	Approval Chains	Misconfigured SLA or bad assignee	Reassign; fix chain
Execution timeout cascade	Retry & Circuit Breakers	External dependency timeout	Open circuit; address dependency
Prefill engine returning wrong values	Form Prefill & Smart Defaults	Allowlist / mapping bug	Disable rule; rebuild from sample data

Runbook 1 — Worker stuck (FW-46)

Symptoms

Queue depth growing in Automation Observability.
New executions don’t progress past “queued”.
p95 wait time spiking.

Diagnosis

Open Automation Observability → switch to Current view.
Confirm queue depth + zero throughput.
Check the platform’s edge-function dashboard (Supabase) for workflow-executor-worker — is it running? Crashing? Out-of-memory?
Check fw_module_settings.fw_worker_concurrency_ceiling per org. Is it 0? (Misconfig.)
If FW-61 has shipped (planned), open the worker capacity dashboard for saturation telemetry.

Recovery

If the worker has crashed: re-deploy the edge function. Pre-FW-46-EN-02 (planned) the redeploy will lose any in-flight pgmq leases — the watchdog (FW-49) will time them out and re-enqueue.
If it’s a config issue: raise fw_worker_concurrency_ceiling for the affected org via fw_module_settings.
If it’s an external-dependency timeout: see Runbook 5.
Confirm queue drains in Automation Observability.

Postmortem actions

Capture timeline + what you changed.
File an enhancement entry under FW-46-ENHANCEMENTS.md if the issue is structural.
Update this runbook with the specific fix.

Runbook 2 — DLQ growing (FW-47)

Symptoms

DLQ counter rising in observability.
Workflow owners receiving DLQ notifications.

Diagnosis

Open the DLQ admin page.
Filter by error class (transient / policy / data / external).
For external failures, identify the dependency and its status.

Recovery

Confirm affected organization_id and ensure all commands include an organization_id filter so changes apply to that org only.
External dependency restored → bulk retry the affected entries (scoped to the confirmed organization_id).
Bad data at source → ask the source owner to fix; bulk discard impossible-to-fix entries (scoped to the confirmed organization_id).
Misconfigured workflow → disable the workflow until fixed (scoped to the confirmed organization_id).
Permission policy changed → grant the needed permission OR discard entries (scoped to the confirmed organization_id).

Postmortem

Tighten DLQ retention if the queue is overflowing the dashboard.
Set up a saved filter for the error class to speed future triage.
Consider adding Rate Limiting on the source.

Runbook 3 — Audit chain break (FW-43; FW-62 once shipped)

Symptoms (FW-62 once shipped)

Audit integrity verifier reports a hash mismatch.
Compliance officer can’t reconcile a date range.

Symptoms (today, pre-FW-62)

Suspected tampering / accidental privileged write.
Audit row count anomaly.

Diagnosis

Confirm the suspected range in Audit & Compliance Reporting.
Pre-FW-62: cross-reference application logs vs DB row count for the range.
Post-FW-62: open Audit & Compliance > Integrity tab, validate the range, identify the broken hash chain link.

Recovery

Confirm affected organization_id and ensure all pause/restore operations are scoped to that org only.
Pause writers to the affected tables (scoped to the confirmed organization_id).
Engage security review. Determine: privilege misuse, migration error, or true tampering?
Restore from PF-11 cold-storage archive if available for the range (scoped to the confirmed organization_id).
Document in incident ledger; report per HIPAA breach-notification rules if applicable.

Postmortem

Accelerate FW-62 implementation if not already in progress.
Audit who has service_role access; rotate.
File security-auditor pre-flight against fw_workflow_audit_events.

Runbook 4 — Webhook flood (FW-59)

Symptoms

A specific webhook endpoint receiving 100s+ requests / minute.
429 responses in the endpoint logs (good — rate limit working).
Or downstream workflow saturated (rate limit too loose).

Diagnosis

Open the endpoint’s Logs tab.
Confirm source IP / API key.
Confirm rate-limit state.

Recovery

Confirm affected organization_id and ensure all throttle/disable operations are scoped to that org only.
Lower endpoint rate limit (Rate Limiting & Throttling) to throttle (scoped to the confirmed organization_id).
Coordinate with source — they should implement exponential backoff on 429.
Disable the endpoint if the source is uncoordinated and PHI-touching (scoped to the confirmed organization_id).
Rotate the secret if the flood looks malicious (scoped to the confirmed organization_id).

Postmortem

Add IP allow-list if the source has stable IPs.
Document the partner’s accepted rate in docs/architecture/integrations/.
Consider FW-53 budget changes platform-wide.

Runbook 5 — Execution timeout cascade (FW-49)

Symptoms

Many executions hitting timed_out status.
Watchdog auto-cancellation events flooding audit log.

Diagnosis

Open Automation Observability, filter status = timed_out.
Identify common workflow / step.
Inspect external dependencies the step calls.

Recovery

Confirm affected organization_id and ensure all circuit breaker/retry operations are scoped to that org only.
Open circuit breaker manually (Retry & Circuit Breakers) on the affected workflow node (scoped to the confirmed organization_id) — fast-fails subsequent calls so they go to DLQ instead of timing out.
Address the dependency (restart, scale, etc.).
Reset circuits when dependency is healthy (scoped to the confirmed organization_id).
Bulk retry from DLQ (scoped to the confirmed organization_id).

Postmortem

Tune workflow timeout per node based on observed dependency p95.
Consider compensation actions for partial-state recovery.

Runbook 6 — KPI snapshot lag (FW-58)

Symptoms

KPI dashboard shows stale values.
Trend charts have gaps.

Diagnosis

The scheduled-PDF edge function and snapshot cron are deferred in the current FW-58 release. Manual ad-hoc render is the only dispatch path today.
Confirm fw_kpi_snapshots rows for the period via DB.
If snapshot cron has shipped per FW-58 follow-up, check edge function health.

Recovery

Run ad-hoc snapshot from KPI > KPI page > Refresh.
For PDF dispatch: render ad-hoc and email out-of-band until FW-58 deferred items ship.

Runbook 7 — Rate-limit saturation across the org

Symptoms

Multiple workflows hitting rate limits.
Stats page shows widespread throttling.

Diagnosis

Open Rate Limiting > Stats.
Identify the top-throttled workflow (single bad actor likely).
Cross-check with Automation Observability — is one workflow consuming > 50% of throughput?

Recovery

Confirm affected organization_id and ensure all disable/tune operations are scoped to that org only.
Disable the runaway workflow until investigated (scoped to the confirmed organization_id).
Tune the per-org limit if legitimate growth (raise carefully; scoped to the confirmed organization_id).
Add a per-workflow limit on the noisy workflow even after it’s restored (scoped to the confirmed organization_id).

Runbook 8 — Approval queue overload (FW-34)

Symptoms

Approval inbox showing 100s of pending items for a single approver.
SLA breach alerts piling up.

Diagnosis

Confirm the approver is real and authenticated.
Confirm the routing rule isn’t broken (FW-54 → all routes lead to one user).
Confirm the approver hasn’t delegated (delegation should redistribute).

Recovery

Bulk reassign to peers (forms_admin can do this).
Set up role-based assignment in the chain so workload distributes.
Configure escalation so SLA breaches auto-route.

Runbook 9 — Prefill engine failure (FW-60)

Symptoms

Forms loading with no prefilled values when they should.
Or prefilled with wrong values.

Diagnosis

Open the form’s Prefill tab.
For each rule, confirm the entity record exists / URL parameter present.
Check rule priority (first-win).
Test in Preview with the same context.

Recovery

Disable the bad rule to stop blast radius.
Fix the mapping (entity field, JSONPath, context key).
Re-enable.
If the issue is in the allowlist: file a request with platform team to extend / fix.

When to escalate

Suspected tampering / breach → security review (Runbook 3).
Worker outage that affects > 5 orgs simultaneously → platform team.
Persistent regression after a FW PR → revert + escalate to FW core lead.
Compliance / regulatory deadline at risk → compliance officer.

Updating this runbook

After every significant incident, add a new runbook section or update an existing one.
Cross-link to the admin / developer / compliance docs in packages/docs/.
Bump the version.
Reviewed quarterly.

Overview

Clinical

Practice Management

Recovery Housing

Finance & Revenue

Workforce & HR

Community Engagement

Facilities & Inventory

Forms & Workflow

Governance & Compliance

IT Service Management

Leadership

Platform

Documentation Index

​Quick triage table

​Runbook 1 — Worker stuck (FW-46)

​Symptoms

​Diagnosis

​Recovery

​Postmortem actions

​Runbook 2 — DLQ growing (FW-47)

​Symptoms

​Diagnosis

​Recovery

​Postmortem

​Runbook 3 — Audit chain break (FW-43; FW-62 once shipped)

​Symptoms (FW-62 once shipped)

​Symptoms (today, pre-FW-62)

​Diagnosis

​Recovery

​Postmortem

​Runbook 4 — Webhook flood (FW-59)

​Symptoms

​Diagnosis

​Recovery

​Postmortem

​Runbook 5 — Execution timeout cascade (FW-49)

​Symptoms

​Diagnosis

​Recovery

​Postmortem

​Runbook 6 — KPI snapshot lag (FW-58)

​Symptoms

​Diagnosis

​Recovery

​Runbook 7 — Rate-limit saturation across the org

​Symptoms

​Diagnosis

​Recovery

​Runbook 8 — Approval queue overload (FW-34)

​Symptoms

​Diagnosis

​Recovery

​Runbook 9 — Prefill engine failure (FW-60)

​Symptoms

​Diagnosis

​Recovery

​When to escalate

​Updating this runbook

Quick triage table

Runbook 1 — Worker stuck (FW-46)

Symptoms

Diagnosis

Recovery

Postmortem actions

Runbook 2 — DLQ growing (FW-47)

Symptoms

Diagnosis

Recovery

Postmortem

Runbook 3 — Audit chain break (FW-43; FW-62 once shipped)

Symptoms (FW-62 once shipped)

Symptoms (today, pre-FW-62)

Diagnosis

Recovery

Postmortem

Runbook 4 — Webhook flood (FW-59)

Symptoms

Diagnosis

Recovery

Postmortem

Runbook 5 — Execution timeout cascade (FW-49)

Symptoms

Diagnosis

Recovery

Postmortem

Runbook 6 — KPI snapshot lag (FW-58)

Symptoms

Diagnosis

Recovery

Runbook 7 — Rate-limit saturation across the org

Symptoms

Diagnosis

Recovery

Runbook 8 — Approval queue overload (FW-34)

Symptoms

Diagnosis

Recovery

Runbook 9 — Prefill engine failure (FW-60)

Symptoms

Diagnosis

Recovery

When to escalate

Updating this runbook