Version: 1.0.0 Last Updated: 2026-04-18 Audience: On-call platform admin / FW core engineer Companion docs:Documentation Index
Fetch the complete documentation index at: https://docs.encoreos.io/llms.txt
Use this file to discover all available pages before exploring further.
- Automation Observability (admin)
- Dead Letter Queue (admin)
- Workflow Execution Replay (admin)
- Audit & Compliance Reporting (admin)
- FW Developer Reference
Quick triage table
| Symptom | First page to open | Likely cause | First action |
|---|---|---|---|
| Workflow executions stuck “in flight” for hours | Automation Observability | Worker crashed or saturated | Confirm worker healthy; if not, restart |
| DLQ counter rising fast | DLQ admin guide | External dependency outage | Triage by error class; throttle source |
fw_workflow_audit_events hash mismatch (FW-62 once shipped) | Audit Trail (FW-43) | DBA-level tamper or migration drift | Pause writers; security review |
| Inbound webhook 5xx flood | External Webhooks (Inbound) | Source mis-configured | Lower endpoint rate limit; coordinate with source |
| KPI snapshot lag (no new snapshots) | KPI Dashboards & Snapshots | Snapshot cron stuck (deferred edge fn) | Run ad-hoc; track FW-58 deferred items |
| Rate-limit saturation across many workflows | Rate Limiting & Throttling | Org-wide spike or runaway workflow | Identify workflow; pause / throttle |
| Approval chain stuck without escalation | Approval Chains | Misconfigured SLA or bad assignee | Reassign; fix chain |
| Execution timeout cascade | Retry & Circuit Breakers | External dependency timeout | Open circuit; address dependency |
| Prefill engine returning wrong values | Form Prefill & Smart Defaults | Allowlist / mapping bug | Disable rule; rebuild from sample data |
Runbook 1 — Worker stuck (FW-46)
Symptoms
- Queue depth growing in Automation Observability.
- New executions don’t progress past “queued”.
- p95 wait time spiking.
Diagnosis
- Open Automation Observability → switch to Current view.
- Confirm queue depth + zero throughput.
- Check the platform’s edge-function dashboard (Supabase) for
workflow-executor-worker— is it running? Crashing? Out-of-memory? - Check
fw_module_settings.fw_worker_concurrency_ceilingper org. Is it 0? (Misconfig.) - If FW-61 has shipped (planned), open the worker capacity dashboard for saturation telemetry.
Recovery
- If the worker has crashed: re-deploy the edge function. Pre-FW-46-EN-02 (planned) the redeploy will lose any in-flight pgmq leases — the watchdog (FW-49) will time them out and re-enqueue.
- If it’s a config issue: raise
fw_worker_concurrency_ceilingfor the affected org viafw_module_settings. - If it’s an external-dependency timeout: see Runbook 5.
- Confirm queue drains in Automation Observability.
Postmortem actions
- Capture timeline + what you changed.
- File an enhancement entry under
FW-46-ENHANCEMENTS.mdif the issue is structural. - Update this runbook with the specific fix.
Runbook 2 — DLQ growing (FW-47)
Symptoms
- DLQ counter rising in observability.
- Workflow owners receiving DLQ notifications.
Diagnosis
- Open the DLQ admin page.
- Filter by error class (
transient/policy/data/external). - For
externalfailures, identify the dependency and its status.
Recovery
- Confirm affected organization_id and ensure all commands include an
organization_idfilter so changes apply to that org only. - External dependency restored → bulk retry the affected entries (scoped to the confirmed organization_id).
- Bad data at source → ask the source owner to fix; bulk discard impossible-to-fix entries (scoped to the confirmed organization_id).
- Misconfigured workflow → disable the workflow until fixed (scoped to the confirmed organization_id).
- Permission policy changed → grant the needed permission OR discard entries (scoped to the confirmed organization_id).
Postmortem
- Tighten DLQ retention if the queue is overflowing the dashboard.
- Set up a saved filter for the error class to speed future triage.
- Consider adding Rate Limiting on the source.
Runbook 3 — Audit chain break (FW-43; FW-62 once shipped)
Symptoms (FW-62 once shipped)
- Audit integrity verifier reports a hash mismatch.
- Compliance officer can’t reconcile a date range.
Symptoms (today, pre-FW-62)
- Suspected tampering / accidental privileged write.
- Audit row count anomaly.
Diagnosis
- Confirm the suspected range in Audit & Compliance Reporting.
- Pre-FW-62: cross-reference application logs vs DB row count for the range.
- Post-FW-62: open Audit & Compliance > Integrity tab, validate the range, identify the broken hash chain link.
Recovery
- Confirm affected organization_id and ensure all pause/restore operations are scoped to that org only.
- Pause writers to the affected tables (scoped to the confirmed organization_id).
- Engage security review. Determine: privilege misuse, migration error, or true tampering?
- Restore from PF-11 cold-storage archive if available for the range (scoped to the confirmed organization_id).
- Document in incident ledger; report per HIPAA breach-notification rules if applicable.
Postmortem
- Accelerate FW-62 implementation if not already in progress.
- Audit who has
service_roleaccess; rotate. - File security-auditor pre-flight against
fw_workflow_audit_events.
Runbook 4 — Webhook flood (FW-59)
Symptoms
- A specific webhook endpoint receiving 100s+ requests / minute.
- 429 responses in the endpoint logs (good — rate limit working).
- Or downstream workflow saturated (rate limit too loose).
Diagnosis
- Open the endpoint’s Logs tab.
- Confirm source IP / API key.
- Confirm rate-limit state.
Recovery
- Confirm affected organization_id and ensure all throttle/disable operations are scoped to that org only.
- Lower endpoint rate limit (Rate Limiting & Throttling) to throttle (scoped to the confirmed organization_id).
- Coordinate with source — they should implement exponential backoff on 429.
- Disable the endpoint if the source is uncoordinated and PHI-touching (scoped to the confirmed organization_id).
- Rotate the secret if the flood looks malicious (scoped to the confirmed organization_id).
Postmortem
- Add IP allow-list if the source has stable IPs.
- Document the partner’s accepted rate in
docs/architecture/integrations/. - Consider FW-53 budget changes platform-wide.
Runbook 5 — Execution timeout cascade (FW-49)
Symptoms
- Many executions hitting
timed_outstatus. - Watchdog auto-cancellation events flooding audit log.
Diagnosis
- Open Automation Observability, filter status =
timed_out. - Identify common workflow / step.
- Inspect external dependencies the step calls.
Recovery
- Confirm affected organization_id and ensure all circuit breaker/retry operations are scoped to that org only.
- Open circuit breaker manually (Retry & Circuit Breakers) on the affected workflow node (scoped to the confirmed organization_id) — fast-fails subsequent calls so they go to DLQ instead of timing out.
- Address the dependency (restart, scale, etc.).
- Reset circuits when dependency is healthy (scoped to the confirmed organization_id).
- Bulk retry from DLQ (scoped to the confirmed organization_id).
Postmortem
- Tune workflow timeout per node based on observed dependency p95.
- Consider compensation actions for partial-state recovery.
Runbook 6 — KPI snapshot lag (FW-58)
Symptoms
- KPI dashboard shows stale values.
- Trend charts have gaps.
Diagnosis
- The scheduled-PDF edge function and snapshot cron are deferred in the current FW-58 release. Manual ad-hoc render is the only dispatch path today.
- Confirm
fw_kpi_snapshotsrows for the period via DB. - If snapshot cron has shipped per FW-58 follow-up, check edge function health.
Recovery
- Run ad-hoc snapshot from KPI > KPI page > Refresh.
- For PDF dispatch: render ad-hoc and email out-of-band until FW-58 deferred items ship.
Runbook 7 — Rate-limit saturation across the org
Symptoms
- Multiple workflows hitting rate limits.
- Stats page shows widespread throttling.
Diagnosis
- Open Rate Limiting > Stats.
- Identify the top-throttled workflow (single bad actor likely).
- Cross-check with Automation Observability — is one workflow consuming > 50% of throughput?
Recovery
- Confirm affected organization_id and ensure all disable/tune operations are scoped to that org only.
- Disable the runaway workflow until investigated (scoped to the confirmed organization_id).
- Tune the per-org limit if legitimate growth (raise carefully; scoped to the confirmed organization_id).
- Add a per-workflow limit on the noisy workflow even after it’s restored (scoped to the confirmed organization_id).
Runbook 8 — Approval queue overload (FW-34)
Symptoms
- Approval inbox showing 100s of pending items for a single approver.
- SLA breach alerts piling up.
Diagnosis
- Confirm the approver is real and authenticated.
- Confirm the routing rule isn’t broken (FW-54 → all routes lead to one user).
- Confirm the approver hasn’t delegated (delegation should redistribute).
Recovery
- Bulk reassign to peers (forms_admin can do this).
- Set up role-based assignment in the chain so workload distributes.
- Configure escalation so SLA breaches auto-route.
Runbook 9 — Prefill engine failure (FW-60)
Symptoms
- Forms loading with no prefilled values when they should.
- Or prefilled with wrong values.
Diagnosis
- Open the form’s Prefill tab.
- For each rule, confirm the entity record exists / URL parameter present.
- Check rule priority (first-win).
- Test in Preview with the same context.
Recovery
- Disable the bad rule to stop blast radius.
- Fix the mapping (entity field, JSONPath, context key).
- Re-enable.
- If the issue is in the allowlist: file a request with platform team to extend / fix.
When to escalate
- Suspected tampering / breach → security review (Runbook 3).
- Worker outage that affects > 5 orgs simultaneously → platform team.
- Persistent regression after a FW PR → revert + escalate to FW core lead.
- Compliance / regulatory deadline at risk → compliance officer.
Updating this runbook
- After every significant incident, add a new runbook section or update an existing one.
- Cross-link to the admin / developer / compliance docs in
packages/docs/. - Bump the version.
- Reviewed quarterly.