Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.encoreos.io/llms.txt

Use this file to discover all available pages before exploring further.

Version: 1.0.0 Last Updated: 2026-04-18 Audience: On-call platform admin / FW core engineer Companion docs:

Quick triage table

SymptomFirst page to openLikely causeFirst action
Workflow executions stuck “in flight” for hoursAutomation ObservabilityWorker crashed or saturatedConfirm worker healthy; if not, restart
DLQ counter rising fastDLQ admin guideExternal dependency outageTriage by error class; throttle source
fw_workflow_audit_events hash mismatch (FW-62 once shipped)Audit Trail (FW-43)DBA-level tamper or migration driftPause writers; security review
Inbound webhook 5xx floodExternal Webhooks (Inbound)Source mis-configuredLower endpoint rate limit; coordinate with source
KPI snapshot lag (no new snapshots)KPI Dashboards & SnapshotsSnapshot cron stuck (deferred edge fn)Run ad-hoc; track FW-58 deferred items
Rate-limit saturation across many workflowsRate Limiting & ThrottlingOrg-wide spike or runaway workflowIdentify workflow; pause / throttle
Approval chain stuck without escalationApproval ChainsMisconfigured SLA or bad assigneeReassign; fix chain
Execution timeout cascadeRetry & Circuit BreakersExternal dependency timeoutOpen circuit; address dependency
Prefill engine returning wrong valuesForm Prefill & Smart DefaultsAllowlist / mapping bugDisable rule; rebuild from sample data

Runbook 1 — Worker stuck (FW-46)

Symptoms

  • Queue depth growing in Automation Observability.
  • New executions don’t progress past “queued”.
  • p95 wait time spiking.

Diagnosis

  1. Open Automation Observability → switch to Current view.
  2. Confirm queue depth + zero throughput.
  3. Check the platform’s edge-function dashboard (Supabase) for workflow-executor-worker — is it running? Crashing? Out-of-memory?
  4. Check fw_module_settings.fw_worker_concurrency_ceiling per org. Is it 0? (Misconfig.)
  5. If FW-61 has shipped (planned), open the worker capacity dashboard for saturation telemetry.

Recovery

  1. If the worker has crashed: re-deploy the edge function. Pre-FW-46-EN-02 (planned) the redeploy will lose any in-flight pgmq leases — the watchdog (FW-49) will time them out and re-enqueue.
  2. If it’s a config issue: raise fw_worker_concurrency_ceiling for the affected org via fw_module_settings.
  3. If it’s an external-dependency timeout: see Runbook 5.
  4. Confirm queue drains in Automation Observability.

Postmortem actions

  • Capture timeline + what you changed.
  • File an enhancement entry under FW-46-ENHANCEMENTS.md if the issue is structural.
  • Update this runbook with the specific fix.

Runbook 2 — DLQ growing (FW-47)

Symptoms

  • DLQ counter rising in observability.
  • Workflow owners receiving DLQ notifications.

Diagnosis

  1. Open the DLQ admin page.
  2. Filter by error class (transient / policy / data / external).
  3. For external failures, identify the dependency and its status.

Recovery

  1. Confirm affected organization_id and ensure all commands include an organization_id filter so changes apply to that org only.
  2. External dependency restored → bulk retry the affected entries (scoped to the confirmed organization_id).
  3. Bad data at source → ask the source owner to fix; bulk discard impossible-to-fix entries (scoped to the confirmed organization_id).
  4. Misconfigured workflow → disable the workflow until fixed (scoped to the confirmed organization_id).
  5. Permission policy changed → grant the needed permission OR discard entries (scoped to the confirmed organization_id).

Postmortem

  • Tighten DLQ retention if the queue is overflowing the dashboard.
  • Set up a saved filter for the error class to speed future triage.
  • Consider adding Rate Limiting on the source.

Runbook 3 — Audit chain break (FW-43; FW-62 once shipped)

Symptoms (FW-62 once shipped)

  • Audit integrity verifier reports a hash mismatch.
  • Compliance officer can’t reconcile a date range.

Symptoms (today, pre-FW-62)

  • Suspected tampering / accidental privileged write.
  • Audit row count anomaly.

Diagnosis

  1. Confirm the suspected range in Audit & Compliance Reporting.
  2. Pre-FW-62: cross-reference application logs vs DB row count for the range.
  3. Post-FW-62: open Audit & Compliance > Integrity tab, validate the range, identify the broken hash chain link.

Recovery

  1. Confirm affected organization_id and ensure all pause/restore operations are scoped to that org only.
  2. Pause writers to the affected tables (scoped to the confirmed organization_id).
  3. Engage security review. Determine: privilege misuse, migration error, or true tampering?
  4. Restore from PF-11 cold-storage archive if available for the range (scoped to the confirmed organization_id).
  5. Document in incident ledger; report per HIPAA breach-notification rules if applicable.

Postmortem

  • Accelerate FW-62 implementation if not already in progress.
  • Audit who has service_role access; rotate.
  • File security-auditor pre-flight against fw_workflow_audit_events.

Runbook 4 — Webhook flood (FW-59)

Symptoms

  • A specific webhook endpoint receiving 100s+ requests / minute.
  • 429 responses in the endpoint logs (good — rate limit working).
  • Or downstream workflow saturated (rate limit too loose).

Diagnosis

  1. Open the endpoint’s Logs tab.
  2. Confirm source IP / API key.
  3. Confirm rate-limit state.

Recovery

  1. Confirm affected organization_id and ensure all throttle/disable operations are scoped to that org only.
  2. Lower endpoint rate limit (Rate Limiting & Throttling) to throttle (scoped to the confirmed organization_id).
  3. Coordinate with source — they should implement exponential backoff on 429.
  4. Disable the endpoint if the source is uncoordinated and PHI-touching (scoped to the confirmed organization_id).
  5. Rotate the secret if the flood looks malicious (scoped to the confirmed organization_id).

Postmortem

  • Add IP allow-list if the source has stable IPs.
  • Document the partner’s accepted rate in docs/architecture/integrations/.
  • Consider FW-53 budget changes platform-wide.

Runbook 5 — Execution timeout cascade (FW-49)

Symptoms

  • Many executions hitting timed_out status.
  • Watchdog auto-cancellation events flooding audit log.

Diagnosis

  1. Open Automation Observability, filter status = timed_out.
  2. Identify common workflow / step.
  3. Inspect external dependencies the step calls.

Recovery

  1. Confirm affected organization_id and ensure all circuit breaker/retry operations are scoped to that org only.
  2. Open circuit breaker manually (Retry & Circuit Breakers) on the affected workflow node (scoped to the confirmed organization_id) — fast-fails subsequent calls so they go to DLQ instead of timing out.
  3. Address the dependency (restart, scale, etc.).
  4. Reset circuits when dependency is healthy (scoped to the confirmed organization_id).
  5. Bulk retry from DLQ (scoped to the confirmed organization_id).

Postmortem

  • Tune workflow timeout per node based on observed dependency p95.
  • Consider compensation actions for partial-state recovery.

Runbook 6 — KPI snapshot lag (FW-58)

Symptoms

  • KPI dashboard shows stale values.
  • Trend charts have gaps.

Diagnosis

  1. The scheduled-PDF edge function and snapshot cron are deferred in the current FW-58 release. Manual ad-hoc render is the only dispatch path today.
  2. Confirm fw_kpi_snapshots rows for the period via DB.
  3. If snapshot cron has shipped per FW-58 follow-up, check edge function health.

Recovery

  1. Run ad-hoc snapshot from KPI > KPI page > Refresh.
  2. For PDF dispatch: render ad-hoc and email out-of-band until FW-58 deferred items ship.

Runbook 7 — Rate-limit saturation across the org

Symptoms

  • Multiple workflows hitting rate limits.
  • Stats page shows widespread throttling.

Diagnosis

  1. Open Rate Limiting > Stats.
  2. Identify the top-throttled workflow (single bad actor likely).
  3. Cross-check with Automation Observability — is one workflow consuming > 50% of throughput?

Recovery

  1. Confirm affected organization_id and ensure all disable/tune operations are scoped to that org only.
  2. Disable the runaway workflow until investigated (scoped to the confirmed organization_id).
  3. Tune the per-org limit if legitimate growth (raise carefully; scoped to the confirmed organization_id).
  4. Add a per-workflow limit on the noisy workflow even after it’s restored (scoped to the confirmed organization_id).

Runbook 8 — Approval queue overload (FW-34)

Symptoms

  • Approval inbox showing 100s of pending items for a single approver.
  • SLA breach alerts piling up.

Diagnosis

  1. Confirm the approver is real and authenticated.
  2. Confirm the routing rule isn’t broken (FW-54 → all routes lead to one user).
  3. Confirm the approver hasn’t delegated (delegation should redistribute).

Recovery

  1. Bulk reassign to peers (forms_admin can do this).
  2. Set up role-based assignment in the chain so workload distributes.
  3. Configure escalation so SLA breaches auto-route.

Runbook 9 — Prefill engine failure (FW-60)

Symptoms

  • Forms loading with no prefilled values when they should.
  • Or prefilled with wrong values.

Diagnosis

  1. Open the form’s Prefill tab.
  2. For each rule, confirm the entity record exists / URL parameter present.
  3. Check rule priority (first-win).
  4. Test in Preview with the same context.

Recovery

  1. Disable the bad rule to stop blast radius.
  2. Fix the mapping (entity field, JSONPath, context key).
  3. Re-enable.
  4. If the issue is in the allowlist: file a request with platform team to extend / fix.

When to escalate

  • Suspected tampering / breach → security review (Runbook 3).
  • Worker outage that affects > 5 orgs simultaneously → platform team.
  • Persistent regression after a FW PR → revert + escalate to FW core lead.
  • Compliance / regulatory deadline at risk → compliance officer.

Updating this runbook

  • After every significant incident, add a new runbook section or update an existing one.
  • Cross-link to the admin / developer / compliance docs in packages/docs/.
  • Bump the version.
  • Reviewed quarterly.