Skip to main content

Troubleshooting

When a hands-on demo fails, start with the Operational diagnostics table on that demo page — those rows cover the symptoms you hit during the walkthrough.

DemoFirst-line fixes
Demo: Mock LLM and MCPOperational diagnosticsENGINE_URL, allowlist, MCP subprocess
Demo: QuickstartWhen a step shows FAILEDOPENAI_API_KEY, local model URL, worker logs

Use this page when those tables do not resolve the issue, or when the failure happens before you reach a demo (install, deploy, or engine health).

Local stack diagnostics

When containers look unhealthy after Installation, start here before diving into saga-level errors.

make doctor

make doctor prints docker compose ps and recent logs from migrate, engine, and worker — enough to spot a failed migration, crash loop, or port conflict.

SymptomWhat to try
Added LLM or MCP credentials to .env but steps still fail with missing key / connection errorsRestart the worker (docker compose up -d worker) — not the engine. Compose injects .env at container start. Retry the failed step or start a new saga.
Local Ollama works on the host (curl localhost:11434) but worker steps fail with connection errorsOllama may bind only 127.0.0.1, or host.docker.internal may not resolve in the worker container on Linux — Configuration → Local LLM under Docker (Ollama)
Stale schema or bad credentials after editing .envmake reset (wipes engine_db_data and runs make up) — only when you can lose local DB state
Stop stack, keep datamake down
Wipe data without full restart sequencemake clean, then make up
Minimal template, migration faileddocker compose -f docker-compose.example.yml logs migrate

Adminer browses Postgres on the dev stack. Default ports and Makefile targets are in Configuration → Dev stack. Prefer warden list sagas --trace-id … and warden list steps --trace-id … for saga state before opening the database UI.

Diagnose the failure mode

Most infrastructure failures fall into one of two buckets:

  • Network boundary — the CLI cannot reach the engine API (ENGINE_URL, health checks, published ports).
  • Outbox / worker execution — the engine wrote commands to Postgres, but the worker is not claiming or finishing them (crash loop, stale claim, step stuck IN_PROGRESS).

Use Stack and CLI for the first; Runtime for the second. Deploy and start covers validation before any instance runs.

Stack and CLI (network boundary)

What you seeLikely causeWhat to do
bash: warden: command not foundCLI not on shell PATHRun make sync-dev, then source .venv/bin/activate or prefix with uv run (e.g. uv run warden ping) — see Installation
ERROR ENGINE_URL is required …ENGINE_URL unsetSet per Configuration
GET /v1/health failed: … connection refusedEngine not runningmake up; docker compose ps; confirm port 8000
GET /v1/health failed: … timed outEngine overloaded or blockedmake doctor; retry when healthy
Engine/worker exits: Database schema is not initializedMigrations did not runmake doctor or docker compose logs migrateLocal stack diagnostics

Deploy and start

warden deploy validates YAML, worker references, and prompt files on disk. It does not check LLM credentials — missing keys surface at step runtime on Demo: Quickstart.

What you seeLikely causeWhat to do
ERROR file not found: config/…Wrong -f pathRun from repo root
… workers that are not registered …Saga deployed before workerRegister the worker manifest first — see Demo: Mock LLM and MCP
… prompt is invalid: Prompt file not found …Missing template or wrong PROMPTS_ROOT on engineMount ./config/prompts; see Configuration
SagaDefinition not found … on startDefinition not deployed, or (namespace, name, version) mismatchRedeploy; match -n, -v, and --namespace to manifest fields (omit --namespace only when manifest uses default)

Runtime (outbox and worker)

What you seeLikely causeWhat to do
Saga RUNNING; step stuck IN_PROGRESSWorker down or orphaned claimmake doctor; Saga recovery after recovery timeouts
Saga FAILED shortly after startWorker returned STEP_FAILEDwarden list steps --trace-id … — failed rows show FAILED*; add --errors for one-line briefs or warden show step … for full error_details
Empty list sagas but you started oneNamespace filter mismatch — instances are isolated by namespace; a list query with the wrong filter returns nothing, not a missing sagaPass the same --namespace used at warden start saga (default default), or pin --trace-id

For MCP tool failures (GitHub demo, hosted SSE, stdio auth), see MCP and tools. Mock MCP subprocess issues are covered in Demo: Mock LLM and MCP → Operational diagnostics. Reason-step completion errors (no_submit_call, structured_output_failed, etc.) are listed in Saga manifests → Reason step execution and Demo: Quickstart → When a step shows FAILED.

What's next

If demo tables and the sections above did not resolve the issue, continue with Saga recovery for stuck IN_PROGRESS steps and operator retries (warden saga retry-step, warden saga retry-compensation). Cross-check Configuration → Recovery timeouts when claims or outbox rows stay stale after worker restarts.