Troubleshooting
When a hands-on demo fails, start with the Operational diagnostics table on that demo page — those rows cover the symptoms you hit during the walkthrough.
| Demo | First-line fixes |
|---|---|
| Demo: Mock LLM and MCP | Operational diagnostics — ENGINE_URL, allowlist, MCP subprocess |
| Demo: Quickstart | When a step shows FAILED — OPENAI_API_KEY, local model URL, worker logs |
Use this page when those tables do not resolve the issue, or when the failure happens before you reach a demo (install, deploy, or engine health).
Local stack diagnostics
When containers look unhealthy after Installation, start here before diving into saga-level errors.
make doctor
make doctor prints docker compose ps and recent logs from migrate, engine, and worker — enough to spot a failed migration, crash loop, or port conflict.
| Symptom | What to try |
|---|---|
Added LLM or MCP credentials to .env but steps still fail with missing key / connection errors | Restart the worker (docker compose up -d worker) — not the engine. Compose injects .env at container start. Retry the failed step or start a new saga. |
Local Ollama works on the host (curl localhost:11434) but worker steps fail with connection errors | Ollama may bind only 127.0.0.1, or host.docker.internal may not resolve in the worker container on Linux — Configuration → Local LLM under Docker (Ollama) |
Stale schema or bad credentials after editing .env | make reset (wipes engine_db_data and runs make up) — only when you can lose local DB state |
| Stop stack, keep data | make down |
| Wipe data without full restart sequence | make clean, then make up |
| Minimal template, migration failed | docker compose -f docker-compose.example.yml logs migrate |
Adminer browses Postgres on the dev stack. Default ports and Makefile targets are in Configuration → Dev stack. Prefer warden list sagas --trace-id … and warden list steps --trace-id … for saga state before opening the database UI.
Diagnose the failure mode
Most infrastructure failures fall into one of two buckets:
- Network boundary — the CLI cannot reach the engine API (
ENGINE_URL, health checks, published ports). - Outbox / worker execution — the engine wrote commands to Postgres, but the worker is not claiming or finishing them (crash loop, stale claim, step stuck
IN_PROGRESS).
Use Stack and CLI for the first; Runtime for the second. Deploy and start covers validation before any instance runs.
Stack and CLI (network boundary)
| What you see | Likely cause | What to do |
|---|---|---|
bash: warden: command not found | CLI not on shell PATH | Run make sync-dev, then source .venv/bin/activate or prefix with uv run (e.g. uv run warden ping) — see Installation |
ERROR ENGINE_URL is required … | ENGINE_URL unset | Set per Configuration |
GET /v1/health failed: … connection refused | Engine not running | make up; docker compose ps; confirm port 8000 |
GET /v1/health failed: … timed out | Engine overloaded or blocked | make doctor; retry when healthy |
Engine/worker exits: Database schema is not initialized | Migrations did not run | make doctor or docker compose logs migrate — Local stack diagnostics |
Deploy and start
warden deploy validates YAML, worker references, and prompt files on disk. It does not check LLM credentials — missing keys surface at step runtime on Demo: Quickstart.
| What you see | Likely cause | What to do |
|---|---|---|
ERROR file not found: config/… | Wrong -f path | Run from repo root |
… workers that are not registered … | Saga deployed before worker | Register the worker manifest first — see Demo: Mock LLM and MCP |
… prompt is invalid: Prompt file not found … | Missing template or wrong PROMPTS_ROOT on engine | Mount ./config/prompts; see Configuration |
SagaDefinition not found … on start | Definition not deployed, or (namespace, name, version) mismatch | Redeploy; match -n, -v, and --namespace to manifest fields (omit --namespace only when manifest uses default) |
Runtime (outbox and worker)
| What you see | Likely cause | What to do |
|---|---|---|
Saga RUNNING; step stuck IN_PROGRESS | Worker down or orphaned claim | make doctor; Saga recovery after recovery timeouts |
Saga FAILED shortly after start | Worker returned STEP_FAILED | warden list steps --trace-id … — failed rows show FAILED*; add --errors for one-line briefs or warden show step … for full error_details |
Empty list sagas but you started one | Namespace filter mismatch — instances are isolated by namespace; a list query with the wrong filter returns nothing, not a missing saga | Pass the same --namespace used at warden start saga (default default), or pin --trace-id |
For MCP tool failures (GitHub demo, hosted SSE, stdio auth), see MCP and tools. Mock MCP subprocess issues are covered in Demo: Mock LLM and MCP → Operational diagnostics. Reason-step completion errors (no_submit_call, structured_output_failed, etc.) are listed in Saga manifests → Reason step execution and Demo: Quickstart → When a step shows FAILED.
What's next
If demo tables and the sections above did not resolve the issue, continue with Saga recovery for stuck IN_PROGRESS steps and operator retries (warden saga retry-step, warden saga retry-compensation). Cross-check Configuration → Recovery timeouts when claims or outbox rows stay stale after worker restarts.
Related
- Configuration — env reference and recovery timeouts
- Saga recovery — operator retry when steps stay
IN_PROGRESS - Open Core vs Enterprise