Skip to main content

Observability

When a saga runs, state lands in Postgres and spans land in your OTLP collector — if you configure one. Both paths share the same Warden trace_id, so you can move between SQL and traces without guessing which run you are looking at.

Postgres is always available on the dev stack and is the system of record. OpenTelemetry is optional: set OTLP_ENDPOINT when you want distributed traces in Jaeger, Tempo, Honeycomb, or similar. The sections below start with Postgres diagnostics (status, stuck steps, timing JSON), then OTLP setup and how Warden correlates spans to saga rows.

Postgres

Postgres is the system of record. Every saga and step instance is a row — not an in-memory event you have to catch in time.

The primary tables:

TableWhat it holds
saga_instancesOne row per saga execution — status, trace_id, start/end times
saga_step_instancesOne row per step execution — status, outputs, errors
outbox_eventsTransactional outbox — dispatched work and HITL decisions

The trace_id is the correlation key. It's the 32-character hex ID returned by warden start saga. Use warden list sagas --trace-id … for saga status and warden list steps --trace-id … --json for per-step rows (includes error_details on failures). HTTP equivalents: GET /v1/sagas?trace_id=… and GET /v1/sagas/steps?trace_id=… — see Start and monitor. In SQL, join saga_step_instances on saga_trace_id.

For ad-hoc SQL or Adminer on the dev stack, query the tables directly. See Start and monitor.

Stuck or stranded steps

If a step stays IN_PROGRESS with no worker progress, do not patch saga_step_instances directly — updating rows in SQL bypasses the engine FSM and will not enqueue compensation or emit outbox events.

SymptomFirst checksRecovery path
IN_PROGRESS, worker crashed mid-commandWorker logs for trace_id; processed_commands idempotency rowWait for claim reap (WORKER_STALE_CLAIM_SECONDS, default 1800s) to clear stale claims and redeliver the outbox command
COMPENSATING with failed undo stepWorker logs for EXECUTE_COMPENSATIONFix environment; see Saga recovery
-- Diagnose: steps that look stranded
SELECT saga_trace_id, span_id, step_id, status, started_at, error_details
FROM saga_step_instances
WHERE status = 'IN_PROGRESS'
ORDER BY started_at;

Full runbook: Saga recovery.

OpenTelemetry

Engine and worker export spans when OTLP is configured at startup. The default exporter uses OTLP gRPC (port 4317).

OTLP endpoint setup

VariableNotes
OTLP_ENDPOINTPreferred Warden setting (.env or process env)
OTEL_EXPORTER_OTLP_ENDPOINTFallback when OTLP_ENDPOINT is unset
OTLP_INSECUREDefault true; set false when the collector expects TLS

Example for a collector reachable from the host:

OTLP_ENDPOINT=http://127.0.0.1:4317
OTLP_INSECURE=true

Point OTLP_ENDPOINT at your collector's gRPC OTLP URL (Grafana Tempo, Honeycomb, Jaeger, etc.).

TLS and other OTLP options

Warden passes endpoint and insecure into the OTLP gRPC exporter. It does not wrap headers, certificates, or timeouts in its own settings. For those, set standard OpenTelemetry environment variables on the engine or worker process — no Warden code changes required. The SDK reads them when Warden leaves the corresponding exporter arguments unset.

ConcernVariable(s)Loaded by
Plaintext vs TLSOTLP_INSECURE (default true)Warden — always applied; do not rely on OTEL_EXPORTER_OTLP_INSECURE alone
Collector URLOTLP_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINTWarden or SDK (see table above)
API key / auth headersOTEL_EXPORTER_OTLP_HEADERS or OTEL_EXPORTER_OTLP_TRACES_HEADERSOpenTelemetry SDK
Custom CA / mTLSOTEL_EXPORTER_OTLP_TRACES_CERTIFICATE, …_CLIENT_KEY, …_CLIENT_CERTIFICATEOpenTelemetry SDK (when OTLP_INSECURE=false)
Timeout, compressionOTEL_EXPORTER_OTLP_TRACES_TIMEOUT, …_COMPRESSION, etc.OpenTelemetry SDK

For basic TLS with system-trusted CAs: OTLP_INSECURE=false and an https://… endpoint (or your vendor's gRPC TLS URL). An https:// scheme also forces a secure channel per the OTel spec.

Vendor auth without TLS changes

Many hosted collectors accept OTEL_EXPORTER_OTLP_HEADERS (or the …_TRACES_HEADERS variant) for API keys while leaving OTLP_INSECURE=true. Follow your vendor's OTLP gRPC docs. This pass-through is not integration-tested in the Warden repo — verify export in your collector UI.

Service names in traces: engine-node, worker-node.

Warden correlation on spans

Postgres owns workflow identity. At saga start the engine assigns saga_instances.trace_id (32-char hex) and each saga_step_instances.span_id (16-char hex) — UUID values, not copied from OpenTelemetry. CLI commands and SQL use those ids.

OpenTelemetry builds a separate trace graph (engine ingest, worker execution, nested operations). Each span gets OTel's own Trace ID and Span ID in the collector UI.

Warden bridges the two by setting span attributes on telemetry spans:

AttributeValueSame as Jaeger "Trace ID" / "Span ID"?
saga.idPostgres saga_instances.trace_idNo — different field; filter on this attribute
saga.step_span_idPostgres saga_step_instances.span_idNo — one step row can have many child spans

Outbox messages also carry trace_context so engine and worker spans link in the trace UI. That propagation is OTel's graph; saga.id is still the id returned by warden start saga.

To correlate: copy trace_id from warden start saga (or warden list sagas), filter traces on attribute saga.id, then use warden list steps --trace-id … or query saga_step_instances by saga_trace_id. Do not assume the collector's top-level Trace ID column is your saga id.

Execution timing

Forward steps and compensation undo rows persist structured millisecond buckets in saga_step_instances.execution_timing (JSONB). Timing is never injected into user output_payload / saga context data.

Worker result events (STEP_COMPLETED, STEP_FAILED, STEP_COMPENSATED, COMPENSATION_FAILED) carry top-level timing.worker on the outbox wire. The engine merges engine-side buckets at ingest.

Bucket reference

BucketProcessForward stepCompensation undo row
hydration_msWorkerCommand hydrateSame
setup_msWorkerAdapter + MCP bootstrapSame
llm_msWorkerreact: ReAct turns; simple: single structured LLM call0 for single-tool undo; ReAct sum for multi-tool
tool_msWorkerMCP invokesSingle ainvoke or ReAct tool sum
when_cel_msEnginewhen.cel evaluation at schedule— (FSM-driven)
schedule_msEnginetrigger_step / trigger_compensationChild row create + command emit
policy_msEngineBefore-commit + after-reason gates0 today
dispatch_to_ingest_msEngineEnd-to-end step wall-clock from worker dispatch to result ingestSame

dispatch_to_ingest_ms is the headline step duration — wall-clock from when the engine commits the worker command to the outbox until the engine begins ingesting the worker's result. Worker buckets (llm_ms, tool_ms, …) describe work inside that window; they overlap this value, so do not sum them for total latency.

Monotonic time.perf_counter() segments only; no cross-process wall-clock math between worker and engine hosts.

Inspect timing

HTTP: GET /v1/sagas/steps?trace_id=… returns timing on each row (forward and undo via compensates_span_id).

-- Forward steps
SELECT span_id, step_id, status, execution_timing
FROM saga_step_instances
WHERE saga_trace_id = :trace_id
AND compensates_span_id IS NULL;

-- Compensation undo rows
SELECT span_id, compensates_span_id, status, execution_timing
FROM saga_step_instances
WHERE saga_trace_id = :trace_id
AND compensates_span_id IS NOT NULL;

See Compensation for single-tool vs ReAct undo timing expectations.

OpenTelemetry span attributes

When OTLP_ENDPOINT (or OTEL_EXPORTER_OTLP_ENDPOINT) is configured, the same timing buckets are mirrored as span attributes on existing handler spans. Postgres execution_timing remains authoritative; attributes are for Jaeger/Tempo drill-down.

Attribute prefixExample spans
timing.worker.*Worker handle_worker_command (DO_STEP, DO_COMMIT, EXECUTE_COMPENSATION)
timing.engine.schedule_ms, policy_ms, when_cel_msEngine trigger_step / trigger_compensation
timing.engine.dispatch_to_ingest_msEngine ingest handlers (handle_step_completed, failures, compensation ingest)
timing.engine.policy_ms (after-reason)Engine handle_step_completed only

Worker timing appears on the worker span, not the ingest span. Filter Jaeger on saga.id=<trace_id> and inspect DO_STEP vs handle_step_completed separately.

Every span that records timing attributes is also tagged with saga.id — including trigger_step (where when_cel_ms and before_commit policy_ms land) and handle_step_completed (where after_reason policy_ms and dispatch_to_ingest_ms land). Warden sets those attributes from the bound saga row or ingest payload, not from OpenTelemetry's wire Trace ID, so one Jaeger filter on saga.id surfaces scheduling, policy, and worker spans for the same run. For what those buckets mean in manifest terms, see Conditional branching (when.cel) and Policies.

Correlating MCP tool logs

saga.id and timing attributes appear on Warden engine and worker spans. They are not automatically injected into MCP subprocess environments or external tool HTTP headers. To correlate a third-party tool failure back to a saga run, filter Jaeger on saga.id=<trace_id> and inspect the parent worker span, check worker logs for the command's saga_trace_id, or pass trace_id explicitly through a with binding into tool arguments when your integration needs it.

For simple reason steps, execution time is recorded in a single llm_ms bucket (no tool loop). With the enterprise plugin enabled, reasoning audit captures the structured chat transcript shape ([system, human, assistant]) — for simple, the assistant message is the validated JSON string. See Open Core vs Enterprise.

OpenTelemetry exports these attributes when the span ends (batch flush from the BatchSpanProcessor), so you will not see partial bucket updates mid-handler. Background maintenance paths — outbox reap and operator recovery — repair state only and do not emit timing attributes.

What's next

Next up: CLI overview — start sagas, approve HITL holds, and retry stuck compensation from the terminal.