Distributed tracing (OpenTelemetry)
Answer "where did this request spend its time?" with one Tempo / Jaeger / Datadog query. Opt-in, failure-safe, off by default.
When a portal request triggers a chain — HTTP → DB → agent subprocess → MCP server call → outbound webhook — the AIFactory web-server emits OpenTelemetry spans that link every hop into a single trace. Operators on call can pull up one trace ID and see the full causal graph, including the agent's own work.
When you haven't pointed AIFactory at a collector, the tracing layer is a near-zero-cost no-op: spans build in memory and drop at export. No I/O. No observable cost. Flip the toggle when you have a collector reachable.
When you need this
You want tracing when any of these apply:
- Diagnosing latency: "every dashboard page is slow on Mondays — which dependency is it?"
- Debugging cross-pod issues in multi-replica deployments (an event published on pod A, dispatched on pod B).
- Auditing the agent-task lifecycle: which phase took how long? Which MCP call dominated coding time?
- Operating a managed deployment with a contractual latency SLO.
You don't need it for:
- Laptop installs / dev sessions (the no-export default is the right choice).
- Single-replica production where you already have Prometheus + structured logs.
- Sub-1-rps workloads where one slow request is easier to debug by reading the log line directly.
What's in scope (and what's not)
| Aspect | v1.1 status |
|---|---|
| HTTP request spans (FastAPI auto-instrumentation) | ✅ |
| DB call spans (SQLAlchemy + asyncpg) | ✅ |
| Outbound httpx spans | ✅ |
| Redis pub/sub spans | ✅ |
Agent phase spans (task:phase:coding, etc.) | ✅ |
Cross-replica trace continuity (Redis envelope carries traceparent) | ✅ |
Agent-subprocess trace continuity (TRACEPARENT env var) | ✅ (context only — agent doesn't export) |
| Sampling configuration (ParentBased + ratio) | ✅ |
| Trace-aware logs (request_id = trace_id when in span) | ✅ |
| Per-MCP-server spans inside the agent | ❌ (v1.2 — needs SDK changes) |
| Custom OTel processors (TailSampling, etc.) | ❌ (set via OTel SDK env vars) |
How it works
Browser ─HTTP─▶ web-server pod A ─AsyncPG─▶ Postgres
│
├─Redis publish─▶ Redis ─Redis subscribe─▶ web-server pod B
│
└─subprocess spawn (TRACEPARENT env)─▶ agent process
│
└─httpx, MCP calls...
Every box opens its own span; the trace ID flows through every arrow. Tempo / Jaeger / Datadog stitch them into a single waterfall view.
The three propagation boundaries:
- In-process (web-server → DB / httpx / redis): OTel auto-instrumentation handles it.
- Cross-replica (web-server pod A → pod B via Redis pub/sub): AIFactory adds a
trace.traceparentfield to the Redis envelope; the receiving pod re-attaches the parent context before dispatching the event locally. Backward-compatible — old envelopes without the field still dispatch normally. - Cross-process (web-server → agent subprocess): AIFactory injects
TRACEPARENTinto the subprocess environment viamake_subprocess_env. The agent'sinit_agent_tracing()(inapps/backend/core/tracing_bootstrap.py) extracts it on startup and attaches the parent context so the agent's logs and metrics carry the originating trace ID.
Turning it on
The otel: block in values.yaml controls everything. Minimal config:
otel:
enabled: true
endpoint: http://tempo.observability.svc:4317
That's it. The web-server picks up the endpoint at lifespan startup, installs the OTLP exporter, and starts emitting spans.
Full configuration
otel:
enabled: true
endpoint: http://tempo.observability.svc:4317
protocol: grpc # or http/protobuf
serviceName: aifactory-prod-eu # default: aifactory-web
samplingRatio: 0.1 # 10% of root spans
headersSecretName: tempo-headers # for vendor auth (Honeycomb / Datadog / NewRelic)
The headersSecretName references a Kubernetes Secret with the key OTEL_EXPORTER_OTLP_HEADERS. Use it for vendor APIs that need a token:
kubectl create secret generic tempo-headers \
--from-literal=OTEL_EXPORTER_OTLP_HEADERS="api-key=hc_xxxxx,team=infra" \
-n aifactory
Validators
The chart fails helm template with a clear message when:
otel.enabled=truebutotel.endpointis empty.otel.headersSecretNameis set butotel.enabledis false (operator typo trap — the Secret would never be consumed).
Failure-safe contract
If the collector goes down mid-request, the OTel SDK's own machinery logs a WARNING and drops spans. The calling code path never crashes. This is a hard contract: AIFactory wraps every integration point in try/except so a broken tracer can't break the app.
Verify it yourself: stop the OTLP collector during a load test, then check that:
- The application's
/api/healthkeeps returning 200. - Logs grow with
otel.exporterWARNING lines (not errors). - No span data shows up in your backend during the outage.
- Span data resumes immediately once the collector is back.
Trace-aware logs
When a request hits FastAPI, the OTel instrumentor opens a root span before AIFactory's middleware runs. AIFactory's CorrelationIdMiddleware then sources its request_id from the active trace ID — so every log line carries the same 32-hex identifier that shows up in the trace backend.
Resolution order:
- Client-provided
X-Request-ID(operators correlating across services with their own IDs always win). - Active OTel trace ID (when a span is active).
- Fresh UUID (fallback when neither exists).
In Tempo, click any span → "View logs" jumps straight to the matching log lines. The other direction works too: paste a trace ID from a log line into Tempo's search bar to pull up the full waterfall.
Sampling guide
| Workload shape | Recommended samplingRatio |
|---|---|
| Dev / staging | 1.0 (sample everything) |
| Low-traffic production (< 1 rps) | 1.0 |
| Medium production (1–50 rps) | 0.1 (10%) |
| High production (50+ rps) | 0.01 (1%) — adjust per cost budget |
| Performance test "no-export" baseline | 0.0 (constructs spans, exports none) |
The sampler is ParentBased(TraceIdRatioBased(ratio)): children always inherit the parent's decision, so a sampled request stays end-to-end visible across the agent / MCP / Redis hops. Only root spans (those without an inherited parent) go through the ratio decision.
What's not yet supported
- Custom processors (TailSamplingProcessor, attribute filters, etc.) — set via the OTel SDK's standard env vars (
OTEL_PROCESSOR_BSP_*,OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT, etc.) which the Python SDK picks up automatically. AIFactory doesn't wrap these. - Metrics + Logs signals — only Traces in v1.1. The Prometheus metrics + structlog stack already covers those layers; OTel signals will land if/when operators ask for unification.
- Per-MCP-server spans inside the agent — the agent inherits the parent trace context (so the trace ID is consistent), but the agent doesn't itself export spans for individual MCP calls in v1.1. The web-server's outbound
httpxcalls to MCP servers DO show up.
See also
- Multi-replica deployment — how Redis pub/sub fans out events (the
traceparentenvelope field is added on top). - GitHub issue #42 — the original design doc.
- Design doc in-repo:
docs/plans/2026-05-28-otel-tracing-design.md.