feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398
feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398wpfleger96 wants to merge 4 commits into
Conversation
…eus export
Replace metrics-rs/metrics-exporter-prometheus with OpenTelemetry native
instruments backed by both a Prometheus text endpoint (:9102) and an OTLP
gRPC exporter. Add distributed tracing via tracing-opentelemetry. Add DB
and Redis pool metrics.
## What changed
### Metrics
- Rewrote metrics.rs as an OTEL setup module: SdkMeterProvider with a
PrometheusExporter (pull-based, same /metrics endpoint) and an optional
PeriodicReader+OTLP exporter gated on OTEL_EXPORTER_OTLP_ENDPOINT.
- Migrated all 41 metrics::counter!/histogram!/gauge! call sites across
connection.rs, subscription.rs, state.rs, handlers/, and api/ to the
pre-built Metrics struct (OnceLock, zero per-call-site allocation).
- Preserved every metric name, type, label set, and histogram bucket from
the prior implementation so existing Prometheus scrapers (including the
Datadog Agent openmetrics annotation) need no changes.
- Instruments lazy-init to OTEL noop meter when install() hasn't been
called, matching prior metrics-rs behaviour in unit tests.
### Tracing
- Added telemetry.rs: try_init_tracer() initialises an OTLP gRPC span
exporter + SdkTracerProvider when OTEL_EXPORTER_OTLP_ENDPOINT is set;
returns None (zero overhead) when unset.
- Wired OpenTelemetryLayer into the tracing_subscriber stack in main.rs
alongside the existing JSON fmt layer (stdout logs unchanged).
- Added #[instrument] spans on hot paths: handle_event, fan_out_pubsub_event,
handle_auth, SubRegistry::fan_out_scoped.
### DB and Redis pool metrics
- Added Db::pool_stats() -> DbPoolStats in buzz-db (exposes sqlx pool
size and num_idle).
- Added background task in main.rs polling pool stats every 10 s
(configurable via BUZZ_POOL_METRICS_INTERVAL_SECS) and emitting
buzz_db_pool_{size,idle,active} and buzz_redis_pool_{available,size,
max,waiting} gauges.
### Graceful shutdown
- SdkMeterProvider and optional SdkTracerProvider flushed on SIGTERM
drain.
## Environment variables (all optional)
- OTEL_EXPORTER_OTLP_ENDPOINT — unset disables OTEL entirely
- OTEL_SERVICE_NAME — defaults to buzz-relay
- OTEL_RESOURCE_ATTRIBUTES — extra resource attributes
- OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG — sampling strategy
- BUZZ_POOL_METRICS_INTERVAL_SECS — pool poll interval (default 10)
Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…entity, metrics port bind ## Fixes ### IMPORTANT 1 — Prometheus bind failure now fails startup Bind the metrics TcpListener synchronously in install() before tokio::spawn. serve_prometheus() now accepts a pre-bound TcpListener instead of a port number. A port conflict panics at startup (matching prior behaviour) rather than silently dropping the metrics endpoint from a detached task. ### IMPORTANT 2 — OTLP service.name defaults to buzz-relay Build a single shared Resource via service_resource() in telemetry.rs using ResourceBuilder::with_service_name(buzz-relay) followed by with_detector(EnvResourceDetector) so that OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES still win when set. Both SdkTracerProvider and SdkMeterProvider receive the same Resource so traces and metrics correlate under the same service identity in Datadog. ### IMPORTANT 3 — Span topology: WS flow now produces one connected trace Create explicit parent spans in handle_text_message() for EVENT (ws.event), REQ (ws.req), COUNT (ws.count), and AUTH (ws.auth) messages, each carrying conn_id. Spawned handler futures are wrapped with .instrument(span) so the tracing context is not dropped at the tokio::spawn boundary. handle_event() and handle_auth() now call Span::current().record() to populate the event_id and kind/conn_id fields declared in their #[instrument] attributes. ### NIT 4 — target_info series suppressed for byte-parity Add .without_target_info() to the Prometheus exporter builder so the new Resource (non-empty after fix 2) does not inject a target_info series that the old metrics-rs endpoint never emitted. ### NIT 5 — BUZZ_POOL_METRICS_INTERVAL_SECS=0 no longer panics Clamp interval_secs to >= 1. tokio::time::interval(Duration::ZERO) panics; a config typo of 0 would have silently killed the pool metrics task. ### CI — cargo fmt drift Run cargo fmt --all to fix rustfmt line-wrapping across the migrated crate::metrics::metrics().<handle>.add(...) call sites. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
EnvResourceDetector reads OTEL_RESOURCE_ATTRIBUTES only, not OTEL_SERVICE_NAME (opentelemetry_sdk 0.32.1 resource/env.rs:23). SdkProvidedResourceDetector does read OTEL_SERVICE_NAME but always emits a service.name key, falling back to unknown_service:<exe> when unset — which would clobber the buzz-relay default. Read OTEL_SERVICE_NAME explicitly: non-empty value wins over the buzz-relay fallback; OTEL_RESOURCE_ATTRIBUTES (via EnvResourceDetector overlaid last) still wins over OTEL_SERVICE_NAME per OTEL spec. Correct the module and function doc comments that claimed the SDK detector handled OTEL_SERVICE_NAME automatically. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Exported metrics referenceAll metric names, types, and label sets are preserved verbatim from the prior Naming note: the two HTTP framework metrics are intentionally unprefixed ( HTTP framework (
|
| Metric | Type | Labels | Description |
|---|---|---|---|
http_requests_total |
Counter | code, caller, action |
HTTP requests served. caller from the Istio x-envoy-downstream-service-cluster header (validated, unknown fallback); action is the matched route pattern. Health/metrics/unmatched paths skipped to bound cardinality. |
http_request_latency_ms |
Histogram | code, caller, action |
Request latency in ms. Explicit buckets: 5/10/25/50/100/250/500/1000/2500/5000/10000. |
WebSocket connections
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_ws_connections_total |
Counter | — | WebSocket connections accepted. |
buzz_ws_connections_active |
UpDownCounter | — | Currently-open WebSocket connections (incremented on register, decremented on close). |
buzz_ws_backpressure_disconnects_total |
Counter | — | Connections dropped because the client could not keep up with the send queue. |
buzz_ws_auth_timeouts_total |
Counter | — | Connections closed for failing to authenticate within the auth window. |
Subscriptions
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_subscriptions_active |
UpDownCounter | — | Currently-active REQ subscriptions across all connections. |
Event ingest
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_events_received_total |
Counter | kind |
Events received over WS. kind is bounded to a known allow-list (else other) to prevent cardinality explosion. |
buzz_events_stored_total |
Counter | kind |
Events successfully persisted. |
buzz_events_rejected_total |
Counter | reason |
Events rejected, labeled by reason. |
buzz_event_processing_seconds |
Histogram | — | End-to-end event processing time. Buckets (s): 0.001/0.005/0.01/0.025/0.05/0.1/0.25/0.5/1/5. |
Fan-out / multi-node
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_fanout_recipients |
Histogram | — | Number of recipients per fanned-out event. Integer-count buckets: 0/1/5/10/25/50/100/500/1000. |
buzz_multinode_fanout_total |
Counter | — | Cross-pod fan-out operations published to the pub/sub bus. |
buzz_multinode_fanout_lag_total |
Counter | — | Messages dropped because a pod's multi-node fan-out consumer lagged the broadcast channel. |
buzz_cache_invalidation_lag_total |
Counter | — | Cache-invalidation messages dropped because a pod's consumer lagged. |
Auth
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_auth_attempts_total |
Counter | method |
NIP-42 auth attempts (method=nip42). |
buzz_auth_failures_total |
Counter | reason |
Auth failures by reason (allowlist_denied, not_relay_member, nip42_invalid). |
Media uploads
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_media_uploads_total |
Counter | mime |
Successful media uploads, labeled by MIME type. |
buzz_media_upload_rejections_total |
Counter | reason |
Upload rejections (rate_limit, concurrency). |
Workflows
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_workflow_runs_total |
Counter | trigger |
Workflow runs, labeled by trigger kind. |
Audit log
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_audit_log_seconds |
Histogram | — | Audit-log write latency. Buckets (s): same DURATION_BUCKETS_S as event processing. |
buzz_audit_log_errors_total |
Counter | — | Audit-log write failures. |
buzz_audit_send_errors_total |
Counter | — | Failures sending audit entries downstream. |
Caches
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_membership_cache_hits_total |
Counter | — | Membership-cache hits. |
buzz_membership_cache_misses_total |
Counter | — | Membership-cache misses. |
buzz_accessible_channels_cache_hits_total |
Counter | — | Accessible-channels-cache hits. |
buzz_accessible_channels_cache_misses_total |
Counter | — | Accessible-channels-cache misses. |
COUNT fallback
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_count_fallback_rejections_total |
Counter | — | COUNT queries rejected for requiring a too-broad fallback scan. |
Connection-pool gauges (periodic, every BUZZ_POOL_METRICS_INTERVAL_SECS, default 10s)
| Metric | Type | Labels | Description |
|---|---|---|---|
buzz_db_pool_size |
Gauge | — | Total Postgres pool connections. |
buzz_db_pool_idle |
Gauge | — | Idle Postgres pool connections. |
buzz_db_pool_active |
Gauge | — | In-use Postgres pool connections (size - idle). |
buzz_redis_pool_size |
Gauge | — | Current Redis pool connections. |
buzz_redis_pool_available |
Gauge | — | Available (idle) Redis pool connections. |
buzz_redis_pool_max |
Gauge | — | Configured Redis pool max. |
buzz_redis_pool_waiting |
Gauge | — | Callers waiting on a Redis connection. |
Two buzz_search_index_* handles (_seconds histogram, _errors_total counter) are declared but currently have no emitters — the search path moved off the Typesense backend, so they register as zero-value series. Left in place to avoid touching unrelated code in this migration PR; can be pruned in a follow-up.
|
I did the cross-check Tyler asked for. Confirmation / recommendation: I would keep Prometheus/OpenMetrics scraped by the Datadog Agent as the primary production metrics path, and treat the OTEL/OTLP path as opt-in / less paved for now. What I found:
Net: Will’s stated refactor direction — Prometheus/OpenMetrics for relay health/activity metrics, Datadog Agent scrape from pods, OTEL only where explicitly needed (e.g. traces behind |
…e, keep OTLP tracing Drop the OpenTelemetry metrics exporter (opentelemetry-prometheus, prometheus crate, OTLP metric push) in favour of the original metrics-rs facade (metrics::counter!/gauge!/histogram! macros) backed by metrics-exporter-prometheus. OTLP tracing (telemetry.rs, tracing-opentelemetry, OTEL trace spans in connection/event/auth handlers) is intentionally preserved — only the metrics path is reverted. Changes: - Cargo.toml: restore metrics + metrics-exporter-prometheus workspace deps; strip metrics/logs features from opentelemetry, opentelemetry_sdk, opentelemetry-otlp, tracing-opentelemetry; remove opentelemetry-prometheus and prometheus = "0.14" - metrics.rs: restore PrometheusBuilder-based setup verbatim from origin/main - All call sites: crate::metrics::metrics().<handle>.add/record -> metrics-rs macros (counter!/gauge!/histogram!) across connection, auth, event, count, state, subscription, bridge, media - main.rs: fanout_lag + cache_inval_lag background consumers converted to metrics::counter!; pool gauge section (net-new, kept per spec) converted from relay_metrics::meter() OTEL API to metrics::gauge!.set(); meter_provider shutdown removed; telemetry comment updated to trace-only - telemetry.rs: doc comment updated — Resource is for trace provider only - Cargo.lock: opentelemetry-prometheus and prometheus crates removed Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Summary
Adds distributed tracing to the buzz relay via OpenTelemetry while keeping metrics on the existing Prometheus scrape path. OTLP carries traces only; the
:9102Prometheus text endpoint and every existing metric name, label, and bucket are unchanged. Also adds additive DB/Redis connection-pool gauges on the existingmetrics-rspath.Changes
Tracing (new)
telemetry.rs: installs an OTLP gRPC span exporter +SdkTracerProvideronly whenOTEL_EXPORTER_OTLP_ENDPOINTis set; no-ops (zero overhead, no connection attempted) when unset.ResourcereadsOTEL_SERVICE_NAMEexplicitly with abuzz-relayfallback, plus anEnvResourceDetectoroverlay forOTEL_RESOURCE_ATTRIBUTES.OpenTelemetryLayerwired into thetracing_subscriberstack inmain.rsalongside the existing JSONfmtlayer — stdout structured logs are unchanged.ws.auth,ws.event,ws.req,ws.countinconnection.rs/handlers/auth.rs/handlers/event.rs(carryingconn_id/event_id/kind/sub_id), and#[instrument(skip_all)]onSubRegistry::fan_out_scoped.Metrics — unchanged path
metrics.rsis unchanged frommain: themetrics-rs/PrometheusBuildersetup, every metric name, label set, and histogram bucket boundary are preserved. Existing Prometheus scrapers and the Datadog Agentopenmetricsannotation need no changes.DB and Redis pool gauges (new, additive)
Db::pool_stats() -> DbPoolStatsadded tobuzz-db(exposessqlxpoolsize()andnum_idle()only — minimal accessor, no SQL or mutation).main.rspolls pool stats (interval viaBUZZ_POOL_METRICS_INTERVAL_SECS, clamped to >= 1s) and emits viametrics::gauge!:buzz_db_pool_size,buzz_db_pool_idle,buzz_db_pool_activebuzz_redis_pool_available,buzz_redis_pool_size,buzz_redis_pool_max,buzz_redis_pool_waitingGraceful shutdown
Environment variables
OTEL_EXPORTER_OTLP_ENDPOINTOTEL_SERVICE_NAMEbuzz-relayservice.nameresource attribute on tracesOTEL_RESOURCE_ATTRIBUTESBUZZ_METRICS_PORT9102BUZZ_POOL_METRICS_INTERVAL_SECS10Backward compatibility
With
OTEL_EXPORTER_OTLP_ENDPOINTunset:/metricson:9102serves the same Prometheus text format with identical metric names/labels, JSON stdout logs are unchanged, and no OTLP connection is attempted. Zero behavioral change for existing deployments.Related: block-coder-tf-stacks#2267 — staging relay OTLP endpoint config.