feat(relay): add OpenTelemetry tracing, keep Prometheus metrics by wpfleger96 · Pull Request #1398 · block/buzz

wpfleger96 · 2026-06-30T15:38:26Z

Summary

Adds distributed tracing to the buzz relay via OpenTelemetry while keeping metrics on the existing Prometheus scrape path. OTLP carries traces only; the :9102 Prometheus text endpoint and every existing metric name, label, and bucket are unchanged. Also adds additive DB/Redis connection-pool gauges on the existing metrics-rs path.

Changes

Tracing (new)

New telemetry.rs: installs an OTLP gRPC span exporter + SdkTracerProvider only when OTEL_EXPORTER_OTLP_ENDPOINT is set; no-ops (zero overhead, no connection attempted) when unset.
The trace Resource reads OTEL_SERVICE_NAME explicitly with a buzz-relay fallback, plus an EnvResourceDetector overlay for OTEL_RESOURCE_ATTRIBUTES.
OpenTelemetryLayer wired into the tracing_subscriber stack in main.rs alongside the existing JSON fmt layer — stdout structured logs are unchanged.
Spans added on hot paths: ws.auth, ws.event, ws.req, ws.count in connection.rs/handlers/auth.rs/handlers/event.rs (carrying conn_id/event_id/kind/sub_id), and #[instrument(skip_all)] on SubRegistry::fan_out_scoped.

Metrics — unchanged path

metrics.rs is unchanged from main: the metrics-rs / PrometheusBuilder setup, every metric name, label set, and histogram bucket boundary are preserved. Existing Prometheus scrapers and the Datadog Agent openmetrics annotation need no changes.

DB and Redis pool gauges (new, additive)

Db::pool_stats() -> DbPoolStats added to buzz-db (exposes sqlx pool size() and num_idle() only — minimal accessor, no SQL or mutation).
Background task in main.rs polls pool stats (interval via BUZZ_POOL_METRICS_INTERVAL_SECS, clamped to >= 1s) and emits via metrics::gauge!:
- buzz_db_pool_size, buzz_db_pool_idle, buzz_db_pool_active
- buzz_redis_pool_available, buzz_redis_pool_size, buzz_redis_pool_max, buzz_redis_pool_waiting

Graceful shutdown

The OTLP tracer provider is flushed on SIGTERM drain (after audit drain), with warning-only error handling. No meter-provider shutdown — metrics stay on the Prometheus exporter.

Environment variables

Variable	Default	Purpose
`OTEL_EXPORTER_OTLP_ENDPOINT`	(unset = tracing disabled)	OTLP gRPC trace endpoint
`OTEL_SERVICE_NAME`	`buzz-relay`	`service.name` resource attribute on traces
`OTEL_RESOURCE_ATTRIBUTES`	—	Extra trace resource attributes
`BUZZ_METRICS_PORT`	`9102`	Prometheus scrape port (unchanged)
`BUZZ_POOL_METRICS_INTERVAL_SECS`	`10`	Pool stats poll interval

Backward compatibility

With OTEL_EXPORTER_OTLP_ENDPOINT unset: /metrics on :9102 serves the same Prometheus text format with identical metric names/labels, JSON stdout logs are unchanged, and no OTLP connection is attempted. Zero behavioral change for existing deployments.

Related: block-coder-tf-stacks#2267 — staging relay OTLP endpoint config.

…eus export Replace metrics-rs/metrics-exporter-prometheus with OpenTelemetry native instruments backed by both a Prometheus text endpoint (:9102) and an OTLP gRPC exporter. Add distributed tracing via tracing-opentelemetry. Add DB and Redis pool metrics. ## What changed ### Metrics - Rewrote metrics.rs as an OTEL setup module: SdkMeterProvider with a PrometheusExporter (pull-based, same /metrics endpoint) and an optional PeriodicReader+OTLP exporter gated on OTEL_EXPORTER_OTLP_ENDPOINT. - Migrated all 41 metrics::counter!/histogram!/gauge! call sites across connection.rs, subscription.rs, state.rs, handlers/, and api/ to the pre-built Metrics struct (OnceLock, zero per-call-site allocation). - Preserved every metric name, type, label set, and histogram bucket from the prior implementation so existing Prometheus scrapers (including the Datadog Agent openmetrics annotation) need no changes. - Instruments lazy-init to OTEL noop meter when install() hasn't been called, matching prior metrics-rs behaviour in unit tests. ### Tracing - Added telemetry.rs: try_init_tracer() initialises an OTLP gRPC span exporter + SdkTracerProvider when OTEL_EXPORTER_OTLP_ENDPOINT is set; returns None (zero overhead) when unset. - Wired OpenTelemetryLayer into the tracing_subscriber stack in main.rs alongside the existing JSON fmt layer (stdout logs unchanged). - Added #[instrument] spans on hot paths: handle_event, fan_out_pubsub_event, handle_auth, SubRegistry::fan_out_scoped. ### DB and Redis pool metrics - Added Db::pool_stats() -> DbPoolStats in buzz-db (exposes sqlx pool size and num_idle). - Added background task in main.rs polling pool stats every 10 s (configurable via BUZZ_POOL_METRICS_INTERVAL_SECS) and emitting buzz_db_pool_{size,idle,active} and buzz_redis_pool_{available,size, max,waiting} gauges. ### Graceful shutdown - SdkMeterProvider and optional SdkTracerProvider flushed on SIGTERM drain. ## Environment variables (all optional) - OTEL_EXPORTER_OTLP_ENDPOINT — unset disables OTEL entirely - OTEL_SERVICE_NAME — defaults to buzz-relay - OTEL_RESOURCE_ATTRIBUTES — extra resource attributes - OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG — sampling strategy - BUZZ_POOL_METRICS_INTERVAL_SECS — pool poll interval (default 10) Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…entity, metrics port bind ## Fixes ### IMPORTANT 1 — Prometheus bind failure now fails startup Bind the metrics TcpListener synchronously in install() before tokio::spawn. serve_prometheus() now accepts a pre-bound TcpListener instead of a port number. A port conflict panics at startup (matching prior behaviour) rather than silently dropping the metrics endpoint from a detached task. ### IMPORTANT 2 — OTLP service.name defaults to buzz-relay Build a single shared Resource via service_resource() in telemetry.rs using ResourceBuilder::with_service_name(buzz-relay) followed by with_detector(EnvResourceDetector) so that OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES still win when set. Both SdkTracerProvider and SdkMeterProvider receive the same Resource so traces and metrics correlate under the same service identity in Datadog. ### IMPORTANT 3 — Span topology: WS flow now produces one connected trace Create explicit parent spans in handle_text_message() for EVENT (ws.event), REQ (ws.req), COUNT (ws.count), and AUTH (ws.auth) messages, each carrying conn_id. Spawned handler futures are wrapped with .instrument(span) so the tracing context is not dropped at the tokio::spawn boundary. handle_event() and handle_auth() now call Span::current().record() to populate the event_id and kind/conn_id fields declared in their #[instrument] attributes. ### NIT 4 — target_info series suppressed for byte-parity Add .without_target_info() to the Prometheus exporter builder so the new Resource (non-empty after fix 2) does not inject a target_info series that the old metrics-rs endpoint never emitted. ### NIT 5 — BUZZ_POOL_METRICS_INTERVAL_SECS=0 no longer panics Clamp interval_secs to >= 1. tokio::time::interval(Duration::ZERO) panics; a config typo of 0 would have silently killed the pool metrics task. ### CI — cargo fmt drift Run cargo fmt --all to fix rustfmt line-wrapping across the migrated crate::metrics::metrics().<handle>.add(...) call sites. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

EnvResourceDetector reads OTEL_RESOURCE_ATTRIBUTES only, not OTEL_SERVICE_NAME (opentelemetry_sdk 0.32.1 resource/env.rs:23). SdkProvidedResourceDetector does read OTEL_SERVICE_NAME but always emits a service.name key, falling back to unknown_service:<exe> when unset — which would clobber the buzz-relay default. Read OTEL_SERVICE_NAME explicitly: non-empty value wins over the buzz-relay fallback; OTEL_RESOURCE_ATTRIBUTES (via EnvResourceDetector overlaid last) still wins over OTEL_SERVICE_NAME per OTEL spec. Correct the module and function doc comments that claimed the SDK detector handled OTEL_SERVICE_NAME automatically. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 · 2026-06-30T17:37:25Z

Exported metrics reference

All metric names, types, and label sets are preserved verbatim from the prior metrics-rs implementation, so existing Prometheus scrapers and Datadog dashboards need no changes. Names are emitted identically on both readers: the Prometheus /metrics text endpoint on :9102 (always on) and the OTLP push exporter (only when OTEL_EXPORTER_OTLP_ENDPOINT is set).

Naming note: the two HTTP framework metrics are intentionally unprefixed (http_*); every other relay metric carries the buzz_ prefix. The Prometheus exporter is configured .without_units() / .without_counter_suffixes() / .without_scope_info() / .without_target_info() to keep byte-parity with the old endpoint.

HTTP framework (`track_metrics` middleware)

Metric	Type	Labels	Description
`http_requests_total`	Counter	`code`, `caller`, `action`	HTTP requests served. `caller` from the Istio `x-envoy-downstream-service-cluster` header (validated, `unknown` fallback); `action` is the matched route pattern. Health/metrics/unmatched paths skipped to bound cardinality.
`http_request_latency_ms`	Histogram	`code`, `caller`, `action`	Request latency in ms. Explicit buckets: 5/10/25/50/100/250/500/1000/2500/5000/10000.

WebSocket connections

Metric	Type	Labels	Description
`buzz_ws_connections_total`	Counter	—	WebSocket connections accepted.
`buzz_ws_connections_active`	UpDownCounter	—	Currently-open WebSocket connections (incremented on register, decremented on close).
`buzz_ws_backpressure_disconnects_total`	Counter	—	Connections dropped because the client could not keep up with the send queue.
`buzz_ws_auth_timeouts_total`	Counter	—	Connections closed for failing to authenticate within the auth window.

Subscriptions

Metric	Type	Labels	Description
`buzz_subscriptions_active`	UpDownCounter	—	Currently-active REQ subscriptions across all connections.

Event ingest

Metric	Type	Labels	Description
`buzz_events_received_total`	Counter	`kind`	Events received over WS. `kind` is bounded to a known allow-list (else `other`) to prevent cardinality explosion.
`buzz_events_stored_total`	Counter	`kind`	Events successfully persisted.
`buzz_events_rejected_total`	Counter	`reason`	Events rejected, labeled by reason.
`buzz_event_processing_seconds`	Histogram	—	End-to-end event processing time. Buckets (s): 0.001/0.005/0.01/0.025/0.05/0.1/0.25/0.5/1/5.

Fan-out / multi-node

Metric	Type	Labels	Description
`buzz_fanout_recipients`	Histogram	—	Number of recipients per fanned-out event. Integer-count buckets: 0/1/5/10/25/50/100/500/1000.
`buzz_multinode_fanout_total`	Counter	—	Cross-pod fan-out operations published to the pub/sub bus.
`buzz_multinode_fanout_lag_total`	Counter	—	Messages dropped because a pod's multi-node fan-out consumer lagged the broadcast channel.
`buzz_cache_invalidation_lag_total`	Counter	—	Cache-invalidation messages dropped because a pod's consumer lagged.

Auth

Metric	Type	Labels	Description
`buzz_auth_attempts_total`	Counter	`method`	NIP-42 auth attempts (`method=nip42`).
`buzz_auth_failures_total`	Counter	`reason`	Auth failures by reason (`allowlist_denied`, `not_relay_member`, `nip42_invalid`).

Media uploads

Metric	Type	Labels	Description
`buzz_media_uploads_total`	Counter	`mime`	Successful media uploads, labeled by MIME type.
`buzz_media_upload_rejections_total`	Counter	`reason`	Upload rejections (`rate_limit`, `concurrency`).

Workflows

Metric	Type	Labels	Description
`buzz_workflow_runs_total`	Counter	`trigger`	Workflow runs, labeled by trigger kind.

Audit log

Metric	Type	Labels	Description
`buzz_audit_log_seconds`	Histogram	—	Audit-log write latency. Buckets (s): same `DURATION_BUCKETS_S` as event processing.
`buzz_audit_log_errors_total`	Counter	—	Audit-log write failures.
`buzz_audit_send_errors_total`	Counter	—	Failures sending audit entries downstream.

Caches

Metric	Type	Labels	Description
`buzz_membership_cache_hits_total`	Counter	—	Membership-cache hits.
`buzz_membership_cache_misses_total`	Counter	—	Membership-cache misses.
`buzz_accessible_channels_cache_hits_total`	Counter	—	Accessible-channels-cache hits.
`buzz_accessible_channels_cache_misses_total`	Counter	—	Accessible-channels-cache misses.

COUNT fallback

Metric	Type	Labels	Description
`buzz_count_fallback_rejections_total`	Counter	—	COUNT queries rejected for requiring a too-broad fallback scan.

Connection-pool gauges (periodic, every `BUZZ_POOL_METRICS_INTERVAL_SECS`, default 10s)

Metric	Type	Labels	Description
`buzz_db_pool_size`	Gauge	—	Total Postgres pool connections.
`buzz_db_pool_idle`	Gauge	—	Idle Postgres pool connections.
`buzz_db_pool_active`	Gauge	—	In-use Postgres pool connections (`size - idle`).
`buzz_redis_pool_size`	Gauge	—	Current Redis pool connections.
`buzz_redis_pool_available`	Gauge	—	Available (idle) Redis pool connections.
`buzz_redis_pool_max`	Gauge	—	Configured Redis pool max.
`buzz_redis_pool_waiting`	Gauge	—	Callers waiting on a Redis connection.

Two buzz_search_index_* handles (_seconds histogram, _errors_total counter) are declared but currently have no emitters — the search path moved off the Typesense backend, so they register as zero-value series. Left in place to avoid touching unrelated code in this migration PR; can be pruned in a follow-up.

tlongwell-block · 2026-06-30T17:58:22Z

I did the cross-check Tyler asked for.

Confirmation / recommendation: I would keep Prometheus/OpenMetrics scraped by the Datadog Agent as the primary production metrics path, and treat the OTEL/OTLP path as opt-in / less paved for now.

What I found:

Datadog Agent can scrape Prometheus/OpenMetrics from Kubernetes pods. Block docs explicitly say the Datadog Agent has built-in OpenMetrics support, configured per pod/container via ad.datadoghq.com/${CONTAINER_NAME}.checks annotations, and that the Agent watches the Kubernetes API for annotated pods/containers to scrape. See: https://dev-guides.sqprod.co/cash/docs/platform/architecture/custom_metrics_collection
Afterpay/Block observability docs give the same Kubernetes pod-annotation flow, including openmetrics_endpoint, metrics, histogram_buckets_as_distributions, send_distribution_buckets, and send_monotonic_counter: https://dev-guides.sqprod.co/afterpay/docs/observability/how-to/ingest-prom-metrics
This PR preserves a Prometheus text endpoint on :9102/metrics and configures the OTEL Prometheus exporter with .without_units(), .without_counter_suffixes(), .without_scope_info(), and .without_target_info() to keep metric names/output compatible with the old metrics-rs/Prometheus endpoint. That supports the claim that existing Datadog OpenMetrics scraping should not need to change.
Internal guidance/search results point to OTEL/OTLP metrics as an active migration/trial area rather than the universally paved path. I found support for Tyler’s “finicky internally” read: collector/OTLP metrics work exists, but docs and internal references repeatedly steer production custom metrics toward Datadog OpenMetrics scraping; OTEL collector/SDK migration is still more situational. I would phrase it as “not abandoned, but less mature/paved internally than Prometheus/OpenMetrics → Datadog Agent.”
For tracing, the companion infra PR’s target endpoint matches existing in-cluster usage in block-coder-tf-stacks (cachew and blox-orchestrator already point at http://datadog-agent.datadog-agent.svc.cluster.local:4317). So using the Datadog Agent OTLP gRPC receiver for staging traces is directionally reasonable, but I’d avoid coupling metrics to OTLP unless there’s a specific reason.

Net: Will’s stated refactor direction — Prometheus/OpenMetrics for relay health/activity metrics, Datadog Agent scrape from pods, OTEL only where explicitly needed (e.g. traces behind OTEL_EXPORTER_OTLP_ENDPOINT) — matches the internal paved path better than a full OTEL metrics migration.

…e, keep OTLP tracing Drop the OpenTelemetry metrics exporter (opentelemetry-prometheus, prometheus crate, OTLP metric push) in favour of the original metrics-rs facade (metrics::counter!/gauge!/histogram! macros) backed by metrics-exporter-prometheus. OTLP tracing (telemetry.rs, tracing-opentelemetry, OTEL trace spans in connection/event/auth handlers) is intentionally preserved — only the metrics path is reverted. Changes: - Cargo.toml: restore metrics + metrics-exporter-prometheus workspace deps; strip metrics/logs features from opentelemetry, opentelemetry_sdk, opentelemetry-otlp, tracing-opentelemetry; remove opentelemetry-prometheus and prometheus = "0.14" - metrics.rs: restore PrometheusBuilder-based setup verbatim from origin/main - All call sites: crate::metrics::metrics().<handle>.add/record -> metrics-rs macros (counter!/gauge!/histogram!) across connection, auth, event, count, state, subscription, bridge, media - main.rs: fanout_lag + cache_inval_lag background consumers converted to metrics::counter!; pool gauge section (net-new, kept per spec) converted from relay_metrics::meter() OTEL API to metrics::gauge!.set(); meter_provider shutdown removed; telemetry comment updated to trace-only - telemetry.rs: doc comment updated — Resource is for trace provider only - Cargo.lock: opentelemetry-prometheus and prometheus crates removed Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 marked this pull request as draft June 30, 2026 15:50

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 30, 2026 12:02

wpfleger96 marked this pull request as ready for review June 30, 2026 17:39

wpfleger96 changed the title ~~feat(relay): migrate telemetry to OpenTelemetry with OTLP and Prometheus export~~ feat(relay): add OpenTelemetry tracing, keep Prometheus metrics Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398
wpfleger96 wants to merge 4 commits into
mainfrom
duncan/otel-migration

wpfleger96 commented Jun 30, 2026 •

edited

Loading

Uh oh!

wpfleger96 commented Jun 30, 2026

Uh oh!

tlongwell-block commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wpfleger96 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tracing (new)

Metrics — unchanged path

DB and Redis pool gauges (new, additive)

Graceful shutdown

Environment variables

Backward compatibility

Uh oh!

wpfleger96 commented Jun 30, 2026

Exported metrics reference

HTTP framework (track_metrics middleware)

WebSocket connections

Subscriptions

Event ingest

Fan-out / multi-node

Auth

Media uploads

Workflows

Audit log

Caches

COUNT fallback

Connection-pool gauges (periodic, every BUZZ_POOL_METRICS_INTERVAL_SECS, default 10s)

Uh oh!

tlongwell-block commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wpfleger96 commented Jun 30, 2026 •

edited

Loading

HTTP framework (`track_metrics` middleware)

Connection-pool gauges (periodic, every `BUZZ_POOL_METRICS_INTERVAL_SECS`, default 10s)