Skip to content

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398

Open
wpfleger96 wants to merge 4 commits into
mainfrom
duncan/otel-migration
Open

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398
wpfleger96 wants to merge 4 commits into
mainfrom
duncan/otel-migration

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds distributed tracing to the buzz relay via OpenTelemetry while keeping metrics on the existing Prometheus scrape path. OTLP carries traces only; the :9102 Prometheus text endpoint and every existing metric name, label, and bucket are unchanged. Also adds additive DB/Redis connection-pool gauges on the existing metrics-rs path.

Changes

Tracing (new)

  • New telemetry.rs: installs an OTLP gRPC span exporter + SdkTracerProvider only when OTEL_EXPORTER_OTLP_ENDPOINT is set; no-ops (zero overhead, no connection attempted) when unset.
  • The trace Resource reads OTEL_SERVICE_NAME explicitly with a buzz-relay fallback, plus an EnvResourceDetector overlay for OTEL_RESOURCE_ATTRIBUTES.
  • OpenTelemetryLayer wired into the tracing_subscriber stack in main.rs alongside the existing JSON fmt layer — stdout structured logs are unchanged.
  • Spans added on hot paths: ws.auth, ws.event, ws.req, ws.count in connection.rs/handlers/auth.rs/handlers/event.rs (carrying conn_id/event_id/kind/sub_id), and #[instrument(skip_all)] on SubRegistry::fan_out_scoped.

Metrics — unchanged path

  • metrics.rs is unchanged from main: the metrics-rs / PrometheusBuilder setup, every metric name, label set, and histogram bucket boundary are preserved. Existing Prometheus scrapers and the Datadog Agent openmetrics annotation need no changes.

DB and Redis pool gauges (new, additive)

  • Db::pool_stats() -> DbPoolStats added to buzz-db (exposes sqlx pool size() and num_idle() only — minimal accessor, no SQL or mutation).
  • Background task in main.rs polls pool stats (interval via BUZZ_POOL_METRICS_INTERVAL_SECS, clamped to >= 1s) and emits via metrics::gauge!:
    • buzz_db_pool_size, buzz_db_pool_idle, buzz_db_pool_active
    • buzz_redis_pool_available, buzz_redis_pool_size, buzz_redis_pool_max, buzz_redis_pool_waiting

Graceful shutdown

  • The OTLP tracer provider is flushed on SIGTERM drain (after audit drain), with warning-only error handling. No meter-provider shutdown — metrics stay on the Prometheus exporter.

Environment variables

Variable Default Purpose
OTEL_EXPORTER_OTLP_ENDPOINT (unset = tracing disabled) OTLP gRPC trace endpoint
OTEL_SERVICE_NAME buzz-relay service.name resource attribute on traces
OTEL_RESOURCE_ATTRIBUTES Extra trace resource attributes
BUZZ_METRICS_PORT 9102 Prometheus scrape port (unchanged)
BUZZ_POOL_METRICS_INTERVAL_SECS 10 Pool stats poll interval

Backward compatibility

With OTEL_EXPORTER_OTLP_ENDPOINT unset: /metrics on :9102 serves the same Prometheus text format with identical metric names/labels, JSON stdout logs are unchanged, and no OTLP connection is attempted. Zero behavioral change for existing deployments.

Related: block-coder-tf-stacks#2267 — staging relay OTLP endpoint config.

…eus export

Replace metrics-rs/metrics-exporter-prometheus with OpenTelemetry native
instruments backed by both a Prometheus text endpoint (:9102) and an OTLP
gRPC exporter. Add distributed tracing via tracing-opentelemetry. Add DB
and Redis pool metrics.

## What changed

### Metrics
- Rewrote metrics.rs as an OTEL setup module: SdkMeterProvider with a
  PrometheusExporter (pull-based, same /metrics endpoint) and an optional
  PeriodicReader+OTLP exporter gated on OTEL_EXPORTER_OTLP_ENDPOINT.
- Migrated all 41 metrics::counter!/histogram!/gauge! call sites across
  connection.rs, subscription.rs, state.rs, handlers/, and api/ to the
  pre-built Metrics struct (OnceLock, zero per-call-site allocation).
- Preserved every metric name, type, label set, and histogram bucket from
  the prior implementation so existing Prometheus scrapers (including the
  Datadog Agent openmetrics annotation) need no changes.
- Instruments lazy-init to OTEL noop meter when install() hasn't been
  called, matching prior metrics-rs behaviour in unit tests.

### Tracing
- Added telemetry.rs: try_init_tracer() initialises an OTLP gRPC span
  exporter + SdkTracerProvider when OTEL_EXPORTER_OTLP_ENDPOINT is set;
  returns None (zero overhead) when unset.
- Wired OpenTelemetryLayer into the tracing_subscriber stack in main.rs
  alongside the existing JSON fmt layer (stdout logs unchanged).
- Added #[instrument] spans on hot paths: handle_event, fan_out_pubsub_event,
  handle_auth, SubRegistry::fan_out_scoped.

### DB and Redis pool metrics
- Added Db::pool_stats() -> DbPoolStats in buzz-db (exposes sqlx pool
  size and num_idle).
- Added background task in main.rs polling pool stats every 10 s
  (configurable via BUZZ_POOL_METRICS_INTERVAL_SECS) and emitting
  buzz_db_pool_{size,idle,active} and buzz_redis_pool_{available,size,
  max,waiting} gauges.

### Graceful shutdown
- SdkMeterProvider and optional SdkTracerProvider flushed on SIGTERM
  drain.

## Environment variables (all optional)
- OTEL_EXPORTER_OTLP_ENDPOINT — unset disables OTEL entirely
- OTEL_SERVICE_NAME — defaults to buzz-relay
- OTEL_RESOURCE_ATTRIBUTES — extra resource attributes
- OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG — sampling strategy
- BUZZ_POOL_METRICS_INTERVAL_SECS — pool poll interval (default 10)

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 marked this pull request as draft June 30, 2026 15:50
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 30, 2026 12:02
…entity, metrics port bind

## Fixes

### IMPORTANT 1 — Prometheus bind failure now fails startup
Bind the metrics TcpListener synchronously in install() before tokio::spawn.
serve_prometheus() now accepts a pre-bound TcpListener instead of a port
number. A port conflict panics at startup (matching prior behaviour) rather
than silently dropping the metrics endpoint from a detached task.

### IMPORTANT 2 — OTLP service.name defaults to buzz-relay
Build a single shared Resource via service_resource() in telemetry.rs using
ResourceBuilder::with_service_name(buzz-relay) followed by
with_detector(EnvResourceDetector) so that OTEL_SERVICE_NAME and
OTEL_RESOURCE_ATTRIBUTES still win when set. Both SdkTracerProvider and
SdkMeterProvider receive the same Resource so traces and metrics correlate
under the same service identity in Datadog.

### IMPORTANT 3 — Span topology: WS flow now produces one connected trace
Create explicit parent spans in handle_text_message() for EVENT (ws.event),
REQ (ws.req), COUNT (ws.count), and AUTH (ws.auth) messages, each carrying
conn_id. Spawned handler futures are wrapped with .instrument(span) so the
tracing context is not dropped at the tokio::spawn boundary. handle_event()
and handle_auth() now call Span::current().record() to populate the event_id
and kind/conn_id fields declared in their #[instrument] attributes.

### NIT 4 — target_info series suppressed for byte-parity
Add .without_target_info() to the Prometheus exporter builder so the new
Resource (non-empty after fix 2) does not inject a target_info series that
the old metrics-rs endpoint never emitted.

### NIT 5 — BUZZ_POOL_METRICS_INTERVAL_SECS=0 no longer panics
Clamp interval_secs to >= 1. tokio::time::interval(Duration::ZERO) panics;
a config typo of 0 would have silently killed the pool metrics task.

### CI — cargo fmt drift
Run cargo fmt --all to fix rustfmt line-wrapping across the migrated
crate::metrics::metrics().<handle>.add(...) call sites.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
EnvResourceDetector reads OTEL_RESOURCE_ATTRIBUTES only, not
OTEL_SERVICE_NAME (opentelemetry_sdk 0.32.1 resource/env.rs:23).
SdkProvidedResourceDetector does read OTEL_SERVICE_NAME but always
emits a service.name key, falling back to unknown_service:<exe> when
unset — which would clobber the buzz-relay default.

Read OTEL_SERVICE_NAME explicitly: non-empty value wins over the
buzz-relay fallback; OTEL_RESOURCE_ATTRIBUTES (via EnvResourceDetector
overlaid last) still wins over OTEL_SERVICE_NAME per OTEL spec.
Correct the module and function doc comments that claimed the SDK
detector handled OTEL_SERVICE_NAME automatically.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96

Copy link
Copy Markdown
Collaborator Author

Exported metrics reference

All metric names, types, and label sets are preserved verbatim from the prior metrics-rs implementation, so existing Prometheus scrapers and Datadog dashboards need no changes. Names are emitted identically on both readers: the Prometheus /metrics text endpoint on :9102 (always on) and the OTLP push exporter (only when OTEL_EXPORTER_OTLP_ENDPOINT is set).

Naming note: the two HTTP framework metrics are intentionally unprefixed (http_*); every other relay metric carries the buzz_ prefix. The Prometheus exporter is configured .without_units() / .without_counter_suffixes() / .without_scope_info() / .without_target_info() to keep byte-parity with the old endpoint.

HTTP framework (track_metrics middleware)

Metric Type Labels Description
http_requests_total Counter code, caller, action HTTP requests served. caller from the Istio x-envoy-downstream-service-cluster header (validated, unknown fallback); action is the matched route pattern. Health/metrics/unmatched paths skipped to bound cardinality.
http_request_latency_ms Histogram code, caller, action Request latency in ms. Explicit buckets: 5/10/25/50/100/250/500/1000/2500/5000/10000.

WebSocket connections

Metric Type Labels Description
buzz_ws_connections_total Counter WebSocket connections accepted.
buzz_ws_connections_active UpDownCounter Currently-open WebSocket connections (incremented on register, decremented on close).
buzz_ws_backpressure_disconnects_total Counter Connections dropped because the client could not keep up with the send queue.
buzz_ws_auth_timeouts_total Counter Connections closed for failing to authenticate within the auth window.

Subscriptions

Metric Type Labels Description
buzz_subscriptions_active UpDownCounter Currently-active REQ subscriptions across all connections.

Event ingest

Metric Type Labels Description
buzz_events_received_total Counter kind Events received over WS. kind is bounded to a known allow-list (else other) to prevent cardinality explosion.
buzz_events_stored_total Counter kind Events successfully persisted.
buzz_events_rejected_total Counter reason Events rejected, labeled by reason.
buzz_event_processing_seconds Histogram End-to-end event processing time. Buckets (s): 0.001/0.005/0.01/0.025/0.05/0.1/0.25/0.5/1/5.

Fan-out / multi-node

Metric Type Labels Description
buzz_fanout_recipients Histogram Number of recipients per fanned-out event. Integer-count buckets: 0/1/5/10/25/50/100/500/1000.
buzz_multinode_fanout_total Counter Cross-pod fan-out operations published to the pub/sub bus.
buzz_multinode_fanout_lag_total Counter Messages dropped because a pod's multi-node fan-out consumer lagged the broadcast channel.
buzz_cache_invalidation_lag_total Counter Cache-invalidation messages dropped because a pod's consumer lagged.

Auth

Metric Type Labels Description
buzz_auth_attempts_total Counter method NIP-42 auth attempts (method=nip42).
buzz_auth_failures_total Counter reason Auth failures by reason (allowlist_denied, not_relay_member, nip42_invalid).

Media uploads

Metric Type Labels Description
buzz_media_uploads_total Counter mime Successful media uploads, labeled by MIME type.
buzz_media_upload_rejections_total Counter reason Upload rejections (rate_limit, concurrency).

Workflows

Metric Type Labels Description
buzz_workflow_runs_total Counter trigger Workflow runs, labeled by trigger kind.

Audit log

Metric Type Labels Description
buzz_audit_log_seconds Histogram Audit-log write latency. Buckets (s): same DURATION_BUCKETS_S as event processing.
buzz_audit_log_errors_total Counter Audit-log write failures.
buzz_audit_send_errors_total Counter Failures sending audit entries downstream.

Caches

Metric Type Labels Description
buzz_membership_cache_hits_total Counter Membership-cache hits.
buzz_membership_cache_misses_total Counter Membership-cache misses.
buzz_accessible_channels_cache_hits_total Counter Accessible-channels-cache hits.
buzz_accessible_channels_cache_misses_total Counter Accessible-channels-cache misses.

COUNT fallback

Metric Type Labels Description
buzz_count_fallback_rejections_total Counter COUNT queries rejected for requiring a too-broad fallback scan.

Connection-pool gauges (periodic, every BUZZ_POOL_METRICS_INTERVAL_SECS, default 10s)

Metric Type Labels Description
buzz_db_pool_size Gauge Total Postgres pool connections.
buzz_db_pool_idle Gauge Idle Postgres pool connections.
buzz_db_pool_active Gauge In-use Postgres pool connections (size - idle).
buzz_redis_pool_size Gauge Current Redis pool connections.
buzz_redis_pool_available Gauge Available (idle) Redis pool connections.
buzz_redis_pool_max Gauge Configured Redis pool max.
buzz_redis_pool_waiting Gauge Callers waiting on a Redis connection.

Two buzz_search_index_* handles (_seconds histogram, _errors_total counter) are declared but currently have no emitters — the search path moved off the Typesense backend, so they register as zero-value series. Left in place to avoid touching unrelated code in this migration PR; can be pruned in a follow-up.

@wpfleger96 wpfleger96 marked this pull request as ready for review June 30, 2026 17:39
@tlongwell-block

Copy link
Copy Markdown
Collaborator

I did the cross-check Tyler asked for.

Confirmation / recommendation: I would keep Prometheus/OpenMetrics scraped by the Datadog Agent as the primary production metrics path, and treat the OTEL/OTLP path as opt-in / less paved for now.

What I found:

  • Datadog Agent can scrape Prometheus/OpenMetrics from Kubernetes pods. Block docs explicitly say the Datadog Agent has built-in OpenMetrics support, configured per pod/container via ad.datadoghq.com/${CONTAINER_NAME}.checks annotations, and that the Agent watches the Kubernetes API for annotated pods/containers to scrape. See: https://dev-guides.sqprod.co/cash/docs/platform/architecture/custom_metrics_collection
  • Afterpay/Block observability docs give the same Kubernetes pod-annotation flow, including openmetrics_endpoint, metrics, histogram_buckets_as_distributions, send_distribution_buckets, and send_monotonic_counter: https://dev-guides.sqprod.co/afterpay/docs/observability/how-to/ingest-prom-metrics
  • This PR preserves a Prometheus text endpoint on :9102/metrics and configures the OTEL Prometheus exporter with .without_units(), .without_counter_suffixes(), .without_scope_info(), and .without_target_info() to keep metric names/output compatible with the old metrics-rs/Prometheus endpoint. That supports the claim that existing Datadog OpenMetrics scraping should not need to change.
  • Internal guidance/search results point to OTEL/OTLP metrics as an active migration/trial area rather than the universally paved path. I found support for Tyler’s “finicky internally” read: collector/OTLP metrics work exists, but docs and internal references repeatedly steer production custom metrics toward Datadog OpenMetrics scraping; OTEL collector/SDK migration is still more situational. I would phrase it as “not abandoned, but less mature/paved internally than Prometheus/OpenMetrics → Datadog Agent.”
  • For tracing, the companion infra PR’s target endpoint matches existing in-cluster usage in block-coder-tf-stacks (cachew and blox-orchestrator already point at http://datadog-agent.datadog-agent.svc.cluster.local:4317). So using the Datadog Agent OTLP gRPC receiver for staging traces is directionally reasonable, but I’d avoid coupling metrics to OTLP unless there’s a specific reason.

Net: Will’s stated refactor direction — Prometheus/OpenMetrics for relay health/activity metrics, Datadog Agent scrape from pods, OTEL only where explicitly needed (e.g. traces behind OTEL_EXPORTER_OTLP_ENDPOINT) — matches the internal paved path better than a full OTEL metrics migration.

…e, keep OTLP tracing

Drop the OpenTelemetry metrics exporter (opentelemetry-prometheus, prometheus
crate, OTLP metric push) in favour of the original metrics-rs facade
(metrics::counter!/gauge!/histogram! macros) backed by metrics-exporter-prometheus.

OTLP tracing (telemetry.rs, tracing-opentelemetry, OTEL trace spans in
connection/event/auth handlers) is intentionally preserved — only the metrics
path is reverted.

Changes:
- Cargo.toml: restore metrics + metrics-exporter-prometheus workspace deps;
  strip metrics/logs features from opentelemetry, opentelemetry_sdk,
  opentelemetry-otlp, tracing-opentelemetry; remove opentelemetry-prometheus
  and prometheus = "0.14"
- metrics.rs: restore PrometheusBuilder-based setup verbatim from origin/main
- All call sites: crate::metrics::metrics().<handle>.add/record -> metrics-rs
  macros (counter!/gauge!/histogram!) across connection, auth, event, count,
  state, subscription, bridge, media
- main.rs: fanout_lag + cache_inval_lag background consumers converted to
  metrics::counter!; pool gauge section (net-new, kept per spec) converted from
  relay_metrics::meter() OTEL API to metrics::gauge!.set(); meter_provider
  shutdown removed; telemetry comment updated to trace-only
- telemetry.rs: doc comment updated — Resource is for trace provider only
- Cargo.lock: opentelemetry-prometheus and prometheus crates removed

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 changed the title feat(relay): migrate telemetry to OpenTelemetry with OTLP and Prometheus export feat(relay): add OpenTelemetry tracing, keep Prometheus metrics Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants