feat(observability): backend metrics + tracing primitives#5376
feat(observability): backend metrics + tracing primitives#5376Ma77Ball wants to merge 50 commits into
Conversation
…, SDK bootstrap (default-off) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tracing primitives Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5376 +/- ##
============================================
- Coverage 55.27% 55.22% -0.06%
- Complexity 2991 2999 +8
============================================
Files 1117 1123 +6
Lines 43258 43401 +143
Branches 4668 4701 +33
============================================
+ Hits 23912 23969 +57
- Misses 17938 18019 +81
- Partials 1408 1413 +5
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 383 | 0.234 | 26,339/32,648/32,648 us | 🔴 +18.5% / 🔴 +111.1% |
| 🔴 | bs=100 sw=10 sl=64 | 806 | 0.492 | 123,598/140,837/140,837 us | 🔴 +13.3% / 🔴 +28.0% |
| ⚪ | bs=1000 sw=10 sl=64 | 929 | 0.567 | 1,079,342/1,117,970/1,117,970 us | ⚪ within ±5% / 🔴 +6.3% |
Baseline details
Latest main 0a5693c from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 383 tuples/sec | 436 tuples/sec | 756.6 tuples/sec | -12.2% | -49.4% |
| bs=10 sw=10 sl=64 | MB/s | 0.234 MB/s | 0.266 MB/s | 0.462 MB/s | -12.0% | -49.3% |
| bs=10 sw=10 sl=64 | p50 | 26,339 us | 22,227 us | 13,009 us | +18.5% | +102.5% |
| bs=10 sw=10 sl=64 | p95 | 32,648 us | 32,197 us | 15,463 us | +1.4% | +111.1% |
| bs=10 sw=10 sl=64 | p99 | 32,648 us | 32,197 us | 18,561 us | +1.4% | +75.9% |
| bs=100 sw=10 sl=64 | throughput | 806 tuples/sec | 860 tuples/sec | 963.83 tuples/sec | -6.3% | -16.4% |
| bs=100 sw=10 sl=64 | MB/s | 0.492 MB/s | 0.525 MB/s | 0.588 MB/s | -6.3% | -16.4% |
| bs=100 sw=10 sl=64 | p50 | 123,598 us | 117,964 us | 103,320 us | +4.8% | +19.6% |
| bs=100 sw=10 sl=64 | p95 | 140,837 us | 124,352 us | 110,058 us | +13.3% | +28.0% |
| bs=100 sw=10 sl=64 | p99 | 140,837 us | 124,352 us | 118,543 us | +13.3% | +18.8% |
| bs=1000 sw=10 sl=64 | throughput | 929 tuples/sec | 926 tuples/sec | 989.07 tuples/sec | +0.3% | -6.1% |
| bs=1000 sw=10 sl=64 | MB/s | 0.567 MB/s | 0.565 MB/s | 0.604 MB/s | +0.4% | -6.1% |
| bs=1000 sw=10 sl=64 | p50 | 1,079,342 us | 1,078,611 us | 1,015,599 us | +0.1% | +6.3% |
| bs=1000 sw=10 sl=64 | p95 | 1,117,970 us | 1,130,759 us | 1,055,944 us | -1.1% | +5.9% |
| bs=1000 sw=10 sl=64 | p99 | 1,117,970 us | 1,130,759 us | 1,086,834 us | -1.1% | +2.9% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,522.27,200,128000,383,0.234,26338.79,32648.39,32648.39
1,100,10,64,20,2481.83,2000,1280000,806,0.492,123598.07,140837.38,140837.38
2,1000,10,64,20,21520.79,20000,12800000,929,0.567,1079341.88,1117969.71,1117969.71… obs/pr2/backend-emit
… obs/pr2/backend-emit
Call OtelInit.init(<service.name>) in each service main so its logs bridge to the OTel collector under its own service.name; cap noisy framework loggers (pekko/iceberg/hadoop/kafka/jetty/jersey/grpc/ netty/hikari/awssdk) at WARN in each service config. Services: access-control, config, file, computing-unit-managing, workflow-compiling, computing-unit-master, texera-web, amber. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e span - WorkflowMetricsRecorder: emit workflow lifecycle metrics keyed by execution, driven from the ExecutionStateStore state-transition chokepoint; registered via WorkflowMetricsRecorder.init() in ComputingUnitMaster - WorkflowService: wrap initExecutionService in a run-level TexeraTracer span so setup-path logs carry the trace id Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Automated Reviewer SuggestionsBased on the
|
… obs/pr1/foundations
…o obs/pr2/backend-emit
… obs/pr2/backend-emit
reduced the comments to include less design details and information not needed in the codebase.
… obs/pr1/foundations
… obs/pr1/foundations
… obs/pr2/backend-emit
What changes were proposed in this PR?
Adds the libraries services use to emit metrics and distributed traces. These stay dormant until the SDK is enabled in PR1.
TexeraMetrics: metric instruments for workflow lifecycle and throughput. The active-executions metric is modeled as an observable gauge read from the live registry, correcting the phantom-active count a manual up/down counter produced.TexeraTracerandSpanAttrs: helpers for creating spans and attaching standardized attributes.TraceparentValidator: parses and validates W3Ctraceparentheaders for context propagation.Configmodule; no behavior changes when telemetry is disabled.Any related issues, documentation, or discussions?
Closes: #5368
Part of #4070. Stacked on #5375.
How was this PR tested?
sbt scalafmtCheckAllpasses; compile and tests run in this PR's CI.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF