Skip to content

feat(observability): backend metrics + tracing primitives#5376

Draft
Ma77Ball wants to merge 50 commits into
apache:mainfrom
Ma77Ball:obs/pr2/backend-emit
Draft

feat(observability): backend metrics + tracing primitives#5376
Ma77Ball wants to merge 50 commits into
apache:mainfrom
Ma77Ball:obs/pr2/backend-emit

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Adds the libraries services use to emit metrics and distributed traces. These stay dormant until the SDK is enabled in PR1.

  • TexeraMetrics: metric instruments for workflow lifecycle and throughput. The active-executions metric is modeled as an observable gauge read from the live registry, correcting the phantom-active count a manual up/down counter produced.
  • TexeraTracer and SpanAttrs: helpers for creating spans and attaching standardized attributes.
  • TraceparentValidator: parses and validates W3C traceparent headers for context propagation.
  • Pure additions to the Config module; no behavior changes when telemetry is disabled.

Any related issues, documentation, or discussions?

Closes: #5368
Part of #4070. Stacked on #5375.

How was this PR tested?

  • Unit specs for the metrics, tracer, span-attribute, and traceparent-validator classes.
  • sbt scalafmtCheckAll passes; compile and tests run in this PR's CI.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

Ma77Ball and others added 2 commits June 5, 2026 04:49
…, SDK bootstrap (default-off)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tracing primitives

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 68.51852% with 119 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.22%. Comparing base (0a5693c) to head (b311b57).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...ala/org/apache/texera/observability/OtelInit.scala 63.30% 43 Missing and 8 partials ⚠️
...org/apache/texera/observability/TexeraTracer.scala 0.00% 14 Missing ⚠️
...rg/apache/texera/observability/TexeraMetrics.scala 84.93% 4 Missing and 7 partials ⚠️
...ra/web/observability/WorkflowMetricsRecorder.scala 44.44% 8 Missing and 2 partials ⚠️
...la/org/apache/texera/observability/SpanAttrs.scala 66.66% 8 Missing and 2 partials ⚠️
...e/texera/observability/TexeraOtelLogAppender.scala 72.22% 4 Missing and 6 partials ⚠️
...rg/apache/texera/web/service/WorkflowService.scala 0.00% 6 Missing ⚠️
...la/org/apache/texera/web/ComputingUnitMaster.scala 0.00% 2 Missing ⚠️
...org/apache/texera/observability/LogSanitizer.scala 96.42% 0 Missing and 1 partial ⚠️
.../texera/service/ComputingUnitManagingService.scala 0.00% 1 Missing ⚠️
... and 3 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5376      +/-   ##
============================================
- Coverage     55.27%   55.22%   -0.06%     
- Complexity     2991     2999       +8     
============================================
  Files          1117     1123       +6     
  Lines         43258    43401     +143     
  Branches       4668     4701      +33     
============================================
+ Hits          23912    23969      +57     
- Misses        17938    18019      +81     
- Partials       1408     1413       +5     
Flag Coverage Δ *Carryforward flag
access-control-service 70.14% <100.00%> (+0.14%) ⬆️
agent-service 34.36% <ø> (ø) Carriedforward from 43ca4b2
amber 58.01% <69.16%> (+0.21%) ⬆️
computing-unit-managing-service 0.00% <0.00%> (ø)
config-service 50.76% <0.00%> (-0.80%) ⬇️
file-service 58.88% <0.00%> (-0.15%) ⬇️
frontend 48.42% <ø> (-0.46%) ⬇️ Carriedforward from 43ca4b2
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.20% <ø> (ø) Carriedforward from 43ca4b2
python 90.76% <ø> (ø) Carriedforward from 43ca4b2
workflow-compiling-service 54.74% <0.00%> (-0.41%) ⬇️

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot added the platform Non-amber Scala service paths label Jun 5, 2026
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 7 worse · ⚪ 8 noise (<±5%) · 0 without baseline

Compared against main 0a5693c benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 383 0.234 26,339/32,648/32,648 us 🔴 +18.5% / 🔴 +111.1%
🔴 bs=100 sw=10 sl=64 806 0.492 123,598/140,837/140,837 us 🔴 +13.3% / 🔴 +28.0%
bs=1000 sw=10 sl=64 929 0.567 1,079,342/1,117,970/1,117,970 us ⚪ within ±5% / 🔴 +6.3%
Baseline details

Latest main 0a5693c from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 383 tuples/sec 436 tuples/sec 756.6 tuples/sec -12.2% -49.4%
bs=10 sw=10 sl=64 MB/s 0.234 MB/s 0.266 MB/s 0.462 MB/s -12.0% -49.3%
bs=10 sw=10 sl=64 p50 26,339 us 22,227 us 13,009 us +18.5% +102.5%
bs=10 sw=10 sl=64 p95 32,648 us 32,197 us 15,463 us +1.4% +111.1%
bs=10 sw=10 sl=64 p99 32,648 us 32,197 us 18,561 us +1.4% +75.9%
bs=100 sw=10 sl=64 throughput 806 tuples/sec 860 tuples/sec 963.83 tuples/sec -6.3% -16.4%
bs=100 sw=10 sl=64 MB/s 0.492 MB/s 0.525 MB/s 0.588 MB/s -6.3% -16.4%
bs=100 sw=10 sl=64 p50 123,598 us 117,964 us 103,320 us +4.8% +19.6%
bs=100 sw=10 sl=64 p95 140,837 us 124,352 us 110,058 us +13.3% +28.0%
bs=100 sw=10 sl=64 p99 140,837 us 124,352 us 118,543 us +13.3% +18.8%
bs=1000 sw=10 sl=64 throughput 929 tuples/sec 926 tuples/sec 989.07 tuples/sec +0.3% -6.1%
bs=1000 sw=10 sl=64 MB/s 0.567 MB/s 0.565 MB/s 0.604 MB/s +0.4% -6.1%
bs=1000 sw=10 sl=64 p50 1,079,342 us 1,078,611 us 1,015,599 us +0.1% +6.3%
bs=1000 sw=10 sl=64 p95 1,117,970 us 1,130,759 us 1,055,944 us -1.1% +5.9%
bs=1000 sw=10 sl=64 p99 1,117,970 us 1,130,759 us 1,086,834 us -1.1% +2.9%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,522.27,200,128000,383,0.234,26338.79,32648.39,32648.39
1,100,10,64,20,2481.83,2000,1280000,806,0.492,123598.07,140837.38,140837.38
2,1000,10,64,20,21520.79,20000,12800000,929,0.567,1079341.88,1117969.71,1117969.71

Ma77Ball and others added 13 commits June 14, 2026 18:28
Call OtelInit.init(<service.name>) in each service main so its logs
bridge to the OTel collector under its own service.name; cap noisy
framework loggers (pekko/iceberg/hadoop/kafka/jetty/jersey/grpc/
netty/hikari/awssdk) at WARN in each service config.

Services: access-control, config, file, computing-unit-managing,
workflow-compiling, computing-unit-master, texera-web, amber.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e span

- WorkflowMetricsRecorder: emit workflow lifecycle metrics keyed by
  execution, driven from the ExecutionStateStore state-transition
  chokepoint; registered via WorkflowMetricsRecorder.init() in
  ComputingUnitMaster
- WorkflowService: wrap initExecutionService in a run-level TexeraTracer
  span so setup-path logs carry the trace id

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @bobbai00, @Yicong-Huang, @aglinxinyuan
    You can notify them by mentioning @bobbai00, @Yicong-Huang, @aglinxinyuan in a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common dependencies Pull requests that update a dependency file engine platform Non-amber Scala service paths

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Observability] Backend metrics and tracing emission primitives

2 participants