Skip to content

fix: non aggregator readiness#3376

Merged
tac0turtle merged 1 commit into
mainfrom
auricom/fix_nonaggregator_readiness
Jul 1, 2026
Merged

fix: non aggregator readiness#3376
tac0turtle merged 1 commit into
mainfrom
auricom/fix_nonaggregator_readiness

Conversation

@auricom

@auricom auricom commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Overview

Summary by CodeRabbit

  • Bug Fixes
    • Improved /health/ready checks for non-aggregator nodes so readiness now reflects how long it has been since the last block was produced.
    • Returns 503 UNREADY: node not executing blocks when the delay exceeds the configured readiness window.
    • Uses a default fallback threshold when the readiness window is not set, helping avoid false-ready responses.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The /health/ready endpoint's readiness logic is extended for non-aggregator nodes to check execution liveness based on a configurable ReadinessWindowSeconds (default 15s fallback), returning 503 when LastBlockTime exceeds the allowed delay. A corresponding test validates the new behavior.

Changes

Readiness Endpoint Liveness Check

Layer / File(s) Summary
Non-aggregator readiness window logic
pkg/rpc/server/http.go
Updated endpoint documentation to distinguish aggregator vs non-aggregator readiness behavior, and added logic to check state.LastBlockTime against a configurable ReadinessWindowSeconds (falling back to 15s), returning 503 UNREADY when exceeded.
Readiness window test coverage
pkg/rpc/server/http_test.go
Added TestHealthReady_nonAggregatorBlockDelay, a table-driven test covering within-window, exceeded-window, and zero-window fallback cases for the new readiness check.

Estimated code review effort: 2 (Simple) | ~10 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant HTTPHandler as /health/ready
  participant Config as cfg.Node
  participant State

  Client->>HTTPHandler: GET /health/ready
  HTTPHandler->>Config: read ReadinessWindowSeconds
  Config-->>HTTPHandler: value (0 -> fallback 15s)
  HTTPHandler->>State: GetState().LastBlockTime
  State-->>HTTPHandler: LastBlockTime
  HTTPHandler->>HTTPHandler: compute time.Since(LastBlockTime) vs maxAllowedDelay
  alt delay exceeded
    HTTPHandler-->>Client: 503 UNREADY: node not executing blocks
  else within window
    HTTPHandler-->>Client: 200 READY
  end
Loading
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is just the template stub and does not provide the required Overview, context, goal, or rationale. Add an Overview with context, goal, rationale, and any linked issue or tl;dr per the template.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title matches the main change: fixing readiness behavior for non-aggregator nodes.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch auricom/fix_nonaggregator_readiness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed⏩ skippedJul 1, 2026, 7:15 AM

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/rpc/server/http.go (1)

138-147: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Add structured logging for non-aggregator readiness failures.

The liveness endpoint logs store access errors (line 60), but readiness failures are silent. Log the readiness failure with LastBlockTime and maxAllowedDelay to aid production debugging.

 			if time.Since(state.LastBlockTime) > maxAllowedDelay {
+				logger.Warn().
+					Time("last_block_time", state.LastBlockTime).
+					Dur("max_allowed_delay", maxAllowedDelay).
+					Msg("Readiness check failed: node not executing blocks")
 				http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable)
 				return
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rpc/server/http.go` around lines 138 - 147, The non-aggregator readiness
check in the HTTP handler is returning UNREADY without any log output, unlike
the store access error path. Update the readiness-failure branch in the HTTP
liveness/readiness handler to emit a structured log when
`time.Since(state.LastBlockTime) > maxAllowedDelay`, including `LastBlockTime`
and `maxAllowedDelay` so failures can be diagnosed in production. Use the
existing HTTP handler context around `state`, `cfg.Node.ReadinessWindowSeconds`,
and `http.Error` to place the log immediately before returning the 503.
pkg/rpc/server/http_test.go (1)

246-265: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low value

Consider adding edge cases for zero and future LastBlockTime.

A time.Time{} zero value (node never executed blocks) and a future LastBlockTime (clock skew) would exercise defensive behavior of the readiness window check.

 	specs := map[string]spec{
 		...
+		"zero last block time never executed blocks": {
+			readinessWindowSeconds: 10,
+			delay:                  0, // use time.Time{} directly in test setup
+			expStatusCode:          http.StatusServiceUnavailable,
+			expBody:                "UNREADY: node not executing blocks\n",
+		},
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rpc/server/http_test.go` around lines 246 - 265, Add edge-case coverage
to the readiness-window tests in http_test.go for the readiness check around the
readiness window logic, especially the table driven cases near specs and the
readiness handler under test. Include a case where LastBlockTime is the zero
time to verify the node is treated as unready, and another where LastBlockTime
is in the future to cover clock-skew behavior. Update the expected status/body
assertions to match the readiness response produced by the relevant readiness
check function.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/rpc/server/http.go`:
- Around line 138-147: Guard the readiness timeout calculation against
uint64-to-time.Duration overflow in the HTTP readiness path. In the readiness
check inside the server logic, before computing maxAllowedDelay from
cfg.Node.ReadinessWindowSeconds, cap the value or saturate it to a safe maximum
so the conversion cannot wrap negative; keep the existing 0 fallback and
preserve the subsequent time.Since(state.LastBlockTime) comparison.

---

Nitpick comments:
In `@pkg/rpc/server/http_test.go`:
- Around line 246-265: Add edge-case coverage to the readiness-window tests in
http_test.go for the readiness check around the readiness window logic,
especially the table driven cases near specs and the readiness handler under
test. Include a case where LastBlockTime is the zero time to verify the node is
treated as unready, and another where LastBlockTime is in the future to cover
clock-skew behavior. Update the expected status/body assertions to match the
readiness response produced by the relevant readiness check function.

In `@pkg/rpc/server/http.go`:
- Around line 138-147: The non-aggregator readiness check in the HTTP handler is
returning UNREADY without any log output, unlike the store access error path.
Update the readiness-failure branch in the HTTP liveness/readiness handler to
emit a structured log when `time.Since(state.LastBlockTime) > maxAllowedDelay`,
including `LastBlockTime` and `maxAllowedDelay` so failures can be diagnosed in
production. Use the existing HTTP handler context around `state`,
`cfg.Node.ReadinessWindowSeconds`, and `http.Error` to place the log immediately
before returning the 503.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8076c0f7-778f-49b9-98dd-3462c32a0d00

📥 Commits

Reviewing files that changed from the base of the PR and between 9e400f9 and 0d830fb.

📒 Files selected for processing (2)
  • pkg/rpc/server/http.go
  • pkg/rpc/server/http_test.go

Comment thread pkg/rpc/server/http.go
Comment on lines +138 to +147
} else {
readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds
if readinessWindowSeconds == 0 {
readinessWindowSeconds = 15 // fallback to default 15s window
}
maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second
if time.Since(state.LastBlockTime) > maxAllowedDelay {
http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable)
return
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Guard against uint64→Duration overflow edge case.

time.Duration(readinessWindowSeconds) * time.Second can wrap to a negative maxAllowedDelay if ReadinessWindowSeconds exceeds int64 nanosecond capacity (~292 years), causing permanent UNREADY. Add an upper-bound check or saturate before conversion.

 			readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds
 			if readinessWindowSeconds == 0 {
 				readinessWindowSeconds = 15 // fallback to default 15s window
 			}
+			const maxReadinessWindowSeconds = uint64(292 * 365 * 24 * 60 * 60) // ~292 years
+			if readinessWindowSeconds > maxReadinessWindowSeconds {
+				readinessWindowSeconds = maxReadinessWindowSeconds
+			}
 			maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else {
readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds
if readinessWindowSeconds == 0 {
readinessWindowSeconds = 15 // fallback to default 15s window
}
maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second
if time.Since(state.LastBlockTime) > maxAllowedDelay {
http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable)
return
}
readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds
if readinessWindowSeconds == 0 {
readinessWindowSeconds = 15 // fallback to default 15s window
}
const maxReadinessWindowSeconds = uint64(292 * 365 * 24 * 60 * 60) // ~292 years
if readinessWindowSeconds > maxReadinessWindowSeconds {
readinessWindowSeconds = maxReadinessWindowSeconds
}
maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second
if time.Since(state.LastBlockTime) > maxAllowedDelay {
http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable)
return
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rpc/server/http.go` around lines 138 - 147, Guard the readiness timeout
calculation against uint64-to-time.Duration overflow in the HTTP readiness path.
In the readiness check inside the server logic, before computing maxAllowedDelay
from cfg.Node.ReadinessWindowSeconds, cap the value or saturate it to a safe
maximum so the conversion cannot wrap negative; keep the existing 0 fallback and
preserve the subsequent time.Since(state.LastBlockTime) comparison.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.61%. Comparing base (9e400f9) to head (0d830fb).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3376      +/-   ##
==========================================
- Coverage   60.66%   60.61%   -0.05%     
==========================================
  Files         129      129              
  Lines       14205    14213       +8     
==========================================
- Hits         8617     8615       -2     
- Misses       4635     4642       +7     
- Partials      953      956       +3     
Flag Coverage Δ
combined 60.61% <100.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@auricom auricom changed the title feat: benchmarks container image fix: non aggregator readiness Jul 1, 2026
@tac0turtle tac0turtle merged commit e23f9b8 into main Jul 1, 2026
35 checks passed
@tac0turtle tac0turtle deleted the auricom/fix_nonaggregator_readiness branch July 1, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants