fix: non aggregator readiness#3376
Conversation
📝 WalkthroughWalkthroughThe /health/ready endpoint's readiness logic is extended for non-aggregator nodes to check execution liveness based on a configurable ReadinessWindowSeconds (default 15s fallback), returning 503 when LastBlockTime exceeds the allowed delay. A corresponding test validates the new behavior. ChangesReadiness Endpoint Liveness Check
Estimated code review effort: 2 (Simple) | ~10 minutes Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTPHandler as /health/ready
participant Config as cfg.Node
participant State
Client->>HTTPHandler: GET /health/ready
HTTPHandler->>Config: read ReadinessWindowSeconds
Config-->>HTTPHandler: value (0 -> fallback 15s)
HTTPHandler->>State: GetState().LastBlockTime
State-->>HTTPHandler: LastBlockTime
HTTPHandler->>HTTPHandler: compute time.Since(LastBlockTime) vs maxAllowedDelay
alt delay exceeded
HTTPHandler-->>Client: 503 UNREADY: node not executing blocks
else within window
HTTPHandler-->>Client: 200 READY
end
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/rpc/server/http.go (1)
138-147: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winAdd structured logging for non-aggregator readiness failures.
The liveness endpoint logs store access errors (line 60), but readiness failures are silent. Log the readiness failure with
LastBlockTimeandmaxAllowedDelayto aid production debugging.if time.Since(state.LastBlockTime) > maxAllowedDelay { + logger.Warn(). + Time("last_block_time", state.LastBlockTime). + Dur("max_allowed_delay", maxAllowedDelay). + Msg("Readiness check failed: node not executing blocks") http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable) return }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/rpc/server/http.go` around lines 138 - 147, The non-aggregator readiness check in the HTTP handler is returning UNREADY without any log output, unlike the store access error path. Update the readiness-failure branch in the HTTP liveness/readiness handler to emit a structured log when `time.Since(state.LastBlockTime) > maxAllowedDelay`, including `LastBlockTime` and `maxAllowedDelay` so failures can be diagnosed in production. Use the existing HTTP handler context around `state`, `cfg.Node.ReadinessWindowSeconds`, and `http.Error` to place the log immediately before returning the 503.pkg/rpc/server/http_test.go (1)
246-265: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low valueConsider adding edge cases for zero and future
LastBlockTime.A
time.Time{}zero value (node never executed blocks) and a futureLastBlockTime(clock skew) would exercise defensive behavior of the readiness window check.specs := map[string]spec{ ... + "zero last block time never executed blocks": { + readinessWindowSeconds: 10, + delay: 0, // use time.Time{} directly in test setup + expStatusCode: http.StatusServiceUnavailable, + expBody: "UNREADY: node not executing blocks\n", + }, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/rpc/server/http_test.go` around lines 246 - 265, Add edge-case coverage to the readiness-window tests in http_test.go for the readiness check around the readiness window logic, especially the table driven cases near specs and the readiness handler under test. Include a case where LastBlockTime is the zero time to verify the node is treated as unready, and another where LastBlockTime is in the future to cover clock-skew behavior. Update the expected status/body assertions to match the readiness response produced by the relevant readiness check function.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/rpc/server/http.go`:
- Around line 138-147: Guard the readiness timeout calculation against
uint64-to-time.Duration overflow in the HTTP readiness path. In the readiness
check inside the server logic, before computing maxAllowedDelay from
cfg.Node.ReadinessWindowSeconds, cap the value or saturate it to a safe maximum
so the conversion cannot wrap negative; keep the existing 0 fallback and
preserve the subsequent time.Since(state.LastBlockTime) comparison.
---
Nitpick comments:
In `@pkg/rpc/server/http_test.go`:
- Around line 246-265: Add edge-case coverage to the readiness-window tests in
http_test.go for the readiness check around the readiness window logic,
especially the table driven cases near specs and the readiness handler under
test. Include a case where LastBlockTime is the zero time to verify the node is
treated as unready, and another where LastBlockTime is in the future to cover
clock-skew behavior. Update the expected status/body assertions to match the
readiness response produced by the relevant readiness check function.
In `@pkg/rpc/server/http.go`:
- Around line 138-147: The non-aggregator readiness check in the HTTP handler is
returning UNREADY without any log output, unlike the store access error path.
Update the readiness-failure branch in the HTTP liveness/readiness handler to
emit a structured log when `time.Since(state.LastBlockTime) > maxAllowedDelay`,
including `LastBlockTime` and `maxAllowedDelay` so failures can be diagnosed in
production. Use the existing HTTP handler context around `state`,
`cfg.Node.ReadinessWindowSeconds`, and `http.Error` to place the log immediately
before returning the 503.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8076c0f7-778f-49b9-98dd-3462c32a0d00
📒 Files selected for processing (2)
pkg/rpc/server/http.gopkg/rpc/server/http_test.go
| } else { | ||
| readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds | ||
| if readinessWindowSeconds == 0 { | ||
| readinessWindowSeconds = 15 // fallback to default 15s window | ||
| } | ||
| maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second | ||
| if time.Since(state.LastBlockTime) > maxAllowedDelay { | ||
| http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable) | ||
| return | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Guard against uint64→Duration overflow edge case.
time.Duration(readinessWindowSeconds) * time.Second can wrap to a negative maxAllowedDelay if ReadinessWindowSeconds exceeds int64 nanosecond capacity (~292 years), causing permanent UNREADY. Add an upper-bound check or saturate before conversion.
readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds
if readinessWindowSeconds == 0 {
readinessWindowSeconds = 15 // fallback to default 15s window
}
+ const maxReadinessWindowSeconds = uint64(292 * 365 * 24 * 60 * 60) // ~292 years
+ if readinessWindowSeconds > maxReadinessWindowSeconds {
+ readinessWindowSeconds = maxReadinessWindowSeconds
+ }
maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else { | |
| readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds | |
| if readinessWindowSeconds == 0 { | |
| readinessWindowSeconds = 15 // fallback to default 15s window | |
| } | |
| maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second | |
| if time.Since(state.LastBlockTime) > maxAllowedDelay { | |
| http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable) | |
| return | |
| } | |
| readinessWindowSeconds := cfg.Node.ReadinessWindowSeconds | |
| if readinessWindowSeconds == 0 { | |
| readinessWindowSeconds = 15 // fallback to default 15s window | |
| } | |
| const maxReadinessWindowSeconds = uint64(292 * 365 * 24 * 60 * 60) // ~292 years | |
| if readinessWindowSeconds > maxReadinessWindowSeconds { | |
| readinessWindowSeconds = maxReadinessWindowSeconds | |
| } | |
| maxAllowedDelay := time.Duration(readinessWindowSeconds) * time.Second | |
| if time.Since(state.LastBlockTime) > maxAllowedDelay { | |
| http.Error(w, "UNREADY: node not executing blocks", http.StatusServiceUnavailable) | |
| return | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/rpc/server/http.go` around lines 138 - 147, Guard the readiness timeout
calculation against uint64-to-time.Duration overflow in the HTTP readiness path.
In the readiness check inside the server logic, before computing maxAllowedDelay
from cfg.Node.ReadinessWindowSeconds, cap the value or saturate it to a safe
maximum so the conversion cannot wrap negative; keep the existing 0 fallback and
preserve the subsequent time.Since(state.LastBlockTime) comparison.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3376 +/- ##
==========================================
- Coverage 60.66% 60.61% -0.05%
==========================================
Files 129 129
Lines 14205 14213 +8
==========================================
- Hits 8617 8615 -2
- Misses 4635 4642 +7
- Partials 953 956 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Overview
Summary by CodeRabbit
/health/readychecks for non-aggregator nodes so readiness now reflects how long it has been since the last block was produced.503 UNREADY: node not executing blockswhen the delay exceeds the configured readiness window.