feat(autobahn): Add epoch.Registry to maintain committee/stake (CON-358)#3632
feat(autobahn): Add epoch.Registry to maintain committee/stake (CON-358)#3632wen-coding wants to merge 104 commits into
Conversation
PR SummaryHigh Risk Overview Proposals and QCs now encode absolute global ranges on the message: Call sites across ledger/block tests, blocksim, and autobahn tests are updated to Reviewed by Cursor Bugbot for commit a455eaa. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
d764443 to
705d8c9
Compare
ab4c67c to
9cce5d2
Compare
f1f1d55 to
09e677e
Compare
| // EpochWindow returns the epoch→committee map for message acceptance across | ||
| // epoch transitions. With a fixed genesis committee it always returns a | ||
| // single entry. | ||
| func (r *Registry) EpochWindow() map[Index]*types.Committee { |
There was a problem hiding this comment.
what is "window"? I don't see it in the doc.
There was a problem hiding this comment.
This is mostly for the Epoch transition, we need to accept lane blocks for the current Epoch and possibly the Epoch before (repair blocks) or after it (others have got the commitQC to transition into new Epoch before me).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 6 total unresolved issues (including 5 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 06711dd. Configure here.
| lane := proposerKey.Public() | ||
|
|
||
| laneQC := makeLaneQC(rng, committee, keys, lane, MaxLaneRangeInProposal, GenBlockHeaderHash(rng)) | ||
| vs.Epoch = NewEpoch(0, OpenRoadRange(), time.Time{}, committee, 0) |
There was a problem hiding this comment.
ViewSpec nil Epoch test panic
Low Severity
TestNewProposalRejectsLaneRangeLongerThanMaxLaneRangeInProposal calls vs.View() while vs.Epoch is still nil. ViewSpec.View() reads vs.Epoch.EpochIndex(), so the test panics before it can assert on NewProposal.
Reviewed by Cursor Bugbot for commit 06711dd. Configure here.
There was a problem hiding this comment.
A large but well-structured refactor introducing epoch.Registry, baking firstBlock/epochIndex into proposals so GlobalRange() is self-contained on the wire, and binding epoch into signatures. Correct for the current single-epoch (genesis-only) operation; the only concerns are gaps in the not-yet-active multi-epoch transition path, which are intentionally deferred and tracked by TODOs.
Findings: 0 blocking | 6 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Codex flagged that
Registry.VerifyInWindowonly checksLatestEpoch()and thatpushCommitQCdoesn't rotatei.ep, so still-valid previous/next-epoch messages would be rejected during a transition. Both observations are accurate but are NOT live bugs in this PR:AddEpochis never called from production code (verified by grep), so the registry always holds exactly one epoch (index 0) andLatestEpoch()== the only epoch. Both gaps are already tracked by explicit TODOs (registry.go:113, inner.go:117/144, avail/state.go:276) and match the PR's stated 'wire the read side first' design. Recommend keeping them as tracked follow-up items rather than merge blockers, but ensure they are closed before dynamic epochs are enabled. - The proto/wire format change is intentionally hard-breaking:
Proposal.Decoderejects messages missingglobal_first, andView/AppProposalnow requireepoch_index. This is fine given Autobahn is pre-production and the code comment documents there is no rolling-upgrade path, but worth confirming no already-persisted consensus state (inner_a/inner_b.pb) from a prior build must be readable across this upgrade. - cursor-review.md was empty — the Cursor pass produced no output for this synthesis.
- Consider a test that exercises
AddEpochend-to-end (registry with 2+ epochs, then verifying a message from the older epoch) even while the transition is not wired, to lock in the intended window semantics before follow-up PRs rely on them. - 2 suggestion(s)/nit(s) flagged inline on specific lines.
| // Returns a slice of all matching epochs so callers can skip re-verification for any | ||
| // epoch already checked here. | ||
| // TODO: expand to neighbor epochs (previous and next) once multi-epoch transitions are wired up. | ||
| func (r *Registry) VerifyInWindow(fn func(*types.Committee) error) ([]*types.Epoch, error) { |
There was a problem hiding this comment.
[suggestion] VerifyInWindow only verifies against LatestEpoch(). Once AddEpoch is actually invoked (follow-up PR), messages/blocks whose QC belongs to a still-valid previous (or upcoming) epoch will be rejected here and by the epochInSet check in data/state.go:469. Not a live bug today since AddEpoch is never called in production, but this must be expanded to neighbor epochs (as the TODO notes) before dynamic epochs are enabled.
| }}) | ||
| // CommitQC advances to new index; clear all state for new view. | ||
| // TODO: rotate ep when epoch transitions are wired up. | ||
| iSend.Store(inner{persistedInner: persistedInner{CommitQC: utils.Some(qc)}, ep: i.ep}) |
There was a problem hiding this comment.
[suggestion] pushCommitQC verifies against i.ep and re-stores the same i.ep, so the first CommitQC of a newly-registered epoch would fail View.Verify's epoch-index check instead of rotating. Guarded today by the fact that epochs never rotate (AddEpoch unused), and the TODO on line 144 tracks it — flagging so it isn't lost when transitions are wired.
ep is context for the call, so it goes before prev per convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the WAL has existing entries, override the constructor's firstBlock with the first loaded entry's value rather than trusting the parameter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
View() needs the epoch, so it belongs on inner which owns the epoch. persistedInner.validate() inlines the ViewSpec directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Look up the epoch from the QC (which is guaranteed present before PushBlock is called), verify outside the lock against that specific epoch, then insert. Removes the VerifyInWindow+epochInSet cross-check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
A large, well-tested refactor that consolidates genesis config into epoch.Registry and bakes firstBlock/epoch_index into proposals so GlobalRange() is self-contained and message epoch-binding is verifiable. No blocking correctness or security issues found; the epoch_index/global_first fields are covered by the proposal hash and validated on decode/verify, and attacker-supplied epoch indices are resolved via EpochByIndex (unknown indices rejected). A few non-blocking observations only.
Findings: 0 blocking | 6 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion passes produced no material findings: Codex reported "No material findings" (noting it could not run Go tests due to a sandbox lacking Go 1.25.6), and the Cursor review file was empty. I independently reviewed the core logic and reached the same conclusion — no blockers.
- Wire/hash compatibility: adding epoch_index to the hashable View and global_first to the hashable Proposal (plus epoch_index on AppProposal) changes the hash of these messages, making them incompatible with any previously-encoded autobahn messages. The code comments and PR body acknowledge this is acceptable because Autobahn is pre-production with no rolling-upgrade path — flagging it so reviewers are explicitly aware there is no cross-version wire compatibility.
- epoch/registry.go: NewRegistry returns (*Registry, error) but never returns a non-nil error, and Registry methods use the
for range r.mu.RLock() { return ... } / panic("unreachable")idiom. Both are fine, but the vestigial error return on NewRegistry could be dropped to simplify call sites. - avail/state.go: the []*types.Epoch slice returned by Registry.VerifyInWindow is discarded at both call sites (PushBlock, PushVote via
if _, err := ...). The re-verification-skipping optimization the return value is meant to enable is currently unused; acceptable while single-epoch but currently dead API surface. - Registry.AddEpoch has no guard against startRoad <= the highest already-committed road index (only startRoad <= latest.Roads().First is checked). This is explicitly called out with a TODO and AddEpoch is not yet wired to any caller, so it is not currently reachable — but it must be addressed before dynamic epochs go live to prevent retroactively reassigning finalized roads.
- Several epoch-transition paths are intentionally stubbed with consistent TODOs (persisted_inner restart after a transition assumes LatestEpoch; ep rotation in pushCommitQC/PushCommitQC; per-epoch weight counting in avail block votes). The read-side-first approach is reasonable and the TODOs are clear; just ensure follow-up PRs close them before enabling multi-epoch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ison Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
A large but careful, well-tested refactor extracting genesis config into a new epoch.Registry/Epoch and binding epoch_index through Autobahn message types; no blocking correctness or security issues found, only minor stylistic/consistency notes and the expected TODO-gated dynamic-epoch stubs.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion passes produced no findings: OpenAI Codex reported "No material findings" (and flagged that it could not run the targeted Go tests because Go 1.25.6 toolchain download was network-blocked), and the Cursor review file was empty (no output produced). Neither surfaced anything to merge in.
- Dynamic-epoch support is intentionally stubbed: Registry.AddEpoch is not yet wired, LatestEpoch() is assumed everywhere, and several TODOs note that restart-after-epoch-transition (consensus newInner), per-epoch vote weighting (avail block_votes/laneQC), and epoch rotation (consensus pushCommitQC, avail PushCommitQC) will break once transitions are real. This matches the PR's stated "wire the read side first" plan, but the surface area of TODOs is worth tracking as a single follow-up so none is missed when epochs go dynamic.
- Consistency nit: epoch/registry.go AddEpoch builds the open-ended range with math.MaxUint64 while types/epoch.go exposes OpenRoadRange()/utils.MaxRoadIndex for the same purpose; reusing the helper would keep the "open range" concept in one place.
- data/state.go NewState relies on inner.qcs[lb.Number] being non-nil when loading blocks from the WAL to resolve the epoch (qc.QC().Proposal().EpochIndex()); this invariant was already assumed by insertBlock/globalBlockAt, but now an additional call site would nil-deref on a block-without-QC WAL state — consider an explicit guard/error since startup-panic is a poor failure mode.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
|
|
||
| "github.com/sei-protocol/sei-chain/sei-tendermint/autobahn/types" | ||
| "github.com/sei-protocol/sei-chain/sei-tendermint/internal/autobahn/consensus/persist" | ||
|
|
There was a problem hiding this comment.
[nit] Nit: this inserts a blank line between two github.com/sei-protocol/... imports, splitting the same-module import group in two. It won't fail the goimports linter (which preserves manual grouping) but is inconsistent with the surrounding style — consider removing it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, WAL overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
A large but carefully-executed refactor that introduces an epoch.Registry, bakes firstBlock/epoch_index into Proposal/View so GlobalRange() is self-contained on the wire, and routes all committee lookups through the registry. No blocking correctness or security issues found; the epoch-rotation machinery is intentionally read-side-only with documented TODOs, and test coverage is extensive.
Findings: 0 blocking | 5 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Cursor's second-opinion review (cursor-review.md) is empty — that pass produced no output.
- Codex flags that AddEpoch (the execution-bridge activation path) exists while consensus/avail pin their Epoch at startup and never rotate it, so activating a new committee would cause epoch/road mismatches. I confirmed AddEpoch and the rotation-aware code paths are only reached from tests — no production caller invokes AddEpoch — so this is not a live bug in this PR. It is explicitly scoped as follow-up and guarded by TODOs (consensus/inner.go newInner assumes LatestEpoch on restart; avail PushCommitQC has commented-out rotation; avail PushBlock/PushVote verify only the latest epoch via VerifyInWindow). Worth tracking that these must land together with the write side before dynamic epochs are enabled, otherwise a restart after a transition or a message from a non-latest epoch will fail verification.
- AppProposal.epoch_index and Proposal.global_first are decoded as hard-required (nil rejected) even though the proto marks them optional; this is fine given the PR states autobahn is pre-production with no rolling-upgrade path, but the divergence between proto
optionaland decoderrequiredis worth a comment in the .proto. - 2 suggestion(s)/nit(s) flagged inline on specific lines.
| newIdx := s.latest + 1 | ||
| s.epochs[newIdx] = types.NewEpoch( | ||
| newIdx, | ||
| types.RoadRange{First: startRoad, Last: math.MaxUint64}, |
There was a problem hiding this comment.
[nit] Nit: Last: math.MaxUint64 open-ends the new epoch's road range, but the genesis epoch uses types.OpenRoadRange() (which is utils.Max[RoadIndex]()). Since RoadIndex is uint64 these are equal today, but using the same OpenRoadRange()/utils.Max[RoadIndex]() helper here keeps the two construction sites consistent and robust if RoadIndex's width ever changes.
| Number: ViewNumber(*m.Number), | ||
| Index: RoadIndex(*m.Index), | ||
| Number: ViewNumber(*m.Number), | ||
| EpochIndex: EpochIndex(*m.EpochIndex), //nolint:gosec // bounded by uint32 max in practice |
There was a problem hiding this comment.
[nit] EpochIndex is uint32 but decoded from a wire uint64 with //nolint:gosec. An oversized epoch_index (e.g. 2^32) silently truncates rather than being rejected — for genesis it would truncate to 0 and select the genesis epoch. It can't bypass signature verification, but consider explicitly rejecting values > math.MaxUint32 here for clearer failure semantics.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…open road range Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…silently truncating Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n main) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ow check and nolint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
A large, well-tested refactor introducing epoch.Registry/types.Epoch and making the wire Proposal self-contained (global_first/epoch_index) so GlobalRange() no longer needs a committee; it also fixes a real reproposal double-count bug. No blocking correctness/security issues found; only latent footguns and deferred-epoch TODOs remain.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion coverage was thin: the Cursor review file is empty, and Codex reported no material findings while noting it could not run Go tests (Go 1.25.6 unavailable, toolchain download network-blocked). No independent test execution backs this refactor beyond the PR's own tests.
- Epoch transitions are entirely deferred: the registry always returns the genesis epoch and many push/persist paths carry TODOs (per-epoch weight counting in avail pushVote/laneQC, inner.ep rotation in avail/consensus, persistedInner.validate resolving the epoch from the persisted QC rather than LatestEpoch, AddEpoch rejecting startRoad below the commit watermark). Restart after a future epoch transition is explicitly acknowledged to fail validation. Ensure these are tracked before dynamic epochs land.
- Registry.epochs is an append-only map with no pruning; unbounded over long-running nodes once AddEpoch is wired. Minor now, worth a follow-up.
- data.PushBlock's new two-phase locking (blockEpoch → verify → re-lock) is safe/idempotent as written, but the correctness now depends on insertBlock's duplicate guard and the qcs[n]==nil re-check; a brief comment on why the drop-on-nil after re-lock is correct would help future readers.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
| if cQC, ok := vs.CommitQC.Get(); ok { | ||
| return cQC.GlobalRange().Next | ||
| } | ||
| return vs.Epoch.FirstBlock() |
There was a problem hiding this comment.
[nit] ViewSpec.NextGlobalBlock() (and View()/NextTimestamp()) now unconditionally dereference vs.Epoch. Previously ViewSpec{} was a valid genesis spec; now any ViewSpec constructed without an Epoch will nil-panic here. All production call sites in this PR set Epoch, but this is a sharp latent footgun — consider a defensive nil check or at least documenting on ViewSpec that Epoch is mandatory.
…_index from TimeoutVote and field 7 from Proposal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
A large but clean refactor that introduces epoch.Registry and self-contains firstBlock/epoch_index on the wire, routing all committee lookups through the registry. I found no blocking correctness, security, or test issues; the change is well-covered by tests and the epoch-transition gaps are explicitly deferred behind TODOs.
Findings: 0 blocking | 4 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Both second-opinion passes produced no actionable findings: Codex reported "No material findings" but noted it could not run
go test(environment had Go 1.24.13 vs the required 1.25.6), and Cursor's output file was empty — so neither independently exercised the touched packages. - Verification is inconsistent across layers: consensus (inner.go) verifies incoming CommitQC/TimeoutQC against the cached
i.ep(always LatestEpoch), while avail/state.go and data/state.go resolve the epoch viaEpochByIndex(qc.Proposal().EpochIndex()). Both are safe today (single genesis epoch) and are guarded by TODOs, but the two approaches should be reconciled when dynamic epochs are wired up so a restart/transition doesn't fail validation (the TODO in consensus/inner.go newInner already flags the restart-after-transition risk). epoch.NewRegistryreturns anerrorthat is currently always nil; consider dropping the error return until it can actually fail, or keep it and document that it's reserved for future validation.- The read-side is wired ahead of dynamic-epoch support, so several
Push*/persist paths now do anEpochByIndexlookup and return "unknown epoch_index" errors that are effectively unreachable in production today; ensure follow-up PRs add coverage for the multi-epoch branches before AddEpoch is actually invoked.


Summary
epoch.Registrywhich consolidatesfirstBlockandgenesisTimestamp— genesis config that doesn't change across epochs — alongside the committeefirstBlockintoProposalsoGlobalRange()returns absolute block numbers directly, removing it as a call-site parameter throughout the stack. The proto message is now self-contained on the wire, making it much easier to interpret during emergency debugging. Also fixes a reproposal bug whereTimeoutQC.reproposal()was double-countingfirstBlockregistry.CommitteeFor(roadIndex). The registry currently returns the genesis committee for all RoadIndexes — we wire the read side first so that any bug introduced when implementing dynamic epochs in follow-up PRs is immediately visible at the call sites rather than requiring a cascading refactor to find it🤖 Generated with Claude Code