perf: Switch Aiur from Keccak to Blake3 hashing#470
Draft
samuelburnham wants to merge 1 commit into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adopt Blake3 for the multi-stark commitment hash, challenger, and recursive verifier
Spans two coordinated branches:
multi-stark@blake3(the prover/verifier hash)and
ix@bench/blake3(in-circuit recursive-verifier port + tests + dependency bump).Summary
Use Blake3 as the multi-stark commitment scheme's hash — the Merkle MMCS
(leaf hash + 2-to-1 compression) and the Fiat–Shamir challenger — and port the
in-circuit recursive verifier to match, including its vk/claims digest
bindings, so the verifier is fully Blake3 with no Keccak on any hot path.
Blake3's lighter permutation (7 rounds, 512-bit state) wins on native hardware
and, far more so, in-circuit, where its mostly byte-aligned rotations avoid the
expensive bit-decomposition the previous hash required.
identical FFT/trace.
~65× in FFT and ~60× in execution time — the difference between proving
the verifier being infeasible on a 250 GB box (baseline: OOM > 200 GB) and
routine (Blake3: 6.8 GB / 4.8 s).
Changes
multi-stark(branchblake3,0b260d7)src/types.rs:Challenger,Mmcs, and the leaf/compress hashers built onBlake3 (
SerializingHasher<Blake3>,CompressionFunctionFromHasher<Blake3,2,32>,[u8;32]digests,SerializingChallenger64<Val, HashChallenger<u8, Blake3, 32>>).Cargo.toml: addp3-blake3.gen_pcs_refs,gen_challenger_refs,#[cfg(test)]):the independent source for the Ix self-test reference digests (leaf/compress,
the Merkle tree, and the challenger samples).
ix(branchbench/blake3)Ix/MultiStark/Pcs.lean: MMCS leaf hash (mmcs_hash_row) and 2-to-1 compression(
mmcs_compress) over the native Blake3 gadget (Ix/IxVM/Blake3.lean); the deadKeccak
PaddingFreeSponge(pf_sponge_u64/pf_absorb/u64_pick) is removed.Digeststays[U64;4](= the 32 Blake3 output bytes, LE-grouped).Ix/MultiStark/Verifier.lean: the Fiat–Shamir challenger flush usesblake3;the generic
cons8list helper is relocated here fromKeccak.lean.Ix/MultiStark.lean: the vk/claims digest bindings (system_digest,claims_digest) useb3_to_digest(blake3(..))in-circuit;IxVM.blake3ismerged;
merge keccakis dropped (the verifier no longer compiles any Keccakcircuit — they were unused/height-0). The Keccak gadget module stays available.
src/aiur/vk_codec.rs: verifying-key Merkle-cap (de)serialization uses theBlake3 32-byte digest (
[u8;32]).Benchmarks/RecursiveVerifier.lean(+lakefile.leanexe target): therecursive-verifier cost benchmark — proves
factorial(5), runs the in-circuitverifier, reports FFT and (by default) the full prove timeline via texray. Pass
--execute-onlyto skip the prove on a memory-constrained box.Cargo.toml/Cargo.lock: bump the multi-stark dependency toblake3(0b260d7).Cost breakdown
All numbers on a 32-core / 250 GB AVX-512 host,
parallel(rayon) build."FFT cost" =
Σ_circuit width·height·log₂(height)(the prover-work / RAM driver).Baseline = current
main(all-Keccak); Blake3 = this PR (all-Blake3).1. Aiur program — kernel-constant leaf proving
Workload: full-closure typecheck of 49
Initconstants (init.ixe), FFT1.7 M → 4.0 B, baseline vs Blake3 from the same commit (only the hash differs),
paired per-constant.
FFT, trace shape, and execution (witness-gen) time are hash-independent —
byte-identical between baseline and Blake3, verified across all 49 constants. The
commitment hash only affects the proving step:
per-constant geomean 1.51× faster (range 1.04–2.21×). The win shrinks as FFT
grows and the NTT/quotient overtake the hash: ~−48% at ~1e6 FFT → ~−23% at ~1e8
→ ~−9% at ~4e9, so the typical sharded-workload win is 20–50%.
dominated by trace/LDE buffers (∝ FFT, identical), not the few-KB hash state.
2. Recursive verifier —
verify_multi_stark_proofonfactorial(5)Inner proof: a
factorial(5)multi-stark proof (numQueries=3,log_blowup=2,30 KB). The verifier recomputes the inner proof's Merkle/transcript hashes and
its vk/claims digest bindings in-circuit; this PR makes all of that Blake3.
(¹) The baseline verifier's FFT is ~99% in-circuit hashing (the bit-decomposition
rotations of the previous hash). Making the vk/claims bindings Blake3 too — not just
the MMCS and challenger — matters here: the vk binding alone is ~2.5 B of FFT, so a
verifier with Blake3 commitments but Keccak bindings still sits at ~3.75 B; moving
the bindings as well brings it to ~1.15 B (the ~65× figure).
(²) Proving the baseline verifier OOMs. Measured under a hard 200 GB cgroup cap
(box-safe), killed mid-prove with the heaviest stages still ahead:
aiur/executeaiur/witnessstark/stage1_commitstark/lookup_messagesstark/batch_inversequotient/fri_open)Extrapolating the remaining stages puts the baseline verifier prove at ~300 GB —
over the 240 GB shutoff. The Blake3 verifier proves comfortably, peaking at 6.8 GB.
3. Supporting microbenchmarks
give it the edge on native throughput — the source of the ~25% leaf-proving win.
(
p3-blake3is scalar-per-leaf; a future packed Blake3 hasher would widen it.)rotations dominate the previous hash). At verifier scale this compounds to ~65×.
Tests
All green under the exact CI invocations (
lake test -- --ignored … multi-stark recursive-verifierandIxTests -- ffi).Ix/MultiStark/Tests.lean):pcs_hash_test(leaf/compress),pcs_merkle_test(Merkleverify_batchroot +tamper),
sample_bits_testandpcs_challenger4_test(challenger samples). Allreferences come from the multi-stark
gen_pcs_refs/gen_challenger_refsgenerators; the leaf digests were cross-checked independently with
b3sum.Tests/MultiStark.lean::endToEndSuite, therecursive-verifierrunner): host-side vk/claims digests use
Blake3.Rust.hash; the verifier acceptsthe honest
factorial(5)proof and rejects a tampered proof byte and a tamperedclaim.
Tests/FFI/Lifecycle.lean, run underffi): new testsmirroring the Keccak ones — official empty-input vector, multi-update == one-shot,
20-update destructor stress, 4 KB large-input determinism.
Methodology
bench-typecheckbinaries differ only in the multi-stark hash (FFT-identity confirms apples-to-apples).
systemd-run --user --scope -p MemoryMax=200G -p MemorySwapMax=0so a runaway allocation is cgroup-OOM-killed(just that process), never tripping the box-level 240 GB shutoff.
tracing-texray; peak RSS via/usr/bin/time -v.lake exe bench-recursive-verifierreproduces §2 (proves by default, ~4.8 s / ~6.8 GB;--execute-onlyfor FFT/exec only).Caveats
prover is non-deterministic under
parallel(rayon): the same statementyields byte-different (valid) proofs, so the verifier authenticates slightly
different Merkle paths. Confirmed:
factorial(5)proof hashes differ acrossparallel runs but are identical single-threaded (
RAYON_NUM_THREADS=1); Aiurexecution FFT is deterministic (
factorial(5)execute FFT =201.627075,every run). Worth a separate upstream determinism fix; it does not affect
correctness or the order-of-magnitude conclusions.
(measured to 198.8 GB with
quotient/fri_openstill ahead). The infeasibilityis measured; the exact peak is not.
Dependency ordering
multi-stark@blake3must merge first; theixCargo.tomlalready pins thedependency to its branch HEAD (
0b260d7) — re-pin to the merge commit on merge.Follow-ups
in-circuit) as a future, larger recursion lever.