Skip to content

perf: Switch Aiur from Keccak to Blake3 hashing#470

Draft
samuelburnham wants to merge 1 commit into
mainfrom
sb/aiur-blake3
Draft

perf: Switch Aiur from Keccak to Blake3 hashing#470
samuelburnham wants to merge 1 commit into
mainfrom
sb/aiur-blake3

Conversation

@samuelburnham

Copy link
Copy Markdown
Member

Adopt Blake3 for the multi-stark commitment hash, challenger, and recursive verifier

Spans two coordinated branches: multi-stark@blake3 (the prover/verifier hash)
and ix@bench/blake3 (in-circuit recursive-verifier port + tests + dependency bump).

Summary

Use Blake3 as the multi-stark commitment scheme's hash — the Merkle MMCS
(leaf hash + 2-to-1 compression) and the Fiat–Shamir challenger — and port the
in-circuit recursive verifier to match, including its vk/claims digest
bindings
, so the verifier is fully Blake3 with no Keccak on any hot path.
Blake3's lighter permutation (7 rounds, 512-bit state) wins on native hardware
and, far more so, in-circuit, where its mostly byte-aligned rotations avoid the
expensive bit-decomposition the previous hash required.

  • Leaf proving (any Aiur program): ~25% faster to prove, no RAM change,
    identical FFT/trace.
  • Recursive verification: the fully-Blake3 in-circuit verifier shrinks
    ~65× in FFT and ~60× in execution time — the difference between proving
    the verifier being infeasible on a 250 GB box
    (baseline: OOM > 200 GB) and
    routine (Blake3: 6.8 GB / 4.8 s).

Changes

multi-stark (branch blake3, 0b260d7)

  • src/types.rs: Challenger, Mmcs, and the leaf/compress hashers built on
    Blake3 (SerializingHasher<Blake3>, CompressionFunctionFromHasher<Blake3,2,32>,
    [u8;32] digests, SerializingChallenger64<Val, HashChallenger<u8, Blake3, 32>>).
  • Cargo.toml: add p3-blake3.
  • Reference-value generators (gen_pcs_refs, gen_challenger_refs, #[cfg(test)]):
    the independent source for the Ix self-test reference digests (leaf/compress,
    the Merkle tree, and the challenger samples).

ix (branch bench/blake3)

  • Ix/MultiStark/Pcs.lean: MMCS leaf hash (mmcs_hash_row) and 2-to-1 compression
    (mmcs_compress) over the native Blake3 gadget (Ix/IxVM/Blake3.lean); the dead
    Keccak PaddingFreeSponge (pf_sponge_u64/pf_absorb/u64_pick) is removed.
    Digest stays [U64;4] (= the 32 Blake3 output bytes, LE-grouped).
  • Ix/MultiStark/Verifier.lean: the Fiat–Shamir challenger flush uses blake3;
    the generic cons8 list helper is relocated here from Keccak.lean.
  • Ix/MultiStark.lean: the vk/claims digest bindings (system_digest,
    claims_digest) use b3_to_digest(blake3(..)) in-circuit; IxVM.blake3 is
    merged; merge keccak is dropped (the verifier no longer compiles any Keccak
    circuit — they were unused/height-0). The Keccak gadget module stays available.
  • src/aiur/vk_codec.rs: verifying-key Merkle-cap (de)serialization uses the
    Blake3 32-byte digest ([u8;32]).
  • Benchmarks/RecursiveVerifier.lean (+ lakefile.lean exe target): the
    recursive-verifier cost benchmark — proves factorial(5), runs the in-circuit
    verifier, reports FFT and (by default) the full prove timeline via texray. Pass
    --execute-only to skip the prove on a memory-constrained box.
  • Cargo.toml / Cargo.lock: bump the multi-stark dependency to blake3 (0b260d7).
  • Tests — see the dedicated section below.

Cost breakdown

All numbers on a 32-core / 250 GB AVX-512 host, parallel (rayon) build.
"FFT cost" = Σ_circuit width·height·log₂(height) (the prover-work / RAM driver).
Baseline = current main (all-Keccak); Blake3 = this PR (all-Blake3).

1. Aiur program — kernel-constant leaf proving

Workload: full-closure typecheck of 49 Init constants (init.ixe), FFT
1.7 M → 4.0 B, baseline vs Blake3 from the same commit (only the hash differs),
paired per-constant.

FFT, trace shape, and execution (witness-gen) time are hash-independent
byte-identical between baseline and Blake3, verified across all 49 constants. The
commitment hash only affects the proving step:

  • Proving time: 145.7 s → 109.6 s over the 49 constants (−24.8%);
    per-constant geomean 1.51× faster (range 1.04–2.21×). The win shrinks as FFT
    grows and the NTT/quotient overtake the hash: ~−48% at ~1e6 FFT → ~−23% at ~1e8
    → ~−9% at ~4e9, so the typical sharded-workload win is 20–50%.
  • Peak proving RAM: flat — mean +0.9% (range −2.8% … +5.4%). Peak RSS is
    dominated by trace/LDE buffers (∝ FFT, identical), not the few-KB hash state.

2. Recursive verifier — verify_multi_stark_proof on factorial(5)

Inner proof: a factorial(5) multi-stark proof (numQueries=3, log_blowup=2,
30 KB). The verifier recomputes the inner proof's Merkle/transcript hashes and
its vk/claims digest bindings in-circuit; this PR makes all of that Blake3.

metric baseline verifier Blake3 verifier Δ
In-circuit FFT ~75 B (¹) ~1.15 B ~65× smaller
Execution time ~60 s ~1 s ~60× faster
Execute peak RAM 6.6 GB 0.45 GB ~15×
Proving time OOM, does not complete (²) 4.8 s infeasible → feasible
Proving peak RAM OOM, ~300 GB extrapolated (²) 6.8 GB infeasible → feasible

(¹) The baseline verifier's FFT is ~99% in-circuit hashing (the bit-decomposition
rotations of the previous hash). Making the vk/claims bindings Blake3 too — not just
the MMCS and challenger — matters here: the vk binding alone is ~2.5 B of FFT, so a
verifier with Blake3 commitments but Keccak bindings still sits at ~3.75 B; moving
the bindings as well brings it to ~1.15 B (the ~65× figure).

(²) Proving the baseline verifier OOMs. Measured under a hard 200 GB cgroup cap
(box-safe), killed mid-prove with the heaviest stages still ahead:

prove stage peak RSS
aiur/execute 8.5 GB
aiur/witness 98.1 GB
stark/stage1_commit 166.2 GB
stark/lookup_messages 185.8 GB
stark/batch_inverse 198.8 GB → cgroup-killed at 200 GB (before quotient/fri_open)

Extrapolating the remaining stages puts the baseline verifier prove at ~300 GB
over the 240 GB shutoff. The Blake3 verifier proves comfortably, peaking at 6.8 GB.

3. Supporting microbenchmarks

  • Raw native hash (off-circuit, AVX-512): Blake3's smaller state / fewer rounds
    give it the edge on native throughput — the source of the ~25% leaf-proving win.
    (p3-blake3 is scalar-per-leaf; a future packed Blake3 hasher would widen it.)
  • In-circuit hash gadget: Blake3 is far cheaper in-circuit (bit-decomposition
    rotations dominate the previous hash). At verifier scale this compounds to ~65×.

Tests

All green under the exact CI invocations (lake test -- --ignored … multi-stark recursive-verifier and IxTests -- ffi).

  • Self-test reference values regenerated for Blake3 (Ix/MultiStark/Tests.lean):
    pcs_hash_test (leaf/compress), pcs_merkle_test (Merkle verify_batch root +
    tamper), sample_bits_test and pcs_challenger4_test (challenger samples). All
    references come from the multi-stark gen_pcs_refs/gen_challenger_refs
    generators; the leaf digests were cross-checked independently with b3sum.
  • End-to-end (Tests/MultiStark.lean::endToEndSuite, the recursive-verifier
    runner): host-side vk/claims digests use Blake3.Rust.hash; the verifier accepts
    the honest factorial(5) proof and rejects a tampered proof byte and a tampered
    claim.
  • Blake3 FFI lifecycle (Tests/FFI/Lifecycle.lean, run under ffi): new tests
    mirroring the Keccak ones — official empty-input vector, multi-update == one-shot,
    20-update destructor stress, 4 KB large-input determinism.

Methodology

  • Both hashes built from the same commit; the leaf bench's two bench-typecheck
    binaries differ only in the multi-stark hash (FFT-identity confirms apples-to-apples).
  • The baseline verifier prove was run under systemd-run --user --scope -p MemoryMax=200G -p MemorySwapMax=0 so a runaway allocation is cgroup-OOM-killed
    (just that process), never tripping the box-level 240 GB shutoff.
  • Per-stage timing/RAM via tracing-texray; peak RSS via /usr/bin/time -v.
  • lake exe bench-recursive-verifier reproduces §2 (proves by default, ~4.8 s / ~6.8 GB;
    --execute-only for FFT/exec only).

Caveats

  1. Verifier FFT/exec carry ±~15% run-to-run noise because the multi-stark
    prover is non-deterministic under parallel (rayon): the same statement
    yields byte-different (valid) proofs, so the verifier authenticates slightly
    different Merkle paths. Confirmed: factorial(5) proof hashes differ across
    parallel runs but are identical single-threaded (RAYON_NUM_THREADS=1); Aiur
    execution FFT is deterministic (factorial(5) execute FFT = 201.627075,
    every run). Worth a separate upstream determinism fix; it does not affect
    correctness or the order-of-magnitude conclusions.
  2. Baseline-verifier prove RAM (~300 GB) is extrapolated past the 200 GB cap
    (measured to 198.8 GB with quotient/fri_open still ahead). The infeasibility
    is measured; the exact peak is not.

Dependency ordering

multi-stark@blake3 must merge first; the ix Cargo.toml already pins the
dependency to its branch HEAD (0b260d7) — re-pin to the merge commit on merge.

Follow-ups

  • File the parallel-proving non-determinism upstream (Caveat 1).
  • Optionally add a packed Blake3 hasher (inter-leaf SIMD) to widen the leaf win. Note the Plonky3 Blake3 implementation is still missing "modifications to tune BLAKE3 for hashing small leaves"
  • Native-field Poseidon1/2 for the in-circuit verifier hash (algebraic ≪ bit-oriented
    in-circuit) as a future, larger recursion lever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant