perf: Switch Aiur from Keccak to Blake3 hashing by samuelburnham · Pull Request #470 · argumentcomputer/ix

samuelburnham · 2026-07-02T02:28:39Z

Adopt Blake3 for the multi-stark commitment hash, challenger, and recursive verifier

Spans two coordinated branches: multi-stark@blake3 (the prover/verifier hash)
and ix@bench/blake3 (in-circuit recursive-verifier port + tests + dependency bump).

Summary

Use Blake3 as the multi-stark commitment scheme's hash — the Merkle MMCS
(leaf hash + 2-to-1 compression) and the Fiat–Shamir challenger — and port the
in-circuit recursive verifier to match, including its vk/claims digest
bindings, so the verifier is fully Blake3 with no Keccak on any hot path.
Blake3's lighter permutation (7 rounds, 512-bit state) wins on native hardware
and, far more so, in-circuit, where its mostly byte-aligned rotations avoid the
expensive bit-decomposition the previous hash required.

Leaf proving (any Aiur program): ~25% faster to prove, no RAM change,
identical FFT/trace.
Recursive verification: the fully-Blake3 in-circuit verifier shrinks
~65× in FFT and ~60× in execution time — the difference between proving
the verifier being infeasible on a 250 GB box (baseline: OOM > 200 GB) and
routine (Blake3: 6.8 GB / 4.8 s).

Changes

`multi-stark` (branch `blake3`, `0b260d7`)

src/types.rs: Challenger, Mmcs, and the leaf/compress hashers built on
Blake3 (SerializingHasher<Blake3>, CompressionFunctionFromHasher<Blake3,2,32>,
[u8;32] digests, SerializingChallenger64<Val, HashChallenger<u8, Blake3, 32>>).
Cargo.toml: add p3-blake3.
Reference-value generators (gen_pcs_refs, gen_challenger_refs, #[cfg(test)]):
the independent source for the Ix self-test reference digests (leaf/compress,
the Merkle tree, and the challenger samples).

`ix` (branch `bench/blake3`)

Ix/MultiStark/Pcs.lean: MMCS leaf hash (mmcs_hash_row) and 2-to-1 compression
(mmcs_compress) over the native Blake3 gadget (Ix/IxVM/Blake3.lean); the dead
Keccak PaddingFreeSponge (pf_sponge_u64/pf_absorb/u64_pick) is removed.
Digest stays [U64;4] (= the 32 Blake3 output bytes, LE-grouped).
Ix/MultiStark/Verifier.lean: the Fiat–Shamir challenger flush uses blake3;
the generic cons8 list helper is relocated here from Keccak.lean.
Ix/MultiStark.lean: the vk/claims digest bindings (system_digest,
claims_digest) use b3_to_digest(blake3(..)) in-circuit; IxVM.blake3 is
merged; merge keccak is dropped (the verifier no longer compiles any Keccak
circuit — they were unused/height-0). The Keccak gadget module stays available.
src/aiur/vk_codec.rs: verifying-key Merkle-cap (de)serialization uses the
Blake3 32-byte digest ([u8;32]).
Benchmarks/RecursiveVerifier.lean (+ lakefile.lean exe target): the
recursive-verifier cost benchmark — proves factorial(5), runs the in-circuit
verifier, reports FFT and (by default) the full prove timeline via texray. Pass
--execute-only to skip the prove on a memory-constrained box.
Cargo.toml / Cargo.lock: bump the multi-stark dependency to blake3 (0b260d7).
Tests — see the dedicated section below.

Cost breakdown

All numbers on a 32-core / 250 GB AVX-512 host, parallel (rayon) build.
"FFT cost" = Σ_circuit width·height·log₂(height) (the prover-work / RAM driver).
Baseline = current main (all-Keccak); Blake3 = this PR (all-Blake3).

1. Aiur program — kernel-constant leaf proving

Workload: full-closure typecheck of 49 Init constants (init.ixe), FFT
1.7 M → 4.0 B, baseline vs Blake3 from the same commit (only the hash differs),
paired per-constant.

FFT, trace shape, and execution (witness-gen) time are hash-independent —
byte-identical between baseline and Blake3, verified across all 49 constants. The
commitment hash only affects the proving step:

Proving time: 145.7 s → 109.6 s over the 49 constants (−24.8%);
per-constant geomean 1.51× faster (range 1.04–2.21×). The win shrinks as FFT
grows and the NTT/quotient overtake the hash: ~−48% at ~1e6 FFT → ~−23% at ~1e8
→ ~−9% at ~4e9, so the typical sharded-workload win is 20–50%.
Peak proving RAM: flat — mean +0.9% (range −2.8% … +5.4%). Peak RSS is
dominated by trace/LDE buffers (∝ FFT, identical), not the few-KB hash state.

2. Recursive verifier — `verify_multi_stark_proof` on `factorial(5)`

Inner proof: a factorial(5) multi-stark proof (numQueries=3, log_blowup=2,
30 KB). The verifier recomputes the inner proof's Merkle/transcript hashes and
its vk/claims digest bindings in-circuit; this PR makes all of that Blake3.

metric	baseline verifier	Blake3 verifier	Δ
In-circuit FFT	~75 B (¹)	~1.15 B	~65× smaller
Execution time	~60 s	~1 s	~60× faster
Execute peak RAM	6.6 GB	0.45 GB	~15×
Proving time	OOM, does not complete (²)	4.8 s	infeasible → feasible
Proving peak RAM	OOM, ~300 GB extrapolated (²)	6.8 GB	infeasible → feasible

(¹) The baseline verifier's FFT is ~99% in-circuit hashing (the bit-decomposition
rotations of the previous hash). Making the vk/claims bindings Blake3 too — not just
the MMCS and challenger — matters here: the vk binding alone is ~2.5 B of FFT, so a
verifier with Blake3 commitments but Keccak bindings still sits at ~3.75 B; moving
the bindings as well brings it to ~1.15 B (the ~65× figure).

(²) Proving the baseline verifier OOMs. Measured under a hard 200 GB cgroup cap
(box-safe), killed mid-prove with the heaviest stages still ahead:

prove stage	peak RSS
`aiur/execute`	8.5 GB
`aiur/witness`	98.1 GB
`stark/stage1_commit`	166.2 GB
`stark/lookup_messages`	185.8 GB
`stark/batch_inverse`	198.8 GB → cgroup-killed at 200 GB (before `quotient`/`fri_open`)

Extrapolating the remaining stages puts the baseline verifier prove at ~300 GB —
over the 240 GB shutoff. The Blake3 verifier proves comfortably, peaking at 6.8 GB.

3. Supporting microbenchmarks

Raw native hash (off-circuit, AVX-512): Blake3's smaller state / fewer rounds
give it the edge on native throughput — the source of the ~25% leaf-proving win.
(p3-blake3 is scalar-per-leaf; a future packed Blake3 hasher would widen it.)
In-circuit hash gadget: Blake3 is far cheaper in-circuit (bit-decomposition
rotations dominate the previous hash). At verifier scale this compounds to ~65×.

Tests

All green under the exact CI invocations (lake test -- --ignored … multi-stark recursive-verifier and IxTests -- ffi).

Self-test reference values regenerated for Blake3 (Ix/MultiStark/Tests.lean):
pcs_hash_test (leaf/compress), pcs_merkle_test (Merkle verify_batch root +
tamper), sample_bits_test and pcs_challenger4_test (challenger samples). All
references come from the multi-stark gen_pcs_refs/gen_challenger_refs
generators; the leaf digests were cross-checked independently with b3sum.
End-to-end (Tests/MultiStark.lean::endToEndSuite, the recursive-verifier
runner): host-side vk/claims digests use Blake3.Rust.hash; the verifier accepts
the honest factorial(5) proof and rejects a tampered proof byte and a tampered
claim.
Blake3 FFI lifecycle (Tests/FFI/Lifecycle.lean, run under ffi): new tests
mirroring the Keccak ones — official empty-input vector, multi-update == one-shot,
20-update destructor stress, 4 KB large-input determinism.

Methodology

Both hashes built from the same commit; the leaf bench's two bench-typecheck
binaries differ only in the multi-stark hash (FFT-identity confirms apples-to-apples).
The baseline verifier prove was run under systemd-run --user --scope -p MemoryMax=200G -p MemorySwapMax=0 so a runaway allocation is cgroup-OOM-killed
(just that process), never tripping the box-level 240 GB shutoff.
Per-stage timing/RAM via tracing-texray; peak RSS via /usr/bin/time -v.
lake exe bench-recursive-verifier reproduces §2 (proves by default, ~4.8 s / ~6.8 GB;
--execute-only for FFT/exec only).

Caveats

Verifier FFT/exec carry ±~15% run-to-run noise because the multi-stark
prover is non-deterministic under parallel (rayon): the same statement
yields byte-different (valid) proofs, so the verifier authenticates slightly
different Merkle paths. Confirmed: factorial(5) proof hashes differ across
parallel runs but are identical single-threaded (RAYON_NUM_THREADS=1); Aiur
execution FFT is deterministic (factorial(5) execute FFT = 201.627075,
every run). Worth a separate upstream determinism fix; it does not affect
correctness or the order-of-magnitude conclusions.
Baseline-verifier prove RAM (~300 GB) is extrapolated past the 200 GB cap
(measured to 198.8 GB with quotient/fri_open still ahead). The infeasibility
is measured; the exact peak is not.

Dependency ordering

multi-stark@blake3 must merge first; the ix Cargo.toml already pins the
dependency to its branch HEAD (0b260d7) — re-pin to the merge commit on merge.

Follow-ups

File the parallel-proving non-determinism upstream (Caveat 1).
Optionally add a packed Blake3 hasher (inter-leaf SIMD) to widen the leaf win. Note the Plonky3 Blake3 implementation is still missing "modifications to tune BLAKE3 for hashing small leaves"
Native-field Poseidon1/2 for the in-circuit verifier hash (algebraic ≪ bit-oriented
in-circuit) as a future, larger recursion lever.

perf: Switch Aiur from Keccak to Blake3 hashing

fa3c1d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Switch Aiur from Keccak to Blake3 hashing#470

perf: Switch Aiur from Keccak to Blake3 hashing#470
samuelburnham wants to merge 1 commit into
mainfrom
sb/aiur-blake3

samuelburnham commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

samuelburnham commented Jul 2, 2026

Adopt Blake3 for the multi-stark commitment hash, challenger, and recursive verifier

Summary

Changes

multi-stark (branch blake3, 0b260d7)

ix (branch bench/blake3)

Cost breakdown

1. Aiur program — kernel-constant leaf proving

2. Recursive verifier — verify_multi_stark_proof on factorial(5)

3. Supporting microbenchmarks

Tests

Methodology

Caveats

Dependency ordering

Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`multi-stark` (branch `blake3`, `0b260d7`)

`ix` (branch `bench/blake3`)

2. Recursive verifier — `verify_multi_stark_proof` on `factorial(5)`