[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161
[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161yew1eb wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Optimizes the SIMD-based probe path in the native-engine join hash map by short-circuiting the “empty slot” SIMD comparison when a hash match is found, targeting reduced instruction count in typical high-hit-rate join workloads.
Changes:
- Splits the probe condition into a fast-path (hash match) and slow-path (empty slot) to avoid an unconditional empty-mask SIMD compare.
- Returns
MapValue::EMPTYdirectly when an empty slot is detected in the probed group.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ShreyeshArangath
left a comment
There was a problem hiding this comment.
How was the performance tested? Can you share some logs/numbers in the PR description as well?
We should probably set up a microbenchmark for lookup_many with controlled hit rates (0%, 50%, 100%) if possible, WDYT?
@ShreyeshArangath Done. Added benches/join_hash_map.rs with 0%/50%/100% hit rates across 5M/10M/20M keys. The numbers are in the PR description: on M2 Pro the win is ~4–5% between hit=0% and hit=100%, which is modest but expected since this is a small hot-path cleanup. Should be safe to merge. |
767b8d1 to
d444755
Compare
[AURON-2160] Optimize join hash map probe by checking hash_matched first before computing empty mask. This reduces ~50% SIMD instructions when hash hit rate is high (typical join scenarios). Before: Always compute both hash_matched and empty SIMD masks. After: Only compute empty mask when hash_matched has no hits. Also add a criterion microbenchmark (benches/join_hash_map.rs) covering realistic BHJ build sizes (5M/10M/20M keys) × three hit rates (0/50/100%). Results on Apple M2 Pro (probe_size=4096): build size | hit=0% | hit=50% | hit=100% ----------------+---------+---------+--------- 5M (~128 MB) | 6.63 µs | 6.52 µs | 6.35 µs 10M (~256 MB) | 6.68 µs | 6.50 µs | 6.36 µs 20M (~512 MB) | 6.70 µs | 6.59 µs | 6.36 µs Latency stays flat because prefetch_read_data (4-step ahead) fully pipelines cache misses. The hit=100% path is consistently ~4-5% faster, aligning with the optimization goal. Instruction-count savings can be confirmed on x86 via: perf stat -e instructions Run benchmark: cargo bench --bench join_hash_map -p datafusion-ext-plans
|
@yew1eb, could you please resolve conflicts? |
|
@SteNicholas Conflicts resolved. CI is green. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
weiqingy
left a comment
There was a problem hiding this comment.
Thanks for taking this on — I walked the reorder against the two build invariants (each distinct hash lands in exactly one slot via the chunk_by(hash) at build time, and insertion always fills the lowest empty lane so occupied lanes pack into a prefix), and the hit-first / empty-second path is equivalent to the old (hash_matched | empty).first_set(). The Copilot pass already covers the prefix-invariant doc, iter_batched, and the transmutes, so just two notes from me inline.
| let label = format!("{size_label}/{rate_label}"); | ||
| group.bench_with_input(BenchmarkId::from_parameter(&label), &label, |b, _| { | ||
| b.iter(|| { | ||
| let result = map.lookup_many(black_box(probe.clone())); |
There was a problem hiding this comment.
Your table shows hit=100% landing a bit ahead of hit=0% within the patched code, which is the expected shape of the win. The one thing it doesn't capture is a master baseline — and since the miss path now runs the two SIMD compares sequentially with an extra any() reduction (vs. the old single first_set over the OR mask), a miss-heavy probe (low-selectivity semi/anti joins) is doing a touch more work. You may well have already checked this — just double-confirming: did hit=0% hold steady against master?
| if let Some(pos) = hash_matched.first_set() { | ||
| hashes[i] = unsafe { | ||
| // safety: transmute MapValue(u32) to u32 | ||
| std::mem::transmute(self.map[e].values[pos]) |
There was a problem hiding this comment.
You've already dropped the unsafe transmute on the empty path just below (hashes[i] = MapValue::EMPTY.0; at line 269) — since this code is in the same module as MapValue, the hit path here can mirror that with hashes[i] = self.map[e].values[pos].0; and retire the last transmute from the hot loop. Worth making the two paths consistent while you're in here?
Which issue does this PR close?
Closes #2160
Rationale for this change
Optimize join hash map probe by checking hash_matched first before computing empty mask.
What changes are included in this PR?
Changes:
hash_matchedbefore computingemptymask.benches/join_hash_map.rswith 0%/50%/100% hit rates × 5M/10M/20M keys.Are there any user-facing changes?
How was this patch tested?
Benchmark (M2 Pro, probe_size=4096):
hit=100% is consistently ~4–5% faster than hit=0%.