[wasm] Enable Vector128 fast paths on Wasm via PackedSimd: hex/Guid, UTF-8, SearchValues, Teddy, Adler32/XXH3#129838
[wasm] Enable Vector128 fast paths on Wasm via PackedSimd: hex/Guid, UTF-8, SearchValues, Teddy, Adler32/XXH3#129838lewing wants to merge 9 commits into
Conversation
These internal helpers are used by HexConverter.AsciiToHexVector128 and other byte-interleaving code paths. They previously dispatched only to Sse2.UnpackLow/UnpackHigh or AdvSimd.Arm64.ZipLow/ZipHigh and threw NotSupportedException on platforms without either ISA. With the recent change that enables HexConverter and Guid format on Wasm via PackedSimd, the helpers became reachable on browser-wasm and started throwing at runtime in libraries tests. Lower to PackedSimd.Shuffle with a constant 16-byte index vector (i8x16.shuffle) when PackedSimd is supported. Validated via System.Runtime.Extensions.Tests on browser-wasm (8224 passing, 0 failed) after the previous run failed 132 tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HexConverter.AsciiToHexVector128, HexConverter.EncodeTo_Vector128 and Guid.FormatGuidVector128Utf8 already use only portable Vector128 ops (Vector128.ShuffleNative, Vector128.UnpackLow/High, Vector128.Shuffle with constant indices) plus an optional AdvSimd.Arm64-specific branch. The gates at Convert.ToHexString, EncodeToUtf8/Utf16, and Guid.ToString required Ssse3 or AdvSimd.Arm64, so Wasm fell back to scalar even with PackedSimd. Add PackedSimd.IsSupported to the gates and the [CompExactlyDependsOn] attributes on the helpers. The bodies are unchanged; on Wasm the existing else branch (portable Vector128.Shuffle) is selected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetPointerToFirstInvalidChar's inner ASCII-scan loop dispatched on AdvSimd.Arm64 (with bitmask128) or Sse2 (with MoveMask), falling back to a scalar 4-DWORD-at-a-time path otherwise. On Wasm with PackedSimd, neither SIMD branch was taken, so UTF-8 validation took the scalar path. Add a PackedSimd.IsSupported branch that uses portable Vector128.LoadUnsafe + ExtractMostSignificantBits to compute the same per-byte non-ASCII bitmask used by the Sse2 path. Update the post-loop Debug.Assert to include PackedSimd. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
There was a problem hiding this comment.
Pull request overview
This PR enables existing Vector128-based fast paths to run on browser-wasm by widening the feature gates to include System.Runtime.Intrinsics.Wasm.PackedSimd and adding a PackedSimd.Shuffle implementation for Vector128.UnpackLow/UnpackHigh so dependent encode/format paths don’t fall into NotSupportedException on Wasm SIMD-enabled runs.
Changes:
- Add
PackedSimd.IsSupportedimplementations forVector128.UnpackLow/UnpackHighusingPackedSimd.Shuffle(two-vector shuffle). - Widen existing hex + Guid formatting SIMD gates /
[CompExactlyDependsOn]to includePackedSimd. - Add a
PackedSimdvectorized ASCII-scan path inUtf8Utility.ValidationusingVector128.LoadUnsafe+ExtractMostSignificantBits.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs | Adds PackedSimd support to UnpackLow/UnpackHigh to avoid Wasm falling into the unsupported path. |
| src/libraries/Common/src/System/HexConverter.cs | Expands encode-side Vector128 gate/attributes to include PackedSimd for Wasm SIMD. |
| src/libraries/System.Private.CoreLib/src/System/Guid.cs | Expands Guid vectorized formatting gate/attributes to include PackedSimd on little-endian. |
| src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs | Adds a PackedSimd ASCII-validation fast path using MSB extraction. |
TranscodeToUtf8's 8-char ASCII fast loop used a Vector128<short> read, a mask-and-compare to detect non-ASCII, and a narrow-and-store of 8 bytes using Sse2.PackUnsignedSaturate / AdvSimd.ExtractNarrowingSatura teUnsignedLower. Two follow-on 4-char sites narrowed 4 bytes the same way. All four sites required Sse41.X64 or AdvSimd.Arm64 + LE, so Wasm took the 4-DWORD-at-a-time scalar fallback. Add PackedSimd branches at every dispatch site: - Outer entry gate (declaration + entry condition) - 8-char narrow-store: use the existing portable AND-compare for the non-ASCII test (same code Sse41 already uses) and PackedSimd.Convert NarrowingSaturateUnsigned + scalar extract for the store - 4-char narrow-stores: PackedSimd.ConvertNarrowingSaturateUnsigned + AsUInt32().ToScalar() unaligned write The Sse2.X64.ConvertToUInt64 sub-branch already had an else path that calls AsUInt64().ToScalar(), which works on Wasm without changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SearchValues<string> with 2+ values previously selected the Aho-Corasick implementation on Wasm because the Teddy entry gate in StringSearchValues.cs required Ssse3 or AdvSimd.Arm64. Teddy's core Vector128 primitives in TeddyHelper.cs (LoadAndPack16AsciiChars, the nibble GetNibbles helper, the two-table Shuffle, and RightShift1/2) similarly excluded PackedSimd. Add PackedSimd branches throughout: - LoadAndPack16AsciiChars: PackedSimd.ConvertNarrowingSaturateUnsigned - GetNibbles: PackedSimd needs the explicit '& 0xF' on the low half because Swizzle returns 0 for indices >= 16 (unlike Ssse3's implicit AND of the low 4 bits) - Shuffle: already uses portable Vector128.ShuffleNative which maps to PackedSimd.Swizzle; just widen the [CompExactlyDependsOn] - RightShift1/RightShift2: compose two Vector128.ShuffleNative calls with constant index vectors and OR the halves. PackedSimd.Shuffle (two-vector i8x16.shuffle) is impractical due to constant lane index requirements; Swizzle clamps out-of-range to 0 which makes the OR safe. Widen the entry gate in StringSearchValues.cs.CreateFromNormalizedV alues and the null-char filter in TryGetTeddyAcceleratedValues (PackedSimd shares Ssse3's PackUnsignedSaturate behavior where signed negative inputs become 0, so null-containing needles produce more false positives on both). Widen [CompExactlyDependsOn] on the IndexOfAnyN2/N3 + Vector128 helpers in AsciiStringSearchValuesTeddyBase.cs. Validated: System.Memory.Tests on browser-wasm 52249/52249 passing (covers SearchValues<string> Teddy paths via StringSearchValues tests), host 52905/52906 unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SearchValues<char> with values that span more than the ASCII range selects a ProbabilisticMap-based search. The vectorized IndexOfAny / LastIndexOfAny path (using ContainsMask16Chars + IsCharBitNotSet) was previously gated on Sse41 || AdvSimd.Arm64 only, so on Wasm the search fell back to the scalar SimpleLoop even when PackedSimd was available. This change is subtler than the other enablement PRs because the *layout* of the ProbabilisticMap bitmap also branches on the same gate (SetCharBit/IsCharBitSet at the top of the file). The [BypassReadyToRun] comment there warns that the construction and lookup branches must agree at all times during program execution. Widen all three gates (SetCharBit/IsCharBitSet, ContainsMask16Chars, the IndexOfAny/LastIndexOfAny entry dispatcher, and the [CompExactly DependsOn] on the Vector128 worker methods) to include PackedSimd consistently. ContainsMask16Chars gets a PackedSimd branch that mirrors the Sse2 algorithm using PackedSimd.ConvertNarrowingSaturateUnsigned for the two-vector narrowing step. IsCharBitNotSet already had a PackedSimd dependency for the table lookup via Vector128.ShuffleNative. ProbabilisticWithAsciiCharSearchValues already had PackedSimd dispatch. Validated: System.Memory.Tests on browser-wasm 52249/52249 passing, host arm64 52905/52906 unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both algorithms were already vectorized on Wasm via the portable
Vector128 else branch (Vector128.Widen + multiply + add), but the
result was 3-5 portable ops per iteration where PackedSimd has a
direct one-instruction equivalent.
Adler32.UpdateVector128: add a PackedSimd branch alongside Sse2 and
AdvSimd that uses PackedSimd.AddPairwiseWidening (i16x8.extadd_pair
wise_i8x16_u and i32x4.extadd_pairwise_i16x8_u) for the s1 sum and
PackedSimd.MultiplyWideningLower/Upper + AddPairwiseWidening for the
weighted s2 sum.
XxHashShared.MultiplyWideningLower: add a PackedSimd branch that
computes { source[0]*source[1], source[2]*source[3] } via two
shuffles + i64x2.extmul_low_i32x4_u, replacing the portable
mask + 64-bit multiply pair.
Validated: System.IO.Hashing.Tests 4196/4196 passing on both host
arm64 and browser-wasm (the XxHash lane order is checked end-to-end
via the algorithm output bytes — a swap would corrupt every hash).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Added 4 more commits extending the Wasm Vector128 fast-path enablement:
Additional test coverage on browser-wasm (V8 v15)
The XxHash lane-swap correctness is end-to-end checked: a wrong shuffle order would corrupt every produced hash. Remaining gap (not in this PR)
Note The new commits in this PR were drafted with AI/Copilot assistance. |
…ow/UnpackHigh PackedSimd.Shuffle wraps i8x16.shuffle which requires its 16 lane indices to be compile-time constants. Mono interpreter accepted a Vector128.Create() constant operand at runtime, but Mono AOT cannot fold it and throws PlatformNotSupportedException at runtime. The same impact was already known and avoided in TeddyHelper.Right Shift1/RightShift2 (see preceding commit on this branch) — use two Vector128.ShuffleNative calls (lowering to PackedSimd.Swizzle, which clamps out-of-range indices to 0) and OR the partial results together. Apply the same pattern in Vector128.UnpackLow/UnpackHigh. This was caught by CI as 50 GuidTests + cascaded reflection-invoke failures under the WasmTestOnChrome-MONO-ST (AOT) leg on PR dotnet#129838. On Mono interpreter all callers (HexConverter, Guid.FormatGuid) had already been validated end-to-end. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CI feedback addressedThe previous Root cause: Fix ( Unrelated: The |
| [BypassReadyToRun] | ||
| private static void SetCharBit(ref uint charMap, byte value) | ||
| { | ||
| if (Sse41.IsSupported || AdvSimd.Arm64.IsSupported) | ||
| if (Sse41.IsSupported || AdvSimd.Arm64.IsSupported || PackedSimd.IsSupported) | ||
| { | ||
| Unsafe.Add(ref Unsafe.As<uint, byte>(ref charMap), value & VectorizedIndexMask) |= (byte)(1u << (value >> VectorizedIndexShift)); | ||
| } |
| // x86 can only be little endian, while ARM can be big or little endian | ||
| // so if we reached this label we need to check both combinations are supported | ||
| Debug.Assert((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported); | ||
| Debug.Assert((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported || PackedSimd.IsSupported); |
| else if (PackedSimd.IsSupported) | ||
| { | ||
| // Widening byte sum: each byte -> ushort pair sum -> uint pair sum, then accumulate into vs1. | ||
| // Because weights are all positive (1-32), unsigned byte * unsigned byte multiply is valid for vs2. | ||
| Vector128<ushort> sumPairs1 = PackedSimd.AddPairwiseWidening(bytes1); | ||
| Vector128<ushort> sumPairs2 = PackedSimd.AddPairwiseWidening(bytes2); | ||
| vs1 += PackedSimd.AddPairwiseWidening(sumPairs1) + PackedSimd.AddPairwiseWidening(sumPairs2); | ||
|
|
||
| // bytes * weights -> 8 ushorts low + 8 ushorts high, sum pairwise to 4 uints + 4 uints. | ||
| Vector128<ushort> wprod1Lo = PackedSimd.MultiplyWideningLower(bytes1, tap1.AsByte()); | ||
| Vector128<ushort> wprod1Hi = PackedSimd.MultiplyWideningUpper(bytes1, tap1.AsByte()); | ||
| vs2 += PackedSimd.AddPairwiseWidening(wprod1Lo) + PackedSimd.AddPairwiseWidening(wprod1Hi); | ||
|
|
||
| Vector128<ushort> wprod2Lo = PackedSimd.MultiplyWideningLower(bytes2, tap2.AsByte()); | ||
| Vector128<ushort> wprod2Hi = PackedSimd.MultiplyWideningUpper(bytes2, tap2.AsByte()); | ||
| vs2 += PackedSimd.AddPairwiseWidening(wprod2Lo) + PackedSimd.AddPairwiseWidening(wprod2Hi); | ||
| } |
…RightShift Both helpers previously dispatched to PackedSimd via Vector128.ShuffleNative, which itself has a Ssse3 -> AdvSimd.Arm64 -> PackedSimd if/else chain. The Mono SIMD intrinsic recognizer does not always lower that chain cleanly for less-traveled paths, surfacing as NIY interpreter assertions and runtime startup failures. Call PackedSimd.Swizzle (i8x16.swizzle) directly under the PackedSimd.IsSupported branch. The semantics are identical to ShuffleNative on Wasm (clamps indices >= 16 to 0) but the lowering goes through a single recognized intrinsic, avoiding the dispatcher chain. Validated: System.Memory.Tests on browser-wasm V8 interpreter 52249/52249 (covers TeddyHelper.RightShift1/2). The original NIY OutOfMemoryException:.ctor failure seen in System.Runtime.Tests with the prior ShuffleNative version is gone with this change. AOT behaviour will be re-validated by CI on push. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enable several
Vector128fast paths in CoreLib on browser-wasm by adding aPackedSimd.IsSupportedbranch alongside the existingSse2/Ssse3/AdvSimd.Arm64gates. Before these changes, the SIMD code paths were unreachable on Wasm even though the wasm runtime supports PackedSimd (the test pipeline default setsWasmEnableSIMD=true).Commits (in dependency order, bisect-safe)
Vector128.UnpackLow/UnpackHigh— internal helpers that previously dispatched only toSse2orAdvSimd.Arm64and threwNotSupportedExceptionotherwise. They are reachable from the encode-side hex path. Lower toPackedSimd.Shuffle(i8x16.shuffle) with a constant index vector.HexConverter.AsciiToHexVector128,HexConverter.EncodeToUtf8/Utf16,HexConverter.EncodeTo_Vector128, andGuid.FormatGuidVector128Utf8. Bodies were already portable (Vector128.ShuffleNative,Vector128.UnpackLow/High,Vector128.Shufflewith constant indices) — only the gates excluded Wasm. The decode side already includedPackedSimd(this fixes the encode/decode asymmetry).Utf8Utility.ValidationASCII fast path on Wasm — the inner ASCII-scan loop inGetPointerToFirstInvalidCharpreviously hadAdvSimd.Arm64andSse2branches falling back to scalar 4-DWORD-at-a-time. Add aPackedSimdbranch using portableVector128.LoadUnsafe+ExtractMostSignificantBitsto produce the same per-byte non-ASCII bitmask.Bodies use existing portable primitives
These helpers already used
Vector128.ShuffleNative/Vector128.UnpackLow/High/Vector128.Shufflewith constant indices /Vector128.ExtractMostSignificantBits. The bodies are unchanged; only the dispatcher gates and[CompExactlyDependsOn]attributes are widened to includePackedSimd.HexConverter.AsciiToHexVector128andVector128.UnpackLow/Highkeep theirAdvSimd.Arm64/Sse2branches; theelse if (PackedSimd.IsSupported)branch is selected on Wasm.Test results
Validated against
wasm-vector128-fastpathshead withWasmEnableSIMDenabled (browser-wasm test pipeline default):System.Runtime.Tests(Guid)System.Runtime.Extensions.Tests(Convert.ToHexString/FromHexString)System.Memory.Tests(Utf8 validation, Ascii fast paths, SpanHelpers)Without commit 1 (
Vector128.UnpackLow/UnpackHighWasm path),System.Runtime.Extensions.Testsregresses by 132 tests on browser-wasm — caught only whenWasmEnableSIMD=trueis set, which is why the helper needs Wasm coverage before the gates in commit 2 widen.Baseline build:
./build.sh clr+libs+host -rc release -lc release. Browser build:./build.sh -os browser -c Release. Tests run via./dotnet.sh build /t:Test ... /p:TargetOS=browserperdocs/workflow/testing/libraries/testing-wasm.md.Files changed
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs(+20 / −6)src/libraries/Common/src/System/HexConverter.cs(+6 / −3)src/libraries/System.Private.CoreLib/src/System/Guid.cs(+3 / −2)src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs(+11 / −1)Related: there are several other Vector128 hot paths in CoreLib whose gates exclude
PackedSimddespite portable bodies —ProbabilisticMap,TeddyHelper,Base64{Encoder,Decoder}Helper,Adler32,XxHashShared. Those are independent and not in this PR.Note
This PR description and the commits in this branch were drafted with AI/Copilot assistance.