Skip to content

[wasm] Enable Vector128 fast paths on Wasm via PackedSimd: hex/Guid, UTF-8, SearchValues, Teddy, Adler32/XXH3#129838

Open
lewing wants to merge 9 commits into
dotnet:mainfrom
lewing:wasm-vector128-fastpaths
Open

[wasm] Enable Vector128 fast paths on Wasm via PackedSimd: hex/Guid, UTF-8, SearchValues, Teddy, Adler32/XXH3#129838
lewing wants to merge 9 commits into
dotnet:mainfrom
lewing:wasm-vector128-fastpaths

Conversation

@lewing

@lewing lewing commented Jun 25, 2026

Copy link
Copy Markdown
Member

Enable several Vector128 fast paths in CoreLib on browser-wasm by adding a PackedSimd.IsSupported branch alongside the existing Sse2/Ssse3/AdvSimd.Arm64 gates. Before these changes, the SIMD code paths were unreachable on Wasm even though the wasm runtime supports PackedSimd (the test pipeline default sets WasmEnableSIMD=true).

Commits (in dependency order, bisect-safe)

  1. Add Wasm PackedSimd path to Vector128.UnpackLow/UnpackHigh — internal helpers that previously dispatched only to Sse2 or AdvSimd.Arm64 and threw NotSupportedException otherwise. They are reachable from the encode-side hex path. Lower to PackedSimd.Shuffle (i8x16.shuffle) with a constant index vector.
  2. Enable Vector128 hex/Guid format fast path on WasmHexConverter.AsciiToHexVector128, HexConverter.EncodeToUtf8/Utf16, HexConverter.EncodeTo_Vector128, and Guid.FormatGuidVector128Utf8. Bodies were already portable (Vector128.ShuffleNative, Vector128.UnpackLow/High, Vector128.Shuffle with constant indices) — only the gates excluded Wasm. The decode side already included PackedSimd (this fixes the encode/decode asymmetry).
  3. Vectorize Utf8Utility.Validation ASCII fast path on Wasm — the inner ASCII-scan loop in GetPointerToFirstInvalidChar previously had AdvSimd.Arm64 and Sse2 branches falling back to scalar 4-DWORD-at-a-time. Add a PackedSimd branch using portable Vector128.LoadUnsafe + ExtractMostSignificantBits to produce the same per-byte non-ASCII bitmask.

Bodies use existing portable primitives

These helpers already used Vector128.ShuffleNative / Vector128.UnpackLow/High / Vector128.Shuffle with constant indices / Vector128.ExtractMostSignificantBits. The bodies are unchanged; only the dispatcher gates and [CompExactlyDependsOn] attributes are widened to include PackedSimd. HexConverter.AsciiToHexVector128 and Vector128.UnpackLow/High keep their AdvSimd.Arm64/Sse2 branches; the else if (PackedSimd.IsSupported) branch is selected on Wasm.

Test results

Validated against wasm-vector128-fastpaths head with WasmEnableSIMD enabled (browser-wasm test pipeline default):

Suite Host arm64 browser-wasm (V8 v15)
System.Runtime.Tests (Guid) 69,714 / 69,714 ✅ 67,835 / 67,835 ✅
System.Runtime.Extensions.Tests (Convert.ToHexString/FromHexString) 8,350 / 8,350 ✅ 8,224 / 8,224
System.Memory.Tests (Utf8 validation, Ascii fast paths, SpanHelpers) 52,906 / 52,906 ✅ 52,249 / 52,249 ✅

Without commit 1 (Vector128.UnpackLow/UnpackHigh Wasm path), System.Runtime.Extensions.Tests regresses by 132 tests on browser-wasm — caught only when WasmEnableSIMD=true is set, which is why the helper needs Wasm coverage before the gates in commit 2 widen.

Baseline build: ./build.sh clr+libs+host -rc release -lc release. Browser build: ./build.sh -os browser -c Release. Tests run via ./dotnet.sh build /t:Test ... /p:TargetOS=browser per docs/workflow/testing/libraries/testing-wasm.md.

Files changed

  • src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs (+20 / −6)
  • src/libraries/Common/src/System/HexConverter.cs (+6 / −3)
  • src/libraries/System.Private.CoreLib/src/System/Guid.cs (+3 / −2)
  • src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs (+11 / −1)

Related: there are several other Vector128 hot paths in CoreLib whose gates exclude PackedSimd despite portable bodies — ProbabilisticMap, TeddyHelper, Base64{Encoder,Decoder}Helper, Adler32, XxHashShared. Those are independent and not in this PR.

Note

This PR description and the commits in this branch were drafted with AI/Copilot assistance.

lewing and others added 3 commits June 24, 2026 23:54
These internal helpers are used by HexConverter.AsciiToHexVector128
and other byte-interleaving code paths. They previously dispatched
only to Sse2.UnpackLow/UnpackHigh or AdvSimd.Arm64.ZipLow/ZipHigh
and threw NotSupportedException on platforms without either ISA.

With the recent change that enables HexConverter and Guid format on
Wasm via PackedSimd, the helpers became reachable on browser-wasm
and started throwing at runtime in libraries tests.

Lower to PackedSimd.Shuffle with a constant 16-byte index vector
(i8x16.shuffle) when PackedSimd is supported. Validated via
System.Runtime.Extensions.Tests on browser-wasm (8224 passing, 0
failed) after the previous run failed 132 tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HexConverter.AsciiToHexVector128, HexConverter.EncodeTo_Vector128 and
Guid.FormatGuidVector128Utf8 already use only portable Vector128 ops
(Vector128.ShuffleNative, Vector128.UnpackLow/High, Vector128.Shuffle
with constant indices) plus an optional AdvSimd.Arm64-specific branch.
The gates at Convert.ToHexString, EncodeToUtf8/Utf16, and Guid.ToString
required Ssse3 or AdvSimd.Arm64, so Wasm fell back to scalar even with
PackedSimd.

Add PackedSimd.IsSupported to the gates and the [CompExactlyDependsOn]
attributes on the helpers. The bodies are unchanged; on Wasm the
existing else branch (portable Vector128.Shuffle) is selected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetPointerToFirstInvalidChar's inner ASCII-scan loop dispatched on
AdvSimd.Arm64 (with bitmask128) or Sse2 (with MoveMask), falling back
to a scalar 4-DWORD-at-a-time path otherwise. On Wasm with PackedSimd,
neither SIMD branch was taken, so UTF-8 validation took the scalar
path.

Add a PackedSimd.IsSupported branch that uses portable
Vector128.LoadUnsafe + ExtractMostSignificantBits to compute the same
per-byte non-ASCII bitmask used by the Sse2 path. Update the post-loop
Debug.Assert to include PackedSimd.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 04:54
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@lewing lewing changed the title Enable Vector128 fast paths on Wasm: hex/Guid format and Utf8 ASCII validation [wasm] Enable Vector128 fast paths on Wasm: hex/Guid format and Utf8 ASCII validation Jun 25, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables existing Vector128-based fast paths to run on browser-wasm by widening the feature gates to include System.Runtime.Intrinsics.Wasm.PackedSimd and adding a PackedSimd.Shuffle implementation for Vector128.UnpackLow/UnpackHigh so dependent encode/format paths don’t fall into NotSupportedException on Wasm SIMD-enabled runs.

Changes:

  • Add PackedSimd.IsSupported implementations for Vector128.UnpackLow / UnpackHigh using PackedSimd.Shuffle (two-vector shuffle).
  • Widen existing hex + Guid formatting SIMD gates / [CompExactlyDependsOn] to include PackedSimd.
  • Add a PackedSimd vectorized ASCII-scan path in Utf8Utility.Validation using Vector128.LoadUnsafe + ExtractMostSignificantBits.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Adds PackedSimd support to UnpackLow/UnpackHigh to avoid Wasm falling into the unsupported path.
src/libraries/Common/src/System/HexConverter.cs Expands encode-side Vector128 gate/attributes to include PackedSimd for Wasm SIMD.
src/libraries/System.Private.CoreLib/src/System/Guid.cs Expands Guid vectorized formatting gate/attributes to include PackedSimd on little-endian.
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs Adds a PackedSimd ASCII-validation fast path using MSB extraction.

lewing and others added 4 commits June 25, 2026 00:07
TranscodeToUtf8's 8-char ASCII fast loop used a Vector128<short> read,
a mask-and-compare to detect non-ASCII, and a narrow-and-store of 8
bytes using Sse2.PackUnsignedSaturate / AdvSimd.ExtractNarrowingSatura
teUnsignedLower. Two follow-on 4-char sites narrowed 4 bytes the same
way. All four sites required Sse41.X64 or AdvSimd.Arm64 + LE, so Wasm
took the 4-DWORD-at-a-time scalar fallback.

Add PackedSimd branches at every dispatch site:
 - Outer entry gate (declaration + entry condition)
 - 8-char narrow-store: use the existing portable AND-compare for the
   non-ASCII test (same code Sse41 already uses) and PackedSimd.Convert
   NarrowingSaturateUnsigned + scalar extract for the store
 - 4-char narrow-stores: PackedSimd.ConvertNarrowingSaturateUnsigned +
   AsUInt32().ToScalar() unaligned write

The Sse2.X64.ConvertToUInt64 sub-branch already had an else path that
calls AsUInt64().ToScalar(), which works on Wasm without changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SearchValues<string> with 2+ values previously selected the
Aho-Corasick implementation on Wasm because the Teddy entry gate in
StringSearchValues.cs required Ssse3 or AdvSimd.Arm64. Teddy's core
Vector128 primitives in TeddyHelper.cs (LoadAndPack16AsciiChars, the
nibble GetNibbles helper, the two-table Shuffle, and RightShift1/2)
similarly excluded PackedSimd.

Add PackedSimd branches throughout:
 - LoadAndPack16AsciiChars: PackedSimd.ConvertNarrowingSaturateUnsigned
 - GetNibbles: PackedSimd needs the explicit '& 0xF' on the low half
   because Swizzle returns 0 for indices >= 16 (unlike Ssse3's
   implicit AND of the low 4 bits)
 - Shuffle: already uses portable Vector128.ShuffleNative which maps
   to PackedSimd.Swizzle; just widen the [CompExactlyDependsOn]
 - RightShift1/RightShift2: compose two Vector128.ShuffleNative calls
   with constant index vectors and OR the halves. PackedSimd.Shuffle
   (two-vector i8x16.shuffle) is impractical due to constant lane
   index requirements; Swizzle clamps out-of-range to 0 which makes
   the OR safe.

Widen the entry gate in StringSearchValues.cs.CreateFromNormalizedV
alues and the null-char filter in TryGetTeddyAcceleratedValues
(PackedSimd shares Ssse3's PackUnsignedSaturate behavior where signed
negative inputs become 0, so null-containing needles produce more
false positives on both).

Widen [CompExactlyDependsOn] on the IndexOfAnyN2/N3 + Vector128
helpers in AsciiStringSearchValuesTeddyBase.cs.

Validated: System.Memory.Tests on browser-wasm 52249/52249 passing
(covers SearchValues<string> Teddy paths via StringSearchValues
tests), host 52905/52906 unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SearchValues<char> with values that span more than the ASCII range
selects a ProbabilisticMap-based search. The vectorized IndexOfAny /
LastIndexOfAny path (using ContainsMask16Chars + IsCharBitNotSet)
was previously gated on Sse41 || AdvSimd.Arm64 only, so on Wasm the
search fell back to the scalar SimpleLoop even when PackedSimd was
available.

This change is subtler than the other enablement PRs because the
*layout* of the ProbabilisticMap bitmap also branches on the same
gate (SetCharBit/IsCharBitSet at the top of the file). The
[BypassReadyToRun] comment there warns that the construction and
lookup branches must agree at all times during program execution.
Widen all three gates (SetCharBit/IsCharBitSet, ContainsMask16Chars,
the IndexOfAny/LastIndexOfAny entry dispatcher, and the [CompExactly
DependsOn] on the Vector128 worker methods) to include PackedSimd
consistently.

ContainsMask16Chars gets a PackedSimd branch that mirrors the Sse2
algorithm using PackedSimd.ConvertNarrowingSaturateUnsigned for the
two-vector narrowing step. IsCharBitNotSet already had a PackedSimd
dependency for the table lookup via Vector128.ShuffleNative.

ProbabilisticWithAsciiCharSearchValues already had PackedSimd dispatch.

Validated: System.Memory.Tests on browser-wasm 52249/52249 passing,
host arm64 52905/52906 unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both algorithms were already vectorized on Wasm via the portable
Vector128 else branch (Vector128.Widen + multiply + add), but the
result was 3-5 portable ops per iteration where PackedSimd has a
direct one-instruction equivalent.

Adler32.UpdateVector128: add a PackedSimd branch alongside Sse2 and
AdvSimd that uses PackedSimd.AddPairwiseWidening (i16x8.extadd_pair
wise_i8x16_u and i32x4.extadd_pairwise_i16x8_u) for the s1 sum and
PackedSimd.MultiplyWideningLower/Upper + AddPairwiseWidening for the
weighted s2 sum.

XxHashShared.MultiplyWideningLower: add a PackedSimd branch that
computes { source[0]*source[1], source[2]*source[3] } via two
shuffles + i64x2.extmul_low_i32x4_u, replacing the portable
mask + 64-bit multiply pair.

Validated: System.IO.Hashing.Tests 4196/4196 passing on both host
arm64 and browser-wasm (the XxHash lane order is checked end-to-end
via the algorithm output bytes — a swap would corrupt every hash).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lewing

lewing commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Added 4 more commits extending the Wasm Vector128 fast-path enablement:

  • da08d905c6eUtf8Utility.Transcoding: vectorize the 8-char ASCII fast path in TranscodeToUtf8 and the two follow-on 4-char narrow-stores using PackedSimd.ConvertNarrowingSaturateUnsigned.
  • 2caa5e25997 — Enable Teddy multi-string search on Wasm. Adds PackedSimd dispatch to TeddyHelper.LoadAndPack16AsciiChars (ConvertNarrowingSaturateUnsigned), GetNibbles, the two-table Shuffle, and RightShift1/RightShift2 (two Vector128.ShuffleNative calls + OR — PackedSimd.Shuffle is impractical due to constant lane index requirements). Widens the entry gate in StringSearchValues.CreateFromNormalizedValues, the null-char filter in TryGetTeddyAcceleratedValues, and the [CompExactlyDependsOn] attributes on IndexOfAnyN2/N3 in AsciiStringSearchValuesTeddyBase.cs.
  • 819c9fe3960ProbabilisticMap vectorized SearchValues. Subtle because the layout of the bitmap also branches on the gate (SetCharBit/IsCharBitSet marked [BypassReadyToRun]). Widens the layout choice, ContainsMask16Chars (new PackedSimd branch using ConvertNarrowingSaturateUnsigned), the IndexOfAny/LastIndexOfAny entry dispatcher, and the [CompExactlyDependsOn] on the Vector128 worker methods, consistently.
  • 80522ca629bAdler32.UpdateVector128 and XxHashShared.MultiplyWideningLower: replace the portable Vector128.Widen + multiply + add sequences with PackedSimd.AddPairwiseWidening + MultiplyWideningLower/Upper and PackedSimd.MultiplyWideningLower (i64x2.extmul_low_i32x4_u). Both were already vectorized on Wasm via the generic else branch; this turns 3–5 ops/iter into 1.

Additional test coverage on browser-wasm (V8 v15)

Suite Result
System.Memory.Tests (post-Teddy, post-ProbabilisticMap) 52,249 / 52,249 ✅
System.Runtime.Tests (post-Utf8Transcoding) 67,835 / 67,835 ✅
System.IO.Hashing.Tests (post-Adler+XxHash) 4,196 / 4,196 ✅
Host arm64 regression checks All matched baseline counts ✅

The XxHash lane-swap correctness is end-to-end checked: a wrong shuffle order would corrupt every produced hash.

Remaining gap (not in this PR)

  • Base64DecoderHelper / Base64EncoderHelper — need pmaddubsw and pmulhuw analogs composed from MultiplyWideningLower/Upper + AddPairwiseWidening or Dot, with careful 8-short lane preservation. Worth its own PR with a benchmark to compare against the scalar path.

Note

The new commits in this PR were drafted with AI/Copilot assistance.

@lewing lewing changed the title [wasm] Enable Vector128 fast paths on Wasm: hex/Guid format and Utf8 ASCII validation [wasm] Enable Vector128 fast paths on Wasm via PackedSimd: hex/Guid, UTF-8, SearchValues, Teddy, Adler32/XXH3 Jun 25, 2026
@lewing lewing requested a review from tannergooding June 25, 2026 05:42
…ow/UnpackHigh

PackedSimd.Shuffle wraps i8x16.shuffle which requires its 16 lane
indices to be compile-time constants. Mono interpreter accepted a
Vector128.Create() constant operand at runtime, but Mono AOT cannot
fold it and throws PlatformNotSupportedException at runtime.

The same impact was already known and avoided in TeddyHelper.Right
Shift1/RightShift2 (see preceding commit on this branch) — use two
Vector128.ShuffleNative calls (lowering to PackedSimd.Swizzle, which
clamps out-of-range indices to 0) and OR the partial results
together. Apply the same pattern in Vector128.UnpackLow/UnpackHigh.

This was caught by CI as 50 GuidTests + cascaded reflection-invoke
failures under the WasmTestOnChrome-MONO-ST (AOT) leg on PR dotnet#129838.
On Mono interpreter all callers (HexConverter, Guid.FormatGuid) had
already been validated end-to-end.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 12:00
@lewing

lewing commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

CI feedback addressed

The previous WasmTestOnChrome-MONO-ST (AOT) leg failed 50 tests in System.Runtime.Tests with PlatformNotSupportedException from Guid formatting + cascaded reflection-invoke failures (StringBuilder.AppendFormat, String.Format, StrongNameKeyPair, etc.).

Root cause: Vector128.UnpackLow/UnpackHigh's PackedSimd branch used PackedSimd.Shuffle (two-vector i8x16.shuffle), whose 16 lane indices must be compile-time constants. The Mono interpreter accepted the Vector128.Create() constant at runtime; the Mono AOT compiler cannot fold it and throws PNS.

Fix (731effa2b3a): replace with two Vector128.ShuffleNative (single-vector, lowers to PackedSimd.Swizzle which clamps out-of-range indices to 0) plus an OR — the same pattern that already worked in TeddyHelper.RightShift1/RightShift2 earlier in this branch.

Unrelated: The Libraries Test Run release coreclr windows x86 Release leg failed two System.Net.Http.Functional.Tests.SocketsHttpHandler_DiagnosticsTest.SendAsync_Success_ConnectionSetupActivityGraphRecorded cases (HTTP networking, untouched by this PR — likely flaky).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comment on lines 80 to 86
[BypassReadyToRun]
private static void SetCharBit(ref uint charMap, byte value)
{
if (Sse41.IsSupported || AdvSimd.Arm64.IsSupported)
if (Sse41.IsSupported || AdvSimd.Arm64.IsSupported || PackedSimd.IsSupported)
{
Unsafe.Add(ref Unsafe.As<uint, byte>(ref charMap), value & VectorizedIndexMask) |= (byte)(1u << (value >> VectorizedIndexShift));
}
Comment on lines 193 to +195
// x86 can only be little endian, while ARM can be big or little endian
// so if we reached this label we need to check both combinations are supported
Debug.Assert((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported);
Debug.Assert((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported || PackedSimd.IsSupported);
Comment on lines +304 to +320
else if (PackedSimd.IsSupported)
{
// Widening byte sum: each byte -> ushort pair sum -> uint pair sum, then accumulate into vs1.
// Because weights are all positive (1-32), unsigned byte * unsigned byte multiply is valid for vs2.
Vector128<ushort> sumPairs1 = PackedSimd.AddPairwiseWidening(bytes1);
Vector128<ushort> sumPairs2 = PackedSimd.AddPairwiseWidening(bytes2);
vs1 += PackedSimd.AddPairwiseWidening(sumPairs1) + PackedSimd.AddPairwiseWidening(sumPairs2);

// bytes * weights -> 8 ushorts low + 8 ushorts high, sum pairwise to 4 uints + 4 uints.
Vector128<ushort> wprod1Lo = PackedSimd.MultiplyWideningLower(bytes1, tap1.AsByte());
Vector128<ushort> wprod1Hi = PackedSimd.MultiplyWideningUpper(bytes1, tap1.AsByte());
vs2 += PackedSimd.AddPairwiseWidening(wprod1Lo) + PackedSimd.AddPairwiseWidening(wprod1Hi);

Vector128<ushort> wprod2Lo = PackedSimd.MultiplyWideningLower(bytes2, tap2.AsByte());
Vector128<ushort> wprod2Hi = PackedSimd.MultiplyWideningUpper(bytes2, tap2.AsByte());
vs2 += PackedSimd.AddPairwiseWidening(wprod2Lo) + PackedSimd.AddPairwiseWidening(wprod2Hi);
}
…RightShift

Both helpers previously dispatched to PackedSimd via
Vector128.ShuffleNative, which itself has a Ssse3 -> AdvSimd.Arm64 ->
PackedSimd if/else chain. The Mono SIMD intrinsic recognizer does not
always lower that chain cleanly for less-traveled paths, surfacing as
NIY interpreter assertions and runtime startup failures.

Call PackedSimd.Swizzle (i8x16.swizzle) directly under the
PackedSimd.IsSupported branch. The semantics are identical to
ShuffleNative on Wasm (clamps indices >= 16 to 0) but the lowering
goes through a single recognized intrinsic, avoiding the dispatcher
chain.

Validated: System.Memory.Tests on browser-wasm V8 interpreter
52249/52249 (covers TeddyHelper.RightShift1/2). The original NIY
OutOfMemoryException:.ctor failure seen in System.Runtime.Tests with
the prior ShuffleNative version is gone with this change. AOT
behaviour will be re-validated by CI on push.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants