fix: encode lone UTF-16 surrogates as U+FFFD to match native bcrypt#171
Open
spokodev wants to merge 1 commit into
Open
fix: encode lone UTF-16 surrogates as U+FFFD to match native bcrypt#171spokodev wants to merge 1 commit into
spokodev wants to merge 1 commit into
Conversation
A password containing an unpaired UTF-16 surrogate was encoded as WTF-8 (e.g. ED A0 80 for U+D800) by utf8Array, whereas Buffer.from(str, "utf8"), TextEncoder, and the native C++ bcrypt binding all map a lone surrogate to the U+FFFD replacement character (EF BF BD). The differing key bytes made bcryptjs hashes of such passwords non-verifiable across implementations, contrary to the README's "Compatible to the C++ bcrypt binding" claim. Emit EF BF BD for any code unit in the surrogate range that is not part of a valid pair. utf8Length already counts a lone surrogate as 3 bytes, so the buffer sizing is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The README states bcryptjs is "Compatible to the C++ bcrypt binding". For any password containing an unpaired UTF-16 surrogate, it is not: bcryptjs computes a different hash than native
bcrypt, so a hash produced by one cannot be verified by the other.utf8Array's 3-byteelsebranch encodes a lone surrogate as raw WTF-8. Every standards-compliant UTF-8 encoder instead substitutes the U+FFFD replacement character:Because the bytes fed to the KDF differ, the hashes do not match the native binding:
The native binding maps every lone surrogate to the same U+FFFD bytes, exactly as
Buffer.from(str, "utf8")does. bcryptjs is the only one producing distinct, non-interoperable output here.Root cause
In
utf8Array, a high+low surrogate pair is decoded correctly. Anything else withcharCodeAt >= 2048falls through to the finalelse, which 3-byte-encodes the raw code unit. For a code unit in0xD800-0xDFFFthat is not part of a valid pair, that produces WTF-8 rather than U+FFFD.Fix
Before the generic 3-byte branch, detect a code unit in the surrogate range (
(c1 & 0xf800) === 0xd800) that was not consumed as a valid pair, and emit0xEF 0xBF 0xBD(U+FFFD), matchingBuffer.from(str, "utf8")and the native binding.utf8Lengthalready counts a lone surrogate as 3 bytes, so buffer sizing is unchanged.Backwards-compatibility note
This changes the hash bcryptjs computes for the (rare, malformed) lone-surrogate password class. A previously-stored bcryptjs hash of such a password would no longer verify against the new output. The affected inputs were already non-interoperable with every other bcrypt implementation, so this aligns bcryptjs with the standard rather than breaking a working contract, but it is a behavior change worth your judgement.
Verification
Added a
compat_lone_surrogateunit test in the same style ascompat_hash, asserting bcryptjs matches the nativebcryptbinding for five lone-surrogate cases (high alone, low alone, emoji high half, surrounded by ASCII, low-then-high). It fails on the current code and passes with the fix. The fullnpm testsuite (unit + the four TypeScript projects) is green, andcompat_hashstill passes, confirming valid UTF-8, 8-bit, NUL, and 72-byte-truncation cases are unaffected.