Skip to content

fix: encode lone UTF-16 surrogates as U+FFFD to match native bcrypt#171

Open
spokodev wants to merge 1 commit into
dcodeIO:mainfrom
spokodev:fix/utf8-lone-surrogate-replacement
Open

fix: encode lone UTF-16 surrogates as U+FFFD to match native bcrypt#171
spokodev wants to merge 1 commit into
dcodeIO:mainfrom
spokodev:fix/utf8-lone-surrogate-replacement

Conversation

@spokodev

Copy link
Copy Markdown

Problem

The README states bcryptjs is "Compatible to the C++ bcrypt binding". For any password containing an unpaired UTF-16 surrogate, it is not: bcryptjs computes a different hash than native bcrypt, so a hash produced by one cannot be verified by the other.

utf8Array's 3-byte else branch encodes a lone surrogate as raw WTF-8. Every standards-compliant UTF-8 encoder instead substitutes the U+FFFD replacement character:

[...Buffer.from("\uD800", "utf8")]            // [0xEF, 0xBF, 0xBD]  (U+FFFD)
[...new TextEncoder().encode("\uD800")]       // [0xEF, 0xBF, 0xBD]  (U+FFFD)
// bcryptjs utf8Array("\uD800")               // [0xED, 0xA0, 0x80]  (WTF-8)

Because the bytes fed to the KDF differ, the hashes do not match the native binding:

const bcrypt = require("bcrypt");      // native C++ binding
const bcryptjs = require("bcryptjs");
const salt = "$2b$06$DCq7YPn5Rq63x1Lad4cll.";

bcrypt.hashSync("\uD800", salt);
// $2b$06$DCq7YPn5Rq63x1Lad4cll.nt6EZli4Z4eGj7C9JUnk.i4ERs/EVsq
bcryptjs.hashSync("\uD800", salt);
// $2b$06$DCq7YPn5Rq63x1Lad4cll.AEJkiq9W7zpkNhET9G4TKw3czJdIZTu   (mismatch)

The native binding maps every lone surrogate to the same U+FFFD bytes, exactly as Buffer.from(str, "utf8") does. bcryptjs is the only one producing distinct, non-interoperable output here.

Root cause

In utf8Array, a high+low surrogate pair is decoded correctly. Anything else with charCodeAt >= 2048 falls through to the final else, which 3-byte-encodes the raw code unit. For a code unit in 0xD800-0xDFFF that is not part of a valid pair, that produces WTF-8 rather than U+FFFD.

Fix

Before the generic 3-byte branch, detect a code unit in the surrogate range ((c1 & 0xf800) === 0xd800) that was not consumed as a valid pair, and emit 0xEF 0xBF 0xBD (U+FFFD), matching Buffer.from(str, "utf8") and the native binding. utf8Length already counts a lone surrogate as 3 bytes, so buffer sizing is unchanged.

Backwards-compatibility note

This changes the hash bcryptjs computes for the (rare, malformed) lone-surrogate password class. A previously-stored bcryptjs hash of such a password would no longer verify against the new output. The affected inputs were already non-interoperable with every other bcrypt implementation, so this aligns bcryptjs with the standard rather than breaking a working contract, but it is a behavior change worth your judgement.

Verification

Added a compat_lone_surrogate unit test in the same style as compat_hash, asserting bcryptjs matches the native bcrypt binding for five lone-surrogate cases (high alone, low alone, emoji high half, surrounded by ASCII, low-then-high). It fails on the current code and passes with the fix. The full npm test suite (unit + the four TypeScript projects) is green, and compat_hash still passes, confirming valid UTF-8, 8-bit, NUL, and 72-byte-truncation cases are unaffected.

A password containing an unpaired UTF-16 surrogate was encoded as WTF-8
(e.g. ED A0 80 for U+D800) by utf8Array, whereas Buffer.from(str, "utf8"),
TextEncoder, and the native C++ bcrypt binding all map a lone surrogate to
the U+FFFD replacement character (EF BF BD). The differing key bytes made
bcryptjs hashes of such passwords non-verifiable across implementations,
contrary to the README's "Compatible to the C++ bcrypt binding" claim.

Emit EF BF BD for any code unit in the surrogate range that is not part of
a valid pair. utf8Length already counts a lone surrogate as 3 bytes, so the
buffer sizing is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant