Skip to content

fix: stop no-dash SSN branch matching digit runs inside hex IDs#163

Open
sidmohan0 wants to merge 2 commits into
devfrom
fix/ssn-no-dash-hex-false-positive
Open

fix: stop no-dash SSN branch matching digit runs inside hex IDs#163
sidmohan0 wants to merge 2 commits into
devfrom
fix/ssn-no-dash-hex-false-positive

Conversation

@sidmohan0

@sidmohan0 sidmohan0 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • The no-dash SSN alternative matched any bare nine-digit run whose neighbors were merely non-digits, so digit runs embedded in longer alphanumeric tokens (random hex IDs, UUID segments) were flagged as SSNs. In the field this made the Claude Code datafog-hook PII firewall block an entire preview session whose randomly generated server ID contained such a run.
  • Tightened the no-dash branch only: the run must not be followed by a letter, and must start at a non-alphanumeric boundary or immediately after an uppercase DE token prefix — the one letter-prefixed shape pinned for v4.4.0 DE_VAT_ID parity (test_ssn_detection_keeps_v44_behavior_for_country_prefixed_digits). The prefix is case-sensitive because lowercase de is a hex byte and would reopen the hex-ID false positive. The dashed branch is unchanged.

Review notes

An adversarial regex review confirmed: fixed-width lookbehinds are valid, IGNORECASE/VERBOSE interactions are correct, no catastrophic-backtracking risk (tested on 200k-char hex blobs), and the dashed branch is semantically identical to before. Its one HIGH finding — the initial generic two-letter-prefix exception matching ID.../PO.../lowercase de... — is addressed in the second commit by scoping to case-sensitive DE.

Test plan

  • New regression test: nine-digit runs inside hex/UUID-style tokens and behind generic two-letter record prefixes (ID, PO, lowercase de) are not flagged
  • New test: standalone nine-digit number still flagged
  • Existing SSN parametrized cases and v4.4.0 DE-prefix regression test pass unchanged
  • Full suite: 625 passed; the 4 failures + 1 error also occur on clean dev locally (missing spaCy model, environmental)
  • isort/black/flake8 clean

sidmohan0 added 2 commits July 3, 2026 12:21
A bare nine-digit run embedded in a longer alphanumeric token (random
hex IDs, UUID segments) matched the SSN pattern because its boundaries
only excluded adjacent digits. In practice this let randomly generated
server IDs trip the Claude Code PII firewall hook and block entire
sessions.

Tighten the no-dash branch only: the run must not be followed by a
letter, and must start at a non-alphanumeric boundary or right after a
two-letter token prefix (preserving v4.4.0 country-code parity, e.g.
DE123456789). The dashed branch keeps its existing boundaries.
Review found the two-letter-prefix exception matched any two-letter
token (ID, PO, ...), and via IGNORECASE also lowercase "de" — a hex
byte, which would reopen the random-hex-ID false positive. The v4.4.0
parity test only pins uppercase DE (DE_VAT_ID overlap), so restrict the
lookbehind to a case-sensitive DE prefix and cover generic and
lowercase prefixes in the regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant