feat: allowlist support, presidio entity aliases, py.typed (4.7.0)#159
Merged
Conversation
Adds allowlist (exact values) and allowlist_patterns (full-match regexes) to scan/redact and threads them through both agent adapters: DATAFOG_HOOK_ALLOWLIST / DATAFOG_HOOK_ALLOWLIST_PATTERNS env vars for the Claude Code hook, allowlist/allowlist_patterns params for the LiteLLM guardrail. Motivated by a day of dogfooding: unix timestamps and numeric IDs match the PHONE pattern, and intentional identifiers (own support email, doc placeholders) should be exemptable. Accepts presidio-style entity names (EMAIL_ADDRESS, US_SSN) as input aliases via the existing canonical type map, ships a py.typed marker so downstream type checkers see our annotations, and backports the upstream-review fixes to the in-repo litellm adapter (guardrail spans recorded on the returned dict, redaction reported as intervention). Also corrects an entity-name documentation error introduced in #156: the scan API returns DATE and ZIP_CODE (DOB/ZIP are input aliases).
Review findings: reject quantified groups containing nested quantifiers at compile time (catastrophic backtracking on attacker-influenced entity text), cap pattern length at 512 chars, and skip pattern matching for entities longer than 512 chars (fail-safe: the finding is kept). Match semantics documented as case-sensitive with no Unicode normalization; allowlist entries are operator configuration, never end-user input. Adds regression tests for the rejection heuristic, the smart-engine path, and the redact(entities=..., allowlist=...) guard. Replaces a walrus assignment with a plain one in the litellm adapter.
This was referenced Jul 2, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
The engine work for a fast 4.7.0 release (pulled forward from DFPY-110/115 scope), motivated by a day of dogfooding the firewall and by the litellm upstream PR:
scan(text, allowlist=[...])for exact values,allowlist_patterns=[...]for full-match regexes. Threaded through both adapters:DATAFOG_HOOK_ALLOWLIST/DATAFOG_HOOK_ALLOWLIST_PATTERNSenv vars for the Claude Code hook, constructor params for the LiteLLM guardrail. The motivating false-positive catalog from today: unix timestamps and 10-digit API IDs matching PHONE, own email in tool metadata, doc placeholders, test fixtures.EMAIL_ADDRESS,US_SSNaccepted as input aliases (via the existingCANONICAL_TYPE_MAP, which already hadPHONE_NUMBER) — the migration bridge for presidio configs.package_data, so downstream type checkers (including litellm's basedpyright gate, which currently needs a suppression for our import) see our annotations.guardrail_intervened).DATE/ZIP_CODE;DOB/ZIPare input aliases, not output types.Design decisions
fullmatch— a partial match never suppresses a finding.ValueErrorat the API boundary (fail fast); the hook converts that to its usual fail-open.redact(entities=[...], allowlist=...)is rejected explicitly — filtering pre-scanned entities is the caller's job.Why now (release strategy)
litellm's supply-chain quarantine (
exclude-newer = 3 days) means any datafog release takes 3 days to become usable in their CI. Shipping 4.7.0 immediately starts that clock: quarantine-clear by ~July 6, in time for the July 5 CI-pin push to go straight to 4.7.0, and it removes the type-stub suppression from the upstream PR.Test plan
tests/test_allowlist.py), TDD (verified RED first)