DataFog · sidmohan0 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 4.6.0
+current_version = 4.7.0
 commit = True
 tag = True
 tag_name = v{new_version}

diff --git a/CHANGELOG.MD b/CHANGELOG.MD
@@ -2,6 +2,41 @@
 
 ## [2026-07-02]
 
+### `datafog-python` [4.7.0]
+
+#### Added
+
+- **Allowlist support** on `scan()` and `redact()`: `allowlist=[...]` exempts
+  exact entity texts (your own support address, documentation placeholders);
+  `allowlist_patterns=[...]` exempts entities whose full text matches a regex
+  (e.g. `^\d{10}$` so unix timestamps stop matching as phone numbers).
+  Matching is deliberately strict: case-sensitive, no Unicode normalization,
+  exact/fullmatch only — a partial match never suppresses a finding.
+  Threaded through both agent adapters: `DATAFOG_HOOK_ALLOWLIST` /
+  `DATAFOG_HOOK_ALLOWLIST_PATTERNS` environment variables for the Claude
+  Code hook, `allowlist` / `allowlist_patterns` parameters for the LiteLLM
+  guardrail. Patterns are operator configuration — treat them like code and
+  never accept them from end users; patterns with nested quantifiers are
+  rejected at configuration time (catastrophic-backtracking guard), pattern
+  length is capped at 512 characters, and entities longer than 512
+  characters skip pattern matching fail-safe (the finding is kept).
+- **Presidio-compatible entity aliases**: `EMAIL_ADDRESS` and `US_SSN` are
+  accepted as input aliases for `EMAIL` and `SSN` (joining the existing
+  `PHONE_NUMBER` alias), so Presidio configurations migrate without renames.
+- **`py.typed` marker**: the package now advertises its inline type
+  annotations to type checkers (PEP 561).
+
+#### Changed
+
+- **LiteLLM guardrail observability**: redaction events are now recorded
+  with `guardrail_status="guardrail_intervened"` (previously `"success"`),
+  so compliance dashboards flag redactions as interventions. Guardrail
+  logging metadata is attached to the request dict actually returned in
+  redact mode, fixing dropped observability records for requests arriving
+  without a pre-existing `metadata` key.
+- Documentation: corrected the engine entity-type list — the scan API
+  returns `DATE` and `ZIP_CODE`; `DOB` and `ZIP` are accepted input aliases.
+
 ### `datafog-python` [4.6.0]
 
 #### Added

diff --git a/README.md b/README.md
@@ -5,17 +5,43 @@ DataFog is a Python library for detecting and redacting personally identifiable
 It provides:
 
 - Fast structured PII detection via regex
+- An offline PII firewall for AI agents: a Claude Code hook and a LiteLLM
+  gateway guardrail (new in 4.6)
 - Optional NER support via spaCy and GLiNER
 - A simple agent-oriented API for LLM applications
 - Backward-compatible `DataFog` and `TextService` classes
 
-## 4.5 Focus
+## Agent & Gateway Firewall (4.6)
 
-DataFog 4.5 is focused on lightweight text PII screening: a small core install,
-fast regex-based scan/redact helpers, explicit optional extras, and a clearer
-path toward future middleware use cases. Dedicated Sentry, OpenTelemetry,
-logging-framework, and cloud DLP adapters are future-facing work and are not
-part of the 4.5 release.
+DataFog 4.6 adds two ready-made enforcement points that catch PII at the
+moment it would leave your machine — offline, in microseconds, with matched
+values never echoed into logs or transcripts:
+
+- **Claude Code hook** (`datafog-hook`): gates agent tool calls (shell
+  commands, web requests, file writes, MCP tools) and warns the model when
+  prompts or tool results carry PII. ~70ms per invocation including process
+  startup. Easiest install is the
+  [Claude Code plugin](https://github.com/DataFog/datafog-claude-plugin):
+
+  ```
+  /plugin marketplace add DataFog/datafog-claude-plugin
+  /plugin install datafog@datafog
+  ```
+
+  Manual hook setup and limitations: [examples/claude_code_hook/](examples/claude_code_hook/).
+
+- **LiteLLM guardrail** (`DataFogGuardrail`): redacts or blocks PII in
+  requests and responses at the gateway, for any LiteLLM-proxied provider.
+  In-process (~31µs per request), no sidecar service. Setup:
+  [examples/litellm_guardrail/](examples/litellm_guardrail/).
+
+Both default to the high-precision entity set (`EMAIL`, `PHONE`,
+`CREDIT_CARD`, `SSN`); noisier types are opt-in. Known-safe values can be
+exempted with an allowlist: `scan(text, allowlist=[...])` for exact values,
+`allowlist_patterns=[...]` for full-match regexes (e.g. `^\d{10}$` to stop
+unix timestamps matching as phone numbers) — available in both adapters and
+the API. Presidio-style entity names (`EMAIL_ADDRESS`, `PHONE_NUMBER`,
+`US_SSN`) are accepted as aliases for easy migration.
 
 ## Installation
 
@@ -42,7 +68,7 @@ pip install datafog[all]
 Python 3.13 support is certified for the core SDK, CLI, `nlp`,
 `nlp-advanced`, and `ocr` install profiles. Donut OCR still requires a model
 that is available locally before runtime use. `distributed` and `all` are not
-newly certified on Python 3.13 in the 4.5 line.
+newly certified on Python 3.13 in the 4.x line.
 
 ## Quick Start
 
@@ -117,7 +143,7 @@ Use the engine that matches your accuracy and dependency constraints:
 
 - `regex`:
   - Fastest and always available.
-  - Best for default structured entities: `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IP_ADDRESS`, `DATE`, `ZIP_CODE`.
+  - Best for default structured entities: `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IP_ADDRESS`, `DATE`, `ZIP_CODE` (`DOB` and `ZIP` are accepted as input aliases).
   - Use `locales=["de"]` for German structured IDs such as `DE_VAT_ID`, `DE_IBAN`, `DE_TAX_ID`, `DE_POSTAL_CODE`, and passport or residence permit numbers.
 - `spacy`:
   - Requires `pip install datafog[nlp]`.
@@ -131,7 +157,7 @@ Use the engine that matches your accuracy and dependency constraints:
 
 ## Optional OCR And Spark Surfaces
 
-DataFog 4.5 keeps the main package story centered on lightweight text PII
+The 4.x line keeps the main package story centered on lightweight text PII
 screening. OCR and Spark remain supported optional surfaces for users who
 already rely on them, but they are not required for the core import, default
 scan/redact helpers, or guardrail helpers.
@@ -151,7 +177,7 @@ scan/redact helpers, or guardrail helpers.
   - A Java runtime is required by PySpark.
 
 OCR and Spark are not deprecated. Their broader API and packaging overhaul is
-deferred; the 4.5 goal is to keep them explicit, documented, and isolated from
+deferred; the 4.x goal is to keep them explicit, documented, and isolated from
 the lightweight core path.
 
 ## Backward-Compatible APIs

diff --git a/datafog/__about__.py b/datafog/__about__.py
@@ -1 +1 @@
-__version__ = "4.6.0"
+__version__ = "4.7.0"
diff --git a/datafog/__init__.py b/datafog/__init__.py
@@ -153,14 +153,28 @@ def scan(
     engine: str = "regex",
     entity_types: list[str] | None = None,
     locales: list[str] | None = None,
+    allowlist: list[str] | None = None,
+    allowlist_patterns: list[str] | None = None,
 ) -> ScanResult:
     """
     v5-preview scan entrypoint.
 
     Defaults to the lightweight regex engine so the core install works without
     optional dependency fallback warnings.
+
+    ``allowlist`` exempts exact entity texts (your own support address, doc
+    placeholders); ``allowlist_patterns`` exempts entities whose full text
+    matches a regex (e.g. ``^\\d{10}$`` so unix timestamps stop matching as
+    phone numbers).
     """
-    return _scan(text=text, engine=engine, entity_types=entity_types, locales=locales)
+    return _scan(
+        text=text,
+        engine=engine,
+        entity_types=entity_types,
+        locales=locales,
+        allowlist=allowlist,
+        allowlist_patterns=allowlist_patterns,
+    )
 
 
 def redact(
@@ -171,12 +185,17 @@ def redact(
     strategy: str = "token",
     preset: str | None = None,
     locales: list[str] | None = None,
+    allowlist: list[str] | None = None,
+    allowlist_patterns: list[str] | None = None,
 ) -> RedactResult:
     """
     v5-preview redaction entrypoint.
 
     If entities are provided, redact those spans. Otherwise, scan text first
-    using the selected engine and redact the detected entities.
+    using the selected engine and redact the detected entities. ``allowlist``
+    and ``allowlist_patterns`` exempt findings from redaction (exact text and
+    full-text regex match respectively); they apply to the scan path and are
+    rejected when explicit ``entities`` are supplied.
     """
     if preset is not None:
         try:
@@ -186,6 +205,11 @@ def redact(
             raise ValueError(f"preset must be one of: {allowed}") from exc
 
     if entities is not None:
+        if allowlist or allowlist_patterns:
+            raise ValueError(
+                "allowlist/allowlist_patterns cannot be combined with explicit "
+                "entities; filter the entities before calling redact"
+            )
         return _redact_entities(text=text, entities=entities, strategy=strategy)
 
     return _scan_and_redact(
@@ -194,6 +218,8 @@ def redact(
         entity_types=entity_types,
         strategy=strategy,
         locales=locales,
+        allowlist=allowlist,
+        allowlist_patterns=allowlist_patterns,
     )
 
 

diff --git a/datafog/engine.py b/datafog/engine.py
@@ -3,6 +3,7 @@
 from __future__ import annotations
 
 import hashlib
+import re
 import warnings
 from dataclasses import dataclass
 from functools import lru_cache
@@ -23,6 +24,9 @@
     "SOCIAL_SECURITY_NUMBER": "SSN",
     "CREDIT_CARD_NUMBER": "CREDIT_CARD",
     "DATE_OF_BIRTH": "DATE",
+    # Presidio-compatible aliases, so configs migrate without renames.
+    "EMAIL_ADDRESS": "EMAIL",
+    "US_SSN": "SSN",
 }
 
 ALL_ENTITY_TYPES = {
@@ -277,6 +281,74 @@ def _filter_entity_types(
     return [entity for entity in entities if entity.type in allowed]
 
 
+# Python's re module backtracks; a quantified group containing another
+# quantifier (e.g. ``(a+)+``) can take exponential time on adversarial
+# input, and entity text can be attacker-influenced (LLM messages, tool
+# output). Reject that construct outright rather than matching under it.
+_NESTED_QUANTIFIER = re.compile(
+    r"\((?:[^()\\]|\\.)*(?<!\\)[+*}](?:[^()\\]|\\.)*\)\s*[+*{]"
+)
+MAX_ALLOWLIST_PATTERN_LENGTH = 512
+# Entities longer than this skip pattern matching (fail-safe: the finding
+# is kept, never suppressed) so match time stays bounded.
+MAX_PATTERN_SUBJECT_LENGTH = 512
+
+
+def _compile_allowlist_patterns(
+    allowlist_patterns: Optional[list[str]],
+) -> list["re.Pattern[str]"]:
+    compiled = []
+    for raw in allowlist_patterns or []:
+        if len(raw) > MAX_ALLOWLIST_PATTERN_LENGTH:
+            raise ValueError(
+                "allowlist_patterns entries must be at most "
+                f"{MAX_ALLOWLIST_PATTERN_LENGTH} characters"
+            )
+        if _NESTED_QUANTIFIER.search(raw):
+            raise ValueError(
+                "allowlist_patterns contains a quantified group with a nested "
+                f"quantifier ({raw!r}), which risks catastrophic backtracking; "
+                "rewrite the pattern without nesting quantifiers"
+            )
+        try:
+            compiled.append(re.compile(raw))
+        except re.error as exc:
+            raise ValueError(
+                f"allowlist_patterns contains an invalid regex: {raw!r} ({exc})"
+            ) from None
+    return compiled
+
+
+def _apply_allowlist(
+    entities: list[Entity],
+    allowlist: Optional[list[str]],
+    allowlist_patterns: Optional[list[str]],
+) -> list[Entity]:
+    """Drop entities whose exact text is allowlisted.
+
+    Matching semantics, deliberately strict for a security boundary:
+    exact values are case-sensitive with no Unicode normalization, and
+    patterns must fullmatch the entity text, so a partial match never
+    suppresses a finding. Allowlist entries and patterns are operator
+    configuration; treat them like code and never accept them from end
+    users.
+    """
+    if not allowlist and not allowlist_patterns:
+        return entities
+    exact = set(allowlist or [])
+    patterns = _compile_allowlist_patterns(allowlist_patterns)
+    return [
+        entity
+        for entity in entities
+        if entity.text not in exact
+        and not any(
+            pattern.fullmatch(entity.text)
+            for pattern in patterns
+            if len(entity.text) <= MAX_PATTERN_SUBJECT_LENGTH
+        )
+    ]
+
+
 def _needs_ner(entity_types: Optional[list[str]]) -> bool:
     if entity_types is None:
         return True
@@ -289,14 +361,25 @@ def scan(
     engine: str = "smart",
     entity_types: Optional[list[str]] = None,
     locales: Optional[list[str]] = None,
+    allowlist: Optional[list[str]] = None,
+    allowlist_patterns: Optional[list[str]] = None,
 ) -> ScanResult:
-    """Scan text for PII entities."""
+    """Scan text for PII entities.
+
+    ``allowlist`` exempts exact entity texts (e.g. your own support email);
+    ``allowlist_patterns`` exempts entities whose full text matches a regex
+    (e.g. ``^\\d{10}$`` to stop unix timestamps matching as phone numbers).
+    """
     if not isinstance(text, str):
         raise TypeError("text must be a string")
 
     if engine not in {"regex", "spacy", "gliner", "smart"}:
         raise ValueError("engine must be one of: regex, spacy, gliner, smart")
 
+    # Validate patterns up front so config errors fail fast even when the
+    # text contains no entities.
+    _compile_allowlist_patterns(allowlist_patterns)
+
     regex_entities = _regex_entities(
         text,
         entity_types=entity_types,
@@ -305,6 +388,7 @@ def scan(
 
     if engine == "regex":
         filtered = _filter_entity_types(regex_entities, entity_types)
+        filtered = _apply_allowlist(filtered, allowlist, allowlist_patterns)
         return ScanResult(
             entities=_dedupe_entities(filtered), text=text, engine_used="regex"
         )
@@ -367,6 +451,7 @@ def scan(
                 )
 
     filtered = _filter_entity_types(combined, entity_types)
+    filtered = _apply_allowlist(filtered, allowlist, allowlist_patterns)
     deduped = _dedupe_entities(filtered)
     return ScanResult(
         entities=deduped,
@@ -437,12 +522,16 @@ def scan_and_redact(
     entity_types: Optional[list[str]] = None,
     strategy: str = "token",
     locales: Optional[list[str]] = None,
+    allowlist: Optional[list[str]] = None,
+    allowlist_patterns: Optional[list[str]] = None,
 ) -> RedactResult:
     """Convenience wrapper: scan then redact."""
     scan_result = scan(
         text=text,
         engine=engine,
         entity_types=entity_types,
         locales=locales,
+        allowlist=allowlist,
+        allowlist_patterns=allowlist_patterns,
     )
     return redact(text=text, entities=scan_result.entities, strategy=strategy)