Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 4.6.0
current_version = 4.7.0
commit = True
tag = True
tag_name = v{new_version}
Expand Down
35 changes: 35 additions & 0 deletions CHANGELOG.MD
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,41 @@

## [2026-07-02]

### `datafog-python` [4.7.0]

#### Added

- **Allowlist support** on `scan()` and `redact()`: `allowlist=[...]` exempts
exact entity texts (your own support address, documentation placeholders);
`allowlist_patterns=[...]` exempts entities whose full text matches a regex
(e.g. `^\d{10}$` so unix timestamps stop matching as phone numbers).
Matching is deliberately strict: case-sensitive, no Unicode normalization,
exact/fullmatch only — a partial match never suppresses a finding.
Threaded through both agent adapters: `DATAFOG_HOOK_ALLOWLIST` /
`DATAFOG_HOOK_ALLOWLIST_PATTERNS` environment variables for the Claude
Code hook, `allowlist` / `allowlist_patterns` parameters for the LiteLLM
guardrail. Patterns are operator configuration — treat them like code and
never accept them from end users; patterns with nested quantifiers are
rejected at configuration time (catastrophic-backtracking guard), pattern
length is capped at 512 characters, and entities longer than 512
characters skip pattern matching fail-safe (the finding is kept).
- **Presidio-compatible entity aliases**: `EMAIL_ADDRESS` and `US_SSN` are
accepted as input aliases for `EMAIL` and `SSN` (joining the existing
`PHONE_NUMBER` alias), so Presidio configurations migrate without renames.
- **`py.typed` marker**: the package now advertises its inline type
annotations to type checkers (PEP 561).

#### Changed

- **LiteLLM guardrail observability**: redaction events are now recorded
with `guardrail_status="guardrail_intervened"` (previously `"success"`),
so compliance dashboards flag redactions as interventions. Guardrail
logging metadata is attached to the request dict actually returned in
redact mode, fixing dropped observability records for requests arriving
without a pre-existing `metadata` key.
- Documentation: corrected the engine entity-type list — the scan API
returns `DATE` and `ZIP_CODE`; `DOB` and `ZIP` are accepted input aliases.

### `datafog-python` [4.6.0]

#### Added
Expand Down
46 changes: 36 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,43 @@ DataFog is a Python library for detecting and redacting personally identifiable
It provides:

- Fast structured PII detection via regex
- An offline PII firewall for AI agents: a Claude Code hook and a LiteLLM
gateway guardrail (new in 4.6)
- Optional NER support via spaCy and GLiNER
- A simple agent-oriented API for LLM applications
- Backward-compatible `DataFog` and `TextService` classes

## 4.5 Focus
## Agent & Gateway Firewall (4.6)

DataFog 4.5 is focused on lightweight text PII screening: a small core install,
fast regex-based scan/redact helpers, explicit optional extras, and a clearer
path toward future middleware use cases. Dedicated Sentry, OpenTelemetry,
logging-framework, and cloud DLP adapters are future-facing work and are not
part of the 4.5 release.
DataFog 4.6 adds two ready-made enforcement points that catch PII at the
moment it would leave your machine — offline, in microseconds, with matched
values never echoed into logs or transcripts:

- **Claude Code hook** (`datafog-hook`): gates agent tool calls (shell
commands, web requests, file writes, MCP tools) and warns the model when
prompts or tool results carry PII. ~70ms per invocation including process
startup. Easiest install is the
[Claude Code plugin](https://github.com/DataFog/datafog-claude-plugin):

```
/plugin marketplace add DataFog/datafog-claude-plugin
/plugin install datafog@datafog
```

Manual hook setup and limitations: [examples/claude_code_hook/](examples/claude_code_hook/).

- **LiteLLM guardrail** (`DataFogGuardrail`): redacts or blocks PII in
requests and responses at the gateway, for any LiteLLM-proxied provider.
In-process (~31µs per request), no sidecar service. Setup:
[examples/litellm_guardrail/](examples/litellm_guardrail/).

Both default to the high-precision entity set (`EMAIL`, `PHONE`,
`CREDIT_CARD`, `SSN`); noisier types are opt-in. Known-safe values can be
exempted with an allowlist: `scan(text, allowlist=[...])` for exact values,
`allowlist_patterns=[...]` for full-match regexes (e.g. `^\d{10}$` to stop
unix timestamps matching as phone numbers) — available in both adapters and
the API. Presidio-style entity names (`EMAIL_ADDRESS`, `PHONE_NUMBER`,
`US_SSN`) are accepted as aliases for easy migration.

## Installation

Expand All @@ -42,7 +68,7 @@ pip install datafog[all]
Python 3.13 support is certified for the core SDK, CLI, `nlp`,
`nlp-advanced`, and `ocr` install profiles. Donut OCR still requires a model
that is available locally before runtime use. `distributed` and `all` are not
newly certified on Python 3.13 in the 4.5 line.
newly certified on Python 3.13 in the 4.x line.

## Quick Start

Expand Down Expand Up @@ -117,7 +143,7 @@ Use the engine that matches your accuracy and dependency constraints:

- `regex`:
- Fastest and always available.
- Best for default structured entities: `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IP_ADDRESS`, `DATE`, `ZIP_CODE`.
- Best for default structured entities: `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IP_ADDRESS`, `DATE`, `ZIP_CODE` (`DOB` and `ZIP` are accepted as input aliases).
- Use `locales=["de"]` for German structured IDs such as `DE_VAT_ID`, `DE_IBAN`, `DE_TAX_ID`, `DE_POSTAL_CODE`, and passport or residence permit numbers.
- `spacy`:
- Requires `pip install datafog[nlp]`.
Expand All @@ -131,7 +157,7 @@ Use the engine that matches your accuracy and dependency constraints:

## Optional OCR And Spark Surfaces

DataFog 4.5 keeps the main package story centered on lightweight text PII
The 4.x line keeps the main package story centered on lightweight text PII
screening. OCR and Spark remain supported optional surfaces for users who
already rely on them, but they are not required for the core import, default
scan/redact helpers, or guardrail helpers.
Expand All @@ -151,7 +177,7 @@ scan/redact helpers, or guardrail helpers.
- A Java runtime is required by PySpark.

OCR and Spark are not deprecated. Their broader API and packaging overhaul is
deferred; the 4.5 goal is to keep them explicit, documented, and isolated from
deferred; the 4.x goal is to keep them explicit, documented, and isolated from
the lightweight core path.

## Backward-Compatible APIs
Expand Down
2 changes: 1 addition & 1 deletion datafog/__about__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "4.6.0"
__version__ = "4.7.0"
30 changes: 28 additions & 2 deletions datafog/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,14 +153,28 @@ def scan(
engine: str = "regex",
entity_types: list[str] | None = None,
locales: list[str] | None = None,
allowlist: list[str] | None = None,
allowlist_patterns: list[str] | None = None,
) -> ScanResult:
"""
v5-preview scan entrypoint.

Defaults to the lightweight regex engine so the core install works without
optional dependency fallback warnings.

``allowlist`` exempts exact entity texts (your own support address, doc
placeholders); ``allowlist_patterns`` exempts entities whose full text
matches a regex (e.g. ``^\\d{10}$`` so unix timestamps stop matching as
phone numbers).
"""
return _scan(text=text, engine=engine, entity_types=entity_types, locales=locales)
return _scan(
text=text,
engine=engine,
entity_types=entity_types,
locales=locales,
allowlist=allowlist,
allowlist_patterns=allowlist_patterns,
)


def redact(
Expand All @@ -171,12 +185,17 @@ def redact(
strategy: str = "token",
preset: str | None = None,
locales: list[str] | None = None,
allowlist: list[str] | None = None,
allowlist_patterns: list[str] | None = None,
) -> RedactResult:
"""
v5-preview redaction entrypoint.

If entities are provided, redact those spans. Otherwise, scan text first
using the selected engine and redact the detected entities.
using the selected engine and redact the detected entities. ``allowlist``
and ``allowlist_patterns`` exempt findings from redaction (exact text and
full-text regex match respectively); they apply to the scan path and are
rejected when explicit ``entities`` are supplied.
"""
if preset is not None:
try:
Expand All @@ -186,6 +205,11 @@ def redact(
raise ValueError(f"preset must be one of: {allowed}") from exc

if entities is not None:
if allowlist or allowlist_patterns:
raise ValueError(
"allowlist/allowlist_patterns cannot be combined with explicit "
"entities; filter the entities before calling redact"
)
return _redact_entities(text=text, entities=entities, strategy=strategy)

return _scan_and_redact(
Expand All @@ -194,6 +218,8 @@ def redact(
entity_types=entity_types,
strategy=strategy,
locales=locales,
allowlist=allowlist,
allowlist_patterns=allowlist_patterns,
)


Expand Down
91 changes: 90 additions & 1 deletion datafog/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from __future__ import annotations

import hashlib
import re
import warnings
from dataclasses import dataclass
from functools import lru_cache
Expand All @@ -23,6 +24,9 @@
"SOCIAL_SECURITY_NUMBER": "SSN",
"CREDIT_CARD_NUMBER": "CREDIT_CARD",
"DATE_OF_BIRTH": "DATE",
# Presidio-compatible aliases, so configs migrate without renames.
"EMAIL_ADDRESS": "EMAIL",
"US_SSN": "SSN",
}

ALL_ENTITY_TYPES = {
Expand Down Expand Up @@ -277,6 +281,74 @@ def _filter_entity_types(
return [entity for entity in entities if entity.type in allowed]


# Python's re module backtracks; a quantified group containing another
# quantifier (e.g. ``(a+)+``) can take exponential time on adversarial
# input, and entity text can be attacker-influenced (LLM messages, tool
# output). Reject that construct outright rather than matching under it.
_NESTED_QUANTIFIER = re.compile(
r"\((?:[^()\\]|\\.)*(?<!\\)[+*}](?:[^()\\]|\\.)*\)\s*[+*{]"
)
MAX_ALLOWLIST_PATTERN_LENGTH = 512
# Entities longer than this skip pattern matching (fail-safe: the finding
# is kept, never suppressed) so match time stays bounded.
MAX_PATTERN_SUBJECT_LENGTH = 512


def _compile_allowlist_patterns(
allowlist_patterns: Optional[list[str]],
) -> list["re.Pattern[str]"]:
compiled = []
for raw in allowlist_patterns or []:
if len(raw) > MAX_ALLOWLIST_PATTERN_LENGTH:
raise ValueError(
"allowlist_patterns entries must be at most "
f"{MAX_ALLOWLIST_PATTERN_LENGTH} characters"
)
if _NESTED_QUANTIFIER.search(raw):
raise ValueError(
"allowlist_patterns contains a quantified group with a nested "
f"quantifier ({raw!r}), which risks catastrophic backtracking; "
"rewrite the pattern without nesting quantifiers"
)
try:
compiled.append(re.compile(raw))
except re.error as exc:
raise ValueError(
f"allowlist_patterns contains an invalid regex: {raw!r} ({exc})"
) from None
return compiled


def _apply_allowlist(
entities: list[Entity],
allowlist: Optional[list[str]],
allowlist_patterns: Optional[list[str]],
) -> list[Entity]:
"""Drop entities whose exact text is allowlisted.

Matching semantics, deliberately strict for a security boundary:
exact values are case-sensitive with no Unicode normalization, and
patterns must fullmatch the entity text, so a partial match never
suppresses a finding. Allowlist entries and patterns are operator
configuration; treat them like code and never accept them from end
users.
"""
if not allowlist and not allowlist_patterns:
return entities
exact = set(allowlist or [])
patterns = _compile_allowlist_patterns(allowlist_patterns)
return [
entity
for entity in entities
if entity.text not in exact
and not any(
pattern.fullmatch(entity.text)
for pattern in patterns
if len(entity.text) <= MAX_PATTERN_SUBJECT_LENGTH
)
]


def _needs_ner(entity_types: Optional[list[str]]) -> bool:
if entity_types is None:
return True
Expand All @@ -289,14 +361,25 @@ def scan(
engine: str = "smart",
entity_types: Optional[list[str]] = None,
locales: Optional[list[str]] = None,
allowlist: Optional[list[str]] = None,
allowlist_patterns: Optional[list[str]] = None,
) -> ScanResult:
"""Scan text for PII entities."""
"""Scan text for PII entities.

``allowlist`` exempts exact entity texts (e.g. your own support email);
``allowlist_patterns`` exempts entities whose full text matches a regex
(e.g. ``^\\d{10}$`` to stop unix timestamps matching as phone numbers).
"""
if not isinstance(text, str):
raise TypeError("text must be a string")

if engine not in {"regex", "spacy", "gliner", "smart"}:
raise ValueError("engine must be one of: regex, spacy, gliner, smart")

# Validate patterns up front so config errors fail fast even when the
# text contains no entities.
_compile_allowlist_patterns(allowlist_patterns)

regex_entities = _regex_entities(
text,
entity_types=entity_types,
Expand All @@ -305,6 +388,7 @@ def scan(

if engine == "regex":
filtered = _filter_entity_types(regex_entities, entity_types)
filtered = _apply_allowlist(filtered, allowlist, allowlist_patterns)
return ScanResult(
entities=_dedupe_entities(filtered), text=text, engine_used="regex"
)
Expand Down Expand Up @@ -367,6 +451,7 @@ def scan(
)

filtered = _filter_entity_types(combined, entity_types)
filtered = _apply_allowlist(filtered, allowlist, allowlist_patterns)
deduped = _dedupe_entities(filtered)
return ScanResult(
entities=deduped,
Expand Down Expand Up @@ -437,12 +522,16 @@ def scan_and_redact(
entity_types: Optional[list[str]] = None,
strategy: str = "token",
locales: Optional[list[str]] = None,
allowlist: Optional[list[str]] = None,
allowlist_patterns: Optional[list[str]] = None,
) -> RedactResult:
"""Convenience wrapper: scan then redact."""
scan_result = scan(
text=text,
engine=engine,
entity_types=entity_types,
locales=locales,
allowlist=allowlist,
allowlist_patterns=allowlist_patterns,
)
return redact(text=text, entities=scan_result.entities, strategy=strategy)
Loading