DataFog is a Python library for detecting and redacting personally identifiable information (PII).
It provides:
- Fast structured PII detection via regex
- An offline PII firewall for AI agents: a Claude Code hook and a LiteLLM gateway guardrail (new in 4.6)
- Optional NER support via spaCy and GLiNER
- A simple agent-oriented API for LLM applications
- Backward-compatible
DataFogandTextServiceclasses
DataFog 4.6 adds two ready-made enforcement points that catch PII at the moment it would leave your machine — offline, in microseconds, with matched values never echoed into logs or transcripts:
-
Claude Code hook (
datafog-hook): gates agent tool calls (shell commands, web requests, file writes, MCP tools) and warns the model when prompts or tool results carry PII. ~70ms per invocation including process startup. Easiest install is the Claude Code plugin:/plugin marketplace add DataFog/datafog-claude-plugin /plugin install datafog@datafogManual hook setup and limitations: examples/claude_code_hook/.
-
LiteLLM guardrail (
DataFogGuardrail): redacts or blocks PII in requests and responses at the gateway, for any LiteLLM-proxied provider. In-process (~31µs per request), no sidecar service. Setup: examples/litellm_guardrail/.
Both default to the high-precision entity set (EMAIL, PHONE,
CREDIT_CARD, SSN); noisier types are opt-in. Known-safe values can be
exempted with an allowlist: scan(text, allowlist=[...]) for exact values,
allowlist_patterns=[...] for full-match regexes (e.g. ^\d{10}$ to stop
unix timestamps matching as phone numbers) — available in both adapters and
the API. Presidio-style entity names (EMAIL_ADDRESS, PHONE_NUMBER,
US_SSN) are accepted as aliases for easy migration.
# Core install (regex engine)
pip install datafog
# Add spaCy support
pip install datafog[nlp]
# Add GLiNER + spaCy support
pip install datafog[nlp-advanced]
# Add local OCR support
pip install datafog[ocr]
# Add Spark/distributed support
pip install datafog[distributed]
# Everything
pip install datafog[all]Python 3.13 support is certified for the core SDK, CLI, nlp,
nlp-advanced, and ocr install profiles. Donut OCR still requires a model
that is available locally before runtime use. distributed and all are not
newly certified on Python 3.13 in the 4.x line.
import datafog
text = "Contact john@example.com or call (555) 123-4567"
clean = datafog.sanitize(text, engine="regex")
print(clean)
# Contact [EMAIL_1] or call [PHONE_1]import datafog
# 1) Scan prompt text before sending to an LLM
prompt = "My SSN is 123-45-6789"
scan_result = datafog.scan_prompt(prompt, engine="regex")
if scan_result.entities:
print(f"Detected {len(scan_result.entities)} PII entities")
# 2) Redact model output before returning it
output = "Email me at jane.doe@example.com"
safe_result = datafog.filter_output(output, engine="regex")
print(safe_result.redacted_text)
# Email me at [EMAIL_1]
# 3) One-liner redaction
print(datafog.sanitize("Card: 4111-1111-1111-1111", engine="regex"))
# Card: [CREDIT_CARD_1]German structured PII is country-specific and opt-in. Use explicit locale selection or entity-type filtering when you want German VAT IDs, German IBANs, tax IDs, postal codes, passports, or residence permits.
import datafog
text = "Steuer-ID 12345678901 liegt vor."
print(datafog.scan(text, engine="regex").entities)
# []
print(datafog.scan(text, engine="regex", locales=["de"]).entities)
# [Entity(type='DE_TAX_ID', text='12345678901', ...)]import datafog
# Reusable guardrail object
guard = datafog.create_guardrail(engine="regex", on_detect="redact")
@guard
def call_llm() -> str:
return "Send to admin@example.com"
print(call_llm())
# Send to [EMAIL_1]Use the engine that matches your accuracy and dependency constraints:
regex:- Fastest and always available.
- Best for default structured entities:
EMAIL,PHONE,SSN,CREDIT_CARD,IP_ADDRESS,DATE,ZIP_CODE(DOBandZIPare accepted as input aliases). - Use
locales=["de"]for German structured IDs such asDE_VAT_ID,DE_IBAN,DE_TAX_ID,DE_POSTAL_CODE, and passport or residence permit numbers.
spacy:- Requires
pip install datafog[nlp]. - Useful for unstructured entities like person and organization names.
- Requires
gliner:- Requires
pip install datafog[nlp-advanced]. - Stronger NER coverage than regex for unstructured text.
- Requires
smart:- Cascades regex with optional NER engines.
- If optional deps are missing, it degrades gracefully and warns.
The 4.x line keeps the main package story centered on lightweight text PII screening. OCR and Spark remain supported optional surfaces for users who already rely on them, but they are not required for the core import, default scan/redact helpers, or guardrail helpers.
- OCR:
- Install
datafog[ocr]for local image OCR helpers. - URL-based image downloading also needs
datafog[web,ocr]. - Tesseract usage requires the system
tesseractbinary. - Python 3.13 is validated for the OCR install profile, Pillow, pytesseract, and system Tesseract smoke checks.
- Donut OCR requires
datafog[nlp-advanced,ocr]and a model already available locally.
- Install
- Spark:
- Install
datafog[distributed]forSparkService. - Spark PII UDF helpers also require
datafog[nlp]and an installed spaCy model. - A Java runtime is required by PySpark.
- Install
OCR and Spark are not deprecated. Their broader API and packaging overhaul is deferred; the 4.x goal is to keep them explicit, documented, and isolated from the lightweight core path.
The existing public API remains available.
from datafog import DataFog
result = DataFog().scan_text("Email john@example.com")
print(result["EMAIL"])from datafog.services import TextService
service = TextService(engine="regex")
result = service.annotate_text_sync("Call (555) 123-4567")
print(result["PHONE"])# Scan text
datafog scan-text "john@example.com"
# Redact text
datafog redact-text "john@example.com"
# Replace text with pseudonyms
datafog replace-text "john@example.com"
# Hash detected entities
datafog hash-text "john@example.com"
# Enable German regex identifiers
datafog redact-text "Steuer-ID 12345678901" --locale deDataFog telemetry is disabled by default.
To opt in:
export DATAFOG_TELEMETRY=1To force telemetry off:
export DATAFOG_NO_TELEMETRY=1
# or
export DO_NOT_TRACK=1Telemetry does not include input text or detected PII values.
git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[all,dev]"
pip install -r requirements-dev.txt
pytest tests/