From c60a3d84fd81ab4fcd279e28a7cbf1f02f8f2ef9 Mon Sep 17 00:00:00 2001 From: Sid Mohan <61345237+sidmohan0@users.noreply.github.com> Date: Thu, 2 Jul 2026 11:33:54 -0700 Subject: [PATCH] Reapply "Merge pull request #151 from DataFog/docs/v5-roadmap" This reverts commit 2119918023f49d88f54d9bdcf338fd13ea6bb54f. --- .gitignore | 1 - ROADMAP.md | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+), 1 deletion(-) create mode 100644 ROADMAP.md diff --git a/.gitignore b/.gitignore index 1316a1f3..cf11a42c 100644 --- a/.gitignore +++ b/.gitignore @@ -69,5 +69,4 @@ docs/* */**/__pycache__/ notes/benchmarking_notes.md -Roadmap.md notes/* diff --git a/ROADMAP.md b/ROADMAP.md new file mode 100644 index 00000000..1f10a519 --- /dev/null +++ b/ROADMAP.md @@ -0,0 +1,37 @@ +# DataFog Roadmap + +_Last updated: July 2026 — following the [v4.5.0 stable release](https://github.com/DataFog/datafog-python/releases/tag/v4.5.0)._ + +DataFog's direction for v5: **the fastest, easiest offline PII firewall for AI apps, logs, and datasets** — no network calls, no model downloads on import, one obvious API. + +Independent benchmarking ranks DataFog fastest among Python PII libraries by orders of magnitude. The v5 cycle invests that speed advantage into precision, EU coverage, and a leaner package. + +## v5.0.0 themes + +### 1. Precision: validators, confidence scores, strictness presets + +Speed means nothing if you can't trust the findings. v5 adds a zero-dependency validation pass on every hit (Luhn for credit cards, IBAN mod-97, SSN structure rules, IP plausibility), confidence scores on findings, and strictness presets (`strict` / `balanced` / `lenient`) so you pick the precision/recall tradeoff instead of hand-filtering. We'll publish precision/recall benchmarks alongside the speed benchmarks. + +### 2. EU language & entity coverage + +v4.5.0 introduced German locale support. v5 generalizes it into a locale pack system — new languages become data plus tests, not code changes — starting with locale-independent EU entities (IBAN, VAT IDs, national ID formats) and expanding across FR/ES/IT/NL/PL through the v5.x line. + +### 3. A leaner package + +The core install stays pydantic-only and tiny. The heavyweight paths are being cut or spun out: the Donut/transformers OCR path goes away (pytesseract remains), the PySpark wrapper becomes a documented recipe, and legacy duplicate modules are deleted. CI enforces wheel-size and import-time budgets. + +### 4. Built for pipelines, not just scripts + +The core `scan`/`redact` functions are pure, stateless, and thread-safe, with batch/iterator APIs for high-throughput use. Instead of shipping transport connectors, we're publishing recipes for embedding DataFog in Kafka consumers, Vector/Fluent Bit transforms, and OpenTelemetry collector processors. + +### 5. Vault-friendly anonymization + +Deterministic tokenization with exportable mappings and format-preserving pseudonymization, so DataFog slots into vault-and-token architectures rather than competing with them. + +## API direction + +`scan()`, `redact()`, and `protect()` — shipped as previews in v4.5.0 — become the primary documented API in v5.0.0. The legacy `detect()`/`process()` functions keep working as compatibility shims throughout the v5.x line. + +## Feedback + +Roadmap priorities are shaped by user feedback — open a [GitHub issue](https://github.com/DataFog/datafog-python/issues) or join the [Discord](https://discord.gg/bzDth394R4).