From 0fe9cd1fe03ef3886ab5306723a635b0687a2c6a Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 25 Jun 2026 08:59:49 +0000 Subject: [PATCH 1/2] =?UTF-8?q?docs:=20add=20SDD=20&=20Spec=20Kit=20crash?= =?UTF-8?q?=20course=20(philosophy=20=E2=86=92=20shipped=20product)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Comprehensive guide covering the SDD philosophy (power inversion, core principles), the software design approach (constitution-as-law, intent vs. mechanism layers, gates, template-driven quality, spec persistence), a command-by-command Spec Kit how-to, and a t0-to-shipped end-to-end walkthrough that impersonates a senior AI engineer shipping an agentic product with a human-in-the-loop frontend. Includes anti-patterns, team workflow, calibration guidance, and cheat sheets. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01K7A9nEeKpPESM9ZQJcs2g1 --- docs/guides/sdd-crash-course.md | 890 ++++++++++++++++++++++++++++++++ 1 file changed, 890 insertions(+) create mode 100644 docs/guides/sdd-crash-course.md diff --git a/docs/guides/sdd-crash-course.md b/docs/guides/sdd-crash-course.md new file mode 100644 index 0000000000..5bdf21a1fe --- /dev/null +++ b/docs/guides/sdd-crash-course.md @@ -0,0 +1,890 @@ +# The SDD & Spec Kit Crash Course + +*From philosophy to a shipped product — how a senior engineer uses Spec-Driven +Development as load-bearing infrastructure, not ceremony.* + +> **Who this is for.** Engineers who already ship software and want SDD to be a +> real part of their lifecycle — not a demo. The running example is built by a +> fictional senior AI engineer shipping an **agentic product with a real +> frontend**, because that is where SDD earns its keep: non-determinism, evals, +> human-in-the-loop review, and a UI that has to feel good. +> +> **How to read it.** Parts I–II are the *why* and the *mental model*. Part III +> is the *reference* (every command, when to reach for it). Part IV is the +> *t0 → shipped* walkthrough with real prompts. Part V is how to operate it like +> a senior — including when **not** to use SDD. +> +> **A note on the code samples.** Prompts are shown in full because they are the +> thing you actually type. Generated artifacts (`spec.md`, `plan.md`, etc.) are +> shown as **captions and structure only** — Spec Kit writes the prose; your job +> is to learn the *shape* so you can review it. + +--- + +## Table of Contents + +- [Part I — The Philosophy](#part-i--the-philosophy) +- [Part II — The Software Design Approach](#part-ii--the-software-design-approach) +- [Part III — Spec Kit Features & How-Tos](#part-iii--spec-kit-features--how-tos) +- [Part IV — End-to-End: t0 to a Shipped Product](#part-iv--end-to-end-t0-to-a-shipped-product) +- [Part V — Operating SDD Like a Senior](#part-v--operating-sdd-like-a-senior) +- [Appendix — Cheat Sheets](#appendix--cheat-sheets) + +--- + +## Part I — The Philosophy + +### 1.1 The power inversion + +For decades, **code was the source of truth** and the spec was scaffolding — +written once, approved, then quietly abandoned the moment "real work" began. +The artifact you trusted was the one the machine ran. + +Spec-Driven Development **inverts that relationship**: the **specification +becomes the durable, executable artifact**, and code becomes its *expression* — +regenerable, replaceable, downstream. You no longer hand-translate intent into +syntax and lose fidelity at every step. You maintain intent precisely enough +that an AI agent can produce a faithful implementation from it, again and again. + +The practical consequence: when requirements change, you **change the spec and +regenerate**, instead of archaeology-ing through code to reconstruct what was +meant. The spec stops being documentation *about* the system and becomes the +*lever that moves* the system. + +### 1.2 Why this is viable *now* (and wasn't before) + +This idea is old; what changed is that LLMs got good enough to do the +"specification → working code" step with real fidelity. Three forces converge: + +- **Capability** — models can now interpret rich natural-language intent and + produce coherent, multi-file implementations. +- **Cost of change collapsed** — re-deriving an implementation from an updated + spec is now minutes, not a sprint, so treating code as disposable is rational. +- **Complexity demands it** — modern products carry constraints (compliance, + multi-stack, AI non-determinism) that *need* a structured intent layer to stay + coherent. + +The bottleneck moved. It is no longer "how fast can you type code." It is **"how +precisely can you express what you actually want, and how well can you keep that +expression honest."** SDD is the discipline for that. + +### 1.3 Core principles + +SDD rests on a handful of commitments: + +- **Intent before mechanism** — define the *what* and *why* completely before + the *how*. The stack is a downstream decision, not the starting point. +- **Multi-step refinement over one-shot generation** — you do not "prompt once + and pray." You build a spec, interrogate it, plan against it, decompose it, + and only then implement. Each step is a checkpoint. +- **Explicit uncertainty** — unknowns are *marked*, not silently guessed. + Ambiguity surfaces as `[NEEDS CLARIFICATION]`, not as a confident hallucination + three files deep. +- **Guardrails over vibes** — organizational principles (the *constitution*) and + structured templates constrain the model toward good outcomes instead of + hoping the prompt was perfect. +- **The spec is the contract** — reviews, debates, and decisions happen at the + spec/plan layer where they are cheap, not in a code review where they are + expensive and late. + +### 1.4 What SDD is **not** + +A senior recognizes SDD by what it refuses to be: + +- **Not "write a giant doc, then code by hand."** The artifacts are + *operational* — they drive generation. A spec nobody regenerates from is just + waterfall with extra steps. +- **Not vibe coding.** Vibe coding optimizes for "something that runs." SDD + optimizes for "the thing we agreed to build, traceable to why." +- **Not a replacement for engineering judgment.** The model drafts; **you** own + the constitution, the clarifications, the gate decisions, and the review. SDD + *concentrates* your judgment at the highest-leverage points. +- **Not all-or-nothing.** You can run the full pipeline on a load-bearing feature + and skip straight to `/speckit.implement` on a throwaway. Match ceremony to + stakes (see §5.4). + +--- + +## Part II — The Software Design Approach + +SDD is not just a workflow; it is an opinion about *how software should be +designed*. Five ideas carry the weight. + +### 2.1 The constitution as architectural law + +Before any feature exists, you write a **constitution** — a small set of binding +principles that govern *every* subsequent decision. It lives at +`.specify/memory/constitution.md` and is referenced during specify, plan, and +implement. + +Think of it as the project's **non-negotiable invariants**: code-quality rules, +a testing mandate, UX consistency requirements, performance budgets, dependency +discipline. Crucially, principles are **enforced as gates**, not posted as +inspirational wallpaper. A plan that violates "Test-First" or "Minimal +Dependencies" is supposed to *fail the gate* and force either a fix or an +explicit, justified exception. + +> The original SDD essay frames these as numbered "Articles" (Library-First, +> CLI Interface, Test-First, Simplicity, Anti-Abstraction, Integration-First, +> etc.). The exact articles matter less than the move: **promote your hardest-won +> engineering values into machine-checkable gates that the agent must pass +> through.** + +### 2.2 Intent layer vs. mechanism layer + +SDD draws a hard line through your artifacts: + +| Layer | Artifact | Answers | Owner of truth | +| --- | --- | --- | --- | +| **Intent** | `spec.md` | *What* / *why* — user stories, requirements, acceptance criteria | Product + Eng, no stack | +| **Mechanism** | `plan.md`, `research.md`, `data-model.md`, `contracts/` | *How* — stack, schemas, APIs, trade-offs | Eng | +| **Execution** | `tasks.md` → code | *Do it* — ordered, testable units | Agent, reviewed by Eng | + +The discipline: **keep the stack out of the spec.** "Users can drag-and-drop +albums" belongs in the spec. "We use React DnD Kit with optimistic updates" +belongs in the plan. This separation is what lets the same intent be +re-implemented in a different stack, and what keeps spec reviews about *product* +rather than *frameworks*. + +### 2.3 Template-driven quality: how structure constrains the model + +Spec Kit's templates are not formatting — they are **guardrails that make LLMs +produce better output**. Concretely, the templates: + +- **Prevent premature implementation detail** — the spec template structurally + pushes "how" content into later phases. +- **Force explicit uncertainty** — `[NEEDS CLARIFICATION]` markers are *required* + where the input is silent, converting silent assumptions into visible + questions. +- **Encode checklists and gates** — the model must walk a structured sequence + (requirement completeness, constitution check, simplicity/anti-abstraction + gates) rather than free-associate. +- **Order file creation** — research before data-model before contracts before + tasks, so each artifact stands on a settled foundation. + +The insight worth internalizing: **a good template is a prompt that runs every +time, on every feature, without you remembering to type it.** That is where +consistency at scale comes from. + +### 2.4 The pre-implementation gates ("Phase −1") + +Before code is generated, the plan passes through gates derived from your +constitution. Canonical examples: + +- **Simplicity gate** — are you adding more moving parts than the problem + requires? Default to fewer projects, fewer layers. +- **Anti-abstraction gate** — are you wrapping frameworks in needless + indirection? Use the platform directly until you have a concrete reason not to. +- **Integration-first gate** — are contracts defined and is testing wired against + real integrations before unit minutiae? + +A gate failure is a **feature, not a blocker**: it catches over-engineering and +ambiguity while the cost of changing course is a paragraph edit, not a rewrite. + +### 2.5 Spec persistence: greenfield, brownfield, and evolution + +SDD does **not** prescribe a single way to keep `spec.md`/`plan.md`/`tasks.md` +alive after requirements shift. You choose a persistence model: + +- **0→1 (greenfield)** — generate everything from high-level intent; the spec is + born with the code. +- **Iterative enhancement (brownfield)** — add features to an existing system; + specs describe deltas and integrate with reality on disk. +- **Spec evolution** — when requirements change, update the spec and re-derive, + using `/speckit.converge` to fold reality and intent back together. + +The senior move is to **decide your persistence model deliberately** at project +start, and to treat the spec as a living asset under version control alongside +the code — not a one-time generation prompt. + +--- + +## Part III — Spec Kit Features & How-Tos + +Spec Kit is the toolkit that operationalizes SDD: a CLI (`specify`) that +bootstraps your repo, and a set of agent slash-commands (`/speckit.*`) that drive +the pipeline. + +### 3.1 Install and initialize + +```bash +# Install the CLI (requires uv). Pin to the latest release tag. +uv tool install specify-cli --from git+https://github.com/github/spec-kit.git@vX.Y.Z + +# Scaffold a project wired for your agent. --integration picks the agent. +specify init redaktor --integration copilot # or: claude, gemini, codex, ... + +# Initialize inside an existing repo (brownfield): +specify init . --here --integration copilot + +# Keep the CLI current: +specify self check # read-only: is a newer release available? +specify self upgrade # upgrade in place +``` + +> **Agent-agnostic, with one syntax note.** Most agents expose commands as +> `/speckit.specify`, `/speckit.plan`, etc. **Codex CLI in skills mode** uses +> `$speckit-*`; **GitHub Copilot CLI** uses `/agents` to select the agent. This +> guide uses the `/speckit.*` form, and the examples are written against +> **Claude Code**. + +What `init` gives you: the `.specify/` directory (constitution, templates, +scripts), the `/speckit.*` commands registered with your agent, and the repo +prepared for the `specs/NNN-feature/` layout. + +### 3.2 The core pipeline at a glance + +The spine of every feature is five commands. The supporting cast (clarify, +analyze, checklist, converge, taskstoissues) wraps around them. + +```text +/speckit.constitution → .specify/memory/constitution.md (project law, once-ish) + │ +/speckit.specify → specs/NNN-feature/spec.md (WHAT/WHY; new branch) + │ ⟂ /speckit.clarify (de-risk ambiguity before planning) +/speckit.plan → plan.md + research.md + data-model.md + contracts/ + quickstart.md + │ ⟂ /speckit.checklist (quality-test the requirements) +/speckit.tasks → tasks.md (ordered, [P]-parallelizable) + │ ⟂ /speckit.analyze (cross-artifact consistency audit, read-only) + │ ⟂ /speckit.taskstoissues (optional: fan out to GitHub issues) +/speckit.implement → working code, task-by-task, marking [X] as it goes + │ +/speckit.converge → reconcile code-on-disk with spec; append missing work +``` + +Commands **hand off** to each other — after `specify`, the agent offers to jump +to `plan` or `clarify`; after `plan`, to `tasks` or `checklist`; and so on. You +can follow the rail or jump around as judgment dictates. + +### 3.3 Command-by-command how-to + +Each entry: **what it does · when to reach for it · example prompt · what it +emits · senior tips.** + +#### `/speckit.constitution` + +- **What** — creates/updates the project constitution and keeps dependent + templates in sync. +- **When** — once at project birth; revisit when a principle genuinely changes + (rare; version it). +- **Prompt** + + ```text + # Establish binding principles. Name the values you will NOT compromise, + # and say how they should arbitrate technical decisions. + /speckit.constitution Create principles focused on: (1) every behavior change + ships with a test, (2) AI components are evaluated, not just "tried" — every + prompt/agent has an eval set and a regression threshold, (3) the human is in + the loop for any irreversible action, (4) minimal dependencies and idempotent + file ops, (5) p95 latency budgets per endpoint. Include governance for how + these gate planning and implementation. + ``` + +- **Emits** — `.specify/memory/constitution.md` (with a versioned "Sync Impact" + header). +- **Tips** — keep it short and *enforceable*. A principle you can't imagine + failing a plan against is just decoration. Treat edits as semver: principle + changes are MAJOR. + +#### `/speckit.specify` + +- **What** — turns a natural-language feature description into a structured + spec; creates the feature branch and `specs/NNN-*/` directory. +- **When** — the start of every feature. +- **Prompt** — describe **what** and **why**; resist naming the stack. + + ```text + # Note what's present: actors, flows, rules, edge cases. Note what's absent: + # frameworks, libraries, schemas. That's deliberate. + /speckit.specify Build a service that detects and masks personal data (PII) + in Turkish-language documents before they're shared. A user pastes or uploads + text; the system highlights detected PII (names, national IDs, IBANs, phones, + addresses) with a confidence level; a human reviewer can accept, reject, or + edit each detection; on confirmation the system returns a masked copy plus an + audit record. Reviewers must never be able to export a document with unresolved + high-confidence detections. + ``` + +- **Emits** — `spec.md` with **user stories**, **functional requirements**, + **acceptance criteria**, and `[NEEDS CLARIFICATION]` markers where you were + silent. *(Caption only — learn the shape: numbered stories, testable + requirements, explicit edge cases.)* +- **Tips** — over-specify the *rules and edge cases*; under-specify the + *mechanism*. If you catch yourself writing a library name, move it to the plan. + +#### `/speckit.clarify` + +- **What** — interrogates the spec and asks up to **5 targeted questions** about + underspecified areas, then encodes your answers back into `spec.md`. +- **When** — immediately after `specify`, *before* `plan`, on anything + load-bearing. This is the cheapest bug-prevention in the whole pipeline. +- **Prompt** + + ```text + # Let it drive. Answer crisply; vague answers produce vague specs. + /speckit.clarify + ``` + +- **Emits** — an updated `spec.md` with ambiguities resolved and clarifications + recorded. +- **Tips** — if it asks something you "obviously" know, that's the point: the + spec didn't say it, so the implementation would have guessed. Answer it. + +#### `/speckit.plan` + +- **What** — produces the technical design from the spec and your stack input. +- **When** — once the spec is clarified and you're ready to commit to *how*. +- **Prompt** — *now* you bring the stack and constraints. + + ```text + # This is where architecture lives. Be specific about stack, boundaries, + # and the things you've already decided. + /speckit.plan Build with: FastAPI + Python 3.12 backend, a hybrid PII pipeline + (deterministic regex/checksum validators for IDs/IBANs + an LLM extractor for + names/addresses), an LLM-as-judge verification pass, Postgres for audit records, + and a React + TypeScript review UI with optimistic accept/reject. Stream + detections to the UI via SSE. Keep the detection engine importable and testable + independently of the web layer. No PII leaves the boundary unmasked in logs. + ``` + +- **Emits** — `plan.md`, plus (as warranted) `research.md` (trade-off studies), + `data-model.md` (schemas/entities), `contracts/` (API/event contracts), and + `quickstart.md` (key validation scenarios). *(Caption only.)* +- **Tips** — read `research.md` and `contracts/` like a design doc, because that + is what they are. This is your cheapest chance to kill a bad abstraction. + +#### `/speckit.checklist` + +- **What** — generates a **"unit test for your requirements."** It validates the + *quality* of the spec (completeness, clarity, consistency, coverage) — **not** + the implementation. +- **When** — after plan, before tasks, on domains where requirement quality is + the risk (UX, security, compliance, anything fuzzy). +- **Prompt** + + ```text + # Ask for a domain-scoped checklist. Think "are the requirements well-written?" + # not "does the button work?" + /speckit.checklist Generate a requirements-quality checklist for the human + review UX and the PII-handling rules: are confidence thresholds quantified? + is every detection category's masking behavior specified? are export-blocking + conditions unambiguous? are accessibility requirements for the reviewer + defined? + ``` + +- **Emits** — a checklist of questions that probe whether your *spec* is + airtight. Items like *"Is 'high confidence' quantified?"* — not *"Test that + masking works."* +- **Tips** — failing checklist items send you back to `specify`/`clarify`, not to + code. That's the loop working. + +#### `/speckit.tasks` + +- **What** — decomposes plan + design artifacts into an **ordered, dependency- + aware `tasks.md`**, marking parallelizable tasks `[P]`. +- **When** — once the plan is settled. +- **Prompt** + + ```text + /speckit.tasks Break the plan into tasks + ``` + +- **Emits** — `tasks.md`: setup → foundational → per-story → polish, each task + small and testable, with `[P]` on independent ones. *(Caption only — note the + phasing; you'll scope `implement` against it.)* +- **Tips** — skim for tasks that are too big to verify; a task you can't write a + test for is a task that's underspecified upstream. + +#### `/speckit.analyze` + +- **What** — a **non-destructive, read-only** cross-artifact audit. Checks + `spec.md`, `plan.md`, and `tasks.md` for inconsistencies, gaps, and drift. +- **When** — after tasks, before implement; and any time you suspect artifacts + have diverged. +- **Prompt** + + ```text + /speckit.analyze Run a project analysis for consistency + ``` + +- **Emits** — a report of mismatches (e.g., a requirement with no task, a task + with no requirement, a contract the plan never mentions). It **changes + nothing**; you decide the fixes. +- **Tips** — treat a clean `analyze` as your "ready to implement" gate. + +#### `/speckit.taskstoissues` *(optional)* + +- **What** — converts `tasks.md` into dependency-ordered **GitHub issues**. +- **When** — when work is shared across a team or you want the backlog tracked in + GitHub rather than a file. +- **Prompt** + + ```text + /speckit.taskstoissues Create GitHub issues from the tasks, preserving order + and dependencies, labeled by phase. + ``` + +- **Emits** — GitHub issues (via the GitHub MCP server) mirroring your task DAG. +- **Tips** — great for human/AI hand-offs: a teammate or another agent can pick up + an issue with full spec context linked. + +#### `/speckit.implement` + +- **What** — executes `tasks.md`, building the code task-by-task and marking each + `[X]` as it completes. +- **When** — after a clean analyze. This is the generation step. +- **Prompt** — and critically, **how to scope it** (see §3.5). + + ```text + # Full run for small features: + /speckit.implement Start the implementation in phases + + # Scoped run for large ones (recommended — keeps context healthy): + /speckit.implement only execute the Setup and Foundational phases, then stop + and report progress + ``` + +- **Emits** — working code, committed task-by-task; `tasks.md` updated with `[X]`. +- **Tips** — review *per phase*, not at the end. The completed-task markers mean + you can stop, review, and resume without losing your place. + +#### `/speckit.converge` + +- **What** — assesses **code-on-disk against spec/plan/tasks**, then **appends + the remaining unbuilt work** as new tasks so `implement` can finish it. +- **When** — brownfield reality checks; after manual edits; when a long + implementation drifted; whenever "what's actually built?" ≠ "what we said." +- **Prompt** + + ```text + /speckit.converge Assess the codebase against the spec and append any missing + work as tasks. + ``` + +- **Emits** — newly appended tasks in `tasks.md` capturing the delta between + intent and reality. Then you re-run `implement`. +- **Tips** — this is the *spec-evolution* workhorse. Change the spec, run + `converge`, run `implement` — that's how a living spec stays load-bearing. + +### 3.4 Extensions, presets, and bundles + +Spec Kit is customizable so it can match *your* lifecycle, not just the default. + +- **Extensions** — opt-in command packs that hook into the pipeline. Built-ins + include: + - **`git`** — commit/validate/feature/remote helpers so SDD phases map cleanly + to branches and commits. + - **`bug`** — `assess` → `test` → `fix`: a spec-driven *bug* workflow (reproduce + with a failing test first, then fix). + - **`agent-context`** — keeps the agent's working context updated as the project + grows. + - Extensions hook at defined points (`before_specify`, `before_plan`, …) via + `.specify/extensions.yml`. +- **Presets** — alternative command/template sets. **`lean`** trims the pipeline + for speed; **`scaffold`** is a starting point for authoring your own; the + templates are yours to fork. +- **Bundles** — role-based setups: **developer**, **product-manager**, + **business-analyst**, **security-researcher**. A bundle pre-wires the commands + and emphasis a given role needs. + +> Senior framing: extensions/presets/bundles are how you **encode your team's +> SDLC into the tool**. The default pipeline is a strong opinion; these let you +> make it *your* opinion. + +### 3.5 Handling complex features (the context problem) + +Big features sail through specify/plan/tasks and then **degrade mid- +`implement`** — the agent loses the plan, skips tasks, or hallucinates as the +context window fills and compaction kicks in. The root cause is context +exhaustion, and there are four fixes: + +```text +# 1) Scope each run — the simplest fix, works on any agent: +/speckit.implement only execute tasks T001-T010, then stop and report progress + +# 2) Delegate to sub-agents (if your agent supports them): +/speckit.implement delegate each parallel [P] task to a sub-agent + +# 3) Combine scoping + delegation for very large features: +/speckit.implement execute only the Core phase, delegate [P] tasks to sub-agents + +# 4) Decompose into smaller specs ("spec of specs") — when even one phase is +# too big, split the feature into sub-features, each with its own +# spec/plan/tasks cycle. Highest overhead; reserve for genuinely huge work. +``` + +Because completed tasks are marked `[X]`, scoped runs resume cleanly. **Default +to option 1**; reach for sub-agents when you want parallelism; decompose only +when a single phase still overflows. + +--- + +## Part IV — End-to-End: t0 to a Shipped Product + +> **The cast.** Deniz, a senior AI engineer, is shipping **Redaktör** — an +> agentic Turkish-document **PII detection & masking** service with a +> **human-in-the-loop review UI**. It's representative of real agentic-product +> work: non-deterministic model components, an eval discipline, a human approval +> step, and a frontend that has to feel trustworthy. +> +> Watch for the **load-bearing moves** (⚖️) — the moments where SDD does real +> work a senior refuses to skip. The point isn't the product; it's the *operating +> discipline*. + +### t0 — Frame the problem before touching the tool + +Deniz doesn't open the agent first. He writes three sentences in a notebook: +*who hurts today, what "done" means, what must never happen.* For Redaktör: +"Compliance teams manually redact Turkish docs; done = reviewer-approved masked +copy + audit trail; **never** export a doc with unresolved high-confidence PII." + +⚖️ **Load-bearing move:** the spec is only as good as the intent behind it. Five +minutes of human framing prevents an hour of regenerating a confidently-wrong +spec. + +### Phase 1 — Bootstrap & constitution + +```bash +specify init redaktor --integration claude +cd redaktor +``` + +Then, before any feature, he encodes the **values that arbitrate every later +decision** — and for an AI product, that explicitly includes *evals* and +*human-in-the-loop*: + +```text +# The constitution is where an AI product's hardest invariants become gates. +/speckit.constitution Create binding principles: (1) every behavior change ships +with a test; (2) NON-NEGOTIABLE — every AI component (extractor, judge) has a +versioned eval set with a regression threshold; promotion requires passing it; +(3) the human reviewer is the final authority — no irreversible export without +explicit approval; (4) PII never appears in logs, traces, or error messages; +(5) the detection engine is a library, importable and testable without the web +layer; (6) p95 detection latency budget: 2s for a 2-page document. Add governance: +plans that violate these must fail the gate or carry a written, justified +exception. +``` + +⚖️ **Load-bearing move:** principle (2) turns "we tried the prompt and it seemed +fine" into a *gate*. Principle (4) will later cause `analyze` and review to flag +any logging task that touches raw text. The constitution is doing architecture. + +### Phase 2 — Specify the *what*, ruthlessly stack-free + +```text +# Actors, flows, rules, edge cases. Zero frameworks. Zero model names. +/speckit.specify Build a service that detects and masks personal data in +Turkish-language documents before sharing. A reviewer pastes or uploads text; +the system streams back detected PII (full names, T.C. national IDs, IBANs, +phone numbers, postal addresses) each with a category and a confidence level; +the reviewer accepts, rejects, or edits each detection inline; on confirmation +the system returns a masked copy and writes an immutable audit record (who, +when, what changed). Hard rule: the system must block export while any +high-confidence detection is unresolved. Documents may be up to 20 pages. +``` + +The agent creates branch `001-pii-masking` and `specs/001-pii-masking/spec.md` +with user stories, functional requirements, acceptance criteria — and a few +`[NEEDS CLARIFICATION]` markers where Deniz was silent. + +> *(Caption of `spec.md`, not its contents:)* §User Stories (reviewer-centric), +> §Functional Requirements (FR-1 detect, FR-2 stream w/ confidence, FR-3 inline +> resolve, FR-4 export-block rule, FR-5 audit), §Edge Cases (empty doc, all-PII +> doc, mixed-language doc), §`[NEEDS CLARIFICATION]` on retention + masking +> format. + +⚖️ **Load-bearing move:** Deniz *notices* he never said how masking should look +(`[ALİ YILMAZ]` → `[AD-SOYAD]`? black box? partial?) — but he doesn't fix it by +editing prose. He lets `clarify` extract it, so the resolution is recorded +systematically. + +### Phase 3 — Clarify: spend questions now, not bugs later + +```text +/speckit.clarify +``` + +The agent asks five sharp questions. Deniz answers crisply: + +```text +# (paraphrased Q→A — answers get encoded back into spec.md) +Q: Confidence threshold for "high"? → ≥ 0.85 blocks export; 0.6–0.85 warns. +Q: Masking representation? → category tokens, e.g. [TCKN], [IBAN]. +Q: Retention of original text? → originals purged after audit write; + audit stores only masked + diff. +Q: Multi-language docs? → detect TR PII only; flag non-TR spans. +Q: Who can override an export block? → nobody in v1; it's a hard stop. +``` + +⚖️ **Load-bearing move:** "nobody can override the export block" is now a +*requirement with a test target*, not a tribal assumption. This single +clarification will shape a contract, a task, and a regression test downstream. + +### Phase 4 — Plan: now, and only now, the architecture + +```text +# Stack, boundaries, and the decisions already made enter here. +/speckit.plan Build with: FastAPI + Python 3.12; a HYBRID detection pipeline — +deterministic validators (regex + checksum) for T.C. IDs/IBANs/phones, and an +LLM extractor for names/addresses; an LLM-as-judge pass that scores each +candidate and assigns confidence; Postgres for immutable audit records; React + +TypeScript review UI with optimistic accept/reject and SSE streaming of +detections. Keep `redaktor.engine` importable and web-independent. Structured +logging with a redaction filter so raw text can never be logged. Provide an eval +harness (golden Turkish docs with labeled PII) for both extractor and judge. +``` + +> *(Caption of emitted design artifacts:)* +> +> - `research.md` — trade study: regex-only vs. LLM-only vs. **hybrid** (chosen: +> hybrid for precision on structured IDs + recall on names); SSE vs. +> WebSocket (chosen: SSE, one-way stream). +> - `data-model.md` — entities: `Document`, `Detection{category, span, +> confidence, status}`, `AuditRecord`. +> - `contracts/` — `POST /documents`, `GET /documents/{id}/detections` (SSE), +> `POST /detections/{id}/resolve`, `POST /documents/{id}/export` (enforces the +> block rule). +> - `quickstart.md` — the golden-path validation scenario. + +⚖️ **Load-bearing move:** Deniz reads `research.md` and pushes back: the judge +adds latency. He keeps it but adds a plan note — *deterministic validators +short-circuit the judge for checksum-valid IDs* — protecting the 2s p95 budget +from principle (6). This is a design debate happening in `plan.md`, where it's +free. + +### Phase 5 — Checklist the requirements, then task it out + +Because the **review UX** and **PII rules** are the real risk, he quality-tests +the *requirements* first: + +```text +/speckit.checklist Generate a requirements-quality checklist for the review UX +and PII rules: is every detection category's masking token specified? is the +export-block condition unambiguous and testable? are reviewer keyboard/a11y +requirements defined? is the SSE reconnect behavior on dropped connection +specified? +``` + +Two items fail — SSE reconnect behavior and a11y were never specified. He loops +back through `clarify` to fix the *spec*, not the code. Then: + +```text +/speckit.tasks Break the plan into tasks +``` + +> *(Caption of `tasks.md`:)* Setup (repo, CI, db) → Foundational (engine library, +> validators, eval harness) → Story A (detect+stream) → Story B (resolve UI) → +> Story C (export-block + audit) → Polish (a11y, latency, observability). `[P]` +> on the independent validator and UI-component tasks. + +### Phase 6 — Analyze: the "ready to build" gate + +```text +/speckit.analyze Run a project analysis for consistency +``` + +The read-only audit flags one real gap: **no task covers the "non-TR span" +flagging** from the clarify step. Deniz adds it (re-runs `tasks`), re-analyzes, +gets a clean report. + +⚖️ **Load-bearing move:** a clean `analyze` is his hard gate to implementation. A +requirement with no task is a feature that silently won't exist — caught here for +the price of a re-read. + +### Phase 7 — Implement, scoped by phase + +He never one-shots a feature this size. He scopes by phase to keep context +healthy (§3.5): + +```text +# 1) Foundation first — the engine + evals must exist before stories. +/speckit.implement only execute the Setup and Foundational phases, then stop and +report progress + +# (Deniz reviews: runs the eval harness, confirms validators pass on golden IDs.) + +# 2) Then story by story, reviewing each: +/speckit.implement execute Story A (detect + stream), then stop + +# 3) Large parallelizable polish — delegate if the agent supports sub-agents: +/speckit.implement execute the Polish phase, delegate [P] tasks to sub-agents +``` + +Each task lands as `[X]`. Between phases, Deniz **runs the app and the evals**, +not just the unit tests — because for an AI component, "tests pass" ≠ "it +detects well." The eval set from principle (2) is the real acceptance gate. + +⚖️ **Load-bearing move:** reviewing per phase means a bad masking decision is +caught after Story A, not after the whole product is wired. The `[X]` markers let +him stop and resume without losing the thread. + +### Phase 8 — Converge: reconcile reality with intent + +Mid-build, product asks for one change: *warn-level detections (0.6–0.85) should +also be reviewable, not just high-confidence ones.* Deniz does the +**spec-evolution loop**, not a hand-patch: + +```text +# 1) Update the spec/clarify with the new rule, then: +/speckit.converge Assess the codebase against the updated spec and append any +missing work as tasks. + +# 2) converge appends the delta tasks (UI filter for warn-level, export rule +# stays high-confidence-only). Then finish them: +/speckit.implement execute the newly appended tasks +``` + +⚖️ **Load-bearing move:** the requirement change entered through the *spec* and +propagated *down* — not as a stray commit. The spec stays the source of truth, so +the next engineer (or agent) reads intent that matches reality. + +### Phase 9 — Ship: CI, audit, and the human gate + +Before release, Deniz wires the lifecycle closure (often via the `git` extension +and his CI): + +- **CI gates** run the test suite *and* the eval suite; a regression below the + eval threshold **fails the build** (principle 2, now load-bearing in CI). +- **A redaction-filter test** asserts no raw PII reaches logs (principle 4). +- **An export-block test** asserts the hard stop from the clarify step. +- He has the agent fan remaining work to issues for a teammate: + + ```text + /speckit.taskstoissues Create GitHub issues for the remaining polish tasks, + preserving dependency order, labeled by phase. + ``` + +Redaktör ships: a reviewer pastes a document, watches detections stream in with +confidence, resolves each inline, and exports a masked copy with an audit trail — +and **cannot** export while a high-confidence detection is unresolved, because +that rule has been load-bearing since Phase 3. + +### What made this "senior," in one breath + +The model wrote most of the code. **Deniz owned the leverage points**: framing +intent (t0), promoting invariants to gates (constitution), spending clarify +questions early, debating architecture in `plan.md`, gating on a clean `analyze`, +reviewing *per phase* against **evals** not vibes, and routing every change +through the spec via `converge`. SDD didn't replace his judgment — it +*concentrated* it where it mattered. + +--- + +## Part V — Operating SDD Like a Senior + +### 5.1 Habits that separate load-bearing SDD from theater + +- **Always `clarify` before `plan`** on anything that matters. It's the highest + ROI command in the kit. +- **Read `research.md` and `contracts/`.** If you're not reviewing the design + artifacts, you're vibe coding with extra files. +- **Gate on a clean `analyze`.** Make "no orphan requirements, no orphan tasks" + your definition of implementation-ready. +- **Scope `implement` by phase, review between phases.** Don't discover problems + at the end. +- **Route changes through the spec → `converge` → `implement`.** Hand-patches rot + the spec; once it's stale, it stops being load-bearing and you're back to + code-as-truth. + +### 5.2 Anti-patterns to refuse + +- **Stack in the spec.** Frameworks in `spec.md` poison the intent layer and + block re-implementation. +- **Skipping clarify to "save time."** You don't save it; you move it to + debugging, at 10× the cost. +- **One-shot `implement` on a big feature.** Context exhaustion will corrupt the + back half of the run. +- **A constitution of platitudes.** If a principle can't fail a plan, delete it. +- **Treating generated artifacts as write-once.** A spec you never regenerate + from is documentation, not SDD. + +### 5.3 Fitting SDD to a team + +- Use **`taskstoissues`** to turn a task DAG into a shared backlog; each issue + carries spec context, so humans and agents can both pick up work. +- Use **bundles** (developer / PM / BA / security) so each role gets the + pipeline emphasis they need. +- Use **extensions** (`git`, `bug`, `agent-context`) to bind SDD phases to your + real branch/commit/triage flow. +- Keep `spec.md`/`plan.md`/`tasks.md` **in version control** next to the code and + review them in PRs — the spec diff *is* the design review. + +### 5.4 When **not** to reach for the full pipeline + +SDD ceremony should match stakes: + +| Situation | Reach for | +| --- | --- | +| Load-bearing feature, multiple stories, real risk | Full pipeline + clarify + checklist + analyze | +| Small, well-understood change | specify → plan → tasks → implement (skip the wrappers) | +| Throwaway spike / proof of concept | Straight to `/speckit.implement` with a tight prompt | +| Bug fix | The `bug` extension (assess → failing test → fix) | +| Requirements changed on a live feature | `clarify`/edit spec → `converge` → `implement` | + +The senior skill is calibration: enough structure to stay honest, not so much +that it's theater. + +### 5.5 The one-sentence summary + +**Make intent the durable artifact, promote your invariants to gates, spend your +judgment at the spec/plan/analyze checkpoints, and let the agent regenerate the +code — so that when the world changes, you change the spec and the system +follows.** + +--- + +## Appendix — Cheat Sheets + +### A. Command map + +| Command | Phase | Emits | Reach for it when | +| --- | --- | --- | --- | +| `/speckit.constitution` | Govern | `constitution.md` | Project birth; principle change | +| `/speckit.specify` | Intent | `spec.md` (+ branch) | Every new feature | +| `/speckit.clarify` | Intent | updated `spec.md` | Before planning anything load-bearing | +| `/speckit.plan` | Design | `plan.md`, `research.md`, `data-model.md`, `contracts/`, `quickstart.md` | Spec is clarified | +| `/speckit.checklist` | Design | requirements-quality checklist | Fuzzy/high-risk domains (UX, security) | +| `/speckit.tasks` | Decompose | `tasks.md` | Plan is settled | +| `/speckit.analyze` | Gate | read-only consistency report | Before implement; on suspected drift | +| `/speckit.taskstoissues` | Decompose | GitHub issues | Team/shared backlog | +| `/speckit.implement` | Build | code + `[X]` tasks | After a clean analyze | +| `/speckit.converge` | Evolve | appended delta tasks | Spec changed; brownfield reconcile | + +### B. Artifact graph + +```text +constitution.md ──governs──▶ spec.md ──derives──▶ plan.md ─┬─▶ research.md + ▲ ├─▶ data-model.md + clarify│ checklist ├─▶ contracts/ + │ └─▶ quickstart.md + │ │ + └────────── analyze (audits all) ──┤ + ▼ + tasks.md ──▶ code + ▲ │ + └ converge ┘ +``` + +### C. Glossary + +- **Constitution** — binding project principles, enforced as plan-time gates. +- **Gate (Phase −1)** — a pre-implementation check (simplicity, anti-abstraction, + integration-first) a plan must pass or explicitly waive. +- **`[NEEDS CLARIFICATION]`** — an explicit uncertainty marker; a question, not a + guess. +- **Persistence model** — your chosen strategy for keeping specs alive + (greenfield / brownfield / evolution). +- **`[P]` task** — a task with no blocking dependency; parallelizable. +- **Converge** — reconcile code-on-disk with intent by appending the missing + work as tasks. +- **Bundle / Preset / Extension** — role setups / alternative pipelines / opt-in + command packs that tailor Spec Kit to your SDLC. + +--- + +*Further reading in this repo: [`spec-driven.md`](../../spec-driven.md) (the +full SDD essay and the constitutional articles), +[`docs/concepts/sdd.md`](../concepts/sdd.md), +[`docs/concepts/complex-features.md`](../concepts/complex-features.md), and +[`docs/guides/evolving-specs.md`](./evolving-specs.md).* From a58091e25877625eb80b8e7ee2c1129489cc5d5a Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 25 Jun 2026 09:21:56 +0000 Subject: [PATCH 2/2] docs: rewrite SDD crash course to teach judgment over mechanics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the feature-catalog/happy-path draft with a field guide centered on how a senior engineer actually thinks in SDD: the model as a fast, confidently-wrong collaborator; SDD as the apparatus that makes disagreement cheap and early; and the irreducible human work the tool can't do. The end-to-end example now goes wrong on purpose — a misframed problem at t0, a clean spec aimed at the wrong target, an over-engineered plan, an architecture inverted toward determinism, model drift caught by a pre-built trap, and a spec found wrong at demo and recovered cheaply via converge. Adds an adversarial artifact 'smell guide', an honest section on where SDD fights you, and process-calibration guidance. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01K7A9nEeKpPESM9ZQJcs2g1 --- docs/guides/sdd-crash-course.md | 1163 +++++++++++-------------------- 1 file changed, 389 insertions(+), 774 deletions(-) diff --git a/docs/guides/sdd-crash-course.md b/docs/guides/sdd-crash-course.md index 5bdf21a1fe..222b833e3c 100644 --- a/docs/guides/sdd-crash-course.md +++ b/docs/guides/sdd-crash-course.md @@ -1,890 +1,505 @@ -# The SDD & Spec Kit Crash Course +# Thinking in SDD -*From philosophy to a shipped product — how a senior engineer uses Spec-Driven -Development as load-bearing infrastructure, not ceremony.* +*A field guide to Spec-Driven Development for engineers who already know the +work is in the judgment, not the typing.* -> **Who this is for.** Engineers who already ship software and want SDD to be a -> real part of their lifecycle — not a demo. The running example is built by a -> fictional senior AI engineer shipping an **agentic product with a real -> frontend**, because that is where SDD earns its keep: non-determinism, evals, -> human-in-the-loop review, and a UI that has to feel good. +> This is not a tour of Spec Kit's commands. The command list is at the bottom in +> one table; you'll learn it in an afternoon. What takes years is knowing **what +> to distrust, when to go backwards, and which mistakes to make cheaply** — and +> that's what SDD is actually for. This guide teaches the thinking. The demo in +> the middle goes wrong on purpose, because that's where the lessons live. > -> **How to read it.** Parts I–II are the *why* and the *mental model*. Part III -> is the *reference* (every command, when to reach for it). Part IV is the -> *t0 → shipped* walkthrough with real prompts. Part V is how to operate it like -> a senior — including when **not** to use SDD. -> -> **A note on the code samples.** Prompts are shown in full because they are the -> thing you actually type. Generated artifacts (`spec.md`, `plan.md`, etc.) are -> shown as **captions and structure only** — Spec Kit writes the prose; your job -> is to learn the *shape* so you can review it. - ---- - -## Table of Contents - -- [Part I — The Philosophy](#part-i--the-philosophy) -- [Part II — The Software Design Approach](#part-ii--the-software-design-approach) -- [Part III — Spec Kit Features & How-Tos](#part-iii--spec-kit-features--how-tos) -- [Part IV — End-to-End: t0 to a Shipped Product](#part-iv--end-to-end-t0-to-a-shipped-product) -- [Part V — Operating SDD Like a Senior](#part-v--operating-sdd-like-a-senior) -- [Appendix — Cheat Sheets](#appendix--cheat-sheets) +> Prompts are shown in full (they're what you type). Generated artifacts are +> shown as captions — you should learn their *shape*, not memorize someone else's +> spec. --- -## Part I — The Philosophy - -### 1.1 The power inversion - -For decades, **code was the source of truth** and the spec was scaffolding — -written once, approved, then quietly abandoned the moment "real work" began. -The artifact you trusted was the one the machine ran. - -Spec-Driven Development **inverts that relationship**: the **specification -becomes the durable, executable artifact**, and code becomes its *expression* — -regenerable, replaceable, downstream. You no longer hand-translate intent into -syntax and lose fidelity at every step. You maintain intent precisely enough -that an AI agent can produce a faithful implementation from it, again and again. - -The practical consequence: when requirements change, you **change the spec and -regenerate**, instead of archaeology-ing through code to reconstruct what was -meant. The spec stops being documentation *about* the system and becomes the -*lever that moves* the system. - -### 1.2 Why this is viable *now* (and wasn't before) - -This idea is old; what changed is that LLMs got good enough to do the -"specification → working code" step with real fidelity. Three forces converge: - -- **Capability** — models can now interpret rich natural-language intent and - produce coherent, multi-file implementations. -- **Cost of change collapsed** — re-deriving an implementation from an updated - spec is now minutes, not a sprint, so treating code as disposable is rational. -- **Complexity demands it** — modern products carry constraints (compliance, - multi-stack, AI non-determinism) that *need* a structured intent layer to stay - coherent. - -The bottleneck moved. It is no longer "how fast can you type code." It is **"how -precisely can you express what you actually want, and how well can you keep that -expression honest."** SDD is the discipline for that. - -### 1.3 Core principles - -SDD rests on a handful of commitments: - -- **Intent before mechanism** — define the *what* and *why* completely before - the *how*. The stack is a downstream decision, not the starting point. -- **Multi-step refinement over one-shot generation** — you do not "prompt once - and pray." You build a spec, interrogate it, plan against it, decompose it, - and only then implement. Each step is a checkpoint. -- **Explicit uncertainty** — unknowns are *marked*, not silently guessed. - Ambiguity surfaces as `[NEEDS CLARIFICATION]`, not as a confident hallucination - three files deep. -- **Guardrails over vibes** — organizational principles (the *constitution*) and - structured templates constrain the model toward good outcomes instead of - hoping the prompt was perfect. -- **The spec is the contract** — reviews, debates, and decisions happen at the - spec/plan layer where they are cheap, not in a code review where they are - expensive and late. - -### 1.4 What SDD is **not** - -A senior recognizes SDD by what it refuses to be: - -- **Not "write a giant doc, then code by hand."** The artifacts are - *operational* — they drive generation. A spec nobody regenerates from is just - waterfall with extra steps. -- **Not vibe coding.** Vibe coding optimizes for "something that runs." SDD - optimizes for "the thing we agreed to build, traceable to why." -- **Not a replacement for engineering judgment.** The model drafts; **you** own - the constitution, the clarifications, the gate decisions, and the review. SDD - *concentrates* your judgment at the highest-leverage points. -- **Not all-or-nothing.** You can run the full pipeline on a load-bearing feature - and skip straight to `/speckit.implement` on a throwaway. Match ceremony to - stakes (see §5.4). +## The one idea everything else hangs on + +You are working with a collaborator that is **fast, fluent, tireless, and +confidently wrong in predictable ways**. It optimizes for *plausibility* — output +that looks like what a good answer looks like — not for *correctness*. It will +hand you a spec that reads beautifully and solves the wrong problem. It will +propose an architecture that's clean, conventional, and two layers heavier than +your problem needs. It will agree with whatever you said last. And it never tells +you what it doesn't know unless you force it to. + +Given that collaborator, the value of SDD is not "specs generate code." It's +this: **SDD front-loads disagreement to the cheapest possible layer.** A wrong +assumption caught in `spec.md` costs a sentence. The same assumption caught in +code costs a day. Caught in production, it costs a customer. Every loop in the +process — clarify, checklist, analyze, converge — exists to drag a future +expensive mistake forward into a present cheap one. + +So the senior's stance toward the whole apparatus is **adversarial, not +compliant.** You are not filling out a workflow. You are running a series of cheap +experiments designed to find out where you and the model are wrong *before* it's +expensive to be wrong. If a step isn't surfacing disagreement, you're doing it as +theater. + +Three corollaries fall out of this, and the rest of the guide is just these three +applied: + +1. **Every artifact the model produces is a suspect, not a deliverable.** Your + job when you read a generated spec or plan is to attack it: *what did it assume + that I didn't say? what failure mode does this gloss? where is it being + plausible instead of right?* + +2. **The process is non-linear because discovery is non-linear.** You will learn + things during planning that invalidate the spec, and things during + implementation that invalidate the plan. A senior expects this and *instruments + for it*. Going backwards isn't failure; refusing to is. + +3. **The hardest part isn't writing the spec — it's knowing what you actually + want, and you usually don't, at the start.** The real problem is often not the + one you typed. The process's job is to help you find the real one before you've + built the wrong one. --- -## Part II — The Software Design Approach - -SDD is not just a workflow; it is an opinion about *how software should be -designed*. Five ideas carry the weight. - -### 2.1 The constitution as architectural law - -Before any feature exists, you write a **constitution** — a small set of binding -principles that govern *every* subsequent decision. It lives at -`.specify/memory/constitution.md` and is referenced during specify, plan, and -implement. +## What the model is good at, and what is yours alone -Think of it as the project's **non-negotiable invariants**: code-quality rules, -a testing mandate, UX consistency requirements, performance budgets, dependency -discipline. Crucially, principles are **enforced as gates**, not posted as -inspirational wallpaper. A plan that violates "Test-First" or "Minimal -Dependencies" is supposed to *fail the gate* and force either a fix or an -explicit, justified exception. +Keep this division sharp, because most wasted effort comes from blurring it. -> The original SDD essay frames these as numbered "Articles" (Library-First, -> CLI Interface, Test-First, Simplicity, Anti-Abstraction, Integration-First, -> etc.). The exact articles matter less than the move: **promote your hardest-won -> engineering values into machine-checkable gates that the agent must pass -> through.** - -### 2.2 Intent layer vs. mechanism layer - -SDD draws a hard line through your artifacts: - -| Layer | Artifact | Answers | Owner of truth | +| The model is genuinely good at | Which means you should | And it is bad at | Which means you must never | | --- | --- | --- | --- | -| **Intent** | `spec.md` | *What* / *why* — user stories, requirements, acceptance criteria | Product + Eng, no stack | -| **Mechanism** | `plan.md`, `research.md`, `data-model.md`, `contracts/` | *How* — stack, schemas, APIs, trade-offs | Eng | -| **Execution** | `tasks.md` → code | *Do it* — ordered, testable units | Agent, reviewed by Eng | - -The discipline: **keep the stack out of the spec.** "Users can drag-and-drop -albums" belongs in the spec. "We use React DnD Kit with optimistic updates" -belongs in the plan. This separation is what lets the same intent be -re-implemented in a different stack, and what keeps spec reviews about *product* -rather than *frameworks*. - -### 2.3 Template-driven quality: how structure constrains the model - -Spec Kit's templates are not formatting — they are **guardrails that make LLMs -produce better output**. Concretely, the templates: - -- **Prevent premature implementation detail** — the spec template structurally - pushes "how" content into later phases. -- **Force explicit uncertainty** — `[NEEDS CLARIFICATION]` markers are *required* - where the input is silent, converting silent assumptions into visible - questions. -- **Encode checklists and gates** — the model must walk a structured sequence - (requirement completeness, constitution check, simplicity/anti-abstraction - gates) rather than free-associate. -- **Order file creation** — research before data-model before contracts before - tasks, so each artifact stands on a settled foundation. - -The insight worth internalizing: **a good template is a prompt that runs every -time, on every feature, without you remembering to type it.** That is where -consistency at scale comes from. - -### 2.4 The pre-implementation gates ("Phase −1") - -Before code is generated, the plan passes through gates derived from your -constitution. Canonical examples: - -- **Simplicity gate** — are you adding more moving parts than the problem - requires? Default to fewer projects, fewer layers. -- **Anti-abstraction gate** — are you wrapping frameworks in needless - indirection? Use the platform directly until you have a concrete reason not to. -- **Integration-first gate** — are contracts defined and is testing wired against - real integrations before unit minutiae? - -A gate failure is a **feature, not a blocker**: it catches over-engineering and -ambiguity while the cost of changing course is a paragraph edit, not a rewrite. - -### 2.5 Spec persistence: greenfield, brownfield, and evolution - -SDD does **not** prescribe a single way to keep `spec.md`/`plan.md`/`tasks.md` -alive after requirements shift. You choose a persistence model: - -- **0→1 (greenfield)** — generate everything from high-level intent; the spec is - born with the code. -- **Iterative enhancement (brownfield)** — add features to an existing system; - specs describe deltas and integrate with reality on disk. -- **Spec evolution** — when requirements change, update the spec and re-derive, - using `/speckit.converge` to fold reality and intent back together. - -The senior move is to **decide your persistence model deliberately** at project -start, and to treat the spec as a living asset under version control alongside -the code — not a one-time generation prompt. - ---- - -## Part III — Spec Kit Features & How-Tos - -Spec Kit is the toolkit that operationalizes SDD: a CLI (`specify`) that -bootstraps your repo, and a set of agent slash-commands (`/speckit.*`) that drive -the pipeline. +| Producing a complete, well-structured first draft fast | Let it draft everything; never start from a blank page | Knowing what's worth building | Outsource the "what" or the "why" | +| Surfacing options and conventions | Use it to widen your view of the solution space | Saying "I don't know" or "this is risky" | Trust silence as a sign of safety | +| Filling in plausible detail | Let it handle mechanism once intent is fixed | Distinguishing plausible from correct | Accept an artifact because it reads well | +| Tireless, consistent mechanical work | Hand it the decomposition and the typing | Holding a large design in coherent context | Let it one-shot a big feature | -### 3.1 Install and initialize +The work that is *yours, irreducibly*: deciding what problem is real, deciding +what "correct" means here, deciding what must never happen, deciding what to cut, +and noticing when the beautiful artifact in front of you is solving a problem you +don't have. SDD doesn't reduce that work. It *clears the noise away from it* so +you can spend your scarce judgment there instead of on boilerplate. -```bash -# Install the CLI (requires uv). Pin to the latest release tag. -uv tool install specify-cli --from git+https://github.com/github/spec-kit.git@vX.Y.Z +--- -# Scaffold a project wired for your agent. --integration picks the agent. -specify init redaktor --integration copilot # or: claude, gemini, codex, ... +## The build that went sideways (and what each turn taught) -# Initialize inside an existing repo (brownfield): -specify init . --here --integration copilot +> Deniz is shipping **Redaktör**: a tool that masks personal data (PII) in +> Turkish documents so a compliance team can share them safely. It's a good +> teaching case because "correct" here is adversarial, non-obvious, and carries +> real liability — exactly the conditions under which plausible-but-wrong is +> dangerous. Watch the *reasoning*, not the commands. Several turns are mistakes, +> caught at different costs. -# Keep the CLI current: -specify self check # read-only: is a newer release available? -specify self upgrade # upgrade in place -``` +### t0 — The first framing is wrong, and finding that out is the whole job -> **Agent-agnostic, with one syntax note.** Most agents expose commands as -> `/speckit.specify`, `/speckit.plan`, etc. **Codex CLI in skills mode** uses -> `$speckit-*`; **GitHub Copilot CLI** uses `/agents` to select the agent. This -> guide uses the `/speckit.*` form, and the examples are written against -> **Claude Code**. +Deniz's first instinct is "build a PII masker." He's about to type that into +`/speckit.specify`. He stops, because he's been burned by exactly this: a clean +spec for the wrong product. -What `init` gives you: the `.specify/` directory (constitution, templates, -scripts), the `/speckit.*` commands registered with your agent, and the repo -prepared for the `specs/NNN-feature/` layout. +He asks the question that the tool will never ask for him: **what is the real +failure here?** Not "the model misses a phone number" — the real failure is *a +compliance officer signs off on a document, and PII leaks anyway, and now it's +their name on the breach.* The product isn't an NLP problem. It's a **trust and +liability** problem. The reviewer has to trust an AI's redaction enough to put +their signature on it. That reframing changes everything downstream: it means +auditability, explainability, and a hard human gate matter *more* than detection +F1. -### 3.2 The core pipeline at a glance +⟶ **Lesson.** The most expensive mistakes are made before you touch the tool, in +the gap between the problem you typed and the problem you have. No command catches +this. It's the one part that is entirely you, and it's the part most people skip. -The spine of every feature is five commands. The supporting cast (clarify, -analyze, checklist, converge, taskstoissues) wraps around them. +He does *not* start with the constitution as a vague "code quality, testing, UX" +prompt (the kind of boilerplate that fails no plan). He writes principles that +encode the *liability* framing as hard gates: ```text -/speckit.constitution → .specify/memory/constitution.md (project law, once-ish) - │ -/speckit.specify → specs/NNN-feature/spec.md (WHAT/WHY; new branch) - │ ⟂ /speckit.clarify (de-risk ambiguity before planning) -/speckit.plan → plan.md + research.md + data-model.md + contracts/ + quickstart.md - │ ⟂ /speckit.checklist (quality-test the requirements) -/speckit.tasks → tasks.md (ordered, [P]-parallelizable) - │ ⟂ /speckit.analyze (cross-artifact consistency audit, read-only) - │ ⟂ /speckit.taskstoissues (optional: fan out to GitHub issues) -/speckit.implement → working code, task-by-task, marking [X] as it goes - │ -/speckit.converge → reconcile code-on-disk with spec; append missing work +# A constitution earns its place only if you can imagine a plan FAILING it. +# These are written to fail plans, not to decorate the repo. +/speckit.constitution Establish gates that a plan must pass or carry a written +exception: (1) Every masked document is reproducible from its audit record alone — +if we can't reconstruct what was redacted and why, the plan fails. (2) The human +reviewer is the last authority; no irreversible export without an explicit +approval action — an "auto-approve high confidence" feature fails this gate by +construction. (3) Raw document text never enters logs, traces, or error messages. +(4) Every model-based component carries an eval set and a regression threshold; +"we tried it and it looked good" is not acceptance. Bias toward fewer moving parts: +a plan that adds a service must justify why a function wouldn't do. ``` -Commands **hand off** to each other — after `specify`, the agent offers to jump -to `plan` or `clarify`; after `plan`, to `tasks` or `checklist`; and so on. You -can follow the rail or jump around as judgment dictates. - -### 3.3 Command-by-command how-to - -Each entry: **what it does · when to reach for it · example prompt · what it -emits · senior tips.** - -#### `/speckit.constitution` - -- **What** — creates/updates the project constitution and keeps dependent - templates in sync. -- **When** — once at project birth; revisit when a principle genuinely changes - (rare; version it). -- **Prompt** - - ```text - # Establish binding principles. Name the values you will NOT compromise, - # and say how they should arbitrate technical decisions. - /speckit.constitution Create principles focused on: (1) every behavior change - ships with a test, (2) AI components are evaluated, not just "tried" — every - prompt/agent has an eval set and a regression threshold, (3) the human is in - the loop for any irreversible action, (4) minimal dependencies and idempotent - file ops, (5) p95 latency budgets per endpoint. Include governance for how - these gate planning and implementation. - ``` - -- **Emits** — `.specify/memory/constitution.md` (with a versioned "Sync Impact" - header). -- **Tips** — keep it short and *enforceable*. A principle you can't imagine - failing a plan against is just decoration. Treat edits as semver: principle - changes are MAJOR. - -#### `/speckit.specify` - -- **What** — turns a natural-language feature description into a structured - spec; creates the feature branch and `specs/NNN-*/` directory. -- **When** — the start of every feature. -- **Prompt** — describe **what** and **why**; resist naming the stack. - - ```text - # Note what's present: actors, flows, rules, edge cases. Note what's absent: - # frameworks, libraries, schemas. That's deliberate. - /speckit.specify Build a service that detects and masks personal data (PII) - in Turkish-language documents before they're shared. A user pastes or uploads - text; the system highlights detected PII (names, national IDs, IBANs, phones, - addresses) with a confidence level; a human reviewer can accept, reject, or - edit each detection; on confirmation the system returns a masked copy plus an - audit record. Reviewers must never be able to export a document with unresolved - high-confidence detections. - ``` - -- **Emits** — `spec.md` with **user stories**, **functional requirements**, - **acceptance criteria**, and `[NEEDS CLARIFICATION]` markers where you were - silent. *(Caption only — learn the shape: numbered stories, testable - requirements, explicit edge cases.)* -- **Tips** — over-specify the *rules and edge cases*; under-specify the - *mechanism*. If you catch yourself writing a library name, move it to the plan. - -#### `/speckit.clarify` - -- **What** — interrogates the spec and asks up to **5 targeted questions** about - underspecified areas, then encodes your answers back into `spec.md`. -- **When** — immediately after `specify`, *before* `plan`, on anything - load-bearing. This is the cheapest bug-prevention in the whole pipeline. -- **Prompt** - - ```text - # Let it drive. Answer crisply; vague answers produce vague specs. - /speckit.clarify - ``` - -- **Emits** — an updated `spec.md` with ambiguities resolved and clarifications - recorded. -- **Tips** — if it asks something you "obviously" know, that's the point: the - spec didn't say it, so the implementation would have guessed. Answer it. - -#### `/speckit.plan` - -- **What** — produces the technical design from the spec and your stack input. -- **When** — once the spec is clarified and you're ready to commit to *how*. -- **Prompt** — *now* you bring the stack and constraints. - - ```text - # This is where architecture lives. Be specific about stack, boundaries, - # and the things you've already decided. - /speckit.plan Build with: FastAPI + Python 3.12 backend, a hybrid PII pipeline - (deterministic regex/checksum validators for IDs/IBANs + an LLM extractor for - names/addresses), an LLM-as-judge verification pass, Postgres for audit records, - and a React + TypeScript review UI with optimistic accept/reject. Stream - detections to the UI via SSE. Keep the detection engine importable and testable - independently of the web layer. No PII leaves the boundary unmasked in logs. - ``` - -- **Emits** — `plan.md`, plus (as warranted) `research.md` (trade-off studies), - `data-model.md` (schemas/entities), `contracts/` (API/event contracts), and - `quickstart.md` (key validation scenarios). *(Caption only.)* -- **Tips** — read `research.md` and `contracts/` like a design doc, because that - is what they are. This is your cheapest chance to kill a bad abstraction. - -#### `/speckit.checklist` - -- **What** — generates a **"unit test for your requirements."** It validates the - *quality* of the spec (completeness, clarity, consistency, coverage) — **not** - the implementation. -- **When** — after plan, before tasks, on domains where requirement quality is - the risk (UX, security, compliance, anything fuzzy). -- **Prompt** - - ```text - # Ask for a domain-scoped checklist. Think "are the requirements well-written?" - # not "does the button work?" - /speckit.checklist Generate a requirements-quality checklist for the human - review UX and the PII-handling rules: are confidence thresholds quantified? - is every detection category's masking behavior specified? are export-blocking - conditions unambiguous? are accessibility requirements for the reviewer - defined? - ``` - -- **Emits** — a checklist of questions that probe whether your *spec* is - airtight. Items like *"Is 'high confidence' quantified?"* — not *"Test that - masking works."* -- **Tips** — failing checklist items send you back to `specify`/`clarify`, not to - code. That's the loop working. - -#### `/speckit.tasks` - -- **What** — decomposes plan + design artifacts into an **ordered, dependency- - aware `tasks.md`**, marking parallelizable tasks `[P]`. -- **When** — once the plan is settled. -- **Prompt** - - ```text - /speckit.tasks Break the plan into tasks - ``` - -- **Emits** — `tasks.md`: setup → foundational → per-story → polish, each task - small and testable, with `[P]` on independent ones. *(Caption only — note the - phasing; you'll scope `implement` against it.)* -- **Tips** — skim for tasks that are too big to verify; a task you can't write a - test for is a task that's underspecified upstream. - -#### `/speckit.analyze` - -- **What** — a **non-destructive, read-only** cross-artifact audit. Checks - `spec.md`, `plan.md`, and `tasks.md` for inconsistencies, gaps, and drift. -- **When** — after tasks, before implement; and any time you suspect artifacts - have diverged. -- **Prompt** - - ```text - /speckit.analyze Run a project analysis for consistency - ``` - -- **Emits** — a report of mismatches (e.g., a requirement with no task, a task - with no requirement, a contract the plan never mentions). It **changes - nothing**; you decide the fixes. -- **Tips** — treat a clean `analyze` as your "ready to implement" gate. - -#### `/speckit.taskstoissues` *(optional)* - -- **What** — converts `tasks.md` into dependency-ordered **GitHub issues**. -- **When** — when work is shared across a team or you want the backlog tracked in - GitHub rather than a file. -- **Prompt** - - ```text - /speckit.taskstoissues Create GitHub issues from the tasks, preserving order - and dependencies, labeled by phase. - ``` - -- **Emits** — GitHub issues (via the GitHub MCP server) mirroring your task DAG. -- **Tips** — great for human/AI hand-offs: a teammate or another agent can pick up - an issue with full spec context linked. - -#### `/speckit.implement` - -- **What** — executes `tasks.md`, building the code task-by-task and marking each - `[X]` as it completes. -- **When** — after a clean analyze. This is the generation step. -- **Prompt** — and critically, **how to scope it** (see §3.5). - - ```text - # Full run for small features: - /speckit.implement Start the implementation in phases - - # Scoped run for large ones (recommended — keeps context healthy): - /speckit.implement only execute the Setup and Foundational phases, then stop - and report progress - ``` - -- **Emits** — working code, committed task-by-task; `tasks.md` updated with `[X]`. -- **Tips** — review *per phase*, not at the end. The completed-task markers mean - you can stop, review, and resume without losing your place. - -#### `/speckit.converge` - -- **What** — assesses **code-on-disk against spec/plan/tasks**, then **appends - the remaining unbuilt work** as new tasks so `implement` can finish it. -- **When** — brownfield reality checks; after manual edits; when a long - implementation drifted; whenever "what's actually built?" ≠ "what we said." -- **Prompt** - - ```text - /speckit.converge Assess the codebase against the spec and append any missing - work as tasks. - ``` - -- **Emits** — newly appended tasks in `tasks.md` capturing the delta between - intent and reality. Then you re-run `implement`. -- **Tips** — this is the *spec-evolution* workhorse. Change the spec, run - `converge`, run `implement` — that's how a living spec stays load-bearing. - -### 3.4 Extensions, presets, and bundles - -Spec Kit is customizable so it can match *your* lifecycle, not just the default. - -- **Extensions** — opt-in command packs that hook into the pipeline. Built-ins - include: - - **`git`** — commit/validate/feature/remote helpers so SDD phases map cleanly - to branches and commits. - - **`bug`** — `assess` → `test` → `fix`: a spec-driven *bug* workflow (reproduce - with a failing test first, then fix). - - **`agent-context`** — keeps the agent's working context updated as the project - grows. - - Extensions hook at defined points (`before_specify`, `before_plan`, …) via - `.specify/extensions.yml`. -- **Presets** — alternative command/template sets. **`lean`** trims the pipeline - for speed; **`scaffold`** is a starting point for authoring your own; the - templates are yours to fork. -- **Bundles** — role-based setups: **developer**, **product-manager**, - **business-analyst**, **security-researcher**. A bundle pre-wires the commands - and emphasis a given role needs. - -> Senior framing: extensions/presets/bundles are how you **encode your team's -> SDLC into the tool**. The default pipeline is a strong opinion; these let you -> make it *your* opinion. - -### 3.5 Handling complex features (the context problem) - -Big features sail through specify/plan/tasks and then **degrade mid- -`implement`** — the agent loses the plan, skips tasks, or hallucinates as the -context window fills and compaction kicks in. The root cause is context -exhaustion, and there are four fixes: - -```text -# 1) Scope each run — the simplest fix, works on any agent: -/speckit.implement only execute tasks T001-T010, then stop and report progress +⟶ **Lesson.** A constitution is not values; it's a set of trip-wires. Principle +(2) is written so that a *specific tempting feature* — auto-approve — is dead on +arrival. That's what makes it load-bearing instead of inspirational. -# 2) Delegate to sub-agents (if your agent supports them): -/speckit.implement delegate each parallel [P] task to a sub-agent +### Specify — the model writes a good spec for the wrong product -# 3) Combine scoping + delegation for very large features: -/speckit.implement execute only the Core phase, delegate [P] tasks to sub-agents +He describes the *what*, deliberately omitting stack, and gets back a clean +`spec.md`. It reads well. That's the trap. -# 4) Decompose into smaller specs ("spec of specs") — when even one phase is -# too big, split the feature into sub-features, each with its own -# spec/plan/tasks cycle. Highest overhead; reserve for genuinely huge work. +```text +/speckit.specify A reviewer uploads a Turkish document; the system detects PII +(names, T.C. national IDs, IBANs, phones, addresses), shows each detection with a +confidence level, lets the reviewer accept/reject/edit inline, and on confirmation +returns a masked copy plus an audit record. Export is blocked while any +high-confidence detection is unresolved. ``` -Because completed tasks are marked `[X]`, scoped runs resume cleanly. **Default -to option 1**; reach for sub-agents when you want parallelism; decompose only -when a single phase still overflows. - ---- - -## Part IV — End-to-End: t0 to a Shipped Product +> *(Caption — the spec the model produced:)* tidy user stories, requirements +> organized around **detection accuracy and the masking flow**, sensible edge +> cases. Looks done. -> **The cast.** Deniz, a senior AI engineer, is shipping **Redaktör** — an -> agentic Turkish-document **PII detection & masking** service with a -> **human-in-the-loop review UI**. It's representative of real agentic-product -> work: non-deterministic model components, an eval discipline, a human approval -> step, and a frontend that has to feel trustworthy. -> -> Watch for the **load-bearing moves** (⚖️) — the moments where SDD does real -> work a senior refuses to skip. The point isn't the product; it's the *operating -> discipline*. +Deniz reads it adversarially and the smell hits immediately: **the spec is +optimizing detection, but his real problem was trust.** The model did exactly what +he asked and nothing he meant. There's no requirement about *why a reviewer would +believe a mask is correct*, nothing about what happens when the model and the +reviewer disagree, nothing about the liability trail beyond a vague "audit +record." It's a spec for a demo, not for someone's signature. -### t0 — Frame the problem before touching the tool +He doesn't tweak it. He rewrites the intent, because a spec aimed at the wrong +target can't be patched into the right one: -Deniz doesn't open the agent first. He writes three sentences in a notebook: -*who hurts today, what "done" means, what must never happen.* For Redaktör: -"Compliance teams manually redact Turkish docs; done = reviewer-approved masked -copy + audit trail; **never** export a doc with unresolved high-confidence PII." +```text +# Re-aim the spec at the real problem: reviewer trust and defensible sign-off. +/speckit.specify Reframe: the user is a compliance officer who must personally +attest that a document is safe to share. The system's job is to make that +attestation FAST and DEFENSIBLE, not just to detect PII. Requirements should cover: +how a reviewer sees what is under a mask without exposing it to anyone else; how +disagreements between the model and the reviewer are recorded; what the audit +record must contain to defend the decision later; and what the reviewer is and +isn't allowed to do (e.g., no bulk auto-accept). Detection is necessary but not the +point — the point is a sign-off the reviewer can stand behind. +``` -⚖️ **Load-bearing move:** the spec is only as good as the intent behind it. Five -minutes of human framing prevents an hour of regenerating a confidently-wrong -spec. +⟶ **Lesson.** "The model gave me a clean spec" is a warning, not a win. A clean +artifact aimed at the wrong problem is more dangerous than a messy one aimed at +the right problem, because the cleanliness lowers your guard. Read every generated +spec by asking *what problem is this actually solving, and is it mine?* -### Phase 1 — Bootstrap & constitution +### Clarify — the questions it asks matter less than the one it doesn't -```bash -specify init redaktor --integration claude -cd redaktor +```text +/speckit.clarify ``` -Then, before any feature, he encodes the **values that arbitrate every later -decision** — and for an AI product, that explicitly includes *evals* and -*human-in-the-loop*: +The model asks five questions. Deniz's read: ```text -# The constitution is where an AI product's hardest invariants become gates. -/speckit.constitution Create binding principles: (1) every behavior change ships -with a test; (2) NON-NEGOTIABLE — every AI component (extractor, judge) has a -versioned eval set with a regression threshold; promotion requires passing it; -(3) the human reviewer is the final authority — no irreversible export without -explicit approval; (4) PII never appears in logs, traces, or error messages; -(5) the detection engine is a library, importable and testable without the web -layer; (6) p95 detection latency budget: 2s for a 2-page document. Add governance: -plans that violate these must fail the gate or carry a written, justified -exception. +# Two are real, three are filler. The filler tells me where the model is pattern- +# matching instead of thinking — and the GAP tells me more than the questions. +Q: Confidence threshold for "high"? → real. ≥0.85 blocks export. +Q: Masking token format? → real. Category tokens: [TCKN], [IBAN]. +Q: Supported file types? → filler. Doesn't change the design. +Q: Max document size? → filler. Pick 20 pages, move on. +Q: Light or dark theme? → noise. ``` -⚖️ **Load-bearing move:** principle (2) turns "we tried the prompt and it seemed -fine" into a *gate*. Principle (4) will later cause `analyze` and review to flag -any logging task that touches raw text. The constitution is doing architecture. +The signal isn't in the answers. It's that **the model did not ask the question +that actually decides the product**: *when the model flags PII the reviewer thinks +is fine — or misses PII the reviewer spots — what happens, and who owns that +decision in the audit trail?* That's the entire trust mechanism, and the tool was +blind to it because it's not a pattern that shows up in training; it's specific to +his liability framing. -### Phase 2 — Specify the *what*, ruthlessly stack-free +So he injects it himself, into the spec, by hand-driving the clarification: ```text -# Actors, flows, rules, edge cases. Zero frameworks. Zero model names. -/speckit.specify Build a service that detects and masks personal data in -Turkish-language documents before sharing. A reviewer pastes or uploads text; -the system streams back detected PII (full names, T.C. national IDs, IBANs, -phone numbers, postal addresses) each with a category and a confidence level; -the reviewer accepts, rejects, or edits each detection inline; on confirmation -the system returns a masked copy and writes an immutable audit record (who, -when, what changed). Hard rule: the system must block export while any -high-confidence detection is unresolved. Documents may be up to 20 pages. +/speckit.clarify Record this decision explicitly in the spec: every reviewer +override of a model detection (accept-despite-low-confidence, reject-despite-high, +manual-add) is a first-class audited event with reviewer identity and timestamp. +The model's opinion and the reviewer's decision are both stored; on conflict, the +reviewer's decision governs the export but the model's dissent is retained in the +record. This is the core trust mechanism — make it a requirement, not an +implementation detail. ``` -The agent creates branch `001-pii-masking` and `specs/001-pii-masking/spec.md` -with user stories, functional requirements, acceptance criteria — and a few -`[NEEDS CLARIFICATION]` markers where Deniz was silent. - -> *(Caption of `spec.md`, not its contents:)* §User Stories (reviewer-centric), -> §Functional Requirements (FR-1 detect, FR-2 stream w/ confidence, FR-3 inline -> resolve, FR-4 export-block rule, FR-5 audit), §Edge Cases (empty doc, all-PII -> doc, mixed-language doc), §`[NEEDS CLARIFICATION]` on retention + masking -> format. +⟶ **Lesson.** Clarify is not "answer the model's quiz." It's a prompt to ask +*yourself* what's underspecified. The model's questions map what's *conventional* +to underspecify; your job is to find what's *uniquely* dangerous to leave +unsaid here. Often the most important clarification is one the tool never raised. -⚖️ **Load-bearing move:** Deniz *notices* he never said how masking should look -(`[ALİ YILMAZ]` → `[AD-SOYAD]`? black box? partial?) — but he doesn't fix it by -editing prose. He lets `clarify` extract it, so the resolution is recorded -systematically. +### Plan — reading the smell of a plausible over-design -### Phase 3 — Clarify: spend questions now, not bugs later +Now stack and constraints are fair game. ```text -/speckit.clarify +/speckit.plan FastAPI + Python 3.12 backend; React + TypeScript reviewer UI with +SSE streaming of detections; Postgres for audit records. Detection engine must be +an importable library, testable without the web layer. Structured logging with a +redaction filter. Eval harness over a labeled Turkish corpus. ``` -The agent asks five sharp questions. Deniz answers crisply: - -```text -# (paraphrased Q→A — answers get encoded back into spec.md) -Q: Confidence threshold for "high"? → ≥ 0.85 blocks export; 0.6–0.85 warns. -Q: Masking representation? → category tokens, e.g. [TCKN], [IBAN]. -Q: Retention of original text? → originals purged after audit write; - audit stores only masked + diff. -Q: Multi-language docs? → detect TR PII only; flag non-TR spans. -Q: Who can override an export block? → nobody in v1; it's a hard stop. -``` +The model returns a confident, conventional plan — and Deniz reads two smells: -⚖️ **Load-bearing move:** "nobody can override the export block" is now a -*requirement with a test target*, not a tribal assumption. This single -clarification will shape a contract, a task, and a regression test downstream. - -### Phase 4 — Plan: now, and only now, the architecture +**Smell 1: it reached for an LLM-as-judge verification pass over every detection.** +Plausible (more checking = better, right?), and wrong for v1. He reasons about the +*cost*, which the model never does unprompted: the judge roughly doubles latency +and per-document cost, and — worse — adds a *second non-deterministic component he +now has to eval and keep from regressing.* For what? Marginal recall on categories +that are already strong. He cuts it, and crucially **writes down why**, so the +next engineer (or the next agent) doesn't helpfully "fix" the omission: ```text -# Stack, boundaries, and the decisions already made enter here. -/speckit.plan Build with: FastAPI + Python 3.12; a HYBRID detection pipeline — -deterministic validators (regex + checksum) for T.C. IDs/IBANs/phones, and an -LLM extractor for names/addresses; an LLM-as-judge pass that scores each -candidate and assigns confidence; Postgres for immutable audit records; React + -TypeScript review UI with optimistic accept/reject and SSE streaming of -detections. Keep `redaktor.engine` importable and web-independent. Structured -logging with a redaction filter so raw text can never be logged. Provide an eval -harness (golden Turkish docs with labeled PII) for both extractor and judge. +/speckit.plan Remove the LLM-judge pass for v1 and record the rationale in plan.md: +it doubles latency/cost and adds a second model to eval for marginal recall on +categories already covered. Revisit only if eval shows a recall gap that determinism +and a single extractor can't close. Keep the architecture as flat as the problem +allows — no message queue, no service split; this is one process until proven +otherwise. ``` -> *(Caption of emitted design artifacts:)* -> -> - `research.md` — trade study: regex-only vs. LLM-only vs. **hybrid** (chosen: -> hybrid for precision on structured IDs + recall on names); SSE vs. -> WebSocket (chosen: SSE, one-way stream). -> - `data-model.md` — entities: `Document`, `Detection{category, span, -> confidence, status}`, `AuditRecord`. -> - `contracts/` — `POST /documents`, `GET /documents/{id}/detections` (SSE), -> `POST /detections/{id}/resolve`, `POST /documents/{id}/export` (enforces the -> block rule). -> - `quickstart.md` — the golden-path validation scenario. - -⚖️ **Load-bearing move:** Deniz reads `research.md` and pushes back: the judge -adds latency. He keeps it but adds a plan note — *deterministic validators -short-circuit the judge for checksum-valid IDs* — protecting the 2s p95 budget -from principle (6). This is a design debate happening in `plan.md`, where it's -free. - -### Phase 5 — Checklist the requirements, then task it out - -Because the **review UX** and **PII rules** are the real risk, he quality-tests -the *requirements* first: +**Smell 2 — and this is the real insight — the architecture is upside down.** +While reviewing `research.md`, Deniz notices that T.C. national IDs and IBANs have +*checksums*. The highest-liability categories don't need a model at all; a few +lines of deterministic validation are more accurate than any LLM and trivially +auditable. That inverts the design: **this is mostly a deterministic system, with +AI used only where determinism fails (names, free-form addresses).** The "AI +product" is 80% boring code, and that's a feature — boring code is auditable, and +auditability is the actual requirement. ```text -/speckit.checklist Generate a requirements-quality checklist for the review UX -and PII rules: is every detection category's masking token specified? is the -export-block condition unambiguous and testable? are reviewer keyboard/a11y -requirements defined? is the SSE reconnect behavior on dropped connection -specified? +/speckit.plan Restructure detection around determinism-first: regex + checksum +validators own T.C. IDs, IBANs, and phones (these are exact and auditable). The +LLM extractor handles only names and free-form addresses, where determinism fails. +Make the validator path the default and the model path the exception. Note in +plan.md: this is deliberate — the most legally sensitive categories must be +explainable without invoking a model. ``` -Two items fail — SSE reconnect behavior and a11y were never specified. He loops -back through `clarify` to fix the *spec*, not the code. Then: - -```text -/speckit.tasks Break the plan into tasks -``` +⟶ **Lesson.** Two senior reflexes here. First: **the model never volunteers the +cost of its suggestions** — latency, money, a new failure surface, eval burden — +so you must price every "let's also add…" yourself and often decline. Second: the +best architectural insight in an AI product is frequently *use less AI* — push +work onto deterministic, auditable code wherever it actually works, and reserve +the model for the irreducibly fuzzy part. Read `research.md` and `plan.md` like a +design review, because that's the last place a bad structure is cheap to kill. -> *(Caption of `tasks.md`:)* Setup (repo, CI, db) → Foundational (engine library, -> validators, eval harness) → Story A (detect+stream) → Story B (resolve UI) → -> Story C (export-block + audit) → Polish (a11y, latency, observability). `[P]` -> on the independent validator and UI-component tasks. +He overrides one of the simplicity-gate's own suggestions (it wanted a separate +validation microservice "for scalability") with a one-line documented exception — +*premature; one process until load data says otherwise.* Gates inform; they don't +command. A senior overrides them **with a written reason**, which is different from +ignoring them. -### Phase 6 — Analyze: the "ready to build" gate +### Tasks & analyze — using the cheap audit as a real gate ```text +/speckit.tasks Break the plan into tasks /speckit.analyze Run a project analysis for consistency ``` -The read-only audit flags one real gap: **no task covers the "non-TR span" -flagging** from the clarify step. Deniz adds it (re-runs `tasks`), re-analyzes, -gets a clean report. +`analyze` is read-only and catches one thing that matters: the override-audit +requirement Deniz injected during clarify has **no task**. It would have silently +not been built — and it's the core trust mechanism. He adds it and re-runs until +clean. -⚖️ **Load-bearing move:** a clean `analyze` is his hard gate to implementation. A -requirement with no task is a feature that silently won't exist — caught here for -the price of a re-read. +⟶ **Lesson.** A requirement with no task is a feature that won't exist, and nobody +will notice until it's missing in production. `analyze` is a near-free adversarial +reader; treating "clean analyze" as a hard gate before implementing is one of the +highest-leverage habits in the whole method. The cost of running it is a minute; +the cost of the gap it finds is a postmortem. -### Phase 7 — Implement, scoped by phase +### Implement — it drifts, and the process (not your vigilance) catches it -He never one-shots a feature this size. He scopes by phase to keep context -healthy (§3.5): +He never one-shots this. He scopes by phase, because he knows the model degrades +as context fills — it starts strong and hallucinates around the edges of a long +run: ```text -# 1) Foundation first — the engine + evals must exist before stories. -/speckit.implement only execute the Setup and Foundational phases, then stop and -report progress +# Foundation first; nothing else can be trusted until the engine + evals exist. +/speckit.implement execute the Setup and Foundational phases only, then stop and +report +``` -# (Deniz reviews: runs the eval harness, confirms validators pass on golden IDs.) +He runs the eval harness himself before going further, because for a model +component **"the unit tests pass" and "it actually detects well" are different +claims**, and only the eval set speaks to the second. -# 2) Then story by story, reviewing each: -/speckit.implement execute Story A (detect + stream), then stop +Then, mid-build around task 7, the model does something very on-brand: in a new +error handler, it logs the offending document snippet to help debugging. That +violates constitution principle (3) — raw text in logs. Deniz doesn't catch it by +reading every line; nobody reliably does that. He catches it because he'd written +a test that greps emitted logs for PII patterns. **The failure mode he predicted, +he instrumented for.** -# 3) Large parallelizable polish — delegate if the agent supports sub-agents: -/speckit.implement execute the Polish phase, delegate [P] tasks to sub-agents +```text +# He'd added this as a foundational task precisely because he knew the model +# would eventually "helpfully" log raw text. The test is the guard, not his eyes. +/speckit.implement the log-redaction test is failing on the new error handler — +fix the handler to log a reference id, not document content, and confirm the test +passes ``` -Each task lands as `[X]`. Between phases, Deniz **runs the app and the evals**, -not just the unit tests — because for an AI component, "tests pass" ≠ "it -detects well." The eval set from principle (2) is the real acceptance gate. +⟶ **Lesson.** You cannot out-vigilance a tireless collaborator by reading harder. +You beat its predictable failures by **building the trap before it walks into +them**: a test that fails on the exact mistake you know it tends to make. The +constitution names the invariant; a test makes the invariant self-enforcing. That +pairing is the difference between a principle and a hope. -⚖️ **Load-bearing move:** reviewing per phase means a bad masking decision is -caught after Story A, not after the whole product is wired. The `[X]` markers let -him stop and resume without losing the thread. +### The spec was *still* wrong — and here's where SDD pays for itself -### Phase 8 — Converge: reconcile reality with intent +Foundation, detection, review UI, audit — all built, all green, demo to the +compliance team. It fails. Not on a bug. The reviewers won't sign off, because +**they can't see what's under a mask without un-masking the whole document**, and +un-masking in front of others is exactly the exposure they're trying to avoid. The +product is technically correct and practically useless. The trust mechanism Deniz +was so proud of is unusable in the room where trust actually happens. -Mid-build, product asks for one change: *warn-level detections (0.6–0.85) should -also be reviewable, not just high-confidence ones.* Deniz does the -**spec-evolution loop**, not a hand-patch: +This is the expensive discovery — the kind that, in a code-first project, means a +painful rewrite. Here it's a *spec* problem (he never specified *how* a reviewer +inspects a mask safely), so he fixes it where it's cheap: he edits intent, then +lets convergence find the delta between the new intent and the built reality. ```text -# 1) Update the spec/clarify with the new rule, then: -/speckit.converge Assess the codebase against the updated spec and append any +# Fix the SPEC, then reconcile reality to it. Not a hand-patch in the UI — +# the change has to live in intent so it survives the next person. +/speckit.clarify Add the missing requirement: a reviewer must be able to inspect +the original value behind a single mask, in place, visible only to them, without +exposing the rest of the document or persisting the reveal. Specify the audit +consequence: each reveal is itself an audited event. + +/speckit.converge Assess the codebase against the updated spec and append the missing work as tasks. -# 2) converge appends the delta tasks (UI filter for warn-level, export rule -# stays high-confidence-only). Then finish them: /speckit.implement execute the newly appended tasks ``` -⚖️ **Load-bearing move:** the requirement change entered through the *spec* and -propagated *down* — not as a stray commit. The spec stays the source of truth, so -the next engineer (or agent) reads intent that matches reality. - -### Phase 9 — Ship: CI, audit, and the human gate +⟶ **Lesson.** This is the entire argument for SDD in one event. You *will* discover +late that the spec was wrong — that's not a process failure, it's the nature of +building things for real humans, whose needs you can't fully know in advance. The +question SDD answers is: **when that discovery lands, does it cost a paragraph or a +rewrite?** Because the intent lived in a spec the system regenerates from, a +demo-day catastrophe became an afternoon. That's the payoff — not the clean happy +path, but the cheap recovery from the inevitable wrong turn. -Before release, Deniz wires the lifecycle closure (often via the `git` extension -and his CI): +### Ship — the gate that lets him sleep -- **CI gates** run the test suite *and* the eval suite; a regression below the - eval threshold **fails the build** (principle 2, now load-bearing in CI). -- **A redaction-filter test** asserts no raw PII reaches logs (principle 4). -- **An export-block test** asserts the hard stop from the clarify step. -- He has the agent fan remaining work to issues for a teammate: +CI runs tests *and* the eval suite; a recall regression below threshold fails the +build, so the model can't silently get worse over time. The log-PII test and the +export-block test are non-negotiable gates. The human approval step is the actual +product. None of this is ceremony — each one is a specific way the system could +have hurt someone, turned into a wall. - ```text - /speckit.taskstoissues Create GitHub issues for the remaining polish tasks, - preserving dependency order, labeled by phase. - ``` +⟶ **Lesson.** "Done" for an AI product isn't "it works." It's "I know the specific +ways this can fail, and each one is now something that fails the build instead of +failing a customer." That confidence is what you're actually shipping. -Redaktör ships: a reviewer pastes a document, watches detections stream in with -confidence, resolves each inline, and exports a masked copy with an audit trail — -and **cannot** export while a high-confidence detection is unresolved, because -that rule has been load-bearing since Phase 3. +--- -### What made this "senior," in one breath +## Reading generated artifacts adversarially: a smell guide -The model wrote most of the code. **Deniz owned the leverage points**: framing -intent (t0), promoting invariants to gates (constitution), spending clarify -questions early, debating architecture in `plan.md`, gating on a clean `analyze`, -reviewing *per phase* against **evals** not vibes, and routing every change -through the spec via `converge`. SDD didn't replace his judgment — it -*concentrated* it where it mattered. +The single most valuable skill SDD demands is reading the model's output like a +hostile reviewer. Concrete tells, by artifact: ---- +**In a spec:** -## Part V — Operating SDD Like a Senior +- It optimizes a *metric* (accuracy, speed) when your real problem is a + *property* (trust, safety, defensibility). → It solved the legible problem, not + yours. +- Edge cases are all the *easy* ones (empty input, too long). The expensive edge + cases — conflict, disagreement, partial failure, liability — are absent. → + The model lists what's conventional, not what's dangerous *here*. +- It reads complete and you can't find a single `[NEEDS CLARIFICATION]`. → It + guessed confidently instead of flagging the unknowns. Suspicious, not + reassuring. -### 5.1 Habits that separate load-bearing SDD from theater +**In a plan:** -- **Always `clarify` before `plan`** on anything that matters. It's the highest - ROI command in the kit. -- **Read `research.md` and `contracts/`.** If you're not reviewing the design - artifacts, you're vibe coding with extra files. -- **Gate on a clean `analyze`.** Make "no orphan requirements, no orphan tasks" - your definition of implementation-ready. -- **Scope `implement` by phase, review between phases.** Don't discover problems - at the end. -- **Route changes through the spec → `converge` → `implement`.** Hand-patches rot - the spec; once it's stale, it stops being load-bearing and you're back to - code-as-truth. +- An extra service, queue, cache, or abstraction "for scale/flexibility" with no + number justifying it. → Plausible over-engineering; make it justify itself or cut + it. +- A new non-deterministic component added "to be safe." → Price the latency, cost, + and *new eval burden* it creates. Usually not worth it. +- Heavy machinery where a checksum, a regex, or a constraint would be exact and + auditable. → The best move is often *less* sophistication, not more. -### 5.2 Anti-patterns to refuse +**In tasks:** -- **Stack in the spec.** Frameworks in `spec.md` poison the intent layer and - block re-implementation. -- **Skipping clarify to "save time."** You don't save it; you move it to - debugging, at 10× the cost. -- **One-shot `implement` on a big feature.** Context exhaustion will corrupt the - back half of the run. -- **A constitution of platitudes.** If a principle can't fail a plan, delete it. -- **Treating generated artifacts as write-once.** A spec you never regenerate - from is documentation, not SDD. +- A requirement you fought for during clarify has no corresponding task. → + `analyze` exists to catch exactly this; gate on it. +- A task you can't imagine writing a test for. → It's underspecified upstream; the + vagueness is in the spec, not the task. -### 5.3 Fitting SDD to a team +The meta-tell across all three: **fluency where there should be friction.** If the +hard, contested part of your problem came out smooth and confident, the model +papered over it. Go look there first. -- Use **`taskstoissues`** to turn a task DAG into a shared backlog; each issue - carries spec context, so humans and agents can both pick up work. -- Use **bundles** (developer / PM / BA / security) so each role gets the - pipeline emphasis they need. -- Use **extensions** (`git`, `bug`, `agent-context`) to bind SDD phases to your - real branch/commit/triage flow. -- Keep `spec.md`/`plan.md`/`tasks.md` **in version control** next to the code and - review them in PRs — the spec diff *is* the design review. +--- -### 5.4 When **not** to reach for the full pipeline +## Where SDD fights you, and what a senior does about it + +Pretending the method is all upside is its own kind of decoration. The honest +failure modes: + +- **It can make wrong decisions feel settled.** A beautifully formatted spec + carries false authority. Counter: treat every artifact as a draft under + suspicion until *you've* attacked it, regardless of how finished it looks. +- **Clarify and analyze can devolve into pedantry** — asking about themes, + flagging trivia. Counter: take the two questions that matter, ignore the noise, + and supply the one they missed. The commands serve your judgment, not the + reverse. +- **The structure can seduce you into completing the workflow instead of solving + the problem.** Running all ten commands feels productive. Counter: if a step + isn't surfacing disagreement or reducing risk, skip it. Ceremony is the enemy. +- **Long `implement` runs rot.** The model loses the plan as context fills. + Counter: scope every run to a phase or a task range; review between; let the + `[X]` markers resume you. Never one-shot anything large. +- **The spec can drift from the code** the moment you hand-patch. Counter: route + changes through the spec → `converge` → `implement`. The instant the spec stops + matching reality, it stops being load-bearing and you've silently reverted to + code-as-truth. -SDD ceremony should match stakes: +### Calibration: how much process is the right amount -| Situation | Reach for | +| Situation | What a senior actually does | | --- | --- | -| Load-bearing feature, multiple stories, real risk | Full pipeline + clarify + checklist + analyze | -| Small, well-understood change | specify → plan → tasks → implement (skip the wrappers) | -| Throwaway spike / proof of concept | Straight to `/speckit.implement` with a tight prompt | -| Bug fix | The `bug` extension (assess → failing test → fix) | -| Requirements changed on a live feature | `clarify`/edit spec → `converge` → `implement` | - -The senior skill is calibration: enough structure to stay honest, not so much -that it's theater. - -### 5.5 The one-sentence summary +| Throwaway spike, learning something | Skip the apparatus. One tight `/speckit.implement` prompt or just write it. | +| Small change to a known system | specify → plan → tasks → implement. Skip the wrappers; the risk is low. | +| Load-bearing feature, real consequences | Full adversarial pass: clarify, checklist on the risky domain, gate on clean analyze, scope implement, review per phase against evals. | +| You're not sure what you're even building | *Stop.* Spend the t0 thinking. The tool can't do this part and will happily build the wrong thing. | +| Requirements changed on a live feature | Edit the spec → converge → implement. Never a stray commit. | -**Make intent the durable artifact, promote your invariants to gates, spend your -judgment at the spec/plan/analyze checkpoints, and let the agent regenerate the -code — so that when the world changes, you change the spec and the system -follows.** +The skill isn't running more steps. It's spending exactly enough structure to keep +yourself honest, and not one step of theater past that. --- -## Appendix — Cheat Sheets - -### A. Command map - -| Command | Phase | Emits | Reach for it when | -| --- | --- | --- | --- | -| `/speckit.constitution` | Govern | `constitution.md` | Project birth; principle change | -| `/speckit.specify` | Intent | `spec.md` (+ branch) | Every new feature | -| `/speckit.clarify` | Intent | updated `spec.md` | Before planning anything load-bearing | -| `/speckit.plan` | Design | `plan.md`, `research.md`, `data-model.md`, `contracts/`, `quickstart.md` | Spec is clarified | -| `/speckit.checklist` | Design | requirements-quality checklist | Fuzzy/high-risk domains (UX, security) | -| `/speckit.tasks` | Decompose | `tasks.md` | Plan is settled | -| `/speckit.analyze` | Gate | read-only consistency report | Before implement; on suspected drift | -| `/speckit.taskstoissues` | Decompose | GitHub issues | Team/shared backlog | -| `/speckit.implement` | Build | code + `[X]` tasks | After a clean analyze | -| `/speckit.converge` | Evolve | appended delta tasks | Spec changed; brownfield reconcile | - -### B. Artifact graph - -```text -constitution.md ──governs──▶ spec.md ──derives──▶ plan.md ─┬─▶ research.md - ▲ ├─▶ data-model.md - clarify│ checklist ├─▶ contracts/ - │ └─▶ quickstart.md - │ │ - └────────── analyze (audits all) ──┤ - ▼ - tasks.md ──▶ code - ▲ │ - └ converge ┘ -``` - -### C. Glossary - -- **Constitution** — binding project principles, enforced as plan-time gates. -- **Gate (Phase −1)** — a pre-implementation check (simplicity, anti-abstraction, - integration-first) a plan must pass or explicitly waive. -- **`[NEEDS CLARIFICATION]`** — an explicit uncertainty marker; a question, not a - guess. -- **Persistence model** — your chosen strategy for keeping specs alive - (greenfield / brownfield / evolution). -- **`[P]` task** — a task with no blocking dependency; parallelizable. -- **Converge** — reconcile code-on-disk with intent by appending the missing - work as tasks. -- **Bundle / Preset / Extension** — role setups / alternative pipelines / opt-in - command packs that tailor Spec Kit to your SDLC. +## If you remember five things + +1. **The model is a fast, confident, predictably-wrong collaborator.** Your stance + is adversarial: every artifact is a suspect until you've attacked it. +2. **SDD's value is cheap, early disagreement.** Every loop drags a future + expensive mistake into a present cheap one. A step that surfaces no + disagreement is theater. +3. **The real problem usually isn't the one you typed.** The t0 reframing — done + by you, before any command — prevents the most expensive class of mistake. +4. **Instrument for the failures you can predict.** Don't out-vigilance the tool; + build the test that fails on the mistake you know it'll make, and let the + constitution name the invariant. +5. **Going backwards is the method working.** You will discover the spec was wrong + late. The win condition isn't avoiding that — it's that fixing intent and + regenerating costs a paragraph instead of a rewrite. --- -*Further reading in this repo: [`spec-driven.md`](../../spec-driven.md) (the -full SDD essay and the constitutional articles), +## Appendix — the commands, compressed + +You'll learn these by using them. They are instruments for the thinking above, not +the point of it. + +| Command | Its job | The senior question it serves | +| --- | --- | --- | +| `constitution` | Binding principles, enforced as plan-time gates | *Which tempting wrong decisions do I want dead on arrival?* | +| `specify` | Natural language → structured spec (+ feature branch) | *What problem am I actually solving — and is this spec aimed at it?* | +| `clarify` | Up to 5 questions; answers encoded into the spec | *What's uniquely dangerous to leave unsaid here — including what it didn't ask?* | +| `plan` | Spec + stack → technical design, research, contracts | *What does each choice cost, and where is this over-built?* | +| `checklist` | "Unit tests for the requirements" — quality, not behavior | *Are the requirements in my risky domain actually airtight?* | +| `tasks` | Plan → ordered, `[P]`-parallelizable task list | *Is every hard-won requirement represented as buildable, testable work?* | +| `analyze` | Read-only cross-artifact consistency audit | *What did we say we'd build that has no task — or build with no reason?* | +| `taskstoissues` | Tasks → dependency-ordered GitHub issues | *How do humans and agents share this work with full context?* | +| `implement` | Executes tasks, marking `[X]`; scope it by phase | *How do I keep context healthy and review before it drifts?* | +| `converge` | Reconcile code-on-disk with intent; append the delta | *Reality changed — how do I make the spec true again instead of patching code?* | + +Customization that lets the method match *your* SDLC rather than the default: +**extensions** (`git`, `bug` for spec-driven debugging, `agent-context`), +**presets** (`lean` trims the pipeline), and **bundles** (developer / PM / BA / +security role setups). + +*Deeper reading in this repo: [`spec-driven.md`](../../spec-driven.md), [`docs/concepts/sdd.md`](../concepts/sdd.md), -[`docs/concepts/complex-features.md`](../concepts/complex-features.md), and +[`docs/concepts/complex-features.md`](../concepts/complex-features.md), [`docs/guides/evolving-specs.md`](./evolving-specs.md).*