diff --git a/docs/guides/sdd-crash-course.md b/docs/guides/sdd-crash-course.md new file mode 100644 index 0000000000..222b833e3c --- /dev/null +++ b/docs/guides/sdd-crash-course.md @@ -0,0 +1,505 @@ +# Thinking in SDD + +*A field guide to Spec-Driven Development for engineers who already know the +work is in the judgment, not the typing.* + +> This is not a tour of Spec Kit's commands. The command list is at the bottom in +> one table; you'll learn it in an afternoon. What takes years is knowing **what +> to distrust, when to go backwards, and which mistakes to make cheaply** — and +> that's what SDD is actually for. This guide teaches the thinking. The demo in +> the middle goes wrong on purpose, because that's where the lessons live. +> +> Prompts are shown in full (they're what you type). Generated artifacts are +> shown as captions — you should learn their *shape*, not memorize someone else's +> spec. + +--- + +## The one idea everything else hangs on + +You are working with a collaborator that is **fast, fluent, tireless, and +confidently wrong in predictable ways**. It optimizes for *plausibility* — output +that looks like what a good answer looks like — not for *correctness*. It will +hand you a spec that reads beautifully and solves the wrong problem. It will +propose an architecture that's clean, conventional, and two layers heavier than +your problem needs. It will agree with whatever you said last. And it never tells +you what it doesn't know unless you force it to. + +Given that collaborator, the value of SDD is not "specs generate code." It's +this: **SDD front-loads disagreement to the cheapest possible layer.** A wrong +assumption caught in `spec.md` costs a sentence. The same assumption caught in +code costs a day. Caught in production, it costs a customer. Every loop in the +process — clarify, checklist, analyze, converge — exists to drag a future +expensive mistake forward into a present cheap one. + +So the senior's stance toward the whole apparatus is **adversarial, not +compliant.** You are not filling out a workflow. You are running a series of cheap +experiments designed to find out where you and the model are wrong *before* it's +expensive to be wrong. If a step isn't surfacing disagreement, you're doing it as +theater. + +Three corollaries fall out of this, and the rest of the guide is just these three +applied: + +1. **Every artifact the model produces is a suspect, not a deliverable.** Your + job when you read a generated spec or plan is to attack it: *what did it assume + that I didn't say? what failure mode does this gloss? where is it being + plausible instead of right?* + +2. **The process is non-linear because discovery is non-linear.** You will learn + things during planning that invalidate the spec, and things during + implementation that invalidate the plan. A senior expects this and *instruments + for it*. Going backwards isn't failure; refusing to is. + +3. **The hardest part isn't writing the spec — it's knowing what you actually + want, and you usually don't, at the start.** The real problem is often not the + one you typed. The process's job is to help you find the real one before you've + built the wrong one. + +--- + +## What the model is good at, and what is yours alone + +Keep this division sharp, because most wasted effort comes from blurring it. + +| The model is genuinely good at | Which means you should | And it is bad at | Which means you must never | +| --- | --- | --- | --- | +| Producing a complete, well-structured first draft fast | Let it draft everything; never start from a blank page | Knowing what's worth building | Outsource the "what" or the "why" | +| Surfacing options and conventions | Use it to widen your view of the solution space | Saying "I don't know" or "this is risky" | Trust silence as a sign of safety | +| Filling in plausible detail | Let it handle mechanism once intent is fixed | Distinguishing plausible from correct | Accept an artifact because it reads well | +| Tireless, consistent mechanical work | Hand it the decomposition and the typing | Holding a large design in coherent context | Let it one-shot a big feature | + +The work that is *yours, irreducibly*: deciding what problem is real, deciding +what "correct" means here, deciding what must never happen, deciding what to cut, +and noticing when the beautiful artifact in front of you is solving a problem you +don't have. SDD doesn't reduce that work. It *clears the noise away from it* so +you can spend your scarce judgment there instead of on boilerplate. + +--- + +## The build that went sideways (and what each turn taught) + +> Deniz is shipping **Redaktör**: a tool that masks personal data (PII) in +> Turkish documents so a compliance team can share them safely. It's a good +> teaching case because "correct" here is adversarial, non-obvious, and carries +> real liability — exactly the conditions under which plausible-but-wrong is +> dangerous. Watch the *reasoning*, not the commands. Several turns are mistakes, +> caught at different costs. + +### t0 — The first framing is wrong, and finding that out is the whole job + +Deniz's first instinct is "build a PII masker." He's about to type that into +`/speckit.specify`. He stops, because he's been burned by exactly this: a clean +spec for the wrong product. + +He asks the question that the tool will never ask for him: **what is the real +failure here?** Not "the model misses a phone number" — the real failure is *a +compliance officer signs off on a document, and PII leaks anyway, and now it's +their name on the breach.* The product isn't an NLP problem. It's a **trust and +liability** problem. The reviewer has to trust an AI's redaction enough to put +their signature on it. That reframing changes everything downstream: it means +auditability, explainability, and a hard human gate matter *more* than detection +F1. + +⟶ **Lesson.** The most expensive mistakes are made before you touch the tool, in +the gap between the problem you typed and the problem you have. No command catches +this. It's the one part that is entirely you, and it's the part most people skip. + +He does *not* start with the constitution as a vague "code quality, testing, UX" +prompt (the kind of boilerplate that fails no plan). He writes principles that +encode the *liability* framing as hard gates: + +```text +# A constitution earns its place only if you can imagine a plan FAILING it. +# These are written to fail plans, not to decorate the repo. +/speckit.constitution Establish gates that a plan must pass or carry a written +exception: (1) Every masked document is reproducible from its audit record alone — +if we can't reconstruct what was redacted and why, the plan fails. (2) The human +reviewer is the last authority; no irreversible export without an explicit +approval action — an "auto-approve high confidence" feature fails this gate by +construction. (3) Raw document text never enters logs, traces, or error messages. +(4) Every model-based component carries an eval set and a regression threshold; +"we tried it and it looked good" is not acceptance. Bias toward fewer moving parts: +a plan that adds a service must justify why a function wouldn't do. +``` + +⟶ **Lesson.** A constitution is not values; it's a set of trip-wires. Principle +(2) is written so that a *specific tempting feature* — auto-approve — is dead on +arrival. That's what makes it load-bearing instead of inspirational. + +### Specify — the model writes a good spec for the wrong product + +He describes the *what*, deliberately omitting stack, and gets back a clean +`spec.md`. It reads well. That's the trap. + +```text +/speckit.specify A reviewer uploads a Turkish document; the system detects PII +(names, T.C. national IDs, IBANs, phones, addresses), shows each detection with a +confidence level, lets the reviewer accept/reject/edit inline, and on confirmation +returns a masked copy plus an audit record. Export is blocked while any +high-confidence detection is unresolved. +``` + +> *(Caption — the spec the model produced:)* tidy user stories, requirements +> organized around **detection accuracy and the masking flow**, sensible edge +> cases. Looks done. + +Deniz reads it adversarially and the smell hits immediately: **the spec is +optimizing detection, but his real problem was trust.** The model did exactly what +he asked and nothing he meant. There's no requirement about *why a reviewer would +believe a mask is correct*, nothing about what happens when the model and the +reviewer disagree, nothing about the liability trail beyond a vague "audit +record." It's a spec for a demo, not for someone's signature. + +He doesn't tweak it. He rewrites the intent, because a spec aimed at the wrong +target can't be patched into the right one: + +```text +# Re-aim the spec at the real problem: reviewer trust and defensible sign-off. +/speckit.specify Reframe: the user is a compliance officer who must personally +attest that a document is safe to share. The system's job is to make that +attestation FAST and DEFENSIBLE, not just to detect PII. Requirements should cover: +how a reviewer sees what is under a mask without exposing it to anyone else; how +disagreements between the model and the reviewer are recorded; what the audit +record must contain to defend the decision later; and what the reviewer is and +isn't allowed to do (e.g., no bulk auto-accept). Detection is necessary but not the +point — the point is a sign-off the reviewer can stand behind. +``` + +⟶ **Lesson.** "The model gave me a clean spec" is a warning, not a win. A clean +artifact aimed at the wrong problem is more dangerous than a messy one aimed at +the right problem, because the cleanliness lowers your guard. Read every generated +spec by asking *what problem is this actually solving, and is it mine?* + +### Clarify — the questions it asks matter less than the one it doesn't + +```text +/speckit.clarify +``` + +The model asks five questions. Deniz's read: + +```text +# Two are real, three are filler. The filler tells me where the model is pattern- +# matching instead of thinking — and the GAP tells me more than the questions. +Q: Confidence threshold for "high"? → real. ≥0.85 blocks export. +Q: Masking token format? → real. Category tokens: [TCKN], [IBAN]. +Q: Supported file types? → filler. Doesn't change the design. +Q: Max document size? → filler. Pick 20 pages, move on. +Q: Light or dark theme? → noise. +``` + +The signal isn't in the answers. It's that **the model did not ask the question +that actually decides the product**: *when the model flags PII the reviewer thinks +is fine — or misses PII the reviewer spots — what happens, and who owns that +decision in the audit trail?* That's the entire trust mechanism, and the tool was +blind to it because it's not a pattern that shows up in training; it's specific to +his liability framing. + +So he injects it himself, into the spec, by hand-driving the clarification: + +```text +/speckit.clarify Record this decision explicitly in the spec: every reviewer +override of a model detection (accept-despite-low-confidence, reject-despite-high, +manual-add) is a first-class audited event with reviewer identity and timestamp. +The model's opinion and the reviewer's decision are both stored; on conflict, the +reviewer's decision governs the export but the model's dissent is retained in the +record. This is the core trust mechanism — make it a requirement, not an +implementation detail. +``` + +⟶ **Lesson.** Clarify is not "answer the model's quiz." It's a prompt to ask +*yourself* what's underspecified. The model's questions map what's *conventional* +to underspecify; your job is to find what's *uniquely* dangerous to leave +unsaid here. Often the most important clarification is one the tool never raised. + +### Plan — reading the smell of a plausible over-design + +Now stack and constraints are fair game. + +```text +/speckit.plan FastAPI + Python 3.12 backend; React + TypeScript reviewer UI with +SSE streaming of detections; Postgres for audit records. Detection engine must be +an importable library, testable without the web layer. Structured logging with a +redaction filter. Eval harness over a labeled Turkish corpus. +``` + +The model returns a confident, conventional plan — and Deniz reads two smells: + +**Smell 1: it reached for an LLM-as-judge verification pass over every detection.** +Plausible (more checking = better, right?), and wrong for v1. He reasons about the +*cost*, which the model never does unprompted: the judge roughly doubles latency +and per-document cost, and — worse — adds a *second non-deterministic component he +now has to eval and keep from regressing.* For what? Marginal recall on categories +that are already strong. He cuts it, and crucially **writes down why**, so the +next engineer (or the next agent) doesn't helpfully "fix" the omission: + +```text +/speckit.plan Remove the LLM-judge pass for v1 and record the rationale in plan.md: +it doubles latency/cost and adds a second model to eval for marginal recall on +categories already covered. Revisit only if eval shows a recall gap that determinism +and a single extractor can't close. Keep the architecture as flat as the problem +allows — no message queue, no service split; this is one process until proven +otherwise. +``` + +**Smell 2 — and this is the real insight — the architecture is upside down.** +While reviewing `research.md`, Deniz notices that T.C. national IDs and IBANs have +*checksums*. The highest-liability categories don't need a model at all; a few +lines of deterministic validation are more accurate than any LLM and trivially +auditable. That inverts the design: **this is mostly a deterministic system, with +AI used only where determinism fails (names, free-form addresses).** The "AI +product" is 80% boring code, and that's a feature — boring code is auditable, and +auditability is the actual requirement. + +```text +/speckit.plan Restructure detection around determinism-first: regex + checksum +validators own T.C. IDs, IBANs, and phones (these are exact and auditable). The +LLM extractor handles only names and free-form addresses, where determinism fails. +Make the validator path the default and the model path the exception. Note in +plan.md: this is deliberate — the most legally sensitive categories must be +explainable without invoking a model. +``` + +⟶ **Lesson.** Two senior reflexes here. First: **the model never volunteers the +cost of its suggestions** — latency, money, a new failure surface, eval burden — +so you must price every "let's also add…" yourself and often decline. Second: the +best architectural insight in an AI product is frequently *use less AI* — push +work onto deterministic, auditable code wherever it actually works, and reserve +the model for the irreducibly fuzzy part. Read `research.md` and `plan.md` like a +design review, because that's the last place a bad structure is cheap to kill. + +He overrides one of the simplicity-gate's own suggestions (it wanted a separate +validation microservice "for scalability") with a one-line documented exception — +*premature; one process until load data says otherwise.* Gates inform; they don't +command. A senior overrides them **with a written reason**, which is different from +ignoring them. + +### Tasks & analyze — using the cheap audit as a real gate + +```text +/speckit.tasks Break the plan into tasks +/speckit.analyze Run a project analysis for consistency +``` + +`analyze` is read-only and catches one thing that matters: the override-audit +requirement Deniz injected during clarify has **no task**. It would have silently +not been built — and it's the core trust mechanism. He adds it and re-runs until +clean. + +⟶ **Lesson.** A requirement with no task is a feature that won't exist, and nobody +will notice until it's missing in production. `analyze` is a near-free adversarial +reader; treating "clean analyze" as a hard gate before implementing is one of the +highest-leverage habits in the whole method. The cost of running it is a minute; +the cost of the gap it finds is a postmortem. + +### Implement — it drifts, and the process (not your vigilance) catches it + +He never one-shots this. He scopes by phase, because he knows the model degrades +as context fills — it starts strong and hallucinates around the edges of a long +run: + +```text +# Foundation first; nothing else can be trusted until the engine + evals exist. +/speckit.implement execute the Setup and Foundational phases only, then stop and +report +``` + +He runs the eval harness himself before going further, because for a model +component **"the unit tests pass" and "it actually detects well" are different +claims**, and only the eval set speaks to the second. + +Then, mid-build around task 7, the model does something very on-brand: in a new +error handler, it logs the offending document snippet to help debugging. That +violates constitution principle (3) — raw text in logs. Deniz doesn't catch it by +reading every line; nobody reliably does that. He catches it because he'd written +a test that greps emitted logs for PII patterns. **The failure mode he predicted, +he instrumented for.** + +```text +# He'd added this as a foundational task precisely because he knew the model +# would eventually "helpfully" log raw text. The test is the guard, not his eyes. +/speckit.implement the log-redaction test is failing on the new error handler — +fix the handler to log a reference id, not document content, and confirm the test +passes +``` + +⟶ **Lesson.** You cannot out-vigilance a tireless collaborator by reading harder. +You beat its predictable failures by **building the trap before it walks into +them**: a test that fails on the exact mistake you know it tends to make. The +constitution names the invariant; a test makes the invariant self-enforcing. That +pairing is the difference between a principle and a hope. + +### The spec was *still* wrong — and here's where SDD pays for itself + +Foundation, detection, review UI, audit — all built, all green, demo to the +compliance team. It fails. Not on a bug. The reviewers won't sign off, because +**they can't see what's under a mask without un-masking the whole document**, and +un-masking in front of others is exactly the exposure they're trying to avoid. The +product is technically correct and practically useless. The trust mechanism Deniz +was so proud of is unusable in the room where trust actually happens. + +This is the expensive discovery — the kind that, in a code-first project, means a +painful rewrite. Here it's a *spec* problem (he never specified *how* a reviewer +inspects a mask safely), so he fixes it where it's cheap: he edits intent, then +lets convergence find the delta between the new intent and the built reality. + +```text +# Fix the SPEC, then reconcile reality to it. Not a hand-patch in the UI — +# the change has to live in intent so it survives the next person. +/speckit.clarify Add the missing requirement: a reviewer must be able to inspect +the original value behind a single mask, in place, visible only to them, without +exposing the rest of the document or persisting the reveal. Specify the audit +consequence: each reveal is itself an audited event. + +/speckit.converge Assess the codebase against the updated spec and append the +missing work as tasks. + +/speckit.implement execute the newly appended tasks +``` + +⟶ **Lesson.** This is the entire argument for SDD in one event. You *will* discover +late that the spec was wrong — that's not a process failure, it's the nature of +building things for real humans, whose needs you can't fully know in advance. The +question SDD answers is: **when that discovery lands, does it cost a paragraph or a +rewrite?** Because the intent lived in a spec the system regenerates from, a +demo-day catastrophe became an afternoon. That's the payoff — not the clean happy +path, but the cheap recovery from the inevitable wrong turn. + +### Ship — the gate that lets him sleep + +CI runs tests *and* the eval suite; a recall regression below threshold fails the +build, so the model can't silently get worse over time. The log-PII test and the +export-block test are non-negotiable gates. The human approval step is the actual +product. None of this is ceremony — each one is a specific way the system could +have hurt someone, turned into a wall. + +⟶ **Lesson.** "Done" for an AI product isn't "it works." It's "I know the specific +ways this can fail, and each one is now something that fails the build instead of +failing a customer." That confidence is what you're actually shipping. + +--- + +## Reading generated artifacts adversarially: a smell guide + +The single most valuable skill SDD demands is reading the model's output like a +hostile reviewer. Concrete tells, by artifact: + +**In a spec:** + +- It optimizes a *metric* (accuracy, speed) when your real problem is a + *property* (trust, safety, defensibility). → It solved the legible problem, not + yours. +- Edge cases are all the *easy* ones (empty input, too long). The expensive edge + cases — conflict, disagreement, partial failure, liability — are absent. → + The model lists what's conventional, not what's dangerous *here*. +- It reads complete and you can't find a single `[NEEDS CLARIFICATION]`. → It + guessed confidently instead of flagging the unknowns. Suspicious, not + reassuring. + +**In a plan:** + +- An extra service, queue, cache, or abstraction "for scale/flexibility" with no + number justifying it. → Plausible over-engineering; make it justify itself or cut + it. +- A new non-deterministic component added "to be safe." → Price the latency, cost, + and *new eval burden* it creates. Usually not worth it. +- Heavy machinery where a checksum, a regex, or a constraint would be exact and + auditable. → The best move is often *less* sophistication, not more. + +**In tasks:** + +- A requirement you fought for during clarify has no corresponding task. → + `analyze` exists to catch exactly this; gate on it. +- A task you can't imagine writing a test for. → It's underspecified upstream; the + vagueness is in the spec, not the task. + +The meta-tell across all three: **fluency where there should be friction.** If the +hard, contested part of your problem came out smooth and confident, the model +papered over it. Go look there first. + +--- + +## Where SDD fights you, and what a senior does about it + +Pretending the method is all upside is its own kind of decoration. The honest +failure modes: + +- **It can make wrong decisions feel settled.** A beautifully formatted spec + carries false authority. Counter: treat every artifact as a draft under + suspicion until *you've* attacked it, regardless of how finished it looks. +- **Clarify and analyze can devolve into pedantry** — asking about themes, + flagging trivia. Counter: take the two questions that matter, ignore the noise, + and supply the one they missed. The commands serve your judgment, not the + reverse. +- **The structure can seduce you into completing the workflow instead of solving + the problem.** Running all ten commands feels productive. Counter: if a step + isn't surfacing disagreement or reducing risk, skip it. Ceremony is the enemy. +- **Long `implement` runs rot.** The model loses the plan as context fills. + Counter: scope every run to a phase or a task range; review between; let the + `[X]` markers resume you. Never one-shot anything large. +- **The spec can drift from the code** the moment you hand-patch. Counter: route + changes through the spec → `converge` → `implement`. The instant the spec stops + matching reality, it stops being load-bearing and you've silently reverted to + code-as-truth. + +### Calibration: how much process is the right amount + +| Situation | What a senior actually does | +| --- | --- | +| Throwaway spike, learning something | Skip the apparatus. One tight `/speckit.implement` prompt or just write it. | +| Small change to a known system | specify → plan → tasks → implement. Skip the wrappers; the risk is low. | +| Load-bearing feature, real consequences | Full adversarial pass: clarify, checklist on the risky domain, gate on clean analyze, scope implement, review per phase against evals. | +| You're not sure what you're even building | *Stop.* Spend the t0 thinking. The tool can't do this part and will happily build the wrong thing. | +| Requirements changed on a live feature | Edit the spec → converge → implement. Never a stray commit. | + +The skill isn't running more steps. It's spending exactly enough structure to keep +yourself honest, and not one step of theater past that. + +--- + +## If you remember five things + +1. **The model is a fast, confident, predictably-wrong collaborator.** Your stance + is adversarial: every artifact is a suspect until you've attacked it. +2. **SDD's value is cheap, early disagreement.** Every loop drags a future + expensive mistake into a present cheap one. A step that surfaces no + disagreement is theater. +3. **The real problem usually isn't the one you typed.** The t0 reframing — done + by you, before any command — prevents the most expensive class of mistake. +4. **Instrument for the failures you can predict.** Don't out-vigilance the tool; + build the test that fails on the mistake you know it'll make, and let the + constitution name the invariant. +5. **Going backwards is the method working.** You will discover the spec was wrong + late. The win condition isn't avoiding that — it's that fixing intent and + regenerating costs a paragraph instead of a rewrite. + +--- + +## Appendix — the commands, compressed + +You'll learn these by using them. They are instruments for the thinking above, not +the point of it. + +| Command | Its job | The senior question it serves | +| --- | --- | --- | +| `constitution` | Binding principles, enforced as plan-time gates | *Which tempting wrong decisions do I want dead on arrival?* | +| `specify` | Natural language → structured spec (+ feature branch) | *What problem am I actually solving — and is this spec aimed at it?* | +| `clarify` | Up to 5 questions; answers encoded into the spec | *What's uniquely dangerous to leave unsaid here — including what it didn't ask?* | +| `plan` | Spec + stack → technical design, research, contracts | *What does each choice cost, and where is this over-built?* | +| `checklist` | "Unit tests for the requirements" — quality, not behavior | *Are the requirements in my risky domain actually airtight?* | +| `tasks` | Plan → ordered, `[P]`-parallelizable task list | *Is every hard-won requirement represented as buildable, testable work?* | +| `analyze` | Read-only cross-artifact consistency audit | *What did we say we'd build that has no task — or build with no reason?* | +| `taskstoissues` | Tasks → dependency-ordered GitHub issues | *How do humans and agents share this work with full context?* | +| `implement` | Executes tasks, marking `[X]`; scope it by phase | *How do I keep context healthy and review before it drifts?* | +| `converge` | Reconcile code-on-disk with intent; append the delta | *Reality changed — how do I make the spec true again instead of patching code?* | + +Customization that lets the method match *your* SDLC rather than the default: +**extensions** (`git`, `bug` for spec-driven debugging, `agent-context`), +**presets** (`lean` trims the pipeline), and **bundles** (developer / PM / BA / +security role setups). + +*Deeper reading in this repo: [`spec-driven.md`](../../spec-driven.md), +[`docs/concepts/sdd.md`](../concepts/sdd.md), +[`docs/concepts/complex-features.md`](../concepts/complex-features.md), +[`docs/guides/evolving-specs.md`](./evolving-specs.md).*