diff --git a/.claude/REVIEW.md b/.claude/REVIEW.md index 19edf00a52..3e1bbecbee 100644 --- a/.claude/REVIEW.md +++ b/.claude/REVIEW.md @@ -15,12 +15,44 @@ Reserve 🔴 for things that would page someone or block a rollback. In this cod - **Queue / concurrency correctness.** RunQueue, MarQS (V1, legacy), redis-worker — any change to enqueue / dequeue / locking semantics. Re-derive the invariant on paper before flagging or accepting. - **Missing index on a hot table.** New Prisma queries against `TaskRun`, `TaskRunExecutionSnapshot`, `JobRun`, `Project`, etc. must use an existing index. Check `internal-packages/database/prisma/schema.prisma` for the relevant `@@index` lines — don't guess and don't propose `EXPLAIN`. - **Recovery-path queries.** Any `TaskRun.findFirst` / `findMany` added to a schedule, run-recovery, or restart loop. Recovery fan-outs (Redis crash, restart storms) turn "rare indexed query" into a DB incident. 🔴 even if indexed. -- **Aggregations on hot tables.** No `COUNT` / `GROUP BY` on `TaskRun` or other multi-million-row tables. Use Redis or ClickHouse for counts. +- **Aggregations on hot tables.** No `COUNT` / `GROUP BY` on `TaskRun` or other tables that can reach billions of rows. Use Redis or ClickHouse for counts. - **Prod Redis blast-radius.** New code paths that `SCAN` with broad patterns (`*foo*`) on prod-shaped Redis, or `EVAL` Lua with `SCAN` loops inside. Both are 🔴. - **`@trigger.dev/core` direct import** from anywhere outside the SDK package. Always import from `@trigger.dev/sdk`. Core direct imports are 🔴 — they break the public API contract. - **Heavy execute-deps imported into request-handler bundles.** Specifically `chat.handover` and similar split-bundle entry points must not transitively import the agent task's execute path. Watch for new imports added at module top-level of route files. - **V1 engine code modified in a "V2 only" PR.** The `apps/webapp/app/v3/` directory contains both. If the PR description says V2-only but it touches `triggerTaskV1`, `cancelTaskRunV1`, `MarQS`, etc. — 🔴. +## Performance (always review) + +Every PR gets a performance pass — not just the ones that look perf-sensitive. For each new query or unit of work, weigh three things: (a) the size of the table it hits, (b) whether it sits on a hot path, (c) whether the data it walks can be deep or wide (run trees, batches). The 🔴 bullets above on indexes, recovery-path queries, aggregations, and Redis `SCAN` are part of this pass — the rest below extends it. + +**Treat these tables as large — no scans, no `COUNT` / `GROUP BY`, no unbounded fetch:** + +- **Postgres — the `TaskRun` family:** `TaskRun`, `TaskRunExecutionSnapshot`, `Waitpoint`, `BatchTaskRun` and their join tables. Assume billions of rows. +- **ClickHouse — `task_events_v1` / `task_events_v2`.** Partitioned by `toDate(inserted_at)`; `ORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)`. Note `span_id` / `parent_span_id` are NOT in the sort key — span-id lookups can't skip granules, only `environment_id` + a `start_time` window can. + +**Hot paths — extra scrutiny on any added query or work:** + +- **Trigger + batch trigger** (`triggerTask.server.ts`, `batchTriggerV3.server.ts`) — see `apps/webapp/CLAUDE.md`; do not add DB queries to these. +- **Dequeue / RunQueue** (`dequeueSystem.ts`, run-queue read/lock paths) — runs on every execution. +- **Execution-snapshot creation in the run engine** — any engine function that writes a `TaskRunExecutionSnapshot` runs per state transition; a new query there multiplies by run volume. +- **OTEL ingestion** (`otel.v1.traces.ts`, `otel.v1.logs.ts`) — write volume scales with customer span counts. +- **Trace + run-list reads** (trace view, run list, span detail) — read paths over the large tables above. + +**Deep / wide shapes — one run can explode into a huge tree or batch; code that walks them is the trap:** + +- Trace span subtrees (deeply nested child runs → deep span trees). +- Batch + parent/child fan-out (one run triggers thousands of children). +- Waitpoint / run-dependency chains. +- Tag / attribute many-to-many joins against the run/event tables. + +**Anti-patterns (severity):** + +- **Per-level fan-out that re-scans a large table once per tree depth** → 🔴. A BFS issuing one query per level (e.g. `parent_span_id IN {thisLevel}`) re-reads the same granules D times for a depth-D tree. Prefer one windowed query + an in-memory tree build. +- **Dropping the partition-pruning predicate** — `inserted_at` for ClickHouse, the `createdAt` window for partitioned Postgres — to "widen" a lookup → 🔴. Without it the query scans every partition. Keep a bounded window even for ancestor / backfill lookups. +- **Unbounded `IN (...)` built from a result set** (a BFS frontier, a batch's child ids) → 🟡. It can reach the row cap (`MAXIMUM_TRACE_SUMMARY_VIEW_COUNT` defaults to 25k). Cap or chunk to ≤1–2k ids per query. +- **Sequential per-level round-trips** where one recursive or windowed query would do → 🟡. N levels = N round-trip latencies stacked. +- **Replacing a single bounded query with a multi-query walk for _every_ call** (not just a rare fallback) → 🔴 on a hot read path, 🟡 elsewhere. Keep the cheap single-query path; branch into the expensive walk only when the cheap one comes up short. + ## Always check - **Tests use testcontainers, not mocks.** Vitest with `redisTest` / `postgresTest` / `containerTest` from `@internal/testcontainers`. Any new `vi.mock(...)` on Redis, Postgres, BullMQ, or other infra is wrong here — 🔴 if added in production-path tests, 🟡 if isolated unit test.