feat(supervisor): proposer-profile optimization + discriminating eval (bottleneck taxonomy)#405
Conversation
…ifying proposer prompt The seam discarded the driver's brief (v1: 'worker IGNORES the brief'), so driver-steer was just expensive best-of-N identical workers — no steering. Thread the brief into each attempt (appended to the surface task prompt) so a re-spawn can take a DIFFERENT, targeted angle. Rewrite the baseline driver prompt as a real PROPOSER: each worker a distinct hypothesis (direct fix → upstream cause → edge case → other module). This is the prerequisite for proposer-profile optimization to have any traction.
… substrate - gepa-driver-prompt: GEPA now optimizes the REAL driver/proposer prompt — each candidate runs a full selfImprovingSupervisor rollout, executable-graded by the supervised resolve (no LLM judge). FIX (adversarial review P1, would crash the arm): report the supervised spend to ctx.cost.observe/ observeTokens so the backend-integrity guard sees a real backend, not a zero-cost stub. - hard-coding-env: a mid-difficulty contamination-proof generated task (stack expression evaluator) as the CHEAP optimization substrate — reference→100%/stub→0% across 10 seeds. FIX (P2): per-field salt on the dialect RNG (base⟺rounding were 100% aliased → spurious shortcut). - ablation: thread the supervisor router + matched worker budget into the optimize call (P3: train/serve regime must match). Known caveats noted: error-credit floor (~33%, constant across candidates so the relative GEPA signal is unaffected) + syntax-error denominator (0 either way). Adversarial review caught all of these pre-merge; verify phase: tsc+biome clean, eval calibrated.
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 2 non-blocking findings — 9c0fc48d
Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-28T17:30:27Z · immutable trace
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (2 low, 2 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 339.8s (2 bridge agents) |
| Total | 339.8s |
💰 Value — sound-with-nits
Threads the driver's brief into workers, rewires GEPA to optimize the real supervisor prompt (not a proxy), adds train/serve-match + cost reporting, and ships a mid-difficulty headroom task — each a real correctness fix; minor scaffold duplication vs the existing self-improving-coder example.
- What it does: Four targeted changes in examples/ablation-suite/: (1) surface-worker.ts:73-80 now accepts the driver's
briefand appends it to the task's systemPrompt, so each spawn can take a different angle (previously every spawn was an identical refine retry — the file's own prior comment marked this as a v1 simplification). (2) ablation.ts:35-51 replaces the terse baseline driver prompt with a proposer-st - Goals it achieves: Make the driver-steered supervisor actually capable of beating a single agent at higher compute. The PR body's bottleneck taxonomy frames it: on capability-bottlenecked bugs the supervisor can only tie; to beat a single agent it needs a SEARCH-bottlenecked regime where diversified attempts pay. The four changes deliver the machinery for that: brief-threading + a proposer baseline make attempts act
- Assessment: Coherent and in-grain. Each change fixes a real correctness gap rather than adding surface area: the v1 simplification is retired with the exact signature the Executor interface already supported (surface-worker.ts:73); the train/serve match prevents a classic skew bug; the cost.observe call is necessary (without it the integrity guard aborts on a stub-looking cell); and the proxy→real rewiring in
- Better / existing approach: Checked for an existing host-pytest AgenticSurface helper to extend rather than clone (rg 'pytestPassed|export const.*AgenticSurface|hostPytest|pytestSurface' across src/ and examples/). None exists in src/ — the only two implementations of this exact pattern are examples/self-improving-coder/self-improving-coder.ts:95 and the new examples/ablation-suite/hard-coding-env.ts:300. The new file is rig
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound-with-nits
Real proposer + GEPA-over-supervisor wiring that lands in the grain of the codebase; the new discriminating substrate ships calibrated but not yet swapped into the runnable ablation.
- Integration: The three wiring changes all reach real consumers. (1) surface-worker.ts:73-80 — the driver's spawn brief reaches the worker: spawn_agent's
taskarg flows scope.spawn → runChild → executor.execute at src/runtime/supervise/scope.ts:585 and src/mcp/tools/coordination.ts:437, so reading it asbriefand folding it into the surface task systemPrompt is correct (TS narrower-arity is fine). (2) ablat - Fit with existing patterns: Excellent. hard-coding-env.ts mirrors examples/self-improving-coder/self-improving-coder.ts in shape exactly (same AgenticSurface open/tools/call/score/close, same seed-derived tasks supplier, same reference/stub calibration self-check at lines 561-586 vs coder lines 251-257) — it extends the established pattern rather than inventing one. gepa-driver-prompt.ts now routes through selfImprovingSuper
- Real-world viability: Holds up. Each task gets its own mkdtempSync tempdir (hard-coding-env.ts:327) cleaned in close (line 407-412) — the same crash-leak risk self-improving-coder already carries, not new. pytestPassed (line 300) has a 60s timeout and parses partial stdout on failure — realistic. The process-global
workspacesMap cannot alias: every open mints a unique tempdir, so concurrent fanout workers grade inde - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/hard-coding-env.ts
- console.log('═══ CALIBRATION ($0) — task solvable + grader discriminates? ═══')
🟡 Cruft: magic number added examples/ablation-suite/hard-coding-env.ts
- errStr:
E${(r(9000, 4) + 1000).toString(36).toUpperCase()},
💰 Value Audit
🟡 pytest + AgenticSurface scaffold duplicated between two example tasks [duplication] ``
hard-coding-env.ts:300-413 clones pytestPassed (17 lines, byte-identical), the workspaces Map pattern, and the open/tools/call/score/close AgenticSurface shape verbatim from examples/self-improving-coder/self-improving-coder.ts:95-208. The calibrate() shape is also near-identical. This is intentional (the file header says 'Mirrors ... exactly in shape') and acceptable for a second example, but if a third contamination-proof task is added the scaffold should be lifted into a small shared helper u
🎯 Usefulness Audit
🟡 Discriminating substrate ships calibrated but not wired into the runnable ablation [integration] ``
hard-coding-env.ts exports hardCodingEnv/hardCodingTasks but no consumer imports them (grep across repo: only self-references). ablation.ts:28,338-339 still wires codingEnv/codingTasks — the saturated task the PR body argues is the whole reason for the new file. So
pnpm tsx examples/ablation-suite/ablation.tsruns the new proposer + GEPA machinery against the substrate that cannot show lift, while the substrate that CAN show lift only runs under CALIBRATE=1. The PR's value prop (driver-steer/o
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
What
The proposer-profile optimization path for the self-improving supervisor — so the driver can actually beat a single agent given more compute, not just match it.
surface-worker.ts— thread the driver's brief into each worker (it was discarded: every spawn was an identical refine attempt). Steering now reaches the worker.ablation.ts— the baseline driver prompt is now a real proposer: each worker a distinct, targeted hypothesis (direct fix → upstream cause → edge case → other module), not "try again".gepa-driver-prompt.ts— GEPA optimizes the real driver/proposer prompt: each candidate runs a fullselfImprovingSupervisorrollout, executable-graded by the supervised resolve (no LLM judge). Reports spend to the campaign cost meter so the backend-integrity guard sees a real backend.hard-coding-env.ts— a mid-difficulty contamination-proof generated task (stack expression evaluator + edge cases) as the cheap optimization substrate (the original generated task is saturated; SWE-bench is too expensive to search prompt-space on). Calibrated reference→100% / stub→0% across 10 seeds.The finding this PR exists to record (the bottleneck taxonomy)
Cost-aware ablation on real SWE-bench bugs: the driver-steered supervisor ties a single agent at ~10× cost, and so does more single-agent budget — because those bugs are capability-bottlenecked (8 independent shots also fail). Coordination adds search, not capability — it can only pay where the worker can solve the task but a single attempt misses the angle/edge-case (a search-bottlenecked regime). The hard eval is that regime; this PR ships the machinery + substrate to test it cheaply and certify winners on real tasks.
Verification
src/(the keystoneanalyzeknob it builds on is already on main via feat(supervise): propagated analyze knob — analyst feeds the driver (additive) #404).