feat(supervise): propagated analyze knob — analyst feeds the driver (additive)#404
Conversation
…additive
supervise() / supervisorAgent / driverAgent gain optional analysts + analyzeOnSettle, threaded to
createCoordinationTools (router arm) AND serveCoordinationMcp (sandbox arm). When a worker settles done,
each configured analyst lens runs and re-enters as a finding the driver pulls (await_event) and composes
its next steer from — the self-improving UP-leg that was previously unreachable through supervise() (it
lived only in a hand-wired createCoordinationTools). Purely additive: every field optional, defaults off,
so status-quo supervise(profile, task, {backend, budget}) is byte-identical. Sub-driver propagation rides
a recursive makeWorkerAgent (the caller's seam).
✅ No Blockers —
|
| glm | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 89 | 92 | 89 |
| Confidence | 65 | 65 | 65 |
| Correctness | 89 | 92 | 89 |
| Security | 89 | 92 | 89 |
| Testing | 89 | 92 | 89 |
| Architecture | 89 | 92 | 89 |
Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.
🟡 LOW No test exercises analysts/analyzeOnSettle propagation through supervise()/supervisorAgent() — src/runtime/supervise/supervise.ts
The new options are threaded through three wrappers (supervise.ts:151-152, supervisor-agent.ts:137-138 and 160-161, coordination-driver.ts:210-211), but the existing suite only covers the downstream consumer (coordination.test.ts:482 calls createCoordinationTools directly). A regression that drops one of these spread lines would not be caught. Impact: silent feature loss, not correctness. Fix: one supervisor-agent.test.ts case that builds supervisorAgent with
analysts+analyzeOnSettleset and asserts the registry reaches the spawned worker settle hook (e.g., via a spy makeWorkerAgent).
🟡 LOW analyzeOnSettle silently no-ops without analysts — src/runtime/supervise/supervise.ts
All three layers (supervise.ts:89, supervisor-agent.ts:100, coordination-driver.ts:67) document that analyzeOnSettle 'Requires analysts', but none of them validate that constraint at construction time. The actual guard lives only in createCoordinationTools at coordination.ts:255 (if … opts.analysts && opts.analyzeOnSettle?.length), so a misconfigured supervisor that sets analyzeOnSettle without analysts silently skips all analysis — the run still completes but wastes compute on a supervisor that appears to be self-improving but isn't. An early ValidationError at supervise() construction would fail loud instead. Not a crash or data risk; fit severity reflects the silent-compute-waste failure mode.
🟡 LOW 'Requires analysts' contract is unenforced at this layer — src/runtime/supervise/supervisor-agent.ts
Docstrings at supervise.ts:88, supervisor-agent.ts:101, and coordination-driver.ts:66 all state
analyzeOnSettlerequiresanalysts, but no ValidationError fires when onlyanalyzeOnSettleis set. The downstream guard at coordination.ts:255 (opts.analysts && opts.analyzeOnSettle?.length) silently no-ops, so the user gets status quo instead of a fail-loud. The enforcement belongs in createCoordinationTools (out of shot), but adding a construction-time guard alongside the existing maxTurns/extraTools guards would surface misconfiguration at the layer this shot edits. Low impact — silent no-op, no data loss.
tangletools · 2026-06-28T05:31:35Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 3 non-blocking findings — 388d41b4
Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-28T05:31:35Z · immutable trace
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 1 (1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 164.3s (2 bridge agents) |
| Total | 164.3s |
💰 Value — sound
Threads an already-implemented analyst-feed primitive (analysts + analyzeOnSettle) through the three higher supervise layers that weren't wiring it; pure forwarding, additive, in-grain, no existing equivalent duplicated.
- What it does: Adds two optional fields —
analysts(an AnalystRegistry of trace-analysis lenses) andanalyzeOnSettle(analyst kind ids) — toDriverAgentOptions,SupervisorAgentDeps, andSuperviseOptions, then conditionally spreads them into the next layer down. The router arm (driverAgent) forwards them to its inlinecreateCoordinationToolscall (coordination-driver.ts:210-211);supervisorAgentf - Goals it achieves: Make the self-improving UP-leg ('analyst feeds the driver') reachable from the canonical one-call API
supervise(profile, task, opts). Before this PR the mechanism was fully built at the primitive layer (createCoordinationTools+serveCoordinationMcp) but thesupervise → supervisorAgent → driverAgentpath wired it at zero levels — so the highest-level, documented entry point could not produ - Assessment: A clean, minimal, correct forwarding change. (1) It genuinely closes a real reachability gap — verified that driverAgent's inline
createCoordinationTools(coordination-driver.ts:204-212) pre-PR built the tools without these fields, so the router arm literally could not fire the analyst hook. (2) It is byte-identical when unset: every field is optional and the conditional-spread idiom (`...(opts. - Better / existing approach: none — this is the right approach. The architecture is a deliberate 3-layer onion where each layer owns a typed, doc-commented options interface and forwards individual knobs via conditional spread; every other optional knob in these same files follows this identical pattern, so the alternative (a shared options-bag passed wholesale) would fight the grain across ~8 existing knobs. The analyst mech
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error
🎯 Usefulness — sound
Purely additive propagation of an existing, lower-layer analyst-on-settle mechanism up to the canonical supervise/supervisorAgent/driverAgent entry points — matches the established maxLiveWorkers threading pattern exactly.
- Integration: Correctly threaded to BOTH arms: the router arm (supervisor-agent.ts:127-141 → driverAgent → coordination-driver.ts:204-212 → createCoordinationTools) and the sandbox arm (supervisor-agent.ts:151-162 → serveCoordinationMcp → coordination-mcp.ts:69-79 → createCoordinationTools). Verified pre-PR state had neither field at these upper layers (git show origin/main confirms), so this PR genuinely opens
- Fit with existing patterns: Mirrors the grain exactly. The
...(opts.x ? { x: opts.x } : {})propagation is identical to howmaxLiveWorkersis already threaded through the same three files. The separateeffort.ts:158 withAnalyst: booleanflag is a DIFFERENT concept (scope-level per-agent ScopeAnalyst passed to runPersonified) and does not compete — different layer, different name, different mechanism. No existing equiva - Real-world viability: Additive and default-off, so status-quo callers see byte-identical behavior. The
analyzeOnSettledoc says 'Requires analysts' but the lower layer enforces this defensively (coordination.ts:255 short-circuits viaopts.analysts && opts.analyzeOnSettle?.length), so a misconfiguration silently no-ops rather than crashing — robustness is preserved. run_analyst (coordination.ts:683) uses optional ch - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
💰 Value Audit
🟡 No eager validation that analyzeOnSettle requires analysts (symmetry with existing fail-loud guards) [maintenance] ``
The doc comments on all three new fields state 'Requires analysts', but no layer validates it. coordination-driver.ts already throws eagerly for half-wired seams of the same shape — extraTools without executeExtraTool (lines 171-175) and negative maxTurns (189-193), both following the 'fail loud at construction, not buried in a swallowed act() throw' rule stated at lines 177-186. By contrast, analyzeOnSettle without analysts is a SILENT no-op (coordination.ts:255 guards `opts.analysts && opts.an
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…nto feat/self-improving-supervisor
LOOP LAYERS (on the analyze keystone): - surface-worker.ts: graded-worker seam — a makeWorkerAgent running runAgentic over an AgenticSurface, settling with the surface score as the deliverable verdict (driver spawns/steers graded workers). - gepa-driver-prompt.ts: optimize the driver compose-prompt on TRAIN with an EXECUTABLE JudgeConfig (selfImprove + gepaProposer, reading the surface score — not an LLM judge), frozen, certified. - self-improving-supervisor.ts: one-call DX composing supervise(+analyze) over the graded seam. - ablation.ts: driverSteer + optimize knobs WIRED (blind/steered/self-improving) over the same supervise(); per-task try/catch resilience; cost+significance autopsy intact. - index.ts: re-export AnalystRegistry + MakeWorkerAgent from the loops barrel (host authors its seam). #402 HARDENING: - swe-bench-env: temp-dir leak fix (try/catch+rmSync on clone/checkout fail), symlink realpath-jail (read+edit), workspaces Map into the closure, exported jailPath/isTestPath + unit tests. - self-improving-coder: read_file path guard (symmetric with write_file), dead run_tests handler gone. - swe-self-improve.mts moved to bench/src/ (now under typecheck). Typecheck clean (examples + core); biome clean; 82 supervise tests green on the keystone.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 1cb0d45c
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T05:59:55Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 1 (1 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 267.5s (2 bridge agents) |
| Total | 267.5s |
💰 Value — sound
Threads two already-implemented options (analysts/analyzeOnSettle) from the public supervise() API down to createCoordinationTools (which already ran them), opening the analyst-feeds-driver loop; purely additive, in-grain plumbing with no reinvention.
- What it does: The true delta vs main is 11 files (+702/-52). The core runtime change (33 lines across coordination-driver.ts, supervise.ts, supervisor-agent.ts) adds optional
analysts+analyzeOnSettlefields toSuperviseOptions/SupervisorAgentDeps/DriverAgentOptionsand threads them — on both the router arm and the sandbox arm — down tocreateCoordinationTools, which ALREADY implemented the pub - Goals it achieves: Make the self-improving UP-leg reachable from the canonical
supervise → supervisorAgent → driverAgentpath. Before this, the analyst-on-settle mechanism existed ONLY inside a hand-wiredcreateCoordinationToolscall — the top-levelsupervise()API wired it at zero levels, so a user of the public API could never get 'the analyst feeds the driver.' After merge, `supervise(profile, task, { analy - Assessment: Sound and in the grain. The runtime threading is genuinely additive: every field is optional and defaults off (conditional spreads
...(opts.x ? {x} : {})), so a status-quosupervise()call is unchanged. It mirrors the EXACT established pattern already used formaxLiveWorkers/extraToolsin all three files — it does not invent a new threading mechanism. Critically, it does NOT reimplement th - Better / existing approach: none — this is the right approach. Searched for an existing path-containment utility to reuse for the bench jail (rg over src/ and packages/ for
realpathSync/startsWith(... + sep)/isInsideJail): none exists outside the bench dir, so the localized utility is not a duplication. Confirmed the examples do not reimplement loops —selfImprovingSupervisorcallssupervise()with the new knob (se - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error
🎯 Usefulness — sound
Threads the analyze knob (analysts + analyzeOnSettle) through supervise → supervisorAgent → driverAgent (router arm) and supervisorAgent → serveCoordinationMcp (sandbox arm), closing a real reachability gap where the self-improving up-leg existed in createCoordinationTools but was zero-wired at the
- Integration: Fully wired and reachable. supervise.ts:151-152 forwards analysts/analyzeOnSettle into supervisorAgent deps; supervisor-agent.ts:137-138 threads them into driverAgent (router arm) and :160-161 into serveCoordinationMcp (sandbox arm), which already accepts them (coordination-mcp.ts:62-76) and forwards to createCoordinationTools. The underlying mechanism — drainSettlement auto-running each configure
- Fit with existing patterns: Exactly in-grain. Every other supervisor knob (extraTools, maxLiveWorkers, compaction, maxTurns) follows the identical optional-field-threaded-through-each-layer pattern with the same
...(opts.x ? { x } : {})spread idiom; the analysts/analyzeOnSettle pair is the same shape. No competing pattern exists — the AnalystRegistry type and the finding/settled/bus taxonomy are the established coordinati - Real-world viability: Additive and defaults off — a status-quo supervise(profile, task, {backend, budget}) call constructs no analysts registry and the drainSettlement hook at coordination.ts:255 short-circuits on
opts.analysts && opts.analyzeOnSettle?.length, so status-quo behavior is byte-identical. A throwing analyst lens is caught by runTool (coordination-driver.ts:351-358) and folded back as an error string the - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
console.log(
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…ob for resolved/score The supervise winner result carries the driver's finalize output in result.out (the best-delivered worker's SurfaceWorkerOut), NOT a verdict field — reading result.verdict?.score always returned 0. Read resolved/score off result.out. Live e2e verified: driver-steered supervisor (analyze on) over the contamination-proof codingEnv resolves the task — resolved=true score=1.00 $0.031, the loop closes end-to-end (driver spawns worker, analyst feeds driver, graded winner delivered).
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — b16ce792
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T06:08:59Z
The merge-base changed after approval.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 34a5b702
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T06:11:27Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (1 low, 3 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 375.8s (2 bridge agents) |
| Total | 375.8s |
💰 Value — sound-with-nits
Threads an already-built analyst-feed substrate knob up through the one-call supervise() API (the real, in-grain keystone), plus three example/recipe files composing it into a driver-steered + GEPA-tuned supervisor and a bench security hardening — additive end-to-end, with two honest v1 methodology
- What it does: (1) KEYSTONE (src/runtime/supervise/{supervise,supervisor-agent,coordination-driver}.ts, +31 lines): the substrate already supported analysts + analyzeOnSettle (src/mcp/tools/coordination.ts:99-106,255-261, added prior; coordination-mcp.ts:62-76). This PR propagates those two optional fields up through driverAgent → supervisorAgent (both the router arm at supervisor-agent.ts:134-135 and the sandbo
- Goals it achieves: Make the self-improving up-leg (analyst reads a settled worker → finding → driver composes the next steer) reachable through the one-call supervise() API, which was the last closed door: the substrate existed only inside hand-wired createCoordinationTools. Secondary: ship a runnable, honest ablation harness that can measure whether driver-steering and GEPA prompt-tuning actually help, and harden t
- Assessment: The keystone is unambiguously good and the grain is exact: it matches how every other supervise() knob (maxLiveWorkers, compaction, extraTools) is threaded — optional field + conditional spread at each of three layers, both arms. The substrate already existed and was simply unreachable from the one-call door; this opens it with zero behavior change when unset. The examples compose real primitives
- Better / existing approach: none for the keystone — threading the existing substrate knob through the one-call API is the right approach, matching the exact pattern of every peer knob. For the examples, the v1 simplifications are acknowledged next-increments, not architecture flaws. searched: all MakeWorkerAgent factories (only workerFromBackend exists, different substrate); all selfImprove facades (only improve() [profile-b
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error
🎯 Usefulness — sound-with-nits
A coherent, additive keystone — analyzeOnSettle/analysts on supervise() propagates through both arms to a real, tested consumption site — plus first example loop layers that compose existing substrate seams; nothing dead or competing, one honest v1 proxy worth a human's awareness.
- Integration: Fully wired and reachable. The keystone options thread
supervise()(supervise.ts:84-89,151-152) →supervisorAgent(supervisor-agent.ts:97-101) → BOTH arms: the router armdriverAgent(coordination-driver.ts:61-66,210-211) and the sandbox armserveCoordinationMcp(coordination-mcp.ts:62-64,75-76), both reaching the real consumption site at coordination.ts:255-260 where a worker settle fires - Fit with existing patterns: Follows the established grain.
surfaceWorkerSeamis a clean sibling ofworkerFromBackend(supervise.ts:26) — both return a{makeWorkerAgent, deliverable?}pair thatsupervise()consumes.optimizeDriverPromptcalls the realselfImprove+gepaProposerfrom agent-eval with an EXECUTABLE judge reading the surface score (gepa-driver-prompt.ts:119-133) — no reinvention, no LLM-judge flattery - Real-world viability: Holds up on the happy path and compiles+tests green. Two honest v1 tradeoffs are explicitly documented in-code, not hidden: (1) v1 workers ignore the driver's per-worker brief — each spawn is a fresh
refineattempt on the same task (surface-worker.ts:16-18), so the driver's intelligence in v1 is allocation (spawn/stop), not instruction authoring; (2) the GEPA pass optimizes the candidate string - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
console.log(
💰 Value Audit
🟡 GEPA optimizes the prompt in one role (refine's between-shot analyst) but deploys it in another (the supervisor's standing systemPrompt) [better-architecture] ``
In gepa-driver-prompt.ts:101-113 the candidate steerer is passed as analystInstruction to runAgentic-refine (refine's between-shot steerer, per strategy.ts:369), and the winner is returned as systemPrompt (gepa-driver-prompt.ts:157). But in self-improving-supervisor.ts:77 that winner is deployed as profile.systemPrompt — the supervisor BRAIN's standing instruction, a different role than the refine analyst instruction it was fitness-scored against. The docblock (gepa-driver-prompt.ts:17-20) hones
🟡 v1 worker ignores the driver's brief, so the driverSteer ablation arm cannot show steering's effect [proportion] ``
surface-worker.ts:68-71 (the 'v1 SIMPLIFICATION' comment): each spawn is a fresh refine on the SAME task; the driver's spawn brief / steer is discarded. The code is honest about this ('the driver's intelligence in v1 is allocation ... not per-worker instruction authoring'). But the downstream consequence is that the ablation's driver-steer / driver-gepa arms (ablation.ts deltas) burn supervisor-inference tokens (reading settles, composing steers, running analysts) whose effect on the worker is n
🎯 Usefulness Audit
🟡 GEPA optimizes a proxy lever; train and deploy regimes diverge [problem-fit] ``
optimizeDriverPromptmeasures each candidate's fitness as a between-shotanalystInstructioninside onerefinerollout (gepa-driver-prompt.ts:101-113), then deploys the winner as the driver'ssystemPromptinsupervise()(self-improving-supervisor.ts:77). Combined with the v1 simplification that spawned workers ignore the driver's brief (surface-worker.ts:16-18), the string GEPA tunes and the string the driver runs govern different control surfaces — training optimizes within-worker shot
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
❌ Needs Work —
|
| glm | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 4 | 53 | 4 |
| Confidence | 80 | 80 | 80 |
| Correctness | 4 | 53 | 4 |
| Security | 4 | 53 | 4 |
| Testing | 4 | 53 | 4 |
| Architecture | 4 | 53 | 4 |
Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH GEPA optimize cost silently dropped from arm costUsd — invalidates the fair-cost invariant — examples/ablation-suite/ablation.ts
Lines 144-156 call optimizeDriverPrompt, whose return type (gepa-driver-prompt.ts:163) is
{systemPrompt, lift, shipped}with NO cost field. selfImprove with generations=1, populationSize=2 runs a baseline cell + 2 candidate cells across the train slice (each cell = a full refine rollout via runAgentic), so the GEPA pass spends real $ that often exceeds the held-out supervisor run's cost. None of it is added to the arm'susdaccumulator (line 202 only addssup.usdfrom the held-out loop). The ablation header
🔴 HIGH Supervisor budget mis-sizing prevents multi-worker steering (driverSteer/optimize arms non-functional as designed) — examples/ablation-suite/ablation.ts
Lines 193-196 pass
budget: { maxIterations: arm.knobs.budget, maxTokens: (opts.worker.maxTokens ?? 4000) * Math.max(1, arm.knobs.budget) }. With the default BUDGET=2 and maxTokens=4000 this yields a root pool of maxIterations=2, maxTokens=8000. supervise.ts:109defaultPerWorkerforwardsbudget.maxIterationsunchanged (only divides tokens by 4), so each worker reservation = {maxIterations: 2, maxTokens: 2000}. The surfaceWorkerExecutor reportsspent.iterations = r.completions(easily >2 for innerTurns=6), and scope.ts:719 clampSpend clamps it back to the reservation ceiling of 2. After ONE worker settles, freeIterations = 2 - 2 = 0; budget.ts
Other
🟠 MEDIUM No integration test exercises the jail through the actual call() tool dispatch — bench/src/swe-bench-env.test.ts
isInsideJail, jailPath, and isTestPath are unit-tested in isolation (lines 7-69), but no test drives environment.call(handle, 'read_file', {path: 'escape/passwd'}) or edit_file against a workspace containing an escaping symlink. The security boundary is the composition safe() -> resolveInJail() -> readFileSync/writeFileSync inside call() (swe-bench-env.ts:135-176); a wiring bug (wrong variable, missing resolveInJail in one branch) would not be caught by the current suite. The test comment at line 46 acknowledges it 'mirrors'
🟠 MEDIUM Data integrity: uncaptured columns rendered as zero in ablation output — examples/ablation-suite/ablation.ts
Lines 170-222: the driverSteer path accumulates only
usdfromselfImprovingSupervisor—ti,to,ms,shots,compsremain 0 (code comment at L177-178 acknowledges 'UNCAPTURED, not a real zero'). Lines 235-246 constructArmResultwith these zeroes, andprintAutopsy(L258-291) renders0slatency and0.0shots identically to a genuinely-zero-cost run. An ablation reader comparingdriver-steer(0s, 0.0 shots) againstbaseline(real latency/shots) will draw invalid conclusions about cost. F
🟠 MEDIUM progressAnalyst summary always renders '[object Object]' — up-leg is non-functional — examples/ablation-suite/self-improving-supervisor.ts
Line 62:
run: async (_kindId, trace) => ({ summary: \worker produced: ${String(trace).slice(0, 400)}` }). Thetraceargument is the worker's settled blob — aSurfaceWorkerOutobject ({resolved, score, shots, summary}), confirmed at coordination.ts:256const trace = await opts.blobs.get(w.outRef).String(obj)` on a plain object yields the literal '[object Object]', so every analyst finding re-entered on the bus is the identical string 'worker produced: [object Object]'. The module's own header ([lines 4-9](https://github.com/tangle-network/agent-runtime/blob/34a5b702227fd0eeada24c030a7a7a64187bcb03/examples/ablation-suite/self-i
🟠 MEDIUM No test coverage for analysts/analyzeOnSettle at driverAgent, supervisorAgent, or supervise layers — src/runtime/supervise/coordination-driver.ts
The coordination MCP tools have existing analyst tests (coordination.test.ts:320-341 tests list_analysts/run_analyst, coordination.test.ts:482-486 tests analyzeOnSettle hook), but the new pass-through options at the
driverAgent,supervisorAgent, andsuperviselevels have zero test coverage. coordination-driver.test.ts has no analysts-related tests. Thesupervise/supervisorAgenttest files have no references to 'analysts' or 'analyzeOnSettle' at all. A test confirming thatdriverAgent({..., analysts: mockRegistry, analyzeOnSettle: ['x']})correctly passes both tocreateCoordinationToolswould close the coverage gap.
🟠 MEDIUM analyzeOnSettle without analysts silently no-ops — missing ValidationError — src/runtime/supervise/coordination-driver.ts
All 4 files document
analyzeOnSettleas 'Requiresanalysts', but no layer throws ValidationError whenanalyzeOnSettleis non-empty andanalystsis unset. The only gate is a softopts.analysts &&check at coordination.ts:255 indrainSettlement, which silently skips analysis. This is inconsistent with theextraTools/executeExtraToolpattern at coordination-driver.ts:171-174 which fails loud with ValidationError. Users who setanalyzeOnSettlewithoutanalystswill get no findings and no error — a debugging time-sink. Add a ValidationError indriverAgent(matching the extraTools pattern) or at minimum increateCoordinationTools.
🟡 LOW Test hardcodes /etc/passwd and asserts its realpath, reducing portability — bench/src/swe-bench-env.test.ts
The symlink-escape test (lines 53-60) does symlinkSync('/etc', link), then realpathSync(join(dir,'escape/passwd')) and asserts
expect(real).toBe('/etc/passwd'). This assumes /etc/passwd exists and /etc is not itself a symlink (true on Linux and macOS, but not guaranteed in minimal/sandboxed CI images where /etc/passwd may be absent). If realpathSync throws on a missing target the test fails for an environmental rather than logic reason. Low severity since the bench harness is Unix-oriented; consider symlinking to a file the test itself creates inside the temp dir.
🟡 LOW jailPath takes an unused _root parameter — bench/src/swe-bench-env.ts
jailPath(_root, p) at line 34 ignores _root; the comment explains it is for 'signature symmetry' with the realpath check. Intentional and documented, but it can trip unused-arg linters and the symmetry justification is weak (callers could pass root only where needed). Not blocking; mention only as a nit.
🟡 LOW list_files escapes realpath jail through repo symlinks — bench/src/swe-bench-env.ts
The new resolveInJail realpath containment check is applied to read_file (line 140) and edit_file (line 160), but list_files (line 106-134) still uses only the string pre-filter (safe=jailPath). If a checked-out repo contains a symlink pointing outside the workspace (e.g., escape -> /etc), list_files("escape") follows it: existsSync follows the symlink, readdirSync lists the target directory's entries. Ve
🟡 LOW list_files is not jail-protected and leaks symlink names (metadata) — bench/src/swe-bench-env.ts
list_files (lines 106-134) uses only the string safe() filter, not resolveInJail. It walks with lstatSync so it does not follow symlinks or read contents, but it does emit symlink entry names (e.g. a repo symlink
creds -> /root/.aws/credsappears in the listing). Contents remain protected because read_file/edit_file are realpath-jailed, and this is not a regression (old code behaved identically), so impact is metadata-only. Noted because the PR's theme is jail hardening; a follow-up could skip symlink entries in list_files for defense in depth.
🟡 LOW Thrown task's real spend lost from arm cost accounting — examples/ablation-suite/ablation.ts
Lines 171-223: the try/catch counts a thrown task as unresolved (perTask.push(0)) but the throw escapes before
usd += sup.usd/usd += r.usd. For runAgentic the throw is post-hoc (strategy.ts:1048-1053 throws AFTER the supervisor run completed and spent real$), and for the supervisor path a network/quota throw likewise follows real conserved spend. The ablation header asserts '$ are real' for driverSteer arms, but a single thrown task silently drops its full $ from the arm total. Low severity for an example, but contradicts the module's own honesty claim. FIX: wrap each branch in its own try that captures partial spend before re-throwing into t
🟡 LOW No guard on empty train set in optimizeDriverPrompt — examples/ablation-suite/gepa-driver-prompt.ts
Line 78:
const trainTasks = await opts.tasks(opts.trainOffset, opts.trainN). If the task generator returns fewer tasks than requested,scenarioscan be empty or shorter. WithholdoutFraction: 0.34applied to a near-empty set, the held-out gate has insufficient scenarios to certify the winner.selfImprovelikely handles this internally, but a guard (e.g.if (scenarios.length < 2) throw ...) at the call site would prevent silent no-op optimization passes that report a lift of 0.
🟡 LOW Cast-based type recovery from supervise winner output — examples/ablation-suite/self-improving-supervisor.ts
Lines 95-97:
const out = result.kind === 'winner' ? (result.out as SurfaceWorkerOut | undefined) : undefined. The type system cannot verify thatresult.outis aSurfaceWorkerOut— it relies on the co-located invariant thatmakeWorkerAgentwas built bysurfaceWorkerSeam. If this function is forked and composed with a differentmakeWorkerAgent, the cast silently produces a structurally-wrong object. Consider a tag/discriminant on the output or a runtime shape check for robustness.
🟡 LOW deliverable passed to supervise() is dead — makeWorkerAgent override ignores it — examples/ablation-suite/self-improving-supervisor.ts
Line 86 passes
deliverable: seam.deliverableto supervise(), but supervise.ts:131-139 only consumes opts.deliverable when deriving the worker seam via workerFromBackend(opts.backend, opts.deliverable) — a path skipped entirely when opts.makeWorkerAgent is supplied (which this seam does). The seam's makeWorkerAgent already encodesverdict: { valid: r.resolved }in the executor result, so worker settlement IS correctly gated to resolved — but the explicit deliverable argument is decorative and misleads anyone reading the call into thinking supervise uses it. Either drop the argument or document that the gate lives in the executor's verd
🟡 LOW progressAnalyst String() coercion may produce [object Object] — examples/ablation-suite/self-improving-supervisor.ts
Lines 59-60:
run: async (_kindId: string, trace: unknown) => ({ summary:worker produced: ${String(trace).slice(0, 400)}}). For non-string, non-primitivetracevalues,String()returns'[object Object]', giving the driver a useless one-liner. TheSurfaceWorkerOut.summaryis a string so the current surface-worker path is fine, but the analyst registers under the genericAnalystRegistryinterface wheretraceisunknown. Usetypeof trace === 'string' ? trace.slice(0, 400) : JSON.stringify(trace).slice(0, 400)to handle arbitrary shape.
🟡 LOW Worker executor ignores spawn-scoped abort signal — examples/ablation-suite/surface-worker.ts
Lines 62-64:
async execute()declares zero parameters, but Executor.execute (types.ts:84) receives(task, signal). runAgentic (strategy.ts:1030) does not accept an AbortSignal either. So when the supervisor trips the intensity breaker or exhausts budget and aborts the child (scope.ts:605childAbort.signal.aborted), the in-flight worker's refine rollout runs to completion anyway. The worker self-limits via its own budget/innerTurns, so this is not a liveness bug, but abort-propagation — the substrate's intended cancellation channel — is absent. Acceptable for an example; worth a one-line comment so readers don't copy it into a long-running
🟡 LOW analyzeOnSettle-requires-analysts invariant documented but not enforced — src/runtime/supervise/coordination-driver.ts
All three option interfaces (DriverAgentOptions:66-67, SupervisorAgentDeps:99-101, SuperviseOptions:85-89) document 'Requires analysts' for analyzeOnSettle, but no construction-time check enforces it. The codebase explicitly fail-loud-checks analogous paired requirements at coordination-driver.ts:171-175 ('extraTools requires executeExtraTool … a silent no-op the house rules forbid') and at coordination-driver.ts:189-193 (maxTurns>=0). A caller wiring
{ analyzeOnSettle: ['profileRichness'] }withoutanalystsgets silently no analysis (coordination.ts:255 short-circuits onopts.analysts && …) — exactly the silent no-op the existing house-rules comment forbids. Fix: add to the validation block in driverAgent() around [line 171](https://github.com/tangle-network/agent-runtime/blob/34a5b
tangletools · 2026-06-28T06:31:05Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 2 Blocking Findings — 34a5b702
Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-28T06:31:05Z · immutable trace
…egrity, fail-loud, docs - coordination-driver: fail loud when analyzeOnSettle is set without analysts (matches the extraTools guard — no silent no-op) + a keystone test (fail-loud + analysts/analyzeOnSettle pass-through). - self-improving-supervisor: the progress analyst read the worker blob via String() → '[object Object]'; now reads the real SurfaceWorkerOut fields (the up-leg feeds a real summary). Size perWorker so MULTIPLE workers fit the pool (the default reserved the whole iteration pool per worker → only one ever spawned, defeating spawn-a-refined-worker steering). Return the full conserved spend (tokens/latency), not just $. - gepa-driver-prompt: return the TRAIN-side optimization cost (totalCostUsd). - ablation: count GEPA's train cost into the arm $ (fair-cost invariant); capture the supervisor arm's real tokens/latency (kill the fake-zero columns); size the supervisor budget for multi-worker steering. - docs/api regenerated (freshness gate: keystone shifted supervisor-agent line numbers). Addresses the multi-shot reviewer's 2 HIGH + the correctness/cost/coverage MEDIUMs on 34a5b70.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 0470036a
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T07:17:40Z
The self-improving supervisor — keystone + loop, end-to-end
A driver-steered, self-improving supervisor: the driver steers the worker, an analyst feeds the driver each round ("here's the last round, compose the next prompt"), the driver's prompt is GEPA-optimized on a frozen holdout, and a cost-aware ablation measures what actually helps. Purely additive — every new knob defaults off, so the status-quo supervisor is byte-identical.
1. Keystone — the propagated
analyzeknob (additive, every level)supervise()/supervisorAgent/driverAgentgain optionalanalysts+analyzeOnSettle, threaded tocreateCoordinationTools(router arm) ANDserveCoordinationMcp(sandbox arm). On a worker settle, each analyst lens runs → afindingthe driver pulls (await_event) and composes its next steer from — the self-improving up-leg that was previously unreachable throughsupervise(). Defaults off → status quo byte-identical. 82 supervise/coordination tests green.2. The loop layers (on the keystone)
surface-worker.ts— the graded-worker seam: amakeWorkerAgentrunningrunAgenticover anAgenticSurface, settling with the surface score as the deliverable verdict (settled ⟺ resolved, not a self-report).gepa-driver-prompt.ts— optimize the driver's compose-prompt on TRAIN with an executableJudgeConfig(selfImprove+gepaProposerreading the surface score, not an LLM judge), frozen, certified.self-improving-supervisor.ts— the one-call DX composingsupervise(+analyze)over the graded seam.ablation.ts—driverSteer+optimizeknobs wired (blind / steered / self-improving over the samesupervise()), per-task resilience, cost + paired-bootstrap significance autopsy.3. Live e2e — proven, no mocks
The driver-steered supervisor (analyze on) over the contamination-proof generated coding task:
resolved=true score=1.00 $0.031 / 141s— the driver spawned a worker, the analyst fed the driver, a worker resolved the task, the graded winner was delivered. The loop closes end-to-end.4. #402 hardening (folded in)
swe-bench-env: temp-dir leak fix (try/catch + rmSync on clone/checkout failure), symlink realpath-jail (read + edit),
workspacesMap into the closure, exportedjailPath/isTestPath+ unit tests. self-improving-coder:read_filepath guard (symmetric withwrite_file), deadrun_testshandler removed. swe-self-improve moved tobench/src/(now under typecheck). ablation per-task try/catch + biome.Supersedes #402 (examples + instrument folded here). Typecheck clean (examples + core); biome clean.