feat(bench): add external benchmark adapters#410
Conversation
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 114.6s (2 bridge agents) |
| Total | 114.6s |
💰 Value — sound
Adds 4 new external-benchmark adapters that each implement a real deterministic judge (or an honest fail-loud refusal) following the exact established pattern; coherent, in-grain, no duplication.
- What it does: Registers four new adapters under bench/src/benchmarks: webarena-verified (judges an official run-output dir via the
webarena_verified eval-tasksCLI), tau2-bench (recomputes rewards over an official tau2 Results/trajectory file via tau2's owncompute_simulation_rewards), agentbench (DBBench subset only; exact-label match after case/whitespace normalization), and toollm (loads ToolBench tasks - Goals it achieves: Broaden the deterministic-judge benchmark coverage so profile/skill/prompt changes can be scored against more real external benchmarks. Two goals are visible from the code: (1) add benchmarks whose official harness CAN be driven deterministically (webarena, tau2, agentbench-dbbench) so they produce real scores; (2) for a benchmark whose only judge is non-deterministic (toollm/ToolEval), still expo
- Assessment: Strong fit. The structure of each new adapter is a near-1:1 match with dabstep.ts (the immediately preceding adapter, commit 5d610e7): same rowToTask→readMeta→selectRows→loadFixtures→fixturesMode→preflight shape, same fail-loud-with-actionable-fix error messages, same
output: OutputAdapterownership. The honesty discipline is exactly right: webarena/tau2 refuse to score plain text and require th - Better / existing approach: none — this is the right approach. Searched for (a) an existing equivalent any of these 4 could duplicate: all 4 adapter files + 4 fixtures are
new file, registry only adds 4 keys, and HARNESS.md line 207's 'absent' list never included these — genuinely new coverage. (b) A shared helper for the repeated per-adapter boilerplate (selectRows/readMeta/loadFixtures and the final-text `OutputAdapter<s - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
Four new benchmark adapters land cleanly behind the existing registry with real (or honestly-refused) deterministic judges — no dead surface, no competing pattern.
- Assessment: Coherent, on-grain, reachable, and documented. All four adapters will do their job: three produce real deterministic scores via the upstream harness, one (toollm) plants a deliberate flag that the benchmark exists but its official judge is non-deterministic — preferable to omitting it or faking scores. No dead surface, no adjacent-problem drift, no awkward interface. Ship.
- Integration: All four adapters are registered in bench/src/adapters.ts:42-45 (webarena-verified, tau2-bench, agentbench, toollm) inside the single ADAPTERS registry. That registry is the seam every consumer reads: resolveAdapter (bench/src/adapters.ts:66) is used by gate-cli.mts:39, trata-gate.mts:150, aec-gate.mts:180; ADAPTERS directly by research-gate.mts:40 and corpus-replay.mts:217; and both are re-export
- Fit with existing patterns: Squarely in the grain. The fixture-mode (
*_FIXTURES=1) + fail-loud preflight requiring*_DIR+ delegate-to-official-evaluator pattern is exactly what dabstep/commit0/programbench/appworld already do (documented in HARNESS.md and types.ts:1-9's 'no self-authored judge' rule). webarena-verified delegates to the officialeval-tasksover a run output dir; tau2-bench recomputes rewards via tau2's - Real-world viability: Holds up on realistic use. preflight throws with actionable install guidance when the upstream checkout/venv is missing (e.g. webarena-verified.ts:161-163, tau2-bench.ts:166-168), so the gate never silently produces zeros. judge fails loud on a missing artifact path. Each judge call is an isolated python subprocess via runVenvPython with no shared mutable state, so concurrency is fine. The path/te
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
Summary
Verification
pnpm exec tsx --test bench/src/benchmarks/external-adapters.test.mts— 4/4 passDABSTEP_FIXTURES=1 pnpm exec tsx --test bench/src/benchmarks/dabstep.test.mts— 6/6 passpnpm exec tsx --tsconfig tsconfig.jsonregistry stdin smoke frombench/— 4/4 new keys resolvepnpm exec tsc --noEmit -p tsconfig.jsonfrombench/— passpnpm pack --dry-runfrombench/— includes new adapters/fixtures/testspnpm run typecheck— passpnpm run build— passgit diff --cached --checkbefore commit — passgit merge-tree --write-tree origin/main HEAD— cleanNotes
ToolLLM is intentionally task-load-only until a deterministic executable subset exists; this prevents recording fake scores from an LLM judge.