Skip to content

feat(bench): add external benchmark adapters#410

Merged
drewstone merged 1 commit into
mainfrom
feat/bench-external-adapters
Jun 29, 2026
Merged

feat(bench): add external benchmark adapters#410
drewstone merged 1 commit into
mainfrom
feat/bench-external-adapters

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Summary

  • add WebArena-Verified and tau2-bench adapters that load official tasks and judge only official run/trajectory artifacts
  • add deterministic AgentBench DBBench adapter with exact-label scoring
  • add ToolLLM task-loading adapter that intentionally refuses scoring because official ToolEval is LLM-judged/stochastic
  • register all four adapters and include tiny fixture-mode tests for offline plumbing

Verification

  • pnpm exec tsx --test bench/src/benchmarks/external-adapters.test.mts — 4/4 pass
  • DABSTEP_FIXTURES=1 pnpm exec tsx --test bench/src/benchmarks/dabstep.test.mts — 6/6 pass
  • pnpm exec tsx --tsconfig tsconfig.json registry stdin smoke from bench/ — 4/4 new keys resolve
  • pnpm exec tsc --noEmit -p tsconfig.json from bench/ — pass
  • pnpm pack --dry-run from bench/ — includes new adapters/fixtures/tests
  • pnpm run typecheck — pass
  • pnpm run build — pass
  • git diff --cached --check before commit — pass
  • git merge-tree --write-tree origin/main HEAD — clean

Notes

ToolLLM is intentionally task-load-only until a deterministic executable subset exists; this prevents recording fake scores from an LLM judge.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 114.6s (2 bridge agents)
Total 114.6s

💰 Value — sound

Adds 4 new external-benchmark adapters that each implement a real deterministic judge (or an honest fail-loud refusal) following the exact established pattern; coherent, in-grain, no duplication.

  • What it does: Registers four new adapters under bench/src/benchmarks: webarena-verified (judges an official run-output dir via the webarena_verified eval-tasks CLI), tau2-bench (recomputes rewards over an official tau2 Results/trajectory file via tau2's own compute_simulation_rewards), agentbench (DBBench subset only; exact-label match after case/whitespace normalization), and toollm (loads ToolBench tasks
  • Goals it achieves: Broaden the deterministic-judge benchmark coverage so profile/skill/prompt changes can be scored against more real external benchmarks. Two goals are visible from the code: (1) add benchmarks whose official harness CAN be driven deterministically (webarena, tau2, agentbench-dbbench) so they produce real scores; (2) for a benchmark whose only judge is non-deterministic (toollm/ToolEval), still expo
  • Assessment: Strong fit. The structure of each new adapter is a near-1:1 match with dabstep.ts (the immediately preceding adapter, commit 5d610e7): same rowToTask→readMeta→selectRows→loadFixtures→fixturesMode→preflight shape, same fail-loud-with-actionable-fix error messages, same output: OutputAdapter ownership. The honesty discipline is exactly right: webarena/tau2 refuse to score plain text and require th
  • Better / existing approach: none — this is the right approach. Searched for (a) an existing equivalent any of these 4 could duplicate: all 4 adapter files + 4 fixtures are new file, registry only adds 4 keys, and HARNESS.md line 207's 'absent' list never included these — genuinely new coverage. (b) A shared helper for the repeated per-adapter boilerplate (selectRows/readMeta/loadFixtures and the final-text `OutputAdapter<s
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Four new benchmark adapters land cleanly behind the existing registry with real (or honestly-refused) deterministic judges — no dead surface, no competing pattern.

  • Assessment: Coherent, on-grain, reachable, and documented. All four adapters will do their job: three produce real deterministic scores via the upstream harness, one (toollm) plants a deliberate flag that the benchmark exists but its official judge is non-deterministic — preferable to omitting it or faking scores. No dead surface, no adjacent-problem drift, no awkward interface. Ship.
  • Integration: All four adapters are registered in bench/src/adapters.ts:42-45 (webarena-verified, tau2-bench, agentbench, toollm) inside the single ADAPTERS registry. That registry is the seam every consumer reads: resolveAdapter (bench/src/adapters.ts:66) is used by gate-cli.mts:39, trata-gate.mts:150, aec-gate.mts:180; ADAPTERS directly by research-gate.mts:40 and corpus-replay.mts:217; and both are re-export
  • Fit with existing patterns: Squarely in the grain. The fixture-mode (*_FIXTURES=1) + fail-loud preflight requiring *_DIR + delegate-to-official-evaluator pattern is exactly what dabstep/commit0/programbench/appworld already do (documented in HARNESS.md and types.ts:1-9's 'no self-authored judge' rule). webarena-verified delegates to the official eval-tasks over a run output dir; tau2-bench recomputes rewards via tau2's
  • Real-world viability: Holds up on realistic use. preflight throws with actionable install guidance when the upstream checkout/venv is missing (e.g. webarena-verified.ts:161-163, tau2-bench.ts:166-168), so the gate never silently produces zeros. judge fails loud on a missing artifact path. Each judge call is an isolated python subprocess via runVenvPython with no shared mutable state, so concurrency is fine. The path/te
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260629T044225Z

@drewstone drewstone merged commit 5e2e81a into main Jun 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants