feat(bench): add external benchmark adapters by drewstone · Pull Request #410 · tangle-network/agent-runtime

drewstone · 2026-06-29T04:38:39Z

Summary

add WebArena-Verified and tau2-bench adapters that load official tasks and judge only official run/trajectory artifacts
add deterministic AgentBench DBBench adapter with exact-label scoring
add ToolLLM task-loading adapter that intentionally refuses scoring because official ToolEval is LLM-judged/stochastic
register all four adapters and include tiny fixture-mode tests for offline plumbing

Verification

pnpm exec tsx --test bench/src/benchmarks/external-adapters.test.mts — 4/4 pass
DABSTEP_FIXTURES=1 pnpm exec tsx --test bench/src/benchmarks/dabstep.test.mts — 6/6 pass
pnpm exec tsx --tsconfig tsconfig.json registry stdin smoke from bench/ — 4/4 new keys resolve
pnpm exec tsc --noEmit -p tsconfig.json from bench/ — pass
pnpm pack --dry-run from bench/ — includes new adapters/fixtures/tests
pnpm run typecheck — pass
pnpm run build — pass
git diff --cached --check before commit — pass
git merge-tree --write-tree origin/main HEAD — clean

Notes

ToolLLM is intentionally task-load-only until a deterministic executable subset exists; this prevents recording fake scores from an LLM judge.

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	0 (none)
Heuristic	0.0s
Duplication	0.0s
Interrogation	114.6s (2 bridge agents)
Total	114.6s

💰 Value — sound

Adds 4 new external-benchmark adapters that each implement a real deterministic judge (or an honest fail-loud refusal) following the exact established pattern; coherent, in-grain, no duplication.

What it does: Registers four new adapters under bench/src/benchmarks: webarena-verified (judges an official run-output dir via the webarena_verified eval-tasks CLI), tau2-bench (recomputes rewards over an official tau2 Results/trajectory file via tau2's own compute_simulation_rewards), agentbench (DBBench subset only; exact-label match after case/whitespace normalization), and toollm (loads ToolBench tasks
Goals it achieves: Broaden the deterministic-judge benchmark coverage so profile/skill/prompt changes can be scored against more real external benchmarks. Two goals are visible from the code: (1) add benchmarks whose official harness CAN be driven deterministically (webarena, tau2, agentbench-dbbench) so they produce real scores; (2) for a benchmark whose only judge is non-deterministic (toollm/ToolEval), still expo
Assessment: Strong fit. The structure of each new adapter is a near-1:1 match with dabstep.ts (the immediately preceding adapter, commit 5d610e7): same rowToTask→readMeta→selectRows→loadFixtures→fixturesMode→preflight shape, same fail-loud-with-actionable-fix error messages, same output: OutputAdapter ownership. The honesty discipline is exactly right: webarena/tau2 refuse to score plain text and require th
Better / existing approach: none — this is the right approach. Searched for (a) an existing equivalent any of these 4 could duplicate: all 4 adapter files + 4 fixtures are new file, registry only adds 4 keys, and HARNESS.md line 207's 'absent' list never included these — genuinely new coverage. (b) A shared helper for the repeated per-adapter boilerplate (selectRows/readMeta/loadFixtures and the final-text `OutputAdapter<s
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 2
Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Four new benchmark adapters land cleanly behind the existing registry with real (or honestly-refused) deterministic judges — no dead surface, no competing pattern.

Assessment: Coherent, on-grain, reachable, and documented. All four adapters will do their job: three produce real deterministic scores via the upstream harness, one (toollm) plants a deliberate flag that the benchmark exists but its official judge is non-deterministic — preferable to omitting it or faking scores. No dead surface, no adjacent-problem drift, no awkward interface. Ship.
Integration: All four adapters are registered in bench/src/adapters.ts:42-45 (webarena-verified, tau2-bench, agentbench, toollm) inside the single ADAPTERS registry. That registry is the seam every consumer reads: resolveAdapter (bench/src/adapters.ts:66) is used by gate-cli.mts:39, trata-gate.mts:150, aec-gate.mts:180; ADAPTERS directly by research-gate.mts:40 and corpus-replay.mts:217; and both are re-export
Fit with existing patterns: Squarely in the grain. The fixture-mode (*_FIXTURES=1) + fail-loud preflight requiring *_DIR + delegate-to-official-evaluator pattern is exactly what dabstep/commit0/programbench/appworld already do (documented in HARNESS.md and types.ts:1-9's 'no self-authored judge' rule). webarena-verified delegates to the official eval-tasks over a run output dir; tau2-bench recomputes rewards via tau2's
Real-world viability: Holds up on realistic use. preflight throws with actionable install guidance when the upstream checkout/venv is missing (e.g. webarena-verified.ts:161-163, tau2-bench.ts:166-168), so the gate never silently produces zeros. judge fails loud on a missing artifact path. Each judge call is an isolated python subprocess via runVenvPython with no shared mutable state, so concurrency is fine. The path/te
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260629T044225Z}

feat(bench): add external benchmark adapters

762eb1a

tangletools reviewed Jun 29, 2026

View reviewed changes

drewstone merged commit 5e2e81a into main Jun 29, 2026
1 check passed

drewstone mentioned this pull request Jun 29, 2026

feat(bench): add external benchmark adapters for WebArena, tau-bench, DABStep, AgentBench, and ToolLLM #408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bench): add external benchmark adapters#410

feat(bench): add external benchmark adapters#410
drewstone merged 1 commit into
mainfrom
feat/bench-external-adapters

drewstone commented Jun 29, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 29, 2026

Summary

Verification

Notes

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants