Skip to content

feat(bench): add DABStep adapter and SDK compat#409

Merged
drewstone merged 3 commits into
mainfrom
feat/bench-dabstep-adapter
Jun 29, 2026
Merged

feat(bench): add DABStep adapter and SDK compat#409
drewstone merged 3 commits into
mainfrom
feat/bench-dabstep-adapter

Conversation

@drewstone

@drewstone drewstone commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a DABStep adapter to @tangle-network/agent-bench using the existing BenchmarkAdapter contract
  • delegate scoring to official DABStep grade.py and fail loud when DABSTEP_DIR / released dataset.csv are missing
  • package bench support files needed by existing adapters, refresh bench deps, and align sandbox 0.9.5 / agent-eval 0.100.0 compatibility
  • expose existing runtime executor types publicly so bench can typecheck against the local runtime source without private imports or runtime-loop changes

Scope note

The runtime edits here are compatibility/type-surface fixes required by the bench package build against current sandbox and agent-eval versions. This PR does not change the runtime loop behavior.

Verification

  • DABSTEP_FIXTURES=1 pnpm exec tsx --test bench/src/benchmarks/dabstep.test.mts
  • node --import tsx --input-type=module -e "import('./bench/src/adapters.ts').then(({resolveAdapter})=>{ const a=resolveAdapter('dabstep'); console.log(a.name); })"
  • pnpm typecheck
  • pnpm --dir bench exec tsc --noEmit -p tsconfig.json
  • pnpm build
  • pnpm install --frozen-lockfile
  • pnpm --dir bench install --frozen-lockfile
  • pnpm --dir bench pack --dry-run
  • pnpm run docs:check
  • git diff --check

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 1 (1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 279.2s (2 bridge agents)
Total 279.2s

💰 Value — sound-with-nits

Adds a 19th benchmark adapter (DABStep) that mirrors the established commit0/programbench pattern exactly; ships clean but bundles unrelated runtime/sandbox compat edits under a bench-titled PR.

  • What it does: Adds a DABStep adapter to bench/src/benchmarks/dabstep.ts (and registers it in adapters.ts). Live mode reads DABSTEP_DIR's released dataset.csv/splits/files/grade.py via an inline Python loader; fixture mode (DABSTEP_FIXTURES=1) reads bench/fixtures/dabstep.json. Scoring is delegated to the official grade.py through a 42-line bench/scripts/dabstep_judge.py driver that imports grade() via importlib
  • Goals it achieves: Let agent-runtime agents be scored on DABStep's data-analysis tasks (EnvCommons/DABStep) using its official deterministic grade.py — no LLM judge. The benchmark suite grows from 18 to 19 adapters behind the single resolveAdapter registry, so any profile/prompt change can be A/B'd against one more real deterministic judge. Secondary goal (per the PR body): refresh bench deps and bring the runtime u
  • Assessment: The adapter is a textbook application of the existing BenchmarkAdapter contract — it reuses _harness.ts (benchRoot/runVenvPython/runVenvScriptStdin), follows the fixture/live/preflight/judge/goldArtifact/output shape used by commit0.ts:218 and programbench.ts:165 verbatim, delegates scoring to the benchmark's own harness, and fails loud rather than fabricating a score. No self-authored judge, no L
  • Better / existing approach: For the adapter itself: none — this is the right approach and matches commit0/programbench line-for-line. Searched for any pre-existing DABStep wiring (git log + grep across .ts/.mts/.py/.md/.json); there is none, so no duplication. For the bundle: the runtime/sandbox 0.9.5 compat edits (environment-provider.ts status field, index.ts type re-exports, trata-gepa.mts rename, decoder-live.mts cast) w
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

DABStep adapter is wired into the existing adapter registry and follows the exact commit0 pattern (fixtures-mode plumbing + fail-loud preflight + judge delegates to official grade.py); immediately reachable by every gate/replay runner via resolveAdapter('dabstep').

  • Integration: Fully wired. createDabstepAdapter is registered at bench/src/adapters.ts:36, so resolveAdapter('dabstep') returns it — and resolveAdapter/ADAPTERS is consumed by gate-cli.mts:39, trata-gate.mts:150, aec-gate.mts:180, research-gate.mts:40, corpus-replay.mts:217, and re-exported as public API in bench/src/index.ts:13. HARNESS.md:205 documents the setup, and package.json description bumps 18→19 adapt
  • Fit with existing patterns: Textbook fit. It mirrors commit0 (bench/src/benchmarks/commit0.ts) field-for-field: fixtures-mode flag pattern, readMeta shape, selectRows, loadFixtures vs official loader, runVenvScriptStdin delegation to a scripts/*_judge.py bridge, fail-loud preflight. Uses the shared _harness.ts helpers (benchRoot, runVenvPython, runVenvScriptStdin) exactly as designed. No competing dabstep implementation exis
  • Real-world viability: Robust on error paths. preflight (dabstep.ts:148-159) fails loud with actionable guidance for every missing piece: DABSTEP_DIR unset, missing dataset.csv, missing split file, missing grade.py, missing files dir, and even does a 1-row loadOfficialTasks probe to validate the CSV parses. judge (dabstep.ts:182-198) surfaces the bridge's error field and defaults score to 0 rather than crashing on mal
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

💰 Value Audit

🟡 PR title advertises only the DABStep adapter but bundles a runtime/sandbox compat pass [proportion] ``

src/runtime/environment-provider.ts:509,912 adds a required status field to PromptResult returns (sandbox 0.9.5 compat); src/runtime/index.ts:459-476 adds three new type re-exports; bench/src/trata-gepa.mts:370 renames driverTarget→proposerTarget (agent-eval API rename); bench/src/decoder-live.mts:50 tightens a TS cast. None of these are required by dabstep.ts itself, which imports only OutputAdapter (bench/src/benchmarks/dabstep.ts:14) — a long-stable symbol. These are the compile/run tax of


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260629T041952Z

@drewstone drewstone changed the title feat(bench): add DABStep adapter feat(bench): add DABStep adapter and SDK compat Jun 29, 2026
@drewstone drewstone merged commit 5d610e7 into main Jun 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants