feat(brief): pre-execution contract negotiation with output evaluation and auto-retry#68
feat(brief): pre-execution contract negotiation with output evaluation and auto-retry#68Nivesh353 wants to merge 2 commits into
Conversation
…uccess criteria Adds a Planner/Evaluator negotiation loop that turns a task prompt into a concrete, versioned contract of binary pass/fail assertions before an agent runs, then grades the agent's actual output against that contract with auto-retry on failure. Wires it into the CLI (`gitagent brief`), the SDK (`runBriefOrchestration`/`runWithBrief`), and the scheduler.
shreyas-lyzr
left a comment
There was a problem hiding this comment.
Solid addition overall — the planner/evaluator negotiation loop is well-thought-out and the hardening in output-evaluator.ts (explicit temperature/maxTokens, distinct failure reasons) is good prior-work cleanup. Four issues worth addressing before merge, in rough priority order.
| model?: string; | ||
| env?: string; | ||
| runPrompt: (prompt: string) => Promise<string>; | ||
| runPrompt: (prompt: string, briefSuffix?: string) => Promise<string>; |
There was a problem hiding this comment.
The runPrompt signature now has an optional second parameter (briefSuffix?: string) but callers outside this repo who implement SchedulerOptions won't know to pass it — they'll silently ignore the brief injection because nothing enforces forwarding the suffix to the underlying query call. Since this is an exported interface, either document the new parameter clearly in a JSDoc comment, or consider shipping a thin adapter so callers don't have to change: e.g. run the brief injection internally inside executeScheduledJob and keep runPrompt with its original one-argument signature. The current design requires every downstream implementor to update their callback even though the caller already has the suffix.
| currentNegotiation = renegotiation; | ||
| finalBrief = { | ||
| ...finalBrief, | ||
| version: version + attempts, |
There was a problem hiding this comment.
Version number computed as version + attempts is wrong when version came from nextVersion() (which already accounts for archived versions). If, say, four prior versions exist, nextVersion returns 5, then on the second regenerate attempt the in-memory brief gets version 7, skipping 6. Use nextVersion(agentDir, id) again here (or track a running counter separately) so the on-disk filename and the in-memory version field stay consistent.
| "| # | Category | Assertion | How to Verify |", | ||
| "|---|---|---|---|", | ||
| ...draft.assertions.map(a => | ||
| `| ${a.id} | ${a.category} | ${a.assertion} | ${a.test} |`, |
There was a problem hiding this comment.
Assertion text and test descriptions are written raw into Markdown table cells without escaping pipe characters. If an assertion contains | (e.g. "word count must be 800 | 1000" or a regex) the table will be corrupted, and parseBrief re-reads the Markdown table as display-only anyway (the canonical data is in the draft_json HTML comment), so the table is purely presentational. Either escape pipes in those fields (.replace(/\|/g, '\\|')) or add a comment noting the table is non-authoritative. Right now a future reader could be confused about why table rows look broken.
| const green = (s: string) => isTTY ? `\x1b[32m${s}\x1b[0m` : s; | ||
| const red = (s: string) => isTTY ? `\x1b[31m${s}\x1b[0m` : s; | ||
|
|
||
| const CATEGORY_COLORS: Record<string, (s: string) => string> = { |
There was a problem hiding this comment.
CATEGORY_COLORS in approval.ts and report.ts are identical and both ignore the module-level isTTY guard — the color lambdas always emit ANSI codes regardless of TTY. The other helpers (dim, bold, etc.) correctly gate on isTTY; these don't. Also, the same map is copy-pasted in both files. Extract it to a shared colors.ts utility and apply the isTTY guard consistently.
- schedule-runner.ts: revert runPrompt to its original one-arg signature, fold the brief suffix into the prompt internally so external SchedulerOptions implementers don't silently drop brief injection - orchestrator.ts: recompute version via nextVersion() on each regenerate attempt instead of `version + attempts`, which could skip version numbers - storage.ts: escape pipe characters in assertion/test text before writing the Markdown success-criteria table - approval.ts, report.ts, colors.ts: extract duplicated CATEGORY_COLORS into a shared colors.ts, fixing the missing isTTY guard in both copies
Summary
Introduces Agent Brief — a negotiation step that runs before an agent
executes a task, producing a concrete, versioned "contract" of binary
pass/fail success criteria (assertions). Once the agent produces output,
a second evaluator grades the output against that contract and can trigger
automatic retries with targeted feedback on what failed.
This closes the gap where an agent's output "looks done" but silently
violates constraints (word count, tone, banned terms, structure) that were
never made explicit or checked.
What's new (
src/brief/, 13 files)planner.ts— LLM call that turns a task + agent's SOUL/RULES/DUTIESinto a
BriefDraft: task summary, ambiguities, binary assertions(format/content/quality/constraint/behavior/tone), a scoring rubric, and
estimated complexity. Assertion count is capped at 15 in code (backing up
the prompt's own "never exceed 15" instruction) to bound evaluator cost.
evaluator.ts— LLM call that reviews the Planner's draft itself(not the agent's output) for vague/unmeasurable assertions, missing
coverage, or contradictions with RULES.md, and returns critical/warning
issues.
negotiator.ts— drives Planner ⇄ Evaluator rounds until the draft isapproved or a round limit is hit.
output-evaluator.ts— after the agent runs, grades its actual outputagainst each approved assertion (binary, evidence-required, strict).
Hardened in this PR: explicit
constraints: { temperature: 0, maxTokens: 8000 }on the grading call (previously silently inherited the host agent's
possibly-too-small
max_tokens, causing truncated/unparseable JSON), plusthe failure path now distinguishes and reports why grading failed —
LLM call error, truncated response, or empty response — instead of a
generic
Unexpected end of JSON input.orchestrator.ts— top-level flow: negotiate → stale-check →approve/edit/regenerate/skip → save.
runner.ts(runWithBrief) — runs the agent with the brief injectedinto its system prompt, evaluates output, and auto-retries (default 2x)
with a prompt that lists exactly which assertions failed and why.
storage.ts— brief persistence: save/load/list/find/version/archiveunder
.gitagent/briefs/, content-hash based staleness keys.stale.ts— detects when SOUL/RULES/DUTIES changed since a brief wasapproved and reports which assertions are affected.
approval.ts— interactive CLI approval flow (approve/edit/regenerate/skip)and brief list/detail rendering.
editor.ts— opens a draft in$EDITORfor manual editing, validatesthe edited JSON against the schema.
injector.ts— serializes an approved brief into a system-prompt suffix.report.ts— renders the final pass/fail evaluation report.types.ts— shared types (Brief,BriefDraft,BriefAssertion,AssertionResult,OutputVerdict, etc).Integration changes (4 files)
src/loader.ts— adds optionalbrief?.modeltoAgentManifestso anagent.yamlcan pin a specific model for brief negotiation/evaluation,independent of the agent's own
model.preferred.src/index.ts— newgitagent briefCLI subcommand:gitagent brief "<task>"— negotiate + approve + run with evaluation--only— negotiate and save without running--brief-path <path>— run with an already-approved brief--list/--view <id>/--regenerate— manage saved briefs--model/-m,--dir/-d,--yes/-ysrc/exports.ts— exposes the brief API (runBriefOrchestration,generateBrief,runWithBrief, storage helpers, and all brief types) fromthe SDK's public surface.
src/schedule-runner.ts— scheduled jobs now look up an approved briefmatching the schedule's prompt and inject it into the run automatically,
warning if it's gone stale.
Why
Assertions are cheap to write and binary to check, which makes "did the
agent actually do what was asked" a testable question instead of a vibe
check — especially valuable for scheduled/unattended runs where nobody is
reading the output live.
Test plan
gitagent brief "write a 500-word post on X" --dir <agent>— negotiate,approve, confirm the assertion contract looks reasonable
gitagent brief --list/--view <id>render correctlywarning appears on next run
(
src/schedule-runner.ts)tsc --noEmitpasses