Skip to content

feat(brief): pre-execution contract negotiation with output evaluation and auto-retry#68

Open
Nivesh353 wants to merge 2 commits into
open-gitagent:mainfrom
Nivesh353:feat/contract-negotiation
Open

feat(brief): pre-execution contract negotiation with output evaluation and auto-retry#68
Nivesh353 wants to merge 2 commits into
open-gitagent:mainfrom
Nivesh353:feat/contract-negotiation

Conversation

@Nivesh353

Copy link
Copy Markdown
Contributor

Summary

Introduces Agent Brief — a negotiation step that runs before an agent
executes a task, producing a concrete, versioned "contract" of binary
pass/fail success criteria (assertions). Once the agent produces output,
a second evaluator grades the output against that contract and can trigger
automatic retries with targeted feedback on what failed.

This closes the gap where an agent's output "looks done" but silently
violates constraints (word count, tone, banned terms, structure) that were
never made explicit or checked.

What's new (src/brief/, 13 files)

  • planner.ts — LLM call that turns a task + agent's SOUL/RULES/DUTIES
    into a BriefDraft: task summary, ambiguities, binary assertions
    (format/content/quality/constraint/behavior/tone), a scoring rubric, and
    estimated complexity. Assertion count is capped at 15 in code (backing up
    the prompt's own "never exceed 15" instruction) to bound evaluator cost.
  • evaluator.ts — LLM call that reviews the Planner's draft itself
    (not the agent's output) for vague/unmeasurable assertions, missing
    coverage, or contradictions with RULES.md, and returns critical/warning
    issues.
  • negotiator.ts — drives Planner ⇄ Evaluator rounds until the draft is
    approved or a round limit is hit.
  • output-evaluator.ts — after the agent runs, grades its actual output
    against each approved assertion (binary, evidence-required, strict).
    Hardened in this PR: explicit constraints: { temperature: 0, maxTokens: 8000 }
    on the grading call (previously silently inherited the host agent's
    possibly-too-small max_tokens, causing truncated/unparseable JSON), plus
    the failure path now distinguishes and reports why grading failed —
    LLM call error, truncated response, or empty response — instead of a
    generic Unexpected end of JSON input.
  • orchestrator.ts — top-level flow: negotiate → stale-check →
    approve/edit/regenerate/skip → save.
  • runner.ts (runWithBrief) — runs the agent with the brief injected
    into its system prompt, evaluates output, and auto-retries (default 2x)
    with a prompt that lists exactly which assertions failed and why.
  • storage.ts — brief persistence: save/load/list/find/version/archive
    under .gitagent/briefs/, content-hash based staleness keys.
  • stale.ts — detects when SOUL/RULES/DUTIES changed since a brief was
    approved and reports which assertions are affected.
  • approval.ts — interactive CLI approval flow (approve/edit/regenerate/skip)
    and brief list/detail rendering.
  • editor.ts — opens a draft in $EDITOR for manual editing, validates
    the edited JSON against the schema.
  • injector.ts — serializes an approved brief into a system-prompt suffix.
  • report.ts — renders the final pass/fail evaluation report.
  • types.ts — shared types (Brief, BriefDraft, BriefAssertion,
    AssertionResult, OutputVerdict, etc).

Integration changes (4 files)

  • src/loader.ts — adds optional brief?.model to AgentManifest so an
    agent.yaml can pin a specific model for brief negotiation/evaluation,
    independent of the agent's own model.preferred.
  • src/index.ts — new gitagent brief CLI subcommand:
    • gitagent brief "<task>" — negotiate + approve + run with evaluation
    • --only — negotiate and save without running
    • --brief-path <path> — run with an already-approved brief
    • --list / --view <id> / --regenerate — manage saved briefs
    • --model / -m, --dir / -d, --yes / -y
  • src/exports.ts — exposes the brief API (runBriefOrchestration,
    generateBrief, runWithBrief, storage helpers, and all brief types) from
    the SDK's public surface.
  • src/schedule-runner.ts — scheduled jobs now look up an approved brief
    matching the schedule's prompt and inject it into the run automatically,
    warning if it's gone stale.

Why

Assertions are cheap to write and binary to check, which makes "did the
agent actually do what was asked" a testable question instead of a vibe
check — especially valuable for scheduled/unattended runs where nobody is
reading the output live.

Test plan

  • gitagent brief "write a 500-word post on X" --dir <agent> — negotiate,
    approve, confirm the assertion contract looks reasonable
    • Confirm auto-retry fires when an assertion is deliberately violated
  • gitagent brief --list / --view <id> render correctly
  • Edit an agent's SOUL.md after approving a brief — confirm staleness
    warning appears on next run
  • Scheduled job with a matching approved brief picks it up automatically
    (src/schedule-runner.ts)
  • tsc --noEmit passes

…uccess criteria

Adds a Planner/Evaluator negotiation loop that turns a task prompt into a
concrete, versioned contract of binary pass/fail assertions before an agent
runs, then grades the agent's actual output against that contract with
auto-retry on failure. Wires it into the CLI (`gitagent brief`), the SDK
(`runBriefOrchestration`/`runWithBrief`), and the scheduler.

@shreyas-lyzr shreyas-lyzr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid addition overall — the planner/evaluator negotiation loop is well-thought-out and the hardening in output-evaluator.ts (explicit temperature/maxTokens, distinct failure reasons) is good prior-work cleanup. Four issues worth addressing before merge, in rough priority order.

Comment thread src/schedule-runner.ts Outdated
model?: string;
env?: string;
runPrompt: (prompt: string) => Promise<string>;
runPrompt: (prompt: string, briefSuffix?: string) => Promise<string>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runPrompt signature now has an optional second parameter (briefSuffix?: string) but callers outside this repo who implement SchedulerOptions won't know to pass it — they'll silently ignore the brief injection because nothing enforces forwarding the suffix to the underlying query call. Since this is an exported interface, either document the new parameter clearly in a JSDoc comment, or consider shipping a thin adapter so callers don't have to change: e.g. run the brief injection internally inside executeScheduledJob and keep runPrompt with its original one-argument signature. The current design requires every downstream implementor to update their callback even though the caller already has the suffix.

Comment thread src/brief/orchestrator.ts Outdated
currentNegotiation = renegotiation;
finalBrief = {
...finalBrief,
version: version + attempts,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version number computed as version + attempts is wrong when version came from nextVersion() (which already accounts for archived versions). If, say, four prior versions exist, nextVersion returns 5, then on the second regenerate attempt the in-memory brief gets version 7, skipping 6. Use nextVersion(agentDir, id) again here (or track a running counter separately) so the on-disk filename and the in-memory version field stay consistent.

Comment thread src/brief/storage.ts Outdated
"| # | Category | Assertion | How to Verify |",
"|---|---|---|---|",
...draft.assertions.map(a =>
`| ${a.id} | ${a.category} | ${a.assertion} | ${a.test} |`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion text and test descriptions are written raw into Markdown table cells without escaping pipe characters. If an assertion contains | (e.g. "word count must be 800 | 1000" or a regex) the table will be corrupted, and parseBrief re-reads the Markdown table as display-only anyway (the canonical data is in the draft_json HTML comment), so the table is purely presentational. Either escape pipes in those fields (.replace(/\|/g, '\\|')) or add a comment noting the table is non-authoritative. Right now a future reader could be confused about why table rows look broken.

Comment thread src/brief/approval.ts Outdated
const green = (s: string) => isTTY ? `\x1b[32m${s}\x1b[0m` : s;
const red = (s: string) => isTTY ? `\x1b[31m${s}\x1b[0m` : s;

const CATEGORY_COLORS: Record<string, (s: string) => string> = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CATEGORY_COLORS in approval.ts and report.ts are identical and both ignore the module-level isTTY guard — the color lambdas always emit ANSI codes regardless of TTY. The other helpers (dim, bold, etc.) correctly gate on isTTY; these don't. Also, the same map is copy-pasted in both files. Extract it to a shared colors.ts utility and apply the isTTY guard consistently.

- schedule-runner.ts: revert runPrompt to its original one-arg signature,
  fold the brief suffix into the prompt internally so external
  SchedulerOptions implementers don't silently drop brief injection
- orchestrator.ts: recompute version via nextVersion() on each regenerate
  attempt instead of `version + attempts`, which could skip version numbers
- storage.ts: escape pipe characters in assertion/test text before writing
  the Markdown success-criteria table
- approval.ts, report.ts, colors.ts: extract duplicated CATEGORY_COLORS
  into a shared colors.ts, fixing the missing isTTY guard in both copies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants