Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions examples/optimization/eval_optimize_loop/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Runtime side-effects (regenerated on every run); the audited report and the
# runs/latest prompt snapshots are kept in VCS as example deliverables.
__pycache__/
_sdk_eval_metrics.json
runs/latest/agent_optimizer/
110 changes: 110 additions & 0 deletions examples/optimization/eval_optimize_loop/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Evaluation + Optimization Loop

## 1. Purpose

This example implements the issue requirement for a reproducible Evaluation + Optimization pipeline. It is not only an `AgentOptimizer` quickstart: it wraps optimization with baseline evaluation, failure attribution, validation regression, gate decisions, and audit artifacts.

The default `fake` mode runs without model credentials. The `live` mode uses a real `LlmAgent` bridge and invokes `AgentOptimizer.optimize` against a `TargetPrompt`.

## 2. Pipeline Stages

The pipeline runs six stages:

1. Baseline evaluation: score train and validation sets separately, including metric scores, pass/fail, reasons, and key trace fields.
2. Failure attribution: cluster failures into `final_response_mismatch`, `tool_call_error`, `parameter_error`, `llm_rubric_not_met`, `knowledge_recall_insufficient`, and `format_error`.
3. Optimization execution: fake mode applies a deterministic candidate; live mode calls `AgentOptimizer.optimize` with `TargetPrompt.add_path("system_prompt", ...)`.
4. Candidate validation: rerun train and validation sets and compute per-case deltas such as `new_pass`, `new_fail`, `score_up`, and `score_down`.
5. Acceptance gate: require validation gain, no new hard fail, no key-case regression, no train-up/validation-down overfit, and cost within budget.
6. Audit persistence: write prompt snapshots, scores, deltas, gate reasons, cost, duration, seed, and config snapshots.

## 3. Directory Layout

```text
examples/optimization/eval_optimize_loop/
├── agent/
│ ├── __init__.py
│ └── agent.py
├── prompts/
│ └── system.md
├── train.evalset.json
├── val.evalset.json
├── case_meta.json
├── optimizer.json
├── optimizer.sdk.json
├── run.py
├── optimization_report.json
└── optimization_report.md
```

## 4. Inputs

- `train.evalset.json`: training evaluation set.
- `val.evalset.json`: validation evaluation set; it must be a different file from train.
- `optimizer.json`: outer-loop configuration for mode, metrics, fake candidate patch, and gate thresholds.
- `prompts/system.md`: baseline prompt source registered as the optimization target.
- `case_meta.json`: out-of-schema metadata for key cases, rubric kinds, and attribution hints.
- `optimizer.sdk.json`: live-only SDK optimizer config passed to `AgentOptimizer.optimize`.

## 5. Outputs

- `optimization_report.json`: machine-readable audit report with baseline, candidate, delta, gate, attribution, optimizer status, cost, duration, seed, and config snapshot.
- `optimization_report.md`: human-readable decision summary.
- `runs/latest/baseline_prompt.md`: exact baseline prompt snapshot.
- `runs/latest/candidate_prompt.md`: candidate prompt snapshot.
- `runs/latest/agent_optimizer/`: live-only raw SDK artifacts, including `RoundRecord`-backed round files, `result.json`, `summary.txt`, and `best_prompts/`.

## 6. Run Modes

Fake mode:

```bash
python examples/optimization/eval_optimize_loop/run.py --mode fake
```

Live mode:

```bash
set TRPC_AGENT_API_KEY=...
set TRPC_AGENT_BASE_URL=...
set TRPC_AGENT_MODEL_NAME=...
python examples/optimization/eval_optimize_loop/run.py --mode live
```

`fake` mode uses a deterministic fake model, fake judge, and scripted candidate so the full loop runs without API keys. `live` mode uses `agent/agent.py`, creates a fresh `LlmAgent` for each call, and invokes `AgentOptimizer.optimize`.

## 7. Customizing The Agent

Edit `agent/agent.py` when connecting a real business agent.

Key constraints:

- `make_call_agent(prompt_path)` must return an async function with the exact optimizer contract `async (query: str) -> str`.
- `create_agent(prompt_path)` must re-read the prompt file every time so candidates written by `AgentOptimizer` take effect immediately.
- `TargetPrompt.add_path("system_prompt", path)` must point to the same prompt file that the agent actually reads.
- For HTTP, CLI, remote config, or multi-agent pipelines, keep the outer contract the same and replace only the bridge implementation.

The outer report still computes richer trace-style scoring. The SDK optimizer itself receives final-text responses through `call_agent`, so `optimizer.sdk.json` intentionally avoids metrics that require full session traces.

## 8. Design And Validation

Failure attribution is rule-based over structured signals, not case ids. Each case records final response, tool trajectory, rubric sub-scores, and expected/actual tool calls. Rubric failures map to `format_error` or `llm_rubric_not_met`; tool mismatches map to tool, parameter, spurious-call, or knowledge-recall categories.

The gate is validation-first. A candidate is accepted only if validation mean improves by the configured threshold, no new hard fail appears, key validation cases do not regress, train improvement does not coincide with validation loss, and cost is within budget.

The bundled fake candidate intentionally improves two train cases and one validation case while damaging two key validation cases. The expected sample decision is `REJECT`, demonstrating overfit rejection.

Verified fake command:

```bash
C:\Users\27303\PycharmProjects\Yun\.venv\Scripts\python.exe examples\optimization\eval_optimize_loop\run.py --mode fake
```

Observed sample result:

```text
train: 0.25 -> 0.7833
validation: 0.7333 -> 0.6667
decision: REJECT
```

Known limits: live mode requires SDK dependencies plus `TRPC_AGENT_API_KEY`, `TRPC_AGENT_BASE_URL`, and `TRPC_AGENT_MODEL_NAME`; no-key environments should use `--mode fake`.
1 change: 1 addition & 0 deletions examples/optimization/eval_optimize_loop/agent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Agent bridge package for the eval_optimize_loop example."""
142 changes: 142 additions & 0 deletions examples/optimization/eval_optimize_loop/agent/agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Tencent is pleased to support the open source community by making tRPC-Agent-Python available.
#
# Copyright (C) 2026 Tencent. All rights reserved.
#
# tRPC-Agent-Python is licensed under Apache-2.0.
"""Live agent bridge for the eval_optimize_loop example.

The optimizer contract is intentionally small: ``call_agent`` is an async
function that accepts one user query and returns the final response text. This
module re-reads the prompt file on every invocation so prompt candidates written
by AgentOptimizer take effect immediately.

The public bridge in this file mirrors the SDK docs:

* ``create_agent`` builds a fresh ``LlmAgent`` from the current prompt file.
* ``run_agent`` drives that agent through ``Runner`` and ``InMemorySessionService``.
* ``make_call_agent`` returns the exact async callable required by
``AgentOptimizer.optimize`` when a ``TargetPrompt`` is registered.
"""

from __future__ import annotations

import os
import uuid
from pathlib import Path
from typing import Any
from typing import Awaitable
from typing import Callable

from trpc_agent_sdk.agents import LlmAgent
from trpc_agent_sdk.models import OpenAIModel
from trpc_agent_sdk.runners import Runner
from trpc_agent_sdk.sessions import InMemorySessionService
from trpc_agent_sdk.tools import FunctionTool
from trpc_agent_sdk.types import Content
from trpc_agent_sdk.types import Part


APP_NAME = "eval_optimize_loop"


def lookup_order(order_id: str) -> str:
"""FunctionTool body used by the live ``LlmAgent`` example."""
data = {
"A100": "Order A100 is in transit and arrives on Friday.",
"A200": "Order A200 is delivered.",
}
return data.get(order_id, f"No order record found for {order_id}.")


def search_policy(topic: str) -> str:
"""FunctionTool body for policy and warranty lookup examples."""
topic_lower = topic.lower()
if "damaged" in topic_lower or "refund" in topic_lower:
return "Damaged items are eligible for a full refund within 30 days."
if "model z" in topic_lower or "warranty" in topic_lower:
return "Model Z has a 24-month warranty."
return "No matching policy snippet was found."


def get_model_config() -> tuple[str, str, str]:
"""Read live model credentials consumed by ``OpenAIModel``."""
api_key = os.getenv("TRPC_AGENT_API_KEY", "")
base_url = os.getenv("TRPC_AGENT_BASE_URL", "")
model_name = os.getenv("TRPC_AGENT_MODEL_NAME", "")
if not api_key or not base_url or not model_name:
raise ValueError(
"Live mode requires TRPC_AGENT_API_KEY, TRPC_AGENT_BASE_URL, and "
"TRPC_AGENT_MODEL_NAME. Use --mode fake for the no-key path."
)
return api_key, base_url, model_name


def create_agent(prompt_path: Path) -> LlmAgent:
"""Create a fresh ``LlmAgent`` from the current prompt file.

Re-reading here is the critical TargetPrompt contract: when
``AgentOptimizer`` writes a candidate prompt, the next call immediately uses
that candidate without restarting the process.
"""
api_key, base_url, model_name = get_model_config()
instruction = Path(prompt_path).read_text(encoding="utf-8").strip()
return LlmAgent(
name="support_assistant",
description="A support assistant whose system prompt is under optimization.",
model=OpenAIModel(model_name=model_name, api_key=api_key, base_url=base_url),
instruction=instruction,
tools=[FunctionTool(lookup_order), FunctionTool(search_policy)],
)


async def run_agent(query: str, prompt_path: Path) -> dict[str, Any]:
"""Run the live agent once and collect final text plus tool calls.

``AgentOptimizer.optimize`` only needs final response text, but the outer
issue-level report also wants key trajectory information. This richer helper
supports both.
"""
agent = create_agent(prompt_path)
session_service = InMemorySessionService()
runner = Runner(app_name=APP_NAME, agent=agent, session_service=session_service)
session_id = str(uuid.uuid4())
user_id = "optimizer"
await session_service.create_session(
app_name=APP_NAME,
user_id=user_id,
session_id=session_id,
state={},
)
message = Content(role="user", parts=[Part.from_text(text=query)])
final_text = ""
tools: list[dict[str, Any]] = []
async for event in runner.run_async(
user_id=user_id,
session_id=session_id,
new_message=message,
):
if not event.content or not event.content.parts:
continue
for part in event.content.parts:
function_call = getattr(part, "function_call", None)
if function_call is not None:
tools.append(
{
"name": getattr(function_call, "name", None),
"args": dict(getattr(function_call, "args", {}) or {}),
}
)
if event.is_final_response():
for part in event.content.parts:
if getattr(part, "text", None) and not getattr(part, "thought", False):
final_text += part.text
return {"text": final_text.strip(), "tools": tools}


def make_call_agent(prompt_path: Path) -> Callable[[str], Awaitable[str]]:
"""Return the fixed async ``(query: str) -> str`` bridge required by GEPA."""

async def call_agent(query: str) -> str:
return (await run_agent(query=query, prompt_path=prompt_path))["text"]

return call_agent
35 changes: 35 additions & 0 deletions examples/optimization/eval_optimize_loop/case_meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"_comment": "Per-case metadata for attribution, gate checks, and fake/live trace scoring. It is kept outside evalsets so EvalSet schema validation remains clean.",
"train_order_lookup_optimizable": {
"category": "tool_call_error",
"key": false,
"rubric": "none"
},
"train_refund_policy_optimizable": {
"category": "knowledge_recall_insufficient",
"key": false,
"rubric": "none",
"authoritative_tool": "search_policy"
},
"train_json_format_ineffective": {
"category": "format_error",
"key": false,
"rubric": "json_format"
},
"val_warranty_new_pass": {
"category": "knowledge_recall_insufficient",
"key": false,
"rubric": "none",
"authoritative_tool": "search_policy"
},
"val_smalltalk_regression": {
"category": "spurious_tool_call",
"key": true,
"rubric": "no_tool"
},
"val_order_soft_degradation": {
"category": "spurious_tool_call",
"key": true,
"rubric": "single_tool"
}
}
Loading
Loading