trpc-group · Adonis-a233 · Jun 30, 2026 · Jun 30, 2026
diff --git a/examples/optimization/eval_optimize_loop/.gitignore b/examples/optimization/eval_optimize_loop/.gitignore
@@ -0,0 +1,5 @@
+# Runtime side-effects (regenerated on every run); the audited report and the
+# runs/latest prompt snapshots are kept in VCS as example deliverables.
+__pycache__/
+_sdk_eval_metrics.json
+runs/latest/agent_optimizer/
diff --git a/examples/optimization/eval_optimize_loop/README.md b/examples/optimization/eval_optimize_loop/README.md
@@ -0,0 +1,110 @@
+# Evaluation + Optimization Loop
+
+## 1. Purpose
+
+This example implements the issue requirement for a reproducible Evaluation + Optimization pipeline. It is not only an `AgentOptimizer` quickstart: it wraps optimization with baseline evaluation, failure attribution, validation regression, gate decisions, and audit artifacts.
+
+The default `fake` mode runs without model credentials. The `live` mode uses a real `LlmAgent` bridge and invokes `AgentOptimizer.optimize` against a `TargetPrompt`.
+
+## 2. Pipeline Stages
+
+The pipeline runs six stages:
+
+1. Baseline evaluation: score train and validation sets separately, including metric scores, pass/fail, reasons, and key trace fields.
+2. Failure attribution: cluster failures into `final_response_mismatch`, `tool_call_error`, `parameter_error`, `llm_rubric_not_met`, `knowledge_recall_insufficient`, and `format_error`.
+3. Optimization execution: fake mode applies a deterministic candidate; live mode calls `AgentOptimizer.optimize` with `TargetPrompt.add_path("system_prompt", ...)`.
+4. Candidate validation: rerun train and validation sets and compute per-case deltas such as `new_pass`, `new_fail`, `score_up`, and `score_down`.
+5. Acceptance gate: require validation gain, no new hard fail, no key-case regression, no train-up/validation-down overfit, and cost within budget.
+6. Audit persistence: write prompt snapshots, scores, deltas, gate reasons, cost, duration, seed, and config snapshots.
+
+## 3. Directory Layout
+
+```text
+examples/optimization/eval_optimize_loop/
+├── agent/
+│   ├── __init__.py
+│   └── agent.py
+├── prompts/
+│   └── system.md
+├── train.evalset.json
+├── val.evalset.json
+├── case_meta.json
+├── optimizer.json
+├── optimizer.sdk.json
+├── run.py
+├── optimization_report.json
+└── optimization_report.md
+```
+
+## 4. Inputs
+
+- `train.evalset.json`: training evaluation set.
+- `val.evalset.json`: validation evaluation set; it must be a different file from train.
+- `optimizer.json`: outer-loop configuration for mode, metrics, fake candidate patch, and gate thresholds.
+- `prompts/system.md`: baseline prompt source registered as the optimization target.
+- `case_meta.json`: out-of-schema metadata for key cases, rubric kinds, and attribution hints.
+- `optimizer.sdk.json`: live-only SDK optimizer config passed to `AgentOptimizer.optimize`.
+
+## 5. Outputs
+
+- `optimization_report.json`: machine-readable audit report with baseline, candidate, delta, gate, attribution, optimizer status, cost, duration, seed, and config snapshot.
+- `optimization_report.md`: human-readable decision summary.
+- `runs/latest/baseline_prompt.md`: exact baseline prompt snapshot.
+- `runs/latest/candidate_prompt.md`: candidate prompt snapshot.
+- `runs/latest/agent_optimizer/`: live-only raw SDK artifacts, including `RoundRecord`-backed round files, `result.json`, `summary.txt`, and `best_prompts/`.
+
+## 6. Run Modes
+
+Fake mode:
+
+```bash
+python examples/optimization/eval_optimize_loop/run.py --mode fake
+```
+
+Live mode:
+
+```bash
+set TRPC_AGENT_API_KEY=...
+set TRPC_AGENT_BASE_URL=...
+set TRPC_AGENT_MODEL_NAME=...
+python examples/optimization/eval_optimize_loop/run.py --mode live
+```
+
+`fake` mode uses a deterministic fake model, fake judge, and scripted candidate so the full loop runs without API keys. `live` mode uses `agent/agent.py`, creates a fresh `LlmAgent` for each call, and invokes `AgentOptimizer.optimize`.
+
+## 7. Customizing The Agent
+
+Edit `agent/agent.py` when connecting a real business agent.
+
+Key constraints:
+
+- `make_call_agent(prompt_path)` must return an async function with the exact optimizer contract `async (query: str) -> str`.
+- `create_agent(prompt_path)` must re-read the prompt file every time so candidates written by `AgentOptimizer` take effect immediately.
+- `TargetPrompt.add_path("system_prompt", path)` must point to the same prompt file that the agent actually reads.
+- For HTTP, CLI, remote config, or multi-agent pipelines, keep the outer contract the same and replace only the bridge implementation.
+
+The outer report still computes richer trace-style scoring. The SDK optimizer itself receives final-text responses through `call_agent`, so `optimizer.sdk.json` intentionally avoids metrics that require full session traces.
+
+## 8. Design And Validation
+
+Failure attribution is rule-based over structured signals, not case ids. Each case records final response, tool trajectory, rubric sub-scores, and expected/actual tool calls. Rubric failures map to `format_error` or `llm_rubric_not_met`; tool mismatches map to tool, parameter, spurious-call, or knowledge-recall categories.
+
+The gate is validation-first. A candidate is accepted only if validation mean improves by the configured threshold, no new hard fail appears, key validation cases do not regress, train improvement does not coincide with validation loss, and cost is within budget.
+
+The bundled fake candidate intentionally improves two train cases and one validation case while damaging two key validation cases. The expected sample decision is `REJECT`, demonstrating overfit rejection.
+
+Verified fake command:
+
+```bash
+C:\Users\27303\PycharmProjects\Yun\.venv\Scripts\python.exe examples\optimization\eval_optimize_loop\run.py --mode fake
+```
+
+Observed sample result:
+
+```text
+train: 0.25 -> 0.7833
+validation: 0.7333 -> 0.6667
+decision: REJECT
+```
+
+Known limits: live mode requires SDK dependencies plus `TRPC_AGENT_API_KEY`, `TRPC_AGENT_BASE_URL`, and `TRPC_AGENT_MODEL_NAME`; no-key environments should use `--mode fake`.
diff --git a/examples/optimization/eval_optimize_loop/agent/__init__.py b/examples/optimization/eval_optimize_loop/agent/__init__.py
@@ -0,0 +1 @@
+"""Agent bridge package for the eval_optimize_loop example."""
diff --git a/examples/optimization/eval_optimize_loop/agent/agent.py b/examples/optimization/eval_optimize_loop/agent/agent.py
@@ -0,0 +1,142 @@
+# Tencent is pleased to support the open source community by making tRPC-Agent-Python available.
+#
+# Copyright (C) 2026 Tencent. All rights reserved.
+#
+# tRPC-Agent-Python is licensed under Apache-2.0.
+"""Live agent bridge for the eval_optimize_loop example.
+
+The optimizer contract is intentionally small: ``call_agent`` is an async
+function that accepts one user query and returns the final response text. This
+module re-reads the prompt file on every invocation so prompt candidates written
+by AgentOptimizer take effect immediately.
+
+The public bridge in this file mirrors the SDK docs:
+
+* ``create_agent`` builds a fresh ``LlmAgent`` from the current prompt file.
+* ``run_agent`` drives that agent through ``Runner`` and ``InMemorySessionService``.
+* ``make_call_agent`` returns the exact async callable required by
+  ``AgentOptimizer.optimize`` when a ``TargetPrompt`` is registered.
+"""
+
+from __future__ import annotations
+
+import os
+import uuid
+from pathlib import Path
+from typing import Any
+from typing import Awaitable
+from typing import Callable
+
+from trpc_agent_sdk.agents import LlmAgent
+from trpc_agent_sdk.models import OpenAIModel
+from trpc_agent_sdk.runners import Runner
+from trpc_agent_sdk.sessions import InMemorySessionService
+from trpc_agent_sdk.tools import FunctionTool
+from trpc_agent_sdk.types import Content
+from trpc_agent_sdk.types import Part
+
+
+APP_NAME = "eval_optimize_loop"
+
+
+def lookup_order(order_id: str) -> str:
+    """FunctionTool body used by the live ``LlmAgent`` example."""
+    data = {
+        "A100": "Order A100 is in transit and arrives on Friday.",
+        "A200": "Order A200 is delivered.",
+    }
+    return data.get(order_id, f"No order record found for {order_id}.")
+
+
+def search_policy(topic: str) -> str:
+    """FunctionTool body for policy and warranty lookup examples."""
+    topic_lower = topic.lower()
+    if "damaged" in topic_lower or "refund" in topic_lower:
+        return "Damaged items are eligible for a full refund within 30 days."
+    if "model z" in topic_lower or "warranty" in topic_lower:
+        return "Model Z has a 24-month warranty."
+    return "No matching policy snippet was found."
+
+
+def get_model_config() -> tuple[str, str, str]:
+    """Read live model credentials consumed by ``OpenAIModel``."""
+    api_key = os.getenv("TRPC_AGENT_API_KEY", "")
+    base_url = os.getenv("TRPC_AGENT_BASE_URL", "")
+    model_name = os.getenv("TRPC_AGENT_MODEL_NAME", "")
+    if not api_key or not base_url or not model_name:
+        raise ValueError(
+            "Live mode requires TRPC_AGENT_API_KEY, TRPC_AGENT_BASE_URL, and "
+            "TRPC_AGENT_MODEL_NAME. Use --mode fake for the no-key path."
+        )
+    return api_key, base_url, model_name
+
+
+def create_agent(prompt_path: Path) -> LlmAgent:
+    """Create a fresh ``LlmAgent`` from the current prompt file.
+
+    Re-reading here is the critical TargetPrompt contract: when
+    ``AgentOptimizer`` writes a candidate prompt, the next call immediately uses
+    that candidate without restarting the process.
+    """
+    api_key, base_url, model_name = get_model_config()
+    instruction = Path(prompt_path).read_text(encoding="utf-8").strip()
+    return LlmAgent(
+        name="support_assistant",
+        description="A support assistant whose system prompt is under optimization.",
+        model=OpenAIModel(model_name=model_name, api_key=api_key, base_url=base_url),
+        instruction=instruction,
+        tools=[FunctionTool(lookup_order), FunctionTool(search_policy)],
+    )
+
+
+async def run_agent(query: str, prompt_path: Path) -> dict[str, Any]:
+    """Run the live agent once and collect final text plus tool calls.
+
+    ``AgentOptimizer.optimize`` only needs final response text, but the outer
+    issue-level report also wants key trajectory information. This richer helper
+    supports both.
+    """
+    agent = create_agent(prompt_path)
+    session_service = InMemorySessionService()
+    runner = Runner(app_name=APP_NAME, agent=agent, session_service=session_service)
+    session_id = str(uuid.uuid4())
+    user_id = "optimizer"
+    await session_service.create_session(
+        app_name=APP_NAME,
+        user_id=user_id,
+        session_id=session_id,
+        state={},
+    )
+    message = Content(role="user", parts=[Part.from_text(text=query)])
+    final_text = ""
+    tools: list[dict[str, Any]] = []
+    async for event in runner.run_async(
+        user_id=user_id,
+        session_id=session_id,
+        new_message=message,
+    ):
+        if not event.content or not event.content.parts:
+            continue
+        for part in event.content.parts:
+            function_call = getattr(part, "function_call", None)
+            if function_call is not None:
+                tools.append(
+                    {
+                        "name": getattr(function_call, "name", None),
+                        "args": dict(getattr(function_call, "args", {}) or {}),
+                    }
+                )
+        if event.is_final_response():
+            for part in event.content.parts:
+                if getattr(part, "text", None) and not getattr(part, "thought", False):
+                    final_text += part.text
+    return {"text": final_text.strip(), "tools": tools}
+
+
+def make_call_agent(prompt_path: Path) -> Callable[[str], Awaitable[str]]:
+    """Return the fixed async ``(query: str) -> str`` bridge required by GEPA."""
+
+    async def call_agent(query: str) -> str:
+        return (await run_agent(query=query, prompt_path=prompt_path))["text"]
+
+    return call_agent
diff --git a/examples/optimization/eval_optimize_loop/case_meta.json b/examples/optimization/eval_optimize_loop/case_meta.json
@@ -0,0 +1,35 @@
+{
+  "_comment": "Per-case metadata for attribution, gate checks, and fake/live trace scoring. It is kept outside evalsets so EvalSet schema validation remains clean.",
+  "train_order_lookup_optimizable": {
+    "category": "tool_call_error",
+    "key": false,
+    "rubric": "none"
+  },
+  "train_refund_policy_optimizable": {
+    "category": "knowledge_recall_insufficient",
+    "key": false,
+    "rubric": "none",
+    "authoritative_tool": "search_policy"
+  },
+  "train_json_format_ineffective": {
+    "category": "format_error",
+    "key": false,
+    "rubric": "json_format"
+  },
+  "val_warranty_new_pass": {
+    "category": "knowledge_recall_insufficient",
+    "key": false,
+    "rubric": "none",
+    "authoritative_tool": "search_policy"
+  },
+  "val_smalltalk_regression": {
+    "category": "spurious_tool_call",
+    "key": true,
+    "rubric": "no_tool"
+  },
+  "val_order_soft_degradation": {
+    "category": "spurious_tool_call",
+    "key": true,
+    "rubric": "single_tool"
+  }
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Agent bridge package for the eval_optimize_loop example."""