fix(harness): harden Claude Code + OpenAI taps and span tracing#446
Merged
Conversation
…e shape Addresses the three P1 Greptile findings on the new sync OpenAI converter (convert_openai_to_agentex_events / OpenAITurn) plus the hosted-tool content mismatch flagged in PR #443: - Parse raw tool-call arguments through a defensive _safe_parse_arguments helper so malformed/truncated/non-dict JSON no longer raises and aborts the turn (matches the Temporal streaming model fallback). - Emit StreamTaskMessageDone for completed reasoning content/summary items so UnifiedEmitter.auto_send releases the context and the reasoning span is closed (reasoning messages used to hang open). - Reserve a fresh message index for every new text item_id so a final answer cannot collide with the preceding reasoning message on reasoning-model streams. - Emit hosted/server-side tool responses as a plain string in TemporalStreamingModel, matching the function-tool response path. Adds regression tests for the converter helpers and reasoning/text sequencing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile follow-up: the _safe_parse_arguments fall-through returned a raw non-dict value (list / scalar / SDK object) when a provider tool passed arguments as something other than a JSON string. ToolRequestContent.arguments is typed Dict[str, object], so that could reject the value and abort the turn — the exact failure the helper exists to prevent. Now serialize SDK objects via model_dump when it yields a dict, otherwise wrap under `value`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SpanDeriver opened reasoning spans with input={} and closed them with
output=None, so reasoning/thinking text never reached the trace — every
reasoning span showed blank input and output (seen on the Claude Code harness,
but the deriver is shared by all harnesses).
Accumulate the reasoning text per open index from ReasoningContentDelta /
ReasoningSummaryDelta (and seed from any text carried on the Start content for
non-streaming harnesses), then record it as the span output on close — matching
how tool spans accumulate args_buf and record their result. Empty buffers still
close as output=None rather than an empty string.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The golden agent emitted duplicate text messages: a single streamed assistant message (e.g. thinking + text) can materialise as SEPARATE `assistant` envelopes, and the converter reset its per-block streamed-index set after EACH materialised envelope. The thinking envelope (first) deduped fine, but the reset then wiped the set, so the text block in the second envelope lost its "already streamed" marker and was re-emitted. Numeric block indexes can't distinguish "already-streamed text materialising in a later envelope" (skip) from "a new turn's non-streamed text at the same index" (emit) — both arrive as a lone block at index 0. Switch to content-based dedup: record the full text of each streamed text/thinking block on content_block_stop, and skip a materialised block whose text matches (consuming one entry so a genuinely repeated later block still emits). Removes the per-envelope reset and the fragile pending/once-guard index machinery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ugin Document migrating from the original Temporal `claude_agents` plugin (`run_claude_agent_activity` + bespoke streaming/tracing) to the unified harness tap (`ClaudeCodeTurn` over the CLI stream-json stdout, delivered via UnifiedEmitter), which gets central span derivation — tool AND reasoning spans — and the shared delivery path. Also records the two Claude Code defect fixes (reasoning span text, duplicate-text dedup) in the guide, and adds a deprecation note to the plugin module docstring pointing at the tap. No code removal — the plugin still has consumers (090 tutorial, eval_dashboard_agent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude Code can emit the materialized `assistant` envelope for a content block WHILE that block is still streaming (the envelope arrives between two content_block_delta events, before content_block_stop). At that point the streamed block's buffer has not yet been recorded in _streamed_texts / _streamed_thinkings, so content-recorded dedup misses it and the block is emitted twice (observed as duplicate reasoning messages in the golden agent). Add a second dedup condition: skip a materialized text/thinking block when a streamed block of the same type is still open and its partial buffer is a prefix of the materialized full text. Prefix-match confirms it's the same block without dropping a genuinely different one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SGP spans API requires a span's input/output to be an object: a scalar or
string payload is rejected with a 422 ("Input should be a valid dictionary") and
the async processor drops the span. The reasoning-span fix records the
chain-of-thought as a plain-string output, so reasoning spans were silently
dropped — observed as "0 reasoning traces" in the golden agent while tool spans
(dict output) survived. Some harnesses' tool results are also plain strings and
would hit the same 422.
Coerce at the tracer boundary: wrap any non-dict input/output under a single key
({"output": ...} / {"input": ...}); dicts and None pass through unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the unified-harness taps (Claude Code stream-json + OpenAI Agents sync converter) and the shared span-tracing path. Started from the three P1 Greptile findings on PR #443's new OpenAI sync converter, then grew to cover several real defects surfaced while running the golden agent against this branch locally (duplicate messages, missing reasoning traces). Also documents migrating off the legacy Temporal
claude_agentsplugin.All fixes are behavior-only (no public API changes) and land on the unified harness shared by every harness (LangGraph, Pydantic-AI, OpenAI Agents, Claude Code, Codex).
Fixes
OpenAI Agents sync converter (
_openai_sync.py)_safe_parse_argumentshelper. Malformed/truncated JSON is preserved underraw, non-dict JSON undervalue, and non-string/non-dict payloads (lists, scalars, SDK objects) are serialized or wrapped — always returning a dict soToolRequestContent.arguments(Dict[str, object]) never rejects it and kills the run. (Greptile P1 + follow-up.)StreamTaskMessageDone, soUnifiedEmitter.auto_sendreleases the context. (Greptile P1.)item_idreserves a fresh message index, so a final answer can't overwrite the reasoning message on reasoning-model streams. (Greptile P1.)OpenAI Agents Temporal model (
temporal_streaming_model.py)contentas a plain string, matching the function-tool response path so both render identically. (Greptile summary flag.)Claude Code tap (
_claude_code_sync.py)assistantenvelope mid-stream (beforecontent_block_stop). When a same-type streamed block is still open and its partial buffer is a prefix of the materialized text, the materialized copy is suppressed.Span tracing (
span_derivation.py,tracer.py)SpanDeriveropened reasoning spans with empty input andoutput=None, so reasoning text never reached the trace. It now accumulatesReasoningContentDelta/ReasoningSummaryDeltatext (and any text seeded on the Start content) and records it as the span output. Affects every harness that streams reasoning.input/outputto be objects; a scalar/string payload is 422-rejected and the span silently dropped by the async processor. This is exactly why reasoning spans (string output) showed up as "0 reasoning traces" in the golden agent while tool spans (dict output) survived. The tracer now wraps any non-dict payload ({"output": ...}/{"input": ...}); dicts andNonepass through.Docs
adk/docs/migration-0.16.0.md— migration guide foragentex-client0.16.0 /agentex-sdk0.15.0: removed LangGraph/Pydantic-AI tracing handlers, private_modulespath moves, OpenAI harness facade relocation, the newrun_turnTemporal entry point, the defect fixes above, and §6: migrating off the legacy Temporalclaude_agentsplugin to the unified harness tap. Linked fromadk/docs/harness.md. Deprecation note added to the plugin module docstring. (Follow-up tracked in AGX1-402.)Tests
tests/lib/adk/test_openai_sync.py(arg parsing + reasoning/text sequencing), reasoning-span tests intest_span_derivation.py, content/interleaved dedup tests intest_claude_code_sync.py, payload-coercion tests intest_tracer.py.Verification
Confirmed against the golden agent on this branch:
DEDUP-HIT prefix (skip)fires (duplicate reasoning gone), and the SGP 422 on string span output is the confirmed cause of the missing reasoning traces (now coerced). Worker must be restarted to pick up the editable install.🤖 Generated with Claude Code
Greptile Summary
This PR hardens the OpenAI and unified harness paths for streaming, tracing, and migration. The main changes are:
Confidence Score: 5/5
The changes are focused, covered by targeted tests, and do not leave known merge-blocking issues.
The updated OpenAI sync conversion, hosted-tool response shape, tracing normalization, and Claude Code dedup behavior are exercised by the added and updated test coverage, with no accepted code issues remaining.
What T-Rex did
Comments Outside Diff (1)
General comment
StreamTaskMessageStartevents at index 1, with the text delta also routed to index 1. The validation contract requires each new textitem_idafter reasoning to reserve a fresh message index so final answer text does not collide with the reasoning context.src/agentex/lib/adk/_modules/_openai_sync.pyonly incrementsmessage_indexfor a new textitem_idwhenseen_tool_outputis true. For a reasoning-only stream, the branch falls through toitem_id_to_index[item_id] = message_index, reusing the current reasoning index.ResponseTextDeltaEventhandling for any unseen textitem_id, incrementmessage_indexbefore assigningitem_id_to_index[item_id], regardless ofseen_tool_output, or otherwise detect prior reasoning item allocation and reserve a new index for the text item.Reviews (5): Last reviewed commit: "fix(harness): coerce non-dict span input..." | Re-trigger Greptile