fix(harness): harden Claude Code + OpenAI taps and span tracing by declan-scale · Pull Request #446 · scaleapi/scale-agentex-python

declan-scale · 2026-06-24T20:01:26Z

Summary

Hardens the unified-harness taps (Claude Code stream-json + OpenAI Agents sync converter) and the shared span-tracing path. Started from the three P1 Greptile findings on PR #443's new OpenAI sync converter, then grew to cover several real defects surfaced while running the golden agent against this branch locally (duplicate messages, missing reasoning traces). Also documents migrating off the legacy Temporal claude_agents plugin.

All fixes are behavior-only (no public API changes) and land on the unified harness shared by every harness (LangGraph, Pydantic-AI, OpenAI Agents, Claude Code, Codex).

Fixes

OpenAI Agents sync converter (`_openai_sync.py`)

Bad tool args no longer abort the turn — raw tool-call arguments flow through a defensive _safe_parse_arguments helper. Malformed/truncated JSON is preserved under raw, non-dict JSON under value, and non-string/non-dict payloads (lists, scalars, SDK objects) are serialized or wrapped — always returning a dict so ToolRequestContent.arguments (Dict[str, object]) never rejects it and kills the run. (Greptile P1 + follow-up.)
Reasoning streams no longer hang — completed reasoning content/summary items emit a matching StreamTaskMessageDone, so UnifiedEmitter.auto_send releases the context. (Greptile P1.)
Text no longer collides with reasoning — every new text item_id reserves a fresh message index, so a final answer can't overwrite the reasoning message on reasoning-model streams. (Greptile P1.)

OpenAI Agents Temporal model (`temporal_streaming_model.py`)

Hosted-tool response shape aligned — hosted/server-side tool responses emit content as a plain string, matching the function-tool response path so both render identically. (Greptile summary flag.)

Claude Code tap (`_claude_code_sync.py`)

No more duplicate text/reasoning messages — dedup of streamed-vs-materialized blocks is now content-based (record the streamed block's text, consume on match) instead of numeric block index, which couldn't distinguish "already-streamed block materializing in a later envelope" from "a new turn's block at the same index".
Interleaved materialized blocks deduped — Claude can emit the materialized assistant envelope mid-stream (before content_block_stop). When a same-type streamed block is still open and its partial buffer is a prefix of the materialized text, the materialized copy is suppressed.

Span tracing (`span_derivation.py`, `tracer.py`)

Reasoning text now recorded on derived reasoning spans — SpanDeriver opened reasoning spans with empty input and output=None, so reasoning text never reached the trace. It now accumulates ReasoningContentDelta / ReasoningSummaryDelta text (and any text seeded on the Start content) and records it as the span output. Affects every harness that streams reasoning.
Non-dict span input/output coerced — the SGP spans API requires input/output to be objects; a scalar/string payload is 422-rejected and the span silently dropped by the async processor. This is exactly why reasoning spans (string output) showed up as "0 reasoning traces" in the golden agent while tool spans (dict output) survived. The tracer now wraps any non-dict payload ({"output": ...} / {"input": ...}); dicts and None pass through.

Docs

adk/docs/migration-0.16.0.md — migration guide for agentex-client 0.16.0 / agentex-sdk 0.15.0: removed LangGraph/Pydantic-AI tracing handlers, private _modules path moves, OpenAI harness facade relocation, the new run_turn Temporal entry point, the defect fixes above, and §6: migrating off the legacy Temporal claude_agents plugin to the unified harness tap. Linked from adk/docs/harness.md. Deprecation note added to the plugin module docstring. (Follow-up tracked in AGX1-402.)

Tests

New: tests/lib/adk/test_openai_sync.py (arg parsing + reasoning/text sequencing), reasoning-span tests in test_span_derivation.py, content/interleaved dedup tests in test_claude_code_sync.py, payload-coercion tests in test_tracer.py.
Full harness + Claude Code + OpenAI suites green; lint/format clean.

Verification

Confirmed against the golden agent on this branch: DEDUP-HIT prefix (skip) fires (duplicate reasoning gone), and the SGP 422 on string span output is the confirmed cause of the missing reasoning traces (now coerced). Worker must be restarted to pick up the editable install.

🤖 Generated with Claude Code

Greptile Summary

This PR hardens the OpenAI and unified harness paths for streaming, tracing, and migration. The main changes are:

Safer OpenAI sync tool-argument parsing and reasoning message closure.
Fresh text message indexing after reasoning and tool output.
Hosted-tool response content aligned with function-tool responses.
Reasoning text captured in derived spans and span payloads normalized for SGP tracing.
Claude Code streamed/materialized block dedup updated to use delivered content.
Migration documentation added for the 0.16.0 client and 0.15.0 SDK release.

Confidence Score: 5/5

The changes are focused, covered by targeted tests, and do not leave known merge-blocking issues.

The updated OpenAI sync conversion, hosted-tool response shape, tracing normalization, and Claude Code dedup behavior are exercised by the added and updated test coverage, with no accepted code issues remaining.

T-Rex Logs

What T-Rex did

T-Rex ran tests to exercise malformed JSON handling and ToolRequestContent arguments validation, producing payloads for malformed JSON and for a valid [1, 2] input, and finishing with exit code 0.
The reasoning-index traces were inspected to confirm that text start and delta routing stay anchored at index 1, with text_fresh_index reported as false in both before and after cases.
The hosted tool response flow was observed to transition from a dict payload to a string payload, with shapes_align flipping from false to true, and the related artifacts capture the full command context and JSON output.

_{Ran code and verified through T-Rex}

Comments Outside Diff (1)

General comment

OpenAI sync text items still collide with reasoning message indexes
- Bug
  - In the head run, a completed reasoning item followed by a separate text item still produces both StreamTaskMessageStart events at index 1, with the text delta also routed to index 1. The validation contract requires each new text item_id after reasoning to reserve a fresh message index so final answer text does not collide with the reasoning context.
- Cause
  - The text-delta branch in src/agentex/lib/adk/_modules/_openai_sync.py only increments message_index for a new text item_id when seen_tool_output is true. For a reasoning-only stream, the branch falls through to item_id_to_index[item_id] = message_index, reusing the current reasoning index.
- Fix
  - In the ResponseTextDeltaEvent handling for any unseen text item_id, increment message_index before assigning item_id_to_index[item_id], regardless of seen_tool_output, or otherwise detect prior reasoning item allocation and reserve a new index for the text item.
_{Ran code and verified through T-Rex}

_{Reviews (5): Last reviewed commit: "fix(harness): coerce non-dict span input..." | Re-trigger Greptile}

…e shape Addresses the three P1 Greptile findings on the new sync OpenAI converter (convert_openai_to_agentex_events / OpenAITurn) plus the hosted-tool content mismatch flagged in PR #443: - Parse raw tool-call arguments through a defensive _safe_parse_arguments helper so malformed/truncated/non-dict JSON no longer raises and aborts the turn (matches the Temporal streaming model fallback). - Emit StreamTaskMessageDone for completed reasoning content/summary items so UnifiedEmitter.auto_send releases the context and the reasoning span is closed (reasoning messages used to hang open). - Reserve a fresh message index for every new text item_id so a final answer cannot collide with the preceding reasoning message on reasoning-model streams. - Emit hosted/server-side tool responses as a plain string in TemporalStreamingModel, matching the function-tool response path. Adds regression tests for the converter helpers and reasoning/text sequencing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Greptile follow-up: the _safe_parse_arguments fall-through returned a raw non-dict value (list / scalar / SDK object) when a provider tool passed arguments as something other than a JSON string. ToolRequestContent.arguments is typed Dict[str, object], so that could reject the value and abort the turn — the exact failure the helper exists to prevent. Now serialize SDK objects via model_dump when it yields a dict, otherwise wrap under `value`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The SpanDeriver opened reasoning spans with input={} and closed them with output=None, so reasoning/thinking text never reached the trace — every reasoning span showed blank input and output (seen on the Claude Code harness, but the deriver is shared by all harnesses). Accumulate the reasoning text per open index from ReasoningContentDelta / ReasoningSummaryDelta (and seed from any text carried on the Start content for non-streaming harnesses), then record it as the span output on close — matching how tool spans accumulate args_buf and record their result. Empty buffers still close as output=None rather than an empty string. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The golden agent emitted duplicate text messages: a single streamed assistant message (e.g. thinking + text) can materialise as SEPARATE `assistant` envelopes, and the converter reset its per-block streamed-index set after EACH materialised envelope. The thinking envelope (first) deduped fine, but the reset then wiped the set, so the text block in the second envelope lost its "already streamed" marker and was re-emitted. Numeric block indexes can't distinguish "already-streamed text materialising in a later envelope" (skip) from "a new turn's non-streamed text at the same index" (emit) — both arrive as a lone block at index 0. Switch to content-based dedup: record the full text of each streamed text/thinking block on content_block_stop, and skip a materialised block whose text matches (consuming one entry so a genuinely repeated later block still emits). Removes the per-envelope reset and the fragile pending/once-guard index machinery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ugin Document migrating from the original Temporal `claude_agents` plugin (`run_claude_agent_activity` + bespoke streaming/tracing) to the unified harness tap (`ClaudeCodeTurn` over the CLI stream-json stdout, delivered via UnifiedEmitter), which gets central span derivation — tool AND reasoning spans — and the shared delivery path. Also records the two Claude Code defect fixes (reasoning span text, duplicate-text dedup) in the guide, and adds a deprecation note to the plugin module docstring pointing at the tap. No code removal — the plugin still has consumers (090 tutorial, eval_dashboard_agent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Claude Code can emit the materialized `assistant` envelope for a content block WHILE that block is still streaming (the envelope arrives between two content_block_delta events, before content_block_stop). At that point the streamed block's buffer has not yet been recorded in _streamed_texts / _streamed_thinkings, so content-recorded dedup misses it and the block is emitted twice (observed as duplicate reasoning messages in the golden agent). Add a second dedup condition: skip a materialized text/thinking block when a streamed block of the same type is still open and its partial buffer is a prefix of the materialized full text. Prefix-match confirms it's the same block without dropping a genuinely different one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The SGP spans API requires a span's input/output to be an object: a scalar or string payload is rejected with a 422 ("Input should be a valid dictionary") and the async processor drops the span. The reasoning-span fix records the chain-of-thought as a plain-string output, so reasoning spans were silently dropped — observed as "0 reasoning traces" in the golden agent while tool spans (dict output) survived. Some harnesses' tool results are also plain strings and would hit the same 422. Coerce at the tracer boundary: wrap any non-dict input/output under a single key ({"output": ...} / {"input": ...}); dicts and None pass through unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread src/agentex/lib/adk/_modules/_openai_sync.py Outdated

declan-scale and others added 5 commits June 24, 2026 16:10

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread src/agentex/lib/adk/_modules/_claude_code_sync.py

declan-scale changed the title ~~fix(openai-agents): harden sync converter + align hosted-tool response shape~~ fix(harness): harden Claude Code + OpenAI taps and span tracing Jun 24, 2026

declan-scale merged commit 5b4359d into next Jun 24, 2026
99 of 101 checks passed

declan-scale deleted the declan-scale/openai-sync-reasoning-fixes branch June 24, 2026 21:54

stainless-app Bot mentioned this pull request Jun 24, 2026

chore: release main #443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harness): harden Claude Code + OpenAI taps and span tracing#446

fix(harness): harden Claude Code + OpenAI taps and span tracing#446
declan-scale merged 7 commits into
nextfrom
declan-scale/openai-sync-reasoning-fixes

declan-scale commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

declan-scale commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes

OpenAI Agents sync converter (_openai_sync.py)

OpenAI Agents Temporal model (temporal_streaming_model.py)

Claude Code tap (_claude_code_sync.py)

Span tracing (span_derivation.py, tracer.py)

Docs

Tests

Verification

Greptile Summary

Confidence Score: 5/5

T-Rex Logs

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

declan-scale commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

OpenAI Agents sync converter (`_openai_sync.py`)

OpenAI Agents Temporal model (`temporal_streaming_model.py`)

Claude Code tap (`_claude_code_sync.py`)

Span tracing (`span_derivation.py`, `tracer.py`)