Skip to content

fix(harness): harden Claude Code + OpenAI taps and span tracing#446

Merged
declan-scale merged 7 commits into
nextfrom
declan-scale/openai-sync-reasoning-fixes
Jun 24, 2026
Merged

fix(harness): harden Claude Code + OpenAI taps and span tracing#446
declan-scale merged 7 commits into
nextfrom
declan-scale/openai-sync-reasoning-fixes

Conversation

@declan-scale

@declan-scale declan-scale commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Hardens the unified-harness taps (Claude Code stream-json + OpenAI Agents sync converter) and the shared span-tracing path. Started from the three P1 Greptile findings on PR #443's new OpenAI sync converter, then grew to cover several real defects surfaced while running the golden agent against this branch locally (duplicate messages, missing reasoning traces). Also documents migrating off the legacy Temporal claude_agents plugin.

All fixes are behavior-only (no public API changes) and land on the unified harness shared by every harness (LangGraph, Pydantic-AI, OpenAI Agents, Claude Code, Codex).

Fixes

OpenAI Agents sync converter (_openai_sync.py)

  • Bad tool args no longer abort the turn — raw tool-call arguments flow through a defensive _safe_parse_arguments helper. Malformed/truncated JSON is preserved under raw, non-dict JSON under value, and non-string/non-dict payloads (lists, scalars, SDK objects) are serialized or wrapped — always returning a dict so ToolRequestContent.arguments (Dict[str, object]) never rejects it and kills the run. (Greptile P1 + follow-up.)
  • Reasoning streams no longer hang — completed reasoning content/summary items emit a matching StreamTaskMessageDone, so UnifiedEmitter.auto_send releases the context. (Greptile P1.)
  • Text no longer collides with reasoning — every new text item_id reserves a fresh message index, so a final answer can't overwrite the reasoning message on reasoning-model streams. (Greptile P1.)

OpenAI Agents Temporal model (temporal_streaming_model.py)

  • Hosted-tool response shape aligned — hosted/server-side tool responses emit content as a plain string, matching the function-tool response path so both render identically. (Greptile summary flag.)

Claude Code tap (_claude_code_sync.py)

  • No more duplicate text/reasoning messages — dedup of streamed-vs-materialized blocks is now content-based (record the streamed block's text, consume on match) instead of numeric block index, which couldn't distinguish "already-streamed block materializing in a later envelope" from "a new turn's block at the same index".
  • Interleaved materialized blocks deduped — Claude can emit the materialized assistant envelope mid-stream (before content_block_stop). When a same-type streamed block is still open and its partial buffer is a prefix of the materialized text, the materialized copy is suppressed.

Span tracing (span_derivation.py, tracer.py)

  • Reasoning text now recorded on derived reasoning spansSpanDeriver opened reasoning spans with empty input and output=None, so reasoning text never reached the trace. It now accumulates ReasoningContentDelta / ReasoningSummaryDelta text (and any text seeded on the Start content) and records it as the span output. Affects every harness that streams reasoning.
  • Non-dict span input/output coerced — the SGP spans API requires input/output to be objects; a scalar/string payload is 422-rejected and the span silently dropped by the async processor. This is exactly why reasoning spans (string output) showed up as "0 reasoning traces" in the golden agent while tool spans (dict output) survived. The tracer now wraps any non-dict payload ({"output": ...} / {"input": ...}); dicts and None pass through.

Docs

  • adk/docs/migration-0.16.0.md — migration guide for agentex-client 0.16.0 / agentex-sdk 0.15.0: removed LangGraph/Pydantic-AI tracing handlers, private _modules path moves, OpenAI harness facade relocation, the new run_turn Temporal entry point, the defect fixes above, and §6: migrating off the legacy Temporal claude_agents plugin to the unified harness tap. Linked from adk/docs/harness.md. Deprecation note added to the plugin module docstring. (Follow-up tracked in AGX1-402.)

Tests

  • New: tests/lib/adk/test_openai_sync.py (arg parsing + reasoning/text sequencing), reasoning-span tests in test_span_derivation.py, content/interleaved dedup tests in test_claude_code_sync.py, payload-coercion tests in test_tracer.py.
  • Full harness + Claude Code + OpenAI suites green; lint/format clean.

Verification

Confirmed against the golden agent on this branch: DEDUP-HIT prefix (skip) fires (duplicate reasoning gone), and the SGP 422 on string span output is the confirmed cause of the missing reasoning traces (now coerced). Worker must be restarted to pick up the editable install.

🤖 Generated with Claude Code

Greptile Summary

This PR hardens the OpenAI and unified harness paths for streaming, tracing, and migration. The main changes are:

  • Safer OpenAI sync tool-argument parsing and reasoning message closure.
  • Fresh text message indexing after reasoning and tool output.
  • Hosted-tool response content aligned with function-tool responses.
  • Reasoning text captured in derived spans and span payloads normalized for SGP tracing.
  • Claude Code streamed/materialized block dedup updated to use delivered content.
  • Migration documentation added for the 0.16.0 client and 0.15.0 SDK release.

Confidence Score: 5/5

The changes are focused, covered by targeted tests, and do not leave known merge-blocking issues.

The updated OpenAI sync conversion, hosted-tool response shape, tracing normalization, and Claude Code dedup behavior are exercised by the added and updated test coverage, with no accepted code issues remaining.

T-Rex T-Rex Logs

What T-Rex did

  • T-Rex ran tests to exercise malformed JSON handling and ToolRequestContent arguments validation, producing payloads for malformed JSON and for a valid [1, 2] input, and finishing with exit code 0.
  • The reasoning-index traces were inspected to confirm that text start and delta routing stay anchored at index 1, with text_fresh_index reported as false in both before and after cases.
  • The hosted tool response flow was observed to transition from a dict payload to a string payload, with shapes_align flipping from false to true, and the related artifacts capture the full command context and JSON output.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (1)

  1. General comment

    P1 OpenAI sync text items still collide with reasoning message indexes

    • Bug
      • In the head run, a completed reasoning item followed by a separate text item still produces both StreamTaskMessageStart events at index 1, with the text delta also routed to index 1. The validation contract requires each new text item_id after reasoning to reserve a fresh message index so final answer text does not collide with the reasoning context.
    • Cause
      • The text-delta branch in src/agentex/lib/adk/_modules/_openai_sync.py only increments message_index for a new text item_id when seen_tool_output is true. For a reasoning-only stream, the branch falls through to item_id_to_index[item_id] = message_index, reusing the current reasoning index.
    • Fix
      • In the ResponseTextDeltaEvent handling for any unseen text item_id, increment message_index before assigning item_id_to_index[item_id], regardless of seen_tool_output, or otherwise detect prior reasoning item allocation and reserve a new index for the text item.

    T-Rex Ran code and verified through T-Rex

Reviews (5): Last reviewed commit: "fix(harness): coerce non-dict span input..." | Re-trigger Greptile

…e shape

Addresses the three P1 Greptile findings on the new sync OpenAI converter
(convert_openai_to_agentex_events / OpenAITurn) plus the hosted-tool content
mismatch flagged in PR #443:

- Parse raw tool-call arguments through a defensive _safe_parse_arguments helper
  so malformed/truncated/non-dict JSON no longer raises and aborts the turn
  (matches the Temporal streaming model fallback).
- Emit StreamTaskMessageDone for completed reasoning content/summary items so
  UnifiedEmitter.auto_send releases the context and the reasoning span is closed
  (reasoning messages used to hang open).
- Reserve a fresh message index for every new text item_id so a final answer
  cannot collide with the preceding reasoning message on reasoning-model streams.
- Emit hosted/server-side tool responses as a plain string in
  TemporalStreamingModel, matching the function-tool response path.

Adds regression tests for the converter helpers and reasoning/text sequencing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/agentex/lib/adk/_modules/_openai_sync.py Outdated
declan-scale and others added 5 commits June 24, 2026 16:10
Greptile follow-up: the _safe_parse_arguments fall-through returned a raw
non-dict value (list / scalar / SDK object) when a provider tool passed
arguments as something other than a JSON string. ToolRequestContent.arguments
is typed Dict[str, object], so that could reject the value and abort the turn —
the exact failure the helper exists to prevent. Now serialize SDK objects via
model_dump when it yields a dict, otherwise wrap under `value`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SpanDeriver opened reasoning spans with input={} and closed them with
output=None, so reasoning/thinking text never reached the trace — every
reasoning span showed blank input and output (seen on the Claude Code harness,
but the deriver is shared by all harnesses).

Accumulate the reasoning text per open index from ReasoningContentDelta /
ReasoningSummaryDelta (and seed from any text carried on the Start content for
non-streaming harnesses), then record it as the span output on close — matching
how tool spans accumulate args_buf and record their result. Empty buffers still
close as output=None rather than an empty string.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The golden agent emitted duplicate text messages: a single streamed assistant
message (e.g. thinking + text) can materialise as SEPARATE `assistant` envelopes,
and the converter reset its per-block streamed-index set after EACH materialised
envelope. The thinking envelope (first) deduped fine, but the reset then wiped
the set, so the text block in the second envelope lost its "already streamed"
marker and was re-emitted.

Numeric block indexes can't distinguish "already-streamed text materialising in a
later envelope" (skip) from "a new turn's non-streamed text at the same index"
(emit) — both arrive as a lone block at index 0. Switch to content-based dedup:
record the full text of each streamed text/thinking block on content_block_stop,
and skip a materialised block whose text matches (consuming one entry so a
genuinely repeated later block still emits). Removes the per-envelope reset and
the fragile pending/once-guard index machinery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ugin

Document migrating from the original Temporal `claude_agents` plugin
(`run_claude_agent_activity` + bespoke streaming/tracing) to the unified harness
tap (`ClaudeCodeTurn` over the CLI stream-json stdout, delivered via
UnifiedEmitter), which gets central span derivation — tool AND reasoning spans —
and the shared delivery path. Also records the two Claude Code defect fixes
(reasoning span text, duplicate-text dedup) in the guide, and adds a deprecation
note to the plugin module docstring pointing at the tap. No code removal — the
plugin still has consumers (090 tutorial, eval_dashboard_agent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude Code can emit the materialized `assistant` envelope for a content block
WHILE that block is still streaming (the envelope arrives between two
content_block_delta events, before content_block_stop). At that point the
streamed block's buffer has not yet been recorded in _streamed_texts /
_streamed_thinkings, so content-recorded dedup misses it and the block is
emitted twice (observed as duplicate reasoning messages in the golden agent).

Add a second dedup condition: skip a materialized text/thinking block when a
streamed block of the same type is still open and its partial buffer is a prefix
of the materialized full text. Prefix-match confirms it's the same block without
dropping a genuinely different one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/agentex/lib/adk/_modules/_claude_code_sync.py
The SGP spans API requires a span's input/output to be an object: a scalar or
string payload is rejected with a 422 ("Input should be a valid dictionary") and
the async processor drops the span. The reasoning-span fix records the
chain-of-thought as a plain-string output, so reasoning spans were silently
dropped — observed as "0 reasoning traces" in the golden agent while tool spans
(dict output) survived. Some harnesses' tool results are also plain strings and
would hit the same 422.

Coerce at the tracer boundary: wrap any non-dict input/output under a single key
({"output": ...} / {"input": ...}); dicts and None pass through unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale changed the title fix(openai-agents): harden sync converter + align hosted-tool response shape fix(harness): harden Claude Code + OpenAI taps and span tracing Jun 24, 2026
@declan-scale declan-scale merged commit 5b4359d into next Jun 24, 2026
99 of 101 checks passed
@declan-scale declan-scale deleted the declan-scale/openai-sync-reasoning-fixes branch June 24, 2026 21:54
@stainless-app stainless-app Bot mentioned this pull request Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant