Skip to content

feat(temporal): opt-in continue-as-new for long-lived agent workflows#447

Open
danielmillerp wants to merge 1 commit into
nextfrom
dm/temporal-continue-as-new
Open

feat(temporal): opt-in continue-as-new for long-lived agent workflows#447
danielmillerp wants to merge 1 commit into
nextfrom
dm/temporal-continue-as-new

Conversation

@danielmillerp

@danielmillerp danielmillerp commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Why

Long-lived chat/session agents (e.g. the Emu/FDD researcher) run as a single Temporal workflow that stays open indefinitely. Their event history grows until it hits Temporal's ~50k-event / 50MB limit and the workflow stalls — this is the root cause behind "chats die / state outgrows the 2MB payload" (P0 for EY).

This PR adds an opt-in continue-as-new path so a session can stay open forever by recycling its history, plus the discipline of keeping messages/state outside workflow state so they survive the recycle.

Two orthogonal levers: continue-as-new bounds history size; the chain-wide WORKFLOW_EXECUTION_TIMEOUT_SECONDS bounds wall-clock lifetime. continue-as-new does not extend the execution timeout — raise that knob too to keep workflows long-lived.

SDK — BaseWorkflow helpers (opt-in)

  • should_continue_as_new() — recycle decision: Temporal's is_continue_as_new_suggested() or a configurable WORKFLOW_MAX_HISTORY_LENGTH threshold.
  • drain_and_continue_as_new() — waits all_handlers_finished (so an in-flight turn isn't lost/duplicated at the boundary), then continue_as_new.
  • run_until_complete() — drop-in replacement for the usual wait_condition(timeout=None) tail. Gated once behind workflow.patched() so in-flight pre-patch workflows keep the old behaviour and don't hit a non-determinism error on replay.
  • conversation_from_messages() — rebuild the conversation from the adk.messages ledger after a recycle (messages live in adk.messages, not workflow state).

Config (default OFF — existing agents unaffected)

  • WORKFLOW_CONTINUE_AS_NEW_ENABLED (bool)
  • WORKFLOW_MAX_HISTORY_LENGTH (int | None)

Examples

All 13 long-lived Temporal tutorial agents adopt run_until_complete:

  • Message-based chat (010, 050, 060, 070, 080, 100, 120) — rebuild conversation from adk.messages.
  • Harness/session — persist non-message state to adk.state and re-hydrate on recycle: opaque session handles for claude-sdk (090), claude-code (140), codex (150); rich ModelMessage history for pydantic-ai (110, via ModelMessagesTypeAdapter); langgraph (130) rebuilds from the ledger.
  • 000 (no per-turn state) just swaps the wait.

Every adk.state / adk.messages round-trip is guarded by the enabled flag, so the default path is byte-for-byte unchanged.

Verification

  • New unit tests for the recycle decision logic — tests/lib/core/temporal/test_base_workflow_continue_as_new.py (5 passing).
  • Full tests/lib/core/temporal suite: 8 passed, no regressions.
  • py_compile + ruff clean across all 16 changed files.

Follow-ups (not in this PR)

  • Replay/integration test of drain_and_continue_as_new against a Temporal test server.
  • Validate the pattern (drain + patch + chain-timeout) with the Temporal team before enabling in production.
  • Optional platform-level "transparent for all agents" variant (SDK owns the run loop) — deferred per discussion.

🤖 Generated with Claude Code

Greptile Summary

This PR adds opt-in continue-as-new support for long-lived Temporal agent workflows. The main changes are:

  • New BaseWorkflow helpers for recycle decisions, draining handlers, and rebuilding conversations.
  • New environment flags for enabling continue-as-new and setting a history threshold.
  • Updated Temporal tutorial workflows to use run_until_complete.
  • Message-ledger and adk.state rehydration for workflows that need state across recycles.
  • Unit tests for the continue-as-new decision logic.

Confidence Score: 3/5

The feature is opt-in and well scoped, but the replay compatibility guard needs to move ahead of any activity-emitting prologue work before this can be safely enabled for existing long-lived workflows.

The implementation has targeted tests for the recycle decision logic and keeps the default path disabled, but replay safety for upgraded executions is a key requirement for Temporal workflow changes and remains unresolved.

src/agentex/lib/core/temporal/workflows/workflow.py and the Temporal tutorial workflow prologues that perform rehydration, state loading, welcome-message, or workspace setup before entering the guarded wait helper.

T-Rex T-Rex Logs

What T-Rex did

  • Reproduced the prologue gate verification by running a focused workflow harness with WORKFLOW_CONTINUE_AS_NEW_ENABLED=true and tracing adk.messages.list before the BaseWorkflow.run_until_complete evaluation of workflow.patched.
  • Verified the before-and-after state of temporal SDK helpers by comparing artifacts; before lacked the helpers, after includes them and shows EXIT_CODE 0.
  • Observed changes in workflow memory behavior with rehydrate/persist: without rehydrate, base kept 010 conversation and 150 Codex thread in memory; with rehydrate enabled, default-off paths do not add adk.messages/adk.state calls, while enabled paths rebuild 010 conversation from adk.messages and restore/update 150 Codex session state via adk.state.
  • Compared before and head runtime config artifacts to confirm updated head runtime import/script output and that both have command, working directory, and exit code; this validation targets the Python config contract, not an HTTP endpoint.

View all artifacts

T-Rex Ran code and verified through T-Rex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
src/agentex/lib/core/temporal/workflows/workflow.py:143-146
**Gate the prologue**

This patch marker runs only after each workflow's `on_task_create` has already executed its new rehydration and welcome-message work. When `WORKFLOW_CONTINUE_AS_NEW_ENABLED=true`, an execution that started before this change and then replays on the new code can schedule new commands before reaching this guard: message-ledger workflows call `conversation_from_messages()` before `run_until_complete`, opaque-state workflows call `adk.state.get_by_task_and_agent`, and several workflows create welcome or workspace activities first. Those commands are not in the old event history, so existing long-lived workflows can still hit Temporal nondeterminism even though this branch intends to preserve the old behavior. Please move the old-run patch decision ahead of any new activity-emitting prologue work, or expose a helper that callers check before both rehydration and the recycle wait.

Reviews (2): Last reviewed commit: "feat(temporal): opt-in continue-as-new f..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@danielmillerp danielmillerp changed the base branch from main to next June 24, 2026 20:20
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 5d63a08 to 4170651 Compare June 24, 2026 20:22
@socket-security

socket-security Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​agentex-sdk@​0.13.0 ⏵ 0.14.094 +1100100100100
Updatedpypi/​agentex-client@​0.13.0 ⏵ 0.15.099 +1100100 +1100100

View full report

Comment on lines +187 to +188
role = "assistant" if content.author == "agent" else "user"
conversation.append({"role": role, "content": content.content})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Filter restored turns

conversation_from_messages() restores every text task message as model input. The updated workflows also emit welcome or initialization TextContent from on_task_create, and on_task_create runs again after every continue-as-new. When a workflow recycles, those welcome messages are restored as prior assistant turns and another welcome is written to the ledger, so long-lived chats accumulate repeated initialization text in model context and user-visible history. Please filter the ledger to only real conversation turns, or skip re-emitting initialization messages on continued runs.

Artifacts

Repro: focused runtime harness for restored initialization turn

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: script output showing welcome TextContent restored as assistant turn

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/temporal/workflows/workflow.py
Line: 187-188

Comment:
**Filter restored turns**

`conversation_from_messages()` restores every text task message as model input. The updated workflows also emit welcome or initialization `TextContent` from `on_task_create`, and `on_task_create` runs again after every continue-as-new. When a workflow recycles, those welcome messages are restored as prior assistant turns and another welcome is written to the ledger, so long-lived chats accumulate repeated initialization text in model context and user-visible history. Please filter the ledger to only real conversation turns, or skip re-emitting initialization messages on continued runs.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Long-lived chat/session agents run as a single Temporal workflow that stays
open indefinitely, so their event history grows until it hits Temporal's
~50k-event / 50MB limit and the workflow stalls. This adds an opt-in
continue-as-new path that recycles the history so a session can stay open
forever, plus the discipline of keeping messages/state outside workflow state
so they survive the recycle.

SDK (BaseWorkflow):
- should_continue_as_new(): recycle decision (Temporal's is_continue_as_new_
  suggested() or a configurable WORKFLOW_MAX_HISTORY_LENGTH threshold).
- drain_and_continue_as_new(): waits all_handlers_finished (so an in-flight
  turn is never lost/duplicated at the boundary) then continue_as_new.
- run_until_complete(): drop-in replacement for the usual
  wait_condition(timeout=None) tail; gated once behind workflow.patched() so
  in-flight pre-patch workflows keep the old behaviour (no non-determinism on
  replay). Identical behaviour unless WORKFLOW_CONTINUE_AS_NEW_ENABLED is set.
- conversation_from_messages(): rebuild the conversation from the adk.messages
  ledger after a recycle (messages live in adk.messages, not workflow state).

Config (default off, so existing agents are unaffected):
- WORKFLOW_CONTINUE_AS_NEW_ENABLED (bool)
- WORKFLOW_MAX_HISTORY_LENGTH (int|None)

Examples: all 13 long-lived Temporal tutorial agents adopt run_until_complete.
Message-based chat agents rebuild conversation from adk.messages; harness
agents with an opaque session handle (claude-code, codex, claude-sdk) or rich
history (pydantic-ai via ModelMessagesTypeAdapter, langgraph) persist their
non-message state to adk.state and re-hydrate on recycle. Every adk.state /
adk.messages round-trip is guarded by the enabled flag, so the default path is
byte-for-byte unchanged.

Note: continue-as-new bounds history SIZE; it does NOT extend the chain-wide
WORKFLOW_EXECUTION_TIMEOUT_SECONDS (raise that to keep workflows long-lived).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 4170651 to 891ef6d Compare June 24, 2026 21:07
Comment on lines +143 to +146
if not self._continue_as_new_enabled or not workflow.patched(
CONTINUE_AS_NEW_PATCH_ID
):
await workflow.wait_condition(is_complete, timeout=None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Gate the prologue

This patch marker runs only after each workflow's on_task_create has already executed its new rehydration and welcome-message work. When WORKFLOW_CONTINUE_AS_NEW_ENABLED=true, an execution that started before this change and then replays on the new code can schedule new commands before reaching this guard: message-ledger workflows call conversation_from_messages() before run_until_complete, opaque-state workflows call adk.state.get_by_task_and_agent, and several workflows create welcome or workspace activities first. Those commands are not in the old event history, so existing long-lived workflows can still hit Temporal nondeterminism even though this branch intends to preserve the old behavior. Please move the old-run patch decision ahead of any new activity-emitting prologue work, or expose a helper that callers check before both rehydration and the recycle wait.

Artifacts

Repro: focused workflow harness tracing prologue command ordering

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness output showing adk.messages.list before workflow.patched guard

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/temporal/workflows/workflow.py
Line: 143-146

Comment:
**Gate the prologue**

This patch marker runs only after each workflow's `on_task_create` has already executed its new rehydration and welcome-message work. When `WORKFLOW_CONTINUE_AS_NEW_ENABLED=true`, an execution that started before this change and then replays on the new code can schedule new commands before reaching this guard: message-ledger workflows call `conversation_from_messages()` before `run_until_complete`, opaque-state workflows call `adk.state.get_by_task_and_agent`, and several workflows create welcome or workspace activities first. Those commands are not in the old event history, so existing long-lived workflows can still hit Temporal nondeterminism even though this branch intends to preserve the old behavior. Please move the old-run patch decision ahead of any new activity-emitting prologue work, or expose a helper that callers check before both rehydration and the recycle wait.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant