feat(temporal): opt-in continue-as-new for long-lived agent workflows by danielmillerp · Pull Request #447 · scaleapi/scale-agentex-python

danielmillerp · 2026-06-24T20:18:40Z

Why

Long-lived chat/session agents (e.g. the Emu/FDD researcher) run as a single Temporal workflow that stays open indefinitely. Their event history grows until it hits Temporal's ~50k-event / 50MB limit and the workflow stalls — this is the root cause behind "chats die / state outgrows the 2MB payload" (P0 for EY).

This PR adds an opt-in continue-as-new path so a session can stay open forever by recycling its history, plus the discipline of keeping messages/state outside workflow state so they survive the recycle.

Two orthogonal levers: continue-as-new bounds history size; the chain-wide WORKFLOW_EXECUTION_TIMEOUT_SECONDS bounds wall-clock lifetime. continue-as-new does not extend the execution timeout — raise that knob too to keep workflows long-lived.

SDK — `BaseWorkflow` helpers (opt-in)

should_continue_as_new() — recycle decision: Temporal's is_continue_as_new_suggested() or a configurable WORKFLOW_MAX_HISTORY_LENGTH threshold.
drain_and_continue_as_new() — waits all_handlers_finished (so an in-flight turn isn't lost/duplicated at the boundary), then continue_as_new.
run_until_complete() — drop-in replacement for the usual wait_condition(timeout=None) tail. Gated once behind workflow.patched() so in-flight pre-patch workflows keep the old behaviour and don't hit a non-determinism error on replay.
conversation_from_messages() — rebuild the conversation from the adk.messages ledger after a recycle (messages live in adk.messages, not workflow state).

Config (default OFF — existing agents unaffected)

WORKFLOW_CONTINUE_AS_NEW_ENABLED (bool)
WORKFLOW_MAX_HISTORY_LENGTH (int | None)

Examples

All 13 long-lived Temporal tutorial agents adopt run_until_complete:

Message-based chat (010, 050, 060, 070, 080, 100, 120) — rebuild conversation from adk.messages.
Harness/session — persist non-message state to adk.state and re-hydrate on recycle: opaque session handles for claude-sdk (090), claude-code (140), codex (150); rich ModelMessage history for pydantic-ai (110, via ModelMessagesTypeAdapter); langgraph (130) rebuilds from the ledger.
000 (no per-turn state) just swaps the wait.

Every adk.state / adk.messages round-trip is guarded by the enabled flag, so the default path is byte-for-byte unchanged.

Verification

New unit tests for the recycle decision logic — tests/lib/core/temporal/test_base_workflow_continue_as_new.py (5 passing).
Full tests/lib/core/temporal suite: 8 passed, no regressions.
py_compile + ruff clean across all 16 changed files.

Follow-ups (not in this PR)

Replay/integration test of drain_and_continue_as_new against a Temporal test server.
Validate the pattern (drain + patch + chain-timeout) with the Temporal team before enabling in production.
Optional platform-level "transparent for all agents" variant (SDK owns the run loop) — deferred per discussion.

🤖 Generated with Claude Code

Greptile Summary

This PR adds opt-in continue-as-new support for long-lived Temporal agent workflows. The main changes are:

New BaseWorkflow helpers for recycle decisions, draining handlers, and rebuilding conversations.
New environment flags for enabling continue-as-new and setting a history threshold.
Updated Temporal tutorial workflows to use run_until_complete.
Message-ledger and adk.state rehydration for workflows that need state across recycles.
Unit tests for the continue-as-new decision logic.

Confidence Score: 3/5

The feature is opt-in and well scoped, but the replay compatibility guard needs to move ahead of any activity-emitting prologue work before this can be safely enabled for existing long-lived workflows.

The implementation has targeted tests for the recycle decision logic and keeps the default path disabled, but replay safety for upgraded executions is a key requirement for Temporal workflow changes and remains unresolved.

src/agentex/lib/core/temporal/workflows/workflow.py and the Temporal tutorial workflow prologues that perform rehydration, state loading, welcome-message, or workspace setup before entering the guarded wait helper.

T-Rex Logs

What T-Rex did

Reproduced the prologue gate verification by running a focused workflow harness with WORKFLOW_CONTINUE_AS_NEW_ENABLED=true and tracing adk.messages.list before the BaseWorkflow.run_until_complete evaluation of workflow.patched.
Verified the before-and-after state of temporal SDK helpers by comparing artifacts; before lacked the helpers, after includes them and shows EXIT_CODE 0.
Observed changes in workflow memory behavior with rehydrate/persist: without rehydrate, base kept 010 conversation and 150 Codex thread in memory; with rehydrate enabled, default-off paths do not add adk.messages/adk.state calls, while enabled paths rebuild 010 conversation from adk.messages and restore/update 150 Codex session state via adk.state.
Compared before and head runtime config artifacts to confirm updated head runtime import/script output and that both have command, working directory, and exit code; this validation targets the Python config contract, not an HTTP endpoint.

_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
src/agentex/lib/core/temporal/workflows/workflow.py:143-146
**Gate the prologue**

This patch marker runs only after each workflow's `on_task_create` has already executed its new rehydration and welcome-message work. When `WORKFLOW_CONTINUE_AS_NEW_ENABLED=true`, an execution that started before this change and then replays on the new code can schedule new commands before reaching this guard: message-ledger workflows call `conversation_from_messages()` before `run_until_complete`, opaque-state workflows call `adk.state.get_by_task_and_agent`, and several workflows create welcome or workspace activities first. Those commands are not in the old event history, so existing long-lived workflows can still hit Temporal nondeterminism even though this branch intends to preserve the old behavior. Please move the old-run patch decision ahead of any new activity-emitting prologue work, or expose a helper that callers check before both rehydration and the recycle wait.

_{Reviews (2): Last reviewed commit: "feat(temporal): opt-in continue-as-new f..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

socket-security · 2026-06-24T20:23:24Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pypi/agentex-sdk@0.13.0 ⏵ 0.14.0	⁺¹
	pypi/agentex-client@0.13.0 ⏵ 0.15.0	⁺¹		⁺¹

View full report

greptile-apps · 2026-06-24T20:28:03Z

+            role = "assistant" if content.author == "agent" else "user"
+            conversation.append({"role": role, "content": content.content})


Filter restored turns

conversation_from_messages() restores every text task message as model input. The updated workflows also emit welcome or initialization TextContent from on_task_create, and on_task_create runs again after every continue-as-new. When a workflow recycles, those welcome messages are restored as prior assistant turns and another welcome is written to the ledger, so long-lived chats accumulate repeated initialization text in model context and user-visible history. Please filter the ledger to only real conversation turns, or skip re-emitting initialization messages on continued runs.

Artifacts

Repro: focused runtime harness for restored initialization turn

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: script output showing welcome TextContent restored as assistant turn

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agentex/lib/core/temporal/workflows/workflow.py Line: 187-188 Comment: **Filter restored turns** `conversation_from_messages()` restores every text task message as model input. The updated workflows also emit welcome or initialization `TextContent` from `on_task_create`, and `on_task_create` runs again after every continue-as-new. When a workflow recycles, those welcome messages are restored as prior assistant turns and another welcome is written to the ledger, so long-lived chats accumulate repeated initialization text in model context and user-visible history. Please filter the ledger to only real conversation turns, or skip re-emitting initialization messages on continued runs. How can I resolve this? If you propose a fix, please make it concise.

Long-lived chat/session agents run as a single Temporal workflow that stays open indefinitely, so their event history grows until it hits Temporal's ~50k-event / 50MB limit and the workflow stalls. This adds an opt-in continue-as-new path that recycles the history so a session can stay open forever, plus the discipline of keeping messages/state outside workflow state so they survive the recycle. SDK (BaseWorkflow): - should_continue_as_new(): recycle decision (Temporal's is_continue_as_new_ suggested() or a configurable WORKFLOW_MAX_HISTORY_LENGTH threshold). - drain_and_continue_as_new(): waits all_handlers_finished (so an in-flight turn is never lost/duplicated at the boundary) then continue_as_new. - run_until_complete(): drop-in replacement for the usual wait_condition(timeout=None) tail; gated once behind workflow.patched() so in-flight pre-patch workflows keep the old behaviour (no non-determinism on replay). Identical behaviour unless WORKFLOW_CONTINUE_AS_NEW_ENABLED is set. - conversation_from_messages(): rebuild the conversation from the adk.messages ledger after a recycle (messages live in adk.messages, not workflow state). Config (default off, so existing agents are unaffected): - WORKFLOW_CONTINUE_AS_NEW_ENABLED (bool) - WORKFLOW_MAX_HISTORY_LENGTH (int|None) Examples: all 13 long-lived Temporal tutorial agents adopt run_until_complete. Message-based chat agents rebuild conversation from adk.messages; harness agents with an opaque session handle (claude-code, codex, claude-sdk) or rich history (pydantic-ai via ModelMessagesTypeAdapter, langgraph) persist their non-message state to adk.state and re-hydrate on recycle. Every adk.state / adk.messages round-trip is guarded by the enabled flag, so the default path is byte-for-byte unchanged. Note: continue-as-new bounds history SIZE; it does NOT extend the chain-wide WORKFLOW_EXECUTION_TIMEOUT_SECONDS (raise that to keep workflows long-lived). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-24T21:13:00Z

+        if not self._continue_as_new_enabled or not workflow.patched(
+            CONTINUE_AS_NEW_PATCH_ID
+        ):
+            await workflow.wait_condition(is_complete, timeout=None)


Gate the prologue

This patch marker runs only after each workflow's on_task_create has already executed its new rehydration and welcome-message work. When WORKFLOW_CONTINUE_AS_NEW_ENABLED=true, an execution that started before this change and then replays on the new code can schedule new commands before reaching this guard: message-ledger workflows call conversation_from_messages() before run_until_complete, opaque-state workflows call adk.state.get_by_task_and_agent, and several workflows create welcome or workspace activities first. Those commands are not in the old event history, so existing long-lived workflows can still hit Temporal nondeterminism even though this branch intends to preserve the old behavior. Please move the old-run patch decision ahead of any new activity-emitting prologue work, or expose a helper that callers check before both rehydration and the recycle wait.

Artifacts

Repro: focused workflow harness tracing prologue command ordering

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness output showing adk.messages.list before workflow.patched guard

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agentex/lib/core/temporal/workflows/workflow.py Line: 143-146 Comment: **Gate the prologue** This patch marker runs only after each workflow's `on_task_create` has already executed its new rehydration and welcome-message work. When `WORKFLOW_CONTINUE_AS_NEW_ENABLED=true`, an execution that started before this change and then replays on the new code can schedule new commands before reaching this guard: message-ledger workflows call `conversation_from_messages()` before `run_until_complete`, opaque-state workflows call `adk.state.get_by_task_and_agent`, and several workflows create welcome or workspace activities first. Those commands are not in the old event history, so existing long-lived workflows can still hit Temporal nondeterminism even though this branch intends to preserve the old behavior. Please move the old-run patch decision ahead of any new activity-emitting prologue work, or expose a helper that callers check before both rehydration and the recycle wait. How can I resolve this? If you propose a fix, please make it concise.

danielmillerp changed the base branch from main to next June 24, 2026 20:20

danielmillerp force-pushed the dm/temporal-continue-as-new branch from 5d63a08 to 4170651 Compare June 24, 2026 20:22

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

danielmillerp force-pushed the dm/temporal-continue-as-new branch from 4170651 to 891ef6d Compare June 24, 2026 21:07

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

stainless-app Bot force-pushed the next branch from 5b4359d to 521c60d Compare June 24, 2026 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(temporal): opt-in continue-as-new for long-lived agent workflows#447

feat(temporal): opt-in continue-as-new for long-lived agent workflows#447
danielmillerp wants to merge 1 commit into
nextfrom
dm/temporal-continue-as-new

danielmillerp commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

socket-security Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		role = "assistant" if content.author == "agent" else "user"
		conversation.append({"role": role, "content": content.content})

Uh oh!

Conversation

danielmillerp commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

SDK — BaseWorkflow helpers (opt-in)

Config (default OFF — existing agents unaffected)

Examples

Verification

Follow-ups (not in this PR)

Greptile Summary

Confidence Score: 3/5

T-Rex Logs

Uh oh!

socket-security Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielmillerp commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

SDK — `BaseWorkflow` helpers (opt-in)

socket-security Bot commented Jun 24, 2026 •

edited

Loading