Skip to content

Speed up and stabilize integration tests (parallelism + deploy retry)#2451

Open
GarrettBeatty wants to merge 12 commits into
devfrom
gcbeatty/fix-custom-authorizer-deploy-retry
Open

Speed up and stabilize integration tests (parallelism + deploy retry)#2451
GarrettBeatty wants to merge 12 commits into
devfrom
gcbeatty/fix-custom-authorizer-deploy-retry

Conversation

@GarrettBeatty

@GarrettBeatty GarrettBeatty commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

TLDR: make tests run in parallel and fix flaky tests. was taking 60 minutes to run all tests now it takes 30

Started as a fix for a flaky CI failure in TestCustomAuthorizerApp.IntegrationTests, then grew into a broader effort to speed up and stabilize the integration-test phase (which dominated CI wall-clock by running everything serially), plus fixes for a few flaky unit tests surfaced along the way.

Reliability

  • Retry custom-authorizer deployment on transient IAM role propagation. The fixture's CloudFormation deploy intermittently rolled back with "The role defined for the function cannot be assumed by Lambda" — a transient IAM eventual-consistency race. DeploymentScript.ps1 now retries the deploy (deleting the rolled-back stack between attempts, since ROLLBACK_COMPLETE can't be re-created) and surfaces CloudFormation failure events.

Speed

  • Run the integration-test projects in parallel. run-integ-tests now runs each *.IntegrationTests.csproj concurrently (run-integ-tests-parallel.ps1); each project deploys its own isolated stack, so they share no state.
  • Build the test projects once, up front. The parallel runner builds all integration projects serially first, then runs dotnet test --no-build in parallel — so the concurrent runs don't each rebuild the shared IntegrationTests.Helpers project and race on its build output.
  • Stack-scoped Lambda lookup. LambdaHelper.FilterByCloudFormationStackAsync uses CloudFormation ListStackResources instead of scanning every Lambda in the account and reading each function's tags — O(stack) instead of O(account), and no shared-account throttling.
  • In-project test parallelism. TestServerlessApp and TestCustomAuthorizerApp share their single deployed-stack fixture across the assembly via IAssemblyFixture instead of one serial [Collection], so the test classes run in parallel (stack still deploys once).
  • Durable suite parallelism. Enabled parallel execution for the durable integ suite (was maxParallelThreads=1).
  • Publish durable test functions once, in a single MSBuild pass. A generated traversal project (Restore;Publish, BuildInParallel) builds the shared dependency projects once and publishes every function to its own bin/publish; tests then only zip the output — replacing per-test cold publishing.

Making durable parallelism safe (rate limits & races)

Enabling parallelism in the durable suite surfaced a series of shared-resource contention issues, fixed in layers:

  • IAM throttling → replaced per-test IAM roles with a single shared execution role (created at most once per account, reused across runs); dispose no longer deletes roles.
  • Lambda control-plane throttling → share the AWS clients statically so adaptive retry coordinates backoff across deployments, and cap concurrent control-plane calls (CreateFunction/DeleteFunction/GetFunctionConfiguration) with a suite-wide gate.
  • Shared-file races → idempotent dotnet tool install across the parallel deploy scripts, and zip each function package to a unique temp path (a function used by more than one test was being zipped to a shared path concurrently).

Developer experience

  • Live integ-test output. The parallel runner streams each project's output line-by-line (prefixed with the project name) instead of buffering until completion; failed projects get a clean reprinted block.

Flaky unit-test fixes (unrelated to the integ work, surfaced in CI)

  • Durable suspend tests (InvokeOperationTests et al.): replaced fixed Task.Delay waits before asserting suspension with a deterministic await on the termination signal (TerminationManager.TerminationTask), bounded by a timeout — the fixed delays raced under CI thread-pool pressure.
  • FileDescriptorLogStream test: the test helper trimmed trailing null bytes from captured output, which flaked ~1/256 of the time when a log header's timestamp ended in 0x00 (16-byte header read as 15). Now captures exactly the bytes written.
  • Streaming E2E test (StreamingE2EWithMoq): ResponseStreamFactory tracks the active invocation in a static field (on-demand) or an AsyncLocal (multi-concurrency), and GetCurrentContext() prefers the AsyncLocal. Several factory tests set the AsyncLocal synchronously on reused xUnit worker threads, leaking into a later on-demand test so its response silently fell back to the buffered path (CapturedHttpBytes null). Fixed test-only (no shipping code changed): the multi-concurrency tests now write the AsyncLocal on isolated Task.Run flows so the mutation can't leak across threads.

Testing

  • Affected projects build clean; PowerShell scripts parse.
  • Verified against AWS where the local environment allowed: custom-authorizer deploys its stack once and all 20 tests pass under the parallel IAssemblyFixture setup; the shared-role + single-pass publish path works (51/51 functions publish in one MSBuild pass).
  • Flaky unit-test fixes verified by stress-running: the streaming test failed ~40% of full-assembly runs before the fix and 12/12 after; the durable suspend and log-stream tests are now deterministic.
  • Remaining end-to-end durable-suite throttling/timing has been validated iteratively on real CI (each contention fix above was driven by a CI run).

…ation failure

The TestCustomAuthorizerApp integration test stack deploys many Lambda
functions that reference IAM roles created in the same stack. CloudFormation
occasionally calls Lambda CreateFunction before the role's trust policy has
propagated through IAM, producing "The role defined for the function cannot
be assumed by Lambda" and rolling the whole stack back, which fails all 20
tests in the project.

Wrap the deploy in a retry loop (3 attempts). Between attempts, delete the
rolled-back stack (a ROLLBACK_COMPLETE stack cannot be re-created) and pause
briefly to let IAM settle. Surface CloudFormation failed-resource events on
each failure for easier debugging.
@GarrettBeatty GarrettBeatty added the Release Not Needed Add this label if a PR does not need to be released. label Jun 26, 2026
The integration-test phase ran everything serially and dominated CI wall-clock.
Four independent changes cut that down:

- run-integ-tests now runs each *.IntegrationTests.csproj concurrently
  (buildtools/run-integ-tests-parallel.ps1). Each project deploys its own
  isolated CloudFormation stack, so they share no state. Replaces the serial
  MSBuild item-batched Exec.

- LambdaHelper.FilterByCloudFormationStackAsync now lists the stack's resources
  via CloudFormation ListStackResources instead of scanning every Lambda in the
  account and reading each function's tags one at a time. O(stack size) instead
  of O(account size), and no longer throttles in a shared test account.

- TestServerlessApp and TestCustomAuthorizerApp integ tests share their single
  deployed-stack fixture across the assembly via IAssemblyFixture (the
  Xunit.Extensions.AssemblyFixture package) instead of one serial
  [Collection]. The stack still deploys once, but the test classes now run in
  parallel.

- The durable execution integ suite (45 independent tests, each deploying its
  own uniquely-named function) no longer forces maxParallelThreads=1; its build
  helper already guards concurrent publishes with a per-directory file lock.

Verified end-to-end against AWS: TestCustomAuthorizerApp deploys its stack once
and all 20 tests pass under the parallel AssemblyFixture setup.
@GarrettBeatty GarrettBeatty changed the title Retry TestCustomAuthorizerApp deployment on transient IAM role propagation failure Speed up and stabilize integration tests (parallelism + deploy retry) Jun 27, 2026
@GarrettBeatty GarrettBeatty reopened this Jun 27, 2026
…letion

The parallel runner captured each project's output with Out-String and only
printed it after the project finished, so nothing appeared during the long
integration-test run. Stream each line to the host as it arrives, prefixed with
the project name so the interleaved parallel logs stay attributable. Failed
projects still get their full output reprinted as one clean block at the end.
…fixed delay

InvokeOperationTests.InvokeAsync_FreshExecution_CheckpointsStartAndSuspends
failed intermittently on net10.0 (e.g. CI run on PR #2451). The suspend-path
tests kicked off an operation, slept a fixed 10-50ms, then asserted
tm.IsTerminated. Under CI thread-pool pressure the suspend signal didn't always
fire within that window, so the assert raced and failed.

TerminationManager already exposes TerminationTask, a Task that completes
exactly when Terminate() fires. Replace the fixed delays with a shared
tm.WaitForTerminationAsync() helper that awaits that task (bounded by a 10s
timeout so a genuine non-suspension still fails fast at the assert). Applied to
all 13 suspend-gated sites across 5 test files.

Verified: full suite passes on net8.0 and net10.0, and the previously-flaky
test passed 25/25 consecutive runs on net10.0. Also faster — tests resume the
instant suspension fires instead of always sleeping.
Running the durable integ suite in parallel (maxParallelThreads=4) surfaced two
contention problems that this addresses.

IAM 'Rate exceeded': each test created and deleted its own IAM role, so several
deployments hammered IAM's (global, single-bucket, low-rate) mutating APIs at
once. Replace per-test roles with a single shared execution role
(durable-integ-shared-execution-role) created at most once per account and
reused across tests and runs, gated so concurrent deployments don't race. It
carries the union of permissions every scenario needs (invoke durable-integ-*
functions + send durable-execution callbacks); no test depends on a role
lacking a permission, so one role is safe. Dispose no longer deletes roles.
Clients also use adaptive retry as a backstop.

Build thrash/timeouts: each test published its function separately and wiped
obj/bin first, so the shared source projects (Amazon.Lambda.DurableExecution
etc.) were rebuilt per-test, and concurrent publishes thrashed MSBuild into
'dotnet timed out'. Publish all functions once, up front, in a single MSBuild
pass via a generated traversal project (Restore;Publish, BuildInParallel) that
builds the shared projects once and publishes each function to its own
bin/publish; tests then only zip that output. Verified: 51/51 functions publish
in one ~16s pass with 0 errors, and the suite no longer throttles IAM.
MaxSizeProducesOneLogFrame intermittently failed with 'Expected: 16, Actual: 15'
on the header length. The header ends with an 8-byte big-endian microsecond
timestamp; roughly 1 in 256 timestamps ends in a 0x00 byte. TestFileStream's
Write captured bytes via TrimTrailingNullBytes(buffer).Take(count), which
stripped that legitimate trailing zero, yielding a 15-byte header.

Capture exactly buffer[offset, offset + count) instead — that is precisely what
the production code wrote, and it no longer depends on the timestamp's value.
After the shared-role fix removed IAM throttling, the throttling moved to
Lambda's account-wide control-plane APIs: with maxParallelThreads=4, the
combination of CreateFunction + DeleteFunction + WaitForFunctionActive polling
GetFunctionConfiguration exceeded Lambda's limits, surfacing as 'Rate exceeded'
and adaptive retry's 'capacity could not be obtained'.

Two compounding causes addressed:

- Each deployment built its own AWS clients, so adaptive retry's per-client
  rate limiter couldn't coordinate across the parallel deployments — N clients
  each assumed they had capacity and fired at once. Make the Lambda and IAM
  clients static/shared so adaptive retry actually paces the whole suite.

- Cap concurrent Lambda control-plane calls (create/delete/get-configuration)
  with a suite-wide semaphore (limit 2) via a RunControlPlaneAsync helper, so
  the 4 parallel test threads don't collectively exceed Lambda's control-plane
  rate. Data-plane calls (Invoke, durable-execution reads) are not gated. Also
  slow the WaitForFunctionActive poll from 2s to 3s to cut its call rate.
The CI run no longer throttles IAM or Lambda control-plane (those fixes held),
but parallelism surfaced two shared-file races:

- 'Cannot create .../dotnet/tools/.store/amazon.lambda.tools/6.0.6 because a
  file or directory with the same name already exists': the three
  *.IntegrationTests projects run DeploymentScript.ps1 in parallel and each ran
  'dotnet tool install -g Amazon.Lambda.Tools', colliding on the global tool
  store. Make the install idempotent: skip if already installed, and tolerate
  the concurrent-install race (already-installed/already-exists treated as
  success) with a short retry.

- 'function.zip ... being used by another process' (ApproverFunction): a test
  function that is the external function for more than one test was zipped to a
  shared bin/function.zip by multiple parallel tests at once. Zip to a unique
  temp path per call instead; the read-only published output is still shared.
…tput race

CI failed with 'GenerateDepsFile task failed unexpectedly ... IntegrationTests.Helpers.deps.json
is being used by another process'. The integration test projects share the
IntegrationTests.Helpers ProjectReference; running 'dotnet test' on them in parallel made each run
rebuild that shared project concurrently, racing on its build output.

Build all projects once, serially, before the parallel phase, then run the parallel 'dotnet test'
with --no-build so the concurrent runs only execute tests and never rebuild shared output. The
shared helper is built once; subsequent up-front builds are no-ops.

(The previous run also confirmed the tool-install fix works: the 'already exists' message is now
tolerated and deployment continues — that path is no longer fatal.)
… (test-only)

StreamingE2EWithMoq.Streaming_AllDataTransmitted_ContentRoundTrip flaked in CI
(Assert.NotNull(output) — CapturedHttpBytes was null) only in full-assembly
runs, never in isolation.

Root cause is cross-test contamination of ResponseStreamFactory's static state.
The factory tracks the active invocation in a static field (_onDemandContext,
on-demand mode) or an AsyncLocal (_asyncLocalContext, multi-concurrency mode),
and GetCurrentContext() prefers the AsyncLocal. Several ResponseStreamFactoryTests
called InitializeInvocation(isMultiConcurrency: true) synchronously on the xUnit
worker thread, mutating that thread's ExecutionContext; because xUnit reuses
thread-pool threads, the AsyncLocal value could remain visible to a later
on-demand test. When that test's handler called CreateStream(),
GetCurrentContext() returned the stale AsyncLocal context instead of the
on-demand one, so the bootstrap's on-demand GetStreamIfCreated() saw no stream
and the response silently fell back to the buffered path — CapturedHttpBytes
stayed null.

Fix is test-only (no shipping code changed): run the multi-concurrency tests
that write the AsyncLocal on isolated Task.Run flows so the mutation is confined
to a throwaway ExecutionContext and cannot leak across xUnit's reused threads —
the same pattern the StreamingE2EWithMoq multi-concurrency tests already use.
The streaming tests also reset factory state before each run as belt-and-suspenders.

Verified: the failure reproduced ~40% of full-assembly runs before (2/5); after,
12/12 full-assembly runs pass.
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/fix-custom-authorizer-deploy-retry branch from c7eaf3c to 9e6af88 Compare June 29, 2026 15:36
…llel publish

CI failed in the durable pre-publish step with NuGet error:
  'The file .../Amazon.Lambda.Serialization.SystemTextJson/obj/project.assets.json
   already exists.'

The single-MSBuild-pass traversal published all function projects with
Targets=Restore;Publish and BuildInParallel=true. Restore is not parallel-safe:
the function projects share src ProjectReferences (Serialization.SystemTextJson,
DurableExecution, Core, RuntimeSupport), so restoring them concurrently raced on
the shared obj/project.assets.json.

Split into two passes inside the traversal: a single non-parallel Restore across
all projects (writes each shared project's assets once), then the parallel
Publish (restore already done, so no shared-output race). Verified from a fully
cold state (function + shared src obj dirs nuked) — 51/51 functions publish with
0 'already exists' errors.
The TestCustomAuthorizerApp REST API (API Gateway v1) valid-auth tests
(RestUserInfo_WithValidAuth, SimpleRestApiUserInfo_WithValidAuth)
intermittently returned 403 instead of 200. API Gateway returns 403 on the
authorizer allow path when the Lambda authorizer wiring has not finished
propagating to the endpoint being hit.

Three compounding causes, fixed at three layers:

- Root cause: AnnotationsRestApi had no EndpointConfiguration, so SAM
  defaulted to EDGE-optimized. Edge endpoints front through CloudFront and
  propagate over minutes, unevenly across edge PoPs, so a warmed endpoint
  could still 403 on a request that hit a different PoP. Set the REST API to
  REGIONAL (invoke URL format unchanged). The generator never writes
  EndpointConfiguration, so this survives template regeneration.

- Warm-up coverage gap: WarmUpApisAsync only warmed 2 of 4 authorizers and
  never warmed SimpleRestAuthorizer. Now warms one allow path per distinct
  authorizer, REST endpoints first (they settle slower than HTTP v2).

- Per-test resilience: add RetryHelper.SendWithRetryOnForbiddenAsync (takes a
  request factory since HttpRequestMessage cannot be resent) and a
  GetWithValidTokenAsync fixture helper. All 9 allow-path tests now retry a
  transient 403 instead of failing. Deny/no-auth/partial-context tests, which
  legitimately expect 403/401, are unchanged.

Verified locally: all 20 tests pass, stack deploys first try with regional REST API.
@GarrettBeatty GarrettBeatty marked this pull request as ready for review June 29, 2026 18:00
@GarrettBeatty GarrettBeatty requested review from a team as code owners June 29, 2026 18:00
@GarrettBeatty GarrettBeatty requested review from normj and philasmar June 29, 2026 18:00
// account-rate-limited and are the next bottleneck once IAM is no longer per-test. Cap how many
// run concurrently across the whole suite so the parallel deployments don't collectively exceed
// Lambda's limits; data-plane calls (Invoke, durable-execution reads) are not gated.
private static readonly SemaphoreSlim LambdaControlPlaneGate = new(2, 2);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though it runs in parallel i still throttle it a bit in order to not hit rate limiting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release Not Needed Add this label if a PR does not need to be released.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants