Skip to content

stfsender: fix segfault on termination (R3C-1147)#193

Open
ktf wants to merge 1 commit into
AliceO2Group:devfrom
ktf:fix-stop-segfault
Open

stfsender: fix segfault on termination (R3C-1147)#193
ktf wants to merge 1 commit into
AliceO2Group:devfrom
ktf:fix-stop-segfault

Conversation

@ktf

@ktf ktf commented Jun 24, 2026

Copy link
Copy Markdown
Member

At the end of every run all TfBuilders disconnect from each StfSender. Each UCX disconnect (StfSenderOutputUCX::disconnectTfBuilder) spawned a detached thread that keeps progressing a UCX worker and touching object state, but stop() never waited for it: it went straight to ucp_worker_destroy()/ucp_cleanup() and the object was then destructed. The still-running detached threads then used freed UCX workers / a destructed object, causing a SIGSEGV (core dumped) and a burst of errors as connections dropped.

  • StfSenderOutputUCX: track the async endpoint-close threads instead of detaching them and join them in stop() while the workers/context are still valid; add a destructor as a safety net in case stop() is skipped.
  • StfSenderDevice::ResetTask: stop the gRPC RPC server before the output handler so no late connect/disconnect/data request can reference the output handler mid-teardown (keeps the unconditional output stop from ce899b9 for flp-only runs).

Ref: https://its.cern.ch/jira/browse/R3C-1147

At the end of every run all TfBuilders disconnect from each StfSender.
Each UCX disconnect (StfSenderOutputUCX::disconnectTfBuilder) spawned a
detached thread that keeps progressing a UCX worker and touching object
state, but stop() never waited for it: it went straight to
ucp_worker_destroy()/ucp_cleanup() and the object was then destructed.
The still-running detached threads then used freed UCX workers / a
destructed object, causing a SIGSEGV (core dumped) and a burst of errors
as connections dropped.

- StfSenderOutputUCX: track the async endpoint-close threads instead of
  detaching them and join them in stop() while the workers/context are
  still valid; add a destructor as a safety net in case stop() is skipped.
- StfSenderDevice::ResetTask: stop the gRPC RPC server before the output
  handler so no late connect/disconnect/data request can reference the
  output handler mid-teardown (keeps the unconditional output stop from
  ce899b9 for flp-only runs).

Ref: https://its.cern.ch/jira/browse/R3C-1147
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant