Skip to content

[fix](be) Avoid finalized pipeline task submit crash#64899

Open
BiteTheDDDDt wants to merge 1 commit into
apache:masterfrom
BiteTheDDDDt:codex/fix-pipeline-task-submit-race
Open

[fix](be) Avoid finalized pipeline task submit crash#64899
BiteTheDDDDt wants to merge 1 commit into
apache:masterfrom
BiteTheDDDDt:codex/fix-pipeline-task-submit-race

Conversation

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: A runtime filter dependency can wake a pipeline task while another thread is closing or finalizing the same task. HybridTaskScheduler::submit() synchronously calls PipelineTask::is_blockable() before enqueueing, and is_blockable() reads _sink and _operators. After terminal close/finalize starts releasing those resources, a late submit can dereference cleared task resources and crash.

This PR adds a small submit gate on PipelineTask. Terminal close/finalize closes that gate under a dedicated lock, and HybridTaskScheduler::submit() checks it under the same lock before calling is_blockable(). The lock only covers the blockable check; it does not cover the actual queue submit.

Release note

None

Check List (For Author)

  • Test: Unit Test / Static check
    • ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT
    • build-support/check-format.sh
    • CLANG_TIDY_BINARY=/mnt/disk6/common/ldb_toolchain_028/bin/clang-tidy build-support/run-clang-tidy.sh --base upstream/master --build-dir be/ut_build_ASAN
    • git diff --check
  • Behavior changed: No
  • Does this need documentation: No

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt BiteTheDDDDt marked this pull request as ready for review June 26, 2026 10:58
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29197 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0e148ec3a0006b3fed5b0d6a0d3ecf85709fe10a, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17721	4021	3981	3981
q2	2025	305	180	180
q3	10336	1432	819	819
q4	4687	466	332	332
q5	7501	1010	595	595
q6	187	163	133	133
q7	775	863	643	643
q8	9367	1588	1630	1588
q9	5596	4562	4528	4528
q10	6768	1793	1523	1523
q11	439	271	248	248
q12	628	429	293	293
q13	18106	3351	2793	2793
q14	277	254	241	241
q15	q16	791	772	711	711
q17	1046	930	910	910
q18	7058	5714	5610	5610
q19	1321	1403	1143	1143
q20	504	411	262	262
q21	6088	2644	2368	2368
q22	438	358	296	296
Total cold run time: 101659 ms
Total hot run time: 29197 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4324	4253	4232	4232
q2	329	345	222	222
q3	4561	4962	4382	4382
q4	2070	2129	1383	1383
q5	4403	4296	4272	4272
q6	232	178	129	129
q7	1720	1635	1957	1635
q8	2612	2177	2120	2120
q9	8312	8478	8014	8014
q10	4790	4721	4281	4281
q11	578	403	405	403
q12	720	754	570	570
q13	3225	3675	2970	2970
q14	294	298	282	282
q15	q16	713	727	634	634
q17	1347	1305	1352	1305
q18	7958	7325	7242	7242
q19	1167	1135	1122	1122
q20	2201	2228	1942	1942
q21	5655	4561	4333	4333
q22	509	460	396	396
Total cold run time: 57720 ms
Total hot run time: 51869 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171345 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0e148ec3a0006b3fed5b0d6a0d3ecf85709fe10a, data reload: false

query5	4370	636	485	485
query6	448	204	179	179
query7	4818	564	315	315
query8	357	180	168	168
query9	8751	4017	3985	3985
query10	440	314	253	253
query11	5923	2335	2132	2132
query12	148	99	98	98
query13	1265	604	406	406
query14	6157	5298	4925	4925
query14_1	4261	4235	4272	4235
query15	208	207	182	182
query16	1001	456	454	454
query17	920	698	592	592
query18	2438	481	355	355
query19	207	190	150	150
query20	110	108	107	107
query21	219	140	116	116
query22	13613	13688	13428	13428
query23	17455	16546	16144	16144
query23_1	16315	16180	16220	16180
query24	7444	1771	1297	1297
query24_1	1340	1311	1319	1311
query25	568	463	407	407
query26	1309	331	178	178
query27	2666	556	345	345
query28	4488	2031	2068	2031
query29	1098	613	507	507
query30	318	227	203	203
query31	1139	1072	963	963
query32	103	65	62	62
query33	530	326	263	263
query34	1170	1118	664	664
query35	778	795	667	667
query36	1426	1411	1231	1231
query37	157	113	91	91
query38	1900	1725	1647	1647
query39	924	919	895	895
query39_1	891	878	886	878
query40	225	126	104	104
query41	72	69	69	69
query42	91	91	94	91
query43	318	325	279	279
query44	1423	799	813	799
query45	212	193	182	182
query46	1115	1210	764	764
query47	2357	2383	2201	2201
query48	415	429	300	300
query49	580	433	322	322
query50	1011	363	263	263
query51	4430	4409	4364	4364
query52	83	82	71	71
query53	269	268	194	194
query54	274	230	213	213
query55	75	73	72	72
query56	242	229	242	229
query57	1460	1412	1328	1328
query58	258	257	207	207
query59	1560	1621	1411	1411
query60	277	242	231	231
query61	156	143	144	143
query62	710	639	585	585
query63	232	192	188	188
query64	2512	739	605	605
query65	4825	4822	4767	4767
query66	1794	457	327	327
query67	29172	28737	28603	28603
query68	3055	1562	943	943
query69	416	297	259	259
query70	1067	1014	963	963
query71	300	237	221	221
query72	2866	2582	2284	2284
query73	844	758	402	402
query74	5096	4896	4786	4786
query75	2578	2539	2174	2174
query76	2302	1166	811	811
query77	343	376	282	282
query78	12562	12448	11800	11800
query79	1417	1109	774	774
query80	829	464	387	387
query81	481	285	244	244
query82	569	159	118	118
query83	318	270	247	247
query84	306	142	112	112
query85	891	512	412	412
query86	407	314	287	287
query87	1856	1829	1757	1757
query88	3699	2837	2783	2783
query89	426	385	336	336
query90	1858	180	183	180
query91	171	156	128	128
query92	60	63	56	56
query93	1441	1445	894	894
query94	615	343	318	318
query95	672	375	452	375
query96	1095	810	343	343
query97	2720	2689	2557	2557
query98	215	203	199	199
query99	1165	1143	1029	1029
Total cold run time: 256881 ms
Total hot run time: 171345 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.07 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 0e148ec3a0006b3fed5b0d6a0d3ecf85709fe10a, data reload: false

query1	0.00	0.00	0.00
query2	0.09	0.05	0.05
query3	0.26	0.14	0.13
query4	1.60	0.13	0.14
query5	0.26	0.22	0.21
query6	1.24	1.05	1.08
query7	0.03	0.00	0.01
query8	0.06	0.04	0.04
query9	0.38	0.31	0.30
query10	0.57	0.55	0.53
query11	0.20	0.14	0.14
query12	0.18	0.16	0.15
query13	0.48	0.45	0.48
query14	1.04	1.03	1.02
query15	0.61	0.60	0.59
query16	0.30	0.32	0.33
query17	1.10	1.06	1.10
query18	0.24	0.22	0.21
query19	1.98	1.96	2.04
query20	0.02	0.01	0.01
query21	15.43	0.17	0.13
query22	5.00	0.06	0.05
query23	16.17	0.30	0.11
query24	2.98	0.44	0.31
query25	0.11	0.06	0.04
query26	0.74	0.21	0.16
query27	0.05	0.04	0.03
query28	3.43	0.89	0.53
query29	12.49	4.30	3.47
query30	0.26	0.15	0.15
query31	2.76	0.59	0.32
query32	3.21	0.60	0.49
query33	3.14	3.18	3.19
query34	15.80	4.18	3.49
query35	3.56	3.47	3.53
query36	0.55	0.43	0.41
query37	0.08	0.07	0.06
query38	0.05	0.04	0.04
query39	0.04	0.02	0.02
query40	0.18	0.16	0.16
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.84 s
Total hot run time: 25.07 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 82.35% (14/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.91% (21607/39351)
Line Coverage 38.42% (206579/537725)
Region Coverage 34.47% (162485/471381)
Branch Coverage 35.50% (71185/200542)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (17/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.31% (28540/38409)
Line Coverage 58.18% (311013/534527)
Region Coverage 54.81% (259510/473482)
Branch Coverage 56.24% (112912/200785)

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one test coverage gap in the new scheduler lifecycle protection.

Critical checkpoint conclusions:

  • Goal: the code adds a submit gate to avoid late HybridTaskScheduler::submit() calling is_blockable() after terminal task cleanup starts. The production path is small and focused, but the added test only proves post-finalize rejection.
  • Concurrency/lifecycle: the relevant race is dependency wake-up submit versus terminal close/finalize. The new lock/gate is aimed at HybridTaskScheduler::submit(); no deadlock or normal scheduling-order regression was found in the inspected paths, but close-time behavior needs direct coverage.
  • Config/compatibility/persistence/data writes: no new config, wire/storage compatibility, transaction, or persistence behavior is involved.
  • Parallel paths: reviewed wake_up, dependency set_ready/is_blocked_by, self-resubmit, RevokableTask, and blockable join/aggregate/spill paths; no separate correctness issue found.
  • Tests: TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT should also cover the close-time gate before finalize, otherwise the close() change can regress while this test still passes.

Validation:

  • git diff --check for the four GitHub PR files passed.
  • build-support/check-format.sh could not run to completion because the runner clang-format is not v16.
  • ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT failed during setup because thirdparty/installed/bin/protoc is missing.

User focus: no additional user-provided review focus was supplied.

Subagent conclusions: optimizer-rewrite reported no candidates. tests-session-config reported TSC-1, which became the single inline comment. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for this same final comment set.

Comment thread be/test/exec/pipeline/pipeline_task_test.cpp
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: A runtime filter dependency can wake a pipeline task while another thread is closing or finalizing the same task. HybridTaskScheduler::submit() synchronously calls PipelineTask::is_blockable() before enqueueing the task, and is_blockable() reads _sink and _operators. After close/finalize starts releasing terminal task resources, a late submit can therefore dereference cleared task resources and crash.

This change adds a small submit gate on PipelineTask. Terminal close/finalize closes that gate under a dedicated lock, and HybridTaskScheduler::submit() checks the gate under the same lock before calling is_blockable(). The lock only covers the blockable check and does not extend to the actual queue submit.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Static check
    - ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT
    - build-support/check-format.sh
    - CLANG_TIDY_BINARY=/mnt/disk6/common/ldb_toolchain_028/bin/clang-tidy build-support/run-clang-tidy.sh --base upstream/master --build-dir be/ut_build_ASAN
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
@BiteTheDDDDt BiteTheDDDDt force-pushed the codex/fix-pipeline-task-submit-race branch from 0e148ec to 1dd3207 Compare June 27, 2026 11:49
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for current head 1dd3207f04a7311ddec655990fca58d68b0b0b9d.

Opinion: I found no additional blocking issue to raise on the current head. The existing inline thread about the finalized-only test coverage should not be duplicated; the current head adds the close-before-finalize hybrid-submit coverage that thread requested.

Critical checkpoint conclusions:

  • Goal and test proof: the PR prevents HybridTaskScheduler::submit() from calling PipelineTask::is_blockable() after terminal close/finalize begins. The gate is closed from close(close_sink=true) and finalize(), and hybrid submit checks it under _blockable_check_lock before reading _sink/_operators. The new close-before-finalize and finalized-task BE tests cover the intended submit rejection cases.
  • Scope: the change is small and focused on BE pipeline task lifecycle/scheduler handling plus BE unit coverage. I saw no new config, persistence, FE/BE protocol, storage-format, or transaction surface.
  • Concurrency and lifecycle: _blockable_check_lock serializes the accept-submit gate with HybridTaskScheduler::submit(), while the existing _dependency_lifecycle_lock still protects forced dependency unblocking against close/finalize resource clearing. In the production close path, _running remains true through close_task()/finalize(), so a task queued just before the gate closes is requeued until finalization and then skipped.
  • Parallel paths: the hybrid scheduler applies the gate before choosing its internal blocking/simple queue. RevokableTask::is_blockable() remains a separate virtual path and does not dereference the underlying task's sink/operators.
  • Tests and validation: git diff --check is clean for the changed files. I did not run BE UT/build because this runner lacks thirdparty/installed and thirdparty/installed/bin/protoc; Doris format validation is also blocked here because the available clang-format is 18.1.3 while the repo requires clang-format 16.
  • User focus: no additional user-provided focus points were present.

Subagent conclusions: optimizer-rewrite reported no new valuable findings after checking that optimizer/Nereids risks are out of scope and reviewing analogous lifecycle/parallel submit paths. tests-session-config reported no new valuable findings after checking the added tests, duplicate context, and basic whitespace/tooling state. Final convergence round 1 ended with both subagents replying NO_NEW_VALUABLE_FINDINGS for the same ledger and empty proposed inline comment set.

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29431 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1dd3207f04a7311ddec655990fca58d68b0b0b9d, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17780	4187	4145	4145
q2	2033	315	203	203
q3	10303	1448	838	838
q4	4680	471	338	338
q5	7511	854	575	575
q6	185	175	135	135
q7	804	828	636	636
q8	9328	1543	1634	1543
q9	5585	4506	4481	4481
q10	6784	1811	1517	1517
q11	439	282	243	243
q12	632	429	287	287
q13	18056	3453	2759	2759
q14	269	268	239	239
q15	q16	791	777	704	704
q17	1036	1022	1018	1018
q18	7156	5768	5719	5719
q19	1228	1335	1114	1114
q20	495	408	268	268
q21	5680	2598	2363	2363
q22	452	356	306	306
Total cold run time: 101227 ms
Total hot run time: 29431 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4381	4351	4325	4325
q2	317	343	222	222
q3	4626	5053	4416	4416
q4	2111	2188	1372	1372
q5	4506	4340	4440	4340
q6	237	180	129	129
q7	1757	1876	1940	1876
q8	2551	2181	2164	2164
q9	8352	8545	8096	8096
q10	4868	4753	4268	4268
q11	566	410	379	379
q12	754	782	539	539
q13	3307	3686	2975	2975
q14	299	297	270	270
q15	q16	755	725	653	653
q17	1340	1326	1375	1326
q18	8063	7303	7237	7237
q19	1183	1104	1134	1104
q20	2256	2237	1949	1949
q21	5260	4618	4386	4386
q22	515	451	408	408
Total cold run time: 58004 ms
Total hot run time: 52434 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172007 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1dd3207f04a7311ddec655990fca58d68b0b0b9d, data reload: false

query5	4322	616	477	477
query6	445	198	187	187
query7	4815	570	303	303
query8	356	183	169	169
query9	8756	4070	4064	4064
query10	441	327	284	284
query11	5948	2338	2182	2182
query12	158	104	106	104
query13	1263	593	430	430
query14	6284	5317	4994	4994
query14_1	4320	4284	4316	4284
query15	216	204	187	187
query16	993	478	468	468
query17	1153	733	601	601
query18	2511	469	355	355
query19	209	191	150	150
query20	124	110	111	110
query21	219	143	120	120
query22	13693	13645	13501	13501
query23	17306	16461	16206	16206
query23_1	16221	16328	16294	16294
query24	7642	1806	1305	1305
query24_1	1302	1303	1266	1266
query25	564	473	402	402
query26	1307	336	191	191
query27	2605	571	334	334
query28	4421	2032	2044	2032
query29	1067	628	500	500
query30	314	239	207	207
query31	1103	1084	953	953
query32	107	62	64	62
query33	543	325	264	264
query34	1155	1159	654	654
query35	769	805	677	677
query36	1411	1392	1221	1221
query37	159	111	96	96
query38	1902	1739	1652	1652
query39	919	919	892	892
query39_1	879	882	872	872
query40	236	125	101	101
query41	64	64	66	64
query42	94	86	93	86
query43	318	319	283	283
query44	1444	772	784	772
query45	207	187	173	173
query46	1120	1267	720	720
query47	2442	2320	2215	2215
query48	412	444	305	305
query49	567	438	309	309
query50	986	370	262	262
query51	4404	4418	4370	4370
query52	82	81	71	71
query53	241	268	190	190
query54	263	223	196	196
query55	75	71	68	68
query56	244	215	223	215
query57	1399	1413	1324	1324
query58	242	211	198	198
query59	1572	1656	1433	1433
query60	276	239	235	235
query61	157	153	150	150
query62	704	649	597	597
query63	237	188	191	188
query64	2511	759	603	603
query65	4807	4790	4791	4790
query66	1769	445	336	336
query67	28948	28818	28664	28664
query68	3302	1624	930	930
query69	414	309	261	261
query70	1061	987	967	967
query71	308	222	217	217
query72	2928	2657	2227	2227
query73	831	786	455	455
query74	5099	4929	4756	4756
query75	2545	2526	2191	2191
query76	2319	1195	778	778
query77	359	378	292	292
query78	12261	12300	11983	11983
query79	1386	1195	724	724
query80	1128	460	390	390
query81	500	281	235	235
query82	562	154	125	125
query83	341	278	247	247
query84	263	143	116	116
query85	917	522	422	422
query86	403	297	306	297
query87	1862	1827	1764	1764
query88	3687	2812	2779	2779
query89	433	387	332	332
query90	1852	173	176	173
query91	173	170	136	136
query92	59	57	53	53
query93	1525	1483	975	975
query94	635	346	314	314
query95	693	475	340	340
query96	1067	808	377	377
query97	2666	2683	2583	2583
query98	216	208	201	201
query99	1173	1143	1044	1044
Total cold run time: 257155 ms
Total hot run time: 172007 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.15 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 1dd3207f04a7311ddec655990fca58d68b0b0b9d, data reload: false

query1	0.00	0.00	0.01
query2	0.09	0.04	0.04
query3	0.25	0.14	0.12
query4	1.61	0.16	0.14
query5	0.24	0.22	0.23
query6	1.22	1.06	1.03
query7	0.04	0.01	0.01
query8	0.05	0.04	0.04
query9	0.37	0.32	0.31
query10	0.55	0.56	0.57
query11	0.20	0.15	0.14
query12	0.18	0.15	0.14
query13	0.47	0.47	0.48
query14	1.00	1.02	0.99
query15	0.63	0.60	0.60
query16	0.33	0.32	0.32
query17	1.13	1.13	1.10
query18	0.23	0.22	0.21
query19	2.04	2.00	1.94
query20	0.01	0.02	0.01
query21	15.46	0.21	0.13
query22	4.97	0.05	0.05
query23	16.17	0.31	0.12
query24	2.93	0.41	0.30
query25	0.11	0.06	0.04
query26	0.72	0.21	0.14
query27	0.04	0.03	0.04
query28	3.51	0.93	0.55
query29	12.57	4.25	3.45
query30	0.27	0.16	0.19
query31	2.78	0.58	0.32
query32	3.24	0.60	0.49
query33	3.20	3.18	3.29
query34	15.73	4.22	3.48
query35	3.53	3.50	3.56
query36	0.55	0.45	0.44
query37	0.08	0.06	0.07
query38	0.05	0.04	0.04
query39	0.04	0.03	0.04
query40	0.17	0.16	0.16
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.92 s
Total hot run time: 25.15 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (17/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.21% (28504/38409)
Line Coverage 58.01% (310086/534527)
Region Coverage 54.64% (258708/473482)
Branch Coverage 56.03% (112506/200785)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants