DM-54540: bps report failed while reporting on a restarted task by mxk62 · Pull Request #88 · lsst/ctrl_bps_htcondor

mxk62 · 2026-06-08T21:10:51Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2026-06-08T21:11:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.14%. Comparing base (40020ac) to head (cd109bd).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #88      +/-   ##
==========================================
+ Coverage   84.80%   85.14%   +0.34%     
==========================================
  Files          19       19              
  Lines        4245     4343      +98     
  Branches      456      461       +5     
==========================================
+ Hits         3600     3698      +98     
  Misses        567      567              
  Partials       78       78

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

A bug was causing mixing up condor_dagman job data with those related to DAGMan jobs if `dag.nodes.log` was missing. Fixed that.

During workflow restarts, selected files are being moved to a dedicated backup directory as HTCondor recreates them with the fresh content once a workflow is restarted. However, the backup function was also moving out such files for subdags that succeeded. HTCondor does not rerun DAGs that completed successfully. Their absence led to reporting incorrect job status counts as the information in these files is used by the plugin's reporting mechanism. Made changes to ensure that only files of the failed subdags are backed up.

_udpate_rescue_file() was trying to back up files for each failed subdags However, at the time of its execution these files were already backed up. Also, the function, in its current implementation, doesn't have access to the correct location of the backup directory. As a result it was creating an empty directory tree in the directory with the failed subdag. Modified it so it has a single responsibility -- updating the rescue file of the main DAG.

MichelleGower

3 bps restarts have held jobs and 2 bps restarts if some succeed one first restart doesn't have correct counts (like previously reported problem but need 1 more restart).

MichelleGower · 2026-06-23T16:49:07Z

@@ -0,0 +1 @@
+Fix a bug causing ``bps report`` emit confusing error due to mixing information specific to a ``condor_dagman`` job with the information related to its jobs.


Either add to or replace this with a message about fixing the bug with reporting with job grouping after a restart.

MichelleGower · 2026-06-23T19:59:51Z

+    # after the run is restarted.  As HTCondorService.report() uses information
+    # in these files to determine job statuses, their absence may lead to
+    # reporting incorrect job status counts.
+    for subdag_dir in {file.parent for file in path.glob("subdags/*/*.rescue*")}:


What happens after 1 failure but works in next restart (but other things still fail)? The rescue files stay in the directory, so the report for a 2nd restart will have the same files missing for success problem, right? Was able to reproduce this by changing submit yaml to fail on two visit+detector 50 and manually changing executable to /usr/bin/true in the sub files for one visit+detector before the 1st restart.

If use the code in _update_rescue_file instead, that uses knowledge from the rescue file for what failed.

If use the code in _update_rescue_file instead, that uses knowledge from the rescue file for what failed.

The _update_rescue_file() needs the name of the rescue file htc_backup_files() returns, but htc_backup_files() moved the files already when it is finished. On the other hand, _update_rescue_file() has no knowledge where the backup directory is. Point being, it not a "just". I'll need to change the information flow between these two functions

mxk62 force-pushed the tickets/DM-54540 branch from 6841d9c to 0ca237f Compare June 9, 2026 15:27

mxk62 marked this pull request as ready for review June 9, 2026 15:44

mxk62 force-pushed the tickets/DM-54540 branch from 876ebe1 to 4dfdb04 Compare June 9, 2026 15:45

mxk62 added 5 commits June 17, 2026 11:25

Avoid injecting condor_dagman info to jobs info

b60ffe3

A bug was causing mixing up condor_dagman job data with those related to DAGMan jobs if `dag.nodes.log` was missing. Fixed that.

Add a news item describing the changes

e049fb9

Add unit tests covering read_single_node_status()

0440d0a

mxk62 force-pushed the tickets/DM-54540 branch from 4dfdb04 to cd109bd Compare June 19, 2026 15:48

MichelleGower reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DM-54540: bps report failed while reporting on a restarted task #88

DM-54540: bps report failed while reporting on a restarted task #88
mxk62 wants to merge 5 commits into
mainfrom
tickets/DM-54540

mxk62 commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

MichelleGower left a comment

Uh oh!

MichelleGower Jun 23, 2026

Uh oh!

MichelleGower Jun 23, 2026

Uh oh!

mxk62 Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1 @@
		Fix a bug causing ``bps report`` emit confusing error due to mixing information specific to a ``condor_dagman`` job with the information related to its jobs.

Uh oh!

Conversation

mxk62 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MichelleGower left a comment

Choose a reason for hiding this comment

Uh oh!

MichelleGower Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

MichelleGower Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

mxk62 Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mxk62 commented Jun 8, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading

mxk62 Jun 23, 2026 •

edited

Loading