Skip to content

DM-54540: bps report failed while reporting on a restarted task #88

Open
mxk62 wants to merge 5 commits into
mainfrom
tickets/DM-54540
Open

DM-54540: bps report failed while reporting on a restarted task #88
mxk62 wants to merge 5 commits into
mainfrom
tickets/DM-54540

Conversation

@mxk62

@mxk62 mxk62 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.14%. Comparing base (40020ac) to head (cd109bd).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #88      +/-   ##
==========================================
+ Coverage   84.80%   85.14%   +0.34%     
==========================================
  Files          19       19              
  Lines        4245     4343      +98     
  Branches      456      461       +5     
==========================================
+ Hits         3600     3698      +98     
  Misses        567      567              
  Partials       78       78              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@mxk62 mxk62 force-pushed the tickets/DM-54540 branch from 6841d9c to 0ca237f Compare June 9, 2026 15:27
@mxk62 mxk62 marked this pull request as ready for review June 9, 2026 15:44
@mxk62 mxk62 force-pushed the tickets/DM-54540 branch from 876ebe1 to 4dfdb04 Compare June 9, 2026 15:45
mxk62 added 5 commits June 17, 2026 11:25
A bug was causing mixing up condor_dagman job data with those related to
DAGMan jobs if `dag.nodes.log` was missing.  Fixed that.
During workflow restarts, selected files are being moved to a dedicated
backup directory as HTCondor recreates them with the fresh content once
a workflow is restarted.  However, the backup function was also moving
out such files for subdags that succeeded.  HTCondor does not rerun
DAGs that completed successfully.  Their absence led to reporting
incorrect job status counts as the information in these files is used by
the plugin's reporting mechanism.  Made changes to ensure that only
files of the failed subdags are backed up.
_udpate_rescue_file() was trying to back up files for each failed
subdags However, at the time of its execution these files were already
backed up. Also, the function, in its current implementation, doesn't
have access to the correct location of the backup directory. As a result
it was creating an empty directory tree in the directory with the failed
subdag.  Modified it so it has a single responsibility -- updating the
rescue file of the main DAG.
@mxk62 mxk62 force-pushed the tickets/DM-54540 branch from 4dfdb04 to cd109bd Compare June 19, 2026 15:48

@MichelleGower MichelleGower left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 bps restarts have held jobs and 2 bps restarts if some succeed one first restart doesn't have correct counts (like previously reported problem but need 1 more restart).

@@ -0,0 +1 @@
Fix a bug causing ``bps report`` emit confusing error due to mixing information specific to a ``condor_dagman`` job with the information related to its jobs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either add to or replace this with a message about fixing the bug with reporting with job grouping after a restart.

# after the run is restarted. As HTCondorService.report() uses information
# in these files to determine job statuses, their absence may lead to
# reporting incorrect job status counts.
for subdag_dir in {file.parent for file in path.glob("subdags/*/*.rescue*")}:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens after 1 failure but works in next restart (but other things still fail)? The rescue files stay in the directory, so the report for a 2nd restart will have the same files missing for success problem, right? Was able to reproduce this by changing submit yaml to fail on two visit+detector 50 and manually changing executable to /usr/bin/true in the sub files for one visit+detector before the 1st restart.

If use the code in _update_rescue_file instead, that uses knowledge from the rescue file for what failed.

@mxk62 mxk62 Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use the code in _update_rescue_file instead, that uses knowledge from the rescue file for what failed.

The _update_rescue_file() needs the name of the rescue file htc_backup_files() returns, but htc_backup_files() moved the files already when it is finished. On the other hand, _update_rescue_file() has no knowledge where the backup directory is. Point being, it not a "just". I'll need to change the information flow between these two functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants