DM-54540: bps report failed while reporting on a restarted task #88
DM-54540: bps report failed while reporting on a restarted task #88mxk62 wants to merge 5 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #88 +/- ##
==========================================
+ Coverage 84.80% 85.14% +0.34%
==========================================
Files 19 19
Lines 4245 4343 +98
Branches 456 461 +5
==========================================
+ Hits 3600 3698 +98
Misses 567 567
Partials 78 78 ☔ View full report in Codecov by Harness. |
A bug was causing mixing up condor_dagman job data with those related to DAGMan jobs if `dag.nodes.log` was missing. Fixed that.
During workflow restarts, selected files are being moved to a dedicated backup directory as HTCondor recreates them with the fresh content once a workflow is restarted. However, the backup function was also moving out such files for subdags that succeeded. HTCondor does not rerun DAGs that completed successfully. Their absence led to reporting incorrect job status counts as the information in these files is used by the plugin's reporting mechanism. Made changes to ensure that only files of the failed subdags are backed up.
_udpate_rescue_file() was trying to back up files for each failed subdags However, at the time of its execution these files were already backed up. Also, the function, in its current implementation, doesn't have access to the correct location of the backup directory. As a result it was creating an empty directory tree in the directory with the failed subdag. Modified it so it has a single responsibility -- updating the rescue file of the main DAG.
MichelleGower
left a comment
There was a problem hiding this comment.
3 bps restarts have held jobs and 2 bps restarts if some succeed one first restart doesn't have correct counts (like previously reported problem but need 1 more restart).
| @@ -0,0 +1 @@ | |||
| Fix a bug causing ``bps report`` emit confusing error due to mixing information specific to a ``condor_dagman`` job with the information related to its jobs. | |||
There was a problem hiding this comment.
Either add to or replace this with a message about fixing the bug with reporting with job grouping after a restart.
| # after the run is restarted. As HTCondorService.report() uses information | ||
| # in these files to determine job statuses, their absence may lead to | ||
| # reporting incorrect job status counts. | ||
| for subdag_dir in {file.parent for file in path.glob("subdags/*/*.rescue*")}: |
There was a problem hiding this comment.
What happens after 1 failure but works in next restart (but other things still fail)? The rescue files stay in the directory, so the report for a 2nd restart will have the same files missing for success problem, right? Was able to reproduce this by changing submit yaml to fail on two visit+detector 50 and manually changing executable to /usr/bin/true in the sub files for one visit+detector before the 1st restart.
If use the code in _update_rescue_file instead, that uses knowledge from the rescue file for what failed.
There was a problem hiding this comment.
If use the code in _update_rescue_file instead, that uses knowledge from the rescue file for what failed.
The _update_rescue_file() needs the name of the rescue file htc_backup_files() returns, but htc_backup_files() moved the files already when it is finished. On the other hand, _update_rescue_file() has no knowledge where the backup directory is. Point being, it not a "just". I'll need to change the information flow between these two functions
Checklist
doc/changes