feat: allow import/reimport to wait for deduplication to complete by valentijnscholten · Pull Request #15007 · DefectDojo/django-DefectDojo

valentijnscholten · 2026-06-14T12:49:17Z

Summary

While thinking about what a v3 import API could look like, I realized we're missing one piece of the puzzel in the current (re)import.

By default deduplication runs as an async celery task. The results are not awaited, so the (re)import process can finish before deduplication is complete.
Consequences are:

The mail notification scan_added contains incorrect statistics
The API response contains incorrect delta statitistics.

With the recent optimizations on both import and deduplication the chances of these incorrect statistics has gone down. However the most confusing case happens when the deduplication is partly finished. In that case the statistics will contain some duplicates, but not all of them. Example report: #13777

Pro also suffers from this when not using the background-import. Also the pattern in this PR could be used for other Pro features like the current asynchronous prioritization.

The solution in this PR is to introduce a new deduplication_execution_mode enum that controls deduplication execution:

async (default): perform deduplication in the background, so not wait for the results.
sync (block_execution==True): perform deduplication in the foreground/in-line with the (re)import.
async_await: perform deduplication in the background and wait for the results.

The async_await is the new option and could be the new default in the future. The block_execution user profile variable is retained as the global "run all async tasks (notifications, JIRA, grading, ...) in the foreground" switch; it is independent of the new deduplication_execution_mode and acts as a fallback (block_execution=True → sync) when no explicit mode is set.
Data is migrated. Using block_execution==True could be used to achieve a similar result, but it would slow the import a lot more because deduplication would run in the foreground. It would also make everything synchronous inluding notifications, pushing to JIRA, etc causing further slowdowns.

The new deduplication_execution_mode can be set in the user profile, but also in the API requests.

In most scenario's the await setting will not lead to significantly longer running import, may only 1 or 2 seconds extra.
The exception is on very busy/overloaded instances where the sync deduplication jobs are queueing up.
For this reason the maximum wait time is 1 minute. After 1 minute the (re)import no longer waits and returns statistics as-is.
A response variable deduplication_complete will be False in this case.

Details

Add a new import/reimport deduplication execution mode, configurable on the user profile via deduplication_execution_mode (async / async_wait / sync) and overridable per request through a deduplication_execution_mode field on the import and reimport endpoints. The request value takes precedence over the profile, otherwise it falls back to async.
async_wait (new): deduplication/post-processing is still dispatched to the background, but the request waits for deduplication to finish before responding, so scan_added notifications and the returned statistics reflect the deduplicated state. JIRA push, product grading and other non-deduplication tasks remain asynchronous and are not awaited. async (default, return immediately) and sync (run inline) round out the modes.
The new field is independent of the existing block_execution flag, which is retained as the global "run all async tasks (notifications, JIRA, grading, ...) in the foreground" switch. A data migration seeds deduplication_execution_mode='sync' for users who had block_execution=True, and the resolver falls back to block_execution → sync, so import behavior is unchanged for them.
In async_wait mode the post-processing batch is dispatched with a per-call ignore_result=False (threaded through dojo_dispatch_task to apply_async) so the request can join on it via AsyncResult.get() despite the global CELERY_TASK_IGNORE_RESULT=True. The post_process_findings_batch task itself stays a plain @app.task, so async/sync imports do not store result rows. The wait is bounded by the new DD_DEDUPLICATION_ASYNC_WAIT_TIMEOUT setting (default 60s) so a missing or stuck worker degrades to responding anyway rather than hanging.
Add a deduplication_complete boolean to the import/reimport response indicating whether deduplication had finished by the time the response was produced (true for sync and for async_wait when it completed within the timeout; false for async).
Make scan_added notifications deduplication-aware: once deduplication has completed, refresh the duplicate status of the affected findings from the database and split the new / reactivated / untouched lists into real and duplicate sub-lists, exclude duplicates from the headline count (so an all-duplicate import sends scan_added_empty), and surface the duplicate groups in the mail and webhook templates. In plain async mode the lists are left untouched, preserving historical behavior.
The import/reimport response statistics are accurate after the await without further change, since they query findings and annotate the duplicate count directly.
The browser import/reimport views resolve deduplication_execution_mode from the user's profile (the API resolves it in the serializer), so UI imports honor the profile setting instead of always defaulting to async. (No per-import selector is added to the UI form in this change.)
Surface deduplication_execution_mode in the user profile form and the user detail view, and add 2.60 upgrade notes.

github-actions · 2026-06-17T20:25:49Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-06-17T20:35:53Z

Conflicts have been resolved. A maintainer will review the pull request shortly.

github-actions · 2026-06-17T21:28:32Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Maffooch

Review (code + local e2e). Reviewed the diff and exercised it end-to-end against the running stack.

Verification (all green):

unittests.test_import_execution_mode → 20 passed; unittests.test_importers_performance → 11 passed / 6 skipped (incl. the new test_deduplication_performance_pghistory_async_wait).
ruff --config ruff.toml → clean on all changed non-migration files.
Live e2e against a real Celery worker + broker + django-db result backend (not eager): async_wait import → deduplication_complete=true with correct stats; async → false; invalid mode → 400. So the cross-process AsyncResult.get() join genuinely works with the global CELERY_TASK_IGNORE_RESULT=True — confirming the open question in the helper.py NOTE (see inline comment there).
Server-rendered UI: view_user shows the new row and the profile form dropdown persists the setting.

One blocking item (the migration collision, #1) and four non-blocking suggestions inline below. Nice feature — the duplicate-aware notification split is a clear improvement.

github-actions · 2026-06-25T20:20:20Z

Conflicts have been resolved. A maintainer will review the pull request shortly.

…ion_mode for import dedup block_execution gates every async task (notifications, jira, grading, deletes, reindex, ...) via we_want_async, so it must remain. Restore it as the global "run all async tasks in the foreground" flag and make deduplication_execution_mode an independent field that only controls how import/reimport deduplication post-processing is dispatched and awaited. - Restore the block_execution field; wants_block_execution() reads it again. - resolve_deduplication_execution_mode() falls back to block_execution -> sync. - Replace the destructive remove migration (0270) with a seed-only data migration that maps block_execution=True -> deduplication_execution_mode 'sync'; keep both. - Restore fixtures, the profile form field, and the user detail view (both fields). - Revert the test sweep: tests that need global foreground use block_execution=True again; the dedicated deduplication_execution_mode tests are updated for the split.

…ort/reimport The API resolves deduplication_execution_mode in the serializer, but the UI import/reimport views built their context straight from the form and never resolved it, so UI imports silently defaulted to async regardless of the user's profile. Resolve it from the profile in both UI process_form paths.

Option B keeps block_execution as a checkbox in the profile form, so the integration helper toggles id_block_execution again instead of the deduplication_execution_mode select (which no longer existed under that id).

Collapse the add/rename pair into a single AddField that creates deduplication_execution_mode with its final name, and keep the seed as a separate data migration, per the convention of keeping schema and data migrations apart. Drops the interim rename migration.

…E_LOCATIONS The CI 'true' matrix variant enables V3_FEATURE_LOCATIONS, where loading dojo_testdata.json (containing Endpoints) raises NotImplementedError. Decorate the test classes with @versioned_fixtures so the locations fixture is used in that mode, matching the other import/reimport test suites.

Covers the 'async_wait' execution mode: post-processing is dispatched to a background worker and the request joins on it before responding. The dedup queries run in the worker (separate connection), not the web request, so the only web-side delta over the plain async path is the post-dedup notification refresh SELECT (+1): 109->110 first import, 89->90 second. Does not use CELERY_TASK_ALWAYS_EAGER (that would run the task inline on the request connection and wrongly count worker queries). The dedup batch is dispatched async and the join (AsyncResult.get) is mocked to return instantly, simulating a finished worker. Adds an optional dedup_mode param to _deduplication_performance.

…0270/0271) dev added 0269_normalize_blank_finding_components, colliding with this branch's 0269. Renumber to 0270_usercontactinfo_deduplication_execution_mode and 0271_seed_deduplication_execution_mode and repoint the dependency chain onto dev's 0269.

…dings_batch The async_wait deduplication join works with the global CELERY_TASK_IGNORE_RESULT=True default (verified live by reviewer against a real broker/worker). Removing the per-task override avoids writing a celery_taskmeta result row for every batch of every import (incl. plain async), which were retained for CELERY_RESULT_EXPIRES.

Adds async_wait coverage to the existing tests/dedupe_test.py DedupeTest suite: set the user profile to 'async_wait', import a Checkmarx report into two tests of a dedupe-on-engagement engagement, and assert the duplicates are visible immediately (check_nb_duplicates_no_wait, no sleep/retry) — proving the request blocked until the cross-process deduplication finished. This exercises the real worker->request join that the eager unit tests cannot cover. Adds a set_deduplication_execution_mode() helper to base_test_class.

…_result The async_wait join via AsyncResult.get() was a silent no-op: the global CELERY_TASK_IGNORE_RESULT=True means post_process_findings_batch never stores a result, so .get() returns immediately instead of blocking. The import responded with deduplication_complete=true while dedup had not run. Pass ignore_result=False only on the async_wait dispatch, threaded through dojo_dispatch_task to apply_async. The task decorator stays plain @app.task, so async/sync imports do not write useless celery_taskmeta result rows.

…uard The Selenium async_wait test could not fail when the cross-process join was broken: with only 2 findings and several page navigations between the import and the duplicate check, the background dedupe finished regardless of whether the import actually blocked. It also never read deduplication_complete. Remove it and add ImportAsyncWaitApiTest: a pure-API test that imports a 400-finding report twice into a dedup-on-engagement engagement against the real worker, then asserts (no sleep/retry) that async_wait reports deduplication_complete=true with all 400 marked duplicate at response time, while plain async returns incomplete with fewer marked. A no-op join makes async_wait behave like async and fails the test.

…d dedup delay The giant-report approach was non-deterministic in CI: deduplication is dispatched per batch during the import, so on a fast worker plain `async` finishes dedup before the response and marks all findings too. The count discriminator then couldn't tell async_wait from async (the control failed `400 != 400`), and a broken async_wait could false-pass. Add DD_DEDUPLICATION_BATCH_PROCESS_TEST_DELAY (default 0, no-op in production), set to 3s on the integration-test celeryworker only. Each dedup batch sleeps before doing work, so async_wait blocks past it (all duplicates marked) while async returns mid-delay (none marked) — independent of worker speed. Rewrite ImportAsyncWaitApiTest to a small report and assert async marks exactly 0. Verified locally: green normally; fails `0 != 25` when the join fix is reverted.

…den margin The first delay attempt broke two ways in CI: - It delayed *every* dedup batch in the UI suite, breaking the timing-sensitive close_old_findings_dedupe_test. - 3s was too short: the async control's import + observation GET took longer than that on CI, so dedup finished first and the control marked all findings. Add DD_DEDUPLICATION_BATCH_PROCESS_TEST_DELAY_FILTER (a finding-title prefix, case-insensitive) so only the async_wait test's own findings are delayed -- other dedupe tests run at full speed. Bump the delay to 10s for margin over the import round-trip. The Generic importer title-cases findings, so the match uses istartswith. Verified locally against the real worker: test_wait now blocks ~10s and marks all 25; async marks 0 at response time; reverting the join fix fails 0 != 25.

Diagnostic for CI: surfaces exactly which post_process_findings_batch invocations hit the integration-test delay (and the finding count + filter), to confirm the async_wait test's batches are being delayed as intended.

The async control's marked-count assertion was non-deterministic: its post_process_findings_batch task races the import request's own DB commit, so on CI the findings were not visible when the delay filter ran (no sleep) yet were deduplicated moments later -> the control saw all 25 marked and failed. Drop the count assertion for async and keep only deduplication_complete is False, which is deterministic and mode-based. The duplicate-count guarantee stays on the async_wait test, where the worker-side delay (confirmed firing via the WARN log: 10s/25 findings per batch) makes it robust and a no-op join still fails 0 != NUM_FINDINGS.

Document that test_import_async_wait_returns_statistics cannot fail when the cross-process join is broken (eager .get() always returns inline), pointing to the real-worker ImportAsyncWaitApiTest. Kept for endpoint/field plumbing coverage and future reference.

Extend the deduplication 'Background processing' section with the new deduplication_execution_mode (async / async_wait / sync), the per-profile and per-request override, the deduplication_complete response field, the DD_DEDUPLICATION_ASYNC_WAIT_TIMEOUT bound, and its relationship to the global block_execution flag. Previously this was only in the 2.60 upgrade note.

Record how long the import request blocked on the deduplication join: - DEBUG line with elapsed seconds + task count + success on completion. - the timeout/error WARNING now includes the elapsed time before it gave up. Timing detail is DEBUG so it does not add noise per import in production; the timeout case stays at WARNING since it signals a degraded (respond-anyway) join.

CLAUDE.local.md, fix-jira-push-eligibility.md, phase10-session-prompt.md, scripts/run-nuclei-docker.sh and tagoulus-replcement-options.md were accidentally added in an earlier refactor commit; none belong in this feature. Remove them from version control (kept on disk locally) and gitignore CLAUDE.local.md / CLAUDE.md so they cannot be re-added.

…t dev leaf dev advanced past the previous 0270/0271 numbering (it now ships 0270_finding_visibility_perf_indexes through 0272_reencrypt_tool_config_credentials_aes_gcm), so the PR's 0270/0271 migrations reintroduced a second leaf node off 0269_normalize_blank_finding_components -> 'Conflicting migrations detected; multiple leaf nodes in the migration graph'. Renumber onto the current leaf: - 0270_usercontactinfo_deduplication_execution_mode -> 0273_..., dep 0272_reencrypt_tool_config_credentials_aes_gcm - 0271_seed_deduplication_execution_mode -> 0274_..., dep 0273_usercontactinfo_deduplication_execution_mode Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…unts CI 'Check counts are up to date' measured 94/74 for test_deduplication_performance_pghistory_async_wait (first/second import) vs the committed 93/73. The web-side delta over plain async (92/72) is +2, not +1: the post-dedup notification refresh SELECT plus one result-backend write from the per-dispatch ignore_result=False that enables the AsyncResult.get() join. Update the baseline and the docstring accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mtesauro

Approved

github-actions Bot added New Migration Adding a new migration file. Take care when merging. settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR apiv2 unittests integration_tests ui docs labels Jun 14, 2026

valentijnscholten added this to the 2.60.0 milestone Jun 14, 2026

valentijnscholten changed the title ~~feat: configurable import/reimport execution mode (async_wait) with dedup-accurate notifications~~ feat: allow import/reimport to wait for deduplication to complete Jun 14, 2026

github-actions Bot removed the integration_tests label Jun 14, 2026

valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label Jun 14, 2026

valentijnscholten marked this pull request as ready for review June 14, 2026 17:02

valentijnscholten requested review from Maffooch and mtesauro as code owners June 14, 2026 17:02

valentijnscholten marked this pull request as draft June 14, 2026 18:04

valentijnscholten marked this pull request as ready for review June 14, 2026 21:53

github-actions Bot added the conflicts-detected label Jun 17, 2026

valentijnscholten force-pushed the import-execution-mode branch from 69aa97b to 506fbbc Compare June 17, 2026 20:34

github-actions Bot removed the conflicts-detected label Jun 17, 2026

github-actions Bot added the conflicts-detected label Jun 17, 2026

Maffooch reviewed Jun 25, 2026

View reviewed changes

Comment thread dojo/db_migrations/0269_usercontactinfo_deduplication_execution_mode.py Outdated

Comment thread dojo/importers/base_importer.py

Comment thread dojo/finding/helper.py Outdated

Comment thread dojo/importers/options.py

Comment thread unittests/test_import_execution_mode.py

valentijnscholten force-pushed the import-execution-mode branch from 506fbbc to 9d50bd9 Compare June 25, 2026 20:18

github-actions Bot removed the conflicts-detected label Jun 25, 2026

github-actions Bot added docker integration_tests labels Jun 25, 2026

valentijnscholten force-pushed the import-execution-mode branch from 2ebc4e3 to eab42db Compare June 25, 2026 20:25

valentijnscholten added 19 commits June 28, 2026 21:45

docs: 2.60 upgrade notes for deduplication_execution_mode

40cd011

test: revert set_block_execution to the block_execution checkbox

0e01d50

Option B keeps block_execution as a checkbox in the profile form, so the integration helper toggles id_block_execution again instead of the deduplication_execution_mode select (which no longer existed under that id).

test(dedupe): log WARN when the test-only dedup delay fires

1fec8dc

Diagnostic for CI: surfaces exactly which post_process_findings_batch invocations hit the integration-test delay (and the finding count + filter), to confirm the async_wait test's batches are being delayed as intended.

valentijnscholten force-pushed the import-execution-mode branch from 71ecee8 to 5b35236 Compare June 28, 2026 19:45

valentijnscholten and others added 3 commits June 28, 2026 21:58

Maffooch approved these changes Jun 30, 2026

View reviewed changes

Maffooch requested review from Jino-T, blakeaowens and dogboat June 30, 2026 21:32

Merge branch 'dev' into import-execution-mode

91e8a0b

mtesauro approved these changes Jun 30, 2026

View reviewed changes

Jino-T approved these changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: allow import/reimport to wait for deduplication to complete#15007

feat: allow import/reimport to wait for deduplication to complete#15007
valentijnscholten wants to merge 28 commits into
DefectDojo:devfrom
valentijnscholten:import-execution-mode

valentijnscholten commented Jun 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Maffooch left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

mtesauro left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

valentijnscholten commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Maffooch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

mtesauro left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

valentijnscholten commented Jun 14, 2026 •

edited

Loading