Skip to content

process issue/pr comment collection as batches#425

Draft
MoralCode wants to merge 5 commits into
mainfrom
fix/comment_collection_performance
Draft

process issue/pr comment collection as batches#425
MoralCode wants to merge 5 commits into
mainfrom
fix/comment_collection_performance

Conversation

@MoralCode

Copy link
Copy Markdown
Contributor

Description
This PR reuses the same code that we use for bulk-fetching comments for repos with few issues/PRs and applies it to larger repos.

This PR fixes #419

Notes for Reviewers
Just throwing it together, havent extensively tested yet.

might need to make sure it works for larger repos.
Might also want to block this until we have full-collection testing in place to verify that we dont hit some kind of pagination limit or something in the github API

Signed commits

  • Yes, I signed my commits.

Generative AI disclosure

Please select one option:

  • This contribution was NOT assisted or created by Generative AI tools.
  • This contribution was assisted or created by Generative AI tools.

If AI tools were used, please provide details below:
- What tools were used? Sonnet 4.6 Medium
- How were these tools used? identifying the issue, writing the batched function and its unit tests
- Did you review these outputs before submitting this PR? yes

this will help with a merge conflict later

Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
This helps adapt between the generator nature of the heavy API requests from the endpoint, and the list-focused processing function that does all sorts of additional queries and things.

we cant do it inside the processing function because of all the extra queries - too risky

Signed-off-by: Adrian Edwards <adredwar@redhat.com>
from collectoss.tasks.init.celery_app import celery_app as celery
from collectoss.tasks.init.celery_app import CoreRepoCollectionTask
from collectoss.application.db.data_parse import *
from collectoss.tasks.github.util.github_data_access import GithubDataAccess, UrlNotFoundException

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
W0611: Unused UrlNotFoundException imported from collectoss.tasks.github.util.github_data_access (unused-import)

from collectoss.tasks.util.worker_util import batched, remove_duplicate_dicts
from collectoss.tasks.github.util.util import get_owner_repo
from collectoss.application.db.models import PullRequest, Message, Issue, PullRequestMessageRef, IssueMessageRef, Contributor, Repo, CollectionStatus
from collectoss.application.db import get_engine, get_session

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
W0611: Unused get_engine imported from collectoss.application.db (unused-import)

@@ -2,7 +2,7 @@
import pytest
import sqlalchemy as s

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
W0611: Unused sqlalchemy imported as s (unused-import)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

collect_github_messages wildly inefficient with API calls for large repos

1 participant