fix: Add is_finished and correct is_empty semantics in RequestQueueClient#1982
fix: Add is_finished and correct is_empty semantics in RequestQueueClient#1982Mantisus wants to merge 7 commits into
is_finished and correct is_empty semantics in RequestQueueClient#1982Conversation
janbuchar
left a comment
There was a problem hiding this comment.
Looks sound, the only thing I'm mildly concerned about is the removal of the is_empty return value caching — I'd feel better if I knew some historical context and why it's safe to do this.
vdusek
left a comment
There was a problem hiding this comment.
Thanks! Could you please implement apify/apify-sdk-python#987 as well?
That way, we can verify that it works on the platform too.
I'd like this to be part of SDK v4, built on top of Crawlee v1.8 with this included.
Sure. |
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
vdusek
left a comment
There was a problem hiding this comment.
A few more comments (three nits and one thing that might have a performance impact)
| @override | ||
| async def is_finished(self) -> bool: | ||
| # If the queue is not empty, it is not finished | ||
| if not await self.is_empty(): |
There was a problem hiding this comment.
is_finished calls await self.is_empty() here, which opens a DB session, and then opens a second session below for get_metadata() and the remaining queries. Since the autoscaled pool polls is_finished/is_empty while scheduling requests, this could have a performance impact.
Maybe we could handle the empty check directly inside is_finished, using only a single session?
There was a problem hiding this comment.
Getting a session from the pool is quite cheap, but it can actually impact performance if the pool is small. However, this will only become apparent when the queue is empty but not yet finished.
But merging is_empty and get_metadata into a single session in is_finished will make the method harder to read.
If we need to optimize these methods, I would consider removing the _add_buffer_record calls in is_empty and is_finished, just to update accessed_at in the metadata. This will have a greater impact on performance.
vdusek
left a comment
There was a problem hiding this comment.
I'd probably use a fix prefix for the PR title. It does add a new method, true, but only because its absence is the root cause of the bug. Something like this:
fix: add is_finished and correct is_empty semantics in RequestQueueClient
would be better in my opinion.
is_finished method to RequestQueueClient interfaceis_finished method to RequestQueueClient interface
is_finished method to RequestQueueClient interfaceis_finished method to RequestQueueClient interface
is_finished method to RequestQueueClient interfaceis_finished and correct is_empty semantics in RequestQueueClient
Description
is_finishedmethod to theRequestQueueClientinterface. It's not a breaking change. The base class defaultsis_finishedtois_emptywhen it isn't overridden. Splits the queue predicates.is_emptymeans there are no requests available to fetch.is_finishedmeans all requests are fully processed (nothing to fetch and nothing in progress).is_finishedin all built-inRequestQueueClient.is_emptyin all built-inRequestQueueClientto match the new meaning.AutoscaledPool. It now skips scheduling new tasks when no requests are available to fetch. For example, while the only request is being processed.Issues
is_finishedmethod to theStorageClientinterface #1966Testing
RequestQueuethat checks theis_emptyandis_finishedvalues through the queue lifecycle. It covers all built-in clients.AutoscaledPooldoesn't schedule new tasks while the only request is being processed.