PyPI - apify - Versions diffs - 3.4.2b28__tar.gz → 3.4.2b30__tar.gz - Mend

apify 3.4.2b28tar.gz → 3.4.2b30tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

{apify-3.4.2b28 → apify-3.4.2b30}/CHANGELOG.md RENAMED Viewed

@@ -31,6 +31,8 @@ All notable changes to this project will be documented in this file.
 - Forward all `Webhook` fields to ad-hoc webhooks ([#963](https://github.com/apify/apify-sdk-python/pull/963)) ([726620b](https://github.com/apify/apify-sdk-python/commit/726620be25da85b74b3f0d1e4f8c1f8f1b29d9b1)) by [@vdusek](https://github.com/vdusek)
 - **scrapy:** Avoid mutating request userData during Scrapy-Apify conversion ([#978](https://github.com/apify/apify-sdk-python/pull/978)) ([b0b7df7](https://github.com/apify/apify-sdk-python/commit/b0b7df72eb169778ab88be04d8b30bb0bdc307d3)) by [@vdusek](https://github.com/vdusek)
 - **scrapy:** Async-thread startup race, shutdown lifecycle, and timeout setting ([#979](https://github.com/apify/apify-sdk-python/pull/979)) ([ae12935](https://github.com/apify/apify-sdk-python/commit/ae1293512f5ee781533dab5b1dd1f0af0fcc2497)) by [@vdusek](https://github.com/vdusek)
+- Commit request queue dedup cache only after batch_add_requests succeeds ([#975](https://github.com/apify/apify-sdk-python/pull/975)) ([078ab87](https://github.com/apify/apify-sdk-python/commit/078ab8744c96e61a5226a9d19869b5e0df71ab23)) by [@vdusek](https://github.com/vdusek)
+- Prevent request queue softlock by adding `is_finished` and correcting `is_empty` ([#1008](https://github.com/apify/apify-sdk-python/pull/1008)) ([4ead0c6](https://github.com/apify/apify-sdk-python/commit/4ead0c64d2a95263b2fa970f5a8fff9141db62b2)) by [@Mantisus](https://github.com/Mantisus), closes [#987](https://github.com/apify/apify-sdk-python/issues/987)
 ### 🚜 Refactor

{apify-3.4.2b28 → apify-3.4.2b30}/CONTRIBUTING.md RENAMED Viewed

@@ -118,6 +118,22 @@ To run the documentation locally (requires Node.js):
 uv run poe run-docs
 ```
+### Linting the docs and website
+Markdown content (this guide, `README.md`, and the `docs/` folder) is checked with
+[markdownlint](https://github.com/DavidAnson/markdownlint). The Docusaurus website code is linted
+with [oxlint](https://oxc.rs/) and formatted with [oxfmt](https://oxc.rs/). All of them run in CI.
+To run them locally (requires Node.js 22.12 or newer and pnpm), from the `website/` directory:
+```sh
+pnpm lint          # lint Markdown and website code
+pnpm lint:fix      # auto-fix both
+pnpm format        # format the website code
+```
+Doc images are committed as optimized `.webp`. To convert a new image, run
+`pnpm opt:images <path-to-image>` from the `website/` directory.
 ## Commits
 We use [Conventional Commits](https://www.conventionalcommits.org/) format for commit messages. This convention is used to automatically determine version bumps during the release process.
@@ -149,25 +165,22 @@ Publishing new versions to [PyPI](https://pypi.org/project/apify) is automated t
 1. **Do not do this unless absolutely necessary.** In all conceivable scenarios, you should use the `release` workflow instead.
 2. **Make sure you know what you're doing.**
+3. Update the version number by modifying the `version` field under `project` in `pyproject.toml`:
-3. Update the version number:
-- Modify the `version` field under `project` in `pyproject.toml`.
-```toml
-[project]
-name = "apify"
-version = "x.z.y"
-```
+    ```toml
+    [project]
+    name = "apify"
+    version = "x.z.y"
+    ```
 4. Build the package:
-```sh
-uv run poe build
-```
+    ```sh
+    uv run poe build
+    ```
 5. Upload to PyPI:
-```sh
-uv publish --token YOUR_API_TOKEN
-```
+    ```sh
+    uv publish --token YOUR_API_TOKEN
+    ```

{apify-3.4.2b28 → apify-3.4.2b30}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: apify
-Version: 3.4.2b28
+Version: 3.4.2b30
 Summary: Apify SDK for Python
 Project-URL: Apify Homepage, https://apify.com
 Project-URL: Changelog, https://docs.apify.com/sdk/python/docs/changelog
@@ -228,7 +228,7 @@ Classifier: Typing :: Typed
 Requires-Python: >=3.11
 Requires-Dist: apify-client<4.0.0,>=3.0.0
 Requires-Dist: cachetools>=5.5.0
-Requires-Dist: crawlee<2.0.0,>=1.0.4
+Requires-Dist: crawlee<2.0.0,>=1.8.0
 Requires-Dist: cryptography>=42.0.0
 Requires-Dist: impit>=0.8.0
 Requires-Dist: lazy-object-proxy>=1.11.0
@@ -438,7 +438,7 @@ async def main() -> None:
 The full SDK documentation lives at **[docs.apify.com/sdk/python](https://docs.apify.com/sdk/python)**. For the Apify platform itself, see the [Apify documentation](https://docs.apify.com/).
 | Section | What you'll find |
-|---|---|
+| --- | --- |
 | [Overview](https://docs.apify.com/sdk/python/docs/overview) | What the SDK is, what Actors are, and how the pieces fit together. |
 | [Quick start](https://docs.apify.com/sdk/python/docs/quick-start) | Create, run, and deploy your first Python Actor. |
 | [Concepts](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle) | Actor lifecycle, input, storages, events, proxy management, interacting with other Actors, webhooks, accessing the Apify API, logging, configuration, and pay-per-event. |

{apify-3.4.2b28 → apify-3.4.2b30}/README.md RENAMED Viewed

@@ -195,7 +195,7 @@ async def main() -> None:
 The full SDK documentation lives at **[docs.apify.com/sdk/python](https://docs.apify.com/sdk/python)**. For the Apify platform itself, see the [Apify documentation](https://docs.apify.com/).
 | Section | What you'll find |
-|---|---|
+| --- | --- |
 | [Overview](https://docs.apify.com/sdk/python/docs/overview) | What the SDK is, what Actors are, and how the pieces fit together. |
 | [Quick start](https://docs.apify.com/sdk/python/docs/quick-start) | Create, run, and deploy your first Python Actor. |
 | [Concepts](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle) | Actor lifecycle, input, storages, events, proxy management, interacting with other Actors, webhooks, accessing the Apify API, logging, configuration, and pay-per-event. |

{apify-3.4.2b28 → apify-3.4.2b30}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "apify"
-version = "3.4.2b28"
+version = "3.4.2b30"
 description = "Apify SDK for Python"
 authors = [{ name = "Apify Technologies s.r.o.", email = "support@apify.com" }]
 license = { file = "LICENSE" }
@@ -36,7 +36,7 @@ keywords = [
 ]
 dependencies = [
     "apify-client>=3.0.0,<4.0.0",
-    "crawlee>=1.0.4,<2.0.0",
+    "crawlee>=1.8.0,<2.0.0",
     "cachetools>=5.5.0",
     "cryptography>=42.0.0",
     "impit>=0.8.0",

{apify-3.4.2b28 → apify-3.4.2b30}/src/apify/storage_clients/_apify/_request_queue_client.py RENAMED Viewed

@@ -198,3 +198,7 @@ class ApifyRequestQueueClient(RequestQueueClient):
     @override
     async def is_empty(self) -> bool:
         return await self._implementation.is_empty()
+    @override
+    async def is_finished(self) -> bool:
+        return await self._implementation.is_finished()

{apify-3.4.2b28 → apify-3.4.2b30}/src/apify/storage_clients/_apify/_request_queue_shared_client.py RENAMED Viewed

@@ -11,7 +11,12 @@ from cachetools import LRUCache
 from crawlee.storage_clients.models import AddRequestsResponse, ProcessedRequest, RequestQueueMetadata
 from ._models import ApifyRequestQueueMetadata, CachedRequest, RequestQueueHead
-from ._utils import to_crawlee_request, unique_key_to_request_id
+from ._utils import (
+    resolve_awaited_in_flight,
+    settle_pending_addition,
+    to_crawlee_request,
+    unique_key_to_request_id,
+)
 if TYPE_CHECKING:
     from collections.abc import Callable, Coroutine, Sequence
@@ -71,6 +76,19 @@ class ApifyRequestQueueSharedClient:
         self._requests_cache: LRUCache[str, CachedRequest] = LRUCache(maxsize=cache_size)
         """LRU cache storing request objects, keyed by request ID."""
+        self._requests_being_added: dict[str, asyncio.Future[bool]] = {}
+        """In-flight `add_batch_of_requests` markers, keyed by request ID.
+        Coordinates only concurrent `add_batch_of_requests` calls sharing this one client instance (e.g. several
+        producer coroutines adding requests in the same process). It does not coordinate separate client instances
+        or processes, which each keep their own markers; deduplication across clients still relies on the platform.
+        Each future resolves once the platform call that is adding the request settles: `True` if the request was
+        committed, `False` otherwise. A concurrent call adding the same request awaits the future instead of
+        re-sending it, which avoids a duplicate platform write while still avoiding false success when the original
+        add fails.
+        """
         self._queue_has_locked_requests: bool | None = None
         """Whether the queue contains requests currently locked by other clients."""
@@ -87,9 +105,13 @@ class ApifyRequestQueueSharedClient:
         forefront: bool = False,
     ) -> AddRequestsResponse:
         """Specific implementation of this method for the RQ shared access mode."""
+        loop = asyncio.get_running_loop()
         # Do not try to add previously added requests to avoid pointless expensive calls to API
         new_requests: list[Request] = []
         already_present_requests: list[ProcessedRequest] = []
+        # Requests a concurrent `add_batch_of_requests` call is already sending. We await its outcome instead of
+        # re-sending them, as (request, that call's in-flight future) pairs.
+        awaited_in_flight: list[tuple[Request, asyncio.Future[bool]]] = []
         for request in requests:
             request_id = unique_key_to_request_id(request.unique_key)
@@ -106,46 +128,70 @@ class ApifyRequestQueueSharedClient:
                     )
                 )
+            elif request_id in self._requests_being_added:
+                # A concurrent call is already adding this request; await its outcome rather than re-sending it.
+                awaited_in_flight.append((request, self._requests_being_added[request_id]))
             else:
-                # Add new request to the cache.
-                processed_request = ProcessedRequest(
-                    id=request_id,
-                    unique_key=request.unique_key,
-                    was_already_present=True,
-                    was_already_handled=request.was_already_handled,
-                )
-                self._cache_request(
-                    request_id,
-                    processed_request,
-                )
+                # Register an in-flight marker so a concurrent call dedupes against it; caching is deferred
+                # until the platform confirms the request was accepted (see below).
                 new_requests.append(request)
+                self._requests_being_added[request_id] = loop.create_future()
         if new_requests:
             # Prepare requests for API by converting to dictionaries.
             requests_dict = [request.model_dump(by_alias=True) for request in new_requests]
-            # Send requests to API.
-            batch_response = await self._api_client.batch_add_requests(
-                requests=requests_dict,
-                forefront=forefront,
-            )
-            batch_response_dict = batch_response.model_dump(by_alias=True)
-            api_response = AddRequestsResponse.model_validate(batch_response_dict)
-            # Add the locally known already present processed requests based on the local cache.
-            api_response.processed_requests.extend(already_present_requests)
+            committed_request_ids: set[str] = set()
+            try:
+                # Send requests to API.
+                batch_response = await self._api_client.batch_add_requests(
+                    requests=requests_dict,
+                    forefront=forefront,
+                )
-            # Remove unprocessed requests from the cache
-            for unprocessed_request in api_response.unprocessed_requests:
-                unprocessed_request_id = unique_key_to_request_id(unprocessed_request.unique_key)
-                self._requests_cache.pop(unprocessed_request_id, None)
+                batch_response_dict = batch_response.model_dump(by_alias=True)
+                api_response = AddRequestsResponse.model_validate(batch_response_dict)
+                # Commit only the requests the platform actually accepted to the local dedup cache. Caching after
+                # the call succeeds (not before) keeps a failed call from poisoning the cache and silently
+                # deduplicating a later retry of the same request.
+                unprocessed_unique_keys = {request.unique_key for request in api_response.unprocessed_requests}
+                for request in new_requests:
+                    if request.unique_key in unprocessed_unique_keys:
+                        continue
+                    request_id = unique_key_to_request_id(request.unique_key)
+                    self._cache_request(
+                        request_id,
+                        ProcessedRequest(
+                            id=request_id,
+                            unique_key=request.unique_key,
+                            was_already_present=True,
+                            was_already_handled=request.was_already_handled,
+                        ),
+                    )
+                    committed_request_ids.add(request_id)
+                # Add the locally known already present processed requests based on the local cache.
+                api_response.processed_requests.extend(already_present_requests)
+            finally:
+                # Release the in-flight markers we registered. Committed requests tell concurrent callers the
+                # request reached the platform; everything else (unprocessed, API error, cancellation) tells them
+                # it did not, so they retry instead of reporting false success.
+                for request in new_requests:
+                    request_id = unique_key_to_request_id(request.unique_key)
+                    settle_pending_addition(
+                        self._requests_being_added, request_id, committed=request_id in committed_request_ids
+                    )
         else:
             api_response = AddRequestsResponse.model_validate(
                 {'unprocessedRequests': [], 'processedRequests': already_present_requests}
             )
+        # Fold in requests a concurrent call was already adding.
+        await resolve_awaited_in_flight(awaited_in_flight, api_response)
         logger.debug(
             f'Tried to add new requests: {len(new_requests)}, '
             f'succeeded to add new requests: {len(api_response.processed_requests) - len(already_present_requests)}, '
@@ -292,8 +338,18 @@ class ApifyRequestQueueSharedClient:
         # Check _list_head.
         # Without the lock the `is_empty` is prone to falsely report True with some low probability race condition.
         async with self._fetch_lock:
-            head = await self._list_head(limit=1)
-            return len(head.items) == 0 and not self._queue_has_locked_requests
+            return await self._is_empty()
+    async def is_finished(self) -> bool:
+        """Specific implementation of this method for the RQ shared access mode."""
+        async with self._fetch_lock:
+            # Order of operations is important here, because affects on `_queue_has_locked_requests`.
+            return await self._is_empty() and not self._queue_has_locked_requests
+    async def _is_empty(self) -> bool:
+        """Check whether anything is available to fetch. Lock-free core of `is_empty`, caller must hold the lock."""
+        head = await self._list_head(limit=1)
+        return len(head.items) == 0
     async def _get_metadata_estimate(self) -> RequestQueueMetadata:
         """Try to get cached metadata first. If multiple clients, fuse with global metadata.

{apify-3.4.2b28 → apify-3.4.2b30}/src/apify/storage_clients/_apify/_request_queue_single_client.py RENAMED Viewed

@@ -1,5 +1,6 @@
 from __future__ import annotations
+import asyncio
 from collections import deque
 from datetime import UTC, datetime
 from logging import getLogger
@@ -9,7 +10,12 @@ from cachetools import LRUCache
 from crawlee.storage_clients.models import AddRequestsResponse, ProcessedRequest, RequestQueueMetadata
-from ._utils import to_crawlee_request, unique_key_to_request_id
+from ._utils import (
+    resolve_awaited_in_flight,
+    settle_pending_addition,
+    to_crawlee_request,
+    unique_key_to_request_id,
+)
 if TYPE_CHECKING:
     from collections.abc import Sequence
@@ -90,6 +96,19 @@ class ApifyRequestQueueSingleClient:
         Tracked locally to accurately determine when the queue is empty for this single consumer.
         """
+        self._requests_being_added: dict[str, asyncio.Future[bool]] = {}
+        """In-flight `add_batch_of_requests` markers, keyed by request ID.
+        Coordinates only concurrent `add_batch_of_requests` calls sharing this one client instance (e.g. several
+        producer coroutines adding requests in the same process). It does not coordinate separate client instances
+        or processes, which each keep their own markers; deduplication across clients still relies on the platform.
+        Each future resolves once the platform call that is adding the request settles: `True` if the request was
+        committed, `False` otherwise. A concurrent call adding the same request awaits the future instead of
+        re-sending it, which avoids a duplicate platform write while still avoiding false success when the original
+        add fails.
+        """
         self._initialized_caches = False
         """Flag indicating whether local caches have been populated from existing queue contents.
@@ -108,8 +127,12 @@ class ApifyRequestQueueSingleClient:
             await self._init_caches()
             self._initialized_caches = True
+        loop = asyncio.get_running_loop()
         new_requests: list[Request] = []
         already_present_requests: list[ProcessedRequest] = []
+        # Requests a concurrent `add_batch_of_requests` call is already sending. We await its outcome instead of
+        # re-sending them, as (request, that call's in-flight future) pairs.
+        awaited_in_flight: list[tuple[Request, asyncio.Future[bool]]] = []
         for request in requests:
             # Calculate id for request
@@ -135,33 +158,54 @@ class ApifyRequestQueueSingleClient:
                         was_already_handled=request.was_already_handled,
                     )
                 )
+            # Check if a concurrent call is already adding this request, and await its outcome rather than
+            # re-sending it.
+            elif request_id in self._requests_being_added:
+                awaited_in_flight.append((request, self._requests_being_added[request_id]))
             else:
-                # Push the request to the platform. Probably not there, or we are not aware of it
+                # Push the request to the platform. Probably not there, or we are not aware of it. Register an
+                # in-flight marker so a concurrent call dedupes against it; caching is deferred until the
+                # platform confirms the request was accepted (see below).
                 new_requests.append(request)
-                # Update local caches
-                self._requests_cache[request_id] = request
-                if forefront:
-                    self._head_requests.append(request_id)
-                else:
-                    self._head_requests.appendleft(request_id)
+                self._requests_being_added[request_id] = loop.create_future()
         if new_requests:
             # Prepare requests for API by converting to dictionaries.
             requests_dict = [request.model_dump(by_alias=True) for request in new_requests]
-            # Send requests to API.
-            batch_response = await self._api_client.batch_add_requests(requests=requests_dict, forefront=forefront)
-            batch_response_dict = batch_response.model_dump(by_alias=True)
-            api_response = AddRequestsResponse.model_validate(batch_response_dict)
-            # Add the locally known already present processed requests based on the local cache.
-            api_response.processed_requests.extend(already_present_requests)
-            # Remove unprocessed requests from the cache
-            for unprocessed_request in api_response.unprocessed_requests:
-                request_id = unique_key_to_request_id(unprocessed_request.unique_key)
-                self._requests_cache.pop(request_id, None)
+            committed_request_ids: set[str] = set()
+            try:
+                # Send requests to API.
+                batch_response = await self._api_client.batch_add_requests(requests=requests_dict, forefront=forefront)
+                batch_response_dict = batch_response.model_dump(by_alias=True)
+                api_response = AddRequestsResponse.model_validate(batch_response_dict)
+                # Commit only the requests the platform actually accepted to the local caches. Caching after the
+                # call succeeds (not before) keeps a failed call from poisoning the cache and silently
+                # deduplicating a later retry of the same request.
+                unprocessed_unique_keys = {request.unique_key for request in api_response.unprocessed_requests}
+                for request in new_requests:
+                    if request.unique_key in unprocessed_unique_keys:
+                        continue
+                    request_id = unique_key_to_request_id(request.unique_key)
+                    self._requests_cache[request_id] = request
+                    if forefront:
+                        self._head_requests.append(request_id)
+                    else:
+                        self._head_requests.appendleft(request_id)
+                    committed_request_ids.add(request_id)
+                # Add the locally known already present processed requests based on the local cache.
+                api_response.processed_requests.extend(already_present_requests)
+            finally:
+                # Release the in-flight markers we registered. Committed requests tell concurrent callers the
+                # request reached the platform; everything else (unprocessed, API error, cancellation) tells them
+                # it did not, so they retry instead of reporting false success.
+                for request in new_requests:
+                    request_id = unique_key_to_request_id(request.unique_key)
+                    settle_pending_addition(
+                        self._requests_being_added, request_id, committed=request_id in committed_request_ids
+                    )
         else:
             api_response = AddRequestsResponse(
@@ -169,6 +213,9 @@ class ApifyRequestQueueSingleClient:
                 processed_requests=already_present_requests,
             )
+        # Fold in requests a concurrent call was already adding.
+        await resolve_awaited_in_flight(awaited_in_flight, api_response)
         # Update assumed total count for newly added requests.
         new_request_count = 0
         for processed_request in api_response.processed_requests:
@@ -277,9 +324,12 @@ class ApifyRequestQueueSingleClient:
     async def is_empty(self) -> bool:
         """Specific implementation of this method for the RQ single access mode."""
-        # Without the lock the `is_empty` is prone to falsely report True with some low probability race condition.
         await self._ensure_head_is_non_empty()
-        return not self._head_requests and not self._requests_in_progress
+        return not self._head_requests
+    async def is_finished(self) -> bool:
+        """Specific implementation of this method for the RQ single access mode."""
+        return await self.is_empty() and not self._requests_in_progress
     async def _ensure_head_is_non_empty(self) -> None:
         """Ensure that the queue head has requests if they are available in the queue."""

{apify-3.4.2b28 → apify-3.4.2b30}/src/apify/storage_clients/_apify/_utils.py RENAMED Viewed

@@ -1,17 +1,22 @@
 from __future__ import annotations
+import asyncio
 import re
 from base64 import b64encode
 from hashlib import sha256
 from typing import TYPE_CHECKING
 from crawlee._utils.crypto import compute_short_hash
+from crawlee.storage_clients.models import ProcessedRequest, UnprocessedRequest
 from apify import Request
 if TYPE_CHECKING:
+    from collections.abc import Iterable
     from apify_client._models import HeadRequest, LockedHeadRequest
     from apify_client._models import Request as ClientRequest
+    from crawlee.storage_clients.models import AddRequestsResponse
     from apify import Configuration
@@ -60,3 +65,48 @@ def to_crawlee_request(client_request: ClientRequest | HeadRequest | LockedHeadR
     # Validate and construct Crawlee Request from the serialized dict
     return Request.model_validate(request_dict)
+def settle_pending_addition(
+    requests_being_added: dict[str, asyncio.Future[bool]],
+    request_id: str,
+    *,
+    committed: bool,
+) -> None:
+    """Resolve the in-flight add marker for a request, unblocking any concurrent call awaiting it.
+    Args:
+        requests_being_added: The client's map of in-flight `add_batch_of_requests` markers.
+        request_id: ID of the request whose in-flight add has settled.
+        committed: Whether the request was committed to the platform.
+    """
+    future = requests_being_added.pop(request_id, None)
+    if future is not None and not future.done():
+        future.set_result(committed)
+async def resolve_awaited_in_flight(
+    awaited_in_flight: Iterable[tuple[Request, asyncio.Future[bool]]],
+    api_response: AddRequestsResponse,
+) -> None:
+    """Await concurrent in-flight adds of these requests and fold the outcome into `api_response`.
+    Requests the concurrent add committed are reported as already present; the rest are reported unprocessed
+    so the caller retries them rather than receiving false success.
+    """
+    for request, future in awaited_in_flight:
+        # Shield the shared in-flight marker: cancelling this awaiting caller must not cancel the future, which
+        # is owned by the original producer and may have other callers waiting on it.
+        if await asyncio.shield(future):
+            api_response.processed_requests.append(
+                ProcessedRequest(
+                    id=unique_key_to_request_id(request.unique_key),
+                    unique_key=request.unique_key,
+                    was_already_present=True,
+                    was_already_handled=request.was_already_handled,
+                )
+            )
+        else:
+            api_response.unprocessed_requests.append(
+                UnprocessedRequest(unique_key=request.unique_key, url=request.url, method=request.method)
+            )