PyPI - airflow-toolkit - Versions diffs - 2.3.0__tar.gz → 2.4.0__tar.gz - Mend

airflow-toolkit 2.3.0tar.gz → 2.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{airflow_toolkit-2.3.0/src/airflow_toolkit.egg-info → airflow_toolkit-2.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: airflow-toolkit
-Version: 2.3.0
+Version: 2.4.0
 Summary: A toolkit of operators, hooks and utilities for Apache Airflow 3
 Author-email: Biel Llobera <biel_llobera@dkl.digital>
 Requires-Python: <3.15,>=3.11
@@ -145,8 +145,8 @@ pip install "airflow-toolkit[airflow3-full]"
 | `http` | `providers-http`, `requests`, `jmespath`, `pandas` | `HttpToFilesystem`, `MultiHttpToFilesystem` |
 | `duckdb` | `airflow-provider-duckdb` | `DuckdbToDeltalake` operator |
 | `sqlite` | `providers-sqlite` | SQLite as source or destination |
-| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` |
-| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` |
+| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` and `HttpToFilesystem` |
+| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` and `HttpToFilesystem` |
 | `airflow3-full` | all of the above | Quick start / development |
 ---
@@ -191,7 +191,7 @@ Changing the connection's `conn_type` is all that is needed to switch backends
 ### HttpToFilesystem
-Calls an HTTP endpoint and writes the response to any filesystem. Supports pagination, JMESPath filtering, compression, and custom response transformations.
+Calls an HTTP endpoint and writes the response to any filesystem. Supports pagination, JMESPath filtering, compression, OAuth 2.0 authentication, rate limiting, and custom response transformations.
 ```python
 from airflow_toolkit.providers.filesystem.operators.http_to_filesystem import HttpToFilesystem
@@ -208,7 +208,7 @@ HttpToFilesystem(
 )
 ```
-With cursor-based pagination:
+**With cursor-based pagination:**
 ```python
 def next_page(response):
@@ -230,9 +230,70 @@ HttpToFilesystem(
 )
 ```
+**With OAuth 2.0 Client Credentials:**
+`OAuth2ClientCredentials.client_credentials()` returns a configured auth class that fetches the token lazily on the first request and refreshes it automatically 30 seconds before expiry — no manual token management required.
+```python
+from airflow_toolkit.providers.filesystem.operators.auth import OAuth2ClientCredentials
+HttpToFilesystem(
+    task_id='fetch_protected_data',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/data/{{ ds }}/',
+    endpoint='/api/v1/data',
+    method='GET',
+    save_format='jsonl',
+    auth_type=OAuth2ClientCredentials.client_credentials(
+        token_url='https://auth.example.com/oauth2/token',
+        client_id='{{ var.value.oauth2_client_id }}',
+        client_secret='{{ var.value.oauth2_client_secret }}',
+        scope='read',           # optional
+    ),
+)
+```
+**With rate limiting:**
+Use `requests_per_second` to cap how fast paginated requests are sent. This is useful when the API enforces a rate limit.
+```python
+HttpToFilesystem(
+    task_id='fetch_with_rate_limit',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/events/{{ ds }}/',
+    endpoint='/api/v1/events',
+    method='GET',
+    pagination_function=next_page,
+    save_format='jsonl',
+    requests_per_second=3.0,    # max 3 requests per second between pages
+)
+```
+**Supported response formats:**
+`save_format` controls how the response is written to the filesystem. For APIs that return binary formats natively (e.g. a reporting API that streams Excel files), set `source_format` to match the response content type:
+| `source_format` / `save_format` | File extension | Notes |
+|---|---|---|
+| `json` | `.json` | Single JSON object or array |
+| `jsonl` | `.jsonl` | Array response written as one record per line |
+| `csv` | `.csv` | Raw CSV text from the response |
+| `xml` | `.xml` | Raw XML text from the response |
+| `parquet` | `.parquet` | Binary passthrough — API must return Parquet bytes |
+| `excel` | `.xlsx` | Binary passthrough — API must return Excel bytes (requires `[excel]`) |
+| `avro` | `.avro` | Binary passthrough — API must return Avro bytes (requires `[avro]`) |
+| `fixed_width` | `.fwf` | Fixed-width text from the response |
+All text and JSON formats support gzip/zip compression via the `compression` parameter.
 ### MultiHttpToFilesystem
-Runs multiple HTTP requests in a single Airflow task, saving each response as a separate file. Useful for fetching multiple entities or date ranges without creating one task per request.
+Runs multiple HTTP requests in a single Airflow task, saving each response as a separate file. Requests can run **sequentially** (with optional rate limiting) or **in parallel** using a thread pool.
+**Sequential with rate limiting:**
 ```python
 from airflow_toolkit.providers.filesystem.operators.http_to_filesystem import MultiHttpToFilesystem
@@ -244,6 +305,7 @@ MultiHttpToFilesystem(
     filesystem_path='raw/reference/{{ ds }}/',
     method='GET',
     save_format='jsonl',
+    requests_per_second=2.0,    # max 2 requests per second between calls
     multi_requests=[
         {'endpoint': '/api/v1/categories'},
         {'endpoint': '/api/v1/statuses'},
@@ -252,6 +314,31 @@ MultiHttpToFilesystem(
 )
 ```
+**Parallel execution:**
+Set `max_workers` to run requests concurrently using a thread pool. Each request writes to its own file — there are no file collisions.
+```python
+MultiHttpToFilesystem(
+    task_id='fetch_users_parallel',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/users/{{ ds }}/',
+    method='GET',
+    save_format='json',
+    max_workers=5,              # up to 5 concurrent threads
+    multi_requests=[
+        {'endpoint': '/api/v1/users/1'},
+        {'endpoint': '/api/v1/users/2'},
+        {'endpoint': '/api/v1/users/3'},
+        {'endpoint': '/api/v1/users/4'},
+        {'endpoint': '/api/v1/users/5'},
+    ],
+)
+```
+> Rate limiting (`requests_per_second`) applies only in sequential mode. In parallel mode the thread pool controls concurrency — use `max_workers` to avoid overwhelming the API.
 Each entry in `multi_requests` can override any base parameter (`endpoint`, `method`, `headers`, `data`, `jmespath_expression`, `save_format`, `compression`).
 ### SQLToFilesystem

{airflow_toolkit-2.3.0 → airflow_toolkit-2.4.0}/README.md RENAMED Viewed

@@ -86,8 +86,8 @@ pip install "airflow-toolkit[airflow3-full]"
 | `http` | `providers-http`, `requests`, `jmespath`, `pandas` | `HttpToFilesystem`, `MultiHttpToFilesystem` |
 | `duckdb` | `airflow-provider-duckdb` | `DuckdbToDeltalake` operator |
 | `sqlite` | `providers-sqlite` | SQLite as source or destination |
-| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` |
-| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` |
+| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` and `HttpToFilesystem` |
+| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` and `HttpToFilesystem` |
 | `airflow3-full` | all of the above | Quick start / development |
 ---
@@ -132,7 +132,7 @@ Changing the connection's `conn_type` is all that is needed to switch backends
 ### HttpToFilesystem
-Calls an HTTP endpoint and writes the response to any filesystem. Supports pagination, JMESPath filtering, compression, and custom response transformations.
+Calls an HTTP endpoint and writes the response to any filesystem. Supports pagination, JMESPath filtering, compression, OAuth 2.0 authentication, rate limiting, and custom response transformations.
 ```python
 from airflow_toolkit.providers.filesystem.operators.http_to_filesystem import HttpToFilesystem
@@ -149,7 +149,7 @@ HttpToFilesystem(
 )
 ```
-With cursor-based pagination:
+**With cursor-based pagination:**
 ```python
 def next_page(response):
@@ -171,9 +171,70 @@ HttpToFilesystem(
 )
 ```
+**With OAuth 2.0 Client Credentials:**
+`OAuth2ClientCredentials.client_credentials()` returns a configured auth class that fetches the token lazily on the first request and refreshes it automatically 30 seconds before expiry — no manual token management required.
+```python
+from airflow_toolkit.providers.filesystem.operators.auth import OAuth2ClientCredentials
+HttpToFilesystem(
+    task_id='fetch_protected_data',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/data/{{ ds }}/',
+    endpoint='/api/v1/data',
+    method='GET',
+    save_format='jsonl',
+    auth_type=OAuth2ClientCredentials.client_credentials(
+        token_url='https://auth.example.com/oauth2/token',
+        client_id='{{ var.value.oauth2_client_id }}',
+        client_secret='{{ var.value.oauth2_client_secret }}',
+        scope='read',           # optional
+    ),
+)
+```
+**With rate limiting:**
+Use `requests_per_second` to cap how fast paginated requests are sent. This is useful when the API enforces a rate limit.
+```python
+HttpToFilesystem(
+    task_id='fetch_with_rate_limit',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/events/{{ ds }}/',
+    endpoint='/api/v1/events',
+    method='GET',
+    pagination_function=next_page,
+    save_format='jsonl',
+    requests_per_second=3.0,    # max 3 requests per second between pages
+)
+```
+**Supported response formats:**
+`save_format` controls how the response is written to the filesystem. For APIs that return binary formats natively (e.g. a reporting API that streams Excel files), set `source_format` to match the response content type:
+| `source_format` / `save_format` | File extension | Notes |
+|---|---|---|
+| `json` | `.json` | Single JSON object or array |
+| `jsonl` | `.jsonl` | Array response written as one record per line |
+| `csv` | `.csv` | Raw CSV text from the response |
+| `xml` | `.xml` | Raw XML text from the response |
+| `parquet` | `.parquet` | Binary passthrough — API must return Parquet bytes |
+| `excel` | `.xlsx` | Binary passthrough — API must return Excel bytes (requires `[excel]`) |
+| `avro` | `.avro` | Binary passthrough — API must return Avro bytes (requires `[avro]`) |
+| `fixed_width` | `.fwf` | Fixed-width text from the response |
+All text and JSON formats support gzip/zip compression via the `compression` parameter.
 ### MultiHttpToFilesystem
-Runs multiple HTTP requests in a single Airflow task, saving each response as a separate file. Useful for fetching multiple entities or date ranges without creating one task per request.
+Runs multiple HTTP requests in a single Airflow task, saving each response as a separate file. Requests can run **sequentially** (with optional rate limiting) or **in parallel** using a thread pool.
+**Sequential with rate limiting:**
 ```python
 from airflow_toolkit.providers.filesystem.operators.http_to_filesystem import MultiHttpToFilesystem
@@ -185,6 +246,7 @@ MultiHttpToFilesystem(
     filesystem_path='raw/reference/{{ ds }}/',
     method='GET',
     save_format='jsonl',
+    requests_per_second=2.0,    # max 2 requests per second between calls
     multi_requests=[
         {'endpoint': '/api/v1/categories'},
         {'endpoint': '/api/v1/statuses'},
@@ -193,6 +255,31 @@ MultiHttpToFilesystem(
 )
 ```
+**Parallel execution:**
+Set `max_workers` to run requests concurrently using a thread pool. Each request writes to its own file — there are no file collisions.
+```python
+MultiHttpToFilesystem(
+    task_id='fetch_users_parallel',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/users/{{ ds }}/',
+    method='GET',
+    save_format='json',
+    max_workers=5,              # up to 5 concurrent threads
+    multi_requests=[
+        {'endpoint': '/api/v1/users/1'},
+        {'endpoint': '/api/v1/users/2'},
+        {'endpoint': '/api/v1/users/3'},
+        {'endpoint': '/api/v1/users/4'},
+        {'endpoint': '/api/v1/users/5'},
+    ],
+)
+```
+> Rate limiting (`requests_per_second`) applies only in sequential mode. In parallel mode the thread pool controls concurrency — use `max_workers` to avoid overwhelming the API.
 Each entry in `multi_requests` can override any base parameter (`endpoint`, `method`, `headers`, `data`, `jmespath_expression`, `save_format`, `compression`).
 ### SQLToFilesystem

{airflow_toolkit-2.3.0 → airflow_toolkit-2.4.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "airflow-toolkit"
-version = "2.3.0"
+version = "2.4.0"
 description = "A toolkit of operators, hooks and utilities for Apache Airflow 3"
 authors = [{ name = "Biel Llobera", email = "biel_llobera@dkl.digital" }]
 requires-python = ">=3.11,<3.15"

airflow_toolkit-2.4.0/src/airflow_toolkit/providers/filesystem/operators/auth.py ADDED Viewed

@@ -0,0 +1,66 @@
+from __future__ import annotations
+import time
+from typing import TYPE_CHECKING
+import requests
+from requests.auth import AuthBase
+if TYPE_CHECKING:
+    from requests import PreparedRequest
+class OAuth2ClientCredentials:
+    """Factory for OAuth 2.0 Client Credentials auth.
+    Returns a configured AuthBase class that fetches and caches tokens,
+    refreshing automatically 30 seconds before expiry.
+    Usage:
+        auth_type=OAuth2ClientCredentials.client_credentials(
+            token_url="https://auth.example.com/token",
+            client_id="{{ var.value.client_id }}",
+            client_secret="{{ var.value.client_secret }}",
+        )
+    """
+    @staticmethod
+    def client_credentials(
+        token_url: str,
+        client_id: str,
+        client_secret: str,
+        scope: str | None = None,
+    ) -> type[AuthBase]:
+        """Return a configured AuthBase subclass for OAuth 2.0 Client Credentials.
+        Each call produces an independent class with its own token cache, so
+        multiple operators with different credentials do not share tokens.
+        """
+        class _OAuth2Auth(AuthBase):
+            _token: str | None = None
+            _expiry: float = 0.0
+            def __call__(self, r: "PreparedRequest") -> "PreparedRequest":
+                cls = type(self)
+                if cls._token is None or time.time() >= cls._expiry - 30:
+                    cls._refresh()
+                r.headers["Authorization"] = f"Bearer {cls._token}"
+                return r
+            @classmethod
+            def _refresh(cls) -> None:
+                payload: dict[str, str] = {
+                    "grant_type": "client_credentials",
+                    "client_id": client_id,
+                    "client_secret": client_secret,
+                }
+                if scope:
+                    payload["scope"] = scope
+                resp = requests.post(token_url, data=payload)
+                resp.raise_for_status()
+                data = resp.json()
+                cls._token = data["access_token"]
+                cls._expiry = time.time() + float(data.get("expires_in", 3600))
+        return _OAuth2Auth

{airflow_toolkit-2.3.0 → airflow_toolkit-2.4.0}/src/airflow_toolkit/providers/filesystem/operators/http_to_filesystem.py RENAMED Viewed

@@ -2,7 +2,9 @@ from __future__ import annotations
 import json
 import logging
+import time
 import uuid
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from io import BytesIO, StringIO
 from typing import (
     TYPE_CHECKING,
@@ -38,7 +40,10 @@ if TYPE_CHECKING:
 class HttpBatchOperator(HttpOperator):
     def execute(
-        self, context: Context, use_new_data_parameters_on_pagination=False
+        self,
+        context: Context,
+        use_new_data_parameters_on_pagination: bool = False,
+        delay: float = 0.0,
     ) -> Generator[Any, None, None]:
         self.log.info("Calling HTTP method")
@@ -49,16 +54,22 @@ class HttpBatchOperator(HttpOperator):
         for response in self.paginate_sync(
             response=response,
             use_new_data_parameters_on_pagination=use_new_data_parameters_on_pagination,
+            delay=delay,
         ):
             yield self.process_response(context=context, response=response)
     def paginate_sync(
-        self, response: Response, use_new_data_parameters_on_pagination=False
+        self,
+        response: Response,
+        use_new_data_parameters_on_pagination: bool = False,
+        delay: float = 0.0,
     ) -> Generator[Response, None, None]:
         if not self.pagination_function:
             return
         while True:
+            if delay > 0:
+                time.sleep(delay)
             next_page_params = self.pagination_function(response)
             if not next_page_params:
                 break
@@ -71,7 +82,9 @@ class HttpBatchOperator(HttpOperator):
         return
     def _merge_next_page_parameters(
-        self, next_page_params: dict, use_new_data_parameters_on_pagination=False
+        self,
+        next_page_params: dict,
+        use_new_data_parameters_on_pagination: bool = False,
     ) -> dict:
         """Merge initial request parameters with next page parameters.
@@ -119,7 +132,21 @@ class HttpToFilesystem(BaseOperator):
     template_fields_renderers = HttpOperator.template_fields_renderers
     json_response_source_format = ["json", "jsonl"]
-    binary_response_source_format = ["parquet"]
+    binary_response_source_format = ["parquet", "excel", "avro"]
+    # Maps source_format/save_format → file extension.
+    # Formats that differ from their name (excel → xlsx, fixed_width → fwf) are listed
+    # explicitly; all others fall back to using the format name as-is.
+    _FORMAT_EXTENSIONS: dict[str, str] = {
+        "json": "json",
+        "jsonl": "jsonl",
+        "xml": "xml",
+        "csv": "csv",
+        "parquet": "parquet",
+        "excel": "xlsx",
+        "avro": "avro",
+        "fixed_width": "fwf",
+    }
     def __init__(
         self,
@@ -142,6 +169,7 @@ class HttpToFilesystem(BaseOperator):
         data_transformation_kwargs: dict[str, Any] | None = None,
         file_number_start: int = 1,
         strict_response_schema: bool = True,
+        requests_per_second: float | None = None,
         *args,
         **kwargs,
     ):
@@ -170,14 +198,15 @@ class HttpToFilesystem(BaseOperator):
         self.source_format = source_format if source_format else save_format
         self.file_number_start = file_number_start
         self.strict_response_schema = strict_response_schema
+        self.requests_per_second = requests_per_second
         self.kwargs = kwargs
         if (
-            self.save_format in self.binary_response_source_format
+            self.source_format in self.binary_response_source_format
             and self.compression is not None
         ):
             raise ValueError(
-                f"Compression is not supported for binary response save formats: {self.binary_response_source_format}"
+                f"Compression is not supported for binary source formats: {self.binary_response_source_format}"
             )
         if self.data_transformation and not callable(self.data_transformation):
@@ -204,10 +233,12 @@ class HttpToFilesystem(BaseOperator):
             response_filter=self._response_filter,
             pagination_function=self.pagination_function,
         )
+        delay = (1.0 / self.requests_per_second) if self.requests_per_second else 0.0
         for i, data in enumerate(
             http_batch_operator.execute(
                 context,
                 use_new_data_parameters_on_pagination=self.use_new_data_parameters_on_pagination,
+                delay=delay,
             ),
             start=self.file_number_start,
         ):
@@ -232,9 +263,9 @@ class HttpToFilesystem(BaseOperator):
                 )
                 filesystem_protocol.write(BytesIO(), success_file_path)
-    def _file_name(self, n_part) -> str:
-        file_name = f"part{n_part:04}.{self.save_format}"
+    def _file_name(self, n_part: int) -> str:
+        ext = self._FORMAT_EXTENSIONS.get(self.save_format, self.save_format)
+        file_name = f"part{n_part:04}.{ext}"
         if self.compression:
             file_name += f".{self.compression}"
         return file_name
@@ -263,7 +294,6 @@ class HttpToFilesystem(BaseOperator):
         self.response_filter_data = data
-        # Check if we have a custom data transformation
         if self.data_transformation and self.data_transformation_kwargs:
             transformed = self.data_transformation(
                 data, self.data_transformation_kwargs
@@ -273,8 +303,6 @@ class HttpToFilesystem(BaseOperator):
             transformed = self.data_transformation(data)
             return self._ensure_bytesio(transformed)
-        # If we don't have a custom data transformation, use the default one based on the source_format
         match self.source_format:
             case "json":
                 return json_to_binary(data, self.compression)
@@ -291,10 +319,10 @@ class HttpToFilesystem(BaseOperator):
                     return BytesIO()
                 return list_to_jsonl(data, self.compression)
-            case "xml":
+            case "xml" | "fixed_width":
                 return xml_to_binary(data, self.compression)
-            case "parquet":
+            case "parquet" | "excel" | "avro":
                 return self._ensure_bytesio(data)
             case "csv":
@@ -306,9 +334,6 @@ class HttpToFilesystem(BaseOperator):
                 )
     def _ensure_bytesio(self, value: BytesIO | bytes | str) -> BytesIO:
-        """
-        Ensure the transformation output is a BytesIO object.
-        """
         if isinstance(value, BytesIO):
             return value
         if isinstance(value, bytes):
@@ -329,25 +354,28 @@ class MultiHttpToFilesystem(HttpToFilesystem):
     Args:
         multi_requests: List of request specifications. Each item can override
                        any base operator parameter for that specific request.
+        max_workers: Number of threads for parallel execution. None (default) runs
+                     requests sequentially. Set to an integer to enable concurrency.
+                     Note: rate limiting (requests_per_second) is only applied in
+                     sequential mode.
     Example:
         MultiHttpToFilesystem(
             http_conn_id='api_connection',
-            base_endpoint='/api/v1',
             headers={'Authorization': 'Bearer token'},
             multi_requests=[
                 {'endpoint': '/users/1'},
                 {'endpoint': '/users/2', 'method': 'POST', 'data': {...}},
-                {'endpoint': '/orders', 'headers': {'Custom': 'Header'}}
+                {'endpoint': '/orders', 'headers': {'Custom': 'Header'}},
             ]
         )
     Notes:
         - Pagination is not supported
-        - Requests are executed sequentially within the task
+        - Requests are executed sequentially by default; set max_workers for concurrency
         - Per-request values override base configuration with dict merging for
           headers/data, and replacement for other parameters
-        - All validations are re-applied after each request configuration
+        - All validations are re-applied after each request configuration override
     """
     template_fields = HttpToFilesystem.template_fields + ["multi_requests"]
@@ -356,11 +384,15 @@ class MultiHttpToFilesystem(HttpToFilesystem):
         "multi_requests": "py",
     }
-    # Allowed keys come from the TypedDict (so static + runtime stay in sync)
     _ALLOWED_KEYS = set(RequestSpec.__annotations__.keys())
-    def __init__(self, *, multi_requests: list[RequestSpec], **kwargs):
-        # No pagination in this multi operator
+    def __init__(
+        self,
+        *,
+        multi_requests: list[RequestSpec],
+        max_workers: int | None = None,
+        **kwargs,
+    ):
         if kwargs.get("pagination_function") is not None:
             raise ValueError("Pagination is not supported in MultiHttpToFilesystem")
@@ -369,6 +401,7 @@ class MultiHttpToFilesystem(HttpToFilesystem):
         super().__init__(**kwargs)
         self.multi_requests: list[RequestSpec] = multi_requests
+        self.max_workers = max_workers
     def _capture_request_state(self) -> RequestState:
         return {
@@ -396,29 +429,24 @@ class MultiHttpToFilesystem(HttpToFilesystem):
     @staticmethod
     def _merge_or_replace(base_val: Any, override_val: Any) -> Any:
-        # Shallow-merge dicts; otherwise replace.
         if isinstance(base_val, dict) and isinstance(override_val, dict):
             return {**base_val, **override_val}
         return override_val
     def _apply_request_overrides(self, spec: RequestSpec, base: RequestState) -> None:
-        """
-            Apply per-request configuration overrides on top of base operator settings.
+        """Apply per-request configuration overrides on top of base operator settings.
         Args:
-            spec: Request-specific configuration overrides
-            base: Base operator configuration to restore after request
+            spec: Request-specific configuration overrides.
+            base: Base operator configuration to restore after request.
         Raises:
-            ValueError: If spec contains unknown keys or invalid combinations
+            ValueError: If spec contains unknown keys or invalid combinations.
         """
-        # Validate allowed override keys
         unknown = set(spec.keys()) - self._ALLOWED_KEYS
         if unknown:
             raise ValueError(f"Unknown keys in multi_requests item: {sorted(unknown)}")
-        # Simple fields
         self.endpoint = spec.get("endpoint", base["endpoint"])
         self.method = spec.get("method", base["method"])
         self.auth_type = spec.get("auth_type", base["auth_type"])
@@ -426,7 +454,6 @@ class MultiHttpToFilesystem(HttpToFilesystem):
             "jmespath_expression", base["jmespath_expression"]
         )
-        # Dict-like merge/replace behavior
         if "headers" in spec:
             self.headers = self._merge_or_replace(base["headers"], spec["headers"])
         else:
@@ -437,22 +464,19 @@ class MultiHttpToFilesystem(HttpToFilesystem):
         else:
             self.data = base["data"]
-        # Formats & compression
         self.save_format = spec.get("save_format", base["save_format"])
         self.source_format = spec.get("source_format", base["source_format"])
         self.compression = spec.get("compression", base["compression"])
-        # Validate this request's final state
         self._validate_current_request_state()
     def _validate_current_request_state(self) -> None:
-        # Re-apply critical validations that may be affected by per-request overrides
         if (
-            self.save_format in self.binary_response_source_format
+            self.source_format in self.binary_response_source_format
             and self.compression is not None
         ):
             raise ValueError(
-                f"Compression is not supported for binary response save formats: "
+                f"Compression is not supported for binary source formats: "
                 f"{self.binary_response_source_format}"
             )
         if self.data_transformation and not callable(self.data_transformation):
@@ -466,15 +490,91 @@ class MultiHttpToFilesystem(HttpToFilesystem):
                 "data_transformation must be provided if data_transformation_kwargs is provided"
             )
-    def execute(self, context) -> Any:
+    @staticmethod
+    def _execute_one_request(
+        base_op: "MultiHttpToFilesystem",
+        context: Context,
+        spec: RequestSpec,
+        base_state: RequestState,
+        file_number: int,
+    ) -> None:
+        """Execute a single request as an independent HttpToFilesystem (thread-safe).
+        Creates a standalone operator instance with the merged configuration so that
+        parallel workers never share mutable state.
+        """
+        unknown = set(spec.keys()) - MultiHttpToFilesystem._ALLOWED_KEYS
+        if unknown:
+            raise ValueError(f"Unknown keys in multi_requests item: {sorted(unknown)}")
+        merged: dict[str, Any] = {
+            "endpoint": spec.get("endpoint", base_state["endpoint"]),
+            "method": spec.get("method", base_state["method"]),
+            "auth_type": spec.get("auth_type", base_state["auth_type"]),
+            "jmespath_expression": spec.get(
+                "jmespath_expression", base_state["jmespath_expression"]
+            ),
+            "headers": MultiHttpToFilesystem._merge_or_replace(
+                base_state["headers"], spec["headers"]
+            )
+            if "headers" in spec
+            else base_state["headers"],
+            "data": MultiHttpToFilesystem._merge_or_replace(
+                base_state["data"], spec["data"]
+            )
+            if "data" in spec
+            else base_state["data"],
+            "save_format": spec.get("save_format", base_state["save_format"]),
+            "source_format": spec.get("source_format", base_state["source_format"]),
+            "compression": spec.get("compression", base_state["compression"]),
+        }
+        op = HttpToFilesystem(
+            task_id=f"http-{uuid.uuid4()}",
+            http_conn_id=base_op.http_conn_id,
+            filesystem_conn_id=base_op.filesystem_conn_id,
+            filesystem_path=base_op.filesystem_path,
+            data_transformation=base_op.data_transformation,
+            data_transformation_kwargs=base_op.data_transformation_kwargs or None,
+            create_file_on_success=base_op.create_file_on_success,
+            strict_response_schema=base_op.strict_response_schema,
+            requests_per_second=None,
+            file_number_start=file_number,
+            **merged,
+        )
+        op.execute(context)
+    def execute(self, context: Context) -> Any:
         base = self._capture_request_state()
-        for i, spec in enumerate(self.multi_requests, start=1):
-            self.file_number_start = i
-            try:
-                self._apply_request_overrides(spec, base)
-                super().execute(context)
-            finally:
-                self._restore_request_state(base)
+        if self.max_workers:
+            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
+                futures = [
+                    executor.submit(
+                        MultiHttpToFilesystem._execute_one_request,
+                        self,
+                        context,
+                        spec,
+                        base,
+                        i,
+                    )
+                    for i, spec in enumerate(self.multi_requests, start=1)
+                ]
+                for future in as_completed(futures):
+                    future.result()
+        else:
+            delay = (
+                (1.0 / self.requests_per_second) if self.requests_per_second else 0.0
+            )
+            for i, spec in enumerate(self.multi_requests, start=1):
+                if i > 1 and delay > 0:
+                    time.sleep(delay)
+                self.file_number_start = i
+                try:
+                    self._apply_request_overrides(spec, base)
+                    super().execute(context)
+                finally:
+                    self._restore_request_state(base)
 def list_to_jsonl(data: list[dict], compression: "CompressionOptions") -> BytesIO:

{airflow_toolkit-2.3.0 → airflow_toolkit-2.4.0/src/airflow_toolkit.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: airflow-toolkit
-Version: 2.3.0
+Version: 2.4.0
 Summary: A toolkit of operators, hooks and utilities for Apache Airflow 3
 Author-email: Biel Llobera <biel_llobera@dkl.digital>
 Requires-Python: <3.15,>=3.11
@@ -145,8 +145,8 @@ pip install "airflow-toolkit[airflow3-full]"
 | `http` | `providers-http`, `requests`, `jmespath`, `pandas` | `HttpToFilesystem`, `MultiHttpToFilesystem` |
 | `duckdb` | `airflow-provider-duckdb` | `DuckdbToDeltalake` operator |
 | `sqlite` | `providers-sqlite` | SQLite as source or destination |
-| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` |
-| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` |
+| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` and `HttpToFilesystem` |
+| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` and `HttpToFilesystem` |
 | `airflow3-full` | all of the above | Quick start / development |
 ---
@@ -191,7 +191,7 @@ Changing the connection's `conn_type` is all that is needed to switch backends
 ### HttpToFilesystem
-Calls an HTTP endpoint and writes the response to any filesystem. Supports pagination, JMESPath filtering, compression, and custom response transformations.
+Calls an HTTP endpoint and writes the response to any filesystem. Supports pagination, JMESPath filtering, compression, OAuth 2.0 authentication, rate limiting, and custom response transformations.
 ```python
 from airflow_toolkit.providers.filesystem.operators.http_to_filesystem import HttpToFilesystem
@@ -208,7 +208,7 @@ HttpToFilesystem(
 )
 ```
-With cursor-based pagination:
+**With cursor-based pagination:**
 ```python
 def next_page(response):
@@ -230,9 +230,70 @@ HttpToFilesystem(
 )
 ```
+**With OAuth 2.0 Client Credentials:**
+`OAuth2ClientCredentials.client_credentials()` returns a configured auth class that fetches the token lazily on the first request and refreshes it automatically 30 seconds before expiry — no manual token management required.
+```python
+from airflow_toolkit.providers.filesystem.operators.auth import OAuth2ClientCredentials
+HttpToFilesystem(
+    task_id='fetch_protected_data',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/data/{{ ds }}/',
+    endpoint='/api/v1/data',
+    method='GET',
+    save_format='jsonl',
+    auth_type=OAuth2ClientCredentials.client_credentials(
+        token_url='https://auth.example.com/oauth2/token',
+        client_id='{{ var.value.oauth2_client_id }}',
+        client_secret='{{ var.value.oauth2_client_secret }}',
+        scope='read',           # optional
+    ),
+)
+```
+**With rate limiting:**
+Use `requests_per_second` to cap how fast paginated requests are sent. This is useful when the API enforces a rate limit.
+```python
+HttpToFilesystem(
+    task_id='fetch_with_rate_limit',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/events/{{ ds }}/',
+    endpoint='/api/v1/events',
+    method='GET',
+    pagination_function=next_page,
+    save_format='jsonl',
+    requests_per_second=3.0,    # max 3 requests per second between pages
+)
+```
+**Supported response formats:**
+`save_format` controls how the response is written to the filesystem. For APIs that return binary formats natively (e.g. a reporting API that streams Excel files), set `source_format` to match the response content type:
+| `source_format` / `save_format` | File extension | Notes |
+|---|---|---|
+| `json` | `.json` | Single JSON object or array |
+| `jsonl` | `.jsonl` | Array response written as one record per line |
+| `csv` | `.csv` | Raw CSV text from the response |
+| `xml` | `.xml` | Raw XML text from the response |
+| `parquet` | `.parquet` | Binary passthrough — API must return Parquet bytes |
+| `excel` | `.xlsx` | Binary passthrough — API must return Excel bytes (requires `[excel]`) |
+| `avro` | `.avro` | Binary passthrough — API must return Avro bytes (requires `[avro]`) |
+| `fixed_width` | `.fwf` | Fixed-width text from the response |
+All text and JSON formats support gzip/zip compression via the `compression` parameter.
 ### MultiHttpToFilesystem
-Runs multiple HTTP requests in a single Airflow task, saving each response as a separate file. Useful for fetching multiple entities or date ranges without creating one task per request.
+Runs multiple HTTP requests in a single Airflow task, saving each response as a separate file. Requests can run **sequentially** (with optional rate limiting) or **in parallel** using a thread pool.
+**Sequential with rate limiting:**
 ```python
 from airflow_toolkit.providers.filesystem.operators.http_to_filesystem import MultiHttpToFilesystem
@@ -244,6 +305,7 @@ MultiHttpToFilesystem(
     filesystem_path='raw/reference/{{ ds }}/',
     method='GET',
     save_format='jsonl',
+    requests_per_second=2.0,    # max 2 requests per second between calls
     multi_requests=[
         {'endpoint': '/api/v1/categories'},
         {'endpoint': '/api/v1/statuses'},
@@ -252,6 +314,31 @@ MultiHttpToFilesystem(
 )
 ```
+**Parallel execution:**
+Set `max_workers` to run requests concurrently using a thread pool. Each request writes to its own file — there are no file collisions.
+```python
+MultiHttpToFilesystem(
+    task_id='fetch_users_parallel',
+    http_conn_id='my_api',
+    filesystem_conn_id='my_data_lake',
+    filesystem_path='raw/users/{{ ds }}/',
+    method='GET',
+    save_format='json',
+    max_workers=5,              # up to 5 concurrent threads
+    multi_requests=[
+        {'endpoint': '/api/v1/users/1'},
+        {'endpoint': '/api/v1/users/2'},
+        {'endpoint': '/api/v1/users/3'},
+        {'endpoint': '/api/v1/users/4'},
+        {'endpoint': '/api/v1/users/5'},
+    ],
+)
+```
+> Rate limiting (`requests_per_second`) applies only in sequential mode. In parallel mode the thread pool controls concurrency — use `max_workers` to avoid overwhelming the API.
 Each entry in `multi_requests` can override any base parameter (`endpoint`, `method`, `headers`, `data`, `jmespath_expression`, `save_format`, `compression`).
 ### SQLToFilesystem

{airflow_toolkit-2.3.0 → airflow_toolkit-2.4.0}/src/airflow_toolkit.egg-info/SOURCES.txt RENAMED Viewed

@@ -48,5 +48,6 @@ src/airflow_toolkit/providers/deltalake/sensors/filesystem_file.py
 src/airflow_toolkit/providers/filesystem/__init__.py
 src/airflow_toolkit/providers/filesystem/tasks.py
 src/airflow_toolkit/providers/filesystem/operators/__init__.py
+src/airflow_toolkit/providers/filesystem/operators/auth.py
 src/airflow_toolkit/providers/filesystem/operators/filesystem.py
 src/airflow_toolkit/providers/filesystem/operators/http_to_filesystem.py