PyPI - airflow-toolkit - Versions diffs - 2.2.0__tar.gz → 2.3.0__tar.gz - Mend

airflow-toolkit 2.2.0tar.gz → 2.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

{airflow_toolkit-2.2.0/src/airflow_toolkit.egg-info → airflow_toolkit-2.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: airflow-toolkit
-Version: 2.2.0
+Version: 2.3.0
 Summary: A toolkit of operators, hooks and utilities for Apache Airflow 3
 Author-email: Biel Llobera <biel_llobera@dkl.digital>
 Requires-Python: <3.15,>=3.11
@@ -32,6 +32,10 @@ Provides-Extra: duckdb
 Requires-Dist: airflow-provider-duckdb>=0.1.2; extra == "duckdb"
 Provides-Extra: sqlite
 Requires-Dist: apache-airflow-providers-sqlite; extra == "sqlite"
+Provides-Extra: excel
+Requires-Dist: openpyxl>=3.1; extra == "excel"
+Provides-Extra: avro
+Requires-Dist: fastavro>=1.9; extra == "avro"
 Provides-Extra: airflow3-full
 Requires-Dist: apache-airflow<4,>=3; extra == "airflow3-full"
 Requires-Dist: apache-airflow-providers-fab>=3.0.0; extra == "airflow3-full"
@@ -49,6 +53,8 @@ Requires-Dist: requests>=2.31.0; extra == "airflow3-full"
 Requires-Dist: jmespath<2,>=1.0.1; extra == "airflow3-full"
 Requires-Dist: airflow-provider-duckdb>=0.1.2; extra == "airflow3-full"
 Requires-Dist: apache-airflow-providers-sqlite; extra == "airflow3-full"
+Requires-Dist: openpyxl>=3.1; extra == "airflow3-full"
+Requires-Dist: fastavro>=1.9; extra == "airflow3-full"
 Dynamic: license-file
 # Airflow Toolkit
@@ -136,10 +142,11 @@ pip install "airflow-toolkit[airflow3-full]"
 | `google` | `providers-google` | GCS filesystem backend |
 | `azure` | `providers-microsoft-azure` | Azure Blob / ADLS filesystem backend |
 | `sftp` | `providers-sftp` | SFTP filesystem backend |
-| `slack` | `providers-slack` | Slack failure notifications |
 | `http` | `providers-http`, `requests`, `jmespath`, `pandas` | `HttpToFilesystem`, `MultiHttpToFilesystem` |
 | `duckdb` | `airflow-provider-duckdb` | `DuckdbToDeltalake` operator |
 | `sqlite` | `providers-sqlite` | SQLite as source or destination |
+| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` |
+| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` |
 | `airflow3-full` | all of the above | Quick start / development |
 ---
@@ -313,7 +320,9 @@ FilesystemToFilesystem(
 ### FilesystemToDatabase
-Reads files (CSV, JSON, or Parquet) from any filesystem and loads them into any SQLAlchemy-compatible database. Handles schema drift automatically: columns present in the file but missing from the table are added; columns present in the table but missing from the file are filled with `NULL`.
+Reads files from any filesystem and loads them into any SQLAlchemy-compatible database. Handles schema drift automatically: columns present in the file but missing from the table are added; columns present in the table but missing from the file are filled with `NULL`.
+**Supported formats:** `csv`, `json`, `parquet`, `excel`, `avro`, `fixed_width`.
 ```python
 from airflow_toolkit.providers.deltalake.operators.filesystem_to_database import FilesystemToDatabaseOperator
@@ -325,7 +334,7 @@ FilesystemToDatabaseOperator(
     filesystem_path='raw/orders/{{ ds }}/',
     db_schema='public',
     db_table='orders',
-    source_format='csv',
+    source_format='csv',                   # 'csv' | 'json' | 'parquet' | 'excel' | 'avro' | 'fixed_width'
     table_aggregation_type='append',       # 'append' | 'replace' | 'fail'
     metadata={
         '_ds':          '{{ ds }}',
@@ -335,6 +344,52 @@ FilesystemToDatabaseOperator(
 )
 ```
+**Excel** (requires the `[excel]` extra):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_excel_report',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/reports/{{ ds }}/',
+    db_table='monthly_report',
+    source_format='excel',
+    source_format_options={'sheet_name': 'Data'},
+)
+```
+**Avro** (requires the `[avro]` extra):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_avro_events',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/events/{{ ds }}/',
+    db_table='events',
+    source_format='avro',
+)
+```
+**Fixed-width** (no extra required — pandas native):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_fixed_width',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/exports/{{ ds }}/',
+    db_table='transactions',
+    source_format='fixed_width',
+    source_format_options={
+        'colspecs': [(0, 10), (10, 25), (25, 35)],
+        'names': ['date', 'description', 'amount'],
+    },
+)
+```
+Each format is matched by file extension: `.csv`/`.csv.gz`, `.json`/`.json.gz`, `.parquet`/`.parquet.gz`, `.xlsx`/`.xls`, `.avro`, `.fwf`/`.txt`/`.dat`. Files with other extensions in the same prefix are silently skipped.
 ### DuckdbToDeltalake
 Executes a DuckDB SQL query and writes the result directly to a Delta Lake table on Azure storage. Useful for in-process transformations that land results as an open table format.
@@ -530,6 +585,50 @@ Each environment maps to a distinct colour across all channels so alerts are rec
 ---
+## Testing Utilities
+### MockFilesystem
+`MockFilesystem` is an in-memory implementation of `FilesystemProtocol` for unit testing. It requires no Docker, no cloud credentials, and no network — all files are stored in a plain Python dict.
+```python
+from airflow_toolkit.testing import MockFilesystem
+# Pre-load files at construction time
+fs = MockFilesystem({
+    "raw/orders/2024-01-01/data.csv": b"id,amount\n1,100\n2,200",
+})
+# Or write files programmatically
+fs.write(b"id,amount\n3,300", "raw/orders/2024-01-02/data.csv")
+# Inspect the result in assertions
+assert fs.check_file("raw/orders/2024-01-01/data.csv")
+assert len(fs.list_files("raw/orders/")) == 2
+assert fs.files["raw/orders/2024-01-01/data.csv"] == b"id,amount\n1,100\n2,200"
+```
+Use it to patch `FilesystemFactory.get_data_lake_filesystem` in your operator tests:
+```python
+from unittest.mock import patch
+from airflow_toolkit.testing import MockFilesystem
+def test_my_pipeline(tmp_path):
+    fs = MockFilesystem({"data/file.csv": b"id,name\n1,Alice"})
+    with patch(
+        "airflow_toolkit.filesystems.filesystem_factory.FilesystemFactory.get_data_lake_filesystem",
+        return_value=fs,
+    ):
+        # run your operator or task here
+        ...
+```
+`MockFilesystem` implements the full `FilesystemProtocol`: `read`, `write`, `delete_file`, `create_prefix`, `delete_prefix`, `check_file`, `check_prefix`, `list_files`.
+---
 ## Running Tests
 ### Integration tests

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0}/README.md RENAMED Viewed

@@ -83,10 +83,11 @@ pip install "airflow-toolkit[airflow3-full]"
 | `google` | `providers-google` | GCS filesystem backend |
 | `azure` | `providers-microsoft-azure` | Azure Blob / ADLS filesystem backend |
 | `sftp` | `providers-sftp` | SFTP filesystem backend |
-| `slack` | `providers-slack` | Slack failure notifications |
 | `http` | `providers-http`, `requests`, `jmespath`, `pandas` | `HttpToFilesystem`, `MultiHttpToFilesystem` |
 | `duckdb` | `airflow-provider-duckdb` | `DuckdbToDeltalake` operator |
 | `sqlite` | `providers-sqlite` | SQLite as source or destination |
+| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` |
+| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` |
 | `airflow3-full` | all of the above | Quick start / development |
 ---
@@ -260,7 +261,9 @@ FilesystemToFilesystem(
 ### FilesystemToDatabase
-Reads files (CSV, JSON, or Parquet) from any filesystem and loads them into any SQLAlchemy-compatible database. Handles schema drift automatically: columns present in the file but missing from the table are added; columns present in the table but missing from the file are filled with `NULL`.
+Reads files from any filesystem and loads them into any SQLAlchemy-compatible database. Handles schema drift automatically: columns present in the file but missing from the table are added; columns present in the table but missing from the file are filled with `NULL`.
+**Supported formats:** `csv`, `json`, `parquet`, `excel`, `avro`, `fixed_width`.
 ```python
 from airflow_toolkit.providers.deltalake.operators.filesystem_to_database import FilesystemToDatabaseOperator
@@ -272,7 +275,7 @@ FilesystemToDatabaseOperator(
     filesystem_path='raw/orders/{{ ds }}/',
     db_schema='public',
     db_table='orders',
-    source_format='csv',
+    source_format='csv',                   # 'csv' | 'json' | 'parquet' | 'excel' | 'avro' | 'fixed_width'
     table_aggregation_type='append',       # 'append' | 'replace' | 'fail'
     metadata={
         '_ds':          '{{ ds }}',
@@ -282,6 +285,52 @@ FilesystemToDatabaseOperator(
 )
 ```
+**Excel** (requires the `[excel]` extra):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_excel_report',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/reports/{{ ds }}/',
+    db_table='monthly_report',
+    source_format='excel',
+    source_format_options={'sheet_name': 'Data'},
+)
+```
+**Avro** (requires the `[avro]` extra):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_avro_events',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/events/{{ ds }}/',
+    db_table='events',
+    source_format='avro',
+)
+```
+**Fixed-width** (no extra required — pandas native):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_fixed_width',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/exports/{{ ds }}/',
+    db_table='transactions',
+    source_format='fixed_width',
+    source_format_options={
+        'colspecs': [(0, 10), (10, 25), (25, 35)],
+        'names': ['date', 'description', 'amount'],
+    },
+)
+```
+Each format is matched by file extension: `.csv`/`.csv.gz`, `.json`/`.json.gz`, `.parquet`/`.parquet.gz`, `.xlsx`/`.xls`, `.avro`, `.fwf`/`.txt`/`.dat`. Files with other extensions in the same prefix are silently skipped.
 ### DuckdbToDeltalake
 Executes a DuckDB SQL query and writes the result directly to a Delta Lake table on Azure storage. Useful for in-process transformations that land results as an open table format.
@@ -477,6 +526,50 @@ Each environment maps to a distinct colour across all channels so alerts are rec
 ---
+## Testing Utilities
+### MockFilesystem
+`MockFilesystem` is an in-memory implementation of `FilesystemProtocol` for unit testing. It requires no Docker, no cloud credentials, and no network — all files are stored in a plain Python dict.
+```python
+from airflow_toolkit.testing import MockFilesystem
+# Pre-load files at construction time
+fs = MockFilesystem({
+    "raw/orders/2024-01-01/data.csv": b"id,amount\n1,100\n2,200",
+})
+# Or write files programmatically
+fs.write(b"id,amount\n3,300", "raw/orders/2024-01-02/data.csv")
+# Inspect the result in assertions
+assert fs.check_file("raw/orders/2024-01-01/data.csv")
+assert len(fs.list_files("raw/orders/")) == 2
+assert fs.files["raw/orders/2024-01-01/data.csv"] == b"id,amount\n1,100\n2,200"
+```
+Use it to patch `FilesystemFactory.get_data_lake_filesystem` in your operator tests:
+```python
+from unittest.mock import patch
+from airflow_toolkit.testing import MockFilesystem
+def test_my_pipeline(tmp_path):
+    fs = MockFilesystem({"data/file.csv": b"id,name\n1,Alice"})
+    with patch(
+        "airflow_toolkit.filesystems.filesystem_factory.FilesystemFactory.get_data_lake_filesystem",
+        return_value=fs,
+    ):
+        # run your operator or task here
+        ...
+```
+`MockFilesystem` implements the full `FilesystemProtocol`: `read`, `write`, `delete_file`, `create_prefix`, `delete_prefix`, `check_file`, `check_prefix`, `list_files`.
+---
 ## Running Tests
 ### Integration tests

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "airflow-toolkit"
-version = "2.2.0"
+version = "2.3.0"
 description = "A toolkit of operators, hooks and utilities for Apache Airflow 3"
 authors = [{ name = "Biel Llobera", email = "biel_llobera@dkl.digital" }]
 requires-python = ">=3.11,<3.15"
@@ -49,6 +49,12 @@ duckdb = [
 sqlite = [
     "apache-airflow-providers-sqlite",
 ]
+excel = [
+    "openpyxl>=3.1",
+]
+avro = [
+    "fastavro>=1.9",
+]
 airflow3-full = [
     "apache-airflow>=3,<4",
     "apache-airflow-providers-fab>=3.0.0",
@@ -66,6 +72,8 @@ airflow3-full = [
     "jmespath>=1.0.1,<2",
     "airflow-provider-duckdb>=0.1.2",
     "apache-airflow-providers-sqlite",
+    "openpyxl>=3.1",
+    "fastavro>=1.9",
 ]
 [dependency-groups]

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0}/src/airflow_toolkit/compression_utils.py RENAMED Viewed

@@ -1,10 +1,10 @@
 import gzip
 import zipfile
 from io import BytesIO
-from typing import Literal, Union
+from airflow_toolkit.types import CompressionOptions
 DEFAULT_ZIP_FILENAME = "file.zip"
-CompressionOptions = Union[Literal["infer", "gzip", "bz2", "zip", "xz", "zstd"], None]
 def gzip_data(data: bytes) -> bytes:

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0}/src/airflow_toolkit/providers/deltalake/operators/filesystem_to_database.py RENAMED Viewed

@@ -15,6 +15,8 @@ import urllib.parse
 from collections.abc import Iterator, Mapping
 from typing import Any, Literal
+from airflow_toolkit.types import MetadataSpec
 import pandas as pd
 from sqlalchemy import (
     Boolean,
@@ -40,6 +42,29 @@ logger = logging.getLogger(__name__)
 _BATCH_SIZE = 50_000
+# Maps source_format names to the file extensions they match.
+# Used in execute() to skip blobs that don't belong to the selected format.
+_FORMAT_EXTENSIONS: dict[str, tuple[str, ...]] = {
+    "csv": (".csv", ".csv.gz"),
+    "json": (".json", ".json.gz"),
+    "parquet": (".parquet", ".parquet.gz"),
+    "excel": (".xlsx", ".xls"),
+    "avro": (".avro",),
+    "fixed_width": (".fwf", ".txt", ".dat"),
+}
+# Canonical extension for the temp file created in execute().
+# Must match what the underlying reader expects (e.g. pandas read_excel
+# infers the engine from the file extension).
+_FORMAT_TEMP_SUFFIX: dict[str, str] = {
+    "csv": ".csv",
+    "json": ".json",
+    "parquet": ".parquet",
+    "excel": ".xlsx",
+    "avro": ".avro",
+    "fixed_width": ".fwf",
+}
 type_mapping: dict[str, type[Any]] = {
     "int64": Integer,
     "int": Integer,
@@ -76,11 +101,13 @@ class FilesystemToDatabaseOperator(BaseOperator):
         filesystem_path: str,
         db_table: str,
         db_schema: str | None = None,
-        source_format: Literal["csv", "json", "parquet"] = "csv",
+        source_format: Literal[
+            "csv", "json", "parquet", "excel", "avro", "fixed_width"
+        ] = "csv",
         source_format_options: Mapping[str, Any] | None = None,
         batch_size: int = _BATCH_SIZE,
         table_aggregation_type: Literal["append", "fail", "replace"] = "append",
-        metadata: Mapping[str, str] | None = None,
+        metadata: MetadataSpec | None = None,
         metadata_columns_in_uppercase: bool = True,
         include_source_path: bool = True,
         normalize_unicode: bool = False,
@@ -157,11 +184,12 @@ class FilesystemToDatabaseOperator(BaseOperator):
         if self.idempotent:
             self._delete_existing_run_data(engine)
+        valid_extensions = _FORMAT_EXTENSIONS.get(
+            self.source_format, (f".{self.source_format}",)
+        )
         first_batch = True
         for blob_path in filesystem.list_files(prefix=self.filesystem_path):
-            if not blob_path.endswith(
-                (f".{self.source_format}", f".{self.source_format}.gz")
-            ):
+            if not blob_path.endswith(valid_extensions):
                 logger.warning(
                     f"Blob {blob_path} is not in the right format. Skipping..."
                 )
@@ -175,7 +203,10 @@ class FilesystemToDatabaseOperator(BaseOperator):
                 f"Downloaded {file_mb:.1f} MB in {time.monotonic() - dl_start:.1f}s"
             )
-            tmp_fd, tmp_path = tempfile.mkstemp(suffix=f".{self.source_format}")
+            tmp_suffix = _FORMAT_TEMP_SUFFIX.get(
+                self.source_format, f".{self.source_format}"
+            )
+            tmp_fd, tmp_path = tempfile.mkstemp(suffix=tmp_suffix)
             try:
                 with os.fdopen(tmp_fd, "wb") as tmp_file:
                     tmp_file.write(raw_bytes)
@@ -310,6 +341,19 @@ class FilesystemToDatabaseOperator(BaseOperator):
                 if peek_opts.get("lines"):
                     return set(pd.read_json(path, nrows=1, **peek_opts).columns)
                 return set(pd.read_json(path, **peek_opts).columns)
+            case "excel":
+                peek_opts = {k: v for k, v in options.items() if k != "sheet_name"}
+                return set(pd.read_excel(path, nrows=0, **peek_opts).columns)
+            case "avro":
+                import fastavro
+                with open(path, "rb") as f:
+                    reader = fastavro.reader(f)
+                    schema = reader.writer_schema
+                    return {field["name"] for field in schema["fields"]}
+            case "fixed_width":
+                peek_opts = {k: v for k, v in options.items() if k != "chunksize"}
+                return set(pd.read_fwf(path, nrows=0, **peek_opts).columns)
             case _:
                 return set()
@@ -398,6 +442,25 @@ class FilesystemToDatabaseOperator(BaseOperator):
                     yield from pd.read_json(path, chunksize=self.batch_size, **options)
                 else:
                     yield pd.read_json(path, **options)
+            case "excel":
+                df = pd.read_excel(path, **options)
+                for start in range(0, max(len(df), 1), self.batch_size):
+                    yield df.iloc[start : start + self.batch_size].copy()
+            case "avro":
+                import fastavro
+                with open(path, "rb") as f:
+                    reader = fastavro.reader(f)
+                    batch: list[dict[str, Any]] = []
+                    for record in reader:
+                        batch.append(record)
+                        if len(batch) >= self.batch_size:
+                            yield pd.DataFrame(batch)
+                            batch = []
+                    if batch:
+                        yield pd.DataFrame(batch)
+            case "fixed_width":
+                yield from pd.read_fwf(path, chunksize=self.batch_size, **options)
             case _:
                 raise ValueError(f"Unknown source format {self.source_format}")
@@ -544,5 +607,26 @@ class FilesystemToDatabaseOperator(BaseOperator):
                 return pd.read_json(path_or_buf, **options)
             case "parquet":
                 return pd.read_parquet(path_or_buf, **options)
+            case "excel":
+                return pd.read_excel(path_or_buf, **options)
+            case "avro":
+                import fastavro
+                if isinstance(path_or_buf, (str, bytes)):
+                    buf = io.BytesIO(
+                        path_or_buf
+                        if isinstance(path_or_buf, bytes)
+                        else path_or_buf.encode()
+                    )
+                else:
+                    buf = (
+                        path_or_buf
+                        if isinstance(path_or_buf, io.BytesIO)
+                        else io.BytesIO(path_or_buf.read().encode())
+                    )
+                records = list(fastavro.reader(buf))
+                return pd.DataFrame(records)
+            case "fixed_width":
+                return pd.read_fwf(path_or_buf, **options)
             case _:
                 raise ValueError(f"Unknown source format {self.source_format}")

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0}/src/airflow_toolkit/providers/filesystem/operators/http_to_filesystem.py RENAMED Viewed

@@ -9,13 +9,9 @@ from typing import (
     Any,
     Callable,
     Generator,
-    Literal,
     Optional,
-    Type,
 )
-from typing import TypedDict
 import jmespath
 import pandas as pd
@@ -25,16 +21,20 @@ from airflow.utils.helpers import merge_dicts
 from requests import Response
 from airflow_toolkit._compact.airflow_shim import BaseOperator, Context, BaseHook
-from airflow_toolkit.compression_utils import CompressionOptions, compress
+from airflow_toolkit.compression_utils import compress
 from airflow_toolkit.exceptions import ApiResponseTypeError
 from airflow_toolkit.filesystems.filesystem_factory import FilesystemFactory
 from airflow_toolkit.protocols import HttpTransformation
+from airflow_toolkit.types import (
+    CompressionOptions,
+    RequestSpec,
+    RequestState,
+    SaveFormat,
+)
 if TYPE_CHECKING:
     from requests.auth import AuthBase
-SaveFormat = Literal["jsonl"]
 class HttpBatchOperator(HttpOperator):
     def execute(
@@ -318,34 +318,6 @@ class HttpToFilesystem(BaseOperator):
         raise TypeError(f"Unsupported transformation output type: {type(value)!r}")
-class RequestSpec(TypedDict, total=False):
-    """User-provided per-request overrides (all keys optional)."""
-    endpoint: str
-    method: str
-    data: Any
-    headers: dict[str, str]
-    auth_type: Type["AuthBase"] | None
-    jmespath_expression: str | None
-    save_format: "SaveFormat"
-    source_format: "SaveFormat"
-    compression: "CompressionOptions" | None
-class RequestState(TypedDict):
-    """Fully-resolved runtime state (all keys present)."""
-    endpoint: str | None
-    method: str
-    data: Any
-    headers: dict[str, str] | None
-    auth_type: Type["AuthBase"] | None
-    jmespath_expression: str | None
-    save_format: "SaveFormat"
-    source_format: "SaveFormat"
-    compression: "CompressionOptions" | None
 class MultiHttpToFilesystem(HttpToFilesystem):
     """
     Execute multiple HTTP requests in a single task and save each response as a separate file.

airflow_toolkit-2.3.0/src/airflow_toolkit/testing.py ADDED Viewed

@@ -0,0 +1,59 @@
+"""Testing utilities for airflow-toolkit.
+Import from here in your unit tests — no Docker, no cloud credentials needed.
+    from airflow_toolkit.testing import MockFilesystem
+    fs = MockFilesystem({"data/2024-01-01.csv": b"id,name\\n1,Alice"})
+    fs.write(b"id,name\\n2,Bob", "data/2024-01-02.csv")
+    assert fs.check_file("data/2024-01-01.csv")
+"""
+from __future__ import annotations
+from io import BytesIO
+class MockFilesystem:
+    """In-memory implementation of FilesystemProtocol for unit testing.
+    Stores all files in a plain dict — no network, no Docker, no credentials.
+    Inspect ``fs.files`` directly in assertions.
+    Args:
+        files: Optional seed data mapping path → bytes.
+    """
+    def __init__(self, files: dict[str, bytes] | None = None) -> None:
+        self.files: dict[str, bytes] = dict(files or {})
+    def read(self, path: str) -> bytes:
+        if path not in self.files:
+            raise FileNotFoundError(f"MockFilesystem: no file at '{path}'")
+        return self.files[path]
+    def write(self, data: str | bytes | BytesIO, path: str) -> None:
+        if isinstance(data, str):
+            data = data.encode()
+        elif isinstance(data, BytesIO):
+            data = data.getvalue()
+        self.files[path] = data
+    def delete_file(self, path: str) -> None:
+        self.files.pop(path, None)
+    def create_prefix(self, prefix: str) -> None:
+        pass
+    def delete_prefix(self, prefix: str) -> None:
+        for key in [k for k in self.files if k.startswith(prefix)]:
+            del self.files[key]
+    def check_file(self, path: str) -> bool:
+        return path in self.files
+    def check_prefix(self, prefix: str) -> bool:
+        return any(k.startswith(prefix) for k in self.files)
+    def list_files(self, prefix: str) -> list[str]:
+        return [k for k in self.files if k.startswith(prefix)]

airflow_toolkit-2.3.0/src/airflow_toolkit/types.py ADDED Viewed

@@ -0,0 +1,51 @@
+from __future__ import annotations
+from typing import TYPE_CHECKING, Any, Literal, Type, TypedDict
+if TYPE_CHECKING:
+    from requests.auth import AuthBase
+# ── Compression ────────────────────────────────────────────────────────────
+CompressionOptions = Literal["infer", "gzip", "bz2", "zip", "xz", "zstd"] | None
+# ── Filesystem / format ────────────────────────────────────────────────────
+SaveFormat = Literal["jsonl"]
+# ── Metadata columns ───────────────────────────────────────────────────────
+# Passed to FilesystemToDatabaseOperator as extra columns added to every row.
+# Key = column name; value = Airflow template string (e.g. "{{ ds }}").
+# Keys prefixed with "_" are coerced to datetime at load time.
+MetadataSpec = dict[str, str]
+# ── HTTP multi-request ─────────────────────────────────────────────────────
+class RequestSpec(TypedDict, total=False):
+    """User-provided per-request overrides for MultiHttpToFilesystem (all keys optional)."""
+    endpoint: str
+    method: str
+    data: Any
+    headers: dict[str, str]
+    auth_type: Type["AuthBase"] | None
+    jmespath_expression: str | None
+    save_format: SaveFormat
+    source_format: SaveFormat
+    compression: CompressionOptions
+class RequestState(TypedDict):
+    """Fully-resolved runtime state for a single HTTP request (all keys required)."""
+    endpoint: str | None
+    method: str
+    data: Any
+    headers: dict[str, str] | None
+    auth_type: Type["AuthBase"] | None
+    jmespath_expression: str | None
+    save_format: SaveFormat
+    source_format: SaveFormat
+    compression: CompressionOptions

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0/src/airflow_toolkit.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: airflow-toolkit
-Version: 2.2.0
+Version: 2.3.0
 Summary: A toolkit of operators, hooks and utilities for Apache Airflow 3
 Author-email: Biel Llobera <biel_llobera@dkl.digital>
 Requires-Python: <3.15,>=3.11
@@ -32,6 +32,10 @@ Provides-Extra: duckdb
 Requires-Dist: airflow-provider-duckdb>=0.1.2; extra == "duckdb"
 Provides-Extra: sqlite
 Requires-Dist: apache-airflow-providers-sqlite; extra == "sqlite"
+Provides-Extra: excel
+Requires-Dist: openpyxl>=3.1; extra == "excel"
+Provides-Extra: avro
+Requires-Dist: fastavro>=1.9; extra == "avro"
 Provides-Extra: airflow3-full
 Requires-Dist: apache-airflow<4,>=3; extra == "airflow3-full"
 Requires-Dist: apache-airflow-providers-fab>=3.0.0; extra == "airflow3-full"
@@ -49,6 +53,8 @@ Requires-Dist: requests>=2.31.0; extra == "airflow3-full"
 Requires-Dist: jmespath<2,>=1.0.1; extra == "airflow3-full"
 Requires-Dist: airflow-provider-duckdb>=0.1.2; extra == "airflow3-full"
 Requires-Dist: apache-airflow-providers-sqlite; extra == "airflow3-full"
+Requires-Dist: openpyxl>=3.1; extra == "airflow3-full"
+Requires-Dist: fastavro>=1.9; extra == "airflow3-full"
 Dynamic: license-file
 # Airflow Toolkit
@@ -136,10 +142,11 @@ pip install "airflow-toolkit[airflow3-full]"
 | `google` | `providers-google` | GCS filesystem backend |
 | `azure` | `providers-microsoft-azure` | Azure Blob / ADLS filesystem backend |
 | `sftp` | `providers-sftp` | SFTP filesystem backend |
-| `slack` | `providers-slack` | Slack failure notifications |
 | `http` | `providers-http`, `requests`, `jmespath`, `pandas` | `HttpToFilesystem`, `MultiHttpToFilesystem` |
 | `duckdb` | `airflow-provider-duckdb` | `DuckdbToDeltalake` operator |
 | `sqlite` | `providers-sqlite` | SQLite as source or destination |
+| `excel` | `openpyxl` | Excel (`.xlsx` / `.xls`) support in `FilesystemToDatabase` |
+| `avro` | `fastavro` | Avro support in `FilesystemToDatabase` |
 | `airflow3-full` | all of the above | Quick start / development |
 ---
@@ -313,7 +320,9 @@ FilesystemToFilesystem(
 ### FilesystemToDatabase
-Reads files (CSV, JSON, or Parquet) from any filesystem and loads them into any SQLAlchemy-compatible database. Handles schema drift automatically: columns present in the file but missing from the table are added; columns present in the table but missing from the file are filled with `NULL`.
+Reads files from any filesystem and loads them into any SQLAlchemy-compatible database. Handles schema drift automatically: columns present in the file but missing from the table are added; columns present in the table but missing from the file are filled with `NULL`.
+**Supported formats:** `csv`, `json`, `parquet`, `excel`, `avro`, `fixed_width`.
 ```python
 from airflow_toolkit.providers.deltalake.operators.filesystem_to_database import FilesystemToDatabaseOperator
@@ -325,7 +334,7 @@ FilesystemToDatabaseOperator(
     filesystem_path='raw/orders/{{ ds }}/',
     db_schema='public',
     db_table='orders',
-    source_format='csv',
+    source_format='csv',                   # 'csv' | 'json' | 'parquet' | 'excel' | 'avro' | 'fixed_width'
     table_aggregation_type='append',       # 'append' | 'replace' | 'fail'
     metadata={
         '_ds':          '{{ ds }}',
@@ -335,6 +344,52 @@ FilesystemToDatabaseOperator(
 )
 ```
+**Excel** (requires the `[excel]` extra):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_excel_report',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/reports/{{ ds }}/',
+    db_table='monthly_report',
+    source_format='excel',
+    source_format_options={'sheet_name': 'Data'},
+)
+```
+**Avro** (requires the `[avro]` extra):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_avro_events',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/events/{{ ds }}/',
+    db_table='events',
+    source_format='avro',
+)
+```
+**Fixed-width** (no extra required — pandas native):
+```python
+FilesystemToDatabaseOperator(
+    task_id='load_fixed_width',
+    filesystem_conn_id='my_data_lake',
+    database_conn_id='my_postgres',
+    filesystem_path='raw/exports/{{ ds }}/',
+    db_table='transactions',
+    source_format='fixed_width',
+    source_format_options={
+        'colspecs': [(0, 10), (10, 25), (25, 35)],
+        'names': ['date', 'description', 'amount'],
+    },
+)
+```
+Each format is matched by file extension: `.csv`/`.csv.gz`, `.json`/`.json.gz`, `.parquet`/`.parquet.gz`, `.xlsx`/`.xls`, `.avro`, `.fwf`/`.txt`/`.dat`. Files with other extensions in the same prefix are silently skipped.
 ### DuckdbToDeltalake
 Executes a DuckDB SQL query and writes the result directly to a Delta Lake table on Azure storage. Useful for in-process transformations that land results as an open table format.
@@ -530,6 +585,50 @@ Each environment maps to a distinct colour across all channels so alerts are rec
 ---
+## Testing Utilities
+### MockFilesystem
+`MockFilesystem` is an in-memory implementation of `FilesystemProtocol` for unit testing. It requires no Docker, no cloud credentials, and no network — all files are stored in a plain Python dict.
+```python
+from airflow_toolkit.testing import MockFilesystem
+# Pre-load files at construction time
+fs = MockFilesystem({
+    "raw/orders/2024-01-01/data.csv": b"id,amount\n1,100\n2,200",
+})
+# Or write files programmatically
+fs.write(b"id,amount\n3,300", "raw/orders/2024-01-02/data.csv")
+# Inspect the result in assertions
+assert fs.check_file("raw/orders/2024-01-01/data.csv")
+assert len(fs.list_files("raw/orders/")) == 2
+assert fs.files["raw/orders/2024-01-01/data.csv"] == b"id,amount\n1,100\n2,200"
+```
+Use it to patch `FilesystemFactory.get_data_lake_filesystem` in your operator tests:
+```python
+from unittest.mock import patch
+from airflow_toolkit.testing import MockFilesystem
+def test_my_pipeline(tmp_path):
+    fs = MockFilesystem({"data/file.csv": b"id,name\n1,Alice"})
+    with patch(
+        "airflow_toolkit.filesystems.filesystem_factory.FilesystemFactory.get_data_lake_filesystem",
+        return_value=fs,
+    ):
+        # run your operator or task here
+        ...
+```
+`MockFilesystem` implements the full `FilesystemProtocol`: `read`, `write`, `delete_file`, `create_prefix`, `delete_prefix`, `check_file`, `check_prefix`, `list_files`.
+---
 ## Running Tests
 ### Integration tests

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0}/src/airflow_toolkit.egg-info/SOURCES.txt RENAMED Viewed

@@ -6,6 +6,8 @@ src/airflow_toolkit/compression_utils.py
 src/airflow_toolkit/exceptions.py
 src/airflow_toolkit/protocols.py
 src/airflow_toolkit/py.typed
+src/airflow_toolkit/testing.py
+src/airflow_toolkit/types.py
 src/airflow_toolkit.egg-info/PKG-INFO
 src/airflow_toolkit.egg-info/SOURCES.txt
 src/airflow_toolkit.egg-info/dependency_links.txt

{airflow_toolkit-2.2.0 → airflow_toolkit-2.3.0}/src/airflow_toolkit.egg-info/requires.txt RENAMED Viewed

@@ -21,10 +21,15 @@ requests>=2.31.0
 jmespath<2,>=1.0.1
 airflow-provider-duckdb>=0.1.2
 apache-airflow-providers-sqlite
+openpyxl>=3.1
+fastavro>=1.9
 [amazon]
 apache-airflow-providers-amazon>=9.15.0
+[avro]
+fastavro>=1.9
 [azure]
 apache-airflow-providers-microsoft-azure>=8
@@ -37,6 +42,9 @@ pandas<3,>=2.1.1
 [duckdb]
 airflow-provider-duckdb>=0.1.2
+[excel]
+openpyxl>=3.1
 [google]
 apache-airflow-providers-google>=18