PyPI - pipefunc - Versions diffs - 0.92.0__tar.gz → 0.93.0__tar.gz - Mend

pipefunc 0.92.0tar.gz → 0.93.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (238) hide show

{pipefunc-0.92.0 → pipefunc-0.93.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pipefunc
-Version: 0.92.0
+Version: 0.93.0
 Summary: A Python library for defining, managing, and executing function pipelines.
 Project-URL: homepage, https://pipefunc.readthedocs.io/
 Project-URL: documentation, https://pipefunc.readthedocs.io/

{pipefunc-0.92.0 → pipefunc-0.93.0}/docs/source/concepts/function-io.md RENAMED Viewed

@@ -530,3 +530,40 @@ print(result["result"].output)
 6. **`pipeline.map`:** We call `pipeline.map` as before, but now we only need to specify the `internal_shapes` of the lists, not the shape of the status. The `internal_shapes` argument is only needed when you return a list, and it cannot be inferred from the inputs.
 This pattern provides a clean and manageable way to work with functions that logically produce multiple outputs of varying shapes within the current capabilities of `pipefunc`.
+## Working with `polars` DataFrames (Parquet storage and `LazyFrame` inputs)
+`pipefunc` has first-class support for [polars](https://pola.rs/):
+1. **Parquet on disk**: when a function returns a `polars.DataFrame` and the results are stored on disk (e.g., `storage="file_array"` in `pipeline.map`), the output is serialized as a [Parquet](https://parquet.apache.org/) file instead of a pickle. Parquet files are compact, fast to read, and can be inspected with external tools like DuckDB. If Parquet serialization fails (e.g., for unsupported dtypes), `pipefunc` transparently falls back to `cloudpickle`.
+2. **Lazy inputs**: annotate a parameter as `polars.LazyFrame` to receive the upstream `polars.DataFrame` output lazily. When the upstream output is stored on disk as Parquet, the function receives `pl.scan_parquet(...)`, so the full DataFrame is never materialized in memory and polars can apply predicate and projection pushdown. Otherwise (e.g., in-memory storage or `pipeline.run`), the DataFrame is converted with `.lazy()`.
+Type validation understands this conversion: a function returning `pl.DataFrame` may feed into a parameter annotated as `pl.LazyFrame`.
+```{code-cell} ipython3
+import polars as pl
+from pipefunc import Pipeline, pipefunc
+@pipefunc(output_name="df")
+def make_df() -> pl.DataFrame:
+    return pl.DataFrame({"x": [1, 2, 3], "y": [10.0, 20.0, 30.0]})
+@pipefunc(output_name="mean_y")
+def mean_y(df: pl.LazyFrame) -> float:  # annotate as LazyFrame to load lazily
+    return df.select(pl.col("y").mean()).collect().item()
+pipeline = Pipeline([make_df, mean_y])
+result = pipeline.map({}, run_folder="my_run_folder", parallel=False, show_progress=False)
+print(result["mean_y"].output)
+```
+The `df` output above is stored as a Parquet file in the run folder, and `mean_y` receives a `pl.LazyFrame` that scans it.
+```{note}
+Only top-level `polars.DataFrame` return values are stored as Parquet; DataFrames nested inside other objects (lists, dicts, dataclasses) are pickled as usual.
+The `pl.LazyFrame` conversion applies to parameters annotated *exactly* as `pl.LazyFrame`.
+```

{pipefunc-0.92.0 → pipefunc-0.93.0}/docs/source/faq.md RENAMED Viewed

@@ -234,6 +234,10 @@ This section has been moved to [Function Inputs and Outputs](./concepts/function
 This section has been moved to [Function Inputs and Outputs](./concepts/function-io.md#pipefuncs-with-multiple-outputs-of-different-shapes).
+## How does `pipefunc` work with `polars` DataFrames?
+See [Function Inputs and Outputs](./concepts/function-io.md#working-with-polars-dataframes-parquet-storage-and-lazyframe-inputs) for Parquet storage and lazy (`pl.LazyFrame`) inputs.
 ## Simplifying Pipelines
 This section has been moved to [Simplifying Pipelines](./concepts/simplifying-pipelines.md).

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/_pipefunc.py RENAMED Viewed

@@ -37,6 +37,7 @@ from pipefunc._utils import (
     clear_cached_properties,
     format_function_call,
     is_classmethod,
+    is_lazyframe_annotation,
     is_pydantic_base_model,
     requires,
 )
@@ -876,6 +877,22 @@ class PipeFunc(Generic[P, R]):
         type_hints = safe_get_type_hints(func, include_extras=True)
         return {self.renames.get(k, k): v for k, v in type_hints.items() if k != "return"}
+    @functools.cached_property
+    def _lazyframe_parameters(self) -> tuple[str, ...]:
+        """Names of parameters annotated as `polars.LazyFrame`."""
+        return tuple(p for p, a in self.parameter_annotations.items() if is_lazyframe_annotation(a))
+    def _convert_lazyframe_kwargs(self, kwargs: dict[str, Any]) -> None:
+        """Convert `pl.DataFrame` values to `pl.LazyFrame` where the annotation asks for it."""
+        if not self._lazyframe_parameters:  # fast path, avoids per-element overhead
+            return
+        import polars as pl
+        for p in self._lazyframe_parameters:
+            value = kwargs.get(p)
+            if isinstance(value, pl.DataFrame):
+                kwargs[p] = value.lazy()
     @functools.cached_property
     def output_annotation(self) -> dict[str, Any]:
         """Return the type annotation of the wrapped function's output."""

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/_pipeline/_base.py RENAMED Viewed

@@ -615,6 +615,7 @@ class Pipeline:
                 raise ValueError(msg)
             func_args[arg] = value
             used_parameters.add(arg)
+        func._convert_lazyframe_kwargs(func_args)
         return func_args
     def _current_cache(self) -> LRUCache | HybridCache | DiskCache | SimpleCache | None:

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/_utils.py RENAMED Viewed

@@ -52,9 +52,23 @@ def at_least_tuple(x: Any) -> tuple[Any, ...]:
     return x if isinstance(x, tuple) else (x,)
+PARQUET_MAGIC = b"PAR1"
+def is_parquet_file(path: Path) -> bool:
+    """Check whether the file at ``path`` is a Parquet file (by magic bytes)."""
+    try:
+        with path.open("rb") as f:
+            return f.read(len(PARQUET_MAGIC)) == PARQUET_MAGIC
+    except OSError:  # pragma: no cover
+        return False
 def load(path: Path, *, cache: bool = False) -> Any:
-    """Load a cloudpickled object from a path.
+    """Load an object from a path.
+    Reads Parquet files (written by `dump` for ``polars.DataFrame`` objects)
+    as ``polars.DataFrame``, everything else as cloudpickle.
     If ``cache`` is ``True``, the object will be cached in memory.
     """
     if cache:
@@ -62,12 +76,33 @@ def load(path: Path, *, cache: bool = False) -> Any:
         return _cached_load(cache_key)
     with path.open("rb") as f:
+        is_parquet = f.read(len(PARQUET_MAGIC)) == PARQUET_MAGIC
+        f.seek(0)
+        if is_parquet:
+            import polars as pl
+            return pl.read_parquet(f)
         return cloudpickle.load(f)
 def dump(obj: Any, path: Path) -> None:
-    """Dump an object to a path using cloudpickle."""
+    """Dump an object to a path.
+    ``polars.DataFrame`` objects are stored as Parquet (falling back to
+    cloudpickle if Parquet serialization fails, e.g., for ``pl.Object``
+    dtype columns); everything else is stored with cloudpickle.
+    """
     path.parent.mkdir(parents=True, exist_ok=True)
+    if is_imported("polars"):
+        import polars as pl
+        if isinstance(obj, pl.DataFrame):
+            try:
+                obj.write_parquet(path)
+            except Exception:  # noqa: BLE001, e.g., unsupported dtypes like pl.Object
+                path.unlink(missing_ok=True)
+            else:
+                return
     with path.open("wb") as f:
         cloudpickle.dump(obj, f)
@@ -629,3 +664,12 @@ def pandas_to_polars(df: Any) -> Any:
         # Fallback to manual conversion if pyarrow is not available
         # This happens when pandas has nullable types but pyarrow is not installed
         return pl.DataFrame({col: df[col].to_numpy() for col in df.columns})
+def is_lazyframe_annotation(annotation: Any) -> bool:
+    """Check whether ``annotation`` is ``polars.LazyFrame``."""
+    if not is_imported("polars"):
+        return False
+    import polars as pl
+    return annotation is pl.LazyFrame

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/_version.py RENAMED Viewed

@@ -3,7 +3,7 @@ from __future__ import annotations
 from pathlib import Path
 # Is set during `onbuild` if `pip install pipefunc` is used
-__version__ = "0.92.0"
+__version__ = "0.93.0"
 if not __version__:
     try:

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/cache.py RENAMED Viewed

@@ -885,6 +885,10 @@ def to_hashable(  # noqa: C901, PLR0911, PLR0912
         if isinstance(obj, polars.DataFrame):
             hsh = to_hashable(obj.to_dict(as_series=False), fallback_to_pickle)
             return (m, tp, hsh)
+        if isinstance(obj, polars.LazyFrame):
+            # Hash the serialized query plan; collecting the data here would
+            # defeat the purpose of using a LazyFrame.
+            return (m, tp, obj.serialize())
     if fallback_to_pickle:
         try:

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/map/_run.py RENAMED Viewed

@@ -30,6 +30,7 @@ from pipefunc._utils import (
     dump,
     ensure_block_allowed,
     get_ncores,
+    is_parquet_file,
     is_running_in_ipynb,
     prod,
 )
@@ -716,7 +717,10 @@ def _func_kwargs(func: PipeFunc, run_info: RunInfo, store: dict[str, StoreType])
         elif p in run_info.inputs:
             kwargs[p] = run_info.inputs[p]
         elif p in run_info.all_output_names:
-            kwargs[p] = _load_from_store(p, store).value
+            if (lazy_frame := _maybe_scan_parquet(func, p, store)) is not None:
+                kwargs[p] = lazy_frame
+            else:
+                kwargs[p] = _load_from_store(p, store).value
         elif p in run_info.defaults and p not in run_info.all_output_names:
             kwargs[p] = run_info.defaults[p]
         else:  # pragma: no cover
@@ -727,6 +731,26 @@ def _func_kwargs(func: PipeFunc, run_info: RunInfo, store: dict[str, StoreType])
     return kwargs
+def _maybe_scan_parquet(func: PipeFunc, parameter: str, store: dict[str, StoreType]) -> Any:
+    """Return a `pl.LazyFrame` scanning the stored Parquet file, if applicable.
+    Only applies when the parameter is annotated as `pl.LazyFrame`, is not
+    indexed by the function's mapspec, and the upstream output is stored on
+    disk as a Parquet file (see `pipefunc._utils.dump`). This avoids
+    materializing the full `pl.DataFrame` in memory.
+    """
+    if parameter not in func._lazyframe_parameters:
+        return None
+    if func.mapspec is not None and parameter in func.mapspec.input_names:
+        return None
+    storage = store[parameter]
+    if not isinstance(storage, Path) or not storage.is_file() or not is_parquet_file(storage):
+        return None
+    import polars as pl
+    return pl.scan_parquet(storage)
 def _select_kwargs(
     func: PipeFunc,
     kwargs: dict[str, Any],
@@ -740,6 +764,7 @@ def _select_kwargs(
     normalized_keys = {k: v[0] if len(v) == 1 else v for k, v in input_keys.items()}
     selected = {k: v[normalized_keys[k]] if k in normalized_keys else v for k, v in kwargs.items()}
     _load_data(selected)
+    func._convert_lazyframe_kwargs(selected)
     return selected
@@ -1695,6 +1720,7 @@ def _execute_single(
     # Otherwise, run the function
     _load_data(kwargs)
+    func._convert_lazyframe_kwargs(kwargs)
     if error_handling == "raise":
         return _get_or_set_cache(func, kwargs, cache, _CTX_RAISE, "raise")

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/map/_storage_array/_file.py RENAMED Viewed

@@ -11,7 +11,7 @@ from typing import TYPE_CHECKING, Any
 import cloudpickle  # type: ignore[import-untyped]
 import numpy as np
-from pipefunc._utils import dump, load
+from pipefunc._utils import PARQUET_MAGIC, dump, load
 from ._base import (
     StorageBase,
@@ -326,8 +326,16 @@ def _load_all(filenames: Iterator[Path]) -> list[Any]:
     def maybe_read(f: Path) -> Any | None:
         return _read(f) if f.is_file() else None
-    def maybe_load(x: str | None) -> Any | None:
-        return cloudpickle.loads(x) if x is not None else None
+    def maybe_load(x: bytes | None) -> Any | None:
+        if x is None:
+            return None
+        if x.startswith(PARQUET_MAGIC):
+            import io
+            import polars as pl
+            return pl.read_parquet(io.BytesIO(x))
+        return cloudpickle.loads(x)
     # Delegate file reading to the threadpool but deserialize sequentially,
     # as this is pure Python and CPU bound

{pipefunc-0.92.0 → pipefunc-0.93.0}/pipefunc/typing.py RENAMED Viewed

@@ -20,6 +20,8 @@ from typing import (
 import numpy as np
+from pipefunc._utils import is_imported
 class NoAnnotation:
     """Marker class for missing type annotations."""
@@ -209,7 +211,7 @@ def _handle_generic_types(
     return None
-def is_type_compatible(
+def is_type_compatible(  # noqa: PLR0911
     incoming_type: Any,
     required_type: Any,
     memo: TypeCheckMemo | None = None,
@@ -228,6 +230,10 @@ def is_type_compatible(
     if _check_identical_or_any(incoming_type, required_type):
         return True
+    if _is_polars_dataframe_to_lazyframe(incoming_type, required_type):
+        # pipefunc converts `pl.DataFrame` values to `pl.LazyFrame` at execution
+        # time when the consuming parameter is annotated as `pl.LazyFrame`.
+        return True
     if (result := _is_typevar_compatible(incoming_type, required_type, memo)) is not None:
         return result
     if (result := _handle_union_types(incoming_type, required_type, memo)) is not None:
@@ -237,6 +243,15 @@ def is_type_compatible(
     return False
+def _is_polars_dataframe_to_lazyframe(incoming_type: Any, required_type: Any) -> bool:
+    """Check for the special-cased `pl.DataFrame` output -> `pl.LazyFrame` input edge."""
+    if not is_imported("polars"):
+        return False
+    import polars as pl
+    return incoming_type is pl.DataFrame and required_type is pl.LazyFrame
 def _is_typevar_compatible(
     incoming_type: Any,
     required_type: Any,

pipefunc-0.93.0/tests/test_polars_parquet.py ADDED Viewed

@@ -0,0 +1,234 @@
+"""Tests for Parquet serialization and `pl.LazyFrame` support (issue #879)."""
+from __future__ import annotations
+import importlib.util
+import sys
+from typing import TYPE_CHECKING, Any
+import numpy as np  # noqa: TC002, needed at runtime to resolve `np.ndarray` annotations
+import pytest
+from pipefunc import Pipeline, pipefunc
+from pipefunc._utils import PARQUET_MAGIC, dump, is_parquet_file, load
+from pipefunc.map import load_outputs
+from pipefunc.typing import is_type_compatible
+has_polars = importlib.util.find_spec("polars") is not None
+pytestmark = pytest.mark.skipif(not has_polars, reason="polars not installed")
+if has_polars:
+    import polars as pl
+if TYPE_CHECKING:
+    from pathlib import Path
+def test_dump_dataframe_as_parquet(tmp_path: Path) -> None:
+    df = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
+    path = tmp_path / "df.cloudpickle"
+    dump(df, path)
+    assert path.read_bytes()[:4] == PARQUET_MAGIC
+    assert is_parquet_file(path)
+    loaded = load(path)
+    assert isinstance(loaded, pl.DataFrame)
+    assert loaded.equals(df)
+def test_dump_non_dataframe_still_pickles(tmp_path: Path) -> None:
+    path = tmp_path / "obj.cloudpickle"
+    dump({"a": 1}, path)
+    assert not is_parquet_file(path)
+    assert load(path) == {"a": 1}
+def test_dump_falls_back_to_pickle_on_parquet_failure(
+    tmp_path: Path,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    df = pl.DataFrame({"a": [1, 2]})
+    def fail(*args: Any, **kwargs: Any) -> None:
+        msg = "boom"
+        raise ValueError(msg)
+    monkeypatch.setattr(pl.DataFrame, "write_parquet", fail)
+    path = tmp_path / "df.cloudpickle"
+    dump(df, path)
+    assert path.read_bytes()[:4] != PARQUET_MAGIC
+    loaded = load(path)
+    assert isinstance(loaded, pl.DataFrame)
+    assert loaded.equals(df)
+def test_load_with_cache(tmp_path: Path) -> None:
+    df = pl.DataFrame({"a": [1, 2]})
+    path = tmp_path / "df.cloudpickle"
+    dump(df, path)
+    assert load(path, cache=True).equals(df)
+def test_file_array_with_dataframes(tmp_path: Path) -> None:
+    from pipefunc.map._storage_array._file import FileArray
+    arr = FileArray(tmp_path / "arr", shape=(2,))
+    arr.dump((0,), pl.DataFrame({"a": [1]}))
+    arr.dump((1,), pl.DataFrame({"a": [2]}))
+    assert is_parquet_file(arr._index_to_file(0))
+    element = arr[0,]
+    assert isinstance(element, pl.DataFrame)
+    assert element["a"].to_list() == [1]
+    # `to_array` exercises the threaded `_load_all` byte-sniffing path
+    full = arr.to_array()
+    assert all(isinstance(x, pl.DataFrame) for x in full)
+def test_dataframe_to_lazyframe_type_compatible() -> None:
+    assert is_type_compatible(pl.DataFrame, pl.LazyFrame)
+    assert not is_type_compatible(pl.LazyFrame, pl.DataFrame)
+    assert not is_type_compatible(int, pl.LazyFrame)
+def test_map_lazyframe_input_scans_parquet(tmp_path: Path) -> None:
+    @pipefunc(output_name="df")
+    def make_df() -> pl.DataFrame:
+        return pl.DataFrame({"a": [1, 2, 3]})
+    @pipefunc(output_name="total")
+    def consume(df: pl.LazyFrame) -> int:
+        assert isinstance(df, pl.LazyFrame)
+        # The plan must be a Parquet scan, not an in-memory DataFrame
+        assert "DF" not in df.explain(optimized=False)
+        return df.select(pl.col("a").sum()).collect().item()
+    pipeline = Pipeline([make_df, consume])  # validates type annotations
+    result = pipeline.map({}, run_folder=tmp_path, parallel=False, show_progress=False)
+    assert result["total"].output == 6
+    df_path = tmp_path / "outputs" / "df.cloudpickle"
+    assert is_parquet_file(df_path)
+    loaded = load_outputs("df", run_folder=tmp_path)
+    assert isinstance(loaded, pl.DataFrame)
+    assert loaded["a"].to_list() == [1, 2, 3]
+def test_map_lazyframe_input_without_run_folder() -> None:
+    @pipefunc(output_name="df")
+    def make_df() -> pl.DataFrame:
+        return pl.DataFrame({"a": [1, 2, 3]})
+    @pipefunc(output_name="total")
+    def consume(df: pl.LazyFrame) -> int:
+        assert isinstance(df, pl.LazyFrame)
+        return df.select(pl.col("a").sum()).collect().item()
+    pipeline = Pipeline([make_df, consume])
+    result = pipeline.map({}, parallel=False, show_progress=False, storage="dict")
+    assert result["total"].output == 6
+def test_map_elementwise_lazyframe(tmp_path: Path) -> None:
+    @pipefunc(output_name="df", mapspec="x[i] -> df[i]")
+    def make_df(x: int) -> pl.DataFrame:
+        return pl.DataFrame({"a": [x, x * 2]})
+    @pipefunc(output_name="total", mapspec="df[i] -> total[i]")
+    def consume(df: pl.LazyFrame) -> int:
+        assert isinstance(df, pl.LazyFrame)
+        return df.select(pl.col("a").sum()).collect().item()
+    pipeline = Pipeline([make_df, consume])
+    result = pipeline.map(
+        {"x": [1, 10]},
+        run_folder=tmp_path,
+        parallel=True,
+        show_progress=False,
+    )
+    assert result["total"].output.tolist() == [3, 30]
+def test_map_reduction_keeps_dataframes(tmp_path: Path) -> None:
+    @pipefunc(output_name="df", mapspec="x[i] -> df[i]")
+    def make_df(x: int) -> pl.DataFrame:
+        return pl.DataFrame({"a": [x]})
+    @pipefunc(output_name="n")
+    def reduce_all(df: np.ndarray) -> int:
+        assert all(isinstance(d, pl.DataFrame) for d in df)
+        return len(df)
+    pipeline = Pipeline([make_df, reduce_all])
+    result = pipeline.map(
+        {"x": [1, 10]},
+        run_folder=tmp_path,
+        parallel=False,
+        show_progress=False,
+    )
+    assert result["n"].output == 2
+def test_run_lazyframe_input() -> None:
+    @pipefunc(output_name="df")
+    def make_df() -> pl.DataFrame:
+        return pl.DataFrame({"a": [1, 2, 3]})
+    @pipefunc(output_name="total")
+    def consume(df: pl.LazyFrame) -> int:
+        assert isinstance(df, pl.LazyFrame)
+        return df.select(pl.col("a").sum()).collect().item()
+    pipeline = Pipeline([make_df, consume])
+    assert pipeline.run("total", kwargs={}) == 6
+def test_run_lazyframe_from_input_kwarg() -> None:
+    @pipefunc(output_name="total")
+    def consume(df: pl.LazyFrame) -> int:
+        assert isinstance(df, pl.LazyFrame)
+        return df.select(pl.col("a").sum()).collect().item()
+    pipeline = Pipeline([consume])
+    df = pl.DataFrame({"a": [1, 2, 3]})
+    assert pipeline.run("total", kwargs={"df": df}) == 6
+def test_map_lazyframe_from_input_kwarg(tmp_path: Path) -> None:
+    @pipefunc(output_name="total")
+    def consume(df: pl.LazyFrame) -> int:
+        assert isinstance(df, pl.LazyFrame)
+        return df.select(pl.col("a").sum()).collect().item()
+    pipeline = Pipeline([consume])
+    df = pl.DataFrame({"a": [1, 2, 3]})
+    result = pipeline.map({"df": df}, run_folder=tmp_path, parallel=False, show_progress=False)
+    assert result["total"].output == 6
+def test_lazyframe_passthrough() -> None:
+    @pipefunc(output_name="lf")
+    def make_lf() -> pl.LazyFrame:
+        return pl.DataFrame({"a": [1, 2, 3]}).lazy()
+    @pipefunc(output_name="total")
+    def consume(lf: pl.LazyFrame) -> int:
+        assert isinstance(lf, pl.LazyFrame)
+        return lf.select(pl.col("a").sum()).collect().item()
+    pipeline = Pipeline([make_lf, consume])
+    result = pipeline.map({}, parallel=False, show_progress=False, storage="dict")
+    assert result["total"].output == 6
+def test_to_hashable_lazyframe() -> None:
+    from pipefunc.cache import to_hashable
+    lf = pl.DataFrame({"a": [1, 2]}).lazy()
+    key = to_hashable(lf)
+    assert hash(key) == hash(to_hashable(pl.DataFrame({"a": [1, 2]}).lazy()))
+def test_helpers_when_polars_not_imported(monkeypatch: pytest.MonkeyPatch) -> None:
+    from pipefunc._utils import is_lazyframe_annotation
+    monkeypatch.delitem(sys.modules, "polars")
+    assert not is_lazyframe_annotation(pl.LazyFrame)
+    assert not is_type_compatible(pl.DataFrame, pl.LazyFrame)