polars-checkpoint 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- polars_checkpoint-0.1.0/PKG-INFO +88 -0
- polars_checkpoint-0.1.0/README.md +80 -0
- polars_checkpoint-0.1.0/pyproject.toml +9 -0
- polars_checkpoint-0.1.0/setup.cfg +4 -0
- polars_checkpoint-0.1.0/src/polars_checkpoint/__init__.py +0 -0
- polars_checkpoint-0.1.0/src/polars_checkpoint/polars_checkpoint.py +341 -0
- polars_checkpoint-0.1.0/src/polars_checkpoint.egg-info/PKG-INFO +88 -0
- polars_checkpoint-0.1.0/src/polars_checkpoint.egg-info/SOURCES.txt +9 -0
- polars_checkpoint-0.1.0/src/polars_checkpoint.egg-info/dependency_links.txt +1 -0
- polars_checkpoint-0.1.0/src/polars_checkpoint.egg-info/requires.txt +1 -0
- polars_checkpoint-0.1.0/src/polars_checkpoint.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: polars-checkpoint
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Add your description here
|
|
5
|
+
Requires-Python: >=3.11.1
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Requires-Dist: polars>=1.39.3
|
|
8
|
+
|
|
9
|
+
# polars-checkpoint
|
|
10
|
+
|
|
11
|
+
Materialise Polars LazyFrames to parquet files and scan them back lazily. Defaults to the streaming engine for sink/scan. Useful for managing & reducing memory pressure due to expensive intermediate results in complex multi-step transforms.
|
|
12
|
+
|
|
13
|
+
## Installation
|
|
14
|
+
|
|
15
|
+
Requires `polars` and Python 3.10+.
|
|
16
|
+
|
|
17
|
+
## Quick Start
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
import polars as pl
|
|
21
|
+
from checkpoint import checkpoint
|
|
22
|
+
|
|
23
|
+
lf = pl.LazyFrame({"x": range(1_000_000)}).with_columns(y=pl.col("x") * 2)
|
|
24
|
+
|
|
25
|
+
# Materialises to a temp parquet file; returns a lazy re-scan
|
|
26
|
+
lf = checkpoint(lf)
|
|
27
|
+
|
|
28
|
+
lf.filter(pl.col("y") > 100).collect()
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
A process-wide default session manages the temp directory and cleans it up at exit.
|
|
32
|
+
|
|
33
|
+
## Session API
|
|
34
|
+
|
|
35
|
+
For explicit control over storage location and lifecycle, use `CheckpointSession`:
|
|
36
|
+
|
|
37
|
+
```python
|
|
38
|
+
from checkpoint import CheckpointSession
|
|
39
|
+
|
|
40
|
+
# As a context manager — cleans up on exit from the block
|
|
41
|
+
with CheckpointSession(root_dir="./my_checkpoints") as sess:
|
|
42
|
+
lf = pl.scan_csv("big.csv")
|
|
43
|
+
lf = sess.checkpoint(lf, name="after-parse")
|
|
44
|
+
lf = lf.filter(pl.col("status") == "active")
|
|
45
|
+
lf = sess.checkpoint(lf, name="filtered")
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
# Without a context manager — cleans up at GC or interpreter shutdown,
|
|
50
|
+
# or when you call close() explicitly
|
|
51
|
+
sess = CheckpointSession(root_dir="./my_checkpoints")
|
|
52
|
+
lf = sess.checkpoint(pl.scan_csv("big.csv"), name="raw")
|
|
53
|
+
reloaded = sess["raw"]
|
|
54
|
+
print(sess.summary())
|
|
55
|
+
|
|
56
|
+
sess.close() # optional; triggers early cleanup
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### `CheckpointSession` constructor
|
|
60
|
+
|
|
61
|
+
| Parameter | Default | Description |
|
|
62
|
+
|---|---|---|
|
|
63
|
+
| `root_dir` | `None` (auto temp dir) | Parent directory for checkpoint folders. |
|
|
64
|
+
| `cleanup` | `True` | Delete checkpoint files on close / GC / interpreter exit. |
|
|
65
|
+
| `default_sink_kwargs` | `{"compression": "zstd"}` | Defaults passed to `sink_parquet` / `write_parquet`. |
|
|
66
|
+
| `default_scan_kwargs` | `{}` | Defaults passed to `scan_parquet`. |
|
|
67
|
+
|
|
68
|
+
### Key methods & features
|
|
69
|
+
|
|
70
|
+
- **`checkpoint(lf, *, name=None, streaming=True, ...)`** — Materialise a LazyFrame to parquet. Auto-generates a name if none given. Falls back to `collect().write_parquet()` when `streaming=False`.
|
|
71
|
+
- **`session[name]`** — Retrieve a checkpoint as a `LazyFrame`.
|
|
72
|
+
- **`name in session`** — Check existence.
|
|
73
|
+
- **`len(session)`** / **`iter(session)`** — Count / list checkpoints.
|
|
74
|
+
- **`summary()`** — Returns a Polars DataFrame with name, size (MB), and path of each checkpoint.
|
|
75
|
+
- **`close(timeout=None)`** — Waits for in-flight writes, then cleans up. Also usable as a context manager.
|
|
76
|
+
|
|
77
|
+
## Thread Safety
|
|
78
|
+
|
|
79
|
+
Sessions are internally locked. Concurrent `checkpoint()` calls from multiple threads are safe; `close()` waits for all in-flight materialisations before removing files.
|
|
80
|
+
|
|
81
|
+
## Cleanup Behaviour
|
|
82
|
+
|
|
83
|
+
| Scenario | `cleanup=True` (default) | `cleanup=False` |
|
|
84
|
+
|---|---|---|
|
|
85
|
+
| `close()` / `__exit__` | Files deleted | Files retained |
|
|
86
|
+
| GC / interpreter shutdown | Files deleted (via `weakref.finalize`) | Files retained |
|
|
87
|
+
|
|
88
|
+
When `root_dir` is auto-generated, the entire temp directory is removed. When user-supplied, only the individual checkpoint subdirectories created by the session are removed.
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
# polars-checkpoint
|
|
2
|
+
|
|
3
|
+
Materialise Polars LazyFrames to parquet files and scan them back lazily. Defaults to the streaming engine for sink/scan. Useful for managing & reducing memory pressure due to expensive intermediate results in complex multi-step transforms.
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
Requires `polars` and Python 3.10+.
|
|
8
|
+
|
|
9
|
+
## Quick Start
|
|
10
|
+
|
|
11
|
+
```python
|
|
12
|
+
import polars as pl
|
|
13
|
+
from checkpoint import checkpoint
|
|
14
|
+
|
|
15
|
+
lf = pl.LazyFrame({"x": range(1_000_000)}).with_columns(y=pl.col("x") * 2)
|
|
16
|
+
|
|
17
|
+
# Materialises to a temp parquet file; returns a lazy re-scan
|
|
18
|
+
lf = checkpoint(lf)
|
|
19
|
+
|
|
20
|
+
lf.filter(pl.col("y") > 100).collect()
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
A process-wide default session manages the temp directory and cleans it up at exit.
|
|
24
|
+
|
|
25
|
+
## Session API
|
|
26
|
+
|
|
27
|
+
For explicit control over storage location and lifecycle, use `CheckpointSession`:
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
from checkpoint import CheckpointSession
|
|
31
|
+
|
|
32
|
+
# As a context manager — cleans up on exit from the block
|
|
33
|
+
with CheckpointSession(root_dir="./my_checkpoints") as sess:
|
|
34
|
+
lf = pl.scan_csv("big.csv")
|
|
35
|
+
lf = sess.checkpoint(lf, name="after-parse")
|
|
36
|
+
lf = lf.filter(pl.col("status") == "active")
|
|
37
|
+
lf = sess.checkpoint(lf, name="filtered")
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
# Without a context manager — cleans up at GC or interpreter shutdown,
|
|
42
|
+
# or when you call close() explicitly
|
|
43
|
+
sess = CheckpointSession(root_dir="./my_checkpoints")
|
|
44
|
+
lf = sess.checkpoint(pl.scan_csv("big.csv"), name="raw")
|
|
45
|
+
reloaded = sess["raw"]
|
|
46
|
+
print(sess.summary())
|
|
47
|
+
|
|
48
|
+
sess.close() # optional; triggers early cleanup
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### `CheckpointSession` constructor
|
|
52
|
+
|
|
53
|
+
| Parameter | Default | Description |
|
|
54
|
+
|---|---|---|
|
|
55
|
+
| `root_dir` | `None` (auto temp dir) | Parent directory for checkpoint folders. |
|
|
56
|
+
| `cleanup` | `True` | Delete checkpoint files on close / GC / interpreter exit. |
|
|
57
|
+
| `default_sink_kwargs` | `{"compression": "zstd"}` | Defaults passed to `sink_parquet` / `write_parquet`. |
|
|
58
|
+
| `default_scan_kwargs` | `{}` | Defaults passed to `scan_parquet`. |
|
|
59
|
+
|
|
60
|
+
### Key methods & features
|
|
61
|
+
|
|
62
|
+
- **`checkpoint(lf, *, name=None, streaming=True, ...)`** — Materialise a LazyFrame to parquet. Auto-generates a name if none given. Falls back to `collect().write_parquet()` when `streaming=False`.
|
|
63
|
+
- **`session[name]`** — Retrieve a checkpoint as a `LazyFrame`.
|
|
64
|
+
- **`name in session`** — Check existence.
|
|
65
|
+
- **`len(session)`** / **`iter(session)`** — Count / list checkpoints.
|
|
66
|
+
- **`summary()`** — Returns a Polars DataFrame with name, size (MB), and path of each checkpoint.
|
|
67
|
+
- **`close(timeout=None)`** — Waits for in-flight writes, then cleans up. Also usable as a context manager.
|
|
68
|
+
|
|
69
|
+
## Thread Safety
|
|
70
|
+
|
|
71
|
+
Sessions are internally locked. Concurrent `checkpoint()` calls from multiple threads are safe; `close()` waits for all in-flight materialisations before removing files.
|
|
72
|
+
|
|
73
|
+
## Cleanup Behaviour
|
|
74
|
+
|
|
75
|
+
| Scenario | `cleanup=True` (default) | `cleanup=False` |
|
|
76
|
+
|---|---|---|
|
|
77
|
+
| `close()` / `__exit__` | Files deleted | Files retained |
|
|
78
|
+
| GC / interpreter shutdown | Files deleted (via `weakref.finalize`) | Files retained |
|
|
79
|
+
|
|
80
|
+
When `root_dir` is auto-generated, the entire temp directory is removed. When user-supplied, only the individual checkpoint subdirectories created by the session are removed.
|
|
File without changes
|
|
@@ -0,0 +1,341 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
import logging
|
|
4
|
+
import shutil
|
|
5
|
+
import tempfile
|
|
6
|
+
import threading
|
|
7
|
+
import time
|
|
8
|
+
import uuid
|
|
9
|
+
import weakref
|
|
10
|
+
from collections.abc import Iterator
|
|
11
|
+
from pathlib import Path
|
|
12
|
+
from typing import Any
|
|
13
|
+
|
|
14
|
+
import polars as pl
|
|
15
|
+
|
|
16
|
+
logger = logging.getLogger(__name__)
|
|
17
|
+
|
|
18
|
+
_default_session_lock = threading.Lock()
|
|
19
|
+
_default_session: CheckpointSession | None = None
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
def _cleanup_files(
|
|
23
|
+
root_dir: Path,
|
|
24
|
+
auto_root: bool,
|
|
25
|
+
owned_dirs: list[Path],
|
|
26
|
+
) -> None:
|
|
27
|
+
"""Safety-net cleanup invoked by weakref.finalize."""
|
|
28
|
+
if auto_root:
|
|
29
|
+
shutil.rmtree(root_dir, ignore_errors=True)
|
|
30
|
+
else:
|
|
31
|
+
for d in owned_dirs:
|
|
32
|
+
shutil.rmtree(d, ignore_errors=True)
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
class CheckpointSession:
|
|
36
|
+
"""Manages a directory of parquet checkpoints for Polars LazyFrames."""
|
|
37
|
+
|
|
38
|
+
def __init__(
|
|
39
|
+
self,
|
|
40
|
+
root_dir: str | Path | None = None,
|
|
41
|
+
*,
|
|
42
|
+
cleanup: bool = True,
|
|
43
|
+
default_sink_kwargs: dict[str, Any] | None = None,
|
|
44
|
+
default_scan_kwargs: dict[str, Any] | None = None,
|
|
45
|
+
) -> None:
|
|
46
|
+
self._lock = threading.Lock()
|
|
47
|
+
self._cond = threading.Condition(self._lock)
|
|
48
|
+
self._auto_root = root_dir is None
|
|
49
|
+
if self._auto_root:
|
|
50
|
+
self.root_dir = Path(tempfile.mkdtemp(prefix="pl-ckpt-"))
|
|
51
|
+
else:
|
|
52
|
+
self.root_dir = Path(root_dir)
|
|
53
|
+
self.root_dir.mkdir(parents=True, exist_ok=True)
|
|
54
|
+
|
|
55
|
+
self.cleanup = cleanup
|
|
56
|
+
self.default_sink_kwargs = {
|
|
57
|
+
"compression": "zstd",
|
|
58
|
+
**(default_sink_kwargs or {}),
|
|
59
|
+
}
|
|
60
|
+
self.default_scan_kwargs = dict(default_scan_kwargs or {})
|
|
61
|
+
|
|
62
|
+
self._owned_dirs: list[Path] = []
|
|
63
|
+
self._active_ops = 0
|
|
64
|
+
self._closed = False
|
|
65
|
+
|
|
66
|
+
# finalize fires at GC or interpreter shutdown (atexit=True),
|
|
67
|
+
# whichever comes first. close() calls detach() to prevent
|
|
68
|
+
# double-cleanup.
|
|
69
|
+
if self.cleanup:
|
|
70
|
+
self._finaliser = weakref.finalize(
|
|
71
|
+
self,
|
|
72
|
+
_cleanup_files,
|
|
73
|
+
self.root_dir,
|
|
74
|
+
self._auto_root,
|
|
75
|
+
self._owned_dirs, # same list object, mutated in place
|
|
76
|
+
)
|
|
77
|
+
self._finaliser.atexit = True
|
|
78
|
+
else:
|
|
79
|
+
self._finaliser = None
|
|
80
|
+
|
|
81
|
+
def __repr__(self) -> str:
|
|
82
|
+
if self._closed:
|
|
83
|
+
state = "closed"
|
|
84
|
+
else:
|
|
85
|
+
n = len(self._owned_dirs)
|
|
86
|
+
state = f"open, {n} checkpoint{'s' if n != 1 else ''}"
|
|
87
|
+
return f"<CheckpointSession root_dir={str(self.root_dir)!r} {state}>"
|
|
88
|
+
|
|
89
|
+
# -- context manager (optional, for early deterministic cleanup) -----------
|
|
90
|
+
|
|
91
|
+
def __enter__(self) -> CheckpointSession:
|
|
92
|
+
with self._cond:
|
|
93
|
+
self._ensure_open()
|
|
94
|
+
return self
|
|
95
|
+
|
|
96
|
+
def __exit__(self, *exc: object) -> None:
|
|
97
|
+
self.close()
|
|
98
|
+
|
|
99
|
+
# -- lifecycle -------------------------------------------------------------
|
|
100
|
+
|
|
101
|
+
def close(self, *, timeout: float | None = None) -> None:
|
|
102
|
+
"""Close the session, wait for in-flight checkpoints, clean up files."""
|
|
103
|
+
with self._cond:
|
|
104
|
+
if self._closed:
|
|
105
|
+
return
|
|
106
|
+
self._closed = True
|
|
107
|
+
|
|
108
|
+
deadline = None if timeout is None else time.monotonic() + timeout
|
|
109
|
+
while self._active_ops > 0:
|
|
110
|
+
remaining = (
|
|
111
|
+
None if deadline is None else max(0.0, deadline - time.monotonic())
|
|
112
|
+
)
|
|
113
|
+
if remaining is not None and remaining <= 0:
|
|
114
|
+
logger.warning(
|
|
115
|
+
"Timed out waiting for %d in-flight checkpoint(s); "
|
|
116
|
+
"proceeding with cleanup",
|
|
117
|
+
self._active_ops,
|
|
118
|
+
)
|
|
119
|
+
break
|
|
120
|
+
self._cond.wait(timeout=remaining)
|
|
121
|
+
|
|
122
|
+
owned = list(self._owned_dirs)
|
|
123
|
+
|
|
124
|
+
# Deactivate the finaliser so it won't double-clean.
|
|
125
|
+
# If the finaliser has already fired (e.g. at GC), detach() is a
|
|
126
|
+
# no-op, and the rmtree calls below are idempotent.
|
|
127
|
+
if self._finaliser is not None:
|
|
128
|
+
self._finaliser.detach()
|
|
129
|
+
|
|
130
|
+
if not self.cleanup:
|
|
131
|
+
return
|
|
132
|
+
|
|
133
|
+
if self._auto_root:
|
|
134
|
+
shutil.rmtree(self.root_dir, ignore_errors=True)
|
|
135
|
+
else:
|
|
136
|
+
for d in owned:
|
|
137
|
+
shutil.rmtree(d, ignore_errors=True)
|
|
138
|
+
|
|
139
|
+
# -- collection protocol ---------------------------------------------------
|
|
140
|
+
|
|
141
|
+
def __getitem__(self, name: str) -> pl.LazyFrame:
|
|
142
|
+
with self._cond:
|
|
143
|
+
self._ensure_open()
|
|
144
|
+
slug = _normalise_name(name)
|
|
145
|
+
path = self.root_dir / slug / "data.parquet"
|
|
146
|
+
if not path.exists():
|
|
147
|
+
with self._cond:
|
|
148
|
+
available = [
|
|
149
|
+
d.name for d in self._owned_dirs if (d / "data.parquet").exists()
|
|
150
|
+
]
|
|
151
|
+
raise KeyError(f"No checkpoint named {name!r}. Available: {available}")
|
|
152
|
+
return pl.scan_parquet(path, **self.default_scan_kwargs)
|
|
153
|
+
|
|
154
|
+
def __contains__(self, name: object) -> bool:
|
|
155
|
+
with self._cond:
|
|
156
|
+
self._ensure_open()
|
|
157
|
+
if not isinstance(name, str):
|
|
158
|
+
return False
|
|
159
|
+
try:
|
|
160
|
+
slug = _normalise_name(name)
|
|
161
|
+
except ValueError:
|
|
162
|
+
return False
|
|
163
|
+
return (self.root_dir / slug / "data.parquet").exists()
|
|
164
|
+
|
|
165
|
+
def __len__(self) -> int:
|
|
166
|
+
with self._cond:
|
|
167
|
+
self._ensure_open()
|
|
168
|
+
return len(self._owned_dirs)
|
|
169
|
+
|
|
170
|
+
def __iter__(self) -> Iterator[str]:
|
|
171
|
+
with self._cond:
|
|
172
|
+
self._ensure_open()
|
|
173
|
+
dirs = list(self._owned_dirs)
|
|
174
|
+
for d in dirs:
|
|
175
|
+
yield d.name
|
|
176
|
+
|
|
177
|
+
# -- introspection ---------------------------------------------------------
|
|
178
|
+
|
|
179
|
+
def summary(self) -> pl.DataFrame:
|
|
180
|
+
"""Return a DataFrame listing all live checkpoints and their sizes."""
|
|
181
|
+
# Should maybe make it a .show() method that prints a table?
|
|
182
|
+
with self._cond:
|
|
183
|
+
self._ensure_open()
|
|
184
|
+
dirs = list(self._owned_dirs)
|
|
185
|
+
rows: list[dict[str, Any]] = []
|
|
186
|
+
for d in dirs:
|
|
187
|
+
p = d / "data.parquet"
|
|
188
|
+
if p.exists():
|
|
189
|
+
rows.append(
|
|
190
|
+
{
|
|
191
|
+
"name": d.name,
|
|
192
|
+
"size_mb": round(p.stat().st_size / 1_048_576, 2),
|
|
193
|
+
"path": str(p),
|
|
194
|
+
}
|
|
195
|
+
)
|
|
196
|
+
if not rows:
|
|
197
|
+
return pl.DataFrame(
|
|
198
|
+
schema={
|
|
199
|
+
"name": pl.Utf8,
|
|
200
|
+
"size_mb": pl.Float64,
|
|
201
|
+
"path": pl.Utf8,
|
|
202
|
+
},
|
|
203
|
+
)
|
|
204
|
+
return pl.DataFrame(rows)
|
|
205
|
+
|
|
206
|
+
# -- checkpointing ---------------------------------------------------------
|
|
207
|
+
|
|
208
|
+
def checkpoint(
|
|
209
|
+
self,
|
|
210
|
+
lf: pl.LazyFrame,
|
|
211
|
+
*,
|
|
212
|
+
name: str | None = None,
|
|
213
|
+
sink_kwargs: dict[str, Any] | None = None,
|
|
214
|
+
scan_kwargs: dict[str, Any] | None = None,
|
|
215
|
+
streaming: bool = True,
|
|
216
|
+
) -> pl.LazyFrame:
|
|
217
|
+
"""Materialise LazyFrame to a parquet checkpoint and return a lazy re-scan."""
|
|
218
|
+
with self._cond:
|
|
219
|
+
self._ensure_open()
|
|
220
|
+
checkpoint_dir = self._new_checkpoint_dir(name)
|
|
221
|
+
self._active_ops += 1
|
|
222
|
+
|
|
223
|
+
checkpoint_path = checkpoint_dir / "data.parquet"
|
|
224
|
+
sink_opts = {**self.default_sink_kwargs, **(sink_kwargs or {})}
|
|
225
|
+
scan_opts = {**self.default_scan_kwargs, **(scan_kwargs or {})}
|
|
226
|
+
|
|
227
|
+
ok = False
|
|
228
|
+
try:
|
|
229
|
+
self._materialise(lf, checkpoint_path, sink_opts, streaming)
|
|
230
|
+
ok = True
|
|
231
|
+
finally:
|
|
232
|
+
with self._cond:
|
|
233
|
+
if not ok:
|
|
234
|
+
try:
|
|
235
|
+
self._owned_dirs.remove(checkpoint_dir)
|
|
236
|
+
except ValueError:
|
|
237
|
+
pass
|
|
238
|
+
self._active_ops -= 1
|
|
239
|
+
self._cond.notify_all()
|
|
240
|
+
if not ok:
|
|
241
|
+
shutil.rmtree(checkpoint_dir, ignore_errors=True)
|
|
242
|
+
|
|
243
|
+
return pl.scan_parquet(checkpoint_path, **scan_opts)
|
|
244
|
+
|
|
245
|
+
# -- internals -------------------------------------------------------------
|
|
246
|
+
|
|
247
|
+
def _ensure_open(self) -> None:
|
|
248
|
+
if self._closed:
|
|
249
|
+
raise RuntimeError("CheckpointSession is closed")
|
|
250
|
+
|
|
251
|
+
def _new_checkpoint_dir(self, name: str | None) -> Path:
|
|
252
|
+
# Must be called under self._cond / self._lock.
|
|
253
|
+
slug = _normalise_name(name) if name is not None else uuid.uuid4().hex
|
|
254
|
+
path = self.root_dir / slug
|
|
255
|
+
try:
|
|
256
|
+
path.mkdir(parents=True, exist_ok=False)
|
|
257
|
+
except FileExistsError:
|
|
258
|
+
msg = f"Checkpoint directory {slug!r} already exists"
|
|
259
|
+
if name is not None and slug != name:
|
|
260
|
+
msg += f" (normalised from {name!r})"
|
|
261
|
+
raise FileExistsError(msg) from None
|
|
262
|
+
self._owned_dirs.append(path)
|
|
263
|
+
return path
|
|
264
|
+
|
|
265
|
+
@staticmethod
|
|
266
|
+
def _materialise(
|
|
267
|
+
lf: pl.LazyFrame,
|
|
268
|
+
path: Path,
|
|
269
|
+
sink_opts: dict[str, Any],
|
|
270
|
+
streaming: bool,
|
|
271
|
+
) -> None:
|
|
272
|
+
mode = "streaming" if streaming else "collect"
|
|
273
|
+
logger.debug("Materialising checkpoint to %s (%s)", path, mode)
|
|
274
|
+
t0 = time.perf_counter()
|
|
275
|
+
|
|
276
|
+
if streaming:
|
|
277
|
+
try:
|
|
278
|
+
lf.sink_parquet(path, **sink_opts)
|
|
279
|
+
except Exception as exc:
|
|
280
|
+
raise RuntimeError(
|
|
281
|
+
f"Checkpoint materialisation failed: {exc}\n\n"
|
|
282
|
+
"If the streaming engine does not support this query "
|
|
283
|
+
"plan, retry with streaming=False."
|
|
284
|
+
) from exc
|
|
285
|
+
else:
|
|
286
|
+
lf.collect().write_parquet(path, **sink_opts)
|
|
287
|
+
|
|
288
|
+
elapsed = time.perf_counter() - t0
|
|
289
|
+
size_mb = path.stat().st_size / 1_048_576
|
|
290
|
+
logger.debug("Checkpoint written in %.2fs (%.1f MB)", elapsed, size_mb)
|
|
291
|
+
|
|
292
|
+
|
|
293
|
+
# -- standalone function --------------------------------------------------------
|
|
294
|
+
|
|
295
|
+
|
|
296
|
+
def checkpoint(
|
|
297
|
+
lf: pl.LazyFrame,
|
|
298
|
+
*,
|
|
299
|
+
session: CheckpointSession | None = None,
|
|
300
|
+
name: str | None = None,
|
|
301
|
+
sink_kwargs: dict[str, Any] | None = None,
|
|
302
|
+
scan_kwargs: dict[str, Any] | None = None,
|
|
303
|
+
streaming: bool = True,
|
|
304
|
+
) -> pl.LazyFrame:
|
|
305
|
+
"""Materialise LazyFrame to a parquet checkpoint."""
|
|
306
|
+
sess = session if session is not None else _get_default_session()
|
|
307
|
+
return sess.checkpoint(
|
|
308
|
+
lf,
|
|
309
|
+
name=name,
|
|
310
|
+
sink_kwargs=sink_kwargs,
|
|
311
|
+
scan_kwargs=scan_kwargs,
|
|
312
|
+
streaming=streaming,
|
|
313
|
+
)
|
|
314
|
+
|
|
315
|
+
|
|
316
|
+
# -- default session -----------------------------------------------------------
|
|
317
|
+
|
|
318
|
+
|
|
319
|
+
def _get_default_session() -> CheckpointSession:
|
|
320
|
+
"""Lazily create a process-wide default session."""
|
|
321
|
+
global _default_session
|
|
322
|
+
with _default_session_lock:
|
|
323
|
+
if _default_session is None or _default_session._closed:
|
|
324
|
+
_default_session = CheckpointSession()
|
|
325
|
+
return _default_session
|
|
326
|
+
|
|
327
|
+
|
|
328
|
+
# -- utilities -----------------------------------------------------------------
|
|
329
|
+
|
|
330
|
+
|
|
331
|
+
def _normalise_name(name: str) -> str:
|
|
332
|
+
out = []
|
|
333
|
+
for ch in name:
|
|
334
|
+
if ch.isalnum() or ch in {"-", "_", "."}:
|
|
335
|
+
out.append(ch)
|
|
336
|
+
else:
|
|
337
|
+
out.append("_")
|
|
338
|
+
normalised = "".join(out).strip("._")
|
|
339
|
+
if not normalised:
|
|
340
|
+
raise ValueError(f"Checkpoint name {name!r} normalised to an empty string")
|
|
341
|
+
return normalised
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: polars-checkpoint
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Add your description here
|
|
5
|
+
Requires-Python: >=3.11.1
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Requires-Dist: polars>=1.39.3
|
|
8
|
+
|
|
9
|
+
# polars-checkpoint
|
|
10
|
+
|
|
11
|
+
Materialise Polars LazyFrames to parquet files and scan them back lazily. Defaults to the streaming engine for sink/scan. Useful for managing & reducing memory pressure due to expensive intermediate results in complex multi-step transforms.
|
|
12
|
+
|
|
13
|
+
## Installation
|
|
14
|
+
|
|
15
|
+
Requires `polars` and Python 3.10+.
|
|
16
|
+
|
|
17
|
+
## Quick Start
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
import polars as pl
|
|
21
|
+
from checkpoint import checkpoint
|
|
22
|
+
|
|
23
|
+
lf = pl.LazyFrame({"x": range(1_000_000)}).with_columns(y=pl.col("x") * 2)
|
|
24
|
+
|
|
25
|
+
# Materialises to a temp parquet file; returns a lazy re-scan
|
|
26
|
+
lf = checkpoint(lf)
|
|
27
|
+
|
|
28
|
+
lf.filter(pl.col("y") > 100).collect()
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
A process-wide default session manages the temp directory and cleans it up at exit.
|
|
32
|
+
|
|
33
|
+
## Session API
|
|
34
|
+
|
|
35
|
+
For explicit control over storage location and lifecycle, use `CheckpointSession`:
|
|
36
|
+
|
|
37
|
+
```python
|
|
38
|
+
from checkpoint import CheckpointSession
|
|
39
|
+
|
|
40
|
+
# As a context manager — cleans up on exit from the block
|
|
41
|
+
with CheckpointSession(root_dir="./my_checkpoints") as sess:
|
|
42
|
+
lf = pl.scan_csv("big.csv")
|
|
43
|
+
lf = sess.checkpoint(lf, name="after-parse")
|
|
44
|
+
lf = lf.filter(pl.col("status") == "active")
|
|
45
|
+
lf = sess.checkpoint(lf, name="filtered")
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
# Without a context manager — cleans up at GC or interpreter shutdown,
|
|
50
|
+
# or when you call close() explicitly
|
|
51
|
+
sess = CheckpointSession(root_dir="./my_checkpoints")
|
|
52
|
+
lf = sess.checkpoint(pl.scan_csv("big.csv"), name="raw")
|
|
53
|
+
reloaded = sess["raw"]
|
|
54
|
+
print(sess.summary())
|
|
55
|
+
|
|
56
|
+
sess.close() # optional; triggers early cleanup
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### `CheckpointSession` constructor
|
|
60
|
+
|
|
61
|
+
| Parameter | Default | Description |
|
|
62
|
+
|---|---|---|
|
|
63
|
+
| `root_dir` | `None` (auto temp dir) | Parent directory for checkpoint folders. |
|
|
64
|
+
| `cleanup` | `True` | Delete checkpoint files on close / GC / interpreter exit. |
|
|
65
|
+
| `default_sink_kwargs` | `{"compression": "zstd"}` | Defaults passed to `sink_parquet` / `write_parquet`. |
|
|
66
|
+
| `default_scan_kwargs` | `{}` | Defaults passed to `scan_parquet`. |
|
|
67
|
+
|
|
68
|
+
### Key methods & features
|
|
69
|
+
|
|
70
|
+
- **`checkpoint(lf, *, name=None, streaming=True, ...)`** — Materialise a LazyFrame to parquet. Auto-generates a name if none given. Falls back to `collect().write_parquet()` when `streaming=False`.
|
|
71
|
+
- **`session[name]`** — Retrieve a checkpoint as a `LazyFrame`.
|
|
72
|
+
- **`name in session`** — Check existence.
|
|
73
|
+
- **`len(session)`** / **`iter(session)`** — Count / list checkpoints.
|
|
74
|
+
- **`summary()`** — Returns a Polars DataFrame with name, size (MB), and path of each checkpoint.
|
|
75
|
+
- **`close(timeout=None)`** — Waits for in-flight writes, then cleans up. Also usable as a context manager.
|
|
76
|
+
|
|
77
|
+
## Thread Safety
|
|
78
|
+
|
|
79
|
+
Sessions are internally locked. Concurrent `checkpoint()` calls from multiple threads are safe; `close()` waits for all in-flight materialisations before removing files.
|
|
80
|
+
|
|
81
|
+
## Cleanup Behaviour
|
|
82
|
+
|
|
83
|
+
| Scenario | `cleanup=True` (default) | `cleanup=False` |
|
|
84
|
+
|---|---|---|
|
|
85
|
+
| `close()` / `__exit__` | Files deleted | Files retained |
|
|
86
|
+
| GC / interpreter shutdown | Files deleted (via `weakref.finalize`) | Files retained |
|
|
87
|
+
|
|
88
|
+
When `root_dir` is auto-generated, the entire temp directory is removed. When user-supplied, only the individual checkpoint subdirectories created by the session are removed.
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
pyproject.toml
|
|
3
|
+
src/polars_checkpoint/__init__.py
|
|
4
|
+
src/polars_checkpoint/polars_checkpoint.py
|
|
5
|
+
src/polars_checkpoint.egg-info/PKG-INFO
|
|
6
|
+
src/polars_checkpoint.egg-info/SOURCES.txt
|
|
7
|
+
src/polars_checkpoint.egg-info/dependency_links.txt
|
|
8
|
+
src/polars_checkpoint.egg-info/requires.txt
|
|
9
|
+
src/polars_checkpoint.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
polars>=1.39.3
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
polars_checkpoint
|