zoopipe 2026.1.20__cp310-abi3-macosx_11_0_arm64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. zoopipe/__init__.py +72 -0
  2. zoopipe/engines/__init__.py +4 -0
  3. zoopipe/engines/base.py +45 -0
  4. zoopipe/engines/dask.py +225 -0
  5. zoopipe/engines/local.py +215 -0
  6. zoopipe/engines/ray.py +252 -0
  7. zoopipe/hooks/__init__.py +4 -0
  8. zoopipe/hooks/base.py +70 -0
  9. zoopipe/hooks/sql.py +94 -0
  10. zoopipe/input_adapter/__init__.py +24 -0
  11. zoopipe/input_adapter/arrow.py +38 -0
  12. zoopipe/input_adapter/base.py +48 -0
  13. zoopipe/input_adapter/csv.py +144 -0
  14. zoopipe/input_adapter/duckdb.py +54 -0
  15. zoopipe/input_adapter/excel.py +51 -0
  16. zoopipe/input_adapter/json.py +73 -0
  17. zoopipe/input_adapter/kafka.py +39 -0
  18. zoopipe/input_adapter/parquet.py +85 -0
  19. zoopipe/input_adapter/pygen.py +37 -0
  20. zoopipe/input_adapter/sql.py +103 -0
  21. zoopipe/manager.py +211 -0
  22. zoopipe/output_adapter/__init__.py +23 -0
  23. zoopipe/output_adapter/arrow.py +50 -0
  24. zoopipe/output_adapter/base.py +41 -0
  25. zoopipe/output_adapter/csv.py +71 -0
  26. zoopipe/output_adapter/duckdb.py +46 -0
  27. zoopipe/output_adapter/excel.py +42 -0
  28. zoopipe/output_adapter/json.py +66 -0
  29. zoopipe/output_adapter/kafka.py +39 -0
  30. zoopipe/output_adapter/parquet.py +49 -0
  31. zoopipe/output_adapter/pygen.py +29 -0
  32. zoopipe/output_adapter/sql.py +43 -0
  33. zoopipe/pipe.py +263 -0
  34. zoopipe/protocols.py +37 -0
  35. zoopipe/py.typed +0 -0
  36. zoopipe/report.py +173 -0
  37. zoopipe/utils/__init__.py +0 -0
  38. zoopipe/utils/dependency.py +78 -0
  39. zoopipe/zoopipe_rust_core.abi3.so +0 -0
  40. zoopipe-2026.1.20.dist-info/METADATA +231 -0
  41. zoopipe-2026.1.20.dist-info/RECORD +43 -0
  42. zoopipe-2026.1.20.dist-info/WHEEL +4 -0
  43. zoopipe-2026.1.20.dist-info/licenses/LICENSE +21 -0
@@ -0,0 +1,231 @@
1
+ Metadata-Version: 2.4
2
+ Name: zoopipe
3
+ Version: 2026.1.20
4
+ Requires-Dist: pydantic>=2.0
5
+ Requires-Dist: dask[distributed]>=2026.1.1 ; extra == 'dask'
6
+ Requires-Dist: mkdocs>=1.6.1 ; extra == 'docs'
7
+ Requires-Dist: mkdocs-material>=9.7.1 ; extra == 'docs'
8
+ Requires-Dist: mkdocstrings[python]>=1.0.0 ; extra == 'docs'
9
+ Requires-Dist: ray>=2.53.0 ; extra == 'ray'
10
+ Provides-Extra: dask
11
+ Provides-Extra: docs
12
+ Provides-Extra: ray
13
+ License-File: LICENSE
14
+ Summary: ZooPipe is a data processing framework that allows you to process data in a declarative way.
15
+ Author-email: Alberto Daniel Badia <alberto_badia@enlacepatagonia.com>
16
+ Requires-Python: >=3.10, <3.14
17
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
18
+ Project-URL: Homepage, https://github.com/albertobadia/zoopipe
19
+
20
+ <p align="center">
21
+ <picture>
22
+ <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-dark.svg">
23
+ <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-light.svg">
24
+ <img alt="ZooPipe Logo" src="docs/assets/logo-light.svg" width="600">
25
+ </picture>
26
+ </p>
27
+
28
+ **ZooPipe** is a lean, ultra-high-performance data processing engine for Python. It leverages a **100% Rust core** to handle I/O and orchestration, while keeping the flexibility of Python for schema validation (via Pydantic) and custom data enrichment (via Hooks).
29
+
30
+ <p align="center">
31
+ <a href="https://pypi.org/project/zoopipe/"><img alt="PyPI" src="https://img.shields.io/pypi/v/zoopipe"></a>
32
+ <img alt="Downloads" src="https://img.shields.io/pypi/dm/zoopipe">
33
+ <a href="https://github.com/albertobadia/zoopipe/actions/workflows/ci.yml"><img alt="CI" src="https://github.com/albertobadia/zoopipe/actions/workflows/ci.yml/badge.svg"></a>
34
+ <a href="https://zoopipe.readthedocs.io/"><img alt="ReadTheDocs" src="https://img.shields.io/readthedocs/zoopipe"></a>
35
+ </p>
36
+
37
+ ---
38
+
39
+ Read the [docs](https://zoopipe.readthedocs.io/) for more information.
40
+
41
+ ## ✨ Key Features
42
+
43
+ - 🚀 **100% Native Rust Engine**: The core execution loop, including CSV and JSON parsing/writing, is implemented in Rust for maximum throughput.
44
+ - 🔍 **Declarative Validation**: Use [Pydantic](https://docs.pydantic.dev/) models to define and validate your data structures naturally.
45
+ - 🪝 **Python Hooks**: Transform and enrich data at any stage using standard Python functions or classes.
46
+ - 🚨 **Automated Error Routing**: Native support for routing failed records to a dedicated error output.
47
+ - 📊 **Multiple Format Support**: Optimized readers/writers for CSV, JSONL, and SQL databases.
48
+ - 🔧 **Two-Tier Parallelism**: Orchestrate across processes or clusters with **Engines** (Local, Ray), and scale throughput at the node level with Rust **Executors**.
49
+ - ☁️ **Cloud Native**: Native S3 support and zero-config distributed execution on **Ray** clusters.
50
+
51
+ ---
52
+
53
+ ## ⚡ Performance & Benchmarks
54
+
55
+ Why ZooPipe? Because **vectorization isn't always the answer.**
56
+
57
+ Tools like **Pandas** and **Polars** are incredible for analytical workloads (groupby, sum, joins) where operations can be vectorized in C/Rust. However, real-world Data Engineering often involves "chaotic ETL": messy custom rules, API calls per row, hashing, conditional cleanup, and complex normalization that forcedly drop down to Python loops.
58
+
59
+ **In these "Heavy ETL" scenarios, ZooPipe outperforms Vectorized DataFrames by 3x-8x.**
60
+
61
+ ![Benchmark Chart](docs/assets/benchmark.svg)
62
+
63
+ > **Key Takeaway**: ZooPipe's "Python-First Architecture" with parallel streaming (`PipeManager`) avoids the serialization overhead that cripples Polars/Pandas when using Python UDFs (`map_elements`/`apply`), and uses **97% less RAM**.
64
+
65
+ ### ⚖️ Is this unfair to Pandas/Polars?
66
+
67
+ **Yes and No.**
68
+
69
+ - **Unfair**: If your workload is purely analytical (e.g., `GROUP BY`, `SUM`, `JOIN`), **Polars and Pandas will likely destroy ZooPipe** because they can use vectorized C/Rust operations on whole columns at once.
70
+ - **Fair**: In real-world Data Engineering, many pipelines are "chaotic". They require custom hashing, API calls per row, conditional normalization, or complex Pydantic validation. **In these "Python-UDF heavy" scenarios, vectorization breaks down**, and ZooPipe shines by orchestrating parallel Python execution efficiently without the DataFrame overhead.
71
+
72
+ ### ❓ When to use what?
73
+
74
+ | Use **ZooPipe** When... | Use **Pandas / Polars** When... |
75
+ |---|---|
76
+ | 🏗️ You have complex, custom Python logic per row (hash, clean, validate). | 🧮 You are doing aggregations (SUM, AVG) or Relational Algebra (JOIN, GROUP BY). |
77
+ | 🔄 You are processing streaming data or files larger than RAM. | 💾 Your dataset fits comfortably in RAM (or use LazyFrames). |
78
+ | 🛡️ You need strict schema validation (Pydantic) and error handling. | 🔬 You are doing data exploration or statistical analysis. |
79
+ | 🚀 You want to mix Rust I/O performance with Python flexibility. | ⚡ Your entire pipeline can be expressed in vectorized expressions. |
80
+
81
+
82
+ ---
83
+
84
+ ## 🚀 Quick Start
85
+
86
+ ### Installation
87
+
88
+ ```bash
89
+ pip install zoopipe
90
+ ```
91
+ Or using uv:
92
+ ```bash
93
+ uv add zoopipe
94
+ ```
95
+ Or from source (uv recommended):
96
+ ```bash
97
+ uv build
98
+ uv run maturin develop --release
99
+ ```
100
+
101
+ ### Simple Example
102
+
103
+ ```python
104
+ from pydantic import BaseModel, ConfigDict
105
+ from zoopipe import CSVInputAdapter, CSVOutputAdapter, Pipe
106
+
107
+
108
+ class UserSchema(BaseModel):
109
+ model_config = ConfigDict(extra="ignore")
110
+ user_id: str
111
+ username: str
112
+ email: str
113
+
114
+
115
+ pipe = Pipe(
116
+ input_adapter=CSVInputAdapter("users.csv"),
117
+ output_adapter=CSVOutputAdapter("processed_users.csv"),
118
+ error_output_adapter=CSVOutputAdapter("errors.csv"),
119
+ schema_model=UserSchema,
120
+ )
121
+
122
+ pipe.start()
123
+ pipe.wait()
124
+
125
+
126
+ print(f"Finished! Processed {pipe.report.total_processed} items.")
127
+ ```
128
+
129
+ Automatically split large files or manage multiple independent workflows:
130
+
131
+ ```python
132
+ from zoopipe import PipeManager, MultiProcessEngine
133
+
134
+ # Create your pipe as usual (Pipe is purely declarative)
135
+ pipe = Pipe(...)
136
+
137
+ # Automatically parallelize across 4 workers
138
+ # MultiProcessEngine() for local, RayEngine() for clusters
139
+ manager = PipeManager.parallelize_pipe(
140
+ pipe,
141
+ workers=4,
142
+ engine=MultiProcessEngine()
143
+ )
144
+ manager.start()
145
+ manager.wait()
146
+ ```
147
+
148
+ ---
149
+
150
+ ## 📚 Documentation
151
+
152
+ ### Core Concepts
153
+
154
+
155
+ #### Hooks
156
+
157
+ Hooks are Python classes that allow you to intercept, transform, and enrich data at different stages of the pipeline.
158
+
159
+ **[📘 Read the full Hooks Guide](docs/hooks.md)** to learn about lifecycle methods (`setup`, `execute`, `teardown`), state management, and advanced patterns like cursor pagination.
160
+
161
+ ### Quick Example
162
+
163
+ ```python
164
+ from zoopipe import BaseHook
165
+
166
+ class MyHook(BaseHook):
167
+ def execute(self, entries, store):
168
+ for entry in entries:
169
+ entry["raw_data"]["checked"] = True
170
+ return entries
171
+ ```
172
+
173
+ > [!IMPORTANT]
174
+ > If you are using a `schema_model`, the pipeline will output the contents of `validated_data` for successful records.
175
+ > - To modify data **before** validation, use `pre_validation_hooks` and modify `entry["raw_data"]`.
176
+ > - To modify data **after** validation (and ensure it reaches the output), use `post_validation_hooks` and modify `entry["validated_data"]`.
177
+
178
+ #### Executors
179
+
180
+ Executors control how ZooPipe scales **up** within a single node using Rust-managed threads. They are the engine under the hood that drives high throughput.
181
+
182
+ **[📘 Read the full Executors Guide](docs/executors.md)** to understand the difference between `SingleThreadExecutor` (debug/ordered) and `MultiThreadExecutor` (high-throughput).
183
+
184
+ ### Input/Output Adapters
185
+
186
+ #### File Formats
187
+
188
+ - [**CSV Adapters**](docs/csv.md) - High-performance CSV reading and writing
189
+ - [**JSON Adapters**](docs/json.md) - JSONL and JSON array format support
190
+ - [**Excel Adapters**](docs/excel.md) - Read and write Excel (.xlsx) files
191
+ - [**Parquet Adapters**](docs/parquet.md) - Columnar storage for analytics and data lakes
192
+ - [**Arrow Adapters**](docs/arrow.md) - Apache Arrow IPC format for high-throughput interoperability
193
+
194
+ #### Databases
195
+
196
+ - [**SQL Adapters**](docs/sql.md) - Read from and write to SQL databases with batch optimization
197
+ - [**SQL Pagination**](docs/sql.md#sqlpaginationinputadapter) - High-performance cursor-style pagination for large tables
198
+ - [**DuckDB Adapters**](docs/duckdb.md) - Analytical database for OLAP workloads
199
+
200
+ #### Messaging Systems
201
+
202
+ - [**Kafka Adapters**](docs/kafka.md) - High-throughput messaging
203
+
204
+ #### Advanced
205
+
206
+ - [**Python Generator Adapters**](docs/pygen.md) - In-memory streaming and testing
207
+ - [**Cloud Storage (S3)**](docs/cloud-storage.md) - Read and write data from Amazon S3 and compatible services
208
+ - [**PipeManager**](docs/pipemanager.md) - Run multiple pipes in parallel for distributed processing
209
+ - [**Ray Guide**](docs/ray.md) - Zero-config distributed execution on Ray clusters
210
+
211
+ ---
212
+
213
+ ## 🛠 Architecture
214
+
215
+ ZooPipe is designed as a thin Python wrapper around a powerful Rust core, featuring a two-tier parallel architecture:
216
+
217
+ 1. **Orchestration Tier (Python Engines)**:
218
+ - Manage distribution across processes or nodes (e.g., `MultiProcessEngine`).
219
+ - Handles data sharding, process lifecycle, and metrics aggregation.
220
+ 2. **Execution Tier (Rust BatchExecutors)**:
221
+ - **Internal Throughput**: High-speed processing within a single process.
222
+ - **Adapters**: Native CSV/JSON/SQL Readers and Writers.
223
+ - **NativePipe**: Orchestrates the loop, fetching chunks and routing result batches.
224
+ - **Executors**: Multi-threaded Rust strategies to bypass the GIL within a node.
225
+
226
+ ---
227
+
228
+ ## 📄 License
229
+
230
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
231
+
@@ -0,0 +1,43 @@
1
+ zoopipe/__init__.py,sha256=gPq396dNUZVk1-bx23FvllBRHhywPxTHpjGnnLSWOlE,2487
2
+ zoopipe/engines/__init__.py,sha256=nOGzFAnjCxhJwIRENfenKzR75EsWB9iVSYI5DfYedfQ,145
3
+ zoopipe/engines/base.py,sha256=7ZRrlm_84OSKNapkLYlkdRxdRwKhQIDB-jCGRNhvbo0,1147
4
+ zoopipe/engines/dask.py,sha256=jIuGeV_6KSoQkZvNirPLbcwgMHdvSXIu9cFGrbYZTrY,7327
5
+ zoopipe/engines/local.py,sha256=6sp-Rn6MrvLo1pHRK2KERkrMHXPpsaXMaNtQcLALEeQ,6923
6
+ zoopipe/engines/ray.py,sha256=5EgYRlY5nzSAkVXUt0q6zkfOCOO6qb9M0HaTcjy3UFk,8907
7
+ zoopipe/hooks/__init__.py,sha256=H6nLNAZVIPULIGpPbCfQ74sjF4kexj-f0BI5cVbkVOI,185
8
+ zoopipe/hooks/base.py,sha256=YgXazJ0jbGe971XMet6WEFkEfezPBjK4pxqjse6nzTw,1854
9
+ zoopipe/hooks/sql.py,sha256=VS7r2xFe90dif2Tvhhw0VFMU7guUnlnZk81CA48wpOU,2868
10
+ zoopipe/input_adapter/__init__.py,sha256=Ravsb2U4SASvOMcLpkYZeW-Iq6ITp5xzLm_ryULXIew,908
11
+ zoopipe/input_adapter/arrow.py,sha256=Fovs9XjMk2Qw5FPGBJm3En57ctnEjTS2vIJiO15YGKo,950
12
+ zoopipe/input_adapter/base.py,sha256=Mbhb4oKqK1mXXbcFCeD0cf8RvLJBXRxQtHCD-2dbWg8,1440
13
+ zoopipe/input_adapter/csv.py,sha256=0HHVfFfpL_K_xle4LS6DZuXgD2JJDcq_SeA40P2tAiQ,4882
14
+ zoopipe/input_adapter/duckdb.py,sha256=_wTjsDvrw8yTL7OTbA9ejXVbh2CH5ZhPGNo7ZayG-Jk,1634
15
+ zoopipe/input_adapter/excel.py,sha256=ZHkl9dDguYzdu-ExF1kM0lUScoJnwoBtFWAOZTdqSYU,1635
16
+ zoopipe/input_adapter/json.py,sha256=ZF_aa6oFSBd0WzkYpVAGgu3kAyx9jFPZufie2uGCTng,2140
17
+ zoopipe/input_adapter/kafka.py,sha256=AnarmVzrkgWhAgerKKzQUgNE3_62YiQgpLJ-hGe2Al0,1052
18
+ zoopipe/input_adapter/parquet.py,sha256=56zyTT8H7V-yablPkTCU-7WlllVR4o6xPmnS1_Ca-08,2600
19
+ zoopipe/input_adapter/pygen.py,sha256=Jfy9uefNSiKZYgY0gH6nyeMYbA_NRZ9mMBLNcx9iXjc,996
20
+ zoopipe/input_adapter/sql.py,sha256=9lN06AgWwA94BkGopKuwu2y9QL-kxNZlJ7H3_f8Q4qw,3243
21
+ zoopipe/manager.py,sha256=XN7EVzkr9yA_XRXDO2uCg35C_IkvP-A7CXL5Xj7XxVk,7195
22
+ zoopipe/output_adapter/__init__.py,sha256=eX09URcg0wOQtftq8BTWQodCr690SyyJKP59zKRynYA,878
23
+ zoopipe/output_adapter/arrow.py,sha256=6WaB7zPIFO_QxyCayn3bQqKVHqD6PPT9wp2jt-wZuCY,1449
24
+ zoopipe/output_adapter/base.py,sha256=Zq172oMzI7Jhvz9E3t0bB0DNWv1ECU47Q2MFih3-BgE,1206
25
+ zoopipe/output_adapter/csv.py,sha256=3l1a121ttOK6IlGAeBS0l55m0BNjmDjzX5DLCfjREoI,2136
26
+ zoopipe/output_adapter/duckdb.py,sha256=l78gdrqSG1au5SC0hp0QYoxTICOra6TXPYs5MGGfmIM,1303
27
+ zoopipe/output_adapter/excel.py,sha256=DU2lEHHpiuDYT9pgs1jbPCtWxq6tN5t0e73-pGacXZM,1186
28
+ zoopipe/output_adapter/json.py,sha256=OTJv-rogiIm_5j3pozl7LaZSOt3rCON4pAx1Ab9d11U,1984
29
+ zoopipe/output_adapter/kafka.py,sha256=dz128E17hDsgRC8oOemitsQ8HPAkaqnhj9zXb9EYlmk,993
30
+ zoopipe/output_adapter/parquet.py,sha256=ygY65JlAswEuLqs84k9SoyZPBHvRHHao0lO7f5yAg5o,1350
31
+ zoopipe/output_adapter/pygen.py,sha256=_e3ps55U7IzKR67JsPuNyhpWVBkDgxenKLkDI6lz5FM,811
32
+ zoopipe/output_adapter/sql.py,sha256=MDDoWnlkGtrlGm3wJZ1U-yOdgQNJXuuFvg6Ki9-7pk4,1135
33
+ zoopipe/pipe.py,sha256=s86DaxAByNlzygond1BxhIRDruNGutJJK0viORdewDk,9740
34
+ zoopipe/protocols.py,sha256=kK6dhHLhUchUaGjfzuJYsDCGBxoQhZSCBxIP69yVlH4,967
35
+ zoopipe/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
36
+ zoopipe/report.py,sha256=-fvwDgWNrtwzdynmwCouxvrUEr_M6i10Nip8fiztKkk,5250
37
+ zoopipe/utils/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
38
+ zoopipe/utils/dependency.py,sha256=Ms2V-hR8jtYcbe3Fgnc2qfy31XJw3aIZprYz2j878Zg,2051
39
+ zoopipe/zoopipe_rust_core.abi3.so,sha256=3HX8XeDN2xBXfiAPEfLprJjhSTwv_sdo5pzWK-Zr6Kk,65972832
40
+ zoopipe-2026.1.20.dist-info/METADATA,sha256=BiaX3FUmpbppYLRn-u7ilKbEBGGeP13ql5BxGsEwMaQ,9591
41
+ zoopipe-2026.1.20.dist-info/WHEEL,sha256=vZ12AMAE5CVtd8oYbYGrz3omfHuIZCNO_3P50V00s00,104
42
+ zoopipe-2026.1.20.dist-info/licenses/LICENSE,sha256=4WRhonN0HErkcdxwCRoaBxdR4suhdVxZj_14XXMVgtw,1077
43
+ zoopipe-2026.1.20.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.11.5)
3
+ Root-Is-Purelib: false
4
+ Tag: cp310-abi3-macosx_11_0_arm64
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Alberto Daniel Badia
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.