cdmltrain 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cdmltrain-0.2.0/LICENSE +21 -0
- cdmltrain-0.2.0/PKG-INFO +316 -0
- cdmltrain-0.2.0/README.md +281 -0
- cdmltrain-0.2.0/cdmltrain/__init__.py +4 -0
- cdmltrain-0.2.0/cdmltrain/core.py +111 -0
- cdmltrain-0.2.0/cdmltrain/dataset.py +137 -0
- cdmltrain-0.2.0/cdmltrain/gpu_loader.py +179 -0
- cdmltrain-0.2.0/cdmltrain/src/fast_core.cpp +80 -0
- cdmltrain-0.2.0/cdmltrain/zstd_engine.py +103 -0
- cdmltrain-0.2.0/cdmltrain.egg-info/PKG-INFO +316 -0
- cdmltrain-0.2.0/cdmltrain.egg-info/SOURCES.txt +14 -0
- cdmltrain-0.2.0/cdmltrain.egg-info/dependency_links.txt +1 -0
- cdmltrain-0.2.0/cdmltrain.egg-info/requires.txt +11 -0
- cdmltrain-0.2.0/cdmltrain.egg-info/top_level.txt +1 -0
- cdmltrain-0.2.0/setup.cfg +4 -0
- cdmltrain-0.2.0/setup.py +59 -0
cdmltrain-0.2.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 prem85642
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
cdmltrain-0.2.0/PKG-INFO
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: cdmltrain
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Stream ML datasets from ZIP/ZSTD archives into PyTorch without disk extraction.
|
|
5
|
+
Home-page: https://github.com/prem85642/cdmltrain
|
|
6
|
+
Author: prem85642
|
|
7
|
+
Author-email: your.email@domain.com
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
12
|
+
Classifier: Intended Audience :: Science/Research
|
|
13
|
+
Requires-Python: >=3.7
|
|
14
|
+
Description-Content-Type: text/markdown
|
|
15
|
+
License-File: LICENSE
|
|
16
|
+
Requires-Dist: Pillow>=8.0.0
|
|
17
|
+
Provides-Extra: zstd
|
|
18
|
+
Requires-Dist: zstandard>=0.20.0; extra == "zstd"
|
|
19
|
+
Provides-Extra: gpu
|
|
20
|
+
Requires-Dist: torch>=1.9.0; extra == "gpu"
|
|
21
|
+
Provides-Extra: full
|
|
22
|
+
Requires-Dist: zstandard>=0.20.0; extra == "full"
|
|
23
|
+
Requires-Dist: torch>=1.9.0; extra == "full"
|
|
24
|
+
Dynamic: author
|
|
25
|
+
Dynamic: author-email
|
|
26
|
+
Dynamic: classifier
|
|
27
|
+
Dynamic: description
|
|
28
|
+
Dynamic: description-content-type
|
|
29
|
+
Dynamic: home-page
|
|
30
|
+
Dynamic: license-file
|
|
31
|
+
Dynamic: provides-extra
|
|
32
|
+
Dynamic: requires-dist
|
|
33
|
+
Dynamic: requires-python
|
|
34
|
+
Dynamic: summary
|
|
35
|
+
|
|
36
|
+
# cdmltrain 🚀
|
|
37
|
+
### Stream ML Datasets Directly from ZIP / ZSTD Archives — No Extraction. No Wasted Storage. Zero OOM.
|
|
38
|
+
|
|
39
|
+
[](LICENSE)
|
|
40
|
+
[](https://www.python.org/)
|
|
41
|
+
[]()
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## 🔥 The Problem This Solves
|
|
46
|
+
|
|
47
|
+
Every Data Scientist / ML Engineer hits this wall:
|
|
48
|
+
|
|
49
|
+
```
|
|
50
|
+
"Your 100 GB Kaggle dataset is a ZIP file.
|
|
51
|
+
Extracting it takes 2 hours and needs 300 GB of free disk space.
|
|
52
|
+
Your Colab/Kaggle notebook crashes with Out-of-Memory errors."
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
**`cdmltrain` eliminates this problem entirely.**
|
|
56
|
+
|
|
57
|
+
It lets PyTorch read images, audio, text, CSV, JSON — **any data** — directly from a compressed archive **into RAM**, skipping disk extraction completely.
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## ✨ Key Features
|
|
62
|
+
|
|
63
|
+
| Feature | Description |
|
|
64
|
+
|---|---|
|
|
65
|
+
| 🗜️ **Format Agnostic** | Images, audio (.wav), text, CSV, JSON, binary — all supported |
|
|
66
|
+
| ⚡ **O(1) Random Access** | ZIP Central Directory indexing — jumps to any file instantly |
|
|
67
|
+
| 🧠 **Memory-Safe Cache** | Custom LRU cache enforces strict RAM limits — zero OOM crashes |
|
|
68
|
+
| 🔒 **Thread Safe** | Concurrent reads for PyTorch `DataLoader(num_workers=N)` |
|
|
69
|
+
| 🔧 **3-Tier Architecture** | Auto-selects best engine based on your hardware |
|
|
70
|
+
| 🏎️ **ZSTD Support** | `.tar.zst` archives — 25x faster than ZIP deflate |
|
|
71
|
+
| 🎮 **GPU Direct Loader** | Streams data to CUDA VRAM with async prefetch |
|
|
72
|
+
| 🌍 **Cross-Platform** | Windows, Linux, macOS — works everywhere |
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## 🏗️ Architecture (3 Tiers — Auto-Selected)
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
Your ZIP/ZSTD File
|
|
80
|
+
│
|
|
81
|
+
▼
|
|
82
|
+
┌─────────────────────────────────────────────────────┐
|
|
83
|
+
│ Tier 1: Python CoreStreamEngine │
|
|
84
|
+
│ ✅ Works on ANY machine, no dependencies │
|
|
85
|
+
│ → O(1) Index + LRU Cache + Thread Safety │
|
|
86
|
+
├─────────────────────────────────────────────────────┤
|
|
87
|
+
│ Tier 2: C++ FastCoreEngine (pybind11) │
|
|
88
|
+
│ ✅ Auto-enabled if C++ Build Tools installed │
|
|
89
|
+
│ → Bypasses Python GIL, faster multi-worker reads │
|
|
90
|
+
├─────────────────────────────────────────────────────┤
|
|
91
|
+
│ Tier 3: ZSTD Engine + GPU Direct Loader │
|
|
92
|
+
│ ✅ pip install zstandard (for .tar.zst files) │
|
|
93
|
+
│ ✅ NVIDIA GPU (for direct VRAM streaming) │
|
|
94
|
+
│ → 25x faster decompression + near-zero GPU latency │
|
|
95
|
+
└─────────────────────────────────────────────────────┘
|
|
96
|
+
│
|
|
97
|
+
▼
|
|
98
|
+
PyTorch DataLoader → Model Training
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
**The library auto-detects your hardware and picks the best tier. You write the same code regardless.**
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## 📦 Installation
|
|
106
|
+
|
|
107
|
+
> **Note:** PyPI release coming soon. Install locally for now.
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
# Step 1: Clone the repo
|
|
111
|
+
git clone https://github.com/prem85642/cdmltrain.git
|
|
112
|
+
cd cdmltrain
|
|
113
|
+
|
|
114
|
+
# Step 2: Install
|
|
115
|
+
pip install .
|
|
116
|
+
|
|
117
|
+
# Step 3 (Optional): ZSTD support for .tar.zst archives
|
|
118
|
+
pip install zstandard
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
> **C++ Acceleration (Tier 2):**
|
|
122
|
+
> - Windows: Install [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
|
|
123
|
+
> - Linux: `sudo apt-get install build-essential`
|
|
124
|
+
> - If skipped: pure-Python engine runs automatically — no errors.
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
## 🚀 Quick Start
|
|
129
|
+
|
|
130
|
+
### Basic Usage (ZIP — Any Data Type)
|
|
131
|
+
```python
|
|
132
|
+
from cdmltrain import CDMLStreamDataset
|
|
133
|
+
from torch.utils.data import DataLoader
|
|
134
|
+
|
|
135
|
+
# Point directly to your ZIP — no extraction needed!
|
|
136
|
+
dataset = CDMLStreamDataset(
|
|
137
|
+
zip_path="my_dataset.zip",
|
|
138
|
+
max_cache_mb=2048 # RAM cache limit (MB)
|
|
139
|
+
)
|
|
140
|
+
|
|
141
|
+
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
|
|
142
|
+
|
|
143
|
+
for batch in dataloader:
|
|
144
|
+
# Train your model normally
|
|
145
|
+
pass
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Image Dataset (with PyTorch Transforms)
|
|
149
|
+
```python
|
|
150
|
+
from cdmltrain import CDMLStreamDataset
|
|
151
|
+
from torchvision import transforms
|
|
152
|
+
|
|
153
|
+
transform = transforms.Compose([
|
|
154
|
+
transforms.Resize((224, 224)),
|
|
155
|
+
transforms.ToTensor(),
|
|
156
|
+
transforms.Normalize([0.5], [0.5])
|
|
157
|
+
])
|
|
158
|
+
|
|
159
|
+
dataset = CDMLStreamDataset("images.zip", transform=transform, is_image=True)
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### ZSTD Archive (25x Faster Decompression) ⚡
|
|
163
|
+
```python
|
|
164
|
+
# Just change the file extension — everything else is identical!
|
|
165
|
+
dataset = CDMLStreamDataset("my_dataset.tar.zst", max_cache_mb=2048)
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
### GPU Direct Loader (NVIDIA VRAM Streaming) 🎮
|
|
169
|
+
```python
|
|
170
|
+
from cdmltrain.gpu_loader import GPUDirectLoader
|
|
171
|
+
|
|
172
|
+
loader = GPUDirectLoader(dataset, batch_size=32)
|
|
173
|
+
# Auto-detects GPU. Falls back to CPU if no GPU found.
|
|
174
|
+
|
|
175
|
+
for batch in loader:
|
|
176
|
+
output = model(batch.float()) # batch already on GPU VRAM!
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
---
|
|
180
|
+
|
|
181
|
+
## ⚙️ Configuration Parameters
|
|
182
|
+
|
|
183
|
+
| Parameter | Type | Default | Description |
|
|
184
|
+
|---|---|---|---|
|
|
185
|
+
| `zip_path` | `str` | *(required)* | Path to `.zip` or `.tar.zst` file |
|
|
186
|
+
| `transform` | `callable` | `None` | PyTorch/torchvision transform |
|
|
187
|
+
| `is_image` | `bool` | `False` | `True` enables PIL image decoding |
|
|
188
|
+
| `max_cache_mb` | `int` | `2048` | Max RAM for caching (MB) |
|
|
189
|
+
|
|
190
|
+
**`max_cache_mb` Guide:**
|
|
191
|
+
|
|
192
|
+
| Your RAM | Recommended |
|
|
193
|
+
|---|---|
|
|
194
|
+
| 4 GB (Colab Free) | `512` |
|
|
195
|
+
| 8 GB (Laptop) | `2048` |
|
|
196
|
+
| 16 GB (PC) | `6000` |
|
|
197
|
+
| 32 GB+ (Server) | `16000` |
|
|
198
|
+
|
|
199
|
+
---
|
|
200
|
+
|
|
201
|
+
## 📊 Benchmarks (Real Tests)
|
|
202
|
+
|
|
203
|
+
### ZSTD vs ZIP Speed (200 files × 10KB)
|
|
204
|
+
| Engine | Speed | Speedup |
|
|
205
|
+
|---|---|---|
|
|
206
|
+
| ZIP (Deflate) — Tier 1/2 | 29,204 files/sec | baseline |
|
|
207
|
+
| ZSTD — Tier 3 | **739,653 files/sec** | **🔥 25x faster** |
|
|
208
|
+
|
|
209
|
+
### GPU Direct Loader (Google Colab T4)
|
|
210
|
+
| Metric | Value |
|
|
211
|
+
|---|---|
|
|
212
|
+
| GPU | Tesla T4 (15.6 GB VRAM) |
|
|
213
|
+
| Batch device | `cuda:0` — data streamed to VRAM |
|
|
214
|
+
| Throughput | **17,512 items/sec** |
|
|
215
|
+
| Epoch time (100 items) | 0.0045s |
|
|
216
|
+
|
|
217
|
+
### Memory Safety Test
|
|
218
|
+
| Test | Result |
|
|
219
|
+
|---|---|
|
|
220
|
+
| Cache limit: 1MB, Data: 2MB | ✅ Stayed under 1MB |
|
|
221
|
+
| Thread safety: 8 workers | ✅ Zero race conditions |
|
|
222
|
+
| Corrupted ZIP | ✅ Rejected cleanly |
|
|
223
|
+
| 50MB single file | ✅ Byte-exact in 0.108s |
|
|
224
|
+
| 1000-file archive | ✅ Indexed in 0.04s |
|
|
225
|
+
|
|
226
|
+
---
|
|
227
|
+
|
|
228
|
+
## 🐛 Debugging / Inspection
|
|
229
|
+
|
|
230
|
+
Inspect any specific file without unzipping:
|
|
231
|
+
```python
|
|
232
|
+
dataset.extract_sample_to_disk(idx=42, export_path="./inspection/")
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
---
|
|
236
|
+
|
|
237
|
+
## 🛠️ Troubleshooting
|
|
238
|
+
|
|
239
|
+
### `ModuleNotFoundError: No module named 'cdmltrain'`
|
|
240
|
+
```bash
|
|
241
|
+
git clone https://github.com/prem85642/cdmltrain.git && cd cdmltrain && pip install .
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
### `ModuleNotFoundError: No module named 'zstandard'`
|
|
245
|
+
```bash
|
|
246
|
+
pip install zstandard
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
### `Microsoft Visual C++ 14.0 required` (Windows)
|
|
250
|
+
Install [C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/). Or skip — pure Python engine works fine.
|
|
251
|
+
|
|
252
|
+
### Out of Memory on Colab
|
|
253
|
+
```python
|
|
254
|
+
dataset = CDMLStreamDataset("data.zip", max_cache_mb=512) # Reduce cache
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
### `Bad CRC-32` error
|
|
258
|
+
```bash
|
|
259
|
+
python -c "import zipfile; print(zipfile.ZipFile('file.zip').testzip())"
|
|
260
|
+
# None = healthy, anything else = re-download
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
### `PIL.UnidentifiedImageError`
|
|
264
|
+
```python
|
|
265
|
+
dataset = CDMLStreamDataset("data.zip", is_image=False) # Not an image dataset
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
---
|
|
269
|
+
|
|
270
|
+
## 🖥️ OS Compatibility
|
|
271
|
+
|
|
272
|
+
| Feature | Windows | Linux | macOS |
|
|
273
|
+
|---|---|---|---|
|
|
274
|
+
| ZIP Engine (Tier 1) | ✅ | ✅ | ✅ |
|
|
275
|
+
| ZSTD Engine (Tier 3) | ✅ | ✅ | ✅ |
|
|
276
|
+
| GPU Direct Loader | ✅ | ✅ | ✅ |
|
|
277
|
+
| C++ Fast Engine (Tier 2) | ✅ pre-built | ✅ compile via `pip install .` | ✅ compile via `pip install .` |
|
|
278
|
+
|
|
279
|
+
---
|
|
280
|
+
|
|
281
|
+
## 📁 Project Structure
|
|
282
|
+
|
|
283
|
+
```
|
|
284
|
+
cdmltrain/
|
|
285
|
+
├── cdmltrain/
|
|
286
|
+
│ ├── __init__.py # Package entry point
|
|
287
|
+
│ ├── core.py # Tier 1: Python CoreStreamEngine
|
|
288
|
+
│ ├── dataset.py # CDMLStreamDataset (auto-tier selection)
|
|
289
|
+
│ ├── zstd_engine.py # Tier 3: ZSTD streaming engine
|
|
290
|
+
│ ├── gpu_loader.py # Tier 3: GPU Direct Loader (CUDA pinned memory)
|
|
291
|
+
│ └── src/
|
|
292
|
+
│ └── fast_core.cpp # Tier 2: C++ FastCoreEngine (pybind11)
|
|
293
|
+
├── demo.py # Quickstart demo
|
|
294
|
+
├── quickstart.ipynb # Jupyter Notebook tutorial
|
|
295
|
+
├── test_enterprise_audit.py # Enterprise QA suite (8 tests)
|
|
296
|
+
├── test_zstd_benchmark.py # ZSTD vs ZIP benchmark
|
|
297
|
+
├── test_zstd_compat.py # Cross-format compatibility test
|
|
298
|
+
├── test_gpu_loader.py # GPU Direct Loader test
|
|
299
|
+
├── setup.py # pip install configuration
|
|
300
|
+
├── requirements.txt # Dependencies
|
|
301
|
+
└── LICENSE # MIT License
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
---
|
|
305
|
+
|
|
306
|
+
## 🤝 Contributing
|
|
307
|
+
|
|
308
|
+
Pull requests are welcome! For major changes, please open an issue first.
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## 📄 License
|
|
313
|
+
|
|
314
|
+
MIT License — see [LICENSE](LICENSE) for details.
|
|
315
|
+
|
|
316
|
+
**Made with ❤️ for the ML community — because your model matters more than your storage bill.**
|
|
@@ -0,0 +1,281 @@
|
|
|
1
|
+
# cdmltrain 🚀
|
|
2
|
+
### Stream ML Datasets Directly from ZIP / ZSTD Archives — No Extraction. No Wasted Storage. Zero OOM.
|
|
3
|
+
|
|
4
|
+
[](LICENSE)
|
|
5
|
+
[](https://www.python.org/)
|
|
6
|
+
[]()
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## 🔥 The Problem This Solves
|
|
11
|
+
|
|
12
|
+
Every Data Scientist / ML Engineer hits this wall:
|
|
13
|
+
|
|
14
|
+
```
|
|
15
|
+
"Your 100 GB Kaggle dataset is a ZIP file.
|
|
16
|
+
Extracting it takes 2 hours and needs 300 GB of free disk space.
|
|
17
|
+
Your Colab/Kaggle notebook crashes with Out-of-Memory errors."
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
**`cdmltrain` eliminates this problem entirely.**
|
|
21
|
+
|
|
22
|
+
It lets PyTorch read images, audio, text, CSV, JSON — **any data** — directly from a compressed archive **into RAM**, skipping disk extraction completely.
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## ✨ Key Features
|
|
27
|
+
|
|
28
|
+
| Feature | Description |
|
|
29
|
+
|---|---|
|
|
30
|
+
| 🗜️ **Format Agnostic** | Images, audio (.wav), text, CSV, JSON, binary — all supported |
|
|
31
|
+
| ⚡ **O(1) Random Access** | ZIP Central Directory indexing — jumps to any file instantly |
|
|
32
|
+
| 🧠 **Memory-Safe Cache** | Custom LRU cache enforces strict RAM limits — zero OOM crashes |
|
|
33
|
+
| 🔒 **Thread Safe** | Concurrent reads for PyTorch `DataLoader(num_workers=N)` |
|
|
34
|
+
| 🔧 **3-Tier Architecture** | Auto-selects best engine based on your hardware |
|
|
35
|
+
| 🏎️ **ZSTD Support** | `.tar.zst` archives — 25x faster than ZIP deflate |
|
|
36
|
+
| 🎮 **GPU Direct Loader** | Streams data to CUDA VRAM with async prefetch |
|
|
37
|
+
| 🌍 **Cross-Platform** | Windows, Linux, macOS — works everywhere |
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## 🏗️ Architecture (3 Tiers — Auto-Selected)
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
Your ZIP/ZSTD File
|
|
45
|
+
│
|
|
46
|
+
▼
|
|
47
|
+
┌─────────────────────────────────────────────────────┐
|
|
48
|
+
│ Tier 1: Python CoreStreamEngine │
|
|
49
|
+
│ ✅ Works on ANY machine, no dependencies │
|
|
50
|
+
│ → O(1) Index + LRU Cache + Thread Safety │
|
|
51
|
+
├─────────────────────────────────────────────────────┤
|
|
52
|
+
│ Tier 2: C++ FastCoreEngine (pybind11) │
|
|
53
|
+
│ ✅ Auto-enabled if C++ Build Tools installed │
|
|
54
|
+
│ → Bypasses Python GIL, faster multi-worker reads │
|
|
55
|
+
├─────────────────────────────────────────────────────┤
|
|
56
|
+
│ Tier 3: ZSTD Engine + GPU Direct Loader │
|
|
57
|
+
│ ✅ pip install zstandard (for .tar.zst files) │
|
|
58
|
+
│ ✅ NVIDIA GPU (for direct VRAM streaming) │
|
|
59
|
+
│ → 25x faster decompression + near-zero GPU latency │
|
|
60
|
+
└─────────────────────────────────────────────────────┘
|
|
61
|
+
│
|
|
62
|
+
▼
|
|
63
|
+
PyTorch DataLoader → Model Training
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**The library auto-detects your hardware and picks the best tier. You write the same code regardless.**
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## 📦 Installation
|
|
71
|
+
|
|
72
|
+
> **Note:** PyPI release coming soon. Install locally for now.
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
# Step 1: Clone the repo
|
|
76
|
+
git clone https://github.com/prem85642/cdmltrain.git
|
|
77
|
+
cd cdmltrain
|
|
78
|
+
|
|
79
|
+
# Step 2: Install
|
|
80
|
+
pip install .
|
|
81
|
+
|
|
82
|
+
# Step 3 (Optional): ZSTD support for .tar.zst archives
|
|
83
|
+
pip install zstandard
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
> **C++ Acceleration (Tier 2):**
|
|
87
|
+
> - Windows: Install [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
|
|
88
|
+
> - Linux: `sudo apt-get install build-essential`
|
|
89
|
+
> - If skipped: pure-Python engine runs automatically — no errors.
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## 🚀 Quick Start
|
|
94
|
+
|
|
95
|
+
### Basic Usage (ZIP — Any Data Type)
|
|
96
|
+
```python
|
|
97
|
+
from cdmltrain import CDMLStreamDataset
|
|
98
|
+
from torch.utils.data import DataLoader
|
|
99
|
+
|
|
100
|
+
# Point directly to your ZIP — no extraction needed!
|
|
101
|
+
dataset = CDMLStreamDataset(
|
|
102
|
+
zip_path="my_dataset.zip",
|
|
103
|
+
max_cache_mb=2048 # RAM cache limit (MB)
|
|
104
|
+
)
|
|
105
|
+
|
|
106
|
+
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
|
|
107
|
+
|
|
108
|
+
for batch in dataloader:
|
|
109
|
+
# Train your model normally
|
|
110
|
+
pass
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### Image Dataset (with PyTorch Transforms)
|
|
114
|
+
```python
|
|
115
|
+
from cdmltrain import CDMLStreamDataset
|
|
116
|
+
from torchvision import transforms
|
|
117
|
+
|
|
118
|
+
transform = transforms.Compose([
|
|
119
|
+
transforms.Resize((224, 224)),
|
|
120
|
+
transforms.ToTensor(),
|
|
121
|
+
transforms.Normalize([0.5], [0.5])
|
|
122
|
+
])
|
|
123
|
+
|
|
124
|
+
dataset = CDMLStreamDataset("images.zip", transform=transform, is_image=True)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
### ZSTD Archive (25x Faster Decompression) ⚡
|
|
128
|
+
```python
|
|
129
|
+
# Just change the file extension — everything else is identical!
|
|
130
|
+
dataset = CDMLStreamDataset("my_dataset.tar.zst", max_cache_mb=2048)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### GPU Direct Loader (NVIDIA VRAM Streaming) 🎮
|
|
134
|
+
```python
|
|
135
|
+
from cdmltrain.gpu_loader import GPUDirectLoader
|
|
136
|
+
|
|
137
|
+
loader = GPUDirectLoader(dataset, batch_size=32)
|
|
138
|
+
# Auto-detects GPU. Falls back to CPU if no GPU found.
|
|
139
|
+
|
|
140
|
+
for batch in loader:
|
|
141
|
+
output = model(batch.float()) # batch already on GPU VRAM!
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
## ⚙️ Configuration Parameters
|
|
147
|
+
|
|
148
|
+
| Parameter | Type | Default | Description |
|
|
149
|
+
|---|---|---|---|
|
|
150
|
+
| `zip_path` | `str` | *(required)* | Path to `.zip` or `.tar.zst` file |
|
|
151
|
+
| `transform` | `callable` | `None` | PyTorch/torchvision transform |
|
|
152
|
+
| `is_image` | `bool` | `False` | `True` enables PIL image decoding |
|
|
153
|
+
| `max_cache_mb` | `int` | `2048` | Max RAM for caching (MB) |
|
|
154
|
+
|
|
155
|
+
**`max_cache_mb` Guide:**
|
|
156
|
+
|
|
157
|
+
| Your RAM | Recommended |
|
|
158
|
+
|---|---|
|
|
159
|
+
| 4 GB (Colab Free) | `512` |
|
|
160
|
+
| 8 GB (Laptop) | `2048` |
|
|
161
|
+
| 16 GB (PC) | `6000` |
|
|
162
|
+
| 32 GB+ (Server) | `16000` |
|
|
163
|
+
|
|
164
|
+
---
|
|
165
|
+
|
|
166
|
+
## 📊 Benchmarks (Real Tests)
|
|
167
|
+
|
|
168
|
+
### ZSTD vs ZIP Speed (200 files × 10KB)
|
|
169
|
+
| Engine | Speed | Speedup |
|
|
170
|
+
|---|---|---|
|
|
171
|
+
| ZIP (Deflate) — Tier 1/2 | 29,204 files/sec | baseline |
|
|
172
|
+
| ZSTD — Tier 3 | **739,653 files/sec** | **🔥 25x faster** |
|
|
173
|
+
|
|
174
|
+
### GPU Direct Loader (Google Colab T4)
|
|
175
|
+
| Metric | Value |
|
|
176
|
+
|---|---|
|
|
177
|
+
| GPU | Tesla T4 (15.6 GB VRAM) |
|
|
178
|
+
| Batch device | `cuda:0` — data streamed to VRAM |
|
|
179
|
+
| Throughput | **17,512 items/sec** |
|
|
180
|
+
| Epoch time (100 items) | 0.0045s |
|
|
181
|
+
|
|
182
|
+
### Memory Safety Test
|
|
183
|
+
| Test | Result |
|
|
184
|
+
|---|---|
|
|
185
|
+
| Cache limit: 1MB, Data: 2MB | ✅ Stayed under 1MB |
|
|
186
|
+
| Thread safety: 8 workers | ✅ Zero race conditions |
|
|
187
|
+
| Corrupted ZIP | ✅ Rejected cleanly |
|
|
188
|
+
| 50MB single file | ✅ Byte-exact in 0.108s |
|
|
189
|
+
| 1000-file archive | ✅ Indexed in 0.04s |
|
|
190
|
+
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
## 🐛 Debugging / Inspection
|
|
194
|
+
|
|
195
|
+
Inspect any specific file without unzipping:
|
|
196
|
+
```python
|
|
197
|
+
dataset.extract_sample_to_disk(idx=42, export_path="./inspection/")
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## 🛠️ Troubleshooting
|
|
203
|
+
|
|
204
|
+
### `ModuleNotFoundError: No module named 'cdmltrain'`
|
|
205
|
+
```bash
|
|
206
|
+
git clone https://github.com/prem85642/cdmltrain.git && cd cdmltrain && pip install .
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
### `ModuleNotFoundError: No module named 'zstandard'`
|
|
210
|
+
```bash
|
|
211
|
+
pip install zstandard
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
### `Microsoft Visual C++ 14.0 required` (Windows)
|
|
215
|
+
Install [C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/). Or skip — pure Python engine works fine.
|
|
216
|
+
|
|
217
|
+
### Out of Memory on Colab
|
|
218
|
+
```python
|
|
219
|
+
dataset = CDMLStreamDataset("data.zip", max_cache_mb=512) # Reduce cache
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### `Bad CRC-32` error
|
|
223
|
+
```bash
|
|
224
|
+
python -c "import zipfile; print(zipfile.ZipFile('file.zip').testzip())"
|
|
225
|
+
# None = healthy, anything else = re-download
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
### `PIL.UnidentifiedImageError`
|
|
229
|
+
```python
|
|
230
|
+
dataset = CDMLStreamDataset("data.zip", is_image=False) # Not an image dataset
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
## 🖥️ OS Compatibility
|
|
236
|
+
|
|
237
|
+
| Feature | Windows | Linux | macOS |
|
|
238
|
+
|---|---|---|---|
|
|
239
|
+
| ZIP Engine (Tier 1) | ✅ | ✅ | ✅ |
|
|
240
|
+
| ZSTD Engine (Tier 3) | ✅ | ✅ | ✅ |
|
|
241
|
+
| GPU Direct Loader | ✅ | ✅ | ✅ |
|
|
242
|
+
| C++ Fast Engine (Tier 2) | ✅ pre-built | ✅ compile via `pip install .` | ✅ compile via `pip install .` |
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
## 📁 Project Structure
|
|
247
|
+
|
|
248
|
+
```
|
|
249
|
+
cdmltrain/
|
|
250
|
+
├── cdmltrain/
|
|
251
|
+
│ ├── __init__.py # Package entry point
|
|
252
|
+
│ ├── core.py # Tier 1: Python CoreStreamEngine
|
|
253
|
+
│ ├── dataset.py # CDMLStreamDataset (auto-tier selection)
|
|
254
|
+
│ ├── zstd_engine.py # Tier 3: ZSTD streaming engine
|
|
255
|
+
│ ├── gpu_loader.py # Tier 3: GPU Direct Loader (CUDA pinned memory)
|
|
256
|
+
│ └── src/
|
|
257
|
+
│ └── fast_core.cpp # Tier 2: C++ FastCoreEngine (pybind11)
|
|
258
|
+
├── demo.py # Quickstart demo
|
|
259
|
+
├── quickstart.ipynb # Jupyter Notebook tutorial
|
|
260
|
+
├── test_enterprise_audit.py # Enterprise QA suite (8 tests)
|
|
261
|
+
├── test_zstd_benchmark.py # ZSTD vs ZIP benchmark
|
|
262
|
+
├── test_zstd_compat.py # Cross-format compatibility test
|
|
263
|
+
├── test_gpu_loader.py # GPU Direct Loader test
|
|
264
|
+
├── setup.py # pip install configuration
|
|
265
|
+
├── requirements.txt # Dependencies
|
|
266
|
+
└── LICENSE # MIT License
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
## 🤝 Contributing
|
|
272
|
+
|
|
273
|
+
Pull requests are welcome! For major changes, please open an issue first.
|
|
274
|
+
|
|
275
|
+
---
|
|
276
|
+
|
|
277
|
+
## 📄 License
|
|
278
|
+
|
|
279
|
+
MIT License — see [LICENSE](LICENSE) for details.
|
|
280
|
+
|
|
281
|
+
**Made with ❤️ for the ML community — because your model matters more than your storage bill.**
|