cdmltrain 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 prem85642
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,316 @@
1
+ Metadata-Version: 2.4
2
+ Name: cdmltrain
3
+ Version: 0.2.0
4
+ Summary: Stream ML datasets from ZIP/ZSTD archives into PyTorch without disk extraction.
5
+ Home-page: https://github.com/prem85642/cdmltrain
6
+ Author: prem85642
7
+ Author-email: your.email@domain.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
12
+ Classifier: Intended Audience :: Science/Research
13
+ Requires-Python: >=3.7
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+ Requires-Dist: Pillow>=8.0.0
17
+ Provides-Extra: zstd
18
+ Requires-Dist: zstandard>=0.20.0; extra == "zstd"
19
+ Provides-Extra: gpu
20
+ Requires-Dist: torch>=1.9.0; extra == "gpu"
21
+ Provides-Extra: full
22
+ Requires-Dist: zstandard>=0.20.0; extra == "full"
23
+ Requires-Dist: torch>=1.9.0; extra == "full"
24
+ Dynamic: author
25
+ Dynamic: author-email
26
+ Dynamic: classifier
27
+ Dynamic: description
28
+ Dynamic: description-content-type
29
+ Dynamic: home-page
30
+ Dynamic: license-file
31
+ Dynamic: provides-extra
32
+ Dynamic: requires-dist
33
+ Dynamic: requires-python
34
+ Dynamic: summary
35
+
36
+ # cdmltrain 🚀
37
+ ### Stream ML Datasets Directly from ZIP / ZSTD Archives — No Extraction. No Wasted Storage. Zero OOM.
38
+
39
+ [![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
40
+ [![Python 3.7+](https://img.shields.io/badge/Python-3.7%2B-blue.svg)](https://www.python.org/)
41
+ [![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20Linux%20%7C%20macOS-lightgrey)]()
42
+
43
+ ---
44
+
45
+ ## 🔥 The Problem This Solves
46
+
47
+ Every Data Scientist / ML Engineer hits this wall:
48
+
49
+ ```
50
+ "Your 100 GB Kaggle dataset is a ZIP file.
51
+ Extracting it takes 2 hours and needs 300 GB of free disk space.
52
+ Your Colab/Kaggle notebook crashes with Out-of-Memory errors."
53
+ ```
54
+
55
+ **`cdmltrain` eliminates this problem entirely.**
56
+
57
+ It lets PyTorch read images, audio, text, CSV, JSON — **any data** — directly from a compressed archive **into RAM**, skipping disk extraction completely.
58
+
59
+ ---
60
+
61
+ ## ✨ Key Features
62
+
63
+ | Feature | Description |
64
+ |---|---|
65
+ | 🗜️ **Format Agnostic** | Images, audio (.wav), text, CSV, JSON, binary — all supported |
66
+ | ⚡ **O(1) Random Access** | ZIP Central Directory indexing — jumps to any file instantly |
67
+ | 🧠 **Memory-Safe Cache** | Custom LRU cache enforces strict RAM limits — zero OOM crashes |
68
+ | 🔒 **Thread Safe** | Concurrent reads for PyTorch `DataLoader(num_workers=N)` |
69
+ | 🔧 **3-Tier Architecture** | Auto-selects best engine based on your hardware |
70
+ | 🏎️ **ZSTD Support** | `.tar.zst` archives — 25x faster than ZIP deflate |
71
+ | 🎮 **GPU Direct Loader** | Streams data to CUDA VRAM with async prefetch |
72
+ | 🌍 **Cross-Platform** | Windows, Linux, macOS — works everywhere |
73
+
74
+ ---
75
+
76
+ ## 🏗️ Architecture (3 Tiers — Auto-Selected)
77
+
78
+ ```
79
+ Your ZIP/ZSTD File
80
+
81
+
82
+ ┌─────────────────────────────────────────────────────┐
83
+ │ Tier 1: Python CoreStreamEngine │
84
+ │ ✅ Works on ANY machine, no dependencies │
85
+ │ → O(1) Index + LRU Cache + Thread Safety │
86
+ ├─────────────────────────────────────────────────────┤
87
+ │ Tier 2: C++ FastCoreEngine (pybind11) │
88
+ │ ✅ Auto-enabled if C++ Build Tools installed │
89
+ │ → Bypasses Python GIL, faster multi-worker reads │
90
+ ├─────────────────────────────────────────────────────┤
91
+ │ Tier 3: ZSTD Engine + GPU Direct Loader │
92
+ │ ✅ pip install zstandard (for .tar.zst files) │
93
+ │ ✅ NVIDIA GPU (for direct VRAM streaming) │
94
+ │ → 25x faster decompression + near-zero GPU latency │
95
+ └─────────────────────────────────────────────────────┘
96
+
97
+
98
+ PyTorch DataLoader → Model Training
99
+ ```
100
+
101
+ **The library auto-detects your hardware and picks the best tier. You write the same code regardless.**
102
+
103
+ ---
104
+
105
+ ## 📦 Installation
106
+
107
+ > **Note:** PyPI release coming soon. Install locally for now.
108
+
109
+ ```bash
110
+ # Step 1: Clone the repo
111
+ git clone https://github.com/prem85642/cdmltrain.git
112
+ cd cdmltrain
113
+
114
+ # Step 2: Install
115
+ pip install .
116
+
117
+ # Step 3 (Optional): ZSTD support for .tar.zst archives
118
+ pip install zstandard
119
+ ```
120
+
121
+ > **C++ Acceleration (Tier 2):**
122
+ > - Windows: Install [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
123
+ > - Linux: `sudo apt-get install build-essential`
124
+ > - If skipped: pure-Python engine runs automatically — no errors.
125
+
126
+ ---
127
+
128
+ ## 🚀 Quick Start
129
+
130
+ ### Basic Usage (ZIP — Any Data Type)
131
+ ```python
132
+ from cdmltrain import CDMLStreamDataset
133
+ from torch.utils.data import DataLoader
134
+
135
+ # Point directly to your ZIP — no extraction needed!
136
+ dataset = CDMLStreamDataset(
137
+ zip_path="my_dataset.zip",
138
+ max_cache_mb=2048 # RAM cache limit (MB)
139
+ )
140
+
141
+ dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
142
+
143
+ for batch in dataloader:
144
+ # Train your model normally
145
+ pass
146
+ ```
147
+
148
+ ### Image Dataset (with PyTorch Transforms)
149
+ ```python
150
+ from cdmltrain import CDMLStreamDataset
151
+ from torchvision import transforms
152
+
153
+ transform = transforms.Compose([
154
+ transforms.Resize((224, 224)),
155
+ transforms.ToTensor(),
156
+ transforms.Normalize([0.5], [0.5])
157
+ ])
158
+
159
+ dataset = CDMLStreamDataset("images.zip", transform=transform, is_image=True)
160
+ ```
161
+
162
+ ### ZSTD Archive (25x Faster Decompression) ⚡
163
+ ```python
164
+ # Just change the file extension — everything else is identical!
165
+ dataset = CDMLStreamDataset("my_dataset.tar.zst", max_cache_mb=2048)
166
+ ```
167
+
168
+ ### GPU Direct Loader (NVIDIA VRAM Streaming) 🎮
169
+ ```python
170
+ from cdmltrain.gpu_loader import GPUDirectLoader
171
+
172
+ loader = GPUDirectLoader(dataset, batch_size=32)
173
+ # Auto-detects GPU. Falls back to CPU if no GPU found.
174
+
175
+ for batch in loader:
176
+ output = model(batch.float()) # batch already on GPU VRAM!
177
+ ```
178
+
179
+ ---
180
+
181
+ ## ⚙️ Configuration Parameters
182
+
183
+ | Parameter | Type | Default | Description |
184
+ |---|---|---|---|
185
+ | `zip_path` | `str` | *(required)* | Path to `.zip` or `.tar.zst` file |
186
+ | `transform` | `callable` | `None` | PyTorch/torchvision transform |
187
+ | `is_image` | `bool` | `False` | `True` enables PIL image decoding |
188
+ | `max_cache_mb` | `int` | `2048` | Max RAM for caching (MB) |
189
+
190
+ **`max_cache_mb` Guide:**
191
+
192
+ | Your RAM | Recommended |
193
+ |---|---|
194
+ | 4 GB (Colab Free) | `512` |
195
+ | 8 GB (Laptop) | `2048` |
196
+ | 16 GB (PC) | `6000` |
197
+ | 32 GB+ (Server) | `16000` |
198
+
199
+ ---
200
+
201
+ ## 📊 Benchmarks (Real Tests)
202
+
203
+ ### ZSTD vs ZIP Speed (200 files × 10KB)
204
+ | Engine | Speed | Speedup |
205
+ |---|---|---|
206
+ | ZIP (Deflate) — Tier 1/2 | 29,204 files/sec | baseline |
207
+ | ZSTD — Tier 3 | **739,653 files/sec** | **🔥 25x faster** |
208
+
209
+ ### GPU Direct Loader (Google Colab T4)
210
+ | Metric | Value |
211
+ |---|---|
212
+ | GPU | Tesla T4 (15.6 GB VRAM) |
213
+ | Batch device | `cuda:0` — data streamed to VRAM |
214
+ | Throughput | **17,512 items/sec** |
215
+ | Epoch time (100 items) | 0.0045s |
216
+
217
+ ### Memory Safety Test
218
+ | Test | Result |
219
+ |---|---|
220
+ | Cache limit: 1MB, Data: 2MB | ✅ Stayed under 1MB |
221
+ | Thread safety: 8 workers | ✅ Zero race conditions |
222
+ | Corrupted ZIP | ✅ Rejected cleanly |
223
+ | 50MB single file | ✅ Byte-exact in 0.108s |
224
+ | 1000-file archive | ✅ Indexed in 0.04s |
225
+
226
+ ---
227
+
228
+ ## 🐛 Debugging / Inspection
229
+
230
+ Inspect any specific file without unzipping:
231
+ ```python
232
+ dataset.extract_sample_to_disk(idx=42, export_path="./inspection/")
233
+ ```
234
+
235
+ ---
236
+
237
+ ## 🛠️ Troubleshooting
238
+
239
+ ### `ModuleNotFoundError: No module named 'cdmltrain'`
240
+ ```bash
241
+ git clone https://github.com/prem85642/cdmltrain.git && cd cdmltrain && pip install .
242
+ ```
243
+
244
+ ### `ModuleNotFoundError: No module named 'zstandard'`
245
+ ```bash
246
+ pip install zstandard
247
+ ```
248
+
249
+ ### `Microsoft Visual C++ 14.0 required` (Windows)
250
+ Install [C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/). Or skip — pure Python engine works fine.
251
+
252
+ ### Out of Memory on Colab
253
+ ```python
254
+ dataset = CDMLStreamDataset("data.zip", max_cache_mb=512) # Reduce cache
255
+ ```
256
+
257
+ ### `Bad CRC-32` error
258
+ ```bash
259
+ python -c "import zipfile; print(zipfile.ZipFile('file.zip').testzip())"
260
+ # None = healthy, anything else = re-download
261
+ ```
262
+
263
+ ### `PIL.UnidentifiedImageError`
264
+ ```python
265
+ dataset = CDMLStreamDataset("data.zip", is_image=False) # Not an image dataset
266
+ ```
267
+
268
+ ---
269
+
270
+ ## 🖥️ OS Compatibility
271
+
272
+ | Feature | Windows | Linux | macOS |
273
+ |---|---|---|---|
274
+ | ZIP Engine (Tier 1) | ✅ | ✅ | ✅ |
275
+ | ZSTD Engine (Tier 3) | ✅ | ✅ | ✅ |
276
+ | GPU Direct Loader | ✅ | ✅ | ✅ |
277
+ | C++ Fast Engine (Tier 2) | ✅ pre-built | ✅ compile via `pip install .` | ✅ compile via `pip install .` |
278
+
279
+ ---
280
+
281
+ ## 📁 Project Structure
282
+
283
+ ```
284
+ cdmltrain/
285
+ ├── cdmltrain/
286
+ │ ├── __init__.py # Package entry point
287
+ │ ├── core.py # Tier 1: Python CoreStreamEngine
288
+ │ ├── dataset.py # CDMLStreamDataset (auto-tier selection)
289
+ │ ├── zstd_engine.py # Tier 3: ZSTD streaming engine
290
+ │ ├── gpu_loader.py # Tier 3: GPU Direct Loader (CUDA pinned memory)
291
+ │ └── src/
292
+ │ └── fast_core.cpp # Tier 2: C++ FastCoreEngine (pybind11)
293
+ ├── demo.py # Quickstart demo
294
+ ├── quickstart.ipynb # Jupyter Notebook tutorial
295
+ ├── test_enterprise_audit.py # Enterprise QA suite (8 tests)
296
+ ├── test_zstd_benchmark.py # ZSTD vs ZIP benchmark
297
+ ├── test_zstd_compat.py # Cross-format compatibility test
298
+ ├── test_gpu_loader.py # GPU Direct Loader test
299
+ ├── setup.py # pip install configuration
300
+ ├── requirements.txt # Dependencies
301
+ └── LICENSE # MIT License
302
+ ```
303
+
304
+ ---
305
+
306
+ ## 🤝 Contributing
307
+
308
+ Pull requests are welcome! For major changes, please open an issue first.
309
+
310
+ ---
311
+
312
+ ## 📄 License
313
+
314
+ MIT License — see [LICENSE](LICENSE) for details.
315
+
316
+ **Made with ❤️ for the ML community — because your model matters more than your storage bill.**
@@ -0,0 +1,281 @@
1
+ # cdmltrain 🚀
2
+ ### Stream ML Datasets Directly from ZIP / ZSTD Archives — No Extraction. No Wasted Storage. Zero OOM.
3
+
4
+ [![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
5
+ [![Python 3.7+](https://img.shields.io/badge/Python-3.7%2B-blue.svg)](https://www.python.org/)
6
+ [![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20Linux%20%7C%20macOS-lightgrey)]()
7
+
8
+ ---
9
+
10
+ ## 🔥 The Problem This Solves
11
+
12
+ Every Data Scientist / ML Engineer hits this wall:
13
+
14
+ ```
15
+ "Your 100 GB Kaggle dataset is a ZIP file.
16
+ Extracting it takes 2 hours and needs 300 GB of free disk space.
17
+ Your Colab/Kaggle notebook crashes with Out-of-Memory errors."
18
+ ```
19
+
20
+ **`cdmltrain` eliminates this problem entirely.**
21
+
22
+ It lets PyTorch read images, audio, text, CSV, JSON — **any data** — directly from a compressed archive **into RAM**, skipping disk extraction completely.
23
+
24
+ ---
25
+
26
+ ## ✨ Key Features
27
+
28
+ | Feature | Description |
29
+ |---|---|
30
+ | 🗜️ **Format Agnostic** | Images, audio (.wav), text, CSV, JSON, binary — all supported |
31
+ | ⚡ **O(1) Random Access** | ZIP Central Directory indexing — jumps to any file instantly |
32
+ | 🧠 **Memory-Safe Cache** | Custom LRU cache enforces strict RAM limits — zero OOM crashes |
33
+ | 🔒 **Thread Safe** | Concurrent reads for PyTorch `DataLoader(num_workers=N)` |
34
+ | 🔧 **3-Tier Architecture** | Auto-selects best engine based on your hardware |
35
+ | 🏎️ **ZSTD Support** | `.tar.zst` archives — 25x faster than ZIP deflate |
36
+ | 🎮 **GPU Direct Loader** | Streams data to CUDA VRAM with async prefetch |
37
+ | 🌍 **Cross-Platform** | Windows, Linux, macOS — works everywhere |
38
+
39
+ ---
40
+
41
+ ## 🏗️ Architecture (3 Tiers — Auto-Selected)
42
+
43
+ ```
44
+ Your ZIP/ZSTD File
45
+
46
+
47
+ ┌─────────────────────────────────────────────────────┐
48
+ │ Tier 1: Python CoreStreamEngine │
49
+ │ ✅ Works on ANY machine, no dependencies │
50
+ │ → O(1) Index + LRU Cache + Thread Safety │
51
+ ├─────────────────────────────────────────────────────┤
52
+ │ Tier 2: C++ FastCoreEngine (pybind11) │
53
+ │ ✅ Auto-enabled if C++ Build Tools installed │
54
+ │ → Bypasses Python GIL, faster multi-worker reads │
55
+ ├─────────────────────────────────────────────────────┤
56
+ │ Tier 3: ZSTD Engine + GPU Direct Loader │
57
+ │ ✅ pip install zstandard (for .tar.zst files) │
58
+ │ ✅ NVIDIA GPU (for direct VRAM streaming) │
59
+ │ → 25x faster decompression + near-zero GPU latency │
60
+ └─────────────────────────────────────────────────────┘
61
+
62
+
63
+ PyTorch DataLoader → Model Training
64
+ ```
65
+
66
+ **The library auto-detects your hardware and picks the best tier. You write the same code regardless.**
67
+
68
+ ---
69
+
70
+ ## 📦 Installation
71
+
72
+ > **Note:** PyPI release coming soon. Install locally for now.
73
+
74
+ ```bash
75
+ # Step 1: Clone the repo
76
+ git clone https://github.com/prem85642/cdmltrain.git
77
+ cd cdmltrain
78
+
79
+ # Step 2: Install
80
+ pip install .
81
+
82
+ # Step 3 (Optional): ZSTD support for .tar.zst archives
83
+ pip install zstandard
84
+ ```
85
+
86
+ > **C++ Acceleration (Tier 2):**
87
+ > - Windows: Install [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
88
+ > - Linux: `sudo apt-get install build-essential`
89
+ > - If skipped: pure-Python engine runs automatically — no errors.
90
+
91
+ ---
92
+
93
+ ## 🚀 Quick Start
94
+
95
+ ### Basic Usage (ZIP — Any Data Type)
96
+ ```python
97
+ from cdmltrain import CDMLStreamDataset
98
+ from torch.utils.data import DataLoader
99
+
100
+ # Point directly to your ZIP — no extraction needed!
101
+ dataset = CDMLStreamDataset(
102
+ zip_path="my_dataset.zip",
103
+ max_cache_mb=2048 # RAM cache limit (MB)
104
+ )
105
+
106
+ dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
107
+
108
+ for batch in dataloader:
109
+ # Train your model normally
110
+ pass
111
+ ```
112
+
113
+ ### Image Dataset (with PyTorch Transforms)
114
+ ```python
115
+ from cdmltrain import CDMLStreamDataset
116
+ from torchvision import transforms
117
+
118
+ transform = transforms.Compose([
119
+ transforms.Resize((224, 224)),
120
+ transforms.ToTensor(),
121
+ transforms.Normalize([0.5], [0.5])
122
+ ])
123
+
124
+ dataset = CDMLStreamDataset("images.zip", transform=transform, is_image=True)
125
+ ```
126
+
127
+ ### ZSTD Archive (25x Faster Decompression) ⚡
128
+ ```python
129
+ # Just change the file extension — everything else is identical!
130
+ dataset = CDMLStreamDataset("my_dataset.tar.zst", max_cache_mb=2048)
131
+ ```
132
+
133
+ ### GPU Direct Loader (NVIDIA VRAM Streaming) 🎮
134
+ ```python
135
+ from cdmltrain.gpu_loader import GPUDirectLoader
136
+
137
+ loader = GPUDirectLoader(dataset, batch_size=32)
138
+ # Auto-detects GPU. Falls back to CPU if no GPU found.
139
+
140
+ for batch in loader:
141
+ output = model(batch.float()) # batch already on GPU VRAM!
142
+ ```
143
+
144
+ ---
145
+
146
+ ## ⚙️ Configuration Parameters
147
+
148
+ | Parameter | Type | Default | Description |
149
+ |---|---|---|---|
150
+ | `zip_path` | `str` | *(required)* | Path to `.zip` or `.tar.zst` file |
151
+ | `transform` | `callable` | `None` | PyTorch/torchvision transform |
152
+ | `is_image` | `bool` | `False` | `True` enables PIL image decoding |
153
+ | `max_cache_mb` | `int` | `2048` | Max RAM for caching (MB) |
154
+
155
+ **`max_cache_mb` Guide:**
156
+
157
+ | Your RAM | Recommended |
158
+ |---|---|
159
+ | 4 GB (Colab Free) | `512` |
160
+ | 8 GB (Laptop) | `2048` |
161
+ | 16 GB (PC) | `6000` |
162
+ | 32 GB+ (Server) | `16000` |
163
+
164
+ ---
165
+
166
+ ## 📊 Benchmarks (Real Tests)
167
+
168
+ ### ZSTD vs ZIP Speed (200 files × 10KB)
169
+ | Engine | Speed | Speedup |
170
+ |---|---|---|
171
+ | ZIP (Deflate) — Tier 1/2 | 29,204 files/sec | baseline |
172
+ | ZSTD — Tier 3 | **739,653 files/sec** | **🔥 25x faster** |
173
+
174
+ ### GPU Direct Loader (Google Colab T4)
175
+ | Metric | Value |
176
+ |---|---|
177
+ | GPU | Tesla T4 (15.6 GB VRAM) |
178
+ | Batch device | `cuda:0` — data streamed to VRAM |
179
+ | Throughput | **17,512 items/sec** |
180
+ | Epoch time (100 items) | 0.0045s |
181
+
182
+ ### Memory Safety Test
183
+ | Test | Result |
184
+ |---|---|
185
+ | Cache limit: 1MB, Data: 2MB | ✅ Stayed under 1MB |
186
+ | Thread safety: 8 workers | ✅ Zero race conditions |
187
+ | Corrupted ZIP | ✅ Rejected cleanly |
188
+ | 50MB single file | ✅ Byte-exact in 0.108s |
189
+ | 1000-file archive | ✅ Indexed in 0.04s |
190
+
191
+ ---
192
+
193
+ ## 🐛 Debugging / Inspection
194
+
195
+ Inspect any specific file without unzipping:
196
+ ```python
197
+ dataset.extract_sample_to_disk(idx=42, export_path="./inspection/")
198
+ ```
199
+
200
+ ---
201
+
202
+ ## 🛠️ Troubleshooting
203
+
204
+ ### `ModuleNotFoundError: No module named 'cdmltrain'`
205
+ ```bash
206
+ git clone https://github.com/prem85642/cdmltrain.git && cd cdmltrain && pip install .
207
+ ```
208
+
209
+ ### `ModuleNotFoundError: No module named 'zstandard'`
210
+ ```bash
211
+ pip install zstandard
212
+ ```
213
+
214
+ ### `Microsoft Visual C++ 14.0 required` (Windows)
215
+ Install [C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/). Or skip — pure Python engine works fine.
216
+
217
+ ### Out of Memory on Colab
218
+ ```python
219
+ dataset = CDMLStreamDataset("data.zip", max_cache_mb=512) # Reduce cache
220
+ ```
221
+
222
+ ### `Bad CRC-32` error
223
+ ```bash
224
+ python -c "import zipfile; print(zipfile.ZipFile('file.zip').testzip())"
225
+ # None = healthy, anything else = re-download
226
+ ```
227
+
228
+ ### `PIL.UnidentifiedImageError`
229
+ ```python
230
+ dataset = CDMLStreamDataset("data.zip", is_image=False) # Not an image dataset
231
+ ```
232
+
233
+ ---
234
+
235
+ ## 🖥️ OS Compatibility
236
+
237
+ | Feature | Windows | Linux | macOS |
238
+ |---|---|---|---|
239
+ | ZIP Engine (Tier 1) | ✅ | ✅ | ✅ |
240
+ | ZSTD Engine (Tier 3) | ✅ | ✅ | ✅ |
241
+ | GPU Direct Loader | ✅ | ✅ | ✅ |
242
+ | C++ Fast Engine (Tier 2) | ✅ pre-built | ✅ compile via `pip install .` | ✅ compile via `pip install .` |
243
+
244
+ ---
245
+
246
+ ## 📁 Project Structure
247
+
248
+ ```
249
+ cdmltrain/
250
+ ├── cdmltrain/
251
+ │ ├── __init__.py # Package entry point
252
+ │ ├── core.py # Tier 1: Python CoreStreamEngine
253
+ │ ├── dataset.py # CDMLStreamDataset (auto-tier selection)
254
+ │ ├── zstd_engine.py # Tier 3: ZSTD streaming engine
255
+ │ ├── gpu_loader.py # Tier 3: GPU Direct Loader (CUDA pinned memory)
256
+ │ └── src/
257
+ │ └── fast_core.cpp # Tier 2: C++ FastCoreEngine (pybind11)
258
+ ├── demo.py # Quickstart demo
259
+ ├── quickstart.ipynb # Jupyter Notebook tutorial
260
+ ├── test_enterprise_audit.py # Enterprise QA suite (8 tests)
261
+ ├── test_zstd_benchmark.py # ZSTD vs ZIP benchmark
262
+ ├── test_zstd_compat.py # Cross-format compatibility test
263
+ ├── test_gpu_loader.py # GPU Direct Loader test
264
+ ├── setup.py # pip install configuration
265
+ ├── requirements.txt # Dependencies
266
+ └── LICENSE # MIT License
267
+ ```
268
+
269
+ ---
270
+
271
+ ## 🤝 Contributing
272
+
273
+ Pull requests are welcome! For major changes, please open an issue first.
274
+
275
+ ---
276
+
277
+ ## 📄 License
278
+
279
+ MIT License — see [LICENSE](LICENSE) for details.
280
+
281
+ **Made with ❤️ for the ML community — because your model matters more than your storage bill.**
@@ -0,0 +1,4 @@
1
+ from .dataset import CDMLStreamDataset
2
+ from .core import CoreStreamEngine
3
+
4
+ __all__ = ["CDMLStreamDataset", "CoreStreamEngine"]