roxxel 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,34 @@
1
+ name: Publish to PyPI 📦
2
+
3
+ on:
4
+ release:
5
+ types: [published]
6
+
7
+ permissions:
8
+ contents: read
9
+
10
+ jobs:
11
+ build-n-publish:
12
+ name: Build and publish Python distribution to PyPI
13
+ runs-on: ubuntu-latest
14
+ permissions:
15
+ # This permission is mandatory for secure Trusted Publishing (OIDC)
16
+ id-token: write
17
+
18
+ steps:
19
+ - name: Checkout source
20
+ uses: actions/checkout@v4
21
+
22
+ - name: Set up Python
23
+ uses: actions/setup-python@v5
24
+ with:
25
+ python-version: "3.x"
26
+
27
+ - name: Install build dependencies
28
+ run: python -m pip install build
29
+
30
+ - name: Build binary wheel and source tarball
31
+ run: python -m build
32
+
33
+ - name: Publish package distributions to PyPI
34
+ uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,12 @@
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # Roxxel temporary data files
7
+ *.rox
8
+
9
+ # System & IDE files
10
+ .DS_Store
11
+ Thumbs.db
12
+ .antigravity*
roxxel-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 anon
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
roxxel-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,189 @@
1
+ Metadata-Version: 2.4
2
+ Name: roxxel
3
+ Version: 0.1.0
4
+ Summary: A zero-RAM, multi-modal, sharded binary dataset manager
5
+ Author-email: anon160 <anon160@users.noreply.github.com>
6
+ License: MIT
7
+ License-File: LICENSE
8
+ Classifier: License :: OSI Approved :: MIT License
9
+ Classifier: Operating System :: OS Independent
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
12
+ Requires-Python: >=3.8
13
+ Requires-Dist: numpy>=1.20.0
14
+ Description-Content-Type: text/markdown
15
+
16
+ # Roxxel 🚀
17
+
18
+ **Zero-RAM, Multi-Modal, Sharded Binary Dataset Manager**
19
+
20
+ Roxxel is an ultra-lightweight (~300 lines of plain Python), zero-dependency (except NumPy) binary dataset format and reader designed for high-performance deep learning pipelines.
21
+
22
+ By implementing the standard Python sequence protocol over native `numpy.memmap` views, Roxxel virtualizes massive, multi-sharded, variable-length datasets on-disk as a simple, continuous in-memory list.
23
+
24
+ ---
25
+
26
+ ## đź’ˇ Motivation
27
+
28
+ Mainstream deep learning data loaders—such as **PyTorch's `DataLoader`**, **Google's `Grain`**, and **TensorFlow's `tf.data`**—attempt to handle every aspect of the data pipeline (I/O, caching, multiprocessing, shuffling, collation, and transformations) in a single, massive monolithic system. This inevitably leads to severe operational friction:
29
+
30
+ * **PyTorch DataLoader**: Relying on multiple workers (`num_workers > 0`) spawns child processes that trigger Python's `fork` mechanism. This frequently results in massive memory leaks due to copy-on-write page sharing bugs in Python's GIL. Furthermore, debugging opaque subprocess deadlocks and socket/IPC exhaustion is incredibly frustrating.
31
+ * **Google Grain**: While powerful, it introduces a heavyweight dependency footprint and complex pipeline building abstractions that are difficult to customize or run outside of JAX-specific training pipelines.
32
+ * **TensorFlow tf.data**: Building robust tf.data pipelines is highly complex. Additionally, it forces you to use the opaque `TFRecord` binary format, which cannot be easily inspected or read without pulling in the massive, multi-gigabyte TensorFlow library as a dependency.
33
+
34
+ ### The Roxxel Philosophy
35
+ Roxxel shifts the architectural boundary by practicing the **Unix philosophy of doing one thing and doing it well**. It handles only the hardest, most critical parts of storage—**safe contiguous file packing, zero-RAM memory mapping, and O(1) seek indexing**—and leaves all batching, threading, and transformations to plain, standard Python and NumPy code.
36
+
37
+ ---
38
+
39
+ ## 🌟 Unique Benefits of Roxxel
40
+
41
+ 1. **Zero-RAM Overhead**: Roxxel maps your dataset directly into virtual memory via the operating system's kernel page cache using `numpy.memmap`. Even for multi-terabyte datasets, it consumes **exactly 0 bytes of Python RAM** for the data.
42
+ 2. **100% Framework Agnostic**: Because Roxxel is built purely on Python standard libraries and NumPy, it is entirely decoupled from any ML framework. You can use the exact same Roxxel dataset across **PyTorch, JAX, TensorFlow, or pure CPU environments** with zero code changes.
43
+ 3. **No Multiprocessing Deadlocks**: Because reading from memory maps is natively thread-safe and extremely fast, you can implement high-performance, asynchronous loading using simple Python threads (`threading.Thread`) or thread pools. You never have to worry about subprocess IPC bottlenecks or fork-related deadlocks.
44
+ 4. **Modality-Agnostic Variable-Length Records**: Unlike rigid formats, Roxxel accepts arbitrary variable-length binary payloads contiguously. You can store JPEGs, MP4 clips, text token arrays, or audio samples in a single, unified structure with zero padding waste.
45
+ 5. **Clean Sharded Portability**: Roxxel automatically splits massive datasets into sequentially numbered shards during writes. During reads, it seamlessly virtualizes them into a single continuous sequence using fast binary search boundaries. Shards are easy to distribute, copy, and stream over networks.
46
+
47
+ ---
48
+
49
+ ## 🛠️ File Format Architecture
50
+
51
+ To prevent **header contamination** (where inline metadata blocks corrupt flat memory maps), Roxxel writes your entire dataset into a single contiguous binary file with a trailing index table:
52
+
53
+ ```
54
+ +-------------------------------------------------------------+
55
+ | |
56
+ | 1. RAW CONTIGUOUS PAYLOAD DATA SECTION |
57
+ | (No headers, no prefixes, completely clean bytes) |
58
+ | |
59
+ +-------------------------------------------------------------+
60
+ | |
61
+ | 2. TRAILING INDEX TABLE SECTION |
62
+ | (Flat array of uint64 offsets pointing to record ends) |
63
+ | |
64
+ +-------------------------------------------------------------+
65
+ | 3. FOOTER (Exactly 24 bytes) |
66
+ | [Total Records (8B)] [Raw Data Size (8B)] [MAGIC (8B)] |
67
+ +-------------------------------------------------------------+
68
+ ```
69
+
70
+ Because the raw data section is completely uninterrupted, you can interpret the entire archive as a single contiguous array in one line (e.g. for LLM token pre-training) or resolve individual records in $O(1)$ constant time.
71
+
72
+ ## 📦 Installation
73
+
74
+ Roxxel can be installed via `pip` directly from PyPI:
75
+
76
+ ```bash
77
+ pip install roxxel
78
+ ```
79
+
80
+ ---
81
+
82
+ ## 🚀 Getting Started
83
+
84
+ Simply copy `roxxel.py` into your project.
85
+
86
+ ### 1. Writing a Single-File Dataset
87
+ ```python
88
+ from roxxel import Roxxel
89
+
90
+ # Define a generator that yields raw byte payloads
91
+ def byte_stream():
92
+ for i in range(100):
93
+ yield bytes([i] * 50) # Yield raw bytes
94
+
95
+ rox = Roxxel("./dataset.rox")
96
+ rox.write(byte_stream())
97
+ ```
98
+
99
+ ### 2. Writing a Sharded Dataset
100
+ Specify `max_shard_bytes` to automatically split massive data streams into dynamically capped shards (e.g., `dataset_0000.rox`, `dataset_0001.rox`):
101
+ ```python
102
+ # Limit each shard to 2GB
103
+ rox.write(byte_stream(), max_shard_bytes=2 * 1024 * 1024 * 1024)
104
+ ```
105
+
106
+ ### 3. Reading and Shuffling (Sequence API)
107
+ Roxxel supports glob patterns and Python lists. It virtualizes all matching shards into a single read-only sequence supporting index lookups, negative indices, and slicing:
108
+ ```python
109
+ import numpy as np
110
+ from roxxel import Roxxel
111
+
112
+ # Read and virtualize all shards matching the glob pattern
113
+ with Roxxel("./dataset_*.rox") as dataset:
114
+ print("Total virtual records:", len(dataset))
115
+
116
+ # 1. O(1) single index lookup
117
+ record = dataset[42]
118
+
119
+ # 2. Slice lookup
120
+ subset = dataset[10:20]
121
+
122
+ # 3. Global Shuffling (handled in three lines of plain NumPy!)
123
+ shuffled_indices = np.random.permutation(len(dataset))
124
+ for idx in shuffled_indices:
125
+ shuffled_record = dataset[idx] # seek & load happens instantly in page cache
126
+ ```
127
+
128
+ ---
129
+
130
+ ## 🍳 Cookbooks
131
+
132
+ ### A. Flat Token Streaming (e.g., LLM Training)
133
+ If you are doing LLM pre-training, you want to treat your entire dataset as one continuous stream of tokens. Roxxel allows you to ignore record boundaries and read the raw contiguous mapped memory directly:
134
+
135
+ ```python
136
+ with Roxxel("./tokens.rox") as dataset:
137
+ # Cast the entire mapped raw bytes section directly into uint16 tokens
138
+ tokens = dataset.raw_data.view(np.uint16)
139
+
140
+ # Chunk and batch locally in NumPy:
141
+ seq_len = 2048
142
+ total_sequences = len(tokens) // seq_len
143
+ reshaped_batches = tokens[:total_sequences * seq_len].reshape(total_sequences, seq_len)
144
+ ```
145
+
146
+ ### B. High-Performance Asynchronous Prefetch Dataloader
147
+ If you are training on high-performance GPUs, you want to pre-load batches on a background CPU thread to completely prevent GPU starvation:
148
+
149
+ ```python
150
+ import queue
151
+ import threading
152
+ import numpy as np
153
+ from roxxel import Roxxel
154
+
155
+ def async_dataloader(rox_pattern, batch_size=32, prefetch_batches=4, seed=42):
156
+ dataset = Roxxel(rox_pattern)
157
+ dataset.open()
158
+
159
+ indices = np.arange(len(dataset))
160
+ rng = np.random.default_rng(seed)
161
+ rng.shuffle(indices)
162
+
163
+ q = queue.Queue(maxsize=prefetch_batches)
164
+
165
+ def producer():
166
+ for start_idx in range(0, len(indices), batch_size):
167
+ batch_picks = indices[start_idx : start_idx + batch_size]
168
+
169
+ # Fetch and decode/stack
170
+ batch_data = [dataset[idx] for idx in batch_picks]
171
+ q.put(batch_data)
172
+ q.put(None) # EOF
173
+ dataset.close()
174
+
175
+ # Start I/O in the background
176
+ threading.Thread(target=producer, daemon=True).start()
177
+
178
+ # Yield batches to the training loop
179
+ while True:
180
+ batch = q.get()
181
+ if batch is None:
182
+ break
183
+ yield batch
184
+ ```
185
+
186
+ ---
187
+
188
+ ## ⚖️ License
189
+ MIT License. Feel free to use, modify, and distribute.
roxxel-0.1.0/README.md ADDED
@@ -0,0 +1,174 @@
1
+ # Roxxel 🚀
2
+
3
+ **Zero-RAM, Multi-Modal, Sharded Binary Dataset Manager**
4
+
5
+ Roxxel is an ultra-lightweight (~300 lines of plain Python), zero-dependency (except NumPy) binary dataset format and reader designed for high-performance deep learning pipelines.
6
+
7
+ By implementing the standard Python sequence protocol over native `numpy.memmap` views, Roxxel virtualizes massive, multi-sharded, variable-length datasets on-disk as a simple, continuous in-memory list.
8
+
9
+ ---
10
+
11
+ ## đź’ˇ Motivation
12
+
13
+ Mainstream deep learning data loaders—such as **PyTorch's `DataLoader`**, **Google's `Grain`**, and **TensorFlow's `tf.data`**—attempt to handle every aspect of the data pipeline (I/O, caching, multiprocessing, shuffling, collation, and transformations) in a single, massive monolithic system. This inevitably leads to severe operational friction:
14
+
15
+ * **PyTorch DataLoader**: Relying on multiple workers (`num_workers > 0`) spawns child processes that trigger Python's `fork` mechanism. This frequently results in massive memory leaks due to copy-on-write page sharing bugs in Python's GIL. Furthermore, debugging opaque subprocess deadlocks and socket/IPC exhaustion is incredibly frustrating.
16
+ * **Google Grain**: While powerful, it introduces a heavyweight dependency footprint and complex pipeline building abstractions that are difficult to customize or run outside of JAX-specific training pipelines.
17
+ * **TensorFlow tf.data**: Building robust tf.data pipelines is highly complex. Additionally, it forces you to use the opaque `TFRecord` binary format, which cannot be easily inspected or read without pulling in the massive, multi-gigabyte TensorFlow library as a dependency.
18
+
19
+ ### The Roxxel Philosophy
20
+ Roxxel shifts the architectural boundary by practicing the **Unix philosophy of doing one thing and doing it well**. It handles only the hardest, most critical parts of storage—**safe contiguous file packing, zero-RAM memory mapping, and O(1) seek indexing**—and leaves all batching, threading, and transformations to plain, standard Python and NumPy code.
21
+
22
+ ---
23
+
24
+ ## 🌟 Unique Benefits of Roxxel
25
+
26
+ 1. **Zero-RAM Overhead**: Roxxel maps your dataset directly into virtual memory via the operating system's kernel page cache using `numpy.memmap`. Even for multi-terabyte datasets, it consumes **exactly 0 bytes of Python RAM** for the data.
27
+ 2. **100% Framework Agnostic**: Because Roxxel is built purely on Python standard libraries and NumPy, it is entirely decoupled from any ML framework. You can use the exact same Roxxel dataset across **PyTorch, JAX, TensorFlow, or pure CPU environments** with zero code changes.
28
+ 3. **No Multiprocessing Deadlocks**: Because reading from memory maps is natively thread-safe and extremely fast, you can implement high-performance, asynchronous loading using simple Python threads (`threading.Thread`) or thread pools. You never have to worry about subprocess IPC bottlenecks or fork-related deadlocks.
29
+ 4. **Modality-Agnostic Variable-Length Records**: Unlike rigid formats, Roxxel accepts arbitrary variable-length binary payloads contiguously. You can store JPEGs, MP4 clips, text token arrays, or audio samples in a single, unified structure with zero padding waste.
30
+ 5. **Clean Sharded Portability**: Roxxel automatically splits massive datasets into sequentially numbered shards during writes. During reads, it seamlessly virtualizes them into a single continuous sequence using fast binary search boundaries. Shards are easy to distribute, copy, and stream over networks.
31
+
32
+ ---
33
+
34
+ ## 🛠️ File Format Architecture
35
+
36
+ To prevent **header contamination** (where inline metadata blocks corrupt flat memory maps), Roxxel writes your entire dataset into a single contiguous binary file with a trailing index table:
37
+
38
+ ```
39
+ +-------------------------------------------------------------+
40
+ | |
41
+ | 1. RAW CONTIGUOUS PAYLOAD DATA SECTION |
42
+ | (No headers, no prefixes, completely clean bytes) |
43
+ | |
44
+ +-------------------------------------------------------------+
45
+ | |
46
+ | 2. TRAILING INDEX TABLE SECTION |
47
+ | (Flat array of uint64 offsets pointing to record ends) |
48
+ | |
49
+ +-------------------------------------------------------------+
50
+ | 3. FOOTER (Exactly 24 bytes) |
51
+ | [Total Records (8B)] [Raw Data Size (8B)] [MAGIC (8B)] |
52
+ +-------------------------------------------------------------+
53
+ ```
54
+
55
+ Because the raw data section is completely uninterrupted, you can interpret the entire archive as a single contiguous array in one line (e.g. for LLM token pre-training) or resolve individual records in $O(1)$ constant time.
56
+
57
+ ## 📦 Installation
58
+
59
+ Roxxel can be installed via `pip` directly from PyPI:
60
+
61
+ ```bash
62
+ pip install roxxel
63
+ ```
64
+
65
+ ---
66
+
67
+ ## 🚀 Getting Started
68
+
69
+ Simply copy `roxxel.py` into your project.
70
+
71
+ ### 1. Writing a Single-File Dataset
72
+ ```python
73
+ from roxxel import Roxxel
74
+
75
+ # Define a generator that yields raw byte payloads
76
+ def byte_stream():
77
+ for i in range(100):
78
+ yield bytes([i] * 50) # Yield raw bytes
79
+
80
+ rox = Roxxel("./dataset.rox")
81
+ rox.write(byte_stream())
82
+ ```
83
+
84
+ ### 2. Writing a Sharded Dataset
85
+ Specify `max_shard_bytes` to automatically split massive data streams into dynamically capped shards (e.g., `dataset_0000.rox`, `dataset_0001.rox`):
86
+ ```python
87
+ # Limit each shard to 2GB
88
+ rox.write(byte_stream(), max_shard_bytes=2 * 1024 * 1024 * 1024)
89
+ ```
90
+
91
+ ### 3. Reading and Shuffling (Sequence API)
92
+ Roxxel supports glob patterns and Python lists. It virtualizes all matching shards into a single read-only sequence supporting index lookups, negative indices, and slicing:
93
+ ```python
94
+ import numpy as np
95
+ from roxxel import Roxxel
96
+
97
+ # Read and virtualize all shards matching the glob pattern
98
+ with Roxxel("./dataset_*.rox") as dataset:
99
+ print("Total virtual records:", len(dataset))
100
+
101
+ # 1. O(1) single index lookup
102
+ record = dataset[42]
103
+
104
+ # 2. Slice lookup
105
+ subset = dataset[10:20]
106
+
107
+ # 3. Global Shuffling (handled in three lines of plain NumPy!)
108
+ shuffled_indices = np.random.permutation(len(dataset))
109
+ for idx in shuffled_indices:
110
+ shuffled_record = dataset[idx] # seek & load happens instantly in page cache
111
+ ```
112
+
113
+ ---
114
+
115
+ ## 🍳 Cookbooks
116
+
117
+ ### A. Flat Token Streaming (e.g., LLM Training)
118
+ If you are doing LLM pre-training, you want to treat your entire dataset as one continuous stream of tokens. Roxxel allows you to ignore record boundaries and read the raw contiguous mapped memory directly:
119
+
120
+ ```python
121
+ with Roxxel("./tokens.rox") as dataset:
122
+ # Cast the entire mapped raw bytes section directly into uint16 tokens
123
+ tokens = dataset.raw_data.view(np.uint16)
124
+
125
+ # Chunk and batch locally in NumPy:
126
+ seq_len = 2048
127
+ total_sequences = len(tokens) // seq_len
128
+ reshaped_batches = tokens[:total_sequences * seq_len].reshape(total_sequences, seq_len)
129
+ ```
130
+
131
+ ### B. High-Performance Asynchronous Prefetch Dataloader
132
+ If you are training on high-performance GPUs, you want to pre-load batches on a background CPU thread to completely prevent GPU starvation:
133
+
134
+ ```python
135
+ import queue
136
+ import threading
137
+ import numpy as np
138
+ from roxxel import Roxxel
139
+
140
+ def async_dataloader(rox_pattern, batch_size=32, prefetch_batches=4, seed=42):
141
+ dataset = Roxxel(rox_pattern)
142
+ dataset.open()
143
+
144
+ indices = np.arange(len(dataset))
145
+ rng = np.random.default_rng(seed)
146
+ rng.shuffle(indices)
147
+
148
+ q = queue.Queue(maxsize=prefetch_batches)
149
+
150
+ def producer():
151
+ for start_idx in range(0, len(indices), batch_size):
152
+ batch_picks = indices[start_idx : start_idx + batch_size]
153
+
154
+ # Fetch and decode/stack
155
+ batch_data = [dataset[idx] for idx in batch_picks]
156
+ q.put(batch_data)
157
+ q.put(None) # EOF
158
+ dataset.close()
159
+
160
+ # Start I/O in the background
161
+ threading.Thread(target=producer, daemon=True).start()
162
+
163
+ # Yield batches to the training loop
164
+ while True:
165
+ batch = q.get()
166
+ if batch is None:
167
+ break
168
+ yield batch
169
+ ```
170
+
171
+ ---
172
+
173
+ ## ⚖️ License
174
+ MIT License. Feel free to use, modify, and distribute.
@@ -0,0 +1,26 @@
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "roxxel"
7
+ version = "0.1.0"
8
+ description = "A zero-RAM, multi-modal, sharded binary dataset manager"
9
+ readme = "README.md"
10
+ requires-python = ">=3.8"
11
+ license = {text = "MIT"}
12
+ authors = [
13
+ {name = "anon160", email = "anon160@users.noreply.github.com"}
14
+ ]
15
+ classifiers = [
16
+ "Programming Language :: Python :: 3",
17
+ "License :: OSI Approved :: MIT License",
18
+ "Operating System :: OS Independent",
19
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
20
+ ]
21
+ dependencies = [
22
+ "numpy>=1.20.0"
23
+ ]
24
+
25
+ [tool.hatch.build.targets.wheel]
26
+ packages = ["roxxel.py"]
roxxel-0.1.0/roxxel.py ADDED
@@ -0,0 +1,322 @@
1
+ import os
2
+ import glob
3
+ import struct
4
+ import bisect
5
+ import numpy as np
6
+
7
+ class Roxxel:
8
+ """
9
+ A bare-bones, zero-RAM single-file or multi-sharded dataset manager.
10
+ Stores raw contiguous payload data, a trailing index table, and a 24-byte footer.
11
+ Seamlessly virtualizes multiple shards on-disk into a single continuous sequence.
12
+ """
13
+ MAGIC_SIGNATURE = b"ROXXEL01" # 8-byte secure signature tag
14
+
15
+ def __init__(self, filepath="./stream_reservoir.rox"):
16
+ self.raw_data = None
17
+ self.index_table = None
18
+ self._total_records = 0
19
+ self._is_open = False
20
+ self._shards = []
21
+ self._shard_boundaries = []
22
+
23
+ # Support single string, list of strings, or glob patterns
24
+ if isinstance(filepath, list):
25
+ self.filepaths = filepath
26
+ elif isinstance(filepath, str):
27
+ if "*" in filepath or "?" in filepath:
28
+ self.filepaths = sorted(glob.glob(filepath))
29
+ else:
30
+ self.filepaths = [filepath]
31
+ else:
32
+ raise TypeError("filepath must be a string (file/pattern) or a list of strings.")
33
+
34
+ # =====================================================================
35
+ # API 1: WRITE STREAM (WITH SHARDING SUPPORT)
36
+ # =====================================================================
37
+ def write(self, data_generator, max_shard_bytes=None):
38
+ """
39
+ Accepts an iterable stream of raw python byte objects.
40
+ If max_shard_bytes is None, writes/appends to a single file.
41
+ If max_shard_bytes is provided, splits the stream across multiple shards (e.g., dataset_0000.rox).
42
+ """
43
+ self.close()
44
+
45
+ if len(self.filepaths) == 0:
46
+ raise ValueError("No filepath specified to write to.")
47
+
48
+ if max_shard_bytes is None:
49
+ self._write_single_file(self.filepaths[0], data_generator)
50
+ return
51
+
52
+ base_path = self.filepaths[0]
53
+ if base_path.endswith(".rox"):
54
+ base_name = base_path[:-4]
55
+ else:
56
+ base_name = base_path
57
+
58
+ # Find first unused shard index
59
+ shard_idx = 0
60
+ while os.path.exists(f"{base_name}_{shard_idx:04d}.rox"):
61
+ shard_idx += 1
62
+
63
+ current_shard_path = None
64
+ end_offsets = []
65
+ raw_data_size = 0
66
+
67
+ # Try to append to the last existing shard if it has room
68
+ if shard_idx > 0:
69
+ last_shard_path = f"{base_name}_{shard_idx-1:04d}.rox"
70
+ last_shard_size = os.path.getsize(last_shard_path)
71
+ if last_shard_size < max_shard_bytes:
72
+ current_shard_path = last_shard_path
73
+ shard_idx -= 1
74
+
75
+ with open(current_shard_path, "rb") as f:
76
+ f.seek(last_shard_size - 24)
77
+ footer_block = f.read(24)
78
+ total_records, raw_data_size, file_signature = struct.unpack("<qq8s", footer_block)
79
+
80
+ if file_signature == self.MAGIC_SIGNATURE:
81
+ with open(current_shard_path, "rb") as f:
82
+ f.seek(raw_data_size)
83
+ end_offsets = np.fromfile(f, dtype="<i8", count=total_records).tolist()
84
+
85
+ with open(current_shard_path, "r+b") as f:
86
+ f.truncate(raw_data_size)
87
+ else:
88
+ current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
89
+ end_offsets = []
90
+ raw_data_size = 0
91
+ else:
92
+ current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
93
+ else:
94
+ current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
95
+
96
+ current_offset = raw_data_size
97
+ # Truncate file to 0 if starting a fresh or overwritten shard
98
+ if current_offset == 0 and os.path.exists(current_shard_path):
99
+ open(current_shard_path, "wb").close()
100
+
101
+ f_out = open(current_shard_path, "ab")
102
+
103
+ try:
104
+ for item_bytes in data_generator:
105
+ if not isinstance(item_bytes, bytes):
106
+ raise TypeError("Data generator must exclusively yield raw python 'bytes' objects.")
107
+
108
+ payload_size = len(item_bytes)
109
+ if payload_size == 0:
110
+ continue
111
+
112
+ # Predict shard size: raw data + index table (8 bytes per record) + 24-byte footer
113
+ estimated_size = current_offset + payload_size + (len(end_offsets) + 1) * 8 + 24
114
+ if estimated_size > max_shard_bytes and len(end_offsets) > 0:
115
+ f_out.close()
116
+ self._finalize_shard(current_shard_path, end_offsets, current_offset)
117
+
118
+ shard_idx += 1
119
+ current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
120
+ print(f"📦 Shard limit reached. Creating new shard: {current_shard_path}")
121
+
122
+ end_offsets = []
123
+ current_offset = 0
124
+ f_out = open(current_shard_path, "ab")
125
+
126
+ f_out.write(item_bytes)
127
+ current_offset += payload_size
128
+ end_offsets.append(current_offset)
129
+ finally:
130
+ f_out.close()
131
+
132
+ if len(end_offsets) > 0:
133
+ self._finalize_shard(current_shard_path, end_offsets, current_offset)
134
+
135
+ def _write_single_file(self, path, data_generator):
136
+ end_offsets = []
137
+ raw_data_size = 0
138
+
139
+ if os.path.exists(path):
140
+ total_file_bytes = os.path.getsize(path)
141
+ if total_file_bytes >= 24:
142
+ with open(path, "rb") as f:
143
+ f.seek(total_file_bytes - 24)
144
+ footer_block = f.read(24)
145
+ total_records, raw_data_size, file_signature = struct.unpack("<qq8s", footer_block)
146
+
147
+ if file_signature == self.MAGIC_SIGNATURE:
148
+ print(f"♻️ Found existing archive. Stripping index and footer...")
149
+ with open(path, "rb") as f:
150
+ f.seek(raw_data_size)
151
+ end_offsets = np.fromfile(f, dtype="<i8", count=total_records).tolist()
152
+
153
+ with open(path, "r+b") as f:
154
+ f.truncate(raw_data_size)
155
+ else:
156
+ print("⚠️ Invalid signature in existing archive. Overwriting/starting fresh...")
157
+ end_offsets = []
158
+ raw_data_size = 0
159
+
160
+ current_offset = raw_data_size
161
+ # Truncate file to 0 if starting fresh or overwriting an invalid archive
162
+ if current_offset == 0 and os.path.exists(path):
163
+ open(path, "wb").close()
164
+
165
+ with open(path, "ab") as f:
166
+ for item_bytes in data_generator:
167
+ if not isinstance(item_bytes, bytes):
168
+ raise TypeError("Data generator must exclusively yield raw python 'bytes' objects.")
169
+
170
+ payload_size = len(item_bytes)
171
+ if payload_size == 0:
172
+ continue
173
+
174
+ f.write(item_bytes)
175
+ current_offset += payload_size
176
+ end_offsets.append(current_offset)
177
+
178
+ if len(end_offsets) > 0:
179
+ self._finalize_shard(path, end_offsets, current_offset)
180
+
181
+ def _finalize_shard(self, path, end_offsets, raw_data_size):
182
+ total_records = len(end_offsets)
183
+ with open(path, "ab") as f:
184
+ np.array(end_offsets, dtype="<i8").tofile(f)
185
+ footer = struct.pack("<qq8s", total_records, raw_data_size, self.MAGIC_SIGNATURE)
186
+ f.write(footer)
187
+ print(f"âś… Finalized shard {os.path.basename(path)} - Records: {total_records}, Data Bytes: {raw_data_size}")
188
+
189
+ # =====================================================================
190
+ # API 2: READ / LOAD (SHARDED SEQUENCE INTERFACE)
191
+ # =====================================================================
192
+ def open(self):
193
+ """
194
+ Memory maps all files in the sharded dataset for high-performance read-only access.
195
+ """
196
+ if self._is_open:
197
+ return
198
+
199
+ self._shards = []
200
+ self._shard_boundaries = []
201
+ self._total_records = 0
202
+
203
+ # In case globs returned nothing
204
+ if len(self.filepaths) == 0:
205
+ raise FileNotFoundError("No matching files found for the specified dataset path/pattern.")
206
+
207
+ for path in self.filepaths:
208
+ if not os.path.exists(path):
209
+ raise FileNotFoundError(f"Missing dataset shard file at {path}.")
210
+
211
+ total_file_bytes = os.path.getsize(path)
212
+ if total_file_bytes < 24:
213
+ raise ValueError(f"Corrupted shard {path}: size is less than footer size.")
214
+
215
+ with open(path, "rb") as f:
216
+ f.seek(total_file_bytes - 24)
217
+ footer_block = f.read(24)
218
+ total_records, raw_data_size, file_signature = struct.unpack("<qq8s", footer_block)
219
+
220
+ if file_signature != self.MAGIC_SIGNATURE:
221
+ raise ValueError(f"Corrupted signature in shard {path}.")
222
+
223
+ # Open standard python file handle for safe, pythonic descriptor management
224
+ f_handle = open(path, "rb")
225
+
226
+ # Memory map the raw data and index table using the file handle
227
+ raw_data = np.memmap(
228
+ f_handle,
229
+ dtype=np.uint8,
230
+ mode="r",
231
+ offset=0,
232
+ shape=(raw_data_size,)
233
+ )
234
+
235
+ index_table = np.memmap(
236
+ f_handle,
237
+ dtype=np.int64,
238
+ mode="r",
239
+ offset=raw_data_size,
240
+ shape=(total_records,)
241
+ )
242
+
243
+ self._shards.append({
244
+ "file_handle": f_handle,
245
+ "raw_data": raw_data,
246
+ "index_table": index_table,
247
+ "total_records": total_records
248
+ })
249
+
250
+ self._total_records += total_records
251
+ self._shard_boundaries.append(self._total_records)
252
+
253
+ # Expose primary shard properties for backward-compatibility if only 1 file exists
254
+ if len(self._shards) == 1:
255
+ self.raw_data = self._shards[0]["raw_data"]
256
+ self.index_table = self._shards[0]["index_table"]
257
+
258
+ self._is_open = True
259
+
260
+ def close(self):
261
+ """
262
+ Closes all mapped file handles and clears metadata.
263
+ """
264
+ if not self._is_open:
265
+ return
266
+
267
+ for shard in self._shards:
268
+ # Delete references to the memmap objects
269
+ del shard["raw_data"]
270
+ del shard["index_table"]
271
+
272
+ # Cleanly close the underlying Python file handle
273
+ if shard["file_handle"] is not None:
274
+ shard["file_handle"].close()
275
+
276
+ self._shards = []
277
+ self._shard_boundaries = []
278
+ self._total_records = 0
279
+ self.raw_data = None
280
+ self.index_table = None
281
+ self._is_open = False
282
+
283
+ def __len__(self):
284
+ if not self._is_open:
285
+ self.open()
286
+ return self._total_records
287
+
288
+ def __getitem__(self, idx):
289
+ if not self._is_open:
290
+ self.open()
291
+
292
+ if isinstance(idx, slice):
293
+ start, stop, step = idx.indices(self._total_records)
294
+ return [self._get_single_item(i) for i in range(start, stop, step)]
295
+
296
+ if idx < 0:
297
+ idx += self._total_records
298
+
299
+ if idx < 0 or idx >= self._total_records:
300
+ raise IndexError("Record index out of range.")
301
+
302
+ return self._get_single_item(idx)
303
+
304
+ def _get_single_item(self, idx):
305
+ # Find which shard holds this global index using binary search
306
+ shard_idx = bisect.bisect_right(self._shard_boundaries, idx)
307
+
308
+ # Calculate local index within that shard
309
+ local_offset = 0 if shard_idx == 0 else self._shard_boundaries[shard_idx - 1]
310
+ local_idx = idx - local_offset
311
+
312
+ shard = self._shards[shard_idx]
313
+ start = 0 if local_idx == 0 else shard["index_table"][local_idx - 1]
314
+ end = shard["index_table"][local_idx]
315
+ return shard["raw_data"][start:end]
316
+
317
+ def __enter__(self):
318
+ self.open()
319
+ return self
320
+
321
+ def __exit__(self, exc_type, exc_val, exc_tb):
322
+ self.close()
@@ -0,0 +1,58 @@
1
+ import os
2
+ import glob
3
+ import numpy as np
4
+ from roxxel import Roxxel
5
+
6
+ def clean_shards(base_name="test_sharded"):
7
+ for file in glob.glob(f"{base_name}*"):
8
+ os.remove(file)
9
+
10
+ def test_sharded_write_and_read():
11
+ print("--- Testing Roxxel Sharded Mode ---")
12
+ base_name = "./test_sharded"
13
+ clean_shards(base_name)
14
+
15
+ # 1. Generate 20 records of variable sizes
16
+ record_sizes = [5, 10, 15, 20] * 5 # Total 20 records
17
+ original_records = [bytes([i] * size) for i, size in enumerate(record_sizes)]
18
+
19
+ # 2. Write with a small shard size (e.g. 50 bytes of raw payload + metadata overhead per shard)
20
+ # This should trigger the creation of multiple shards automatically.
21
+ rox_writer = Roxxel(filepath=f"{base_name}.rox")
22
+
23
+ # We set max_shard_bytes to 180 to trigger sharding
24
+ rox_writer.write(original_records, max_shard_bytes=180)
25
+
26
+ # Verify shards were created on-disk
27
+ created_shards = sorted(glob.glob(f"{base_name}_*.rox"))
28
+ print(f"Created shards: {created_shards}")
29
+ assert len(created_shards) > 1
30
+
31
+ # 3. Read using glob pattern
32
+ dataset = Roxxel(filepath=f"{base_name}_*.rox")
33
+ with dataset:
34
+ print(f"Virtualized Sharded Dataset Length: {len(dataset)}")
35
+ assert len(dataset) == len(original_records)
36
+
37
+ # Test random access across boundaries
38
+ for i in range(len(dataset)):
39
+ record = dataset[i]
40
+ print(f" Global Record {i} - Shard-resolved Size: {len(record)}, Unique Value: {record[0]}")
41
+ assert len(record) == record_sizes[i]
42
+ assert np.all(record == i)
43
+
44
+ # Test negative indexing
45
+ assert np.all(dataset[-1] == 19)
46
+
47
+ # Test slicing
48
+ sliced = dataset[2:7]
49
+ print(f"Sliced [2:7] returns {len(sliced)} items")
50
+ assert len(sliced) == 5
51
+ assert len(sliced[0]) == record_sizes[2]
52
+
53
+ # Clean up files
54
+ clean_shards(base_name)
55
+ print("Roxxel Sharded Mode passed successfully!\n")
56
+
57
+ if __name__ == "__main__":
58
+ test_sharded_write_and_read()