PyPI - roxxel - Versions diffs - 0.1.0__tar.gz - Mend

roxxel 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

roxxel-0.1.0/.github/workflows/publish.yml +34 -0
roxxel-0.1.0/.gitignore +12 -0
roxxel-0.1.0/LICENSE +21 -0
roxxel-0.1.0/PKG-INFO +189 -0
roxxel-0.1.0/README.md +174 -0
roxxel-0.1.0/pyproject.toml +26 -0
roxxel-0.1.0/roxxel.py +322 -0
roxxel-0.1.0/tests/test_roxxel.py +58 -0

roxxel-0.1.0/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,34 @@
+name: Publish to PyPI 📦
+on:
+  release:
+    types: [published]
+permissions:
+  contents: read
+jobs:
+  build-n-publish:
+    name: Build and publish Python distribution to PyPI
+    runs-on: ubuntu-latest
+    permissions:
+      # This permission is mandatory for secure Trusted Publishing (OIDC)
+      id-token: write
+    steps:
+    - name: Checkout source
+      uses: actions/checkout@v4
+    - name: Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: "3.x"
+    - name: Install build dependencies
+      run: python -m pip install build
+    - name: Build binary wheel and source tarball
+      run: python -m build
+    - name: Publish package distributions to PyPI
+      uses: pypa/gh-action-pypi-publish@release/v1

roxxel-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,12 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# Roxxel temporary data files
+*.rox
+# System & IDE files
+.DS_Store
+Thumbs.db
+.antigravity*

roxxel-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 anon
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

roxxel-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,189 @@
+Metadata-Version: 2.4
+Name: roxxel
+Version: 0.1.0
+Summary: A zero-RAM, multi-modal, sharded binary dataset manager
+Author-email: anon160 <anon160@users.noreply.github.com>
+License: MIT
+License-File: LICENSE
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.8
+Requires-Dist: numpy>=1.20.0
+Description-Content-Type: text/markdown
+# Roxxel 🚀
+**Zero-RAM, Multi-Modal, Sharded Binary Dataset Manager**
+Roxxel is an ultra-lightweight (~300 lines of plain Python), zero-dependency (except NumPy) binary dataset format and reader designed for high-performance deep learning pipelines.
+By implementing the standard Python sequence protocol over native `numpy.memmap` views, Roxxel virtualizes massive, multi-sharded, variable-length datasets on-disk as a simple, continuous in-memory list.
+---
+## 💡 Motivation
+Mainstream deep learning data loaders—such as **PyTorch's `DataLoader`**, **Google's `Grain`**, and **TensorFlow's `tf.data`**—attempt to handle every aspect of the data pipeline (I/O, caching, multiprocessing, shuffling, collation, and transformations) in a single, massive monolithic system. This inevitably leads to severe operational friction:
+* **PyTorch DataLoader**: Relying on multiple workers (`num_workers > 0`) spawns child processes that trigger Python's `fork` mechanism. This frequently results in massive memory leaks due to copy-on-write page sharing bugs in Python's GIL. Furthermore, debugging opaque subprocess deadlocks and socket/IPC exhaustion is incredibly frustrating.
+* **Google Grain**: While powerful, it introduces a heavyweight dependency footprint and complex pipeline building abstractions that are difficult to customize or run outside of JAX-specific training pipelines.
+* **TensorFlow tf.data**: Building robust tf.data pipelines is highly complex. Additionally, it forces you to use the opaque `TFRecord` binary format, which cannot be easily inspected or read without pulling in the massive, multi-gigabyte TensorFlow library as a dependency.
+### The Roxxel Philosophy
+Roxxel shifts the architectural boundary by practicing the **Unix philosophy of doing one thing and doing it well**. It handles only the hardest, most critical parts of storage—**safe contiguous file packing, zero-RAM memory mapping, and O(1) seek indexing**—and leaves all batching, threading, and transformations to plain, standard Python and NumPy code.
+---
+## 🌟 Unique Benefits of Roxxel
+1. **Zero-RAM Overhead**: Roxxel maps your dataset directly into virtual memory via the operating system's kernel page cache using `numpy.memmap`. Even for multi-terabyte datasets, it consumes **exactly 0 bytes of Python RAM** for the data.
+2. **100% Framework Agnostic**: Because Roxxel is built purely on Python standard libraries and NumPy, it is entirely decoupled from any ML framework. You can use the exact same Roxxel dataset across **PyTorch, JAX, TensorFlow, or pure CPU environments** with zero code changes.
+3. **No Multiprocessing Deadlocks**: Because reading from memory maps is natively thread-safe and extremely fast, you can implement high-performance, asynchronous loading using simple Python threads (`threading.Thread`) or thread pools. You never have to worry about subprocess IPC bottlenecks or fork-related deadlocks.
+4. **Modality-Agnostic Variable-Length Records**: Unlike rigid formats, Roxxel accepts arbitrary variable-length binary payloads contiguously. You can store JPEGs, MP4 clips, text token arrays, or audio samples in a single, unified structure with zero padding waste.
+5. **Clean Sharded Portability**: Roxxel automatically splits massive datasets into sequentially numbered shards during writes. During reads, it seamlessly virtualizes them into a single continuous sequence using fast binary search boundaries. Shards are easy to distribute, copy, and stream over networks.
+---
+## 🛠️ File Format Architecture
+To prevent **header contamination** (where inline metadata blocks corrupt flat memory maps), Roxxel writes your entire dataset into a single contiguous binary file with a trailing index table:
+```
++-------------------------------------------------------------+
+|                                                             |
+| 1. RAW CONTIGUOUS PAYLOAD DATA SECTION                      |
+|    (No headers, no prefixes, completely clean bytes)        |
+|                                                             |
++-------------------------------------------------------------+
+|                                                             |
+| 2. TRAILING INDEX TABLE SECTION                             |
+|    (Flat array of uint64 offsets pointing to record ends)   |
+|                                                             |
++-------------------------------------------------------------+
+| 3. FOOTER (Exactly 24 bytes)                                |
+|    [Total Records (8B)] [Raw Data Size (8B)] [MAGIC (8B)]  |
++-------------------------------------------------------------+
+```
+Because the raw data section is completely uninterrupted, you can interpret the entire archive as a single contiguous array in one line (e.g. for LLM token pre-training) or resolve individual records in $O(1)$ constant time.
+## 📦 Installation
+Roxxel can be installed via `pip` directly from PyPI:
+```bash
+pip install roxxel
+```
+---
+## 🚀 Getting Started
+Simply copy `roxxel.py` into your project.
+### 1. Writing a Single-File Dataset
+```python
+from roxxel import Roxxel
+# Define a generator that yields raw byte payloads
+def byte_stream():
+    for i in range(100):
+        yield bytes([i] * 50)  # Yield raw bytes
+rox = Roxxel("./dataset.rox")
+rox.write(byte_stream())
+```
+### 2. Writing a Sharded Dataset
+Specify `max_shard_bytes` to automatically split massive data streams into dynamically capped shards (e.g., `dataset_0000.rox`, `dataset_0001.rox`):
+```python
+# Limit each shard to 2GB
+rox.write(byte_stream(), max_shard_bytes=2 * 1024 * 1024 * 1024)
+```
+### 3. Reading and Shuffling (Sequence API)
+Roxxel supports glob patterns and Python lists. It virtualizes all matching shards into a single read-only sequence supporting index lookups, negative indices, and slicing:
+```python
+import numpy as np
+from roxxel import Roxxel
+# Read and virtualize all shards matching the glob pattern
+with Roxxel("./dataset_*.rox") as dataset:
+    print("Total virtual records:", len(dataset))
+    # 1. O(1) single index lookup
+    record = dataset[42]
+    # 2. Slice lookup
+    subset = dataset[10:20]
+    # 3. Global Shuffling (handled in three lines of plain NumPy!)
+    shuffled_indices = np.random.permutation(len(dataset))
+    for idx in shuffled_indices:
+        shuffled_record = dataset[idx]  # seek & load happens instantly in page cache
+```
+---
+## 🍳 Cookbooks
+### A. Flat Token Streaming (e.g., LLM Training)
+If you are doing LLM pre-training, you want to treat your entire dataset as one continuous stream of tokens. Roxxel allows you to ignore record boundaries and read the raw contiguous mapped memory directly:
+```python
+with Roxxel("./tokens.rox") as dataset:
+    # Cast the entire mapped raw bytes section directly into uint16 tokens
+    tokens = dataset.raw_data.view(np.uint16)
+    # Chunk and batch locally in NumPy:
+    seq_len = 2048
+    total_sequences = len(tokens) // seq_len
+    reshaped_batches = tokens[:total_sequences * seq_len].reshape(total_sequences, seq_len)
+```
+### B. High-Performance Asynchronous Prefetch Dataloader
+If you are training on high-performance GPUs, you want to pre-load batches on a background CPU thread to completely prevent GPU starvation:
+```python
+import queue
+import threading
+import numpy as np
+from roxxel import Roxxel
+def async_dataloader(rox_pattern, batch_size=32, prefetch_batches=4, seed=42):
+    dataset = Roxxel(rox_pattern)
+    dataset.open()
+    indices = np.arange(len(dataset))
+    rng = np.random.default_rng(seed)
+    rng.shuffle(indices)
+    q = queue.Queue(maxsize=prefetch_batches)
+    def producer():
+        for start_idx in range(0, len(indices), batch_size):
+            batch_picks = indices[start_idx : start_idx + batch_size]
+            # Fetch and decode/stack
+            batch_data = [dataset[idx] for idx in batch_picks]
+            q.put(batch_data)
+        q.put(None)  # EOF
+        dataset.close()
+    # Start I/O in the background
+    threading.Thread(target=producer, daemon=True).start()
+    # Yield batches to the training loop
+    while True:
+        batch = q.get()
+        if batch is None:
+            break
+        yield batch
+```
+---
+## ⚖️ License
+MIT License. Feel free to use, modify, and distribute.

roxxel-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,174 @@
+# Roxxel 🚀
+**Zero-RAM, Multi-Modal, Sharded Binary Dataset Manager**
+Roxxel is an ultra-lightweight (~300 lines of plain Python), zero-dependency (except NumPy) binary dataset format and reader designed for high-performance deep learning pipelines.
+By implementing the standard Python sequence protocol over native `numpy.memmap` views, Roxxel virtualizes massive, multi-sharded, variable-length datasets on-disk as a simple, continuous in-memory list.
+---
+## 💡 Motivation
+Mainstream deep learning data loaders—such as **PyTorch's `DataLoader`**, **Google's `Grain`**, and **TensorFlow's `tf.data`**—attempt to handle every aspect of the data pipeline (I/O, caching, multiprocessing, shuffling, collation, and transformations) in a single, massive monolithic system. This inevitably leads to severe operational friction:
+* **PyTorch DataLoader**: Relying on multiple workers (`num_workers > 0`) spawns child processes that trigger Python's `fork` mechanism. This frequently results in massive memory leaks due to copy-on-write page sharing bugs in Python's GIL. Furthermore, debugging opaque subprocess deadlocks and socket/IPC exhaustion is incredibly frustrating.
+* **Google Grain**: While powerful, it introduces a heavyweight dependency footprint and complex pipeline building abstractions that are difficult to customize or run outside of JAX-specific training pipelines.
+* **TensorFlow tf.data**: Building robust tf.data pipelines is highly complex. Additionally, it forces you to use the opaque `TFRecord` binary format, which cannot be easily inspected or read without pulling in the massive, multi-gigabyte TensorFlow library as a dependency.
+### The Roxxel Philosophy
+Roxxel shifts the architectural boundary by practicing the **Unix philosophy of doing one thing and doing it well**. It handles only the hardest, most critical parts of storage—**safe contiguous file packing, zero-RAM memory mapping, and O(1) seek indexing**—and leaves all batching, threading, and transformations to plain, standard Python and NumPy code.
+---
+## 🌟 Unique Benefits of Roxxel
+1. **Zero-RAM Overhead**: Roxxel maps your dataset directly into virtual memory via the operating system's kernel page cache using `numpy.memmap`. Even for multi-terabyte datasets, it consumes **exactly 0 bytes of Python RAM** for the data.
+2. **100% Framework Agnostic**: Because Roxxel is built purely on Python standard libraries and NumPy, it is entirely decoupled from any ML framework. You can use the exact same Roxxel dataset across **PyTorch, JAX, TensorFlow, or pure CPU environments** with zero code changes.
+3. **No Multiprocessing Deadlocks**: Because reading from memory maps is natively thread-safe and extremely fast, you can implement high-performance, asynchronous loading using simple Python threads (`threading.Thread`) or thread pools. You never have to worry about subprocess IPC bottlenecks or fork-related deadlocks.
+4. **Modality-Agnostic Variable-Length Records**: Unlike rigid formats, Roxxel accepts arbitrary variable-length binary payloads contiguously. You can store JPEGs, MP4 clips, text token arrays, or audio samples in a single, unified structure with zero padding waste.
+5. **Clean Sharded Portability**: Roxxel automatically splits massive datasets into sequentially numbered shards during writes. During reads, it seamlessly virtualizes them into a single continuous sequence using fast binary search boundaries. Shards are easy to distribute, copy, and stream over networks.
+---
+## 🛠️ File Format Architecture
+To prevent **header contamination** (where inline metadata blocks corrupt flat memory maps), Roxxel writes your entire dataset into a single contiguous binary file with a trailing index table:
+```
++-------------------------------------------------------------+
+|                                                             |
+| 1. RAW CONTIGUOUS PAYLOAD DATA SECTION                      |
+|    (No headers, no prefixes, completely clean bytes)        |
+|                                                             |
++-------------------------------------------------------------+
+|                                                             |
+| 2. TRAILING INDEX TABLE SECTION                             |
+|    (Flat array of uint64 offsets pointing to record ends)   |
+|                                                             |
++-------------------------------------------------------------+
+| 3. FOOTER (Exactly 24 bytes)                                |
+|    [Total Records (8B)] [Raw Data Size (8B)] [MAGIC (8B)]  |
++-------------------------------------------------------------+
+```
+Because the raw data section is completely uninterrupted, you can interpret the entire archive as a single contiguous array in one line (e.g. for LLM token pre-training) or resolve individual records in $O(1)$ constant time.
+## 📦 Installation
+Roxxel can be installed via `pip` directly from PyPI:
+```bash
+pip install roxxel
+```
+---
+## 🚀 Getting Started
+Simply copy `roxxel.py` into your project.
+### 1. Writing a Single-File Dataset
+```python
+from roxxel import Roxxel
+# Define a generator that yields raw byte payloads
+def byte_stream():
+    for i in range(100):
+        yield bytes([i] * 50)  # Yield raw bytes
+rox = Roxxel("./dataset.rox")
+rox.write(byte_stream())
+```
+### 2. Writing a Sharded Dataset
+Specify `max_shard_bytes` to automatically split massive data streams into dynamically capped shards (e.g., `dataset_0000.rox`, `dataset_0001.rox`):
+```python
+# Limit each shard to 2GB
+rox.write(byte_stream(), max_shard_bytes=2 * 1024 * 1024 * 1024)
+```
+### 3. Reading and Shuffling (Sequence API)
+Roxxel supports glob patterns and Python lists. It virtualizes all matching shards into a single read-only sequence supporting index lookups, negative indices, and slicing:
+```python
+import numpy as np
+from roxxel import Roxxel
+# Read and virtualize all shards matching the glob pattern
+with Roxxel("./dataset_*.rox") as dataset:
+    print("Total virtual records:", len(dataset))
+    # 1. O(1) single index lookup
+    record = dataset[42]
+    # 2. Slice lookup
+    subset = dataset[10:20]
+    # 3. Global Shuffling (handled in three lines of plain NumPy!)
+    shuffled_indices = np.random.permutation(len(dataset))
+    for idx in shuffled_indices:
+        shuffled_record = dataset[idx]  # seek & load happens instantly in page cache
+```
+---
+## 🍳 Cookbooks
+### A. Flat Token Streaming (e.g., LLM Training)
+If you are doing LLM pre-training, you want to treat your entire dataset as one continuous stream of tokens. Roxxel allows you to ignore record boundaries and read the raw contiguous mapped memory directly:
+```python
+with Roxxel("./tokens.rox") as dataset:
+    # Cast the entire mapped raw bytes section directly into uint16 tokens
+    tokens = dataset.raw_data.view(np.uint16)
+    # Chunk and batch locally in NumPy:
+    seq_len = 2048
+    total_sequences = len(tokens) // seq_len
+    reshaped_batches = tokens[:total_sequences * seq_len].reshape(total_sequences, seq_len)
+```
+### B. High-Performance Asynchronous Prefetch Dataloader
+If you are training on high-performance GPUs, you want to pre-load batches on a background CPU thread to completely prevent GPU starvation:
+```python
+import queue
+import threading
+import numpy as np
+from roxxel import Roxxel
+def async_dataloader(rox_pattern, batch_size=32, prefetch_batches=4, seed=42):
+    dataset = Roxxel(rox_pattern)
+    dataset.open()
+    indices = np.arange(len(dataset))
+    rng = np.random.default_rng(seed)
+    rng.shuffle(indices)
+    q = queue.Queue(maxsize=prefetch_batches)
+    def producer():
+        for start_idx in range(0, len(indices), batch_size):
+            batch_picks = indices[start_idx : start_idx + batch_size]
+            # Fetch and decode/stack
+            batch_data = [dataset[idx] for idx in batch_picks]
+            q.put(batch_data)
+        q.put(None)  # EOF
+        dataset.close()
+    # Start I/O in the background
+    threading.Thread(target=producer, daemon=True).start()
+    # Yield batches to the training loop
+    while True:
+        batch = q.get()
+        if batch is None:
+            break
+        yield batch
+```
+---
+## ⚖️ License
+MIT License. Feel free to use, modify, and distribute.

roxxel-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,26 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "roxxel"
+version = "0.1.0"
+description = "A zero-RAM, multi-modal, sharded binary dataset manager"
+readme = "README.md"
+requires-python = ">=3.8"
+license = {text = "MIT"}
+authors = [
+    {name = "anon160", email = "anon160@users.noreply.github.com"}
+]
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "numpy>=1.20.0"
+]
+[tool.hatch.build.targets.wheel]
+packages = ["roxxel.py"]

roxxel-0.1.0/roxxel.py ADDED Viewed

@@ -0,0 +1,322 @@
+import os
+import glob
+import struct
+import bisect
+import numpy as np
+class Roxxel:
+    """
+    A bare-bones, zero-RAM single-file or multi-sharded dataset manager.
+    Stores raw contiguous payload data, a trailing index table, and a 24-byte footer.
+    Seamlessly virtualizes multiple shards on-disk into a single continuous sequence.
+    """
+    MAGIC_SIGNATURE = b"ROXXEL01"  # 8-byte secure signature tag
+    def __init__(self, filepath="./stream_reservoir.rox"):
+        self.raw_data = None
+        self.index_table = None
+        self._total_records = 0
+        self._is_open = False
+        self._shards = []
+        self._shard_boundaries = []
+        # Support single string, list of strings, or glob patterns
+        if isinstance(filepath, list):
+            self.filepaths = filepath
+        elif isinstance(filepath, str):
+            if "*" in filepath or "?" in filepath:
+                self.filepaths = sorted(glob.glob(filepath))
+            else:
+                self.filepaths = [filepath]
+        else:
+            raise TypeError("filepath must be a string (file/pattern) or a list of strings.")
+    # =====================================================================
+    # API 1: WRITE STREAM (WITH SHARDING SUPPORT)
+    # =====================================================================
+    def write(self, data_generator, max_shard_bytes=None):
+        """
+        Accepts an iterable stream of raw python byte objects.
+        If max_shard_bytes is None, writes/appends to a single file.
+        If max_shard_bytes is provided, splits the stream across multiple shards (e.g., dataset_0000.rox).
+        """
+        self.close()
+        if len(self.filepaths) == 0:
+            raise ValueError("No filepath specified to write to.")
+        if max_shard_bytes is None:
+            self._write_single_file(self.filepaths[0], data_generator)
+            return
+        base_path = self.filepaths[0]
+        if base_path.endswith(".rox"):
+            base_name = base_path[:-4]
+        else:
+            base_name = base_path
+        # Find first unused shard index
+        shard_idx = 0
+        while os.path.exists(f"{base_name}_{shard_idx:04d}.rox"):
+            shard_idx += 1
+        current_shard_path = None
+        end_offsets = []
+        raw_data_size = 0
+        # Try to append to the last existing shard if it has room
+        if shard_idx > 0:
+            last_shard_path = f"{base_name}_{shard_idx-1:04d}.rox"
+            last_shard_size = os.path.getsize(last_shard_path)
+            if last_shard_size < max_shard_bytes:
+                current_shard_path = last_shard_path
+                shard_idx -= 1
+                with open(current_shard_path, "rb") as f:
+                    f.seek(last_shard_size - 24)
+                    footer_block = f.read(24)
+                    total_records, raw_data_size, file_signature = struct.unpack("<qq8s", footer_block)
+                if file_signature == self.MAGIC_SIGNATURE:
+                    with open(current_shard_path, "rb") as f:
+                        f.seek(raw_data_size)
+                        end_offsets = np.fromfile(f, dtype="<i8", count=total_records).tolist()
+                    with open(current_shard_path, "r+b") as f:
+                        f.truncate(raw_data_size)
+                else:
+                    current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
+                    end_offsets = []
+                    raw_data_size = 0
+            else:
+                current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
+        else:
+            current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
+        current_offset = raw_data_size
+        # Truncate file to 0 if starting a fresh or overwritten shard
+        if current_offset == 0 and os.path.exists(current_shard_path):
+            open(current_shard_path, "wb").close()
+        f_out = open(current_shard_path, "ab")
+        try:
+            for item_bytes in data_generator:
+                if not isinstance(item_bytes, bytes):
+                    raise TypeError("Data generator must exclusively yield raw python 'bytes' objects.")
+                payload_size = len(item_bytes)
+                if payload_size == 0:
+                    continue
+                # Predict shard size: raw data + index table (8 bytes per record) + 24-byte footer
+                estimated_size = current_offset + payload_size + (len(end_offsets) + 1) * 8 + 24
+                if estimated_size > max_shard_bytes and len(end_offsets) > 0:
+                    f_out.close()
+                    self._finalize_shard(current_shard_path, end_offsets, current_offset)
+                    shard_idx += 1
+                    current_shard_path = f"{base_name}_{shard_idx:04d}.rox"
+                    print(f"📦 Shard limit reached. Creating new shard: {current_shard_path}")
+                    end_offsets = []
+                    current_offset = 0
+                    f_out = open(current_shard_path, "ab")
+                f_out.write(item_bytes)
+                current_offset += payload_size
+                end_offsets.append(current_offset)
+        finally:
+            f_out.close()
+        if len(end_offsets) > 0:
+            self._finalize_shard(current_shard_path, end_offsets, current_offset)
+    def _write_single_file(self, path, data_generator):
+        end_offsets = []
+        raw_data_size = 0
+        if os.path.exists(path):
+            total_file_bytes = os.path.getsize(path)
+            if total_file_bytes >= 24:
+                with open(path, "rb") as f:
+                    f.seek(total_file_bytes - 24)
+                    footer_block = f.read(24)
+                    total_records, raw_data_size, file_signature = struct.unpack("<qq8s", footer_block)
+                if file_signature == self.MAGIC_SIGNATURE:
+                    print(f"♻️ Found existing archive. Stripping index and footer...")
+                    with open(path, "rb") as f:
+                        f.seek(raw_data_size)
+                        end_offsets = np.fromfile(f, dtype="<i8", count=total_records).tolist()
+                    with open(path, "r+b") as f:
+                        f.truncate(raw_data_size)
+                else:
+                    print("⚠️ Invalid signature in existing archive. Overwriting/starting fresh...")
+                    end_offsets = []
+                    raw_data_size = 0
+        current_offset = raw_data_size
+        # Truncate file to 0 if starting fresh or overwriting an invalid archive
+        if current_offset == 0 and os.path.exists(path):
+            open(path, "wb").close()
+        with open(path, "ab") as f:
+            for item_bytes in data_generator:
+                if not isinstance(item_bytes, bytes):
+                    raise TypeError("Data generator must exclusively yield raw python 'bytes' objects.")
+                payload_size = len(item_bytes)
+                if payload_size == 0:
+                    continue
+                f.write(item_bytes)
+                current_offset += payload_size
+                end_offsets.append(current_offset)
+        if len(end_offsets) > 0:
+            self._finalize_shard(path, end_offsets, current_offset)
+    def _finalize_shard(self, path, end_offsets, raw_data_size):
+        total_records = len(end_offsets)
+        with open(path, "ab") as f:
+            np.array(end_offsets, dtype="<i8").tofile(f)
+            footer = struct.pack("<qq8s", total_records, raw_data_size, self.MAGIC_SIGNATURE)
+            f.write(footer)
+        print(f"✅ Finalized shard {os.path.basename(path)} - Records: {total_records}, Data Bytes: {raw_data_size}")
+    # =====================================================================
+    # API 2: READ / LOAD (SHARDED SEQUENCE INTERFACE)
+    # =====================================================================
+    def open(self):
+        """
+        Memory maps all files in the sharded dataset for high-performance read-only access.
+        """
+        if self._is_open:
+            return
+        self._shards = []
+        self._shard_boundaries = []
+        self._total_records = 0
+        # In case globs returned nothing
+        if len(self.filepaths) == 0:
+            raise FileNotFoundError("No matching files found for the specified dataset path/pattern.")
+        for path in self.filepaths:
+            if not os.path.exists(path):
+                raise FileNotFoundError(f"Missing dataset shard file at {path}.")
+            total_file_bytes = os.path.getsize(path)
+            if total_file_bytes < 24:
+                raise ValueError(f"Corrupted shard {path}: size is less than footer size.")
+            with open(path, "rb") as f:
+                f.seek(total_file_bytes - 24)
+                footer_block = f.read(24)
+                total_records, raw_data_size, file_signature = struct.unpack("<qq8s", footer_block)
+            if file_signature != self.MAGIC_SIGNATURE:
+                raise ValueError(f"Corrupted signature in shard {path}.")
+            # Open standard python file handle for safe, pythonic descriptor management
+            f_handle = open(path, "rb")
+            # Memory map the raw data and index table using the file handle
+            raw_data = np.memmap(
+                f_handle,
+                dtype=np.uint8,
+                mode="r",
+                offset=0,
+                shape=(raw_data_size,)
+            )
+            index_table = np.memmap(
+                f_handle,
+                dtype=np.int64,
+                mode="r",
+                offset=raw_data_size,
+                shape=(total_records,)
+            )
+            self._shards.append({
+                "file_handle": f_handle,
+                "raw_data": raw_data,
+                "index_table": index_table,
+                "total_records": total_records
+            })
+            self._total_records += total_records
+            self._shard_boundaries.append(self._total_records)
+        # Expose primary shard properties for backward-compatibility if only 1 file exists
+        if len(self._shards) == 1:
+            self.raw_data = self._shards[0]["raw_data"]
+            self.index_table = self._shards[0]["index_table"]
+        self._is_open = True
+    def close(self):
+        """
+        Closes all mapped file handles and clears metadata.
+        """
+        if not self._is_open:
+            return
+        for shard in self._shards:
+            # Delete references to the memmap objects
+            del shard["raw_data"]
+            del shard["index_table"]
+            # Cleanly close the underlying Python file handle
+            if shard["file_handle"] is not None:
+                shard["file_handle"].close()
+        self._shards = []
+        self._shard_boundaries = []
+        self._total_records = 0
+        self.raw_data = None
+        self.index_table = None
+        self._is_open = False
+    def __len__(self):
+        if not self._is_open:
+            self.open()
+        return self._total_records
+    def __getitem__(self, idx):
+        if not self._is_open:
+            self.open()
+        if isinstance(idx, slice):
+            start, stop, step = idx.indices(self._total_records)
+            return [self._get_single_item(i) for i in range(start, stop, step)]
+        if idx < 0:
+            idx += self._total_records
+        if idx < 0 or idx >= self._total_records:
+            raise IndexError("Record index out of range.")
+        return self._get_single_item(idx)
+    def _get_single_item(self, idx):
+        # Find which shard holds this global index using binary search
+        shard_idx = bisect.bisect_right(self._shard_boundaries, idx)
+        # Calculate local index within that shard
+        local_offset = 0 if shard_idx == 0 else self._shard_boundaries[shard_idx - 1]
+        local_idx = idx - local_offset
+        shard = self._shards[shard_idx]
+        start = 0 if local_idx == 0 else shard["index_table"][local_idx - 1]
+        end = shard["index_table"][local_idx]
+        return shard["raw_data"][start:end]
+    def __enter__(self):
+        self.open()
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()

roxxel-0.1.0/tests/test_roxxel.py ADDED Viewed

@@ -0,0 +1,58 @@
+import os
+import glob
+import numpy as np
+from roxxel import Roxxel
+def clean_shards(base_name="test_sharded"):
+    for file in glob.glob(f"{base_name}*"):
+        os.remove(file)
+def test_sharded_write_and_read():
+    print("--- Testing Roxxel Sharded Mode ---")
+    base_name = "./test_sharded"
+    clean_shards(base_name)
+    # 1. Generate 20 records of variable sizes
+    record_sizes = [5, 10, 15, 20] * 5  # Total 20 records
+    original_records = [bytes([i] * size) for i, size in enumerate(record_sizes)]
+    # 2. Write with a small shard size (e.g. 50 bytes of raw payload + metadata overhead per shard)
+    # This should trigger the creation of multiple shards automatically.
+    rox_writer = Roxxel(filepath=f"{base_name}.rox")
+    # We set max_shard_bytes to 180 to trigger sharding
+    rox_writer.write(original_records, max_shard_bytes=180)
+    # Verify shards were created on-disk
+    created_shards = sorted(glob.glob(f"{base_name}_*.rox"))
+    print(f"Created shards: {created_shards}")
+    assert len(created_shards) > 1
+    # 3. Read using glob pattern
+    dataset = Roxxel(filepath=f"{base_name}_*.rox")
+    with dataset:
+        print(f"Virtualized Sharded Dataset Length: {len(dataset)}")
+        assert len(dataset) == len(original_records)
+        # Test random access across boundaries
+        for i in range(len(dataset)):
+            record = dataset[i]
+            print(f"  Global Record {i} - Shard-resolved Size: {len(record)}, Unique Value: {record[0]}")
+            assert len(record) == record_sizes[i]
+            assert np.all(record == i)
+        # Test negative indexing
+        assert np.all(dataset[-1] == 19)
+        # Test slicing
+        sliced = dataset[2:7]
+        print(f"Sliced [2:7] returns {len(sliced)} items")
+        assert len(sliced) == 5
+        assert len(sliced[0]) == record_sizes[2]
+    # Clean up files
+    clean_shards(base_name)
+    print("Roxxel Sharded Mode passed successfully!\n")
+if __name__ == "__main__":
+    test_sharded_write_and_read()