PyPI - fastsafetensor-3fs-reader - Versions diffs - 0.3.3__tar.gz - Mend

fastsafetensor-3fs-reader 0.3.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

fastsafetensor_3fs_reader-0.3.3/MANIFEST.in ADDED Viewed

@@ -0,0 +1,6 @@
+include LICENSE
+include README.md
+include pyproject.toml
+include setup.py
+recursive-include fastsafetensor_3fs_reader *.py *.cpp *.hpp *.h *.cc
+recursive-include tests *.py

fastsafetensor_3fs_reader-0.3.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,218 @@
+Metadata-Version: 2.4
+Name: fastsafetensor-3fs-reader
+Version: 0.3.3
+Summary: 3FS USRBIO file reader for fastsafetensors
+License: Apache-2.0
+Project-URL: Repository, https://github.com/ABNER-1/fastsafetensor_3fs_reader
+Keywords: 3fs,usrbio,safetensors,gpu,io
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+Provides-Extra: test
+Requires-Dist: pytest>=8.1.1; extra == "test"
+Requires-Dist: numpy; extra == "test"
+Provides-Extra: gpu
+Requires-Dist: torch>=2.0; extra == "gpu"
+Provides-Extra: lint
+Requires-Dist: ruff>=0.6.0; extra == "lint"
+# fastsafetensor-3fs-reader
+3FS USRBIO file reader for fastsafetensors.
+This package provides a high-performance reader for 3FS USRBIO files with
+two backend implementations (C++ and pure-Python) and a mock for testing.
+## Backends
+| Backend | Module | Requirements | Performance |
+|---------|--------|-------------|-------------|
+| **C++** | `reader_cpp.py` | `libhf3fs_api_shared.so` + libtorch + CUDA | Best (GIL-free, native USRBIO async I/O) |
+| **Python** | `reader_py.py` | `hf3fs_py_usrbio` (+ optional PyTorch for GPU) | Good (USRBIO via Client API or OS pread) |
+| **Mock** | `mock.py` | None | For testing only |
+The package auto-selects the best available backend at import time:
+C++ → Python → Mock.  Use `get_backend()` to check which one is active.
+> **Note:** The C++ backend supports **pipelined mode** (double-buffered async
+> H2D copy via `cudaMemcpyAsync`) which overlaps network I/O with GPU memory
+> transfer for significantly better throughput.  Pass `pipelined=True` to
+> `read_chunked()` to enable it.  The Python backend does not support
+> pipelining and will silently fall back to non-pipelined mode.
+## Installation
+### Pure-Python mode (no C++ compilation)
+```bash
+FST3FS_NO_EXT=1 pip install .
+```
+### With C++ extension
+Requires `libhf3fs_api_shared.so` (from a 3FS build) and CUDA Runtime.
+The `hf3fs_usrbio.h` header is bundled in the package, so no external
+header dependency is needed:
+```bash
+export HF3FS_LIB_DIR=/path/to/3FS/build/lib         # directory with libhf3fs_api_shared.so
+pip install .
+```
+### Automatic `libhf3fs_api_shared.so` discovery
+At import time, the package automatically searches for
+`libhf3fs_api_shared.so` using the following priority:
+1. **`HF3FS_LIB_DIR`** environment variable (user-explicit, highest priority).
+2. **`LD_LIBRARY_PATH`** directories (user already configured).
+3. **`hf3fs_py_usrbio` pip install path** — if `hf3fs_py_usrbio` is installed
+   via pip, the library is typically located in a sibling `.libs/` directory
+   (e.g. `site-packages/hf3fs_py_usrbio.libs/`).  This is discovered
+   automatically so you don't need to set `LD_LIBRARY_PATH` manually.
+The library is pre-loaded with `RTLD_GLOBAL` so that both the C++ and
+Python backends can resolve its symbols.  Use `get_hf3fs_lib_path()` to
+check which path was loaded:
+```python
+from fastsafetensor_3fs_reader import get_hf3fs_lib_path
+print(get_hf3fs_lib_path())  # e.g. "/path/to/site-packages/hf3fs_py_usrbio.libs/libhf3fs_api_shared.so"
+```
+### Installing hf3fs_py_usrbio (for the Python backend)
+`hf3fs_py_usrbio` is **not** available on PyPI.  It must be built from the
+[DeepSeek 3FS](https://github.com/deepseek-ai/3FS) source tree:
+```bash
+git clone https://github.com/deepseek-ai/3FS
+cd 3FS
+git submodule update --init --recursive
+# Follow 3FS build instructions (cmake, etc.)
+# After build, install the Python package:
+cd build && pip install ..
+```
+> **Important:** The default pip-installed `hf3fs_py_usrbio` package is
+> suitable for **testing and validation** but is **not recommended for
+> production use**.  For production deployments, build 3FS from source with
+> optimized compiler flags tailored to your hardware.  Refer to projects like
+> [SGLang](https://github.com/sgl-project/sglang) for examples of
+> production-grade 3FS compilation workflows.
+## Usage
+```python
+from fastsafetensor_3fs_reader import (
+    ThreeFSFileReader,
+    MockFileReader,
+    is_available,
+    get_backend,
+)
+# Check which backend is active
+print(f"Backend: {get_backend()}")  # "cpp", "python", or "mock"
+# Use mock reader for testing (always available)
+reader = MockFileReader()
+headers = reader.read_headers_batch(["/path/to/file.safetensors"])
+reader.close()
+# Use 3FS reader when available
+if is_available():
+    reader = ThreeFSFileReader(mount_point="/mnt/3fs")
+    headers = reader.read_headers_batch([
+        "/mnt/3fs/model-00001.safetensors",
+        "/mnt/3fs/model-00002.safetensors",
+    ])
+    # Read tensor data into GPU memory
+    import torch
+    buf = torch.empty(1024 * 1024, dtype=torch.uint8, device="cuda")
+    bytes_read = reader.read_chunked(
+        path="/mnt/3fs/model-00001.safetensors",
+        dev_ptr=buf.data_ptr(),
+        file_offset=0,
+        total_length=1024 * 1024,
+    )
+    reader.close()
+```
+## Benchmark
+The `hack/benchmark/` directory contains a comprehensive benchmarking suite.
+Use `benchmark_runner.py` to measure read throughput across different backends,
+buffer sizes, chunk sizes, and process counts.
+### Full benchmark (read + GPU copy)
+```bash
+python hack/benchmark/benchmark_runner.py \
+    --mount-point /mnt/3fs \
+    --backends cpp,python \
+    --buffer-sizes 8,16,32,64,128,256,512 \
+    --chunk-sizes 8,16,32,64,128,256,512 \
+    --num-processes 1,2,4,8 \
+    --iterations 3
+```
+### Download-only benchmark (host memory only, no GPU copy)
+```bash
+python hack/benchmark/benchmark_runner.py \
+    --mount-point /mnt/3fs \
+    --backends cpp,python \
+    --buffer-sizes 8,16,32,64,128,256,512 \
+    --chunk-sizes 8,16,32,64,128,256,512 \
+    --num-processes 1,2,4,8 \
+    --download-only \
+    --iterations 3
+```
+### Key parameters
+| Parameter | Description | Default                       |
+|-----------|-------------|-------------------------------|
+| `--mount-point` | 3FS FUSE mount-point path | *(required)*                  |
+| `--backends` | Comma-separated backend names | `mock,python,cpp`             |
+| `--buffer-sizes` | Buffer sizes in MB | `8,16,32,64,128,256,512,1024` |
+| `--chunk-sizes` | Chunk sizes in MB | `8,16,32,64,128,256,512,1024`         |
+| `--num-processes` | Process counts | `1,2,4,8`                     |
+| `--download-only` | Read into host memory only (skip GPU copy) | `false`                       |
+| `--iterations` | Iterations per combination | `3`                           |
+| `--mode` | `grid` (sweep all combos) or `single` | `grid`                        |
+| `--output-dir` | Directory for CSV and chart output | `./benchmark_results`         |
+### Performance Results
+> **Test environment:** Single 400 Gbps RDMA NIC.
+> These numbers represent a **loading baseline** under specific storage and
+> network hardware conditions — they do **not** represent the performance
+> ceiling of the system.
+**Model:** DeepSeek-V3 (total ~640 GB safetensors)
+| Configuration | Avg Throughput (GB/s) | Peak Throughput with fastsafetensors (GB/s) | Load Time (s) | Backend |
+|---|---|---|---|---|
+| 8 processes, buffer=8 MB | 35.0 | 32.0 | 30.34 | C++ (non-pipelined) |
+| 8 processes, buffer=16 MB | 37.6 | 36.6 | 25.73 | C++ (pipelined) |
+#### Benchmark: RDMA throughput across buffer sizes (8M / 16M / 32M)
+![RDMA throughput across buffer sizes](docs/images/cpp_performance.png)
+#### Production: model weight loading with fastsafetensors (pipelined, peak 36.6 GB/s)
+![Model weight loading throughput](docs/images/cpp_load.png)
+## License
+Apache-2.0

fastsafetensor_3fs_reader-0.3.3/README.md ADDED Viewed

@@ -0,0 +1,192 @@
+# fastsafetensor-3fs-reader
+3FS USRBIO file reader for fastsafetensors.
+This package provides a high-performance reader for 3FS USRBIO files with
+two backend implementations (C++ and pure-Python) and a mock for testing.
+## Backends
+| Backend | Module | Requirements | Performance |
+|---------|--------|-------------|-------------|
+| **C++** | `reader_cpp.py` | `libhf3fs_api_shared.so` + libtorch + CUDA | Best (GIL-free, native USRBIO async I/O) |
+| **Python** | `reader_py.py` | `hf3fs_py_usrbio` (+ optional PyTorch for GPU) | Good (USRBIO via Client API or OS pread) |
+| **Mock** | `mock.py` | None | For testing only |
+The package auto-selects the best available backend at import time:
+C++ → Python → Mock.  Use `get_backend()` to check which one is active.
+> **Note:** The C++ backend supports **pipelined mode** (double-buffered async
+> H2D copy via `cudaMemcpyAsync`) which overlaps network I/O with GPU memory
+> transfer for significantly better throughput.  Pass `pipelined=True` to
+> `read_chunked()` to enable it.  The Python backend does not support
+> pipelining and will silently fall back to non-pipelined mode.
+## Installation
+### Pure-Python mode (no C++ compilation)
+```bash
+FST3FS_NO_EXT=1 pip install .
+```
+### With C++ extension
+Requires `libhf3fs_api_shared.so` (from a 3FS build) and CUDA Runtime.
+The `hf3fs_usrbio.h` header is bundled in the package, so no external
+header dependency is needed:
+```bash
+export HF3FS_LIB_DIR=/path/to/3FS/build/lib         # directory with libhf3fs_api_shared.so
+pip install .
+```
+### Automatic `libhf3fs_api_shared.so` discovery
+At import time, the package automatically searches for
+`libhf3fs_api_shared.so` using the following priority:
+1. **`HF3FS_LIB_DIR`** environment variable (user-explicit, highest priority).
+2. **`LD_LIBRARY_PATH`** directories (user already configured).
+3. **`hf3fs_py_usrbio` pip install path** — if `hf3fs_py_usrbio` is installed
+   via pip, the library is typically located in a sibling `.libs/` directory
+   (e.g. `site-packages/hf3fs_py_usrbio.libs/`).  This is discovered
+   automatically so you don't need to set `LD_LIBRARY_PATH` manually.
+The library is pre-loaded with `RTLD_GLOBAL` so that both the C++ and
+Python backends can resolve its symbols.  Use `get_hf3fs_lib_path()` to
+check which path was loaded:
+```python
+from fastsafetensor_3fs_reader import get_hf3fs_lib_path
+print(get_hf3fs_lib_path())  # e.g. "/path/to/site-packages/hf3fs_py_usrbio.libs/libhf3fs_api_shared.so"
+```
+### Installing hf3fs_py_usrbio (for the Python backend)
+`hf3fs_py_usrbio` is **not** available on PyPI.  It must be built from the
+[DeepSeek 3FS](https://github.com/deepseek-ai/3FS) source tree:
+```bash
+git clone https://github.com/deepseek-ai/3FS
+cd 3FS
+git submodule update --init --recursive
+# Follow 3FS build instructions (cmake, etc.)
+# After build, install the Python package:
+cd build && pip install ..
+```
+> **Important:** The default pip-installed `hf3fs_py_usrbio` package is
+> suitable for **testing and validation** but is **not recommended for
+> production use**.  For production deployments, build 3FS from source with
+> optimized compiler flags tailored to your hardware.  Refer to projects like
+> [SGLang](https://github.com/sgl-project/sglang) for examples of
+> production-grade 3FS compilation workflows.
+## Usage
+```python
+from fastsafetensor_3fs_reader import (
+    ThreeFSFileReader,
+    MockFileReader,
+    is_available,
+    get_backend,
+)
+# Check which backend is active
+print(f"Backend: {get_backend()}")  # "cpp", "python", or "mock"
+# Use mock reader for testing (always available)
+reader = MockFileReader()
+headers = reader.read_headers_batch(["/path/to/file.safetensors"])
+reader.close()
+# Use 3FS reader when available
+if is_available():
+    reader = ThreeFSFileReader(mount_point="/mnt/3fs")
+    headers = reader.read_headers_batch([
+        "/mnt/3fs/model-00001.safetensors",
+        "/mnt/3fs/model-00002.safetensors",
+    ])
+    # Read tensor data into GPU memory
+    import torch
+    buf = torch.empty(1024 * 1024, dtype=torch.uint8, device="cuda")
+    bytes_read = reader.read_chunked(
+        path="/mnt/3fs/model-00001.safetensors",
+        dev_ptr=buf.data_ptr(),
+        file_offset=0,
+        total_length=1024 * 1024,
+    )
+    reader.close()
+```
+## Benchmark
+The `hack/benchmark/` directory contains a comprehensive benchmarking suite.
+Use `benchmark_runner.py` to measure read throughput across different backends,
+buffer sizes, chunk sizes, and process counts.
+### Full benchmark (read + GPU copy)
+```bash
+python hack/benchmark/benchmark_runner.py \
+    --mount-point /mnt/3fs \
+    --backends cpp,python \
+    --buffer-sizes 8,16,32,64,128,256,512 \
+    --chunk-sizes 8,16,32,64,128,256,512 \
+    --num-processes 1,2,4,8 \
+    --iterations 3
+```
+### Download-only benchmark (host memory only, no GPU copy)
+```bash
+python hack/benchmark/benchmark_runner.py \
+    --mount-point /mnt/3fs \
+    --backends cpp,python \
+    --buffer-sizes 8,16,32,64,128,256,512 \
+    --chunk-sizes 8,16,32,64,128,256,512 \
+    --num-processes 1,2,4,8 \
+    --download-only \
+    --iterations 3
+```
+### Key parameters
+| Parameter | Description | Default                       |
+|-----------|-------------|-------------------------------|
+| `--mount-point` | 3FS FUSE mount-point path | *(required)*                  |
+| `--backends` | Comma-separated backend names | `mock,python,cpp`             |
+| `--buffer-sizes` | Buffer sizes in MB | `8,16,32,64,128,256,512,1024` |
+| `--chunk-sizes` | Chunk sizes in MB | `8,16,32,64,128,256,512,1024`         |
+| `--num-processes` | Process counts | `1,2,4,8`                     |
+| `--download-only` | Read into host memory only (skip GPU copy) | `false`                       |
+| `--iterations` | Iterations per combination | `3`                           |
+| `--mode` | `grid` (sweep all combos) or `single` | `grid`                        |
+| `--output-dir` | Directory for CSV and chart output | `./benchmark_results`         |
+### Performance Results
+> **Test environment:** Single 400 Gbps RDMA NIC.
+> These numbers represent a **loading baseline** under specific storage and
+> network hardware conditions — they do **not** represent the performance
+> ceiling of the system.
+**Model:** DeepSeek-V3 (total ~640 GB safetensors)
+| Configuration | Avg Throughput (GB/s) | Peak Throughput with fastsafetensors (GB/s) | Load Time (s) | Backend |
+|---|---|---|---|---|
+| 8 processes, buffer=8 MB | 35.0 | 32.0 | 30.34 | C++ (non-pipelined) |
+| 8 processes, buffer=16 MB | 37.6 | 36.6 | 25.73 | C++ (pipelined) |
+#### Benchmark: RDMA throughput across buffer sizes (8M / 16M / 32M)
+![RDMA throughput across buffer sizes](docs/images/cpp_performance.png)
+#### Production: model weight loading with fastsafetensors (pipelined, peak 36.6 GB/s)
+![Model weight loading throughput](docs/images/cpp_load.png)
+## License
+Apache-2.0

fastsafetensor_3fs_reader-0.3.3/fastsafetensor_3fs_reader/__init__.py ADDED Viewed

@@ -0,0 +1,49 @@
+# SPDX-License-Identifier: Apache-2.0
+"""fastsafetensor_3fs_reader -- 3FS USRBIO file reader for fastsafetensors.
+Quick start::
+    from fastsafetensor_3fs_reader import ThreeFSFileReader, is_available
+    if is_available():
+        reader = ThreeFSFileReader(mount_point="/mnt/3fs")
+        headers = reader.read_headers_batch(["/mnt/3fs/model.safetensors"])
+        reader.close()
+Backend auto-selection (override via ``FASTSAFETENSORS_BACKEND``)::
+    cpp -> python -> mock
+"""
+from ._lib_preload import get_hf3fs_lib_path, preload_hf3fs_library
+preload_hf3fs_library()  # must run before any backend import
+from ._mount_utils import extract_mount_point
+from .interface import FileReaderInterface
+from .mock import MockFileReader
+from ._backend import (  # noqa: E402
+    create_reader,
+    get_backend,
+    init_backend,
+    is_available,
+)
+# init_backend() must run BEFORE importing ThreeFSFileReader: Python's
+# ``from mod import name`` captures the value at import time.
+init_backend()
+from ._backend import ThreeFSFileReader  # noqa: E402
+__all__ = [
+    "FileReaderInterface",
+    "ThreeFSFileReader",
+    "MockFileReader",
+    "extract_mount_point",
+    "get_hf3fs_lib_path",
+    "is_available",
+    "get_backend",
+    "create_reader",
+]

fastsafetensor_3fs_reader-0.3.3/fastsafetensor_3fs_reader/_backend.py ADDED Viewed

@@ -0,0 +1,121 @@
+# SPDX-License-Identifier: Apache-2.0
+"""Backend selection: ``FASTSAFETENSORS_BACKEND`` → cpp / python / mock."""
+from __future__ import annotations
+import logging
+import os
+from typing import Any
+from .interface import FileReaderInterface
+from .mock import MockFileReader
+logger = logging.getLogger(__name__)
+_VALID_BACKENDS = ("cpp", "python", "mock", "auto")
+_BACKEND: str = "mock"
+ThreeFSFileReader: type | None = None
+def _load_backend(name: str) -> None:
+    global ThreeFSFileReader, _BACKEND
+    if name == "cpp":
+        from .reader_cpp import ThreeFSFileReaderCpp
+        ThreeFSFileReader = ThreeFSFileReaderCpp
+        _BACKEND = "cpp"
+    elif name == "python":
+        from .reader_py import ThreeFSFileReaderPy
+        ThreeFSFileReader = ThreeFSFileReaderPy
+        _BACKEND = "python"
+    elif name == "mock":
+        ThreeFSFileReader = MockFileReader
+        _BACKEND = "mock"
+    else:
+        raise ValueError(f"Unknown backend: {name!r}")
+def init_backend() -> None:
+    """Auto-select backend (cpp → python → mock).
+    Override with ``FASTSAFETENSORS_BACKEND=cpp|python|mock``.
+    """
+    forced = os.environ.get("FASTSAFETENSORS_BACKEND", "").lower().strip()
+    if forced and forced not in _VALID_BACKENDS:
+        raise ValueError(
+            f"FASTSAFETENSORS_BACKEND={forced!r} is invalid. "
+            f"Valid values: {', '.join(_VALID_BACKENDS)} (or unset)"
+        )
+    if forced and forced != "auto":
+        _load_backend(forced)
+        logger.info(
+            "using backend=%r (forced via FASTSAFETENSORS_BACKEND)",
+            _BACKEND,
+        )
+    else:
+        for candidate in ("cpp", "python"):
+            try:
+                _load_backend(candidate)
+                logger.info(
+                    "using backend=%r (auto-selected)", _BACKEND
+                )
+                break
+            except ImportError as exc:
+                logger.debug(
+                    "backend=%r not available (%s), trying next",
+                    candidate,
+                    exc,
+                )
+        if ThreeFSFileReader is None:
+            _load_backend("mock")
+            logger.warning(
+                "no real 3FS backend available "
+                "(cpp/python both failed), falling back to mock backend"
+            )
+def is_available() -> bool:
+    return _BACKEND in ("cpp", "python")
+def get_backend() -> str:
+    return _BACKEND
+def create_reader(backend: str = "auto", **kwargs: Any) -> FileReaderInterface:
+    """Create a reader instance, optionally forcing a specific backend.
+    ``**kwargs`` are forwarded to the reader constructor.
+    """
+    if backend == "auto":
+        if ThreeFSFileReader is None:
+            raise RuntimeError("No backend is available")
+        return ThreeFSFileReader(**kwargs)
+    elif backend == "cpp":
+        from .reader_cpp import ThreeFSFileReaderCpp
+        return ThreeFSFileReaderCpp(**kwargs)
+    elif backend == "python":
+        from .reader_py import ThreeFSFileReaderPy
+        return ThreeFSFileReaderPy(**kwargs)
+    elif backend == "mock":
+        return MockFileReader(**kwargs)
+    else:
+        raise ValueError(
+            f"backend={backend!r} is invalid. Valid values: {', '.join(_VALID_BACKENDS)}"
+        )
+__all__ = [
+    "ThreeFSFileReader",
+    "init_backend",
+    "is_available",
+    "get_backend",
+    "create_reader",
+]