PyPI - mldataforge - Versions diffs - 0.1.0__tar.gz → 0.1.2__tar.gz - Mend

mldataforge 0.1.0tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

mldataforge-0.1.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,59 @@
+Metadata-Version: 2.4
+Name: mldataforge
+Version: 0.1.2
+Summary: swiss army knife of scripts for transforming and processing datasets for machine learning.
+Project-URL: Homepage, https://github.com/schneiderkamplab/mldataforge
+Project-URL: Bug Tracker, https://github.com/schneiderkamplab/mldataforge/issues
+Author: Peter Schneider-Kamp
+License-File: LICENSE
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Requires-Python: >=3.12
+Requires-Dist: click
+Requires-Dist: datasets
+Requires-Dist: mltiming
+Requires-Dist: mosaicml-streaming
+Provides-Extra: all
+Requires-Dist: build; extra == 'all'
+Requires-Dist: pytest; extra == 'all'
+Requires-Dist: pytest-dependency; extra == 'all'
+Requires-Dist: twine; extra == 'all'
+Provides-Extra: dev
+Requires-Dist: build; extra == 'dev'
+Requires-Dist: twine; extra == 'dev'
+Provides-Extra: test
+Requires-Dist: pytest; extra == 'test'
+Requires-Dist: pytest-dependency; extra == 'test'
+Description-Content-Type: text/markdown
+# mldatasets
+swiss army knife of scripts for transforming and processing datasets for machine learning
+## scope
+Currently, mldataforge provides space- and time-efficient conversions between JSONL (with or without compression), MosaiclML Dataset (MDS format), and Parquet. The implementations handle conversions by individual samples or small batches of samples and make efficient use of multi-core architectures where possible. Consequently, mldataforge is an excellent choice when transforming TB-scale datasets on data processing nodes with many cores.
+## installation and general usage
+```
+pip install mldataforge
+python -m mldataforge --help
+```
+## usage example: converting MosaiclML Dataset (MDS) to Parquet format
+```
+Usage: python -m mldataforge convert mds parquet [OPTIONS] OUTPUT_FILE
+                                                 MDS_DIRECTORIES...
+Options:
+  --compression [snappy|gzip|zstd]
+                                  Compress the output file (default: snappy).
+  --overwrite                     Overwrite existing path.
+  --yes                           Assume yes to all prompts. Use with caution
+                                  as it will remove files or even entire
+                                  directories without confirmation.
+  --batch-size INTEGER            Batch size for loading data and writing
+                                  files (default: 65536).
+  --no-bulk                       Use a custom space and time-efficient bulk
+                                  reader (only gzip and no compression).
+  --help                          Show this message and exit.
+```

mldataforge-0.1.2/README.md ADDED Viewed

@@ -0,0 +1,30 @@
+# mldatasets
+swiss army knife of scripts for transforming and processing datasets for machine learning
+## scope
+Currently, mldataforge provides space- and time-efficient conversions between JSONL (with or without compression), MosaiclML Dataset (MDS format), and Parquet. The implementations handle conversions by individual samples or small batches of samples and make efficient use of multi-core architectures where possible. Consequently, mldataforge is an excellent choice when transforming TB-scale datasets on data processing nodes with many cores.
+## installation and general usage
+```
+pip install mldataforge
+python -m mldataforge --help
+```
+## usage example: converting MosaiclML Dataset (MDS) to Parquet format
+```
+Usage: python -m mldataforge convert mds parquet [OPTIONS] OUTPUT_FILE
+                                                 MDS_DIRECTORIES...
+Options:
+  --compression [snappy|gzip|zstd]
+                                  Compress the output file (default: snappy).
+  --overwrite                     Overwrite existing path.
+  --yes                           Assume yes to all prompts. Use with caution
+                                  as it will remove files or even entire
+                                  directories without confirmation.
+  --batch-size INTEGER            Batch size for loading data and writing
+                                  files (default: 65536).
+  --no-bulk                       Use a custom space and time-efficient bulk
+                                  reader (only gzip and no compression).
+  --help                          Show this message and exit.
+```

mldataforge-0.1.2/mldataforge/__main__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .commands import cli
+if __name__ == "__main__":
+    cli()

mldataforge-0.1.2/mldataforge/commands/__init__.py ADDED Viewed

@@ -0,0 +1,13 @@
+import click
+from .convert import convert
+from .join import join
+__all__ = ["cli"]
+@click.group()
+def cli():
+    pass
+cli.add_command(convert)
+cli.add_command(join)

{mldataforge-0.1.0 → mldataforge-0.1.2}/mldataforge/commands/convert/mds.py RENAMED Viewed

@@ -17,10 +17,11 @@ def mds():
 @overwrite_option()
 @yes_option()
 @batch_size_option()
-def jsonl(output_file, mds_directories, compression, processes, overwrite, yes, batch_size):
+@no_bulk_option()
+def jsonl(output_file, mds_directories, compression, processes, overwrite, yes, batch_size, no_bulk):
     check_arguments(output_file, overwrite, yes, mds_directories)
     save_jsonl(
-        load_mds_directories(mds_directories, batch_size=batch_size),
+        load_mds_directories(mds_directories, batch_size=batch_size, bulk=not no_bulk),
         output_file,
         compression=compression,
         processes=processes,
@@ -28,15 +29,16 @@ def jsonl(output_file, mds_directories, compression, processes, overwrite, yes,
 @mds.command()
 @click.argument("output_file", type=click.Path(exists=False), required=True)
-@click.argument("parquet_files", type=click.Path(exists=True), required=True, nargs=-1)
+@click.argument("mds_directories", type=click.Path(exists=True), required=True, nargs=-1)
 @compression_option("snappy", ["snappy", "gzip", "zstd"])
 @overwrite_option()
 @yes_option()
 @batch_size_option()
-def parquet(output_file, parquet_files, compression, overwrite, yes, batch_size):
-    check_arguments(output_file, overwrite, yes, parquet_files)
+@no_bulk_option()
+def parquet(output_file, mds_directories, compression, overwrite, yes, batch_size, no_bulk):
+    check_arguments(output_file, overwrite, yes, mds_directories)
     save_parquet(
-        load_mds_directories(parquet_files, batch_size=batch_size),
+        load_mds_directories(mds_directories, batch_size=batch_size, bulk=not no_bulk),
         output_file,
         compression=compression,
         batch_size=batch_size,

mldataforge-0.1.2/mldataforge/commands/join.py ADDED Viewed

@@ -0,0 +1,64 @@
+import click
+from datasets import load_dataset
+from ..options import *
+from ..utils import *
+__all__ = ["join"]
+@click.group()
+def join():
+    pass
+@join.command()
+@click.argument("output_file", type=click.Path(exists=False), required=True)
+@click.argument("jsonl_files", type=click.Path(exists=True), required=True, nargs=-1)
+@compression_option("infer", ["none", "infer", "pigz", "gzip", "bz2", "xz"])
+@processes_option()
+@overwrite_option()
+@yes_option()
+def jsonl(output_file, jsonl_files, compression, processes, overwrite, yes):
+    check_arguments(output_file, overwrite, yes, jsonl_files)
+    save_jsonl(
+        load_dataset("json", data_files=jsonl_files, split="train"),
+        output_file,
+        compression=compression,
+        processes=processes,
+    )
+@join.command()
+@click.argument("output_dir", type=click.Path(exists=False), required=True)
+@click.argument("mds_directories", type=click.Path(exists=True), required=True, nargs=-1)
+@compression_option(None, ['none', 'br', 'bz2', 'gzip', 'pigz', 'snappy', 'zstd'])
+@processes_option()
+@overwrite_option()
+@yes_option()
+@batch_size_option()
+@buf_size_option()
+@no_bulk_option()
+def mds(output_dir, mds_directories, compression, processes, overwrite, yes, batch_size, buf_size, no_bulk):
+    check_arguments(output_dir, overwrite, yes, mds_directories)
+    save_mds(
+        load_mds_directories(mds_directories, batch_size=batch_size, bulk=not no_bulk),
+        output_dir,
+        processes=processes,
+        compression=compression,
+        buf_size=buf_size,
+        pigz=use_pigz(compression),
+    )
+@join.command()
+@click.argument("output_file", type=click.Path(exists=False), required=True)
+@click.argument("parquet_files", type=click.Path(exists=True), required=True, nargs=-1)
+@compression_option("snappy", ["snappy", "gzip", "zstd"])
+@overwrite_option()
+@yes_option()
+@batch_size_option()
+def parquet(output_file, parquet_files, compression, overwrite, yes, batch_size):
+    check_arguments(output_file, overwrite, yes, parquet_files)
+    save_parquet(
+        load_dataset("parquet", data_files=parquet_files, split="train"),
+        output_file,
+        compression=compression,
+        batch_size=batch_size,
+    )

mldataforge-0.1.2/mldataforge/mds.py ADDED Viewed

@@ -0,0 +1,97 @@
+import gzip
+import json
+from mltiming import timing
+import numpy as np
+import os
+from streaming.base.format.mds.encodings import mds_decode
+from typing import Any, Optional, Generator
+class MDSBulkReader:
+    def __init__(
+        self,
+        dirnames: list[str],
+        split: Optional[str],
+    ) -> None:
+        self.shards = []
+        self.samples = 0
+        for dirname in dirnames:
+            if split is not None:
+                dirname = os.path.join(dirname, split)
+            index = json.load(open(os.path.join(dirname, "index.json"), 'rt'))
+            for shard in index["shards"]:
+                basename = shard['raw_data']['basename'] if shard['zip_data'] is None else shard['zip_data']['basename']
+                filename = os.path.join(dirname, basename)
+                self.shards.append({
+                    "filename": filename,
+                    "compression": shard['compression'],
+                })
+                self.samples += shard['samples']
+    def __len__(self) -> int:
+        return self.samples
+    def __iter__(self) -> Generator[dict[str, Any], None, None]:
+        for shard in self.shards:
+            with MDSShardReader(**shard) as reader:
+                for sample in reader:
+                    yield sample
+class MDSShardReader:
+    def __init__(
+        self,
+        filename: str,
+        compression: Optional[str],
+    ) -> None:
+        if compression is None:
+            _open = open
+        elif compression == 'gz':
+            _open = gzip.open
+        else:
+            raise ValueError(f'Unsupported compression type: {compression}. Supported types: None, gzip.')
+        self.fp = _open(filename, "rb")
+        self.samples = np.frombuffer(self.fp.read(4), np.uint32)[0]
+        self.index = np.frombuffer(self.fp.read((1+self.samples)*4), np.uint32)
+        info = json.loads(self.fp.read(self.index[0]-self.fp.tell()))
+        self.column_encodings = info["column_encodings"]
+        self.column_names = info["column_names"]
+        self.column_sizes = info["column_sizes"]
+        assert self.fp.tell() == self.index[0]
+    def decode_sample(self, data: bytes) -> dict[str, Any]:
+        sizes = []
+        idx = 0
+        for key, size in zip(self.column_names, self.column_sizes):
+            if size:
+                sizes.append(size)
+            else:
+                size, = np.frombuffer(data[idx:idx + 4], np.uint32)
+                sizes.append(size)
+                idx += 4
+        sample = {}
+        for key, encoding, size in zip(self.column_names, self.column_encodings, sizes):
+            value = data[idx:idx + size]
+            sample[key] = mds_decode(encoding, value)
+            idx += size
+        return sample
+    def get_sample_data(self, idx: int) -> bytes:
+        begin, end = self.index[idx:idx+2]
+        assert self.fp.tell() == begin
+        data = self.fp.read(end - begin)
+        assert self.fp.tell() == end
+        assert data
+        return data
+    def get_item(self, idx: int) -> dict[str, Any]:
+        data = self.get_sample_data(idx)
+        return self.decode_sample(data)
+    def __iter__(self) -> Generator[dict[str, Any], None, None]:
+        for i in range(self.samples):
+            yield self.get_item(i)
+    def __enter__(self) -> "MDSShardReader":
+        return self
+    def __exit__(self, exc_type, exc_value, traceback) -> None:
+        self.fp.close()

{mldataforge-0.1.0 → mldataforge-0.1.2}/mldataforge/options.py RENAMED Viewed

@@ -29,6 +29,16 @@ def buf_size_option(default=2**24):
         help=f"Buffer size for pigz compression (default: {default}).",
     )
+def no_bulk_option():
+    """
+    Option for specifying whether to use a custom space and time-efficient bulk reader (only gzip and no compression).
+    """
+    return click.option(
+        "--no-bulk",
+        is_flag=True,
+        help="Use a custom space and time-efficient bulk reader (only gzip and no compression).",
+    )
 def compression_option(default, choices):
     """
     Option for specifying the compression type.

{mldataforge-0.1.0 → mldataforge-0.1.2}/mldataforge/utils.py RENAMED Viewed

@@ -12,6 +12,7 @@ import shutil
 from streaming import MDSWriter, StreamingDataset
 from tqdm import tqdm
+from .mds import MDSBulkReader
 from .pigz import pigz_open
 __all__ = [
@@ -98,7 +99,9 @@ def _infer_compression(file_path):
         return 'zstd'
     return None
-def load_mds_directories(mds_directories, split='.', batch_size=2**16):
+def load_mds_directories(mds_directories, split='.', batch_size=2**16, bulk=True):
+    if bulk:
+        return MDSBulkReader(mds_directories, split=split)
     dss = []
     for mds_directory in mds_directories:
         ds = StreamingDataset(

{mldataforge-0.1.0 → mldataforge-0.1.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "mldataforge"
-version = "0.1.0"
+version = "0.1.2"
 authors = [
   { name = "Peter Schneider-Kamp" }
 ]
@@ -25,6 +25,11 @@ dependencies = [
     'mosaicml-streaming'
 ]
+[project.optional-dependencies]
+test = ["pytest", "pytest-dependency"]
+dev = ["build", "twine"]
+all = ["build", "twine", "pytest", "pytest-dependency"]
 [project.urls]
 "Homepage" = "https://github.com/schneiderkamplab/mldataforge"
 "Bug Tracker" = "https://github.com/schneiderkamplab/mldataforge/issues"

mldataforge-0.1.0/PKG-INFO DELETED Viewed

@@ -1,20 +0,0 @@
-Metadata-Version: 2.4
-Name: mldataforge
-Version: 0.1.0
-Summary: swiss army knife of scripts for transforming and processing datasets for machine learning.
-Project-URL: Homepage, https://github.com/schneiderkamplab/mldataforge
-Project-URL: Bug Tracker, https://github.com/schneiderkamplab/mldataforge/issues
-Author: Peter Schneider-Kamp
-License-File: LICENSE
-Classifier: License :: OSI Approved :: MIT License
-Classifier: Operating System :: OS Independent
-Classifier: Programming Language :: Python :: 3
-Requires-Python: >=3.12
-Requires-Dist: click
-Requires-Dist: datasets
-Requires-Dist: mltiming
-Requires-Dist: mosaicml-streaming
-Description-Content-Type: text/markdown
-# mldatasets
-swiss army knife of scripts for transforming and processing datasets for machine learning

mldataforge-0.1.0/README.md DELETED Viewed

	@@ -1,2 +0,0 @@
1	- # mldatasets
2	- swiss army knife of scripts for transforming and processing datasets for machine learning

mldataforge-0.1.0/mldataforge/__main__.py DELETED Viewed

@@ -1,12 +0,0 @@
-import click
-from .commands import convert
-@click.group()
-def cli():
-    pass
-cli.add_command(convert)
-if __name__ == "__main__":
-    cli()

mldataforge-0.1.0/mldataforge/commands/__init__.py DELETED Viewed

@@ -1,3 +0,0 @@
-from .convert import convert
-__all__ = ["convert"]

{mldataforge-0.1.0 → mldataforge-0.1.2}/.gitignore RENAMED Viewed

File without changes

{mldataforge-0.1.0 → mldataforge-0.1.2}/LICENSE RENAMED Viewed

File without changes

{mldataforge-0.1.0 → mldataforge-0.1.2}/mldataforge/commands/convert/__init__.py RENAMED Viewed

File without changes

{mldataforge-0.1.0 → mldataforge-0.1.2}/mldataforge/commands/convert/jsonl.py RENAMED Viewed

File without changes

{mldataforge-0.1.0 → mldataforge-0.1.2}/mldataforge/commands/convert/parquet.py RENAMED Viewed

File without changes

{mldataforge-0.1.0 → mldataforge-0.1.2}/mldataforge/pigz.py RENAMED Viewed

File without changes

mldataforge 0.1.0__tar.gz → 0.1.2__tar.gz

mldataforge 0.1.0tar.gz → 0.1.2tar.gz