PyPI - datamaestro - Versions diffs - 1.6.1__tar.gz → 1.7.0__tar.gz - Mend

datamaestro 1.6.1tar.gz → 1.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (100) hide show

datamaestro-1.7.0/.github/workflows/pytest.yml ADDED Viewed

@@ -0,0 +1,33 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+name: Python test
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+    steps:
+    - uses: actions/checkout@v4
+    - name: Install uv
+      uses: astral-sh/setup-uv@v5
+    - name: Set up Python ${{ matrix.python-version }}
+      run: uv python install ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: uv sync --group dev
+    - name: Lint with ruff
+      run: |
+        uv run ruff check .
+        uv run ruff format --check .
+    - name: Test with pytest
+      run: uv run pytest

datamaestro-1.7.0/.pre-commit-config.yaml ADDED Viewed

@@ -0,0 +1,23 @@
+default_install_hook_types:
+  - pre-commit
+  - commit-msg
+repos:
+- hooks:
+  - id: check-yaml
+  - id: end-of-file-fixer
+  - id: trailing-whitespace
+  repo: https://github.com/pre-commit/pre-commit-hooks
+  rev: v5.0.0
+- hooks:
+  - id: ruff
+    args: [--fix]
+  - id: ruff-format
+  repo: https://github.com/astral-sh/ruff-pre-commit
+  rev: v0.9.6
+- hooks:
+  - id: conventional-pre-commit
+    stages:
+    - commit-msg
+  repo: https://github.com/compilerla/conventional-pre-commit
+  rev: v4.0.0

{datamaestro-1.6.1 → datamaestro-1.7.0}/.readthedocs.yml RENAMED Viewed

@@ -2,20 +2,19 @@
 # Read the Docs configuration file
 # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
-# Required
 version: 2
-sphinx:
-  configuration: docs/source/conf.py
 build:
-  os: "ubuntu-20.04"
+  os: "ubuntu-22.04"
   tools:
-    python: "3.10"
+    python: "3.11"
+sphinx:
+  configuration: docs/source/conf.py
-# Install the package
 python:
   install:
     - method: pip
       path: .
-    - requirements: docs/requirements.txt
+      extra_requirements:
+        - docs

{datamaestro-1.6.1 → datamaestro-1.7.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: datamaestro
-Version: 1.6.1
+Version: 1.7.0
 Summary: Add your description here
 Author-email: Benjamin Piwowarski <benjamin@piwowarski.fr>
 License-File: LICENSE
@@ -25,6 +25,12 @@ Requires-Dist: pymdown-extensions>=10.16
 Requires-Dist: requests>=2.32.4
 Requires-Dist: tqdm>=4.67.1
 Requires-Dist: urllib3>=2.5.0
+Provides-Extra: docs
+Requires-Dist: myst-parser>0.18; extra == 'docs'
+Requires-Dist: sphinx-codeautolink>=0.15; extra == 'docs'
+Requires-Dist: sphinx-rtd-theme==1.2.2; extra == 'docs'
+Requires-Dist: sphinx-toolbox>=4.1.2; extra == 'docs'
+Requires-Dist: sphinx>=4.2; extra == 'docs'
 Description-Content-Type: text/markdown
 [![PyPI version](https://badge.fury.io/py/datamaestro.svg)](https://badge.fury.io/py/datamaestro) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![DOI](https://zenodo.org/badge/4573876.svg)](https://zenodo.org/badge/latestdoi/4573876)
@@ -127,57 +133,50 @@ Out[3]: (dtype('uint8'), (60000, 28, 28))
 ## Python definition of datasets
-Each dataset (or a set of related datasets) is described in Python using a mix of declarative
-and imperative statements. This allows to quickly define how to download dataset using the
-datamaestro declarative API; the imperative part is used when creating the JSON output,
-and is integrated with [experimaestro](http://experimaestro.github.io/experimaestro-python).
+Datasets are defined as Python classes with resource attributes that describe how
+to download and process data. The framework automatically builds a dependency graph
+and handles downloads with two-path safety and state tracking.
-Its syntax is described in the [documentation](https://datamaestro.readthedocs.io).
+```python
+from datamaestro_image.data import ImageClassification, LabelledImages
+from datamaestro.data.tensor import IDX
+from datamaestro.download.single import FileDownloader
+from datamaestro.definitions import AbstractDataset, dataset
-For instance, the MNIST dataset can be described by the following
+@dataset(url="http://yann.lecun.com/exdb/mnist/")
+class MNIST(ImageClassification):
+    """The MNIST database of handwritten digits."""
-```python
-from datamaestro import dataset
-from datamaestro.download.single import download_file
-from datamaestro_image.data import ImageClassification, LabelledImages, IDXImage
-@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
-@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
-@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
-@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
-@dataset(
-  ImageClassification,
-  url="http://yann.lecun.com/exdb/mnist/",
-)
-    return ImageClassification(
-        train=LabelledImages(
-            images=IDXImage(path=train_images), labels=IDXImage(path=train_labels)
-        ),
-        test=LabelledImages(
-            images=IDXImage(path=test_images), labels=IDXImage(path=test_labels)
-        ),
+    TRAIN_IMAGES = FileDownloader(
+        "train_images.idx",
+        "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
+    )
+    TRAIN_LABELS = FileDownloader(
+        "train_labels.idx",
+        "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
+    )
+    TEST_IMAGES = FileDownloader(
+        "test_images.idx",
+        "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
+    )
+    TEST_LABELS = FileDownloader(
+        "test_labels.idx",
+        "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
     )
-```
-When building dataset modules, some extra documentation can be provided:
-```yaml
-  ids: [com.lecun.mnist]
-  entry_point: "datamaestro_image.config.com.lecun:mnist"
-  title: The MNIST database
-  url: http://yann.lecun.com/exdb/mnist/
-  groups: [image-classification]
-  description: |
-    The MNIST database of handwritten digits, available from this page,
-    has a training set of 60,000 examples, and a test set of 10,000
-    examples. It is a subset of a larger set available from NIST. The
-    digits have been size-normalized and centered in a fixed-size image.
+    @classmethod
+    def __create_dataset__(cls, dataset: AbstractDataset):
+        return cls.C(
+            train=LabelledImages(
+                images=IDX(path=cls.TRAIN_IMAGES.path),
+                labels=IDX(path=cls.TRAIN_LABELS.path),
+            ),
+            test=LabelledImages(
+                images=IDX(path=cls.TEST_IMAGES.path),
+                labels=IDX(path=cls.TEST_LABELS.path),
+            ),
+        )
 ```
-This will allow to
-1. Document the dataset
-2. Allow to use the command line interface to manipulate it (download resources, etc.)
+Its syntax is described in the [documentation](https://datamaestro.readthedocs.io).

{datamaestro-1.6.1 → datamaestro-1.7.0}/README.md RENAMED Viewed

@@ -98,57 +98,50 @@ Out[3]: (dtype('uint8'), (60000, 28, 28))
 ## Python definition of datasets
-Each dataset (or a set of related datasets) is described in Python using a mix of declarative
-and imperative statements. This allows to quickly define how to download dataset using the
-datamaestro declarative API; the imperative part is used when creating the JSON output,
-and is integrated with [experimaestro](http://experimaestro.github.io/experimaestro-python).
+Datasets are defined as Python classes with resource attributes that describe how
+to download and process data. The framework automatically builds a dependency graph
+and handles downloads with two-path safety and state tracking.
-Its syntax is described in the [documentation](https://datamaestro.readthedocs.io).
+```python
+from datamaestro_image.data import ImageClassification, LabelledImages
+from datamaestro.data.tensor import IDX
+from datamaestro.download.single import FileDownloader
+from datamaestro.definitions import AbstractDataset, dataset
-For instance, the MNIST dataset can be described by the following
+@dataset(url="http://yann.lecun.com/exdb/mnist/")
+class MNIST(ImageClassification):
+    """The MNIST database of handwritten digits."""
-```python
-from datamaestro import dataset
-from datamaestro.download.single import download_file
-from datamaestro_image.data import ImageClassification, LabelledImages, IDXImage
-@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
-@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
-@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
-@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
-@dataset(
-  ImageClassification,
-  url="http://yann.lecun.com/exdb/mnist/",
-)
-    return ImageClassification(
-        train=LabelledImages(
-            images=IDXImage(path=train_images), labels=IDXImage(path=train_labels)
-        ),
-        test=LabelledImages(
-            images=IDXImage(path=test_images), labels=IDXImage(path=test_labels)
-        ),
+    TRAIN_IMAGES = FileDownloader(
+        "train_images.idx",
+        "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz",
+    )
+    TRAIN_LABELS = FileDownloader(
+        "train_labels.idx",
+        "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
+    )
+    TEST_IMAGES = FileDownloader(
+        "test_images.idx",
+        "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
+    )
+    TEST_LABELS = FileDownloader(
+        "test_labels.idx",
+        "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
     )
-```
-When building dataset modules, some extra documentation can be provided:
-```yaml
-  ids: [com.lecun.mnist]
-  entry_point: "datamaestro_image.config.com.lecun:mnist"
-  title: The MNIST database
-  url: http://yann.lecun.com/exdb/mnist/
-  groups: [image-classification]
-  description: |
-    The MNIST database of handwritten digits, available from this page,
-    has a training set of 60,000 examples, and a test set of 10,000
-    examples. It is a subset of a larger set available from NIST. The
-    digits have been size-normalized and centered in a fixed-size image.
+    @classmethod
+    def __create_dataset__(cls, dataset: AbstractDataset):
+        return cls.C(
+            train=LabelledImages(
+                images=IDX(path=cls.TRAIN_IMAGES.path),
+                labels=IDX(path=cls.TRAIN_LABELS.path),
+            ),
+            test=LabelledImages(
+                images=IDX(path=cls.TEST_IMAGES.path),
+                labels=IDX(path=cls.TEST_LABELS.path),
+            ),
+        )
 ```
-This will allow to
-1. Document the dataset
-2. Allow to use the command line interface to manipulate it (download resources, etc.)
+Its syntax is described in the [documentation](https://datamaestro.readthedocs.io).

datamaestro-1.7.0/docs/source/api/data.md ADDED Viewed

@@ -0,0 +1,280 @@
+# Data Types
+Data types define the structure of dataset contents. They inherit from `datamaestro.data.Base`
+and use experimaestro's configuration system for type-safe parameter handling.
+## Base Types
+### Base
+The root class for all data types:
+```python
+from datamaestro.data import Base
+class MyData(Base):
+    """Custom data type"""
+    pass
+```
+```{eval-rst}
+.. autoxpmconfig:: datamaestro.data.Base
+```
+### Generic
+Generic data with a path:
+```{eval-rst}
+.. autoxpmconfig:: datamaestro.data.Generic
+```
+### File
+Single file reference:
+```python
+from datamaestro.data import File
+# In dataset definition
+return File(path=downloaded_path)
+# Usage
+print(ds.path)  # Path to the file
+```
+```{eval-rst}
+.. autoxpmconfig:: datamaestro.data.File
+```
+## CSV Data
+Package: `datamaestro.data.csv`
+### Generic CSV
+```python
+from datamaestro.data.csv import Generic
+return Generic(
+    path=csv_path,
+    separator=",",
+    names_row=0,  # Header row index
+    size=1000,    # Number of rows (optional)
+)
+```
+```{eval-rst}
+.. autoxpmconfig:: datamaestro.data.csv.Generic
+```
+### Matrix CSV
+For numeric CSV data:
+```python
+from datamaestro.data.csv import Matrix
+return Matrix(
+    path=csv_path,
+    separator=",",
+    target=-1,  # Target column index (-1 for last)
+)
+```
+```{eval-rst}
+.. autoxpmconfig:: datamaestro.data.csv.Matrix
+```
+## Tensor Data
+Package: `datamaestro.data.tensor`
+### IDX Format
+The IDX format is used by MNIST and similar datasets:
+```python
+from datamaestro.data.tensor import IDX
+idx_data = IDX(path=idx_file_path)
+# Load as numpy array
+array = idx_data.data()
+print(array.shape)  # e.g., (60000, 28, 28)
+print(array.dtype)  # e.g., uint8
+```
+```{eval-rst}
+.. autoxpmconfig:: datamaestro.data.tensor.IDX
+```
+## Machine Learning
+Package: `datamaestro.data.ml`
+### Supervised Learning
+For supervised learning datasets with train/test splits:
+```python
+from datamaestro.data.ml import Supervised
+return Supervised(
+    train=train_data,
+    test=test_data,
+    validation=validation_data,  # Optional
+)
+```
+```{eval-rst}
+.. autoxpmconfig:: datamaestro.data.ml.Supervised
+```
+## HuggingFace Integration
+Package: `datamaestro.data.huggingface`
+For datasets from the HuggingFace Hub:
+```python
+from datamaestro.data.huggingface import DatasetDict
+return DatasetDict(
+    dataset_id="squad",
+    config=None,  # Optional config name
+)
+```
+## Creating Custom Data Types
+### Basic Custom Type
+Create custom data types by inheriting from {py:class}`~datamaestro.data.Base`.
+Use `Param` from experimaestro to define typed parameters:
+```python
+from pathlib import Path
+from experimaestro import Param
+from datamaestro.data import Base
+class TextCorpus(Base):
+    """A text corpus with documents"""
+    path: Param[Path]
+    """Path to the corpus directory"""
+    encoding: Param[str] = "utf-8"
+    """Text encoding"""
+    def documents(self):
+        """Iterate over documents"""
+        for file in self.path.glob("*.txt"):
+            yield file.read_text(encoding=self.encoding)
+    def __len__(self):
+        return len(list(self.path.glob("*.txt")))
+```
+### Nested Data Types
+```python
+from experimaestro import Param
+from datamaestro.data import Base
+class LabelledData(Base):
+    """Data with labels"""
+    data: Param[Base]
+    """The actual data"""
+    labels: Param[Base]
+    """The labels"""
+class ImageClassification(Base):
+    """Image classification dataset"""
+    train: Param[LabelledData]
+    """Training split"""
+    test: Param[LabelledData]
+    """Test split"""
+    num_classes: Param[int]
+    """Number of classes"""
+```
+### With Data Loading Methods
+```python
+from pathlib import Path
+from experimaestro import Param
+from datamaestro.data import Base
+class JSONLData(Base):
+    """JSON Lines format data"""
+    path: Param[Path]
+    def __iter__(self):
+        """Iterate over records"""
+        import json
+        with open(self.path) as f:
+            for line in f:
+                yield json.loads(line)
+    def to_pandas(self):
+        """Load as pandas DataFrame"""
+        import pandas as pd
+        return pd.read_json(self.path, lines=True)
+    def to_list(self):
+        """Load all records into a list"""
+        return list(self)
+```
+### Inheriting from Existing Types
+```python
+from datamaestro.data.csv import Matrix
+class ClassificationMatrix(Matrix):
+    """CSV matrix for classification tasks"""
+    num_classes: Param[int]
+    """Number of target classes"""
+    class_names: Param[list] = None
+    """Optional class names"""
+    def get_class_name(self, index: int) -> str:
+        if self.class_names:
+            return self.class_names[index]
+        return str(index)
+```
+## Type Annotations with Experimaestro
+Data types use experimaestro's annotation system ({py:class}`~experimaestro.Param`,
+{py:class}`~experimaestro.Option`, {py:class}`~experimaestro.Meta`):
+```python
+from experimaestro import Param, Option, Meta
+from datamaestro.data import Base
+class MyData(Base):
+    # Required parameter
+    path: Param[Path]
+    # Optional parameter with default
+    encoding: Param[str] = "utf-8"
+    # Option (not serialized, for runtime configuration)
+    cache_size: Option[int] = 1000
+    # Metadata (not part of configuration identity)
+    description: Meta[str] = ""
+```
+See the [experimaestro documentation](https://experimaestro-python.readthedocs.io/)
+for more details on the configuration system.

datamaestro 1.6.1__tar.gz → 1.7.0__tar.gz

datamaestro 1.6.1tar.gz → 1.7.0tar.gz