PyPI - data-annotations - Versions diffs - 2.1.2__tar.gz - Mend

data-annotations 2.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

data_annotations-2.1.2/LICENSE ADDED Viewed

@@ -0,0 +1,28 @@
+BSD 3-Clause License
+Copyright (c) 2026, CeDA, University of Basel
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its
+   contributors may be used to endorse or promote products derived from
+   this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

data_annotations-2.1.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,616 @@
+Metadata-Version: 2.4
+Name: data-annotations
+Version: 2.1.2
+Summary: Annotate generated data artifacts
+Keywords: annotations,data,metadata,provenance,reproducibility
+Author: Rodrigo C.  G.  Pena
+Author-email: Rodrigo C.  G.  Pena <rodrigo.cerqueiragonzalezpena@unibas.ch>
+License-Expression: BSD-3-Clause
+License-File: LICENSE
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Dist: pydantic>=2.13.1
+Requires-Dist: questionary>=2.1.1 ; extra == 'cli'
+Requires-Dist: typer>=0.16.0 ; extra == 'cli'
+Requires-Python: >=3.12
+Project-URL: Source, https://gitlab.com/ceda-unibas/tools/data-annotations
+Project-URL: Changelog, https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/CHANGELOG.md
+Project-URL: Issues, https://gitlab.com/ceda-unibas/tools/data-annotations/-/issues
+Provides-Extra: cli
+Description-Content-Type: text/markdown
+# data-annotations
+A small Python package for attaching provenance and structured descriptions to the
+files and directories your workflows produce.
+It is designed for lightweight research and reproducibility pipelines where you want
+generated datasets, tables, plots, or reports to carry enough context to explain
+where they came from and what they contain.
+The package captures common provenance automatically and writes plain JSON and
+Markdown artifacts that are easy to inspect or archive. The canonical on-disk format
+is now a single annotation document:
+- Files use `artifact.ext.meta.json`
+- Directories use `manifest.json`
+Each annotation document stores four top-level sections:
+- `annotation_version`
+- `subject`
+- `provenance`
+- `description`
+See the [changelog](CHANGELOG.md) for release history and upgrade-oriented notes.
+## Installation
+Install the core library from PyPI with `pip`:
+```bash
+pip install data-annotations
+```
+Or add it to a project with [uv](https://astral.sh/uv/):
+```bash
+uv add data-annotations
+```
+The command-line interface uses optional dependencies. Install the package with
+CLI support when you want to run `data-annotations` commands:
+```bash
+pip install "data-annotations[cli]"
+uv add "data-annotations[cli]"
+```
+For development or unreleased source installs, install directly from GitLab:
+```bash
+uv add "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git"
+pip install "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git"
+```
+Pin a source install to a particular release tag `x.y.z` with:
+```bash
+uv add "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git@x.y.z"
+```
+## What gets captured automatically
+Every annotation document includes provenance with:
+- A UTC creation timestamp
+- Hostname and username
+- The script path and command-line arguments
+- The script path relative to the Git repo root when it can be determined
+- Git commit, branch, dirty state, and canonical repository remote when available
+- The current `SLURM_JOB_ID` when available
+You can also attach your own parameters, input file paths, and function names.
+Local filesystem paths in provenance are stored as absolute paths. URI-style inputs
+such as `s3://...` or `https://...` are preserved as provided.
+## Quick Start
+The recommended way to annotate your data artifacts is to decorate pipeline
+functions that consume some inputs and parameters, then write those artifacts.
+This keeps the artifact-writing logic explicit while letting `data-annotations` capture
+provenance and emit sidecars automatically.
+For example, here is a complete file-level annotation workflow using the
+`record_file_annotation(...)` decorator. Once `write_participants` is called, it
+automatically generates sidecars `participants.csv.meta.json` and `participants.csv.README.md`.
+The JSON sidecar will contain provenance and description metadata, and the Markdown sidecar
+will have a human-friendly rendering of the description provided in the decorator.
+```python
+from pathlib import Path
+from data_annotations.annotations import record_file_annotation
+from data_annotations.description import AllowedValue, FieldDefinition
+@record_file_annotation(
+    title="Participant Cohort",
+    summary="Participant-level cohort assignments for the validation split.",
+    fields=[
+        FieldDefinition(
+            name="participant_id",
+            data_type="string",
+            summary="Stable participant identifier.",
+            required=True,
+            nullable=False,
+        ),
+        FieldDefinition(
+            name="group",
+            data_type="string",
+            summary="Assigned study group.",
+            allowed_values=[
+                AllowedValue(value="control"),
+                AllowedValue(value="treatment"),
+            ],
+        ),
+    ],
+    primary_key=["participant_id"],
+    artifact_kind="dataset",
+    acquisition_context={"source": "Study A registry export"},
+    generation_context={"pipeline": "baseline-v1"},
+)
+def write_participants(
+    artifact_path: Path,
+    input_path: Path,
+    split: str,
+) -> Path:
+    participant_ids = [
+        line.strip()
+        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
+        if line.strip()
+    ]
+    artifact_path.parent.mkdir(parents=True, exist_ok=True)
+    artifact_path.write_text(
+        "\n".join(
+            [
+                "participant_id,group,split",
+                *[
+                    f"{participant_id},control,{split}"
+                    for participant_id in participant_ids
+                ],
+            ]
+        )
+        + "\n",
+        encoding="utf-8",
+    )
+    return artifact_path
+# Annotation sidecars are written automatically
+# when the decorated function is called:
+artifact_path = Path("outputs") / "participants.csv"
+write_participants(
+    artifact_path=artifact_path,
+    input_path=Path("data/raw/participants.csv"),
+    split="validation",
+)
+print(f"{artifact_path}.meta.json")
+print(f"{artifact_path}.README.md")
+```
+### Decorator Contract
+You write a normal Python function and the decorator returns that function's
+original return value unchanged.
+For provenance-bearing decorators, recorded inputs are inferred from named
+function arguments such as `input_path` and `input_paths`. Those arguments
+should correspond to real data dependencies used inside the wrapped function.
+For file decorators:
+- `record_file_manifest(...)`
+- `record_file_annotation(...)`
+- `record_file_description(...)`
+Your function should:
+- accept one argument pointing at the output file path. By default this argument
+  is named `artifact_path`, but you can change the expected name with
+  `artifact_path_arg=...`.
+- use any other normal Python arguments you need for the pipeline step.
+- for provenance-bearing decorators, use argument names listed in `input_args`
+  for real upstream dependencies you want recorded as provenance inputs. By
+  default those names are `("input_path", "input_paths")`.
+Your function may return any value. File decorators do not inspect that return
+value. Returning the generated `artifact_path` is recommended because it is
+convenient for callers, but it is not required.
+For directory decorators:
+- `record_directory_manifest(...)`
+- `record_directory_annotation(...)`
+- `record_directory_description(...)`
+Your function should:
+- accept one argument pointing at the output directory. By default this argument
+  is named `output_dir`, but you can change the expected name with
+  `output_dir_arg=...`.
+- return a materialized iterable, usually a `list`, describing the files that
+  were produced in that directory.
+- prefer returning a `list` or `tuple` rather than a generator, since the
+  decorator needs to iterate over the outputs to write sidecars.
+Accepted directory return items are:
+- `DocumentedArtifact` when you want per-artifact title, summary, fields,
+  keys, or missing-value metadata.
+- `ProducedFile` when you only need path, kind, and optional precomputed hash.
+- `(path, kind)` tuples when path and artifact kind are enough.
+- plain path-like values when the artifact kind can default to `"other"`.
+For provenance-bearing directory decorators, `input_args` works the same way as
+for file decorators: matching argument names are recorded as inputs, and the
+remaining bound arguments become provenance params.
+Here is another decorator pattern example with `record_directory_annotation(...)`:
+```python
+from pathlib import Path
+from data_annotations.annotations import record_directory_annotation
+from data_annotations.description import DocumentedArtifact, FieldDefinition
+from data_annotations.provenance import ProducedFile
+@record_directory_annotation(
+    title="Validation Outputs",
+    summary="Directory-level documentation for the validation run outputs.",
+    acquisition_context={"source": "Study A registry export"},
+    generation_context={"pipeline": "baseline-v1"},
+)
+def build_outputs(
+    output_dir: Path,
+    input_path: Path,
+    split: str,
+):
+    participant_ids = [
+        line.strip()
+        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
+        if line.strip()
+    ]
+    output_dir.mkdir(parents=True, exist_ok=True)
+    table_path = output_dir / "scores.csv"
+    table_path.write_text(
+        "\n".join(
+            [
+                "participant_id,score,split",
+                *[
+                    f"{participant_id},0.94,{split}"
+                    for participant_id in participant_ids
+                ],
+            ]
+        )
+        + "\n",
+        encoding="utf-8",
+    )
+    report_path = output_dir / "summary.txt"
+    report_path.write_text(
+        (
+            f"Validated {len(participant_ids)} participants from "
+            f"{input_path.name} for the {split} split.\n"
+        ),
+        encoding="utf-8",
+    )
+    plot_path = output_dir / "roc.png"
+    plot_path.write_bytes(
+        (
+            f"plot placeholder derived from {input_path.name} "
+            f"({len(participant_ids)} participants)\n"
+        ).encode("utf-8")
+    )
+    return [
+        DocumentedArtifact(
+            path=str(table_path),
+            kind="dataset",
+            title="Metrics Table",
+            fields=[
+                FieldDefinition(
+                    name="metric",
+                    data_type="string",
+                    summary="Metric name.",
+                ),
+                FieldDefinition(
+                    name="value",
+                    data_type="float",
+                    summary="Metric value.",
+                ),
+            ],
+        ),
+        ProducedFile(path=str(report_path), kind="report"),
+        (plot_path, "plot"),
+    ]
+output_dir = Path("outputs") / "run-001"
+build_outputs(
+    output_dir=output_dir,
+    input_path=Path("data/raw/participants.csv"),
+    split="validation",
+)
+print(output_dir / "manifest.json")
+print(output_dir / "README.md")
+```
+The decorator and direct APIs write the same canonical document shape. If you need
+metadata to vary per call instead of staying fixed at decoration time, use
+`annotate_file(...)`, `annotate_directory(...)`, `write_file_annotation(...)`, or
+`write_directory_annotation(...)` directly instead. See the example gallery in
+`examples/` for runnable examples of all approaches.
+### When To Use Decorators Vs Direct Functions
+If a function is only a final serializer for already-prepared data, prefer the
+direct annotation and writer APIs. They let you attach `inputs=[...]` explicitly.
+## Canonical Document Shape
+File annotations store:
+- `subject.path`
+- `subject.kind`
+- `subject.sha256`
+- `provenance.*`
+- `description.title`
+- `description.summary`
+- `description.fields`
+- `description.primary_key`
+- `description.missing_value_codes`
+- `description.acquisition_context`
+- `description.generation_context`
+- `description.description_updated_at`
+Directory annotations store:
+- `subject.path`
+- `subject.produced_files[]`
+- `provenance.*`
+- `description.title`
+- `description.summary`
+- `description.artifacts[]`
+- `description.acquisition_context`
+- `description.generation_context`
+- `description.description_updated_at`
+The `description` section intentionally excludes provenance linkage fields and
+file kinds for directory artifacts. Kinds live in `subject.produced_files`.
+## Provenance Decorators And Writers
+The `data_annotations.provenance` namespace provides provenance-only entry points.
+Prefer the decorators when you already have a small function that writes artifacts:
+```python
+from pathlib import Path
+from data_annotations.provenance import record_file_manifest
+@record_file_manifest(artifact_kind="report")
+def write_report(
+    artifact_path: Path,
+    input_path: Path,
+    threshold: float = 0.5,
+):
+    artifact_path.parent.mkdir(parents=True, exist_ok=True)
+    artifact_path.write_text(
+        f"threshold applied: {threshold}\nsource={input_path.name}\n",
+        encoding="utf-8",
+    )
+write_report(
+    artifact_path=Path("outputs/summary.txt"),
+    input_path=Path("data/raw/participants.csv"),
+    threshold=0.75,
+)
+```
+Use `record_directory_manifest(...)` for directory outputs. Directory decorators
+accept `DocumentedArtifact`, `ProducedFile`, `(path, kind)`, and plain path-like
+return values.
+If you want the direct writer approach instead, use `write_file_manifest(...)` and
+`write_directory_manifest(...)` (see `examples/`).
+## Description Layer
+The `data_annotations.description` sub-package provides the structured description
+models used by annotation writers and the Markdown sidecar renderers.
+Within those models, the primary human-written narrative field is named `summary`.
+Key public description models:
+- `AllowedValue`
+- `FieldDefinition`
+- `DocumentedArtifact`
+- `ArtifactDescription`
+- `FileDescription`
+- `DirectoryDescription`
+Description decorators and helpers:
+- `record_file_description(...)`
+- `record_directory_description(...)`
+- `write_file_description(...)`
+- `write_directory_description(...)`
+- `render_file_readme(...)`
+- `render_directory_readme(...)`
+Alias helpers `write_file_readme(...)` and `write_directory_readme(...)` are supported.
+Use the decorator forms when the description metadata is stable
+for a function, and use the direct helpers when you want to assemble descriptions
+per call.
+## Recovery Helpers
+Use `artifact_matches_manifest(...)` to verify whether a detached artifact still
+matches an annotation document, and `checkout_manifest_source(...)` to recover the
+recorded code state from Git metadata.
+```python
+from pathlib import Path
+from data_annotations.provenance import (
+    artifact_matches_manifest,
+    checkout_manifest_source,
+)
+annotation_path = Path("outputs/participants.csv.meta.json")
+artifact_path = Path("downloads/participants.csv")
+if artifact_matches_manifest(artifact_path, annotation_path):
+    recovered = checkout_manifest_source(annotation_path)
+    print(recovered.checkout_path)
+    print(recovered.script_path)
+```
+## Post-Hoc Annotation
+The strongest workflow is to create provenance and description at the same time
+as the artifact itself. When annotations are written during generation, the
+package can capture runtime context directly and the resulting records are
+typically more complete, precise, and trustworthy.
+For existing artifacts, the CLI provides a post-hoc annotation path so you can
+still attach provenance and description after the fact.
+Post-hoc descriptions can still be very useful, but the quality of post-hoc
+provenance depends on how exact the supplied answers are. In particular, fields
+such as the generating script, command, function, Git commit, repository path,
+inputs, and parameters are only as reliable as the information entered during
+annotation.
+## CLI Workflow
+This package provides a command-line interface (CLI) for retrospective annotation
+and provenance inspection.
+For post-hoc annotation:
+```bash
+data-annotations annotate file path/to/participants.csv
+data-annotations annotate directory path/to/run-001
+```
+These commands prompt for missing details, write `*.meta.json` or `manifest.json`,
+and optionally derive README sidecars. Post-hoc records are marked with
+`capture_mode="post_hoc"`.
+For provenance inspection and source recovery:
+```bash
+data-annotations provenance match path/to/artifact
+data-annotations provenance checkout path/to/artifact
+```
+Command `match` auto-discovers `*.meta.json` for files and `manifest.json` for
+directories, prints a verification summary, and suggests the exact `checkout`
+command to run next when Git recovery metadata is available.
+### Run With `uvx`
+```bash
+uvx --from "data-annotations[cli]" data-annotations provenance match path/to/participants.csv
+```
+### Install And Use With `uv tool`
+```bash
+uv tool install "data-annotations[cli]"
+data-annotations provenance match path/to/participants.csv
+```
+### Run From Repository Root
+From the repository root while developing locally, run `task install` first.
+That task uses `uv sync --extra cli`, so the CLI commands are available in
+the project environment. You can then run:
+```bash
+uv run data-annotations annotate file path/to/participants.csv
+uv run data-annotations annotate directory path/to/run-001
+uv run data-annotations provenance match path/to/participants.csv
+uv run data-annotations provenance checkout path/to/participants.csv
+```
+## API Overview
+### Annotation Models
+- `FileArtifactSubject`
+- `DirectoryArtifactSubject`
+- `FileAnnotationDocument`
+- `DirectoryAnnotationDocument`
+- `FileAnnotationResult`
+- `DirectoryAnnotationResult`
+### Annotation Decorators
+- `record_file_annotation(...)`
+- `record_directory_annotation(...)`
+### Annotation Functions
+- `write_file_annotation(...)`
+- `write_directory_annotation(...)`
+- `annotate_file(...)`
+- `annotate_directory(...)`
+### Description Functions
+- `record_file_description(...)`
+- `record_directory_description(...)`
+- `write_file_description(...)`
+- `write_directory_description(...)`
+- `write_file_readme(...)`
+- `write_directory_readme(...)`
+- `render_file_readme(...)`
+- `render_directory_readme(...)`
+### Provenance Models
+- `ProducedFile`
+- `BaseProvenance`
+- `FileManifest`
+- `DirectoryManifest`
+- `RecoveredSource`
+### Provenance Functions
+- `record_file_manifest(...)`
+- `record_directory_manifest(...)`
+- `write_file_manifest(...)`
+- `write_directory_manifest(...)`
+- `artifact_matches_manifest(...)`
+- `checkout_manifest_source(...)`
+## Examples
+Runnable examples live in `examples/` and mirror the README workflows.
+Run them from the repository root with:
+```bash
+uv run python examples/record_file_annotation.py
+uv run python examples/record_directory_annotation.py
+uv run python examples/record_file_manifest.py
+uv run python examples/record_directory_manifest.py
+uv run python examples/record_file_description.py
+uv run python examples/record_directory_description.py
+uv run python examples/annotate_file.py
+uv run python examples/annotate_directory.py
+uv run python examples/write_file_manifest.py
+uv run python examples/write_directory_manifest.py
+uv run python examples/write_file_description.py
+uv run python examples/write_directory_description.py
+uv run python examples/recover_provenance.py
+uv run python examples/recover_provenance_cli.py
+```
+Each example writes its outputs to a fresh temporary directory and prints the
+location so you can inspect the generated annotation documents and README sidecars.