PyPI - data-annotations - Versions diffs - 2.8.0__tar.gz → 2.8.1__tar.gz - Mend

data-annotations 2.8.0tar.gz → 2.8.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

data_annotations-2.8.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,161 @@
+Metadata-Version: 2.4
+Name: data-annotations
+Version: 2.8.1
+Summary: Annotate data artifacts with provenance and descriptions
+Keywords: annotations,data,metadata,provenance,reproducibility
+Author: Rodrigo C.  G.  Pena
+Author-email: Rodrigo C.  G.  Pena <rodrigo.cerqueiragonzalezpena@unibas.ch>
+License-Expression: BSD-3-Clause
+License-File: LICENSE
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Dist: pydantic>=2.13.1
+Requires-Dist: pyyaml>=6.0.2
+Requires-Dist: questionary>=2.1.1 ; extra == 'cli'
+Requires-Dist: typer>=0.16.0 ; extra == 'cli'
+Requires-Python: >=3.12
+Project-URL: Source, https://gitlab.com/ceda-unibas/tools/data-annotations
+Project-URL: Changelog, https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/CHANGELOG.md
+Project-URL: Issues, https://gitlab.com/ceda-unibas/tools/data-annotations/-/issues
+Provides-Extra: cli
+Description-Content-Type: text/markdown
+# data-annotations
+`data-annotations` is a Python package for attaching provenance and structured
+descriptions to the files and directories your workflows produce.
+It writes plain JSON annotation sidecars that are easy to inspect, archive, and
+publish with research outputs:
+- files use `artifact.ext.annotation.json`
+- directories use `data-annotations.json` at their root
+Optional Markdown README sidecars can be generated for human-readable summaries.
+## Documentation
+The full documentation is organized as a [Diátaxis](https://diataxis.fr/) site:
+https://ceda-unibas.gitlab.io/tools/data-annotations/
+- Source: https://gitlab.com/ceda-unibas/tools/data-annotations
+- Changelog: https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/CHANGELOG.md
+- Work items: https://gitlab.com/ceda-unibas/tools/data-annotations/-/work_items
+## Installation
+Install the core library from PyPI:
+```bash
+pip install data-annotations
+```
+Or add it to a project with `uv`:
+```bash
+uv add data-annotations
+```
+Install CLI support when you want the `data-annotations` command:
+```bash
+pip install "data-annotations[cli]"
+uv add "data-annotations[cli]"
+```
+## Quick start
+Decorate a function that writes an artifact. When the function runs,
+`data-annotations` records provenance and writes the JSON sidecar.
+```python
+from pathlib import Path
+from data_annotations.annotations import record_file_annotation
+from data_annotations.description import FieldDefinition
+@record_file_annotation(
+    title="Participant Cohort",
+    summary="Participant-level cohort assignments.",
+    fields=[
+        FieldDefinition(
+            name="participant_id",
+            data_type="string",
+            summary="Stable participant identifier.",
+            required=True,
+            nullable=False,
+        ),
+    ],
+    primary_key=["participant_id"],
+    artifact_kind="dataset",
+    write_readme=True,
+)
+def write_participants(artifact_path: Path, input_path: Path) -> Path:
+    participant_ids = [
+        line.strip()
+        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
+        if line.strip()
+    ]
+    artifact_path.parent.mkdir(parents=True, exist_ok=True)
+    artifact_path.write_text(
+        "participant_id\n" + "\n".join(participant_ids) + "\n",
+        encoding="utf-8",
+    )
+    return artifact_path
+artifact_path = Path("outputs") / "participants.csv"
+write_participants(
+    artifact_path=artifact_path,
+    input_path=Path("data/raw/participants.csv"),
+)
+```
+This writes:
+```text
+outputs/participants.csv
+outputs/participants.csv.annotation.json
+outputs/participants.csv.README.md
+```
+## CLI
+The CLI supports retrospective annotation, provenance inspection, source
+recovery, and sanitized publish bundles.
+```bash
+data-annotations annotate file path/to/participants.csv --write-readme
+data-annotations annotate directory path/to/run-001 --recursive
+data-annotations provenance match path/to/participants.csv
+data-annotations provenance chain path/to/participants.csv
+data-annotations provenance checkout path/to/participants.csv
+data-annotations publish path/to/run-001 path/to/publish-bundle
+```
+## Development
+From a source checkout (assuming you have [Task installed](https://taskfile.dev/docs/installation)):
+```bash
+task install
+task lint
+task type-check
+task test
+```
+Build or preview the documentation site:
+```bash
+task docs-build
+task docs-serve
+```

data_annotations-2.8.1/README.md ADDED Viewed

@@ -0,0 +1,131 @@
+# data-annotations
+`data-annotations` is a Python package for attaching provenance and structured
+descriptions to the files and directories your workflows produce.
+It writes plain JSON annotation sidecars that are easy to inspect, archive, and
+publish with research outputs:
+- files use `artifact.ext.annotation.json`
+- directories use `data-annotations.json` at their root
+Optional Markdown README sidecars can be generated for human-readable summaries.
+## Documentation
+The full documentation is organized as a [Diátaxis](https://diataxis.fr/) site:
+https://ceda-unibas.gitlab.io/tools/data-annotations/
+- Source: https://gitlab.com/ceda-unibas/tools/data-annotations
+- Changelog: https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/CHANGELOG.md
+- Work items: https://gitlab.com/ceda-unibas/tools/data-annotations/-/work_items
+## Installation
+Install the core library from PyPI:
+```bash
+pip install data-annotations
+```
+Or add it to a project with `uv`:
+```bash
+uv add data-annotations
+```
+Install CLI support when you want the `data-annotations` command:
+```bash
+pip install "data-annotations[cli]"
+uv add "data-annotations[cli]"
+```
+## Quick start
+Decorate a function that writes an artifact. When the function runs,
+`data-annotations` records provenance and writes the JSON sidecar.
+```python
+from pathlib import Path
+from data_annotations.annotations import record_file_annotation
+from data_annotations.description import FieldDefinition
+@record_file_annotation(
+    title="Participant Cohort",
+    summary="Participant-level cohort assignments.",
+    fields=[
+        FieldDefinition(
+            name="participant_id",
+            data_type="string",
+            summary="Stable participant identifier.",
+            required=True,
+            nullable=False,
+        ),
+    ],
+    primary_key=["participant_id"],
+    artifact_kind="dataset",
+    write_readme=True,
+)
+def write_participants(artifact_path: Path, input_path: Path) -> Path:
+    participant_ids = [
+        line.strip()
+        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
+        if line.strip()
+    ]
+    artifact_path.parent.mkdir(parents=True, exist_ok=True)
+    artifact_path.write_text(
+        "participant_id\n" + "\n".join(participant_ids) + "\n",
+        encoding="utf-8",
+    )
+    return artifact_path
+artifact_path = Path("outputs") / "participants.csv"
+write_participants(
+    artifact_path=artifact_path,
+    input_path=Path("data/raw/participants.csv"),
+)
+```
+This writes:
+```text
+outputs/participants.csv
+outputs/participants.csv.annotation.json
+outputs/participants.csv.README.md
+```
+## CLI
+The CLI supports retrospective annotation, provenance inspection, source
+recovery, and sanitized publish bundles.
+```bash
+data-annotations annotate file path/to/participants.csv --write-readme
+data-annotations annotate directory path/to/run-001 --recursive
+data-annotations provenance match path/to/participants.csv
+data-annotations provenance chain path/to/participants.csv
+data-annotations provenance checkout path/to/participants.csv
+data-annotations publish path/to/run-001 path/to/publish-bundle
+```
+## Development
+From a source checkout (assuming you have [Task installed](https://taskfile.dev/docs/installation)):
+```bash
+task install
+task lint
+task type-check
+task test
+```
+Build or preview the documentation site:
+```bash
+task docs-build
+task docs-serve
+```

{data_annotations-2.8.0 → data_annotations-2.8.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "data-annotations"
-version = "2.8.0"
+version = "2.8.1"
 description = "Annotate data artifacts with provenance and descriptions"
 readme = "README.md"
 authors = [
@@ -42,8 +42,10 @@ build-backend = "uv_build"
 [dependency-groups]
 dev = [
   "ipykernel>=7.2.0",
+  "mkdocstrings-python>=2.0.4",
   "prek>=0.3.9",
   "pytest>=9.0.3",
   "ruff>=0.15.10",
   "ty>=0.0.31",
+  "zensical>=0.0.45",
 ]

{data_annotations-2.8.0 → data_annotations-2.8.1}/src/data_annotations/annotations/answers.py RENAMED Viewed

@@ -125,21 +125,62 @@ _EXPLICIT_PROVENANCE_OVERRIDE_FIELDS = {
 class ProvenanceAnswers(BaseModel):
+    """Optional provenance overrides supplied through an answers payload."""
     model_config = ConfigDict(extra="forbid")
-    command: str | list[str] | None = None
-    script: str | None = None
-    script_repo_path: str | None = None
-    function: str | None = None
-    git_sha: str | None = None
-    git_branch: str | None = None
-    git_dirty: bool | None = None
-    git_remote_name: str | None = None
-    git_remote_url: str | None = None
-    git_tags: list[str] | None = None
-    git_describe: str | None = None
-    source_code: SourceCodeReference | None = None
-    infer_from_runtime: list[str] = Field(default_factory=list)
+    command: str | list[str] | None = Field(
+        default=None,
+        description="Command line to record instead of the captured runtime command.",
+    )
+    script: str | None = Field(
+        default=None,
+        description="Script path to record instead of the captured runtime script.",
+    )
+    script_repo_path: str | None = Field(
+        default=None,
+        description="Script path relative to the source repository root.",
+    )
+    function: str | None = Field(
+        default=None,
+        description="Qualified callable name to record in provenance.",
+    )
+    git_sha: str | None = Field(
+        default=None,
+        description="Git commit SHA to record in provenance.",
+    )
+    git_branch: str | None = Field(
+        default=None,
+        description="Git branch name to record in provenance.",
+    )
+    git_dirty: bool | None = Field(
+        default=None,
+        description="Whether the recorded Git worktree was dirty.",
+    )
+    git_remote_name: str | None = Field(
+        default=None,
+        description="Git remote name to record in provenance.",
+    )
+    git_remote_url: str | None = Field(
+        default=None,
+        description="Git remote URL to record in provenance.",
+    )
+    git_tags: list[str] | None = Field(
+        default=None,
+        description="Git tags to record for the captured revision.",
+    )
+    git_describe: str | None = Field(
+        default=None,
+        description="git describe output to record for the captured revision.",
+    )
+    source_code: SourceCodeReference | None = Field(
+        default=None,
+        description="Explicit source-code reference to record for recovery.",
+    )
+    infer_from_runtime: list[str] = Field(
+        default_factory=list,
+        description="Runtime fields that should remain inferred instead of overridden.",
+    )
     @field_validator("infer_from_runtime", mode="before")
     @classmethod
@@ -185,6 +226,8 @@ class ProvenanceAnswers(BaseModel):
         return self
     def command_tokens(self) -> list[str] | None:
+        """Return `command` as shell-like tokens."""
         if self.command is None:
             return None
         if isinstance(self.command, list):
@@ -195,6 +238,8 @@ class ProvenanceAnswers(BaseModel):
             raise AnswersError(f"invalid provenance.command: {exc}") from exc
     def runtime_inference_fields(self) -> set[str]:
+        """Expand runtime inference groups into concrete provenance fields."""
         fields: set[str] = set()
         for field in self.infer_from_runtime:
             fields.update(_RUNTIME_INFERENCE_GROUPS.get(field, {field}))
@@ -202,47 +247,124 @@ class ProvenanceAnswers(BaseModel):
 class BaseAnswers(BaseModel):
+    """Common answers payload fields shared by file and directory modes."""
     model_config = ConfigDict(extra="forbid")
-    target: str | None = None
-    title: str | None = None
-    summary: str | None = None
-    inputs: list[str] = Field(default_factory=list)
-    params: dict[str, Any] = Field(default_factory=dict)
-    provenance: ProvenanceAnswers = Field(default_factory=ProvenanceAnswers)
+    target: str | None = Field(
+        default=None,
+        description="Artifact path supplied by the answers payload.",
+    )
+    title: str | None = Field(
+        default=None,
+        description="Display title for generated descriptions or README files.",
+    )
+    summary: str | None = Field(
+        default=None,
+        description="Short artifact summary for generated descriptions.",
+    )
+    inputs: list[str] = Field(
+        default_factory=list,
+        description="Input paths or URIs to record in provenance.",
+    )
+    params: dict[str, Any] = Field(
+        default_factory=dict,
+        description="Parameter values to record in provenance.",
+    )
+    provenance: ProvenanceAnswers = Field(
+        default_factory=ProvenanceAnswers,
+        description="Provenance overrides and runtime inference controls.",
+    )
 class FileAnswers(BaseAnswers):
-    kind: ArtifactKind = "other"
-    sha256: str | None = None
-    fields: list[FieldDefinition] = Field(default_factory=list)
-    primary_key: list[str] = Field(default_factory=list)
-    missing_value_codes: dict[str, str] = Field(default_factory=dict)
+    """Validated answers payload for annotating one existing file."""
+    kind: ArtifactKind = Field(
+        default="other",
+        description="High-level artifact category for the file.",
+    )
+    sha256: str | None = Field(
+        default=None,
+        description="Precomputed SHA-256 digest for the file.",
+    )
+    fields: list[FieldDefinition] = Field(
+        default_factory=list,
+        description="Field-level descriptions for the file.",
+    )
+    primary_key: list[str] = Field(
+        default_factory=list,
+        description="Field names that uniquely identify records in the file.",
+    )
+    missing_value_codes: dict[str, str] = Field(
+        default_factory=dict,
+        description="Mapping of missing-value markers to their meanings.",
+    )
 class DirectoryArtifactAnswers(BaseModel):
+    """Answers entry describing one artifact inside a directory."""
     model_config = ConfigDict(extra="forbid")
-    path: str
-    kind: ArtifactKind = "other"
-    title: str | None = None
-    summary: str | None = None
-    fields: list[FieldDefinition] = Field(default_factory=list)
-    primary_key: list[str] = Field(default_factory=list)
-    missing_value_codes: dict[str, str] = Field(default_factory=dict)
+    path: str = Field(description="Artifact path relative to the target directory.")
+    kind: ArtifactKind = Field(
+        default="other",
+        description="High-level artifact category.",
+    )
+    title: str | None = Field(
+        default=None,
+        description="Display title for the artifact.",
+    )
+    summary: str | None = Field(
+        default=None,
+        description="Short description of the artifact.",
+    )
+    fields: list[FieldDefinition] = Field(
+        default_factory=list,
+        description="Field-level descriptions for the artifact.",
+    )
+    primary_key: list[str] = Field(
+        default_factory=list,
+        description="Field names that uniquely identify records in the artifact.",
+    )
+    missing_value_codes: dict[str, str] = Field(
+        default_factory=dict,
+        description="Mapping of missing-value markers to their meanings.",
+    )
 class DirectoryArtifactGroupAnswers(BaseModel):
+    """Answers entry describing a group of artifacts inside a directory."""
     model_config = ConfigDict(extra="forbid")
-    title: str
-    summary: str | None = None
-    kind: ArtifactKind = "other"
-    paths: list[str]
-    selector: str | None = None
-    fields: list[FieldDefinition] = Field(default_factory=list)
-    primary_key: list[str] = Field(default_factory=list)
-    missing_value_codes: dict[str, str] = Field(default_factory=dict)
+    title: str = Field(description="Display title for the artifact group.")
+    summary: str | None = Field(
+        default=None,
+        description="Short description shared by group members.",
+    )
+    kind: ArtifactKind = Field(
+        default="other",
+        description="High-level category shared by group members.",
+    )
+    paths: list[str] = Field(description="Artifact paths included in the group.")
+    selector: str | None = Field(
+        default=None,
+        description="Pattern or rule used to select members of the group.",
+    )
+    fields: list[FieldDefinition] = Field(
+        default_factory=list,
+        description="Field-level descriptions shared by group members.",
+    )
+    primary_key: list[str] = Field(
+        default_factory=list,
+        description="Field names that uniquely identify records in each member.",
+    )
+    missing_value_codes: dict[str, str] = Field(
+        default_factory=dict,
+        description="Mapping of missing-value markers to their meanings.",
+    )
     @field_validator("paths")
     @classmethod
@@ -255,18 +377,39 @@ class DirectoryArtifactGroupAnswers(BaseModel):
 class ChildBundleAnswers(BaseModel):
+    """Answers entry for a nested annotated directory bundle."""
     model_config = ConfigDict(extra="forbid")
-    path: str
-    annotation_path: str
-    content_digest: str | None = None
+    path: str = Field(description="Path to the child bundle directory.")
+    annotation_path: str = Field(
+        description="Path to the child bundle annotation document.",
+    )
+    content_digest: str | None = Field(
+        default=None,
+        description="Expected content digest for the child bundle.",
+    )
 class DirectoryAnswers(BaseAnswers):
-    artifacts: list[DirectoryArtifactAnswers] = Field(default_factory=list)
-    artifact_groups: list[DirectoryArtifactGroupAnswers] = Field(default_factory=list)
-    child_bundles: list[ChildBundleAnswers] = Field(default_factory=list)
-    checksums: dict[str, str] = Field(default_factory=dict)
+    """Validated answers payload for annotating an existing directory."""
+    artifacts: list[DirectoryArtifactAnswers] = Field(
+        default_factory=list,
+        description="Individual artifacts to document inside the directory.",
+    )
+    artifact_groups: list[DirectoryArtifactGroupAnswers] = Field(
+        default_factory=list,
+        description="Groups of artifacts to document together.",
+    )
+    child_bundles: list[ChildBundleAnswers] = Field(
+        default_factory=list,
+        description="Nested annotated bundles to include in the directory subject.",
+    )
+    checksums: dict[str, str] = Field(
+        default_factory=dict,
+        description="Precomputed checksums keyed by artifact path.",
+    )
 FileAnswersInput: TypeAlias = str | Path | Mapping[str, Any] | FileAnswers
@@ -274,10 +417,34 @@ DirectoryAnswersInput: TypeAlias = str | Path | Mapping[str, Any] | DirectoryAns
 def load_file_answers(source: FileAnswersInput) -> FileAnswers:
+    """Load and validate answers for file annotation.
+    Args:
+        source: YAML path, mapping, or existing `FileAnswers` instance.
+    Returns:
+        Validated file answers.
+    Raises:
+        AnswersError: If the source cannot be loaded or validated.
+    """
     return _validate_answers(source, mode="file")
 def load_directory_answers(source: DirectoryAnswersInput) -> DirectoryAnswers:
+    """Load and validate answers for directory annotation.
+    Args:
+        source: YAML path, mapping, or existing `DirectoryAnswers` instance.
+    Returns:
+        Validated directory answers.
+    Raises:
+        AnswersError: If the source cannot be loaded or validated.
+    """
     return _validate_answers(source, mode="directory")

data-annotations 2.8.0__tar.gz → 2.8.1__tar.gz

data-annotations 2.8.0tar.gz → 2.8.1tar.gz