data-annotations 2.1.2__tar.gz → 2.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {data_annotations-2.1.2 → data_annotations-2.2.0}/PKG-INFO +136 -27
- {data_annotations-2.1.2 → data_annotations-2.2.0}/README.md +135 -26
- {data_annotations-2.1.2 → data_annotations-2.2.0}/pyproject.toml +1 -1
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/_decorators.py +83 -19
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/decorators.py +14 -3
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/models.py +5 -3
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/writers.py +111 -17
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/annotate.py +185 -9
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/common.py +99 -13
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/prompts.py +92 -2
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/__init__.py +4 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/decorators.py +29 -2
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/models.py +42 -1
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/writers.py +58 -1
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/__init__.py +4 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/decorators.py +9 -3
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/git.py +10 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/models.py +10 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/recovery.py +142 -11
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/writers.py +130 -6
- {data_annotations-2.1.2 → data_annotations-2.2.0}/LICENSE +0 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/__init__.py +0 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/__init__.py +0 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli.py +0 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/__init__.py +0 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/provenance_commands.py +0 -0
- {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/runtime.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: data-annotations
|
|
3
|
-
Version: 2.
|
|
3
|
+
Version: 2.2.0
|
|
4
4
|
Summary: Annotate generated data artifacts
|
|
5
5
|
Keywords: annotations,data,metadata,provenance,reproducibility
|
|
6
6
|
Author: Rodrigo C. G. Pena
|
|
@@ -29,7 +29,7 @@ Description-Content-Type: text/markdown
|
|
|
29
29
|
|
|
30
30
|
# data-annotations
|
|
31
31
|
|
|
32
|
-
A
|
|
32
|
+
A Python package for attaching provenance and structured descriptions to the
|
|
33
33
|
files and directories your workflows produce.
|
|
34
34
|
|
|
35
35
|
It is designed for lightweight research and reproducibility pipelines where you want
|
|
@@ -37,11 +37,11 @@ generated datasets, tables, plots, or reports to carry enough context to explain
|
|
|
37
37
|
where they came from and what they contain.
|
|
38
38
|
|
|
39
39
|
The package captures common provenance automatically and writes plain JSON and
|
|
40
|
-
Markdown artifacts that are easy to inspect or archive. The canonical on-disk
|
|
41
|
-
|
|
40
|
+
Markdown artifacts that are easy to inspect or archive. The canonical on-disk
|
|
41
|
+
format uses one JSON annotation document per artifact:
|
|
42
42
|
|
|
43
|
-
- Files use `artifact.ext.
|
|
44
|
-
- Directories
|
|
43
|
+
- Files use `artifact.ext.annotation.json`
|
|
44
|
+
- Directories carry `data-annotations.json` at their root
|
|
45
45
|
|
|
46
46
|
Each annotation document stores four top-level sections:
|
|
47
47
|
|
|
@@ -50,6 +50,10 @@ Each annotation document stores four top-level sections:
|
|
|
50
50
|
- `provenance`
|
|
51
51
|
- `description`
|
|
52
52
|
|
|
53
|
+
Here's the mental model: files get a visible sibling annotation, and
|
|
54
|
+
directories carry one visible annotation at their root. Treat the annotation as
|
|
55
|
+
part of the research output bundle.
|
|
56
|
+
|
|
53
57
|
See the [changelog](CHANGELOG.md) for release history and upgrade-oriented notes.
|
|
54
58
|
|
|
55
59
|
## Installation
|
|
@@ -95,12 +99,15 @@ Every annotation document includes provenance with:
|
|
|
95
99
|
- Hostname and username
|
|
96
100
|
- The script path and command-line arguments
|
|
97
101
|
- The script path relative to the Git repo root when it can be determined
|
|
98
|
-
- Git commit, branch, dirty state,
|
|
102
|
+
- Git commit, branch, dirty state, canonical repository remote, exact tags, and
|
|
103
|
+
`git describe` output when available
|
|
99
104
|
- The current `SLURM_JOB_ID` when available
|
|
100
105
|
|
|
101
106
|
You can also attach your own parameters, input file paths, and function names.
|
|
102
107
|
Local filesystem paths in provenance are stored as absolute paths. URI-style inputs
|
|
103
108
|
such as `s3://...` or `https://...` are preserved as provided.
|
|
109
|
+
Git tags and `git_describe` are human-friendly hints only; `git_sha` remains the
|
|
110
|
+
source of truth for reproducibility, matching, and source checkout.
|
|
104
111
|
|
|
105
112
|
## Quick Start
|
|
106
113
|
|
|
@@ -111,7 +118,7 @@ provenance and emit sidecars automatically.
|
|
|
111
118
|
|
|
112
119
|
For example, here is a complete file-level annotation workflow using the
|
|
113
120
|
`record_file_annotation(...)` decorator. Once `write_participants` is called, it
|
|
114
|
-
automatically generates sidecars `participants.csv.
|
|
121
|
+
automatically generates sidecars `participants.csv.annotation.json` and `participants.csv.README.md`.
|
|
115
122
|
The JSON sidecar will contain provenance and description metadata, and the Markdown sidecar
|
|
116
123
|
will have a human-friendly rendering of the description provided in the decorator.
|
|
117
124
|
|
|
@@ -182,7 +189,7 @@ write_participants(
|
|
|
182
189
|
split="validation",
|
|
183
190
|
)
|
|
184
191
|
|
|
185
|
-
print(f"{artifact_path}.
|
|
192
|
+
print(f"{artifact_path}.annotation.json")
|
|
186
193
|
print(f"{artifact_path}.README.md")
|
|
187
194
|
```
|
|
188
195
|
|
|
@@ -235,7 +242,12 @@ Accepted directory return items are:
|
|
|
235
242
|
|
|
236
243
|
- `DocumentedArtifact` when you want per-artifact title, summary, fields,
|
|
237
244
|
keys, or missing-value metadata.
|
|
245
|
+
- `DocumentedArtifactGroup` for `record_directory_annotation(...)` and
|
|
246
|
+
`record_directory_description(...)` when many files share one title, summary,
|
|
247
|
+
kind, and optional schema metadata.
|
|
238
248
|
- `ProducedFile` when you only need path, kind, and optional precomputed hash.
|
|
249
|
+
- `ChildBundle` when an annotated child directory should be referenced as its
|
|
250
|
+
own independently shareable bundle.
|
|
239
251
|
- `(path, kind)` tuples when path and artifact kind are enough.
|
|
240
252
|
- plain path-like values when the artifact kind can default to `"other"`.
|
|
241
253
|
|
|
@@ -249,7 +261,11 @@ Here is another decorator pattern example with `record_directory_annotation(...)
|
|
|
249
261
|
from pathlib import Path
|
|
250
262
|
|
|
251
263
|
from data_annotations.annotations import record_directory_annotation
|
|
252
|
-
from data_annotations.description import
|
|
264
|
+
from data_annotations.description import (
|
|
265
|
+
DocumentedArtifact,
|
|
266
|
+
DocumentedArtifactGroup,
|
|
267
|
+
FieldDefinition,
|
|
268
|
+
)
|
|
253
269
|
from data_annotations.provenance import ProducedFile
|
|
254
270
|
|
|
255
271
|
@record_directory_annotation(
|
|
@@ -294,13 +310,16 @@ def build_outputs(
|
|
|
294
310
|
encoding="utf-8",
|
|
295
311
|
)
|
|
296
312
|
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
|
|
302
|
-
|
|
303
|
-
|
|
313
|
+
plot_paths = []
|
|
314
|
+
for day in ["2024-01-01", "2024-01-02", "2024-01-03"]:
|
|
315
|
+
plot_path = output_dir / f"sma_{day}.png"
|
|
316
|
+
plot_path.write_bytes(
|
|
317
|
+
(
|
|
318
|
+
f"plot placeholder for the SMA variable on {day}, "
|
|
319
|
+
f"derived from {input_path.name}\n"
|
|
320
|
+
).encode("utf-8")
|
|
321
|
+
)
|
|
322
|
+
plot_paths.append(plot_path)
|
|
304
323
|
|
|
305
324
|
return [
|
|
306
325
|
DocumentedArtifact(
|
|
@@ -321,7 +340,13 @@ def build_outputs(
|
|
|
321
340
|
],
|
|
322
341
|
),
|
|
323
342
|
ProducedFile(path=str(report_path), kind="report"),
|
|
324
|
-
(
|
|
343
|
+
DocumentedArtifactGroup(
|
|
344
|
+
title="Daily SMA plots",
|
|
345
|
+
summary="Plots of the same variable on different days.",
|
|
346
|
+
kind="plot",
|
|
347
|
+
paths=[str(path) for path in plot_paths],
|
|
348
|
+
selector="sma_*.png",
|
|
349
|
+
),
|
|
325
350
|
]
|
|
326
351
|
|
|
327
352
|
|
|
@@ -332,7 +357,7 @@ build_outputs(
|
|
|
332
357
|
split="validation",
|
|
333
358
|
)
|
|
334
359
|
|
|
335
|
-
print(output_dir / "
|
|
360
|
+
print(output_dir / "data-annotations.json")
|
|
336
361
|
print(output_dir / "README.md")
|
|
337
362
|
```
|
|
338
363
|
|
|
@@ -368,16 +393,66 @@ Directory annotations store:
|
|
|
368
393
|
|
|
369
394
|
- `subject.path`
|
|
370
395
|
- `subject.produced_files[]`
|
|
396
|
+
- `subject.child_bundles[]`
|
|
397
|
+
- `subject.content_digest`
|
|
371
398
|
- `provenance.*`
|
|
372
399
|
- `description.title`
|
|
373
400
|
- `description.summary`
|
|
401
|
+
- `description.artifact_groups[]`
|
|
374
402
|
- `description.artifacts[]`
|
|
375
403
|
- `description.acquisition_context`
|
|
376
404
|
- `description.generation_context`
|
|
377
405
|
- `description.description_updated_at`
|
|
378
406
|
|
|
379
|
-
|
|
380
|
-
|
|
407
|
+
Use `description.artifact_groups[]` when many files have the same meaning, and
|
|
408
|
+
use `description.artifacts[]` only for file-specific notes, overrides, or schema.
|
|
409
|
+
Groups are descriptive only. Integrity still lives in `subject.produced_files[]`,
|
|
410
|
+
which tracks every concrete file by path, kind, and checksum.
|
|
411
|
+
|
|
412
|
+
The `description` section intentionally excludes provenance linkage fields.
|
|
413
|
+
Directory `produced_files[].path` values are stored relative to `subject.path`,
|
|
414
|
+
which keeps verification stable when a complete output directory is copied or
|
|
415
|
+
archived elsewhere. `subject.content_digest` is computed from sorted tracked file
|
|
416
|
+
paths, file checksums, and referenced child bundle digests.
|
|
417
|
+
|
|
418
|
+
## Artifact Groups
|
|
419
|
+
|
|
420
|
+
Artifact groups are for homogeneous sets of files that researchers naturally
|
|
421
|
+
understand as one output family: for example, 100 PNG plots of the same variable,
|
|
422
|
+
one per acquisition day. A group stores the shared title, summary, kind, optional
|
|
423
|
+
schema fields, and the concrete member paths. It can also store an informational
|
|
424
|
+
`selector`, such as `plots/*.png`, to show how the group was chosen.
|
|
425
|
+
|
|
426
|
+
Rules of thumb:
|
|
427
|
+
|
|
428
|
+
- Use artifact groups when many files have the same meaning.
|
|
429
|
+
- Use individual artifacts for file-specific notes, exceptions, or overrides.
|
|
430
|
+
- It is OK for an individual artifact to also appear in a group.
|
|
431
|
+
- Do not rely on groups for integrity. `subject.produced_files[]` remains the
|
|
432
|
+
complete checksum inventory.
|
|
433
|
+
|
|
434
|
+
## Nested Directory Policy
|
|
435
|
+
|
|
436
|
+
Annotate the smallest thing you would share as a unit. If a directory is one
|
|
437
|
+
research output, give that directory one `data-annotations.json`, even when its
|
|
438
|
+
tracked files live in nested subdirectories.
|
|
439
|
+
|
|
440
|
+
Use recursive directory annotations for one bundle with nested files:
|
|
441
|
+
|
|
442
|
+
```bash
|
|
443
|
+
data-annotations annotate directory path/to/run-001 --recursive
|
|
444
|
+
data-annotations annotate directory path/to/run-001 --max-depth 2
|
|
445
|
+
```
|
|
446
|
+
|
|
447
|
+
Use child bundle annotations when a subdirectory is independently meaningful,
|
|
448
|
+
shareable, or reusable. In that case, annotate the child directory first, then
|
|
449
|
+
annotate the parent. The parent records a compact `child_bundles[]` reference
|
|
450
|
+
with the child path, child annotation path, and child content digest; it does not
|
|
451
|
+
copy the child file inventory into the parent JSON.
|
|
452
|
+
|
|
453
|
+
Post-hoc directory discovery follows the same rule. `--recursive` discovers
|
|
454
|
+
nested files, but it stops at annotated child directories containing
|
|
455
|
+
`data-annotations.json` and records them as child bundles.
|
|
381
456
|
|
|
382
457
|
## Provenance Decorators And Writers
|
|
383
458
|
|
|
@@ -412,7 +487,9 @@ write_report(
|
|
|
412
487
|
|
|
413
488
|
Use `record_directory_manifest(...)` for directory outputs. Directory decorators
|
|
414
489
|
accept `DocumentedArtifact`, `ProducedFile`, `(path, kind)`, and plain path-like
|
|
415
|
-
return values.
|
|
490
|
+
return values. Provenance-only APIs do not accept description groups; use
|
|
491
|
+
unified annotation or description APIs when groups should appear in the JSON or
|
|
492
|
+
README.
|
|
416
493
|
|
|
417
494
|
If you want the direct writer approach instead, use `write_file_manifest(...)` and
|
|
418
495
|
`write_directory_manifest(...)` (see `examples/`).
|
|
@@ -428,7 +505,9 @@ Key public description models:
|
|
|
428
505
|
- `AllowedValue`
|
|
429
506
|
- `FieldDefinition`
|
|
430
507
|
- `DocumentedArtifact`
|
|
508
|
+
- `DocumentedArtifactGroup`
|
|
431
509
|
- `ArtifactDescription`
|
|
510
|
+
- `ArtifactGroupDescription`
|
|
432
511
|
- `FileDescription`
|
|
433
512
|
- `DirectoryDescription`
|
|
434
513
|
|
|
@@ -461,7 +540,7 @@ from data_annotations.provenance import (
|
|
|
461
540
|
checkout_manifest_source,
|
|
462
541
|
)
|
|
463
542
|
|
|
464
|
-
annotation_path = Path("outputs/participants.csv.
|
|
543
|
+
annotation_path = Path("outputs/participants.csv.annotation.json")
|
|
465
544
|
artifact_path = Path("downloads/participants.csv")
|
|
466
545
|
|
|
467
546
|
if artifact_matches_manifest(artifact_path, annotation_path):
|
|
@@ -483,8 +562,8 @@ still attach provenance and description after the fact.
|
|
|
483
562
|
Post-hoc descriptions can still be very useful, but the quality of post-hoc
|
|
484
563
|
provenance depends on how exact the supplied answers are. In particular, fields
|
|
485
564
|
such as the generating script, command, function, Git commit, repository path,
|
|
486
|
-
inputs, and parameters are only as reliable as
|
|
487
|
-
annotation.
|
|
565
|
+
Git tags, `git describe` output, inputs, and parameters are only as reliable as
|
|
566
|
+
the information entered during annotation.
|
|
488
567
|
|
|
489
568
|
## CLI Workflow
|
|
490
569
|
|
|
@@ -496,12 +575,29 @@ For post-hoc annotation:
|
|
|
496
575
|
```bash
|
|
497
576
|
data-annotations annotate file path/to/participants.csv
|
|
498
577
|
data-annotations annotate directory path/to/run-001
|
|
578
|
+
data-annotations annotate directory path/to/run-001 --recursive
|
|
579
|
+
data-annotations annotate directory path/to/run-001 --max-depth 2
|
|
580
|
+
data-annotations annotate directory path/to/run-001 \
|
|
581
|
+
--recursive \
|
|
582
|
+
--group-selector "plots/*.png" \
|
|
583
|
+
--group-title "Daily SMA plots" \
|
|
584
|
+
--group-summary "Plots of the same variable on different days." \
|
|
585
|
+
--group-kind plot
|
|
499
586
|
```
|
|
500
587
|
|
|
501
|
-
These commands prompt for missing details, write `*.
|
|
588
|
+
These commands prompt for missing details, write `*.annotation.json` or `data-annotations.json`,
|
|
502
589
|
and optionally derive README sidecars. Post-hoc records are marked with
|
|
503
590
|
`capture_mode="post_hoc"`.
|
|
504
591
|
|
|
592
|
+
When group selectors are provided, the CLI expands them to concrete member paths
|
|
593
|
+
at annotation time. Grouped files are tracked in `subject.produced_files[]` but
|
|
594
|
+
are skipped by the per-file prompt flow, so you do not have to answer the same
|
|
595
|
+
questions for every matching file.
|
|
596
|
+
|
|
597
|
+
For post-hoc provenance, use repeatable `--git-tag` and optional
|
|
598
|
+
`--git-describe` when you know the original code state. These values are stored
|
|
599
|
+
as human-readable hints; `--git-sha` remains the field used for recovery.
|
|
600
|
+
|
|
505
601
|
For provenance inspection and source recovery:
|
|
506
602
|
|
|
507
603
|
```bash
|
|
@@ -509,7 +605,7 @@ data-annotations provenance match path/to/artifact
|
|
|
509
605
|
data-annotations provenance checkout path/to/artifact
|
|
510
606
|
```
|
|
511
607
|
|
|
512
|
-
Command `match` auto-discovers `*.
|
|
608
|
+
Command `match` auto-discovers `*.annotation.json` for files and `data-annotations.json` for
|
|
513
609
|
directories, prints a verification summary, and suggests the exact `checkout`
|
|
514
610
|
command to run next when Git recovery metadata is available.
|
|
515
611
|
|
|
@@ -562,6 +658,17 @@ uv run data-annotations provenance checkout path/to/participants.csv
|
|
|
562
658
|
- `annotate_file(...)`
|
|
563
659
|
- `annotate_directory(...)`
|
|
564
660
|
|
|
661
|
+
### Description Models
|
|
662
|
+
|
|
663
|
+
- `AllowedValue`
|
|
664
|
+
- `FieldDefinition`
|
|
665
|
+
- `DocumentedArtifact`
|
|
666
|
+
- `DocumentedArtifactGroup`
|
|
667
|
+
- `ArtifactDescription`
|
|
668
|
+
- `ArtifactGroupDescription`
|
|
669
|
+
- `FileDescription`
|
|
670
|
+
- `DirectoryDescription`
|
|
671
|
+
|
|
565
672
|
### Description Functions
|
|
566
673
|
|
|
567
674
|
- `record_file_description(...)`
|
|
@@ -576,6 +683,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
|
|
|
576
683
|
### Provenance Models
|
|
577
684
|
|
|
578
685
|
- `ProducedFile`
|
|
686
|
+
- `ChildBundle`
|
|
579
687
|
- `BaseProvenance`
|
|
580
688
|
- `FileManifest`
|
|
581
689
|
- `DirectoryManifest`
|
|
@@ -587,6 +695,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
|
|
|
587
695
|
- `record_directory_manifest(...)`
|
|
588
696
|
- `write_file_manifest(...)`
|
|
589
697
|
- `write_directory_manifest(...)`
|
|
698
|
+
- `directory_content_digest(...)`
|
|
590
699
|
- `artifact_matches_manifest(...)`
|
|
591
700
|
- `checkout_manifest_source(...)`
|
|
592
701
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# data-annotations
|
|
2
2
|
|
|
3
|
-
A
|
|
3
|
+
A Python package for attaching provenance and structured descriptions to the
|
|
4
4
|
files and directories your workflows produce.
|
|
5
5
|
|
|
6
6
|
It is designed for lightweight research and reproducibility pipelines where you want
|
|
@@ -8,11 +8,11 @@ generated datasets, tables, plots, or reports to carry enough context to explain
|
|
|
8
8
|
where they came from and what they contain.
|
|
9
9
|
|
|
10
10
|
The package captures common provenance automatically and writes plain JSON and
|
|
11
|
-
Markdown artifacts that are easy to inspect or archive. The canonical on-disk
|
|
12
|
-
|
|
11
|
+
Markdown artifacts that are easy to inspect or archive. The canonical on-disk
|
|
12
|
+
format uses one JSON annotation document per artifact:
|
|
13
13
|
|
|
14
|
-
- Files use `artifact.ext.
|
|
15
|
-
- Directories
|
|
14
|
+
- Files use `artifact.ext.annotation.json`
|
|
15
|
+
- Directories carry `data-annotations.json` at their root
|
|
16
16
|
|
|
17
17
|
Each annotation document stores four top-level sections:
|
|
18
18
|
|
|
@@ -21,6 +21,10 @@ Each annotation document stores four top-level sections:
|
|
|
21
21
|
- `provenance`
|
|
22
22
|
- `description`
|
|
23
23
|
|
|
24
|
+
Here's the mental model: files get a visible sibling annotation, and
|
|
25
|
+
directories carry one visible annotation at their root. Treat the annotation as
|
|
26
|
+
part of the research output bundle.
|
|
27
|
+
|
|
24
28
|
See the [changelog](CHANGELOG.md) for release history and upgrade-oriented notes.
|
|
25
29
|
|
|
26
30
|
## Installation
|
|
@@ -66,12 +70,15 @@ Every annotation document includes provenance with:
|
|
|
66
70
|
- Hostname and username
|
|
67
71
|
- The script path and command-line arguments
|
|
68
72
|
- The script path relative to the Git repo root when it can be determined
|
|
69
|
-
- Git commit, branch, dirty state,
|
|
73
|
+
- Git commit, branch, dirty state, canonical repository remote, exact tags, and
|
|
74
|
+
`git describe` output when available
|
|
70
75
|
- The current `SLURM_JOB_ID` when available
|
|
71
76
|
|
|
72
77
|
You can also attach your own parameters, input file paths, and function names.
|
|
73
78
|
Local filesystem paths in provenance are stored as absolute paths. URI-style inputs
|
|
74
79
|
such as `s3://...` or `https://...` are preserved as provided.
|
|
80
|
+
Git tags and `git_describe` are human-friendly hints only; `git_sha` remains the
|
|
81
|
+
source of truth for reproducibility, matching, and source checkout.
|
|
75
82
|
|
|
76
83
|
## Quick Start
|
|
77
84
|
|
|
@@ -82,7 +89,7 @@ provenance and emit sidecars automatically.
|
|
|
82
89
|
|
|
83
90
|
For example, here is a complete file-level annotation workflow using the
|
|
84
91
|
`record_file_annotation(...)` decorator. Once `write_participants` is called, it
|
|
85
|
-
automatically generates sidecars `participants.csv.
|
|
92
|
+
automatically generates sidecars `participants.csv.annotation.json` and `participants.csv.README.md`.
|
|
86
93
|
The JSON sidecar will contain provenance and description metadata, and the Markdown sidecar
|
|
87
94
|
will have a human-friendly rendering of the description provided in the decorator.
|
|
88
95
|
|
|
@@ -153,7 +160,7 @@ write_participants(
|
|
|
153
160
|
split="validation",
|
|
154
161
|
)
|
|
155
162
|
|
|
156
|
-
print(f"{artifact_path}.
|
|
163
|
+
print(f"{artifact_path}.annotation.json")
|
|
157
164
|
print(f"{artifact_path}.README.md")
|
|
158
165
|
```
|
|
159
166
|
|
|
@@ -206,7 +213,12 @@ Accepted directory return items are:
|
|
|
206
213
|
|
|
207
214
|
- `DocumentedArtifact` when you want per-artifact title, summary, fields,
|
|
208
215
|
keys, or missing-value metadata.
|
|
216
|
+
- `DocumentedArtifactGroup` for `record_directory_annotation(...)` and
|
|
217
|
+
`record_directory_description(...)` when many files share one title, summary,
|
|
218
|
+
kind, and optional schema metadata.
|
|
209
219
|
- `ProducedFile` when you only need path, kind, and optional precomputed hash.
|
|
220
|
+
- `ChildBundle` when an annotated child directory should be referenced as its
|
|
221
|
+
own independently shareable bundle.
|
|
210
222
|
- `(path, kind)` tuples when path and artifact kind are enough.
|
|
211
223
|
- plain path-like values when the artifact kind can default to `"other"`.
|
|
212
224
|
|
|
@@ -220,7 +232,11 @@ Here is another decorator pattern example with `record_directory_annotation(...)
|
|
|
220
232
|
from pathlib import Path
|
|
221
233
|
|
|
222
234
|
from data_annotations.annotations import record_directory_annotation
|
|
223
|
-
from data_annotations.description import
|
|
235
|
+
from data_annotations.description import (
|
|
236
|
+
DocumentedArtifact,
|
|
237
|
+
DocumentedArtifactGroup,
|
|
238
|
+
FieldDefinition,
|
|
239
|
+
)
|
|
224
240
|
from data_annotations.provenance import ProducedFile
|
|
225
241
|
|
|
226
242
|
@record_directory_annotation(
|
|
@@ -265,13 +281,16 @@ def build_outputs(
|
|
|
265
281
|
encoding="utf-8",
|
|
266
282
|
)
|
|
267
283
|
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
274
|
-
|
|
284
|
+
plot_paths = []
|
|
285
|
+
for day in ["2024-01-01", "2024-01-02", "2024-01-03"]:
|
|
286
|
+
plot_path = output_dir / f"sma_{day}.png"
|
|
287
|
+
plot_path.write_bytes(
|
|
288
|
+
(
|
|
289
|
+
f"plot placeholder for the SMA variable on {day}, "
|
|
290
|
+
f"derived from {input_path.name}\n"
|
|
291
|
+
).encode("utf-8")
|
|
292
|
+
)
|
|
293
|
+
plot_paths.append(plot_path)
|
|
275
294
|
|
|
276
295
|
return [
|
|
277
296
|
DocumentedArtifact(
|
|
@@ -292,7 +311,13 @@ def build_outputs(
|
|
|
292
311
|
],
|
|
293
312
|
),
|
|
294
313
|
ProducedFile(path=str(report_path), kind="report"),
|
|
295
|
-
(
|
|
314
|
+
DocumentedArtifactGroup(
|
|
315
|
+
title="Daily SMA plots",
|
|
316
|
+
summary="Plots of the same variable on different days.",
|
|
317
|
+
kind="plot",
|
|
318
|
+
paths=[str(path) for path in plot_paths],
|
|
319
|
+
selector="sma_*.png",
|
|
320
|
+
),
|
|
296
321
|
]
|
|
297
322
|
|
|
298
323
|
|
|
@@ -303,7 +328,7 @@ build_outputs(
|
|
|
303
328
|
split="validation",
|
|
304
329
|
)
|
|
305
330
|
|
|
306
|
-
print(output_dir / "
|
|
331
|
+
print(output_dir / "data-annotations.json")
|
|
307
332
|
print(output_dir / "README.md")
|
|
308
333
|
```
|
|
309
334
|
|
|
@@ -339,16 +364,66 @@ Directory annotations store:
|
|
|
339
364
|
|
|
340
365
|
- `subject.path`
|
|
341
366
|
- `subject.produced_files[]`
|
|
367
|
+
- `subject.child_bundles[]`
|
|
368
|
+
- `subject.content_digest`
|
|
342
369
|
- `provenance.*`
|
|
343
370
|
- `description.title`
|
|
344
371
|
- `description.summary`
|
|
372
|
+
- `description.artifact_groups[]`
|
|
345
373
|
- `description.artifacts[]`
|
|
346
374
|
- `description.acquisition_context`
|
|
347
375
|
- `description.generation_context`
|
|
348
376
|
- `description.description_updated_at`
|
|
349
377
|
|
|
350
|
-
|
|
351
|
-
|
|
378
|
+
Use `description.artifact_groups[]` when many files have the same meaning, and
|
|
379
|
+
use `description.artifacts[]` only for file-specific notes, overrides, or schema.
|
|
380
|
+
Groups are descriptive only. Integrity still lives in `subject.produced_files[]`,
|
|
381
|
+
which tracks every concrete file by path, kind, and checksum.
|
|
382
|
+
|
|
383
|
+
The `description` section intentionally excludes provenance linkage fields.
|
|
384
|
+
Directory `produced_files[].path` values are stored relative to `subject.path`,
|
|
385
|
+
which keeps verification stable when a complete output directory is copied or
|
|
386
|
+
archived elsewhere. `subject.content_digest` is computed from sorted tracked file
|
|
387
|
+
paths, file checksums, and referenced child bundle digests.
|
|
388
|
+
|
|
389
|
+
## Artifact Groups
|
|
390
|
+
|
|
391
|
+
Artifact groups are for homogeneous sets of files that researchers naturally
|
|
392
|
+
understand as one output family: for example, 100 PNG plots of the same variable,
|
|
393
|
+
one per acquisition day. A group stores the shared title, summary, kind, optional
|
|
394
|
+
schema fields, and the concrete member paths. It can also store an informational
|
|
395
|
+
`selector`, such as `plots/*.png`, to show how the group was chosen.
|
|
396
|
+
|
|
397
|
+
Rules of thumb:
|
|
398
|
+
|
|
399
|
+
- Use artifact groups when many files have the same meaning.
|
|
400
|
+
- Use individual artifacts for file-specific notes, exceptions, or overrides.
|
|
401
|
+
- It is OK for an individual artifact to also appear in a group.
|
|
402
|
+
- Do not rely on groups for integrity. `subject.produced_files[]` remains the
|
|
403
|
+
complete checksum inventory.
|
|
404
|
+
|
|
405
|
+
## Nested Directory Policy
|
|
406
|
+
|
|
407
|
+
Annotate the smallest thing you would share as a unit. If a directory is one
|
|
408
|
+
research output, give that directory one `data-annotations.json`, even when its
|
|
409
|
+
tracked files live in nested subdirectories.
|
|
410
|
+
|
|
411
|
+
Use recursive directory annotations for one bundle with nested files:
|
|
412
|
+
|
|
413
|
+
```bash
|
|
414
|
+
data-annotations annotate directory path/to/run-001 --recursive
|
|
415
|
+
data-annotations annotate directory path/to/run-001 --max-depth 2
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
Use child bundle annotations when a subdirectory is independently meaningful,
|
|
419
|
+
shareable, or reusable. In that case, annotate the child directory first, then
|
|
420
|
+
annotate the parent. The parent records a compact `child_bundles[]` reference
|
|
421
|
+
with the child path, child annotation path, and child content digest; it does not
|
|
422
|
+
copy the child file inventory into the parent JSON.
|
|
423
|
+
|
|
424
|
+
Post-hoc directory discovery follows the same rule. `--recursive` discovers
|
|
425
|
+
nested files, but it stops at annotated child directories containing
|
|
426
|
+
`data-annotations.json` and records them as child bundles.
|
|
352
427
|
|
|
353
428
|
## Provenance Decorators And Writers
|
|
354
429
|
|
|
@@ -383,7 +458,9 @@ write_report(
|
|
|
383
458
|
|
|
384
459
|
Use `record_directory_manifest(...)` for directory outputs. Directory decorators
|
|
385
460
|
accept `DocumentedArtifact`, `ProducedFile`, `(path, kind)`, and plain path-like
|
|
386
|
-
return values.
|
|
461
|
+
return values. Provenance-only APIs do not accept description groups; use
|
|
462
|
+
unified annotation or description APIs when groups should appear in the JSON or
|
|
463
|
+
README.
|
|
387
464
|
|
|
388
465
|
If you want the direct writer approach instead, use `write_file_manifest(...)` and
|
|
389
466
|
`write_directory_manifest(...)` (see `examples/`).
|
|
@@ -399,7 +476,9 @@ Key public description models:
|
|
|
399
476
|
- `AllowedValue`
|
|
400
477
|
- `FieldDefinition`
|
|
401
478
|
- `DocumentedArtifact`
|
|
479
|
+
- `DocumentedArtifactGroup`
|
|
402
480
|
- `ArtifactDescription`
|
|
481
|
+
- `ArtifactGroupDescription`
|
|
403
482
|
- `FileDescription`
|
|
404
483
|
- `DirectoryDescription`
|
|
405
484
|
|
|
@@ -432,7 +511,7 @@ from data_annotations.provenance import (
|
|
|
432
511
|
checkout_manifest_source,
|
|
433
512
|
)
|
|
434
513
|
|
|
435
|
-
annotation_path = Path("outputs/participants.csv.
|
|
514
|
+
annotation_path = Path("outputs/participants.csv.annotation.json")
|
|
436
515
|
artifact_path = Path("downloads/participants.csv")
|
|
437
516
|
|
|
438
517
|
if artifact_matches_manifest(artifact_path, annotation_path):
|
|
@@ -454,8 +533,8 @@ still attach provenance and description after the fact.
|
|
|
454
533
|
Post-hoc descriptions can still be very useful, but the quality of post-hoc
|
|
455
534
|
provenance depends on how exact the supplied answers are. In particular, fields
|
|
456
535
|
such as the generating script, command, function, Git commit, repository path,
|
|
457
|
-
inputs, and parameters are only as reliable as
|
|
458
|
-
annotation.
|
|
536
|
+
Git tags, `git describe` output, inputs, and parameters are only as reliable as
|
|
537
|
+
the information entered during annotation.
|
|
459
538
|
|
|
460
539
|
## CLI Workflow
|
|
461
540
|
|
|
@@ -467,12 +546,29 @@ For post-hoc annotation:
|
|
|
467
546
|
```bash
|
|
468
547
|
data-annotations annotate file path/to/participants.csv
|
|
469
548
|
data-annotations annotate directory path/to/run-001
|
|
549
|
+
data-annotations annotate directory path/to/run-001 --recursive
|
|
550
|
+
data-annotations annotate directory path/to/run-001 --max-depth 2
|
|
551
|
+
data-annotations annotate directory path/to/run-001 \
|
|
552
|
+
--recursive \
|
|
553
|
+
--group-selector "plots/*.png" \
|
|
554
|
+
--group-title "Daily SMA plots" \
|
|
555
|
+
--group-summary "Plots of the same variable on different days." \
|
|
556
|
+
--group-kind plot
|
|
470
557
|
```
|
|
471
558
|
|
|
472
|
-
These commands prompt for missing details, write `*.
|
|
559
|
+
These commands prompt for missing details, write `*.annotation.json` or `data-annotations.json`,
|
|
473
560
|
and optionally derive README sidecars. Post-hoc records are marked with
|
|
474
561
|
`capture_mode="post_hoc"`.
|
|
475
562
|
|
|
563
|
+
When group selectors are provided, the CLI expands them to concrete member paths
|
|
564
|
+
at annotation time. Grouped files are tracked in `subject.produced_files[]` but
|
|
565
|
+
are skipped by the per-file prompt flow, so you do not have to answer the same
|
|
566
|
+
questions for every matching file.
|
|
567
|
+
|
|
568
|
+
For post-hoc provenance, use repeatable `--git-tag` and optional
|
|
569
|
+
`--git-describe` when you know the original code state. These values are stored
|
|
570
|
+
as human-readable hints; `--git-sha` remains the field used for recovery.
|
|
571
|
+
|
|
476
572
|
For provenance inspection and source recovery:
|
|
477
573
|
|
|
478
574
|
```bash
|
|
@@ -480,7 +576,7 @@ data-annotations provenance match path/to/artifact
|
|
|
480
576
|
data-annotations provenance checkout path/to/artifact
|
|
481
577
|
```
|
|
482
578
|
|
|
483
|
-
Command `match` auto-discovers `*.
|
|
579
|
+
Command `match` auto-discovers `*.annotation.json` for files and `data-annotations.json` for
|
|
484
580
|
directories, prints a verification summary, and suggests the exact `checkout`
|
|
485
581
|
command to run next when Git recovery metadata is available.
|
|
486
582
|
|
|
@@ -533,6 +629,17 @@ uv run data-annotations provenance checkout path/to/participants.csv
|
|
|
533
629
|
- `annotate_file(...)`
|
|
534
630
|
- `annotate_directory(...)`
|
|
535
631
|
|
|
632
|
+
### Description Models
|
|
633
|
+
|
|
634
|
+
- `AllowedValue`
|
|
635
|
+
- `FieldDefinition`
|
|
636
|
+
- `DocumentedArtifact`
|
|
637
|
+
- `DocumentedArtifactGroup`
|
|
638
|
+
- `ArtifactDescription`
|
|
639
|
+
- `ArtifactGroupDescription`
|
|
640
|
+
- `FileDescription`
|
|
641
|
+
- `DirectoryDescription`
|
|
642
|
+
|
|
536
643
|
### Description Functions
|
|
537
644
|
|
|
538
645
|
- `record_file_description(...)`
|
|
@@ -547,6 +654,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
|
|
|
547
654
|
### Provenance Models
|
|
548
655
|
|
|
549
656
|
- `ProducedFile`
|
|
657
|
+
- `ChildBundle`
|
|
550
658
|
- `BaseProvenance`
|
|
551
659
|
- `FileManifest`
|
|
552
660
|
- `DirectoryManifest`
|
|
@@ -558,6 +666,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
|
|
|
558
666
|
- `record_directory_manifest(...)`
|
|
559
667
|
- `write_file_manifest(...)`
|
|
560
668
|
- `write_directory_manifest(...)`
|
|
669
|
+
- `directory_content_digest(...)`
|
|
561
670
|
- `artifact_matches_manifest(...)`
|
|
562
671
|
- `checkout_manifest_source(...)`
|
|
563
672
|
|