data-annotations 2.1.2__tar.gz → 2.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. {data_annotations-2.1.2 → data_annotations-2.2.0}/PKG-INFO +136 -27
  2. {data_annotations-2.1.2 → data_annotations-2.2.0}/README.md +135 -26
  3. {data_annotations-2.1.2 → data_annotations-2.2.0}/pyproject.toml +1 -1
  4. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/_decorators.py +83 -19
  5. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/decorators.py +14 -3
  6. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/models.py +5 -3
  7. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/writers.py +111 -17
  8. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/annotate.py +185 -9
  9. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/common.py +99 -13
  10. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/prompts.py +92 -2
  11. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/__init__.py +4 -0
  12. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/decorators.py +29 -2
  13. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/models.py +42 -1
  14. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/description/writers.py +58 -1
  15. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/__init__.py +4 -0
  16. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/decorators.py +9 -3
  17. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/git.py +10 -0
  18. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/models.py +10 -0
  19. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/recovery.py +142 -11
  20. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/writers.py +130 -6
  21. {data_annotations-2.1.2 → data_annotations-2.2.0}/LICENSE +0 -0
  22. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/__init__.py +0 -0
  23. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/annotations/__init__.py +0 -0
  24. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli.py +0 -0
  25. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/__init__.py +0 -0
  26. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/cli_app/provenance_commands.py +0 -0
  27. {data_annotations-2.1.2 → data_annotations-2.2.0}/src/data_annotations/provenance/runtime.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: data-annotations
3
- Version: 2.1.2
3
+ Version: 2.2.0
4
4
  Summary: Annotate generated data artifacts
5
5
  Keywords: annotations,data,metadata,provenance,reproducibility
6
6
  Author: Rodrigo C. G. Pena
@@ -29,7 +29,7 @@ Description-Content-Type: text/markdown
29
29
 
30
30
  # data-annotations
31
31
 
32
- A small Python package for attaching provenance and structured descriptions to the
32
+ A Python package for attaching provenance and structured descriptions to the
33
33
  files and directories your workflows produce.
34
34
 
35
35
  It is designed for lightweight research and reproducibility pipelines where you want
@@ -37,11 +37,11 @@ generated datasets, tables, plots, or reports to carry enough context to explain
37
37
  where they came from and what they contain.
38
38
 
39
39
  The package captures common provenance automatically and writes plain JSON and
40
- Markdown artifacts that are easy to inspect or archive. The canonical on-disk format
41
- is now a single annotation document:
40
+ Markdown artifacts that are easy to inspect or archive. The canonical on-disk
41
+ format uses one JSON annotation document per artifact:
42
42
 
43
- - Files use `artifact.ext.meta.json`
44
- - Directories use `manifest.json`
43
+ - Files use `artifact.ext.annotation.json`
44
+ - Directories carry `data-annotations.json` at their root
45
45
 
46
46
  Each annotation document stores four top-level sections:
47
47
 
@@ -50,6 +50,10 @@ Each annotation document stores four top-level sections:
50
50
  - `provenance`
51
51
  - `description`
52
52
 
53
+ Here's the mental model: files get a visible sibling annotation, and
54
+ directories carry one visible annotation at their root. Treat the annotation as
55
+ part of the research output bundle.
56
+
53
57
  See the [changelog](CHANGELOG.md) for release history and upgrade-oriented notes.
54
58
 
55
59
  ## Installation
@@ -95,12 +99,15 @@ Every annotation document includes provenance with:
95
99
  - Hostname and username
96
100
  - The script path and command-line arguments
97
101
  - The script path relative to the Git repo root when it can be determined
98
- - Git commit, branch, dirty state, and canonical repository remote when available
102
+ - Git commit, branch, dirty state, canonical repository remote, exact tags, and
103
+ `git describe` output when available
99
104
  - The current `SLURM_JOB_ID` when available
100
105
 
101
106
  You can also attach your own parameters, input file paths, and function names.
102
107
  Local filesystem paths in provenance are stored as absolute paths. URI-style inputs
103
108
  such as `s3://...` or `https://...` are preserved as provided.
109
+ Git tags and `git_describe` are human-friendly hints only; `git_sha` remains the
110
+ source of truth for reproducibility, matching, and source checkout.
104
111
 
105
112
  ## Quick Start
106
113
 
@@ -111,7 +118,7 @@ provenance and emit sidecars automatically.
111
118
 
112
119
  For example, here is a complete file-level annotation workflow using the
113
120
  `record_file_annotation(...)` decorator. Once `write_participants` is called, it
114
- automatically generates sidecars `participants.csv.meta.json` and `participants.csv.README.md`.
121
+ automatically generates sidecars `participants.csv.annotation.json` and `participants.csv.README.md`.
115
122
  The JSON sidecar will contain provenance and description metadata, and the Markdown sidecar
116
123
  will have a human-friendly rendering of the description provided in the decorator.
117
124
 
@@ -182,7 +189,7 @@ write_participants(
182
189
  split="validation",
183
190
  )
184
191
 
185
- print(f"{artifact_path}.meta.json")
192
+ print(f"{artifact_path}.annotation.json")
186
193
  print(f"{artifact_path}.README.md")
187
194
  ```
188
195
 
@@ -235,7 +242,12 @@ Accepted directory return items are:
235
242
 
236
243
  - `DocumentedArtifact` when you want per-artifact title, summary, fields,
237
244
  keys, or missing-value metadata.
245
+ - `DocumentedArtifactGroup` for `record_directory_annotation(...)` and
246
+ `record_directory_description(...)` when many files share one title, summary,
247
+ kind, and optional schema metadata.
238
248
  - `ProducedFile` when you only need path, kind, and optional precomputed hash.
249
+ - `ChildBundle` when an annotated child directory should be referenced as its
250
+ own independently shareable bundle.
239
251
  - `(path, kind)` tuples when path and artifact kind are enough.
240
252
  - plain path-like values when the artifact kind can default to `"other"`.
241
253
 
@@ -249,7 +261,11 @@ Here is another decorator pattern example with `record_directory_annotation(...)
249
261
  from pathlib import Path
250
262
 
251
263
  from data_annotations.annotations import record_directory_annotation
252
- from data_annotations.description import DocumentedArtifact, FieldDefinition
264
+ from data_annotations.description import (
265
+ DocumentedArtifact,
266
+ DocumentedArtifactGroup,
267
+ FieldDefinition,
268
+ )
253
269
  from data_annotations.provenance import ProducedFile
254
270
 
255
271
  @record_directory_annotation(
@@ -294,13 +310,16 @@ def build_outputs(
294
310
  encoding="utf-8",
295
311
  )
296
312
 
297
- plot_path = output_dir / "roc.png"
298
- plot_path.write_bytes(
299
- (
300
- f"plot placeholder derived from {input_path.name} "
301
- f"({len(participant_ids)} participants)\n"
302
- ).encode("utf-8")
303
- )
313
+ plot_paths = []
314
+ for day in ["2024-01-01", "2024-01-02", "2024-01-03"]:
315
+ plot_path = output_dir / f"sma_{day}.png"
316
+ plot_path.write_bytes(
317
+ (
318
+ f"plot placeholder for the SMA variable on {day}, "
319
+ f"derived from {input_path.name}\n"
320
+ ).encode("utf-8")
321
+ )
322
+ plot_paths.append(plot_path)
304
323
 
305
324
  return [
306
325
  DocumentedArtifact(
@@ -321,7 +340,13 @@ def build_outputs(
321
340
  ],
322
341
  ),
323
342
  ProducedFile(path=str(report_path), kind="report"),
324
- (plot_path, "plot"),
343
+ DocumentedArtifactGroup(
344
+ title="Daily SMA plots",
345
+ summary="Plots of the same variable on different days.",
346
+ kind="plot",
347
+ paths=[str(path) for path in plot_paths],
348
+ selector="sma_*.png",
349
+ ),
325
350
  ]
326
351
 
327
352
 
@@ -332,7 +357,7 @@ build_outputs(
332
357
  split="validation",
333
358
  )
334
359
 
335
- print(output_dir / "manifest.json")
360
+ print(output_dir / "data-annotations.json")
336
361
  print(output_dir / "README.md")
337
362
  ```
338
363
 
@@ -368,16 +393,66 @@ Directory annotations store:
368
393
 
369
394
  - `subject.path`
370
395
  - `subject.produced_files[]`
396
+ - `subject.child_bundles[]`
397
+ - `subject.content_digest`
371
398
  - `provenance.*`
372
399
  - `description.title`
373
400
  - `description.summary`
401
+ - `description.artifact_groups[]`
374
402
  - `description.artifacts[]`
375
403
  - `description.acquisition_context`
376
404
  - `description.generation_context`
377
405
  - `description.description_updated_at`
378
406
 
379
- The `description` section intentionally excludes provenance linkage fields and
380
- file kinds for directory artifacts. Kinds live in `subject.produced_files`.
407
+ Use `description.artifact_groups[]` when many files have the same meaning, and
408
+ use `description.artifacts[]` only for file-specific notes, overrides, or schema.
409
+ Groups are descriptive only. Integrity still lives in `subject.produced_files[]`,
410
+ which tracks every concrete file by path, kind, and checksum.
411
+
412
+ The `description` section intentionally excludes provenance linkage fields.
413
+ Directory `produced_files[].path` values are stored relative to `subject.path`,
414
+ which keeps verification stable when a complete output directory is copied or
415
+ archived elsewhere. `subject.content_digest` is computed from sorted tracked file
416
+ paths, file checksums, and referenced child bundle digests.
417
+
418
+ ## Artifact Groups
419
+
420
+ Artifact groups are for homogeneous sets of files that researchers naturally
421
+ understand as one output family: for example, 100 PNG plots of the same variable,
422
+ one per acquisition day. A group stores the shared title, summary, kind, optional
423
+ schema fields, and the concrete member paths. It can also store an informational
424
+ `selector`, such as `plots/*.png`, to show how the group was chosen.
425
+
426
+ Rules of thumb:
427
+
428
+ - Use artifact groups when many files have the same meaning.
429
+ - Use individual artifacts for file-specific notes, exceptions, or overrides.
430
+ - It is OK for an individual artifact to also appear in a group.
431
+ - Do not rely on groups for integrity. `subject.produced_files[]` remains the
432
+ complete checksum inventory.
433
+
434
+ ## Nested Directory Policy
435
+
436
+ Annotate the smallest thing you would share as a unit. If a directory is one
437
+ research output, give that directory one `data-annotations.json`, even when its
438
+ tracked files live in nested subdirectories.
439
+
440
+ Use recursive directory annotations for one bundle with nested files:
441
+
442
+ ```bash
443
+ data-annotations annotate directory path/to/run-001 --recursive
444
+ data-annotations annotate directory path/to/run-001 --max-depth 2
445
+ ```
446
+
447
+ Use child bundle annotations when a subdirectory is independently meaningful,
448
+ shareable, or reusable. In that case, annotate the child directory first, then
449
+ annotate the parent. The parent records a compact `child_bundles[]` reference
450
+ with the child path, child annotation path, and child content digest; it does not
451
+ copy the child file inventory into the parent JSON.
452
+
453
+ Post-hoc directory discovery follows the same rule. `--recursive` discovers
454
+ nested files, but it stops at annotated child directories containing
455
+ `data-annotations.json` and records them as child bundles.
381
456
 
382
457
  ## Provenance Decorators And Writers
383
458
 
@@ -412,7 +487,9 @@ write_report(
412
487
 
413
488
  Use `record_directory_manifest(...)` for directory outputs. Directory decorators
414
489
  accept `DocumentedArtifact`, `ProducedFile`, `(path, kind)`, and plain path-like
415
- return values.
490
+ return values. Provenance-only APIs do not accept description groups; use
491
+ unified annotation or description APIs when groups should appear in the JSON or
492
+ README.
416
493
 
417
494
  If you want the direct writer approach instead, use `write_file_manifest(...)` and
418
495
  `write_directory_manifest(...)` (see `examples/`).
@@ -428,7 +505,9 @@ Key public description models:
428
505
  - `AllowedValue`
429
506
  - `FieldDefinition`
430
507
  - `DocumentedArtifact`
508
+ - `DocumentedArtifactGroup`
431
509
  - `ArtifactDescription`
510
+ - `ArtifactGroupDescription`
432
511
  - `FileDescription`
433
512
  - `DirectoryDescription`
434
513
 
@@ -461,7 +540,7 @@ from data_annotations.provenance import (
461
540
  checkout_manifest_source,
462
541
  )
463
542
 
464
- annotation_path = Path("outputs/participants.csv.meta.json")
543
+ annotation_path = Path("outputs/participants.csv.annotation.json")
465
544
  artifact_path = Path("downloads/participants.csv")
466
545
 
467
546
  if artifact_matches_manifest(artifact_path, annotation_path):
@@ -483,8 +562,8 @@ still attach provenance and description after the fact.
483
562
  Post-hoc descriptions can still be very useful, but the quality of post-hoc
484
563
  provenance depends on how exact the supplied answers are. In particular, fields
485
564
  such as the generating script, command, function, Git commit, repository path,
486
- inputs, and parameters are only as reliable as the information entered during
487
- annotation.
565
+ Git tags, `git describe` output, inputs, and parameters are only as reliable as
566
+ the information entered during annotation.
488
567
 
489
568
  ## CLI Workflow
490
569
 
@@ -496,12 +575,29 @@ For post-hoc annotation:
496
575
  ```bash
497
576
  data-annotations annotate file path/to/participants.csv
498
577
  data-annotations annotate directory path/to/run-001
578
+ data-annotations annotate directory path/to/run-001 --recursive
579
+ data-annotations annotate directory path/to/run-001 --max-depth 2
580
+ data-annotations annotate directory path/to/run-001 \
581
+ --recursive \
582
+ --group-selector "plots/*.png" \
583
+ --group-title "Daily SMA plots" \
584
+ --group-summary "Plots of the same variable on different days." \
585
+ --group-kind plot
499
586
  ```
500
587
 
501
- These commands prompt for missing details, write `*.meta.json` or `manifest.json`,
588
+ These commands prompt for missing details, write `*.annotation.json` or `data-annotations.json`,
502
589
  and optionally derive README sidecars. Post-hoc records are marked with
503
590
  `capture_mode="post_hoc"`.
504
591
 
592
+ When group selectors are provided, the CLI expands them to concrete member paths
593
+ at annotation time. Grouped files are tracked in `subject.produced_files[]` but
594
+ are skipped by the per-file prompt flow, so you do not have to answer the same
595
+ questions for every matching file.
596
+
597
+ For post-hoc provenance, use repeatable `--git-tag` and optional
598
+ `--git-describe` when you know the original code state. These values are stored
599
+ as human-readable hints; `--git-sha` remains the field used for recovery.
600
+
505
601
  For provenance inspection and source recovery:
506
602
 
507
603
  ```bash
@@ -509,7 +605,7 @@ data-annotations provenance match path/to/artifact
509
605
  data-annotations provenance checkout path/to/artifact
510
606
  ```
511
607
 
512
- Command `match` auto-discovers `*.meta.json` for files and `manifest.json` for
608
+ Command `match` auto-discovers `*.annotation.json` for files and `data-annotations.json` for
513
609
  directories, prints a verification summary, and suggests the exact `checkout`
514
610
  command to run next when Git recovery metadata is available.
515
611
 
@@ -562,6 +658,17 @@ uv run data-annotations provenance checkout path/to/participants.csv
562
658
  - `annotate_file(...)`
563
659
  - `annotate_directory(...)`
564
660
 
661
+ ### Description Models
662
+
663
+ - `AllowedValue`
664
+ - `FieldDefinition`
665
+ - `DocumentedArtifact`
666
+ - `DocumentedArtifactGroup`
667
+ - `ArtifactDescription`
668
+ - `ArtifactGroupDescription`
669
+ - `FileDescription`
670
+ - `DirectoryDescription`
671
+
565
672
  ### Description Functions
566
673
 
567
674
  - `record_file_description(...)`
@@ -576,6 +683,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
576
683
  ### Provenance Models
577
684
 
578
685
  - `ProducedFile`
686
+ - `ChildBundle`
579
687
  - `BaseProvenance`
580
688
  - `FileManifest`
581
689
  - `DirectoryManifest`
@@ -587,6 +695,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
587
695
  - `record_directory_manifest(...)`
588
696
  - `write_file_manifest(...)`
589
697
  - `write_directory_manifest(...)`
698
+ - `directory_content_digest(...)`
590
699
  - `artifact_matches_manifest(...)`
591
700
  - `checkout_manifest_source(...)`
592
701
 
@@ -1,6 +1,6 @@
1
1
  # data-annotations
2
2
 
3
- A small Python package for attaching provenance and structured descriptions to the
3
+ A Python package for attaching provenance and structured descriptions to the
4
4
  files and directories your workflows produce.
5
5
 
6
6
  It is designed for lightweight research and reproducibility pipelines where you want
@@ -8,11 +8,11 @@ generated datasets, tables, plots, or reports to carry enough context to explain
8
8
  where they came from and what they contain.
9
9
 
10
10
  The package captures common provenance automatically and writes plain JSON and
11
- Markdown artifacts that are easy to inspect or archive. The canonical on-disk format
12
- is now a single annotation document:
11
+ Markdown artifacts that are easy to inspect or archive. The canonical on-disk
12
+ format uses one JSON annotation document per artifact:
13
13
 
14
- - Files use `artifact.ext.meta.json`
15
- - Directories use `manifest.json`
14
+ - Files use `artifact.ext.annotation.json`
15
+ - Directories carry `data-annotations.json` at their root
16
16
 
17
17
  Each annotation document stores four top-level sections:
18
18
 
@@ -21,6 +21,10 @@ Each annotation document stores four top-level sections:
21
21
  - `provenance`
22
22
  - `description`
23
23
 
24
+ Here's the mental model: files get a visible sibling annotation, and
25
+ directories carry one visible annotation at their root. Treat the annotation as
26
+ part of the research output bundle.
27
+
24
28
  See the [changelog](CHANGELOG.md) for release history and upgrade-oriented notes.
25
29
 
26
30
  ## Installation
@@ -66,12 +70,15 @@ Every annotation document includes provenance with:
66
70
  - Hostname and username
67
71
  - The script path and command-line arguments
68
72
  - The script path relative to the Git repo root when it can be determined
69
- - Git commit, branch, dirty state, and canonical repository remote when available
73
+ - Git commit, branch, dirty state, canonical repository remote, exact tags, and
74
+ `git describe` output when available
70
75
  - The current `SLURM_JOB_ID` when available
71
76
 
72
77
  You can also attach your own parameters, input file paths, and function names.
73
78
  Local filesystem paths in provenance are stored as absolute paths. URI-style inputs
74
79
  such as `s3://...` or `https://...` are preserved as provided.
80
+ Git tags and `git_describe` are human-friendly hints only; `git_sha` remains the
81
+ source of truth for reproducibility, matching, and source checkout.
75
82
 
76
83
  ## Quick Start
77
84
 
@@ -82,7 +89,7 @@ provenance and emit sidecars automatically.
82
89
 
83
90
  For example, here is a complete file-level annotation workflow using the
84
91
  `record_file_annotation(...)` decorator. Once `write_participants` is called, it
85
- automatically generates sidecars `participants.csv.meta.json` and `participants.csv.README.md`.
92
+ automatically generates sidecars `participants.csv.annotation.json` and `participants.csv.README.md`.
86
93
  The JSON sidecar will contain provenance and description metadata, and the Markdown sidecar
87
94
  will have a human-friendly rendering of the description provided in the decorator.
88
95
 
@@ -153,7 +160,7 @@ write_participants(
153
160
  split="validation",
154
161
  )
155
162
 
156
- print(f"{artifact_path}.meta.json")
163
+ print(f"{artifact_path}.annotation.json")
157
164
  print(f"{artifact_path}.README.md")
158
165
  ```
159
166
 
@@ -206,7 +213,12 @@ Accepted directory return items are:
206
213
 
207
214
  - `DocumentedArtifact` when you want per-artifact title, summary, fields,
208
215
  keys, or missing-value metadata.
216
+ - `DocumentedArtifactGroup` for `record_directory_annotation(...)` and
217
+ `record_directory_description(...)` when many files share one title, summary,
218
+ kind, and optional schema metadata.
209
219
  - `ProducedFile` when you only need path, kind, and optional precomputed hash.
220
+ - `ChildBundle` when an annotated child directory should be referenced as its
221
+ own independently shareable bundle.
210
222
  - `(path, kind)` tuples when path and artifact kind are enough.
211
223
  - plain path-like values when the artifact kind can default to `"other"`.
212
224
 
@@ -220,7 +232,11 @@ Here is another decorator pattern example with `record_directory_annotation(...)
220
232
  from pathlib import Path
221
233
 
222
234
  from data_annotations.annotations import record_directory_annotation
223
- from data_annotations.description import DocumentedArtifact, FieldDefinition
235
+ from data_annotations.description import (
236
+ DocumentedArtifact,
237
+ DocumentedArtifactGroup,
238
+ FieldDefinition,
239
+ )
224
240
  from data_annotations.provenance import ProducedFile
225
241
 
226
242
  @record_directory_annotation(
@@ -265,13 +281,16 @@ def build_outputs(
265
281
  encoding="utf-8",
266
282
  )
267
283
 
268
- plot_path = output_dir / "roc.png"
269
- plot_path.write_bytes(
270
- (
271
- f"plot placeholder derived from {input_path.name} "
272
- f"({len(participant_ids)} participants)\n"
273
- ).encode("utf-8")
274
- )
284
+ plot_paths = []
285
+ for day in ["2024-01-01", "2024-01-02", "2024-01-03"]:
286
+ plot_path = output_dir / f"sma_{day}.png"
287
+ plot_path.write_bytes(
288
+ (
289
+ f"plot placeholder for the SMA variable on {day}, "
290
+ f"derived from {input_path.name}\n"
291
+ ).encode("utf-8")
292
+ )
293
+ plot_paths.append(plot_path)
275
294
 
276
295
  return [
277
296
  DocumentedArtifact(
@@ -292,7 +311,13 @@ def build_outputs(
292
311
  ],
293
312
  ),
294
313
  ProducedFile(path=str(report_path), kind="report"),
295
- (plot_path, "plot"),
314
+ DocumentedArtifactGroup(
315
+ title="Daily SMA plots",
316
+ summary="Plots of the same variable on different days.",
317
+ kind="plot",
318
+ paths=[str(path) for path in plot_paths],
319
+ selector="sma_*.png",
320
+ ),
296
321
  ]
297
322
 
298
323
 
@@ -303,7 +328,7 @@ build_outputs(
303
328
  split="validation",
304
329
  )
305
330
 
306
- print(output_dir / "manifest.json")
331
+ print(output_dir / "data-annotations.json")
307
332
  print(output_dir / "README.md")
308
333
  ```
309
334
 
@@ -339,16 +364,66 @@ Directory annotations store:
339
364
 
340
365
  - `subject.path`
341
366
  - `subject.produced_files[]`
367
+ - `subject.child_bundles[]`
368
+ - `subject.content_digest`
342
369
  - `provenance.*`
343
370
  - `description.title`
344
371
  - `description.summary`
372
+ - `description.artifact_groups[]`
345
373
  - `description.artifacts[]`
346
374
  - `description.acquisition_context`
347
375
  - `description.generation_context`
348
376
  - `description.description_updated_at`
349
377
 
350
- The `description` section intentionally excludes provenance linkage fields and
351
- file kinds for directory artifacts. Kinds live in `subject.produced_files`.
378
+ Use `description.artifact_groups[]` when many files have the same meaning, and
379
+ use `description.artifacts[]` only for file-specific notes, overrides, or schema.
380
+ Groups are descriptive only. Integrity still lives in `subject.produced_files[]`,
381
+ which tracks every concrete file by path, kind, and checksum.
382
+
383
+ The `description` section intentionally excludes provenance linkage fields.
384
+ Directory `produced_files[].path` values are stored relative to `subject.path`,
385
+ which keeps verification stable when a complete output directory is copied or
386
+ archived elsewhere. `subject.content_digest` is computed from sorted tracked file
387
+ paths, file checksums, and referenced child bundle digests.
388
+
389
+ ## Artifact Groups
390
+
391
+ Artifact groups are for homogeneous sets of files that researchers naturally
392
+ understand as one output family: for example, 100 PNG plots of the same variable,
393
+ one per acquisition day. A group stores the shared title, summary, kind, optional
394
+ schema fields, and the concrete member paths. It can also store an informational
395
+ `selector`, such as `plots/*.png`, to show how the group was chosen.
396
+
397
+ Rules of thumb:
398
+
399
+ - Use artifact groups when many files have the same meaning.
400
+ - Use individual artifacts for file-specific notes, exceptions, or overrides.
401
+ - It is OK for an individual artifact to also appear in a group.
402
+ - Do not rely on groups for integrity. `subject.produced_files[]` remains the
403
+ complete checksum inventory.
404
+
405
+ ## Nested Directory Policy
406
+
407
+ Annotate the smallest thing you would share as a unit. If a directory is one
408
+ research output, give that directory one `data-annotations.json`, even when its
409
+ tracked files live in nested subdirectories.
410
+
411
+ Use recursive directory annotations for one bundle with nested files:
412
+
413
+ ```bash
414
+ data-annotations annotate directory path/to/run-001 --recursive
415
+ data-annotations annotate directory path/to/run-001 --max-depth 2
416
+ ```
417
+
418
+ Use child bundle annotations when a subdirectory is independently meaningful,
419
+ shareable, or reusable. In that case, annotate the child directory first, then
420
+ annotate the parent. The parent records a compact `child_bundles[]` reference
421
+ with the child path, child annotation path, and child content digest; it does not
422
+ copy the child file inventory into the parent JSON.
423
+
424
+ Post-hoc directory discovery follows the same rule. `--recursive` discovers
425
+ nested files, but it stops at annotated child directories containing
426
+ `data-annotations.json` and records them as child bundles.
352
427
 
353
428
  ## Provenance Decorators And Writers
354
429
 
@@ -383,7 +458,9 @@ write_report(
383
458
 
384
459
  Use `record_directory_manifest(...)` for directory outputs. Directory decorators
385
460
  accept `DocumentedArtifact`, `ProducedFile`, `(path, kind)`, and plain path-like
386
- return values.
461
+ return values. Provenance-only APIs do not accept description groups; use
462
+ unified annotation or description APIs when groups should appear in the JSON or
463
+ README.
387
464
 
388
465
  If you want the direct writer approach instead, use `write_file_manifest(...)` and
389
466
  `write_directory_manifest(...)` (see `examples/`).
@@ -399,7 +476,9 @@ Key public description models:
399
476
  - `AllowedValue`
400
477
  - `FieldDefinition`
401
478
  - `DocumentedArtifact`
479
+ - `DocumentedArtifactGroup`
402
480
  - `ArtifactDescription`
481
+ - `ArtifactGroupDescription`
403
482
  - `FileDescription`
404
483
  - `DirectoryDescription`
405
484
 
@@ -432,7 +511,7 @@ from data_annotations.provenance import (
432
511
  checkout_manifest_source,
433
512
  )
434
513
 
435
- annotation_path = Path("outputs/participants.csv.meta.json")
514
+ annotation_path = Path("outputs/participants.csv.annotation.json")
436
515
  artifact_path = Path("downloads/participants.csv")
437
516
 
438
517
  if artifact_matches_manifest(artifact_path, annotation_path):
@@ -454,8 +533,8 @@ still attach provenance and description after the fact.
454
533
  Post-hoc descriptions can still be very useful, but the quality of post-hoc
455
534
  provenance depends on how exact the supplied answers are. In particular, fields
456
535
  such as the generating script, command, function, Git commit, repository path,
457
- inputs, and parameters are only as reliable as the information entered during
458
- annotation.
536
+ Git tags, `git describe` output, inputs, and parameters are only as reliable as
537
+ the information entered during annotation.
459
538
 
460
539
  ## CLI Workflow
461
540
 
@@ -467,12 +546,29 @@ For post-hoc annotation:
467
546
  ```bash
468
547
  data-annotations annotate file path/to/participants.csv
469
548
  data-annotations annotate directory path/to/run-001
549
+ data-annotations annotate directory path/to/run-001 --recursive
550
+ data-annotations annotate directory path/to/run-001 --max-depth 2
551
+ data-annotations annotate directory path/to/run-001 \
552
+ --recursive \
553
+ --group-selector "plots/*.png" \
554
+ --group-title "Daily SMA plots" \
555
+ --group-summary "Plots of the same variable on different days." \
556
+ --group-kind plot
470
557
  ```
471
558
 
472
- These commands prompt for missing details, write `*.meta.json` or `manifest.json`,
559
+ These commands prompt for missing details, write `*.annotation.json` or `data-annotations.json`,
473
560
  and optionally derive README sidecars. Post-hoc records are marked with
474
561
  `capture_mode="post_hoc"`.
475
562
 
563
+ When group selectors are provided, the CLI expands them to concrete member paths
564
+ at annotation time. Grouped files are tracked in `subject.produced_files[]` but
565
+ are skipped by the per-file prompt flow, so you do not have to answer the same
566
+ questions for every matching file.
567
+
568
+ For post-hoc provenance, use repeatable `--git-tag` and optional
569
+ `--git-describe` when you know the original code state. These values are stored
570
+ as human-readable hints; `--git-sha` remains the field used for recovery.
571
+
476
572
  For provenance inspection and source recovery:
477
573
 
478
574
  ```bash
@@ -480,7 +576,7 @@ data-annotations provenance match path/to/artifact
480
576
  data-annotations provenance checkout path/to/artifact
481
577
  ```
482
578
 
483
- Command `match` auto-discovers `*.meta.json` for files and `manifest.json` for
579
+ Command `match` auto-discovers `*.annotation.json` for files and `data-annotations.json` for
484
580
  directories, prints a verification summary, and suggests the exact `checkout`
485
581
  command to run next when Git recovery metadata is available.
486
582
 
@@ -533,6 +629,17 @@ uv run data-annotations provenance checkout path/to/participants.csv
533
629
  - `annotate_file(...)`
534
630
  - `annotate_directory(...)`
535
631
 
632
+ ### Description Models
633
+
634
+ - `AllowedValue`
635
+ - `FieldDefinition`
636
+ - `DocumentedArtifact`
637
+ - `DocumentedArtifactGroup`
638
+ - `ArtifactDescription`
639
+ - `ArtifactGroupDescription`
640
+ - `FileDescription`
641
+ - `DirectoryDescription`
642
+
536
643
  ### Description Functions
537
644
 
538
645
  - `record_file_description(...)`
@@ -547,6 +654,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
547
654
  ### Provenance Models
548
655
 
549
656
  - `ProducedFile`
657
+ - `ChildBundle`
550
658
  - `BaseProvenance`
551
659
  - `FileManifest`
552
660
  - `DirectoryManifest`
@@ -558,6 +666,7 @@ uv run data-annotations provenance checkout path/to/participants.csv
558
666
  - `record_directory_manifest(...)`
559
667
  - `write_file_manifest(...)`
560
668
  - `write_directory_manifest(...)`
669
+ - `directory_content_digest(...)`
561
670
  - `artifact_matches_manifest(...)`
562
671
  - `checkout_manifest_source(...)`
563
672
 
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "data-annotations"
3
- version = "2.1.2"
3
+ version = "2.2.0"
4
4
  description = "Annotate generated data artifacts"
5
5
  readme = "README.md"
6
6
  authors = [