PyPI - fairscape-wizard - Versions diffs - 0.2.0__tar.gz - Mend

fairscape-wizard 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

fairscape_wizard-0.2.0/.claude/skills/agentic-rescore/SKILL.md ADDED Viewed

@@ -0,0 +1,158 @@
+---
+name: agentic-rescore
+description: Phase 4 of the remote-source wizard. Score the 28 AI-Ready rubrics agentically — dump deterministic evidence via `python -m fairscape_wizard.rubric_eval extract-evidence`, then have Claude (this skill) read each rubric YAML + evidence.json and emit a RubricScore JSON per rubric. Never invokes grade.py.
+---
+# Agentic rescore — Phase 4
+The grader (`rubrics/ai-ready/grade.py`) normally drives a separate LLM via `pydantic-ai`. Here, **Claude is the grader** — we reuse the deterministic evidence extractors from `extract.py` as a library, then score each rubric inline. Output matches `grade.py`'s file layout exactly so downstream tooling can consume either.
+## What to tell the user before any commands run
+Before invoking the evidence dump or asking about subset selection, give them one paragraph of context so the rest of the phase isn't opaque:
+> *"This is the **AI-Ready scoring** phase. The 28 rubrics live in `rubrics/ai-ready/<id>-<slug>.yaml` and cover seven criteria: FAIRness (`0.x`), Provenance (`1.x`), Characterization (`2.x`), Pre-model Explainability (`3.x`), Ethics (`4.x`), Sustainability (`5.x`), Computability (`6.x`). Each rubric has three possible scores: 0 (Absent), 1 (Partial), or 2 (Substantive), with rules that say literally what evidence justifies each level. Max total is 56 (2 × 28).*
+>
+> *The scoring is two-step. First a deterministic Python pass (`python -m fairscape_wizard.rubric_eval extract-evidence`) walks the crate and dumps the relevant facts per rubric — identifiers, license, schemas, format coverage, etc. — into a `grading/` folder. No LLM involved; just structured reading. Then I fan the rubrics out to parallel subagents — one per rubric, all dispatched in a single message — and each subagent sees **only** its rubric YAML and its evidence JSON, nothing else. It writes its `score.json` and returns. After all rubrics are scored, a small Python aggregator computes the total and a per-criterion breakdown.*
+>
+> *The isolation matters for reproducibility: the score for any rubric is determined by the fixed prompt + that rubric + its evidence, not by anything I've seen earlier in this conversation (the paper, prior decisions, your phrasing). Anyone can re-run the same subagent prompt against the same `evidence.json` and reproduce the verdict. The evidence dump is itself reproducible and inspectable — `grading/<id>/evidence.json` is a self-contained audit trail. Each score comes with a written rationale and a `gaps` list that tells you what would raise it."*
+## Preconditions
+- `.fairscape-remote-state.json` exists and `state.crate_path` points at a valid `ro-crate-metadata.json`.
+- Earlier phases don't have to be fully done — grading runs against whatever state the crate is in. But warn the user if `phase` is still `imported` ("you can grade now, but the score will be lower without phase 3").
+## 0. Ask: full sweep, or one criterion?
+Twenty-eight rubrics is many conversation turns. Offer up front:
+- **"All 28"** — full sweep.
+- **"One criterion (0–6)"** — narrow to a single criterion. The user picks one of:
+  - 0 FAIRness, 1 Provenance, 2 Characterization, 3 Pre-model Explainability,
+  - 4 Ethics, 5 Sustainability, 6 Computability.
+If subset, only iterate the matching rubrics. Aggregated score still computes correctly because `_aggregate` only sees the ones with `score.json`.
+## 1. Dump evidence (deterministic, one shot)
+```
+Bash python -m fairscape_wizard.rubric_eval extract-evidence "<state.crate_path>" "<state.crate_dir>/grading/"
+```
+This writes:
+- `<crate_dir>/grading/summary.json` — `root_summary`, `stats`, list of rubric ids.
+- `<crate_dir>/grading/<id>-<slug>/rubric.yaml` — copied from `rubrics/ai-ready/`.
+- `<crate_dir>/grading/<id>-<slug>/evidence.json` — extractor output.
+Tell the user one line: `"dumped evidence for 28 rubrics → <grading>/"`. Don't read the dump into context — read per rubric in the next step.
+## 2. Score each rubric (parallel subagents, one fan-out)
+**Do not score rubrics in the main conversation.** Dispatch one `Agent(subagent_type=general-purpose)` per rubric, all in a single message so they run in parallel. This is non-negotiable for two reasons:
+1. **Reproducibility.** Each subagent runs in a fresh context with only the fixed prompt + the rubric YAML + the evidence JSON. Same inputs → same scoring inputs every time. Anyone with the same `evidence.json` and the prompt template below can reproduce the verdict.
+2. **No context contamination.** The main conversation has read the paper, heard the user's framing, made earlier decisions. None of that may bias the score. Subagents see *only* the two file paths — they cannot read the crate, the PDF, the state file, or prior turns.
+### Which rubrics to dispatch
+The full sweep is all 28 rubric folders under `<grading>/`. If the user picked a single criterion in step 0, filter to that prefix (`0.*`, `1.*`, etc.).
+Before dispatching, skip any rubric whose `score.json` already exists — that's how resume works.
+### The subagent prompt (use verbatim, fill only the three paths)
+This prompt is part of the reproducibility contract. Do not customize it per rubric. Do not add crate facts, summaries, or commentary. The only variables are the three absolute paths.
+```
+You are scoring one AI-Ready rubric for an RO-Crate. Read only the two files at the given paths and write one output file. Do not read or fetch anything else.
+RUBRIC_YAML: <abs path to grading/<id>-<slug>/rubric.yaml>
+EVIDENCE_JSON: <abs path to grading/<id>-<slug>/evidence.json>
+OUTPUT: <abs path to grading/<id>-<slug>/score.json>
+Procedure:
+1. Read RUBRIC_YAML. Its `scoring` block lists three rules — one for score 0, one for 1, one for 2. Restate each rule to yourself literally before deciding.
+2. Read EVIDENCE_JSON. Treat it as the complete factual basis. Do not assume any field that is not present. Do not invent @ids or strings.
+3. Pick the single rule whose conditions match the evidence — no averaging, no halves. If the rule for score 0 says "Absent" and the required field is missing, choose 0.
+4. Compose a JSON object exactly matching the rubric's `output_schema`:
+   {
+     "score": 0 | 1 | 2,
+     "rationale": "1-3 sentences. Cite the rule that applied and the specific evidence fields that decided it.",
+     "evidence": ["...direct @id refs or short string fragments that appear verbatim in EVIDENCE_JSON..."],
+     "gaps": ["...specific missing things that would raise the score; empty list if score is 2..."]
+   }
+5. Write that JSON to OUTPUT (pretty-printed, trailing newline).
+6. Reply with one line: "<rubric id> → <score>".
+Tone: neutral and audit-friendly. Cite verbatim ("identifier: doi:10.18130/V3/KCBTMS — both rule-2 conditions are met"). Avoid value judgments ("great", "weakly characterized") and avoid the word "unvalidated" — peer-reviewed work is not unvalidated.
+Pitfalls:
+- Treating a missing field as score 1 when the rule says "Absent". Read the 0 rule literally.
+- Citing the rubric YAML in `evidence` instead of EVIDENCE_JSON. The `evidence` list is the crate's evidence, not the rubric's text.
+- Skipping `gaps` when score < 2. Always populate them — they are the actionable feedback.
+```
+### After the fan-out returns
+1. `ls <grading>/*/score.json` to verify every dispatched rubric wrote its file. Re-dispatch any that are missing (same prompt, same paths).
+2. Print one consolidated status block to the user — the score lines the subagents returned, one per line.
+3. Update state once: set `state.grading.dir = "<crate_dir>/grading"`, set `state.grading.completed_rubrics` to the list of rubrics with `score.json` on disk, persist atomically.
+## 3. Aggregate
+After the loop (full or filtered):
+```
+Bash python -m fairscape_wizard.rubric_eval aggregate "<state.crate_dir>/grading/"
+```
+This writes `<grading>/aggregated_score.json` matching `grade.py`'s shape: `total_score`, `max_score`, `percentage`, `counts`, and `criteria` grouped by `id[0]`.
+Report the rollup to the user — one paragraph:
+```
+Scored 28/28 rubrics. Total: 42 / 56 (75.0%).
+  FAIRness:                   6/8  (4 substantive, 2 partial)
+  Provenance:                 5/8
+  Characterization:           7/10
+  Pre-model Explainability:   3/6
+  Ethics:                     6/8
+  Sustainability:             8/8
+  Computability:              7/8
+Top gaps: ...   (pull 3 from the worst-scoring rubrics' `gaps`)
+Full per-rubric output: <crate_dir>/grading/
+```
+## 4. State write
+```json
+{
+  ...,
+  "grading": {
+    "dir": "<crate_dir>/grading",
+    "completed_rubrics": ["0.a", "0.b", ...],
+    "aggregated_score_path": "<crate_dir>/grading/aggregated_score.json",
+    "summary": {"total": 42, "max": 56, "percentage": 75.0,
+                "counts": {"substantive": 14, "partial": 8, "absent": 6, "error": 0}}
+  },
+  "phase": "graded",
+  "history": [..., {"ts": "...", "skill": "agentic-rescore",
+                    "summary": "scored 28 rubrics: 42/56 (75.0%)"}]
+}
+```
+If the user picked a subset, leave `phase` at `rai_done` (or whatever it was) and only set `state.grading.completed_rubrics` to the subset. Restating "phase: graded" should mean *all* 28 are done.
+## Resume behavior
+On invocation, check `state.grading.completed_rubrics` AND the on-disk presence of `<grading>/<id>-<slug>/score.json`. The disk is the source of truth — a `score.json` that exists counts as done. Only dispatch subagents for rubrics with no `score.json`. If the user wants to rescore one, delete its `score.json` (and remove its id from `completed_rubrics`) before re-invoking.
+## Don't
+- **Don't score rubrics in the main conversation.** Always fan out to subagents. The main context has read the paper, heard the user's framing, and made earlier decisions — using it to score taints the verdict and breaks reproducibility.
+- **Don't customize the subagent prompt per rubric.** The only variables are the three file paths. No crate facts, no summaries, no "for context, the dataset is…" preface — the subagent must score from rubric + evidence alone.
+- **Don't pass crate paths, the PDF, or state to subagents.** They get the rubric YAML path, the evidence JSON path, and the output path. Nothing else.
+- Don't shell out to `grade.py`. The whole point of this phase is to use Claude directly — `rubric_eval.py` is the only Python helper this skill calls.
+- Don't write a `score.json` that doesn't match the rubric YAML's `output_schema`. If `additionalProperties: false`, no extra keys.
+- Don't invent evidence. If `evidence.json` is sparse, score Absent or Partial with a `gaps` list explaining what's missing — that's the honest signal.
+- Don't run the aggregator before all selected rubrics have `score.json` files written.

fairscape_wizard-0.2.0/.claude/skills/build-local-manifest/SKILL.md ADDED Viewed

@@ -0,0 +1,178 @@
+---
+name: build-local-manifest
+description: Build a manifest (manifest.csv + crate.json) from a local project folder. Walks the folder's file inventory (produced by scan-project-folder), computes md5+sha256 for singleton files via streaming hash, skips hashing for bulk-group members (≥10 same-extension siblings in one directory) when over the project-total threshold, composes per-file descriptions, sets contentUrl to `file:///<relpath>`, and writes the sidecar with crate-level metadata collected via the form path. Output is the input to `fairscape-cli import manifest`. Does NOT call the importer.
+---
+# Build a manifest from a local folder
+You're the **local-source mirror** of `build-manifest`. Same output contract — `manifest.csv` + `crate.json` in a folder — but the files exist on disk, so hashes get computed locally instead of pulled from a published index.
+```
+<project_root>/
+  manifest.csv     # one row per file (name, description, contentUrl=file:///<relpath>, [md5, sha256, size_bytes, ...])
+  crate.json       # sidecar with title, authors, license, keywords, publication_date, [doi]
+```
+Caller (`remote-import` or the unified wizard) runs `fairscape-cli import manifest manifest.csv --output-dir <project_root>` next. **You do not.**
+The manifest format is the same as `wizards/manifest-import-design/DESIGN.md`. The HPRC test case at `wizards/manifest-import-design/hprc-subset/` is the canonical CSV/JSON shape — mirror it.
+## When you're invoked
+Expected caller is **`remote-import`** when the user picked the local-folder branch in the unified wizard (`source.kind = "local"`). You can also be invoked directly when a user has already collected crate-level metadata and just wants the manifest produced for a folder.
+## Inputs
+The orchestrator must supply (either via state or directly):
+- **`project_root`** — absolute path to the folder. This will also be the crate output directory.
+- **`state.scan`** — the output of `scan-project-folder` (file inventory grouped by category, with `group_key` annotation per file marking bulk groups). If state isn't already populated, run `scan-project-folder` first.
+- **`state.crate_metadata`** — the form output from `extract-crate-metadata` (form path): `name`, `description`, `authors[]`, `keywords[]`, `license`, `publication_date`, `doi` (optional).
+If any are missing, stop and ask the caller to populate them. Don't interview the user yourself — the form path lives in `extract-crate-metadata`, not here.
+## Procedure
+### 1. Resolve files in scope
+From `state.scan.files_by_category`, flatten every category (`scripts`, `data`, `docs`, `pdfs`, `other`) into a single list. Drop `state.scan.existing_crate` if set (the existing `ro-crate-metadata.json` mustn't appear in the new manifest).
+Annotate each file as either:
+- **`singleton`** — `group_key` is null/absent, OR the group has fewer than 10 members.
+- **`bulk`** — `group_key` is set and the group has ≥ 10 members.
+The threshold matches `scan-project-folder`. It's deliberately set at 10 (not 3 or 4) because small clusters of same-extension files in a directory are usually distinct work (four sibling Python scripts, three CSVs with different contents) rather than a templated bulk pattern; treating them as bulk would hide their individual identities in the crate.
+Also determine the **entity type** for each file (default: extension-based auto-detection, which matches what the manifest connector does):
+- **`software`** — files whose extension is one of: `.py .r .sh .bash .ipynb .jl .m .exe .java .cpp .js .jsx .ts .tsx .css`. These become `Software` entities in the crate via `GenerateSoftware` → `fairscape_models.software.Software`.
+- **`dataset`** — everything else. These become `Dataset` entities via `GenerateDataset` → `fairscape_models.dataset.Dataset`.
+The manifest CSV carries the type explicitly in a `type` column (see step 3) so the import is deterministic and the user can override edge cases (e.g. a `.py` file that's actually data).
+### 2. Decide whether to hash everything
+Compute totals over the files-in-scope: `TOTAL_FILES = count`, `TOTAL_BYTES = sum(size_bytes)`.
+**Threshold rule:** if `TOTAL_FILES < 1000` AND `TOTAL_BYTES < 1 GiB (1,073,741,824 bytes)`, **hash every file**, bulk groups included. Otherwise hash singletons only and leave bulk-group hashes blank.
+The reasoning: small projects can afford the I/O, and bulk-group rubric scoring (3.c Verifiable) benefits from complete coverage. Large projects (sequencing runs, microscopy archives) would block on the streaming hash for too long.
+Tell the user the decision before you start, framed by the actual numbers:
+- **Under threshold:** *"Hashing all `<N>` files (`<H>` total; under the 1000-files / 1 GiB threshold). Roughly `<T>` at ~100 MB/s."*
+- **Over threshold:** *"Hashing `<M>` distinct files (`<H>` total); skipping `<K>` bulk-group files because the project exceeds the 1000-files / 1 GiB threshold (e.g. `<example_group>` with `<k>` files). You can run `/hash-coverage` later if you want hashes on those too."*
+### 3. Hash the in-scope files
+For each file marked for hashing (everything when under threshold; singletons only when over), compute md5 + sha256 in **one pass via streaming hash**:
+```python
+import hashlib
+md5 = hashlib.md5()
+sha256 = hashlib.sha256()
+with open(path, "rb") as f:
+    while chunk := f.read(1024 * 1024):  # 1 MB
+        md5.update(chunk)
+        sha256.update(chunk)
+return md5.hexdigest(), sha256.hexdigest()
+```
+Show a one-line progress update every 10 files or so (`"hashed 30/120..."`). Don't spam.
+If a file is unreadable (permission denied, broken symlink), don't crash. Leave both hashes blank for that row, add a note to the run log, keep going.
+### 4. Build the per-file rows
+For each file (singleton or bulk):
+| Column | Value |
+|---|---|
+| `name` | basename (e.g. `raw_counts.csv`). |
+| `type` | `software` or `dataset` per Step 1's detection. The manifest connector reads this column and routes accordingly; if blank it auto-detects from extension, so explicit values are not strictly required but make the manifest self-documenting. |
+| `description` | template: `"<basename> from <parent-relpath>"`. If the file's parent is the project root, use `"<basename> in project root"`. **Pad to a minimum of 10 characters** — the Dataset and Software models reject shorter descriptions. The user can edit later in the CSV before importing. |
+| `contentUrl` | `file:///<relpath>` where `<relpath>` is the path relative to `project_root` with forward slashes. |
+| `format` | the file's extension token (e.g. `csv`, `parquet`, `tif`, `py`). Bare extension, no leading dot. Don't invent MIME types. |
+| `md5` | from the hashing pass, or blank if hashing was skipped for this file. |
+| `sha256` | from the hashing pass, or blank if hashing was skipped for this file. |
+| `size_bytes` | from `state.scan` (or `os.stat()` if scan is stale). |
+| `datePublished` | leave blank — the connector falls back to the sidecar's `publication_date`. For software rows the connector automatically remaps this to `dateModified` on the way to `GenerateSoftware`. |
+| `version` | leave blank — connector defaults to `"1.0"`. |
+| `keywords` | leave blank unless the user supplied per-file keywords in state. |
+| `group` | the `group_key` from `state.scan` for bulk files; blank for singletons. Reserved for future schema inference; safe to write. |
+### 5. Write the sidecar
+Compose `crate.json` from `state.crate_metadata`. Fields:
+```json
+{
+  "name": "<from form>",
+  "description": "<from form, augmented if hashes were skipped>",
+  "authors": ["<from form>"],
+  "license": "<from form, URL form preferred>",
+  "keywords": ["<from form>"],
+  "publication_date": "<from form, ISO date>",
+  "doi": "<from form, optional>",
+  "associated_publication": "<form's doi as URL, or omit>",
+  "repository_name": "Local project",
+  "project_id": "<slugged version of name>",
+  "version": "1.0"
+}
+```
+**If any files had hashes skipped** (the over-threshold case), append a note to `description`:
+> *"Note: hashes were skipped for `<M>` bulk-group files (e.g. `<example_group_key>`) because the project exceeds the 1000-files / 1 GiB hashing threshold; run `/hash-coverage` from inside the crate to fill them in."*
+This keeps the gap honest and pointer-rich.
+### 6. Write the files
+- `Write` to `<project_root>/manifest.csv` (CSV with header row, UTF-8 no BOM).
+- `Write` to `<project_root>/crate.json` (JSON pretty-printed, 2-space indent).
+### 7. Validate locally
+Don't shell out to the importer. Programmatic checks only:
+- Open `manifest.csv` with `csv.DictReader`; confirm required columns (`name`, `description`, `contentUrl`) are present.
+- Confirm row count matches the number of in-scope files from scan.
+- Confirm `crate.json` has `name`, `description`, `authors[]` populated.
+### 8. State write
+Update `.fairscape-state.json` (the unified state file once Step 2 lands; until then `.fairscape-remote-state.json`):
+```json
+{
+  "phase": "manifest_built",
+  "source": {
+    "kind": "local",
+    "project_root": "<abs>"
+  },
+  "history": [
+    {"ts": "...", "skill": "build-local-manifest",
+     "summary": "wrote <N>-row manifest + sidecar at <project_root> (M hashed, K bulk skipped)"}
+  ]
+}
+```
+### 9. Report
+> *"Manifest ready at `<project_root>/manifest.csv`. `<N>` rows — `<D>` Datasets and `<S>` Software entries (detected by extension; you can override the `type` column before importing). Hashes set on `<M>` files; left blank on `<K>` files (`<reason: bulk-group skip / unreadable>`). The importer step is next — `fairscape-cli import manifest <project_root>/manifest.csv --output-dir <project_root>` — but that's the caller's job, not mine."*
+## What you must NOT do
+- **Don't call `fairscape-cli import manifest`.** That's the caller. Your output is the manifest + sidecar.
+- **Don't interview the user.** All metadata comes from `state.crate_metadata` (form path of `extract-crate-metadata`). If state is missing fields, stop and tell the caller.
+- **Don't hash bulk-group files when over the 1000-files / 1 GiB threshold.** The whole point of the threshold is to avoid 10,000-file hash storms on sequencing/imaging archives. Bulk = blank hashes + a documented gap. Under threshold, hash everything — the gap closes.
+- **Don't follow symlinks outside the project root.** Use `os.path.realpath` and skip anything that escapes.
+- **Don't compute hashes for files > 10 GB without warning the user first.** Streaming is correct but uninterruptible mid-file; if a huge file exists, surface it and let the user say keep / skip / cancel before starting.
+- **Don't put absolute paths in `contentUrl`.** Always relative-to-project-root: `file:///data/raw.csv`, never `file:///Users/justin/.../data/raw.csv`.
+- **Don't change file formats heuristically.** `csv.gz` is `csv-gz` or similar; `nii.gz` is `nii-gz`. Compound extensions stay compound. The importer's downstream consumers may care.
+## Reference
+- `wizards/manifest-import-design/hprc-subset/manifest.csv` — canonical column layout. Same columns; different `contentUrl` scheme (`https://` vs `file:///`).
+- `wizards/manifest-import-design/hprc-subset/crate.json` — canonical sidecar shape.
+- `fairscape-cli/src/fairscape_cli/models/dataset.py` lines 56-63 — the `GenerateDataset` path. Will *also* auto-compute md5 from a `file:///` contentUrl during import if the manifest left it blank, but **we don't rely on that** — we put hashes in the manifest so they're durable and visible.

fairscape_wizard-0.2.0/.claude/skills/build-manifest/SKILL.md ADDED Viewed

@@ -0,0 +1,195 @@
+---
+name: build-manifest
+description: Build a generic-import manifest (manifest.csv + crate.json) for a published dataset that doesn't have a dedicated repository connector (i.e. is not Dataverse, PhysioNet, or Figshare). Researches the dataset's published file inventory, rewrites cloud URIs to anonymously-fetchable HTTPS, pulls real hashes/sizes where the source publishes them, leaves them blank where it doesn't, composes per-file descriptions, and writes a sidecar with crate-level metadata from the paper. Output is the input to `fairscape-cli import manifest`. Does NOT call the importer.
+---
+# Build a generic-import manifest
+You produce **two files in a fresh folder**:
+```
+<workdir>/<slug>/
+  manifest.csv     # one row per file (name, description, contentUrl, [md5, sha256, size_bytes, ...])
+  crate.json       # sidecar with title, authors, doi, license, publication date, keywords
+```
+That folder is the input to `fairscape-cli import manifest <manifest.csv> --output-dir <crate_dir>`. **Do not run the importer here** — your job ends when the two files are on disk and validated.
+The manifest format is documented in `wizards/manifest-import-design/DESIGN.md`. A working reference example lives at `wizards/manifest-import-design/hprc-subset/` (HPRC year-1 phased assemblies, 6 files). Read both before drafting.
+## When you're invoked
+The expected caller is **`remote-import`** (Phase 1 of `fairscape-remote-rocrate-wizard`), which routes here automatically when the user's paste isn't a Dataverse DOI, PhysioNet URL, or Figshare article. You can also be invoked directly by a user who already knows they want a manifest for an arbitrary source.
+Either way, your output (`manifest.csv` + `crate.json`) is the input to `fairscape-cli import manifest`. The caller runs that — not you.
+Cues that you're the right skill:
+- The data is in an **AWS / GCP open-data bucket** (`s3://`, `gs://`).
+- The data is on **NCBI GenBank / SRA / GEO**, **ProteomeXchange**, **cellxgene**, **OpenNeuro**, **Mendeley Data**, or any portal without a dedicated `fairscape-cli import <kind>` subcommand.
+- The user has a **paper + a vague "data is at this URL"** but no fetcher will work.
+- The user has a **list of files on hand** (TSV/CSV/spreadsheet) and just wants those wrapped as a crate.
+If the data is on Dataverse, PhysioNet, or Figshare, **stop and tell the user** to use `fairscape-cli import <kind>` directly — those importers do this for free.
+## What the orchestrator (or user) passes you
+A free-form brief is fine — accept any of:
+- A paper DOI (`10.1038/s41586-023-05896-x`) or URL.
+- A repository URL (`https://github.com/human-pangenomics/HPP_Year1_Assemblies`).
+- A "data is at <URL>" pointer with no further structure.
+- A paper PDF on disk.
+- A pre-existing inventory file (TSV/CSV/JSON) the user already has.
+- A working directory where the output should land (default: `./<slug>/` under pwd).
+If too little is given, ask one focused question to fill the biggest gap. Don't interrogate — three rounds max.
+## Phase A — Identify the inventory source
+Before writing anything, **find the file inventory**. In order of preference:
+1. **Pre-existing inventory the user already has.** If they hand you a TSV/CSV/JSON file listing the data files, use that — don't re-derive.
+2. **A "release manifest" or "index" file published with the dataset.** Look on the dataset's GitHub repo (`assembly_index/`, `manifests/`, `releases/`), the paper's supplementary data, or a sibling `files.tsv` next to the data.
+3. **A listing API.** S3 buckets without `ListBucket` denial expose `?list-type=2`; GitHub releases expose `/releases/<id>` JSON; NCBI gives FTP listings.
+4. **The paper's Data Availability section.** Sometimes the manifest is literally a table in supplementary materials.
+5. **As a last resort**, walk the user through what files they want included (one-by-one). Only do this when nothing else turned up — it's expensive and error-prone.
+`WebFetch` and `WebSearch` are your friends here. Don't guess inventory URLs — verify by fetching first.
+Once you have the inventory in hand, tell the user what you found:
+> *"Found the file inventory at `<url>`. It lists `<N>` files across `<K>` groups (e.g. samples / haplotypes / runs). Columns published: `<col list>`. Notably, the source publishes `<sha256 | md5 | nothing>` for hashes and `<does | doesn't>` include sizes."*
+## Phase B — Scope
+Ask the user whether to include everything or a subset. Phrase the trade-off:
+> *"The inventory has `<N>` files totaling roughly `<size>`. The crate will reference all of them by URL — no data downloads — but each becomes a Dataset entry in `@graph`, which can get noisy for very large inventories. Want all `<N>`, or a subset (and if so, which axis to filter on — first K samples, specific cohort, single haplotype)?"*
+For datasets > a few hundred files, default to suggesting a representative subset for the first crate, and tell them they can re-run with the full inventory once they've eyeballed the result.
+## Phase C — Per-file row build
+For each file in scope, populate the row:
+### `name`
+The basename of the file (`HG00438.paternal.f1_assembly_v2_genbank.fa.gz`). Not the full URL.
+### `description`
+A short, templated description that reads naturally. Most inventories give you 2–3 axis values (sample × haplotype, run × condition, donor × organ). Compose like:
+> *"Phased diploid assembly (paternal haplotype) for HPRC sample HG00438; year-1 v2 GenBank release."*
+If the inventory has a per-file `description` column, use it verbatim. Don't make up details that aren't in the source — vague but truthful beats specific but invented.
+### `contentUrl` — rewriting cloud URIs
+The crate's downstream consumers (datasheet, validator) currently assume `http(s)://`. Rewrite the inventory's URI:
+| Inventory has | Rewrite to |
+|---|---|
+| `s3://<bucket>/<key>` (public AWS Open Data) | `https://<bucket>.s3.amazonaws.com/<key>` |
+| `gs://<bucket>/<key>` (public GCS) | `https://storage.googleapis.com/<bucket>/<key>` |
+| `ftp://<host>/<path>` | Leave as-is OR switch to `https://<host>/<path>` if the host serves both. |
+| `<github>/blob/<sha>/<path>` | `<github>/raw/<sha>/<path>` |
+For private/auth'd buckets — stop. The manifest path is for openly-accessible data. Tell the user we don't support tokens here in v1.
+**Verify one URL is reachable** before continuing — pick a representative file and `curl -sI` it. If you get HTTP 200, the rest of the bucket is probably fine. If you get 403/404, the URL pattern is wrong; debug.
+### `md5` / `sha256`
+Pull from the inventory if published. **Never download to compute.** If only one is published, fill that column and leave the other blank — fairscape's `Dataset` model accepts either. If neither is published, leave both blank; the user can run `/hash-coverage` later for any files they're willing to download locally.
+When hashes are blank, **note it in the sidecar** (see Phase D) so the gap is documented rather than silent.
+### `size_bytes`
+Prefer the inventory's published byte count. If absent, `curl -sI <url> | grep -i ^content-length:` per file. For inventories > 50 files, batch the HEAD requests:
+```bash
+for url in <urls>; do
+  size=$(curl -sI "$url" | awk 'BEGIN{IGNORECASE=1} /^Content-Length:/ {gsub(/[\r\n]/,"",$2); print $2; exit}')
+  echo "$url,$size"
+done
+```
+The manifest's `_human_size()` helper (in `manifest_connector.py`) formats `size_bytes` into the human-readable `contentSize` string ("833.8 MB") on the way to the crate, so the CSV only carries the integer.
+### `format`
+If the inventory publishes a MIME type, use it. Otherwise leave blank — the connector falls back to the filename extension. Don't invent MIME types you're not sure about (`application/x-bgzip` is wrong for `.fa.gz`; it's just `fasta-gz` if anything).
+### `keywords` (optional, pipe-separated)
+Per-file keywords that meaningfully narrow the file beyond what the crate-level `keywords` already cover. Skip if the file is generic.
+### `group` (optional)
+Free-form label for files that share a structure (e.g. `"paternal-haplotype-fasta"`, `"per-sample-vcf"`). Reserved for future schema-inference grouping — write it now, it'll light up later for free.
+### `type` (optional)
+`dataset` (default) or `software`. The manifest connector reads this column and routes the row to either `GenerateDataset` → `fairscape_models.dataset.Dataset` or `GenerateSoftware` → `fairscape_models.software.Software`. If blank, the connector auto-detects from extension: `.py .r .sh .bash .ipynb .jl .m .exe .java .cpp .js .jsx .ts .tsx .css` → `software`; everything else → `dataset`.
+For most remote published datasets the source's file inventory is purely data, so the column is rarely needed. Set it explicitly when the source ships scripts you want to surface as Software (e.g. analysis code published alongside the data).
+## Phase D — Sidecar (crate.json)
+Pull from the paper (DOI lookup, PDF read) and the dataset's GitHub repo:
+| Field | Required? | Notes |
+|---|---|---|
+| `name` | yes | Paper title or dataset title. Add "(subset)" if scope is partial. |
+| `description` | yes | One paragraph. Pull from paper abstract or repo README. |
+| `authors` | yes | Full author list from the paper. Order matters. Don't truncate. |
+| `license` | strongly recommended | URL form (`https://creativecommons.org/...`). Look at the data repo, not the paper's CC-BY — those are often different. |
+| `keywords` | recommended | 4–8 short terms grounded in the domain. |
+| `publication_date` | recommended | ISO date of the paper. |
+| `doi` | recommended | Bare DOI (`10.1038/...`), not the URL form. |
+| `associated_publication` | recommended | URL form of the DOI. |
+| `repository_name` | recommended | Free text — "HPRC (AWS Open Data)", "NCBI GenBank", "Zenodo (record 12345)". Not a URL. |
+| `project_id` | recommended | A short slug or accession the source uses. |
+| `version` | optional | Release version, defaults to "1.0". |
+| `url` | optional | A canonical landing URL for the dataset (the GitHub repo, the portal page, etc.). |
+If hashes were blank for some rows in Phase C, **add a free-text note** to `description`: *"Note: file hashes are not published by the source repository; `/hash-coverage` can populate them later for files downloaded locally."* This keeps the gap documented inside the crate itself.
+## Phase E — Write + spot-check
+1. `Write` `manifest.csv` and `crate.json` to `<workdir>/<slug>/`.
+2. **Validate locally — don't shell out to the importer.** Programmatic checks:
+   - `csv.DictReader` opens cleanly.
+   - Required columns present: `name`, `description`, `contentUrl`.
+   - At least one row.
+   - First row's `contentUrl` returns HTTP 200 on HEAD.
+   - Sidecar has `name`, `description`, `authors` (a list).
+3. **Spot-check one URL by sampling.** `curl -r 0-1023 <one_url>` — pulls 1 KB. Confirms not only reachability but that the bytes are real (not a 200-but-empty redirect page).
+## Phase F — Report
+Print a short summary the user can act on:
+> *"Manifest ready at `<workdir>/<slug>/`. `<N>` files referenced; total declared size `<H>` (computed from `size_bytes`). Hashes: `<K>` rows have sha256, `<L>` have md5, `<M>` have neither. Run `fairscape-cli import manifest <workdir>/<slug>/manifest.csv --output-dir <crate_dir>` to build the crate. The remote-source wizard's phases 2–6 (schema infer, AI-Ready enrich, provenance, grade, improve) work against the resulting crate unchanged."*
+## What you must NOT do
+- **Don't call `fairscape-cli import manifest`.** Your output is the manifest + sidecar. The caller decides when to build.
+- **Don't download data files to compute hashes.** HEAD is fine; GET is not. If hashes aren't published, leave blank and document.
+- **Don't invent metadata.** Per-file descriptions can be templated, but the *facts* in them (sample IDs, conditions, file types) must come from the inventory, not your imagination. If you don't know, say "data file" and move on.
+- **Don't include private/auth'd URLs.** If a row would require a token, the row doesn't belong in the manifest — it belongs in a Dataset with `contentUrl: "Embargoed"` placeholder, added later via `register-embargoed-dataset`.
+- **Don't try to be smart about cloud URI rewrites you haven't verified.** If a bucket isn't documented as public, HEAD it to confirm `200 OK` *before* writing the URL into the manifest.
+- **Don't write `MEMORY.md`-style "this is a great dataset" commentary into the sidecar.** Stick to factual fields.
+## Reference: the HPRC test case
+`wizards/manifest-import-design/hprc-subset/` is the canonical worked example. When in doubt, mirror its shape:
+- 6 rows (3 samples × 2 haplotypes).
+- `contentUrl`s rewritten from `s3://human-pangenomics/...` to `https://human-pangenomics.s3.amazonaws.com/...`.
+- `sha256` pulled verbatim from `Year1_assemblies_v2_genbank.index`; `md5` blank because HPRC doesn't publish md5.
+- `size_bytes` from HTTP HEAD against each public URL.
+- `crate.json` has 119 authors copied from the Liao 2023 Nature paper, CC0 license matching HPRC's open-data terms, DOI `10.1038/s41586-023-05896-x`, publication date `2023-05-10`.
+That manifest produced a crate that validates as `ROCrate v1.2` and has every per-file Dataset carrying `contentUrl`, `sha256`, and `contentSize`.

fairscape_wizard-0.2.0/.claude/skills/checkpoint/SKILL.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: checkpoint
+description: Summarize the current wizard state for the user and confirm whether to continue. Invoked at session resume, between major phases, and on user request.
+---
+# Checkpoint
+Read `.fairscape-wizard-state.json` and produce a short status summary. Ask the user whether to continue, correct something, or stop.
+## Procedure
+1. `Read` the state file. If it doesn't exist, say so and offer to start the wizard fresh.
+2. Compose a summary using counts, not full records:
+   ```
+   Crate: <state.crate_metadata.name or "(no name yet)">
+     <one-line description if present>
+   Captured so far:
+     - 3 single inputs
+     - 1 folder of inputs (raw images, 847 files)
+     - 2 scripts
+     - 1 step
+     - 1 single output
+     - 0 folders of outputs
+   Branches: imaging (open, head: segmented_masks), clinical (complete), demographics (merged into imaging)
+   Last activity: 2026-05-04T14:23 — added step "Segment images"
+   ```
+   Pull the "last activity" line from the most recent `state.history` entry. Omit the "Branches:" line if `state.branches` is absent or empty (single-pipeline case).
+3. Ask the user: "Continue from here, or correct/remove anything first?"
+4. Common follow-ups:
+   - "Show me what's in inputs" → list `state.datasets` where `is_raw_input` and `state.bulk_groups` where `is_raw_input`, by `name`/`user_label`.
+   - "Show me the steps" → render each `state.computations` entry as `<inputs> → <software> → <outputs>`.
+   - "Remove X" → delete the matching entry, append a `removed` entry to `history`. If the removed entity is referenced by a computation, warn before removing or unwire it.
+## Don't
+- Don't dump full JSON unless explicitly asked.
+- Don't propose new entities — that's the registration skills' job.