npm - lumina-wiki - Versions diffs - 0.7.0 → 0.8.0 - Mend

lumina-wiki 0.7.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/CHANGELOG.md +31 -0
package/package.json +2 -1
package/src/installer/commands.js +2 -2
package/src/installer/manifest.js +6 -1
package/src/scripts/lint.mjs +59 -2
package/src/scripts/schemas.mjs +3 -1
package/src/skills/core/ingest/SKILL.md +71 -11
package/src/skills/core/migrate-legacy/SKILL.md +148 -27
package/src/skills/packs/research/discover/SKILL.md +17 -6
package/src/templates/README.md +9 -4
package/src/tools/fetch_pdf.py +416 -0

package/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,37 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 ## [Unreleased]
+## [0.8.0] - 2026-05-03
+### Added
+- Schema: `raw_paths` field (array, optional) on `sources` — relative paths to permanent raw artifacts backing the source page (`raw/sources/*`, `raw/notes/*`, `raw/download/<resource>/*`, `raw/discovered/<topic>/*.json`). Replaces implicit "URL is the anchor" semantic with an explicit pointer set; verify Stage A (planned v1.0) reads this directly instead of re-deriving from heuristics.
+- `raw/download/<resource>/` — permanent agent-writable zone for auto-fetched full-text artifacts, partitioned by source (`arxiv`, `doi`, `s2`, `web`). Distinct from `raw/tmp/` (transient) and `raw/sources/` (human-curated).
+- `_lumina/tools/fetch_pdf.py` — CLI tool: download URL to `raw/download/<resource>/<filename>`, idempotent (skip on existing, `--force` to overwrite). Resource detection from URL pattern (arxiv/doi/s2/web). Atomic write (tempfile + fsync + rename). Used by `/lumi-ingest` Mode B.
+- Lint check L12: warning when `raw_paths` entries point to a missing file, escape the project root, or live in `raw/tmp/*` (transient — should be moved to `raw/sources/` or `raw/download/`).
+- `/lumi-ingest` Mode B: input may be a URL, arxiv ID, DOI, or paper title from discover shortlist. Skill resolves to URL, calls `fetch_pdf.py`, ingests from the resulting `raw/download/` path. Mode A (local file path) unchanged.
+### Changed
+- Source frontmatter field `url: <string>` renamed to `urls: <array>` for symmetry with `raw_paths: array`. Multiple URLs supported per source (arxiv abs, DOI, repo, slides). Lint type validation expects `urls` to be an array; legacy `url` string entries flagged as unknown field. Migration handled by `/lumi-migrate-legacy` (detects and rewrites `url: <str>` → `urls: [<str>]`).
+- Provenance semantic reframed raw-centric (enum unchanged, 3 values):
+  - `replayable` now requires `raw_paths` non-empty with at least one entry resolving to disk (URL is no longer a precondition — file-only sources qualify).
+  - `partial` requires `url` present and no resolvable `raw_paths`.
+  - `missing` unchanged.
+  Rubric updated in `/lumi-ingest`, `/lumi-research-discover`, `/lumi-migrate-legacy`.
+- `/lumi-migrate-legacy` rubric: tier 1 reads ingest checkpoint (`_lumina/_state/ingest-<slug>.json`) for authoritative `source_path`; tier 2 falls back to slug-prefix and URL-derived-ID heuristics across `raw/sources/`, `raw/notes/`, `raw/download/**`, `raw/discovered/**`. Pages whose checkpoint points into `raw/tmp/*` are flagged for the user to relocate before backfill — skill does not auto-move human files.
+- Manifest schema: `MANIFEST_SCHEMA_VERSION` 2 → 3. Migration is metadata-only (no manifest field shape change); workspace schema additions (`raw_paths`, `raw/download/`) are additive and backward-compatible — old wikis continue to lint clean (L12 warnings advisory only).
+- `/lumi-migrate-legacy`: raised the work-list confirmation gate from 10 to 30 entries. Real wikis commonly have dozens of entries, and the original threshold made every migration a multi-turn chore. Lists ≤30 now proceed after the plan is reported; lists >30 still pause for explicit confirmation, since a large batch usually signals a long-dormant wiki or major schema bump worth spot-checking.
+### Fixed
+- `/lumi-migrate-legacy`: Step 1.2 and Step 4.1 now use `lint.mjs --summary` for counts and write `--json` to `/tmp/lumi-lint.json` before projecting findings. Avoids the Bash-tool ~30KB stdout cap which truncated full `--json` mid-string on wikis with many findings, breaking inline `JSON.parse`.
+### Migration
+- Existing source pages without `raw_paths`: no immediate action required. Lint stays green (`raw_paths` is optional, L12 only fires when present-but-broken).
+- To backfill `raw_paths` on legacy entries, run `/lumi-migrate-legacy` after upgrading. The skill reads ingest checkpoints and applies the new tier-1/tier-2 rubric.
+- If you have wiki sources currently pointing at `raw/tmp/arxiv-ingest/` or similar transient locations (a known artefact of pre-v0.8 agent improvisation): move those PDFs to `raw/download/arxiv/` (matching arxiv ID) or `raw/sources/` (custom-named), then re-run `/lumi-migrate-legacy`. Lint L12 will identify the affected pages.
+- Custom tooling reading manifest: bump expected `schemaVersion` to 3 (or accept 2|3 transitionally — the manifest shape is unchanged).
+## [0.7.0] - 2026-05-03
 ### Added
 - `/lumi-migrate-legacy` core skill — LLM-driven backfill of provenance/confidence
 - `CHANGELOG.md` shipped to `_lumina/CHANGELOG.md` for skill consumption

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "$schema": "https://json.schemastore.org/package.json",
   "name": "lumina-wiki",
-  "version": "0.7.0",
+  "version": "0.8.0",
   "description": "Domain-agnostic, multi-IDE wiki scaffolder — Karpathy's LLM-Wiki vision, cross-platform and pack-based.",
   "keywords": [
     "llm-wiki",
@@ -51,6 +51,7 @@
     "src/tools/init_discovery.py",
     "src/tools/prepare_source.py",
     "src/tools/fetch_arxiv.py",
+    "src/tools/fetch_pdf.py",
     "src/tools/fetch_wikipedia.py",
     "src/tools/fetch_s2.py",
     "src/tools/fetch_deepxiv.py",

package/src/installer/commands.js CHANGED Viewed

@@ -110,7 +110,7 @@ const CORE_WIKI_DIRS = [
 const RESEARCH_WIKI_DIRS = ['wiki/foundations', 'wiki/topics'];
 const READING_WIKI_DIRS  = ['wiki/chapters', 'wiki/characters', 'wiki/themes', 'wiki/plot'];
-const CORE_RAW_DIRS = ['raw/sources', 'raw/notes', 'raw/assets', 'raw/tmp'];
+const CORE_RAW_DIRS = ['raw/sources', 'raw/notes', 'raw/assets', 'raw/tmp', 'raw/download'];
 const RESEARCH_RAW_DIRS = ['raw/discovered'];
 const LUMINA_DIRS = [
@@ -878,7 +878,7 @@ function getSkillDefs(packs) {
 async function copyTools(projectRoot, { research }) {
   const destDir = join(projectRoot, '_lumina', 'tools');
-  const coreTools = ['extract_pdf.py'];
+  const coreTools = ['extract_pdf.py', 'fetch_pdf.py'];
   const researchTools = [
     '_env.py', 'discover.py', 'init_discovery.py', 'prepare_source.py',
     'fetch_arxiv.py', 'fetch_wikipedia.py', 'fetch_s2.py', 'fetch_deepxiv.py',

package/src/installer/manifest.js CHANGED Viewed

@@ -19,7 +19,7 @@ import { atomicWrite, ensureDir } from './fs.js';
 // Constants
 // ---------------------------------------------------------------------------
-export const MANIFEST_SCHEMA_VERSION = 2;
+export const MANIFEST_SCHEMA_VERSION = 3;
 export const SKILLS_CSV_HEADER = 'canonical_id,display_name,pack,source,relative_path,target_link_path,version';
 export const FILES_CSV_HEADER = 'relative_path,sha256,source_pack,installed_version';
@@ -293,6 +293,11 @@ export async function writeFilesManifest(projectRoot, rows) {
  */
 const MIGRATIONS = {
   '1->2': (m) => ({ ...m, legacyMigrationNeeded: true }),
+  // 2->3 (v0.8): workspace schema additions — raw_paths field, raw/download/ dir,
+  // lint L12, source frontmatter url (string) -> urls (array). All additive /
+  // backward-compatible at the manifest level. Wiki content migration is handled
+  // by /lumi-migrate-legacy, not by the installer. No manifest shape change.
+  '2->3': (m) => ({ ...m }),
 };
 /**

package/src/scripts/lint.mjs CHANGED Viewed

@@ -67,7 +67,7 @@ const INDEX_MARKER_OPEN = '<!-- lumina:index -->';
 const INDEX_MARKER_CLOSE = '<!-- /lumina:index -->';
 /** All check IDs in run order. */
-const ALL_CHECK_IDS = ['L01', 'L02', 'L03', 'L04', 'L05', 'L06', 'L07', 'L08', 'L09', 'L10', 'L11'];
+const ALL_CHECK_IDS = ['L01', 'L02', 'L03', 'L04', 'L05', 'L06', 'L07', 'L08', 'L09', 'L10', 'L11', 'L12'];
 /** Kebab-case pattern: lowercase letters, digits, hyphens; no leading/trailing hyphen. */
 const KEBAB_RE = /^[a-z0-9]+(?:-[a-z0-9]+)*$/;
@@ -667,6 +667,62 @@ function checkL10(foundationEntries) {
   return findings;
 }
+/**
+ * L12: `raw_paths` entries on a `sources` page point to a missing file, or to
+ * `raw/tmp/*` (transient location — canonical sources should not live there).
+ * Severity: warning. Not auto-fixable. Catches drift when the user moves or
+ * renames a backing file, and flags the common mistake of pinning a wiki page
+ * to a temp-zone artifact that may be cleaned at any time.
+ *
+ * @param {string} wikiRelPath
+ * @param {Record<string,unknown>} fm
+ * @param {string} projectRoot  Absolute path; used to resolve raw_paths entries.
+ * @returns {Promise<Finding[]>}
+ */
+async function checkL12(wikiRelPath, fm, projectRoot) {
+  const type = entityTypeForPath(wikiRelPath);
+  if (type !== 'sources') return [];
+  const rawPaths = fm.raw_paths;
+  if (!Array.isArray(rawPaths) || rawPaths.length === 0) return [];
+  const findings = [];
+  for (const entry of rawPaths) {
+    if (typeof entry !== 'string' || entry === '') continue;
+    // Reject paths inside raw/tmp/ — transient zone, not for canonical sources.
+    if (entry.startsWith('raw/tmp/') || entry.startsWith('./raw/tmp/')) {
+      findings.push(finding(
+        'L12-raw-paths-transient', 'warning', false,
+        wikiRelPath, null,
+        `raw_paths entry "${entry}" lives in raw/tmp/ — transient. Move the file to raw/sources/ (human) or raw/download/<resource>/ (agent) and update raw_paths.`
+      ));
+      continue;
+    }
+    // Verify file exists on disk (relative to project root).
+    const abs = resolve(projectRoot, entry);
+    if (!abs.startsWith(resolve(projectRoot))) {
+      findings.push(finding(
+        'L12-raw-paths-unsafe', 'warning', false,
+        wikiRelPath, null,
+        `raw_paths entry "${entry}" escapes the project root`
+      ));
+      continue;
+    }
+    try {
+      await access(abs, fsConstants.F_OK);
+    } catch {
+      findings.push(finding(
+        'L12-raw-paths-missing', 'warning', false,
+        wikiRelPath, null,
+        `raw_paths entry "${entry}" does not exist on disk`
+      ));
+    }
+  }
+  return findings;
+}
 /**
  * L11: `confidence` field missing on a `sources` or `concepts` entity.
  * Severity: warning. Not auto-fixable. Sets an explicit trust signal that
@@ -943,6 +999,7 @@ async function runLint(projectRoot, opts) {
     allFindings.push(...checkL04(wikiRelPath, outboundMap.get(wikiRelPath) || new Set(), inboundSet));
     allFindings.push(...checkL05(wikiRelPath, content, knownSlugs));
     allFindings.push(...checkL11(wikiRelPath, fm));
+    allFindings.push(...await checkL12(wikiRelPath, fm, projectRoot));
   }
   allFindings.push(...checkL06(edges, new Set(edgeSet)));
@@ -1243,7 +1300,7 @@ export {
   isExempt,
   entityTypeForPath,
   checkL01, checkL02, checkL03, checkL04, checkL05,
-  checkL06, checkL07, checkL08, checkL09, checkL10, checkL11,
+  checkL06, checkL07, checkL08, checkL09, checkL10, checkL11, checkL12,
   fixL01, fixL03, fixL06, fixL07, fixL09,
   runLint,
   reportSummary,

package/src/scripts/schemas.mjs CHANGED Viewed

@@ -118,6 +118,7 @@ export const RAW_DIRS = {
   notes:      'core',
   assets:     'core',
   tmp:        'core',
+  download:   'core',
   // research pack
   discovered: 'research',
@@ -257,7 +258,8 @@ export const REQUIRED_FRONTMATTER = {
     { key: 'authors',    type: 'array',    required: true  },
     { key: 'year',       type: 'number',   required: true  },
     { key: 'importance', type: 'enum',     required: true,  values: [1, 2, 3, 4, 5] },
-    { key: 'url',        type: 'string',   required: false },
+    { key: 'urls',       type: 'array',    required: false },
+    { key: 'raw_paths',  type: 'array',    required: false },
     { key: 'provenance', type: 'enum',     required: true,  values: ['replayable', 'partial', 'missing'] },
     { key: 'confidence', type: 'enum',     required: false, values: ['high', 'medium', 'low', 'unverified'] },
   ],

package/src/skills/core/ingest/SKILL.md CHANGED Viewed

@@ -8,8 +8,13 @@ description: >
   into the wiki", "create a wiki page for", or drops a filename from raw/sources/.
   Also fires for: "I added a PDF to raw/sources/", "add this paper to the wiki",
   "parse this article", "what should I do with raw/sources/X?", or any request
-  to bring a new source into the wiki graph. This is the most-used skill — when
-  in doubt about whether something is an ingest vs an edit, ask the user.
+  to bring a new source into the wiki graph.
+  Also accepts Mode B input — paper title, arxiv ID, or URL, without a local
+  file path. Examples: "ingest paper 2604.03501v2", "ingest arxiv:2604.03501",
+  "ingest https://arxiv.org/abs/2604.03501". The skill fetches the PDF
+  automatically in that case.
+  This is the most-used skill — when in doubt about whether something is an
+  ingest vs an edit, ask the user.
 allowed-tools:
   - Bash
   - Read
@@ -34,6 +39,7 @@ depends on bidirectional-link discipline.
 Key workspace paths:
 - `raw/sources/` — immutable user-provided sources; you read but never modify
+- `raw/download/<resource>/` — agent-writable, permanent; auto-fetched PDFs land here (resource = `arxiv | doi | s2 | web`)
 - `wiki/sources/` — one page per ingested source (you write this)
 - `wiki/concepts/`, `wiki/people/` — you create or update stubs here
 - `wiki/index.md` — updated on every ingest
@@ -79,6 +85,43 @@ If a checkpoint exists and `phase` is not `"done"`, ask the user whether to resu
 or restart. Resuming skips completed phases. Restarting deletes the checkpoint and
 starts from Phase 1.
+### Phase 0.5 — Resolve input
+Three input modes:
+**Mode A — Local file path** (e.g. `raw/sources/foo.pdf`, `raw/notes/bar.md`)
+Use directly as `source_path`. Proceed to Phase 1 to detect type from this file.
+**Mode B — URL or identifier** (arxiv ID like `2604.03501v2`, arxiv URL, DOI, S2 paper ID, generic URL)
+```bash
+python3 _lumina/tools/fetch_pdf.py "<url-or-id>"
+```
+- For bare arxiv ID: pass `https://arxiv.org/abs/<id>`
+- For DOI: pass `https://doi.org/<doi>`
+On exit 0: read JSON output. Use the returned `path` as `source_path`. Write the input URL as
+the first entry of `urls` on the source page; additional URLs (DOI, repo, slides, etc.) can be
+appended after. Proceed to Phase 1.
+On exit 2 (HTML response): the URL likely points to a paywall or non-PDF page.
+Report to user and ask for a direct PDF URL or manual download. Do not proceed
+with ingest until a valid file path is available.
+On exit 3 (network error): retry once. If the second attempt also fails, surface
+the error to the user with the exact message from the tool output.
+**Mode C — Title only** (e.g. from a `/lumi-research-discover` shortlist)
+```bash
+node _lumina/scripts/wiki.mjs checkpoint-read research-discover shortlist
+```
+Match the title to a shortlist entry and extract the URL from that entry.
+Fall through to Mode B with that URL.
 ### Phase 1 — Detect type
 Read the file header (first ~200 lines). Classify as one of:
@@ -115,7 +158,7 @@ drafting the page.
 Draft `wiki/sources/<slug>.md` using the Source
 template from `_lumina/schema/page-templates.md` (open it when in doubt about
 required fields). Required frontmatter fields: `id`, `title`, `type`, `created`,
-`updated`, `authors`, `year`, `importance` (1-5), `url` (optional).
+`updated`, `authors`, `year`, `importance` (1-5), `urls` (optional, array).
 Required body sections: `## Summary` (2-4 sentences), `## Key Claims` (bulleted
 with confidence level), `## Concepts` (all `[[concept-slug]]` links), `## People`
@@ -129,20 +172,36 @@ verification (Stage A/B/C of `/lumi-verify`, planned for v1.0). An explicit deci
 is more useful than a silently-defaulted value because verification needs to know
 whether it can re-check the material end-to-end.
-Provenance rubric — pick the one that matches what you actually did:
-- `replayable` — you fetched the URL and saved the raw snapshot under `raw/`. The
-  source can be re-verified end-to-end against the original.
-- `partial` — you kept only a summary; no raw text snapshot was saved. Drift
-  detection works against the URL, but grounding cannot be re-checked.
-- `missing` — manual entry: no URL, no raw snapshot. Verification has nothing to
-  grip on.
+Provenance rubric — raw-centric; pick the one that fits:
+- `replayable` — `raw_paths` is non-empty AND every entry resolves to an existing file.
+  Source can be re-grounded end-to-end against these files.
+- `partial` — has at least one entry in `urls` but `raw_paths` is empty or every listed entry is missing.
+  Drift detection works against the URL, but grounding cannot be re-checked.
+- `missing` — no `urls`, no `raw_paths`. Manual entry; verification has nothing to grip on.
+A source can have multiple URLs (arxiv abs + DOI + repo + slides). List the most authoritative
+first; the agent uses `urls[0]` when a single canonical URL is needed (e.g., for `fetch_pdf.py` Mode B).
+Set `raw_paths` to the list of permanent raw artifacts backing this page:
+- The primary file passed to ingest (`raw/sources/*`, `raw/notes/*`, or
+  `raw/download/<resource>/*` from Mode B).
+- Any matching metadata JSON in `raw/discovered/<topic>/<id>.json` (research pack).
+  Match by paper ID (arxiv ID, DOI) extracted from the source's URL or filename.
+Do NOT include `raw/tmp/*` entries — that zone is transient (lint L12 warns).
+Do NOT include files outside `raw/`.
 Also set `confidence:` (optional but encouraged). Use `high | medium | low | unverified`.
 Default to `unverified` for fresh ingests; bump up only after you have cross-checked
 the claims or the user has confirmed them.
-Example frontmatter with both fields:
+Example frontmatter:
 ```yaml
+urls:
+  - https://arxiv.org/abs/2604.03501v2
+raw_paths:
+  - raw/download/arxiv/2604.03501v2.pdf
+  - raw/discovered/ai-economics/2604.03501v2.json
 provenance: replayable
 confidence: unverified
 ```
@@ -303,6 +362,7 @@ Link added to `## Concepts` in `wiki/sources/rlhf-overview.md`:
 - Keep a checkpoint after every phase — an interrupted ingest must be resumable.
 - If the source is too large to fully read, read in sections and checkpoint between them.
 - `raw/tmp/` accepts additions only; never overwrite a file there.
+- `raw_paths` must list permanent artifacts only. Reject `raw/tmp/*` entries.
 ## Definition of Done

package/src/skills/core/migrate-legacy/SKILL.md CHANGED Viewed

@@ -56,13 +56,38 @@ Proceed to the lint check regardless.
 **Step 1.2 — Run lint (read-only pass).**
+First, get aggregate counts (tiny output, always safe):
 ```bash
-node _lumina/scripts/lint.mjs --json
+node _lumina/scripts/lint.mjs --summary
+```
+If `errors === 0` and `by_check.L11` is `0` or absent, skip to the clean-exit
+branch below. Otherwise, you need the per-entry findings.
+**Important — do NOT pipe `--json` straight into a heredoc.** On a large wiki
+the full findings JSON can exceed the shell tool's ~30KB stdout buffer and get
+truncated mid-string, breaking JSON.parse. Instead, write it to a temp file
+and read filtered slices:
+```bash
+node _lumina/scripts/lint.mjs --json > /tmp/lumi-lint.json
+node -e "
+const j=JSON.parse(require('fs').readFileSync('/tmp/lumi-lint.json','utf8'));
+const want=new Set(['L01-frontmatter-required','L11-confidence-missing']);
+const hits=j.findings.filter(f=>want.has(f.id))
+  .map(f=>({id:f.id,file:f.file,message:f.message}));
+console.log(JSON.stringify(hits,null,2));
+"
 ```
-Parse the JSON output. Collect:
-- All `L01-frontmatter-required` findings (severity: error) — these are entries
-  with missing required fields.
+The projected output (id + file + message only) is bounded and parseable. If
+even that exceeds buffer (very large wikis), read `/tmp/lumi-lint.json` with
+the Read tool instead — Read paginates, Bash stdout does not.
+Collect:
+- All `L01-frontmatter-required` findings (severity: error) — entries with
+  missing required fields.
 - All `L11-confidence-missing` findings (severity: warning) — entries missing
   the optional-but-recommended `confidence` field.
@@ -107,9 +132,21 @@ Field: confidence (optional, sources + concepts)
   - concepts/softmax-temperature
 ```
-Report this plan to the user before proceeding. Ask for confirmation if the
-work list is large (more than 10 entries). For 10 or fewer, proceed without
-asking.
+Always report this plan to the user before proceeding. For work lists of **30
+or fewer entries**, continue without waiting for confirmation — small batches
+are routine and the operation is safe to re-run. For **more than 30 entries**,
+stop and ask the user to confirm before any writes. A large batch usually
+means a long-dormant wiki or a major schema bump, and the user should have a
+chance to spot-check the inference table before bulk changes land.
+The safety net beneath this threshold:
+- `set-meta` is atomic and idempotent — rerunning with a corrected value is
+  a single command, no rollback needed.
+- The inference rubric falls back to `unverified` when evidence is ambiguous,
+  so wrong values err toward "honest about uncertainty," not overconfidence.
+- Phase 4 re-runs lint and surfaces any remaining issues before clearing the
+  manifest flag.
 ### Phase 2 — Plan
@@ -127,13 +164,7 @@ existing fields (url, authors, year, type, etc.).
 **For `sources` entries also check:**
-1. Whether a raw snapshot exists:
-   ```bash
-   ls raw/sources/<slug>* 2>/dev/null || echo "no snapshot"
-   ls raw/discovered/<slug>* 2>/dev/null || echo "no snapshot"
-   ```
-2. Inbound citation/edge count (how many other entries link to this one):
+1. Inbound citation/edge count (how many other entries link to this one):
    ```bash
    grep -c '"target":"sources/<slug>"' wiki/graph/edges.jsonl 2>/dev/null || echo 0
    grep -c '"target":"sources/<slug>"' wiki/graph/citations.jsonl 2>/dev/null || echo 0
@@ -141,16 +172,42 @@ existing fields (url, authors, year, type, etc.).
 **Inference rubrics — apply these to decide values:**
-#### provenance (required on `sources`)
+#### provenance + raw_paths (required on `sources`)
+Use the following inference order. Stop at the first tier that yields a result.
+**Tier 1 (authoritative): read the ingest checkpoint.**
+```bash
+node _lumina/scripts/wiki.mjs checkpoint-read ingest <slug>
+```
+If a checkpoint exists with a `source_path` field:
+- If `source_path` is under `raw/tmp/*`: do NOT write `raw_paths`. Tell the user:
+  "`<slug>` was ingested from a transient location (`<source_path>`). Move the
+  file to `raw/sources/` or `raw/download/<resource>/` and re-run
+  `/lumi-migrate-legacy` to backfill `raw_paths` properly."
+  Set `provenance` to `partial` (if `urls` is non-empty) or `missing` (no `urls`).
+- Otherwise: set `raw_paths` to `[source_path]` and `provenance` to `replayable`.
+  Skip Tiers 2 and 3.
+**Tier 2 (heuristic): scan raw/ for matching files.**
+- Slug-prefix match: `raw/sources/<slug>*`, `raw/notes/<slug>*`, or
+  `raw/download/<resource>/<slug>*`
+- URL-derived ID match: extract arxiv ID, DOI, or URL basename from the page's
+  `urls` array (or legacy `url` string for backward compat); scan `raw/sources/`,
+  `raw/notes/`, `raw/download/**` for filenames containing that ID.
+- Research-pack flow: also scan `raw/discovered/<topic>/<id>.json` for a JSON
+  whose `id` or `url` matches any entry in the page's `urls` array.
+All non-`raw/tmp/` matches go into `raw_paths`. Set `provenance` to `replayable`
+if any match was found.
-Pick the one that matches what you can verify:
+**Tier 3 (fall back to urls heuristic): no checkpoint, no file match.**
-- `replayable` — A `url` field is present AND a raw snapshot exists under
-  `raw/sources/` or `raw/discovered/`. The source can be re-verified end-to-end.
-- `partial` — A `url` field is present but no raw snapshot was saved. Drift
-  detection works against the URL, but the full text cannot be re-grounded.
-- `missing` — No `url` field and no raw snapshot. Manual entry; verification
-  has nothing to grip on.
+- Has at least one entry in `urls`, no raw match → `partial` (leave `raw_paths` unset or `[]`)
+- Neither → `missing`
 #### confidence (optional-but-recommended on `sources` and `concepts`)
@@ -178,11 +235,13 @@ After the read phase, produce an inference table:
 ```
 sources/attention-is-all-you-need:
-  provenance: replayable  (url present, raw/sources/attention-is-all-you-need.pdf found)
+  raw_paths: ["raw/sources/attention-is-all-you-need.pdf"]  (Tier 1: checkpoint source_path)
+  provenance: replayable  (raw_paths non-empty, file exists)
   confidence: high        (7 inbound citations)
 sources/lora-2021:
-  provenance: partial     (url present, no raw snapshot)
+  raw_paths: []           (Tier 3: url present, no file match)
+  provenance: partial     (url present, no resolvable raw_paths)
   confidence: unverified  (0 inbound edges, no cross-checks)
 concepts/softmax-temperature:
@@ -197,8 +256,15 @@ For each entry in the inference table, set each missing field:
 node _lumina/scripts/wiki.mjs set-meta <slug> <key> "<value>"
 ```
+For `raw_paths` (an array field), pass a JSON array with `--json-value`:
+```bash
+node _lumina/scripts/wiki.mjs set-meta sources/<slug> raw_paths '["raw/sources/foo.pdf"]' --json-value
+```
 Examples:
 ```bash
+node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need raw_paths '["raw/sources/attention-is-all-you-need.pdf"]' --json-value
 node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need provenance replayable
 node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need confidence high
 node _lumina/scripts/wiki.mjs set-meta sources/lora-2021 provenance partial
@@ -209,6 +275,42 @@ node _lumina/scripts/wiki.mjs set-meta concepts/softmax-temperature confidence m
 `set-meta` is atomic (temp + fsync + rename) and idempotent — calling it twice
 with the same value is a no-op. It is safe to re-run this phase.
+**Schema-shape upgrade — `url` → `urls` (v0.9+):**
+For every source page that has a top-level `url:` key (singular string) in frontmatter,
+rewrite it as `urls:` (array) and remove the old key. Preserve placement — keep `urls`
+where `url` was.
+```bash
+# Detect source pages that still have legacy url: (singular)
+node _lumina/scripts/wiki.mjs list-entities | node -e "
+const lines=require('fs').readFileSync('/dev/stdin','utf8').trim().split('\n');
+const ents=lines.map(l=>{ try{return JSON.parse(l);}catch{return null;} }).filter(Boolean);
+ents.filter(e=>e.type==='sources').forEach(e=>console.log(e.slug));
+" | while read slug; do
+  node _lumina/scripts/wiki.mjs read-meta "$slug" | node -e "
+const m=JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
+if(m.url && !m.urls) console.log(process.argv[1]);
+" "$slug"
+done
+```
+For each slug with a legacy `url:` field:
+```bash
+# Step 1 — read current url value
+URL=$(node _lumina/scripts/wiki.mjs read-meta sources/<slug> | node -e "process.stdout.write(JSON.parse(require('fs').readFileSync('/dev/stdin','utf8')).url)")
+# Step 2 — write urls array
+node _lumina/scripts/wiki.mjs set-meta sources/<slug> urls "[\"$URL\"]" --json-value
+# Step 3 — remove legacy url key (set-meta with empty string removes the key)
+node _lumina/scripts/wiki.mjs set-meta sources/<slug> url --remove
+```
+If `set-meta --remove` is not supported by the installed wiki.mjs version, use `Edit` to
+remove the `url:` line directly after confirming `urls:` was written successfully.
 After backfilling all entries, proceed immediately to Phase 4.
 ### Phase 4 — Verify
@@ -216,12 +318,31 @@ After backfilling all entries, proceed immediately to Phase 4.
 **Step 4.1 — Re-run lint.**
 ```bash
-node _lumina/scripts/lint.mjs --json
+node _lumina/scripts/lint.mjs --summary
 ```
-Confirm that all L01 errors from Phase 1 are resolved. L11 warnings for
+Confirm `errors === 0`. If you need to inspect remaining findings, re-run with
+`--json > /tmp/lumi-lint.json` and project as in Step 1.2 — never parse full
+`--json` from inline stdout on a large wiki. L11 warnings for
 entries you set `confidence` on should also be gone.
+Check for L12 warnings explicitly and surface them to the user:
+```bash
+node -e "
+const j=JSON.parse(require('fs').readFileSync('/tmp/lumi-lint.json','utf8'));
+const l12=j.findings.filter(f=>f.id==='L12-raw-paths-drift')
+  .map(f=>({file:f.file,message:f.message}));
+if(l12.length) console.log('L12 raw_paths drift:\n'+JSON.stringify(l12,null,2));
+else console.log('No L12 warnings.');
+"
+```
+L12 warnings mean one or more `raw_paths` entries point to files that do not
+exist or are under `raw/tmp/`. Treat these as follow-up action items for the
+user — the migration is not blocked, but the `raw_paths` value is inaccurate
+until the referenced file is located or the entry is corrected.
 If any L01 errors remain:
 - Read the finding message — it names the exact field still missing.
 - Return to Phase 2 and infer a value for that field.
@@ -290,7 +411,7 @@ node _lumina/scripts/lint.mjs --json
 # Phase 2 — for each source:
 node _lumina/scripts/wiki.mjs read-meta sources/attention-is-all-you-need
-# → { url: "https://arxiv.org/abs/1706.03762", ... }
+# → { urls: ["https://arxiv.org/abs/1706.03762"], ... }
 ls raw/sources/attention-is-all-you-need*
 # → raw/sources/attention-is-all-you-need.pdf (found)
 # → infer: provenance = replayable

package/src/skills/packs/research/discover/SKILL.md CHANGED Viewed

@@ -69,12 +69,21 @@ python3 _lumina/tools/discover.py --help
 8. Present a checkpointed shortlist with title, authors/year, URL or identifier,
    `_score`, rationale, duplicate status, and recommended next action.
-   For each candidate, include a suggested `provenance` value based on what you
-   actually fetched. This helps the user (or `/lumi-ingest`) decide immediately
-   rather than guessing later — downstream verification depends on it:
-   - `replayable` — URL fetched and raw snapshot saved to `raw/discovered/`.
-   - `partial` — only a summary or abstract was retrieved; no full-text snapshot.
-   - `missing` — no URL available; metadata only (e.g. a manually entered title).
+   Discover writes JSON metadata to `raw/discovered/<topic>/<id>.json`. It does
+   NOT fetch PDFs — full-text download happens at ingest time via `/lumi-ingest`
+   Mode B, which calls `fetch_pdf.py` and places the PDF at
+   `raw/download/<resource>/<id>.<ext>`.
+   For each candidate, include a suggested `provenance` value (advisory — the
+   actual value is set by `/lumi-ingest` once the PDF is fetched). This helps
+   the user plan which sources are immediately accessible:
+   - `replayable` — abstract + full text both fetchable; `/lumi-ingest` will
+     download the PDF to `raw/download/` and resolve `raw_paths` at ingest time.
+   - `partial` — only abstract or metadata available (closed-access paper); no
+     full-text PDF reachable. `/lumi-ingest` will set `raw_paths` from the
+     metadata JSON only.
+   - `missing` — no URL; metadata only (e.g. a manually entered title). Nothing
+     to fetch; ingest will result in `provenance: missing`.
 9. Ask the user which candidates should be ingested. Do not create source pages
    or graph edges in this skill.
@@ -88,6 +97,8 @@ python3 _lumina/tools/discover.py --help
   `init_discovery.py`.
 - Do not include any non-FR35 workflows such as ideation, LaTeX writing, or
   orchestrator mode.
+- Do not download PDFs. Discover writes metadata JSON to `raw/discovered/` only.
+  PDF fetching is `/lumi-ingest`'s job (Mode B, via `_lumina/tools/fetch_pdf.py`).
 ## Definition of Done

package/src/templates/README.md CHANGED Viewed

@@ -44,12 +44,16 @@ Keep this mental map in immediate context:
 - `raw/sources/` — `.pdf`, `.tex`, `.html`, `.md`, transcripts, anything ingested
 - `raw/notes/` — user's own markdown notes
 - `raw/assets/` — images and binary attachments
-- `raw/tmp/` — sidecar files generated by skills (additions only)
+- `raw/tmp/` — sidecar files generated by skills (transient; do not store canonical sources here)
+- `raw/download/<resource>/` — full-text artifacts auto-fetched by skills, partitioned by source
+  (e.g. `raw/download/arxiv/2604.03501v2.pdf`, `raw/download/doi/<doi>.pdf`).
+  Permanent agent-writable zone — keep separate from `raw/sources/` (human-curated).
 {{#if pack_research}}
-- `raw/discovered/` — sources fetched by research pack tools (additions only, pack: research)
+- `raw/discovered/<topic>/` — metadata JSON candidates from research-pack discovery
+  (additions only, pack: research). Holds `<paper-id>.json`; full-text PDFs go in `raw/download/`.
 {{/if}}
-**Rule:** never modify or delete an existing file under `raw/`. Files added by the user are authoritative and immutable to the agent. New files may only be *added*, only by a skill that documents this behavior, and only into `raw/tmp/`{{#if pack_research}} or `raw/discovered/`{{/if}}. Every other path under `raw/` is read-only.
+**Rule:** never modify or delete an existing file under `raw/`. Files added by the user are authoritative and immutable to the agent. New files may only be *added*, only by a skill that documents this behavior, and only into `raw/tmp/`, `raw/download/`{{#if pack_research}}, or `raw/discovered/`{{/if}}. Every other path under `raw/` is read-only.
 ### `.agents/` is the skill source of truth
@@ -60,7 +64,7 @@ Keep this mental map in immediate context:
 - `_lumina/config/lumina.config.yaml` — workspace config; editable
 - `_lumina/schema/` — deeper reference docs; open when this file points you there
 - `_lumina/scripts/` — Node engine (`wiki.mjs`, `lint.mjs`, `reset.mjs`, `schemas.mjs`)
-- `_lumina/tools/` — Python tools (always: `extract_pdf.py`, `requirements.txt`{{#if pack_research}}; research pack adds `_env.py`, `prepare_source.py`, `init_discovery.py`, `discover.py`, and fetcher tools{{/if}})
+- `_lumina/tools/` — Python tools (always: `extract_pdf.py`, `fetch_pdf.py`, `requirements.txt`{{#if pack_research}}; research pack adds `_env.py`, `prepare_source.py`, `init_discovery.py`, `discover.py`, and fetcher tools{{/if}})
 - `_lumina/_state/` — installer/skill checkpoint state; gitignored
 - `_lumina/manifest.json` — installer state; never edit by hand
@@ -190,6 +194,7 @@ Adds `/lumi-reading-chapter-ingest` (file a chapter, update characters/themes/pl
 - **`_lumina/scripts/wiki.mjs`** — wiki engine (frontmatter, graph mutation, slug, log).
 - **`_lumina/scripts/reset.mjs`** — scoped destructive reset.
 - **`_lumina/tools/extract_pdf.py`** — PDF text extractor (pypdf-based); used by `/lumi-ingest` and `/lumi-reading-chapter-ingest` when the host IDE cannot read PDFs natively.
+- **`_lumina/tools/fetch_pdf.py`** — URL → `raw/download/<resource>/` PDF downloader (streaming, atomic, idempotent); used by `/lumi-ingest` Mode B when the input is a URL or paper identifier.
 - **`_lumina/tools/requirements.txt`** — Python dependencies for bundled tools. Run `pip install -r _lumina/tools/requirements.txt` when a tool reports a missing package.
 {{#if pack_research}}- **`_lumina/tools/_env.py`** — shared `.env` loader for research tools.
 - **`_lumina/tools/prepare_source.py`** — normalizes local source files into tool-readable JSON.

package/src/tools/fetch_pdf.py ADDED Viewed

@@ -0,0 +1,416 @@
+"""
+fetch_pdf.py — Download a PDF from a URL into the workspace landing zone.
+CLI:
+    python fetch_pdf.py <url> [--project-root PATH] [--filename NAME] [--force]
+Output (stdout, single JSON object on success):
+    {
+      "url": "<input url>",
+      "resolved_url": "<final url after redirects/normalization>",
+      "resource": "arxiv|doi|s2|web",
+      "id": "<extracted id>",
+      "path": "raw/download/arxiv/2604.03501v2.pdf",
+      "size_bytes": 12345,
+      "sha256": "<hex>",
+      "skipped": false
+    }
+Errors emitted to stderr as JSON; exit codes:
+    0  success (or skipped due to existing file)
+    2  user error (empty url, malformed url, path traversal, HTML response)
+    3  transient error (network failure, HTTP 5xx, timeout)
+No API key required. All network calls use requests.Session().
+Landing zone: raw/download/<resource>/<filename>
+    resource = arxiv | doi | s2 | web
+"""
+from __future__ import annotations
+import argparse
+import hashlib
+import json
+import os
+import re
+import sys
+import tempfile
+from pathlib import Path
+from typing import Any
+from urllib.parse import urlparse
+import requests
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+USER_AGENT = "lumina-wiki/0.1 (research-pack; pdf fetcher)"
+REQUEST_TIMEOUT = 60
+MIN_PDF_SIZE = 100  # bytes — smaller responses are likely error pages
+CHUNK_SIZE = 65536  # 64 KB chunks for streaming download
+# Windows-illegal characters in filenames
+_WIN_ILLEGAL_RE = re.compile(r'[<>:"/\\|?*]')
+# Resource detection patterns — compiled once at module level
+_ARXIV_ABS_RE = re.compile(
+    r"arxiv\.org/abs/([0-9]{4}\.[0-9]{4,5}(?:v\d+)?)", re.IGNORECASE
+)
+_ARXIV_PDF_RE = re.compile(
+    r"arxiv\.org/pdf/([0-9]{4}\.[0-9]{4,5}(?:v\d+)?)(?:\.pdf)?$", re.IGNORECASE
+)
+_DOI_RE = re.compile(r"(?:dx\.)?doi\.org/(.+)", re.IGNORECASE)
+_S2_RE = re.compile(r"semanticscholar\.org/paper/([^/?#]+)", re.IGNORECASE)
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _err_json(msg: str, code: int) -> None:
+    """Print a JSON error to stderr."""
+    print(json.dumps({"error": msg, "code": code}), file=sys.stderr)
+def _sha16_url(url: str) -> str:
+    """First 16 hex chars of SHA256 of URL — used as web resource ID."""
+    return hashlib.sha256(url.encode()).hexdigest()[:16]
+def _sanitize_filename(name: str) -> str:
+    """Remove Windows-illegal characters from a filename."""
+    return _WIN_ILLEGAL_RE.sub("_", name)
+def _safe_path(base: Path, rel: str, label: str) -> Path:
+    """Resolve rel under base; reject '..', absolute, or escaping paths."""
+    rel_path = Path(rel)
+    if rel_path.is_absolute():
+        _err_json(f"{label} must be a relative path, got: {rel}", 2)
+        sys.exit(2)
+    if ".." in rel_path.parts:
+        _err_json(f"{label} contains '..': {rel}", 2)
+        sys.exit(2)
+    resolved = (base / rel_path).resolve()
+    try:
+        resolved.relative_to(base.resolve())
+    except ValueError:
+        _err_json(f"{label} escapes base directory: {rel}", 2)
+        sys.exit(2)
+    return resolved
+# ---------------------------------------------------------------------------
+# Resource detection
+# ---------------------------------------------------------------------------
+def detect_resource(url: str) -> tuple[str, str, str]:
+    """Detect resource type and ID from URL.
+    Returns:
+        (resource, id, resolved_pdf_url)
+    resource is one of: arxiv, doi, s2, web
+    """
+    url = url.strip()
+    # arxiv abs
+    m = _ARXIV_ABS_RE.search(url)
+    if m:
+        arxiv_id = m.group(1)
+        pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
+        return "arxiv", arxiv_id, pdf_url
+    # arxiv pdf
+    m = _ARXIV_PDF_RE.search(url)
+    if m:
+        arxiv_id = m.group(1)
+        pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
+        return "arxiv", arxiv_id, pdf_url
+    # DOI
+    m = _DOI_RE.search(url)
+    if m:
+        doi_raw = m.group(1).rstrip("/")
+        doi_id = doi_raw.replace("/", "-")
+        return "doi", doi_id, url
+    # Semantic Scholar
+    m = _S2_RE.search(url)
+    if m:
+        s2_id = m.group(1)
+        return "s2", s2_id, url
+    # Web fallback
+    sha16 = _sha16_url(url)
+    return "web", sha16, url
+def _derive_filename(resource: str, id_: str, content_type: str = "") -> str:
+    """Derive a default filename from resource/id.
+    For 'web', probes content_type for extension; defaults to .pdf.
+    """
+    if resource in ("arxiv", "doi", "s2"):
+        return _sanitize_filename(id_) + ".pdf"
+    # web
+    ext = ".pdf"
+    if content_type:
+        ct = content_type.lower().split(";")[0].strip()
+        if "octet-stream" in ct or "pdf" in ct:
+            ext = ".pdf"
+        # If it's something else we still default to .pdf per spec
+    return _sanitize_filename(id_) + ext
+# ---------------------------------------------------------------------------
+# Session factory
+# ---------------------------------------------------------------------------
+def _make_session() -> requests.Session:
+    session = requests.Session()
+    session.headers.update({"User-Agent": USER_AGENT})
+    return session
+# ---------------------------------------------------------------------------
+# Core download function
+# ---------------------------------------------------------------------------
+def fetch_pdf(
+    url: str,
+    project_root: Path,
+    filename: str | None = None,
+    force: bool = False,
+    session: requests.Session | None = None,
+) -> dict[str, Any]:
+    """Download a PDF from url into raw/download/<resource>/<filename>.
+    Args:
+        url: The source URL (arxiv abs/pdf, doi, s2, or generic web URL).
+        project_root: Absolute path to the project root.
+        filename: Override output filename (sanitized). If None, derived from resource/id.
+        force: If True, overwrite existing file. If False, skip if exists.
+        session: Optional requests.Session for connection reuse.
+    Returns:
+        Result dict (see module docstring).
+    Raises:
+        ValueError: on user errors (empty url, content-type mismatch, path traversal).
+        RuntimeError: on transient errors (network, HTTP 5xx, timeout).
+        requests.RequestException: on low-level network failure (caller re-raises).
+    """
+    url = url.strip()
+    if not url:
+        raise ValueError("url must not be empty")
+    parsed = urlparse(url)
+    if not parsed.scheme or not parsed.netloc:
+        raise ValueError(f"malformed url (no scheme or host): {url!r}")
+    resource, res_id, resolved_url = detect_resource(url)
+    sess = session or _make_session()
+    if filename is not None:
+        out_filename = _sanitize_filename(filename)
+        if not out_filename:
+            raise ValueError(f"--filename becomes empty after sanitization: {filename!r}")
+    else:
+        out_filename = None
+    rel_dir = f"raw/download/{resource}"
+    out_dir = _safe_path(project_root, rel_dir, "output directory")
+    if out_filename is None:
+        out_filename = _derive_filename(resource, res_id)
+    if "/" in out_filename or "\\" in out_filename or ".." in out_filename:
+        raise ValueError(f"filename contains path separators or '..': {out_filename!r}")
+    out_path = out_dir / out_filename
+    # Idempotency: skip if exists and not --force
+    if out_path.exists() and not force:
+        return {
+            "url": url,
+            "resolved_url": resolved_url,
+            "resource": resource,
+            "id": res_id,
+            "path": str(out_path.relative_to(project_root)),
+            "size_bytes": out_path.stat().st_size,
+            "sha256": _sha256_file(out_path),
+            "skipped": True,
+            "reason": "exists",
+        }
+    # Streaming download
+    resp = sess.get(resolved_url, timeout=REQUEST_TIMEOUT, allow_redirects=True, stream=True)
+    if resp.status_code >= 500:
+        raise RuntimeError(f"HTTP {resp.status_code} from server")
+    if resp.status_code == 404:
+        raise ValueError(f"HTTP 404: resource not found at {resolved_url}")
+    if resp.status_code >= 400:
+        raise ValueError(f"HTTP {resp.status_code} from server")
+    resp.raise_for_status()
+    content_type = resp.headers.get("Content-Type", "")
+    # For 'web' resource, refine the filename extension from content-type
+    if resource == "web" and filename is None:
+        out_filename = _derive_filename(resource, res_id, content_type)
+        out_path = out_dir / out_filename
+    ct_lower = content_type.lower().split(";")[0].strip()
+    url_ends_pdf = resolved_url.lower().endswith(".pdf")
+    is_pdf = ct_lower.startswith("application/pdf") or url_ends_pdf
+    if not is_pdf and ct_lower.startswith("text/html"):
+        raise ValueError(
+            f"expected PDF but server returned HTML (Content-Type: {content_type}); "
+            f"URL may be a landing page rather than a direct PDF link"
+        )
+    # Atomic write: temp + streaming + fsync + rename; SHA256 computed during download
+    out_dir.mkdir(parents=True, exist_ok=True)
+    fd, tmp_path_str = tempfile.mkstemp(dir=out_dir, suffix=".tmp")
+    hasher = hashlib.sha256()
+    size = 0
+    try:
+        with os.fdopen(fd, "wb") as f:
+            for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
+                if chunk:
+                    f.write(chunk)
+                    hasher.update(chunk)
+                    size += len(chunk)
+            f.flush()
+            os.fsync(f.fileno())
+    except Exception:
+        try:
+            os.unlink(tmp_path_str)
+        except OSError:
+            pass
+        raise
+    if size < MIN_PDF_SIZE:
+        try:
+            os.unlink(tmp_path_str)
+        except OSError:
+            pass
+        raise ValueError(
+            f"downloaded content is too small ({size} bytes < {MIN_PDF_SIZE}); "
+            f"likely an error page rather than a real PDF"
+        )
+    os.replace(tmp_path_str, out_path)
+    return {
+        "url": url,
+        "resolved_url": resp.url,
+        "resource": resource,
+        "id": res_id,
+        "path": str(out_path.relative_to(project_root)),
+        "size_bytes": size,
+        "sha256": hasher.hexdigest(),
+        "skipped": False,
+    }
+def _sha256_file(path: Path) -> str:
+    """Compute SHA256 of an existing file."""
+    h = hashlib.sha256()
+    with path.open("rb") as f:
+        for chunk in iter(lambda: f.read(CHUNK_SIZE), b""):
+            h.update(chunk)
+    return h.hexdigest()
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+def main(argv: list[str] | None = None) -> None:
+    parser = argparse.ArgumentParser(
+        prog="fetch_pdf.py",
+        description=(
+            "Download a PDF from a URL into raw/download/<resource>/<filename>. "
+            "Detects arxiv, DOI, Semantic Scholar, and generic web URLs."
+        ),
+    )
+    parser.add_argument("url", help="URL of the PDF to download.")
+    parser.add_argument(
+        "--project-root", default=None,
+        help="Project root directory (default: current directory).",
+    )
+    parser.add_argument(
+        "--filename", default=None,
+        help="Override output filename (default: derived from resource/id).",
+    )
+    parser.add_argument(
+        "--force", action="store_true",
+        help="Re-download and overwrite if file already exists.",
+    )
+    args = parser.parse_args(argv)
+    if not args.url or not args.url.strip():
+        _err_json("url must not be empty", 2)
+        sys.exit(2)
+    project_root = (
+        Path(args.project_root).resolve()
+        if args.project_root
+        else Path.cwd().resolve()
+    )
+    if args.filename is not None:
+        fn = args.filename
+        if "/" in fn or "\\" in fn or ".." in fn:
+            _err_json(
+                f"--filename must be a plain filename (no path separators or '..'): {fn!r}",
+                2,
+            )
+            sys.exit(2)
+    session = _make_session()
+    try:
+        result = fetch_pdf(
+            url=args.url,
+            project_root=project_root,
+            filename=args.filename,
+            force=args.force,
+            session=session,
+        )
+        print(json.dumps(result, ensure_ascii=False, indent=2))
+        sys.exit(0)
+    except ValueError as exc:
+        _err_json(str(exc), 2)
+        sys.exit(2)
+    except requests.exceptions.ConnectionError as exc:
+        _err_json(f"Network error: {exc}", 3)
+        sys.exit(3)
+    except requests.exceptions.Timeout:
+        _err_json("Request timed out while downloading PDF.", 3)
+        sys.exit(3)
+    except requests.exceptions.HTTPError as exc:
+        code = exc.response.status_code if exc.response is not None else "unknown"
+        _err_json(f"HTTP error {code} while downloading PDF.", 3)
+        sys.exit(3)
+    except RuntimeError as exc:
+        _err_json(str(exc), 3)
+        sys.exit(3)
+    except OSError as exc:
+        _err_json(f"I/O error: {exc}", 3)
+        sys.exit(3)
+    except Exception as exc:  # noqa: BLE001
+        _err_json(f"Internal error: {exc}", 3)
+        sys.exit(3)
+if __name__ == "__main__":
+    main()