lumina-wiki 0.7.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -5,6 +5,37 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
5
5
 
6
6
  ## [Unreleased]
7
7
 
8
+ ## [0.8.0] - 2026-05-03
9
+
10
+ ### Added
11
+ - Schema: `raw_paths` field (array, optional) on `sources` — relative paths to permanent raw artifacts backing the source page (`raw/sources/*`, `raw/notes/*`, `raw/download/<resource>/*`, `raw/discovered/<topic>/*.json`). Replaces implicit "URL is the anchor" semantic with an explicit pointer set; verify Stage A (planned v1.0) reads this directly instead of re-deriving from heuristics.
12
+ - `raw/download/<resource>/` — permanent agent-writable zone for auto-fetched full-text artifacts, partitioned by source (`arxiv`, `doi`, `s2`, `web`). Distinct from `raw/tmp/` (transient) and `raw/sources/` (human-curated).
13
+ - `_lumina/tools/fetch_pdf.py` — CLI tool: download URL to `raw/download/<resource>/<filename>`, idempotent (skip on existing, `--force` to overwrite). Resource detection from URL pattern (arxiv/doi/s2/web). Atomic write (tempfile + fsync + rename). Used by `/lumi-ingest` Mode B.
14
+ - Lint check L12: warning when `raw_paths` entries point to a missing file, escape the project root, or live in `raw/tmp/*` (transient — should be moved to `raw/sources/` or `raw/download/`).
15
+ - `/lumi-ingest` Mode B: input may be a URL, arxiv ID, DOI, or paper title from discover shortlist. Skill resolves to URL, calls `fetch_pdf.py`, ingests from the resulting `raw/download/` path. Mode A (local file path) unchanged.
16
+
17
+ ### Changed
18
+ - Source frontmatter field `url: <string>` renamed to `urls: <array>` for symmetry with `raw_paths: array`. Multiple URLs supported per source (arxiv abs, DOI, repo, slides). Lint type validation expects `urls` to be an array; legacy `url` string entries flagged as unknown field. Migration handled by `/lumi-migrate-legacy` (detects and rewrites `url: <str>` → `urls: [<str>]`).
19
+ - Provenance semantic reframed raw-centric (enum unchanged, 3 values):
20
+ - `replayable` now requires `raw_paths` non-empty with at least one entry resolving to disk (URL is no longer a precondition — file-only sources qualify).
21
+ - `partial` requires `url` present and no resolvable `raw_paths`.
22
+ - `missing` unchanged.
23
+ Rubric updated in `/lumi-ingest`, `/lumi-research-discover`, `/lumi-migrate-legacy`.
24
+ - `/lumi-migrate-legacy` rubric: tier 1 reads ingest checkpoint (`_lumina/_state/ingest-<slug>.json`) for authoritative `source_path`; tier 2 falls back to slug-prefix and URL-derived-ID heuristics across `raw/sources/`, `raw/notes/`, `raw/download/**`, `raw/discovered/**`. Pages whose checkpoint points into `raw/tmp/*` are flagged for the user to relocate before backfill — skill does not auto-move human files.
25
+ - Manifest schema: `MANIFEST_SCHEMA_VERSION` 2 → 3. Migration is metadata-only (no manifest field shape change); workspace schema additions (`raw_paths`, `raw/download/`) are additive and backward-compatible — old wikis continue to lint clean (L12 warnings advisory only).
26
+ - `/lumi-migrate-legacy`: raised the work-list confirmation gate from 10 to 30 entries. Real wikis commonly have dozens of entries, and the original threshold made every migration a multi-turn chore. Lists ≤30 now proceed after the plan is reported; lists >30 still pause for explicit confirmation, since a large batch usually signals a long-dormant wiki or major schema bump worth spot-checking.
27
+
28
+ ### Fixed
29
+ - `/lumi-migrate-legacy`: Step 1.2 and Step 4.1 now use `lint.mjs --summary` for counts and write `--json` to `/tmp/lumi-lint.json` before projecting findings. Avoids the Bash-tool ~30KB stdout cap which truncated full `--json` mid-string on wikis with many findings, breaking inline `JSON.parse`.
30
+
31
+ ### Migration
32
+ - Existing source pages without `raw_paths`: no immediate action required. Lint stays green (`raw_paths` is optional, L12 only fires when present-but-broken).
33
+ - To backfill `raw_paths` on legacy entries, run `/lumi-migrate-legacy` after upgrading. The skill reads ingest checkpoints and applies the new tier-1/tier-2 rubric.
34
+ - If you have wiki sources currently pointing at `raw/tmp/arxiv-ingest/` or similar transient locations (a known artefact of pre-v0.8 agent improvisation): move those PDFs to `raw/download/arxiv/` (matching arxiv ID) or `raw/sources/` (custom-named), then re-run `/lumi-migrate-legacy`. Lint L12 will identify the affected pages.
35
+ - Custom tooling reading manifest: bump expected `schemaVersion` to 3 (or accept 2|3 transitionally — the manifest shape is unchanged).
36
+
37
+ ## [0.7.0] - 2026-05-03
38
+
8
39
  ### Added
9
40
  - `/lumi-migrate-legacy` core skill — LLM-driven backfill of provenance/confidence
10
41
  - `CHANGELOG.md` shipped to `_lumina/CHANGELOG.md` for skill consumption
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "$schema": "https://json.schemastore.org/package.json",
3
3
  "name": "lumina-wiki",
4
- "version": "0.7.0",
4
+ "version": "0.8.0",
5
5
  "description": "Domain-agnostic, multi-IDE wiki scaffolder — Karpathy's LLM-Wiki vision, cross-platform and pack-based.",
6
6
  "keywords": [
7
7
  "llm-wiki",
@@ -51,6 +51,7 @@
51
51
  "src/tools/init_discovery.py",
52
52
  "src/tools/prepare_source.py",
53
53
  "src/tools/fetch_arxiv.py",
54
+ "src/tools/fetch_pdf.py",
54
55
  "src/tools/fetch_wikipedia.py",
55
56
  "src/tools/fetch_s2.py",
56
57
  "src/tools/fetch_deepxiv.py",
@@ -110,7 +110,7 @@ const CORE_WIKI_DIRS = [
110
110
  const RESEARCH_WIKI_DIRS = ['wiki/foundations', 'wiki/topics'];
111
111
  const READING_WIKI_DIRS = ['wiki/chapters', 'wiki/characters', 'wiki/themes', 'wiki/plot'];
112
112
 
113
- const CORE_RAW_DIRS = ['raw/sources', 'raw/notes', 'raw/assets', 'raw/tmp'];
113
+ const CORE_RAW_DIRS = ['raw/sources', 'raw/notes', 'raw/assets', 'raw/tmp', 'raw/download'];
114
114
  const RESEARCH_RAW_DIRS = ['raw/discovered'];
115
115
 
116
116
  const LUMINA_DIRS = [
@@ -878,7 +878,7 @@ function getSkillDefs(packs) {
878
878
 
879
879
  async function copyTools(projectRoot, { research }) {
880
880
  const destDir = join(projectRoot, '_lumina', 'tools');
881
- const coreTools = ['extract_pdf.py'];
881
+ const coreTools = ['extract_pdf.py', 'fetch_pdf.py'];
882
882
  const researchTools = [
883
883
  '_env.py', 'discover.py', 'init_discovery.py', 'prepare_source.py',
884
884
  'fetch_arxiv.py', 'fetch_wikipedia.py', 'fetch_s2.py', 'fetch_deepxiv.py',
@@ -19,7 +19,7 @@ import { atomicWrite, ensureDir } from './fs.js';
19
19
  // Constants
20
20
  // ---------------------------------------------------------------------------
21
21
 
22
- export const MANIFEST_SCHEMA_VERSION = 2;
22
+ export const MANIFEST_SCHEMA_VERSION = 3;
23
23
 
24
24
  export const SKILLS_CSV_HEADER = 'canonical_id,display_name,pack,source,relative_path,target_link_path,version';
25
25
  export const FILES_CSV_HEADER = 'relative_path,sha256,source_pack,installed_version';
@@ -293,6 +293,11 @@ export async function writeFilesManifest(projectRoot, rows) {
293
293
  */
294
294
  const MIGRATIONS = {
295
295
  '1->2': (m) => ({ ...m, legacyMigrationNeeded: true }),
296
+ // 2->3 (v0.8): workspace schema additions — raw_paths field, raw/download/ dir,
297
+ // lint L12, source frontmatter url (string) -> urls (array). All additive /
298
+ // backward-compatible at the manifest level. Wiki content migration is handled
299
+ // by /lumi-migrate-legacy, not by the installer. No manifest shape change.
300
+ '2->3': (m) => ({ ...m }),
296
301
  };
297
302
 
298
303
  /**
@@ -67,7 +67,7 @@ const INDEX_MARKER_OPEN = '<!-- lumina:index -->';
67
67
  const INDEX_MARKER_CLOSE = '<!-- /lumina:index -->';
68
68
 
69
69
  /** All check IDs in run order. */
70
- const ALL_CHECK_IDS = ['L01', 'L02', 'L03', 'L04', 'L05', 'L06', 'L07', 'L08', 'L09', 'L10', 'L11'];
70
+ const ALL_CHECK_IDS = ['L01', 'L02', 'L03', 'L04', 'L05', 'L06', 'L07', 'L08', 'L09', 'L10', 'L11', 'L12'];
71
71
 
72
72
  /** Kebab-case pattern: lowercase letters, digits, hyphens; no leading/trailing hyphen. */
73
73
  const KEBAB_RE = /^[a-z0-9]+(?:-[a-z0-9]+)*$/;
@@ -667,6 +667,62 @@ function checkL10(foundationEntries) {
667
667
  return findings;
668
668
  }
669
669
 
670
+ /**
671
+ * L12: `raw_paths` entries on a `sources` page point to a missing file, or to
672
+ * `raw/tmp/*` (transient location — canonical sources should not live there).
673
+ * Severity: warning. Not auto-fixable. Catches drift when the user moves or
674
+ * renames a backing file, and flags the common mistake of pinning a wiki page
675
+ * to a temp-zone artifact that may be cleaned at any time.
676
+ *
677
+ * @param {string} wikiRelPath
678
+ * @param {Record<string,unknown>} fm
679
+ * @param {string} projectRoot Absolute path; used to resolve raw_paths entries.
680
+ * @returns {Promise<Finding[]>}
681
+ */
682
+ async function checkL12(wikiRelPath, fm, projectRoot) {
683
+ const type = entityTypeForPath(wikiRelPath);
684
+ if (type !== 'sources') return [];
685
+
686
+ const rawPaths = fm.raw_paths;
687
+ if (!Array.isArray(rawPaths) || rawPaths.length === 0) return [];
688
+
689
+ const findings = [];
690
+ for (const entry of rawPaths) {
691
+ if (typeof entry !== 'string' || entry === '') continue;
692
+
693
+ // Reject paths inside raw/tmp/ — transient zone, not for canonical sources.
694
+ if (entry.startsWith('raw/tmp/') || entry.startsWith('./raw/tmp/')) {
695
+ findings.push(finding(
696
+ 'L12-raw-paths-transient', 'warning', false,
697
+ wikiRelPath, null,
698
+ `raw_paths entry "${entry}" lives in raw/tmp/ — transient. Move the file to raw/sources/ (human) or raw/download/<resource>/ (agent) and update raw_paths.`
699
+ ));
700
+ continue;
701
+ }
702
+
703
+ // Verify file exists on disk (relative to project root).
704
+ const abs = resolve(projectRoot, entry);
705
+ if (!abs.startsWith(resolve(projectRoot))) {
706
+ findings.push(finding(
707
+ 'L12-raw-paths-unsafe', 'warning', false,
708
+ wikiRelPath, null,
709
+ `raw_paths entry "${entry}" escapes the project root`
710
+ ));
711
+ continue;
712
+ }
713
+ try {
714
+ await access(abs, fsConstants.F_OK);
715
+ } catch {
716
+ findings.push(finding(
717
+ 'L12-raw-paths-missing', 'warning', false,
718
+ wikiRelPath, null,
719
+ `raw_paths entry "${entry}" does not exist on disk`
720
+ ));
721
+ }
722
+ }
723
+ return findings;
724
+ }
725
+
670
726
  /**
671
727
  * L11: `confidence` field missing on a `sources` or `concepts` entity.
672
728
  * Severity: warning. Not auto-fixable. Sets an explicit trust signal that
@@ -943,6 +999,7 @@ async function runLint(projectRoot, opts) {
943
999
  allFindings.push(...checkL04(wikiRelPath, outboundMap.get(wikiRelPath) || new Set(), inboundSet));
944
1000
  allFindings.push(...checkL05(wikiRelPath, content, knownSlugs));
945
1001
  allFindings.push(...checkL11(wikiRelPath, fm));
1002
+ allFindings.push(...await checkL12(wikiRelPath, fm, projectRoot));
946
1003
  }
947
1004
 
948
1005
  allFindings.push(...checkL06(edges, new Set(edgeSet)));
@@ -1243,7 +1300,7 @@ export {
1243
1300
  isExempt,
1244
1301
  entityTypeForPath,
1245
1302
  checkL01, checkL02, checkL03, checkL04, checkL05,
1246
- checkL06, checkL07, checkL08, checkL09, checkL10, checkL11,
1303
+ checkL06, checkL07, checkL08, checkL09, checkL10, checkL11, checkL12,
1247
1304
  fixL01, fixL03, fixL06, fixL07, fixL09,
1248
1305
  runLint,
1249
1306
  reportSummary,
@@ -118,6 +118,7 @@ export const RAW_DIRS = {
118
118
  notes: 'core',
119
119
  assets: 'core',
120
120
  tmp: 'core',
121
+ download: 'core',
121
122
 
122
123
  // research pack
123
124
  discovered: 'research',
@@ -257,7 +258,8 @@ export const REQUIRED_FRONTMATTER = {
257
258
  { key: 'authors', type: 'array', required: true },
258
259
  { key: 'year', type: 'number', required: true },
259
260
  { key: 'importance', type: 'enum', required: true, values: [1, 2, 3, 4, 5] },
260
- { key: 'url', type: 'string', required: false },
261
+ { key: 'urls', type: 'array', required: false },
262
+ { key: 'raw_paths', type: 'array', required: false },
261
263
  { key: 'provenance', type: 'enum', required: true, values: ['replayable', 'partial', 'missing'] },
262
264
  { key: 'confidence', type: 'enum', required: false, values: ['high', 'medium', 'low', 'unverified'] },
263
265
  ],
@@ -8,8 +8,13 @@ description: >
8
8
  into the wiki", "create a wiki page for", or drops a filename from raw/sources/.
9
9
  Also fires for: "I added a PDF to raw/sources/", "add this paper to the wiki",
10
10
  "parse this article", "what should I do with raw/sources/X?", or any request
11
- to bring a new source into the wiki graph. This is the most-used skill — when
12
- in doubt about whether something is an ingest vs an edit, ask the user.
11
+ to bring a new source into the wiki graph.
12
+ Also accepts Mode B input paper title, arxiv ID, or URL, without a local
13
+ file path. Examples: "ingest paper 2604.03501v2", "ingest arxiv:2604.03501",
14
+ "ingest https://arxiv.org/abs/2604.03501". The skill fetches the PDF
15
+ automatically in that case.
16
+ This is the most-used skill — when in doubt about whether something is an
17
+ ingest vs an edit, ask the user.
13
18
  allowed-tools:
14
19
  - Bash
15
20
  - Read
@@ -34,6 +39,7 @@ depends on bidirectional-link discipline.
34
39
 
35
40
  Key workspace paths:
36
41
  - `raw/sources/` — immutable user-provided sources; you read but never modify
42
+ - `raw/download/<resource>/` — agent-writable, permanent; auto-fetched PDFs land here (resource = `arxiv | doi | s2 | web`)
37
43
  - `wiki/sources/` — one page per ingested source (you write this)
38
44
  - `wiki/concepts/`, `wiki/people/` — you create or update stubs here
39
45
  - `wiki/index.md` — updated on every ingest
@@ -79,6 +85,43 @@ If a checkpoint exists and `phase` is not `"done"`, ask the user whether to resu
79
85
  or restart. Resuming skips completed phases. Restarting deletes the checkpoint and
80
86
  starts from Phase 1.
81
87
 
88
+ ### Phase 0.5 — Resolve input
89
+
90
+ Three input modes:
91
+
92
+ **Mode A — Local file path** (e.g. `raw/sources/foo.pdf`, `raw/notes/bar.md`)
93
+
94
+ Use directly as `source_path`. Proceed to Phase 1 to detect type from this file.
95
+
96
+ **Mode B — URL or identifier** (arxiv ID like `2604.03501v2`, arxiv URL, DOI, S2 paper ID, generic URL)
97
+
98
+ ```bash
99
+ python3 _lumina/tools/fetch_pdf.py "<url-or-id>"
100
+ ```
101
+
102
+ - For bare arxiv ID: pass `https://arxiv.org/abs/<id>`
103
+ - For DOI: pass `https://doi.org/<doi>`
104
+
105
+ On exit 0: read JSON output. Use the returned `path` as `source_path`. Write the input URL as
106
+ the first entry of `urls` on the source page; additional URLs (DOI, repo, slides, etc.) can be
107
+ appended after. Proceed to Phase 1.
108
+
109
+ On exit 2 (HTML response): the URL likely points to a paywall or non-PDF page.
110
+ Report to user and ask for a direct PDF URL or manual download. Do not proceed
111
+ with ingest until a valid file path is available.
112
+
113
+ On exit 3 (network error): retry once. If the second attempt also fails, surface
114
+ the error to the user with the exact message from the tool output.
115
+
116
+ **Mode C — Title only** (e.g. from a `/lumi-research-discover` shortlist)
117
+
118
+ ```bash
119
+ node _lumina/scripts/wiki.mjs checkpoint-read research-discover shortlist
120
+ ```
121
+
122
+ Match the title to a shortlist entry and extract the URL from that entry.
123
+ Fall through to Mode B with that URL.
124
+
82
125
  ### Phase 1 — Detect type
83
126
 
84
127
  Read the file header (first ~200 lines). Classify as one of:
@@ -115,7 +158,7 @@ drafting the page.
115
158
  Draft `wiki/sources/<slug>.md` using the Source
116
159
  template from `_lumina/schema/page-templates.md` (open it when in doubt about
117
160
  required fields). Required frontmatter fields: `id`, `title`, `type`, `created`,
118
- `updated`, `authors`, `year`, `importance` (1-5), `url` (optional).
161
+ `updated`, `authors`, `year`, `importance` (1-5), `urls` (optional, array).
119
162
 
120
163
  Required body sections: `## Summary` (2-4 sentences), `## Key Claims` (bulleted
121
164
  with confidence level), `## Concepts` (all `[[concept-slug]]` links), `## People`
@@ -129,20 +172,36 @@ verification (Stage A/B/C of `/lumi-verify`, planned for v1.0). An explicit deci
129
172
  is more useful than a silently-defaulted value because verification needs to know
130
173
  whether it can re-check the material end-to-end.
131
174
 
132
- Provenance rubric — pick the one that matches what you actually did:
133
- - `replayable` — you fetched the URL and saved the raw snapshot under `raw/`. The
134
- source can be re-verified end-to-end against the original.
135
- - `partial` — you kept only a summary; no raw text snapshot was saved. Drift
136
- detection works against the URL, but grounding cannot be re-checked.
137
- - `missing` — manual entry: no URL, no raw snapshot. Verification has nothing to
138
- grip on.
175
+ Provenance rubric — raw-centric; pick the one that fits:
176
+ - `replayable` — `raw_paths` is non-empty AND every entry resolves to an existing file.
177
+ Source can be re-grounded end-to-end against these files.
178
+ - `partial` — has at least one entry in `urls` but `raw_paths` is empty or every listed entry is missing.
179
+ Drift detection works against the URL, but grounding cannot be re-checked.
180
+ - `missing` — no `urls`, no `raw_paths`. Manual entry; verification has nothing to grip on.
181
+
182
+ A source can have multiple URLs (arxiv abs + DOI + repo + slides). List the most authoritative
183
+ first; the agent uses `urls[0]` when a single canonical URL is needed (e.g., for `fetch_pdf.py` Mode B).
184
+
185
+ Set `raw_paths` to the list of permanent raw artifacts backing this page:
186
+ - The primary file passed to ingest (`raw/sources/*`, `raw/notes/*`, or
187
+ `raw/download/<resource>/*` from Mode B).
188
+ - Any matching metadata JSON in `raw/discovered/<topic>/<id>.json` (research pack).
189
+ Match by paper ID (arxiv ID, DOI) extracted from the source's URL or filename.
190
+
191
+ Do NOT include `raw/tmp/*` entries — that zone is transient (lint L12 warns).
192
+ Do NOT include files outside `raw/`.
139
193
 
140
194
  Also set `confidence:` (optional but encouraged). Use `high | medium | low | unverified`.
141
195
  Default to `unverified` for fresh ingests; bump up only after you have cross-checked
142
196
  the claims or the user has confirmed them.
143
197
 
144
- Example frontmatter with both fields:
198
+ Example frontmatter:
145
199
  ```yaml
200
+ urls:
201
+ - https://arxiv.org/abs/2604.03501v2
202
+ raw_paths:
203
+ - raw/download/arxiv/2604.03501v2.pdf
204
+ - raw/discovered/ai-economics/2604.03501v2.json
146
205
  provenance: replayable
147
206
  confidence: unverified
148
207
  ```
@@ -303,6 +362,7 @@ Link added to `## Concepts` in `wiki/sources/rlhf-overview.md`:
303
362
  - Keep a checkpoint after every phase — an interrupted ingest must be resumable.
304
363
  - If the source is too large to fully read, read in sections and checkpoint between them.
305
364
  - `raw/tmp/` accepts additions only; never overwrite a file there.
365
+ - `raw_paths` must list permanent artifacts only. Reject `raw/tmp/*` entries.
306
366
 
307
367
  ## Definition of Done
308
368
 
@@ -56,13 +56,38 @@ Proceed to the lint check regardless.
56
56
 
57
57
  **Step 1.2 — Run lint (read-only pass).**
58
58
 
59
+ First, get aggregate counts (tiny output, always safe):
60
+
59
61
  ```bash
60
- node _lumina/scripts/lint.mjs --json
62
+ node _lumina/scripts/lint.mjs --summary
63
+ ```
64
+
65
+ If `errors === 0` and `by_check.L11` is `0` or absent, skip to the clean-exit
66
+ branch below. Otherwise, you need the per-entry findings.
67
+
68
+ **Important — do NOT pipe `--json` straight into a heredoc.** On a large wiki
69
+ the full findings JSON can exceed the shell tool's ~30KB stdout buffer and get
70
+ truncated mid-string, breaking JSON.parse. Instead, write it to a temp file
71
+ and read filtered slices:
72
+
73
+ ```bash
74
+ node _lumina/scripts/lint.mjs --json > /tmp/lumi-lint.json
75
+ node -e "
76
+ const j=JSON.parse(require('fs').readFileSync('/tmp/lumi-lint.json','utf8'));
77
+ const want=new Set(['L01-frontmatter-required','L11-confidence-missing']);
78
+ const hits=j.findings.filter(f=>want.has(f.id))
79
+ .map(f=>({id:f.id,file:f.file,message:f.message}));
80
+ console.log(JSON.stringify(hits,null,2));
81
+ "
61
82
  ```
62
83
 
63
- Parse the JSON output. Collect:
64
- - All `L01-frontmatter-required` findings (severity: error) these are entries
65
- with missing required fields.
84
+ The projected output (id + file + message only) is bounded and parseable. If
85
+ even that exceeds buffer (very large wikis), read `/tmp/lumi-lint.json` with
86
+ the Read tool instead — Read paginates, Bash stdout does not.
87
+
88
+ Collect:
89
+ - All `L01-frontmatter-required` findings (severity: error) — entries with
90
+ missing required fields.
66
91
  - All `L11-confidence-missing` findings (severity: warning) — entries missing
67
92
  the optional-but-recommended `confidence` field.
68
93
 
@@ -107,9 +132,21 @@ Field: confidence (optional, sources + concepts)
107
132
  - concepts/softmax-temperature
108
133
  ```
109
134
 
110
- Report this plan to the user before proceeding. Ask for confirmation if the
111
- work list is large (more than 10 entries). For 10 or fewer, proceed without
112
- asking.
135
+ Always report this plan to the user before proceeding. For work lists of **30
136
+ or fewer entries**, continue without waiting for confirmation small batches
137
+ are routine and the operation is safe to re-run. For **more than 30 entries**,
138
+ stop and ask the user to confirm before any writes. A large batch usually
139
+ means a long-dormant wiki or a major schema bump, and the user should have a
140
+ chance to spot-check the inference table before bulk changes land.
141
+
142
+ The safety net beneath this threshold:
143
+
144
+ - `set-meta` is atomic and idempotent — rerunning with a corrected value is
145
+ a single command, no rollback needed.
146
+ - The inference rubric falls back to `unverified` when evidence is ambiguous,
147
+ so wrong values err toward "honest about uncertainty," not overconfidence.
148
+ - Phase 4 re-runs lint and surfaces any remaining issues before clearing the
149
+ manifest flag.
113
150
 
114
151
  ### Phase 2 — Plan
115
152
 
@@ -127,13 +164,7 @@ existing fields (url, authors, year, type, etc.).
127
164
 
128
165
  **For `sources` entries also check:**
129
166
 
130
- 1. Whether a raw snapshot exists:
131
- ```bash
132
- ls raw/sources/<slug>* 2>/dev/null || echo "no snapshot"
133
- ls raw/discovered/<slug>* 2>/dev/null || echo "no snapshot"
134
- ```
135
-
136
- 2. Inbound citation/edge count (how many other entries link to this one):
167
+ 1. Inbound citation/edge count (how many other entries link to this one):
137
168
  ```bash
138
169
  grep -c '"target":"sources/<slug>"' wiki/graph/edges.jsonl 2>/dev/null || echo 0
139
170
  grep -c '"target":"sources/<slug>"' wiki/graph/citations.jsonl 2>/dev/null || echo 0
@@ -141,16 +172,42 @@ existing fields (url, authors, year, type, etc.).
141
172
 
142
173
  **Inference rubrics — apply these to decide values:**
143
174
 
144
- #### provenance (required on `sources`)
175
+ #### provenance + raw_paths (required on `sources`)
176
+
177
+ Use the following inference order. Stop at the first tier that yields a result.
178
+
179
+ **Tier 1 (authoritative): read the ingest checkpoint.**
180
+
181
+ ```bash
182
+ node _lumina/scripts/wiki.mjs checkpoint-read ingest <slug>
183
+ ```
184
+
185
+ If a checkpoint exists with a `source_path` field:
186
+ - If `source_path` is under `raw/tmp/*`: do NOT write `raw_paths`. Tell the user:
187
+ "`<slug>` was ingested from a transient location (`<source_path>`). Move the
188
+ file to `raw/sources/` or `raw/download/<resource>/` and re-run
189
+ `/lumi-migrate-legacy` to backfill `raw_paths` properly."
190
+ Set `provenance` to `partial` (if `urls` is non-empty) or `missing` (no `urls`).
191
+ - Otherwise: set `raw_paths` to `[source_path]` and `provenance` to `replayable`.
192
+ Skip Tiers 2 and 3.
193
+
194
+ **Tier 2 (heuristic): scan raw/ for matching files.**
195
+
196
+ - Slug-prefix match: `raw/sources/<slug>*`, `raw/notes/<slug>*`, or
197
+ `raw/download/<resource>/<slug>*`
198
+ - URL-derived ID match: extract arxiv ID, DOI, or URL basename from the page's
199
+ `urls` array (or legacy `url` string for backward compat); scan `raw/sources/`,
200
+ `raw/notes/`, `raw/download/**` for filenames containing that ID.
201
+ - Research-pack flow: also scan `raw/discovered/<topic>/<id>.json` for a JSON
202
+ whose `id` or `url` matches any entry in the page's `urls` array.
203
+
204
+ All non-`raw/tmp/` matches go into `raw_paths`. Set `provenance` to `replayable`
205
+ if any match was found.
145
206
 
146
- Pick the one that matches what you can verify:
207
+ **Tier 3 (fall back to urls heuristic): no checkpoint, no file match.**
147
208
 
148
- - `replayable` A `url` field is present AND a raw snapshot exists under
149
- `raw/sources/` or `raw/discovered/`. The source can be re-verified end-to-end.
150
- - `partial` — A `url` field is present but no raw snapshot was saved. Drift
151
- detection works against the URL, but the full text cannot be re-grounded.
152
- - `missing` — No `url` field and no raw snapshot. Manual entry; verification
153
- has nothing to grip on.
209
+ - Has at least one entry in `urls`, no raw match `partial` (leave `raw_paths` unset or `[]`)
210
+ - Neither `missing`
154
211
 
155
212
  #### confidence (optional-but-recommended on `sources` and `concepts`)
156
213
 
@@ -178,11 +235,13 @@ After the read phase, produce an inference table:
178
235
 
179
236
  ```
180
237
  sources/attention-is-all-you-need:
181
- provenance: replayable (url present, raw/sources/attention-is-all-you-need.pdf found)
238
+ raw_paths: ["raw/sources/attention-is-all-you-need.pdf"] (Tier 1: checkpoint source_path)
239
+ provenance: replayable (raw_paths non-empty, file exists)
182
240
  confidence: high (7 inbound citations)
183
241
 
184
242
  sources/lora-2021:
185
- provenance: partial (url present, no raw snapshot)
243
+ raw_paths: [] (Tier 3: url present, no file match)
244
+ provenance: partial (url present, no resolvable raw_paths)
186
245
  confidence: unverified (0 inbound edges, no cross-checks)
187
246
 
188
247
  concepts/softmax-temperature:
@@ -197,8 +256,15 @@ For each entry in the inference table, set each missing field:
197
256
  node _lumina/scripts/wiki.mjs set-meta <slug> <key> "<value>"
198
257
  ```
199
258
 
259
+ For `raw_paths` (an array field), pass a JSON array with `--json-value`:
260
+
261
+ ```bash
262
+ node _lumina/scripts/wiki.mjs set-meta sources/<slug> raw_paths '["raw/sources/foo.pdf"]' --json-value
263
+ ```
264
+
200
265
  Examples:
201
266
  ```bash
267
+ node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need raw_paths '["raw/sources/attention-is-all-you-need.pdf"]' --json-value
202
268
  node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need provenance replayable
203
269
  node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need confidence high
204
270
  node _lumina/scripts/wiki.mjs set-meta sources/lora-2021 provenance partial
@@ -209,6 +275,42 @@ node _lumina/scripts/wiki.mjs set-meta concepts/softmax-temperature confidence m
209
275
  `set-meta` is atomic (temp + fsync + rename) and idempotent — calling it twice
210
276
  with the same value is a no-op. It is safe to re-run this phase.
211
277
 
278
+ **Schema-shape upgrade — `url` → `urls` (v0.9+):**
279
+
280
+ For every source page that has a top-level `url:` key (singular string) in frontmatter,
281
+ rewrite it as `urls:` (array) and remove the old key. Preserve placement — keep `urls`
282
+ where `url` was.
283
+
284
+ ```bash
285
+ # Detect source pages that still have legacy url: (singular)
286
+ node _lumina/scripts/wiki.mjs list-entities | node -e "
287
+ const lines=require('fs').readFileSync('/dev/stdin','utf8').trim().split('\n');
288
+ const ents=lines.map(l=>{ try{return JSON.parse(l);}catch{return null;} }).filter(Boolean);
289
+ ents.filter(e=>e.type==='sources').forEach(e=>console.log(e.slug));
290
+ " | while read slug; do
291
+ node _lumina/scripts/wiki.mjs read-meta "$slug" | node -e "
292
+ const m=JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
293
+ if(m.url && !m.urls) console.log(process.argv[1]);
294
+ " "$slug"
295
+ done
296
+ ```
297
+
298
+ For each slug with a legacy `url:` field:
299
+
300
+ ```bash
301
+ # Step 1 — read current url value
302
+ URL=$(node _lumina/scripts/wiki.mjs read-meta sources/<slug> | node -e "process.stdout.write(JSON.parse(require('fs').readFileSync('/dev/stdin','utf8')).url)")
303
+
304
+ # Step 2 — write urls array
305
+ node _lumina/scripts/wiki.mjs set-meta sources/<slug> urls "[\"$URL\"]" --json-value
306
+
307
+ # Step 3 — remove legacy url key (set-meta with empty string removes the key)
308
+ node _lumina/scripts/wiki.mjs set-meta sources/<slug> url --remove
309
+ ```
310
+
311
+ If `set-meta --remove` is not supported by the installed wiki.mjs version, use `Edit` to
312
+ remove the `url:` line directly after confirming `urls:` was written successfully.
313
+
212
314
  After backfilling all entries, proceed immediately to Phase 4.
213
315
 
214
316
  ### Phase 4 — Verify
@@ -216,12 +318,31 @@ After backfilling all entries, proceed immediately to Phase 4.
216
318
  **Step 4.1 — Re-run lint.**
217
319
 
218
320
  ```bash
219
- node _lumina/scripts/lint.mjs --json
321
+ node _lumina/scripts/lint.mjs --summary
220
322
  ```
221
323
 
222
- Confirm that all L01 errors from Phase 1 are resolved. L11 warnings for
324
+ Confirm `errors === 0`. If you need to inspect remaining findings, re-run with
325
+ `--json > /tmp/lumi-lint.json` and project as in Step 1.2 — never parse full
326
+ `--json` from inline stdout on a large wiki. L11 warnings for
223
327
  entries you set `confidence` on should also be gone.
224
328
 
329
+ Check for L12 warnings explicitly and surface them to the user:
330
+
331
+ ```bash
332
+ node -e "
333
+ const j=JSON.parse(require('fs').readFileSync('/tmp/lumi-lint.json','utf8'));
334
+ const l12=j.findings.filter(f=>f.id==='L12-raw-paths-drift')
335
+ .map(f=>({file:f.file,message:f.message}));
336
+ if(l12.length) console.log('L12 raw_paths drift:\n'+JSON.stringify(l12,null,2));
337
+ else console.log('No L12 warnings.');
338
+ "
339
+ ```
340
+
341
+ L12 warnings mean one or more `raw_paths` entries point to files that do not
342
+ exist or are under `raw/tmp/`. Treat these as follow-up action items for the
343
+ user — the migration is not blocked, but the `raw_paths` value is inaccurate
344
+ until the referenced file is located or the entry is corrected.
345
+
225
346
  If any L01 errors remain:
226
347
  - Read the finding message — it names the exact field still missing.
227
348
  - Return to Phase 2 and infer a value for that field.
@@ -290,7 +411,7 @@ node _lumina/scripts/lint.mjs --json
290
411
 
291
412
  # Phase 2 — for each source:
292
413
  node _lumina/scripts/wiki.mjs read-meta sources/attention-is-all-you-need
293
- # → { url: "https://arxiv.org/abs/1706.03762", ... }
414
+ # → { urls: ["https://arxiv.org/abs/1706.03762"], ... }
294
415
  ls raw/sources/attention-is-all-you-need*
295
416
  # → raw/sources/attention-is-all-you-need.pdf (found)
296
417
  # → infer: provenance = replayable
@@ -69,12 +69,21 @@ python3 _lumina/tools/discover.py --help
69
69
  8. Present a checkpointed shortlist with title, authors/year, URL or identifier,
70
70
  `_score`, rationale, duplicate status, and recommended next action.
71
71
 
72
- For each candidate, include a suggested `provenance` value based on what you
73
- actually fetched. This helps the user (or `/lumi-ingest`) decide immediately
74
- rather than guessing later downstream verification depends on it:
75
- - `replayable` — URL fetched and raw snapshot saved to `raw/discovered/`.
76
- - `partial` — only a summary or abstract was retrieved; no full-text snapshot.
77
- - `missing` no URL available; metadata only (e.g. a manually entered title).
72
+ Discover writes JSON metadata to `raw/discovered/<topic>/<id>.json`. It does
73
+ NOT fetch PDFs full-text download happens at ingest time via `/lumi-ingest`
74
+ Mode B, which calls `fetch_pdf.py` and places the PDF at
75
+ `raw/download/<resource>/<id>.<ext>`.
76
+
77
+ For each candidate, include a suggested `provenance` value (advisory the
78
+ actual value is set by `/lumi-ingest` once the PDF is fetched). This helps
79
+ the user plan which sources are immediately accessible:
80
+ - `replayable` — abstract + full text both fetchable; `/lumi-ingest` will
81
+ download the PDF to `raw/download/` and resolve `raw_paths` at ingest time.
82
+ - `partial` — only abstract or metadata available (closed-access paper); no
83
+ full-text PDF reachable. `/lumi-ingest` will set `raw_paths` from the
84
+ metadata JSON only.
85
+ - `missing` — no URL; metadata only (e.g. a manually entered title). Nothing
86
+ to fetch; ingest will result in `provenance: missing`.
78
87
 
79
88
  9. Ask the user which candidates should be ingested. Do not create source pages
80
89
  or graph edges in this skill.
@@ -88,6 +97,8 @@ python3 _lumina/tools/discover.py --help
88
97
  `init_discovery.py`.
89
98
  - Do not include any non-FR35 workflows such as ideation, LaTeX writing, or
90
99
  orchestrator mode.
100
+ - Do not download PDFs. Discover writes metadata JSON to `raw/discovered/` only.
101
+ PDF fetching is `/lumi-ingest`'s job (Mode B, via `_lumina/tools/fetch_pdf.py`).
91
102
 
92
103
  ## Definition of Done
93
104
 
@@ -44,12 +44,16 @@ Keep this mental map in immediate context:
44
44
  - `raw/sources/` — `.pdf`, `.tex`, `.html`, `.md`, transcripts, anything ingested
45
45
  - `raw/notes/` — user's own markdown notes
46
46
  - `raw/assets/` — images and binary attachments
47
- - `raw/tmp/` — sidecar files generated by skills (additions only)
47
+ - `raw/tmp/` — sidecar files generated by skills (transient; do not store canonical sources here)
48
+ - `raw/download/<resource>/` — full-text artifacts auto-fetched by skills, partitioned by source
49
+ (e.g. `raw/download/arxiv/2604.03501v2.pdf`, `raw/download/doi/<doi>.pdf`).
50
+ Permanent agent-writable zone — keep separate from `raw/sources/` (human-curated).
48
51
  {{#if pack_research}}
49
- - `raw/discovered/`sources fetched by research pack tools (additions only, pack: research)
52
+ - `raw/discovered/<topic>/`metadata JSON candidates from research-pack discovery
53
+ (additions only, pack: research). Holds `<paper-id>.json`; full-text PDFs go in `raw/download/`.
50
54
  {{/if}}
51
55
 
52
- **Rule:** never modify or delete an existing file under `raw/`. Files added by the user are authoritative and immutable to the agent. New files may only be *added*, only by a skill that documents this behavior, and only into `raw/tmp/`{{#if pack_research}} or `raw/discovered/`{{/if}}. Every other path under `raw/` is read-only.
56
+ **Rule:** never modify or delete an existing file under `raw/`. Files added by the user are authoritative and immutable to the agent. New files may only be *added*, only by a skill that documents this behavior, and only into `raw/tmp/`, `raw/download/`{{#if pack_research}}, or `raw/discovered/`{{/if}}. Every other path under `raw/` is read-only.
53
57
 
54
58
  ### `.agents/` is the skill source of truth
55
59
 
@@ -60,7 +64,7 @@ Keep this mental map in immediate context:
60
64
  - `_lumina/config/lumina.config.yaml` — workspace config; editable
61
65
  - `_lumina/schema/` — deeper reference docs; open when this file points you there
62
66
  - `_lumina/scripts/` — Node engine (`wiki.mjs`, `lint.mjs`, `reset.mjs`, `schemas.mjs`)
63
- - `_lumina/tools/` — Python tools (always: `extract_pdf.py`, `requirements.txt`{{#if pack_research}}; research pack adds `_env.py`, `prepare_source.py`, `init_discovery.py`, `discover.py`, and fetcher tools{{/if}})
67
+ - `_lumina/tools/` — Python tools (always: `extract_pdf.py`, `fetch_pdf.py`, `requirements.txt`{{#if pack_research}}; research pack adds `_env.py`, `prepare_source.py`, `init_discovery.py`, `discover.py`, and fetcher tools{{/if}})
64
68
  - `_lumina/_state/` — installer/skill checkpoint state; gitignored
65
69
  - `_lumina/manifest.json` — installer state; never edit by hand
66
70
 
@@ -190,6 +194,7 @@ Adds `/lumi-reading-chapter-ingest` (file a chapter, update characters/themes/pl
190
194
  - **`_lumina/scripts/wiki.mjs`** — wiki engine (frontmatter, graph mutation, slug, log).
191
195
  - **`_lumina/scripts/reset.mjs`** — scoped destructive reset.
192
196
  - **`_lumina/tools/extract_pdf.py`** — PDF text extractor (pypdf-based); used by `/lumi-ingest` and `/lumi-reading-chapter-ingest` when the host IDE cannot read PDFs natively.
197
+ - **`_lumina/tools/fetch_pdf.py`** — URL → `raw/download/<resource>/` PDF downloader (streaming, atomic, idempotent); used by `/lumi-ingest` Mode B when the input is a URL or paper identifier.
193
198
  - **`_lumina/tools/requirements.txt`** — Python dependencies for bundled tools. Run `pip install -r _lumina/tools/requirements.txt` when a tool reports a missing package.
194
199
  {{#if pack_research}}- **`_lumina/tools/_env.py`** — shared `.env` loader for research tools.
195
200
  - **`_lumina/tools/prepare_source.py`** — normalizes local source files into tool-readable JSON.
@@ -0,0 +1,416 @@
1
+ """
2
+ fetch_pdf.py — Download a PDF from a URL into the workspace landing zone.
3
+
4
+ CLI:
5
+ python fetch_pdf.py <url> [--project-root PATH] [--filename NAME] [--force]
6
+
7
+ Output (stdout, single JSON object on success):
8
+ {
9
+ "url": "<input url>",
10
+ "resolved_url": "<final url after redirects/normalization>",
11
+ "resource": "arxiv|doi|s2|web",
12
+ "id": "<extracted id>",
13
+ "path": "raw/download/arxiv/2604.03501v2.pdf",
14
+ "size_bytes": 12345,
15
+ "sha256": "<hex>",
16
+ "skipped": false
17
+ }
18
+
19
+ Errors emitted to stderr as JSON; exit codes:
20
+ 0 success (or skipped due to existing file)
21
+ 2 user error (empty url, malformed url, path traversal, HTML response)
22
+ 3 transient error (network failure, HTTP 5xx, timeout)
23
+
24
+ No API key required. All network calls use requests.Session().
25
+ Landing zone: raw/download/<resource>/<filename>
26
+ resource = arxiv | doi | s2 | web
27
+ """
28
+
29
+ from __future__ import annotations
30
+
31
+ import argparse
32
+ import hashlib
33
+ import json
34
+ import os
35
+ import re
36
+ import sys
37
+ import tempfile
38
+ from pathlib import Path
39
+ from typing import Any
40
+ from urllib.parse import urlparse
41
+
42
+ import requests
43
+
44
+ # ---------------------------------------------------------------------------
45
+ # Constants
46
+ # ---------------------------------------------------------------------------
47
+
48
+ USER_AGENT = "lumina-wiki/0.1 (research-pack; pdf fetcher)"
49
+ REQUEST_TIMEOUT = 60
50
+ MIN_PDF_SIZE = 100 # bytes — smaller responses are likely error pages
51
+ CHUNK_SIZE = 65536 # 64 KB chunks for streaming download
52
+
53
+ # Windows-illegal characters in filenames
54
+ _WIN_ILLEGAL_RE = re.compile(r'[<>:"/\\|?*]')
55
+
56
+ # Resource detection patterns — compiled once at module level
57
+ _ARXIV_ABS_RE = re.compile(
58
+ r"arxiv\.org/abs/([0-9]{4}\.[0-9]{4,5}(?:v\d+)?)", re.IGNORECASE
59
+ )
60
+ _ARXIV_PDF_RE = re.compile(
61
+ r"arxiv\.org/pdf/([0-9]{4}\.[0-9]{4,5}(?:v\d+)?)(?:\.pdf)?$", re.IGNORECASE
62
+ )
63
+ _DOI_RE = re.compile(r"(?:dx\.)?doi\.org/(.+)", re.IGNORECASE)
64
+ _S2_RE = re.compile(r"semanticscholar\.org/paper/([^/?#]+)", re.IGNORECASE)
65
+
66
+
67
+ # ---------------------------------------------------------------------------
68
+ # Helpers
69
+ # ---------------------------------------------------------------------------
70
+
71
+ def _err_json(msg: str, code: int) -> None:
72
+ """Print a JSON error to stderr."""
73
+ print(json.dumps({"error": msg, "code": code}), file=sys.stderr)
74
+
75
+
76
+ def _sha16_url(url: str) -> str:
77
+ """First 16 hex chars of SHA256 of URL — used as web resource ID."""
78
+ return hashlib.sha256(url.encode()).hexdigest()[:16]
79
+
80
+
81
+ def _sanitize_filename(name: str) -> str:
82
+ """Remove Windows-illegal characters from a filename."""
83
+ return _WIN_ILLEGAL_RE.sub("_", name)
84
+
85
+
86
+ def _safe_path(base: Path, rel: str, label: str) -> Path:
87
+ """Resolve rel under base; reject '..', absolute, or escaping paths."""
88
+ rel_path = Path(rel)
89
+ if rel_path.is_absolute():
90
+ _err_json(f"{label} must be a relative path, got: {rel}", 2)
91
+ sys.exit(2)
92
+ if ".." in rel_path.parts:
93
+ _err_json(f"{label} contains '..': {rel}", 2)
94
+ sys.exit(2)
95
+ resolved = (base / rel_path).resolve()
96
+ try:
97
+ resolved.relative_to(base.resolve())
98
+ except ValueError:
99
+ _err_json(f"{label} escapes base directory: {rel}", 2)
100
+ sys.exit(2)
101
+ return resolved
102
+
103
+
104
+ # ---------------------------------------------------------------------------
105
+ # Resource detection
106
+ # ---------------------------------------------------------------------------
107
+
108
+ def detect_resource(url: str) -> tuple[str, str, str]:
109
+ """Detect resource type and ID from URL.
110
+
111
+ Returns:
112
+ (resource, id, resolved_pdf_url)
113
+
114
+ resource is one of: arxiv, doi, s2, web
115
+ """
116
+ url = url.strip()
117
+
118
+ # arxiv abs
119
+ m = _ARXIV_ABS_RE.search(url)
120
+ if m:
121
+ arxiv_id = m.group(1)
122
+ pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
123
+ return "arxiv", arxiv_id, pdf_url
124
+
125
+ # arxiv pdf
126
+ m = _ARXIV_PDF_RE.search(url)
127
+ if m:
128
+ arxiv_id = m.group(1)
129
+ pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
130
+ return "arxiv", arxiv_id, pdf_url
131
+
132
+ # DOI
133
+ m = _DOI_RE.search(url)
134
+ if m:
135
+ doi_raw = m.group(1).rstrip("/")
136
+ doi_id = doi_raw.replace("/", "-")
137
+ return "doi", doi_id, url
138
+
139
+ # Semantic Scholar
140
+ m = _S2_RE.search(url)
141
+ if m:
142
+ s2_id = m.group(1)
143
+ return "s2", s2_id, url
144
+
145
+ # Web fallback
146
+ sha16 = _sha16_url(url)
147
+ return "web", sha16, url
148
+
149
+
150
+ def _derive_filename(resource: str, id_: str, content_type: str = "") -> str:
151
+ """Derive a default filename from resource/id.
152
+
153
+ For 'web', probes content_type for extension; defaults to .pdf.
154
+ """
155
+ if resource in ("arxiv", "doi", "s2"):
156
+ return _sanitize_filename(id_) + ".pdf"
157
+ # web
158
+ ext = ".pdf"
159
+ if content_type:
160
+ ct = content_type.lower().split(";")[0].strip()
161
+ if "octet-stream" in ct or "pdf" in ct:
162
+ ext = ".pdf"
163
+ # If it's something else we still default to .pdf per spec
164
+ return _sanitize_filename(id_) + ext
165
+
166
+
167
+ # ---------------------------------------------------------------------------
168
+ # Session factory
169
+ # ---------------------------------------------------------------------------
170
+
171
+ def _make_session() -> requests.Session:
172
+ session = requests.Session()
173
+ session.headers.update({"User-Agent": USER_AGENT})
174
+ return session
175
+
176
+
177
+ # ---------------------------------------------------------------------------
178
+ # Core download function
179
+ # ---------------------------------------------------------------------------
180
+
181
+ def fetch_pdf(
182
+ url: str,
183
+ project_root: Path,
184
+ filename: str | None = None,
185
+ force: bool = False,
186
+ session: requests.Session | None = None,
187
+ ) -> dict[str, Any]:
188
+ """Download a PDF from url into raw/download/<resource>/<filename>.
189
+
190
+ Args:
191
+ url: The source URL (arxiv abs/pdf, doi, s2, or generic web URL).
192
+ project_root: Absolute path to the project root.
193
+ filename: Override output filename (sanitized). If None, derived from resource/id.
194
+ force: If True, overwrite existing file. If False, skip if exists.
195
+ session: Optional requests.Session for connection reuse.
196
+
197
+ Returns:
198
+ Result dict (see module docstring).
199
+
200
+ Raises:
201
+ ValueError: on user errors (empty url, content-type mismatch, path traversal).
202
+ RuntimeError: on transient errors (network, HTTP 5xx, timeout).
203
+ requests.RequestException: on low-level network failure (caller re-raises).
204
+ """
205
+ url = url.strip()
206
+ if not url:
207
+ raise ValueError("url must not be empty")
208
+
209
+ parsed = urlparse(url)
210
+ if not parsed.scheme or not parsed.netloc:
211
+ raise ValueError(f"malformed url (no scheme or host): {url!r}")
212
+
213
+ resource, res_id, resolved_url = detect_resource(url)
214
+
215
+ sess = session or _make_session()
216
+
217
+ if filename is not None:
218
+ out_filename = _sanitize_filename(filename)
219
+ if not out_filename:
220
+ raise ValueError(f"--filename becomes empty after sanitization: {filename!r}")
221
+ else:
222
+ out_filename = None
223
+
224
+ rel_dir = f"raw/download/{resource}"
225
+ out_dir = _safe_path(project_root, rel_dir, "output directory")
226
+
227
+ if out_filename is None:
228
+ out_filename = _derive_filename(resource, res_id)
229
+
230
+ if "/" in out_filename or "\\" in out_filename or ".." in out_filename:
231
+ raise ValueError(f"filename contains path separators or '..': {out_filename!r}")
232
+
233
+ out_path = out_dir / out_filename
234
+
235
+ # Idempotency: skip if exists and not --force
236
+ if out_path.exists() and not force:
237
+ return {
238
+ "url": url,
239
+ "resolved_url": resolved_url,
240
+ "resource": resource,
241
+ "id": res_id,
242
+ "path": str(out_path.relative_to(project_root)),
243
+ "size_bytes": out_path.stat().st_size,
244
+ "sha256": _sha256_file(out_path),
245
+ "skipped": True,
246
+ "reason": "exists",
247
+ }
248
+
249
+ # Streaming download
250
+ resp = sess.get(resolved_url, timeout=REQUEST_TIMEOUT, allow_redirects=True, stream=True)
251
+
252
+ if resp.status_code >= 500:
253
+ raise RuntimeError(f"HTTP {resp.status_code} from server")
254
+ if resp.status_code == 404:
255
+ raise ValueError(f"HTTP 404: resource not found at {resolved_url}")
256
+ if resp.status_code >= 400:
257
+ raise ValueError(f"HTTP {resp.status_code} from server")
258
+ resp.raise_for_status()
259
+
260
+ content_type = resp.headers.get("Content-Type", "")
261
+
262
+ # For 'web' resource, refine the filename extension from content-type
263
+ if resource == "web" and filename is None:
264
+ out_filename = _derive_filename(resource, res_id, content_type)
265
+ out_path = out_dir / out_filename
266
+
267
+ ct_lower = content_type.lower().split(";")[0].strip()
268
+ url_ends_pdf = resolved_url.lower().endswith(".pdf")
269
+ is_pdf = ct_lower.startswith("application/pdf") or url_ends_pdf
270
+
271
+ if not is_pdf and ct_lower.startswith("text/html"):
272
+ raise ValueError(
273
+ f"expected PDF but server returned HTML (Content-Type: {content_type}); "
274
+ f"URL may be a landing page rather than a direct PDF link"
275
+ )
276
+
277
+ # Atomic write: temp + streaming + fsync + rename; SHA256 computed during download
278
+ out_dir.mkdir(parents=True, exist_ok=True)
279
+ fd, tmp_path_str = tempfile.mkstemp(dir=out_dir, suffix=".tmp")
280
+ hasher = hashlib.sha256()
281
+ size = 0
282
+ try:
283
+ with os.fdopen(fd, "wb") as f:
284
+ for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
285
+ if chunk:
286
+ f.write(chunk)
287
+ hasher.update(chunk)
288
+ size += len(chunk)
289
+ f.flush()
290
+ os.fsync(f.fileno())
291
+ except Exception:
292
+ try:
293
+ os.unlink(tmp_path_str)
294
+ except OSError:
295
+ pass
296
+ raise
297
+
298
+ if size < MIN_PDF_SIZE:
299
+ try:
300
+ os.unlink(tmp_path_str)
301
+ except OSError:
302
+ pass
303
+ raise ValueError(
304
+ f"downloaded content is too small ({size} bytes < {MIN_PDF_SIZE}); "
305
+ f"likely an error page rather than a real PDF"
306
+ )
307
+
308
+ os.replace(tmp_path_str, out_path)
309
+
310
+ return {
311
+ "url": url,
312
+ "resolved_url": resp.url,
313
+ "resource": resource,
314
+ "id": res_id,
315
+ "path": str(out_path.relative_to(project_root)),
316
+ "size_bytes": size,
317
+ "sha256": hasher.hexdigest(),
318
+ "skipped": False,
319
+ }
320
+
321
+
322
+ def _sha256_file(path: Path) -> str:
323
+ """Compute SHA256 of an existing file."""
324
+ h = hashlib.sha256()
325
+ with path.open("rb") as f:
326
+ for chunk in iter(lambda: f.read(CHUNK_SIZE), b""):
327
+ h.update(chunk)
328
+ return h.hexdigest()
329
+
330
+
331
+ # ---------------------------------------------------------------------------
332
+ # CLI
333
+ # ---------------------------------------------------------------------------
334
+
335
+ def main(argv: list[str] | None = None) -> None:
336
+ parser = argparse.ArgumentParser(
337
+ prog="fetch_pdf.py",
338
+ description=(
339
+ "Download a PDF from a URL into raw/download/<resource>/<filename>. "
340
+ "Detects arxiv, DOI, Semantic Scholar, and generic web URLs."
341
+ ),
342
+ )
343
+ parser.add_argument("url", help="URL of the PDF to download.")
344
+ parser.add_argument(
345
+ "--project-root", default=None,
346
+ help="Project root directory (default: current directory).",
347
+ )
348
+ parser.add_argument(
349
+ "--filename", default=None,
350
+ help="Override output filename (default: derived from resource/id).",
351
+ )
352
+ parser.add_argument(
353
+ "--force", action="store_true",
354
+ help="Re-download and overwrite if file already exists.",
355
+ )
356
+
357
+ args = parser.parse_args(argv)
358
+
359
+ if not args.url or not args.url.strip():
360
+ _err_json("url must not be empty", 2)
361
+ sys.exit(2)
362
+
363
+ project_root = (
364
+ Path(args.project_root).resolve()
365
+ if args.project_root
366
+ else Path.cwd().resolve()
367
+ )
368
+
369
+ if args.filename is not None:
370
+ fn = args.filename
371
+ if "/" in fn or "\\" in fn or ".." in fn:
372
+ _err_json(
373
+ f"--filename must be a plain filename (no path separators or '..'): {fn!r}",
374
+ 2,
375
+ )
376
+ sys.exit(2)
377
+
378
+ session = _make_session()
379
+
380
+ try:
381
+ result = fetch_pdf(
382
+ url=args.url,
383
+ project_root=project_root,
384
+ filename=args.filename,
385
+ force=args.force,
386
+ session=session,
387
+ )
388
+ print(json.dumps(result, ensure_ascii=False, indent=2))
389
+ sys.exit(0)
390
+
391
+ except ValueError as exc:
392
+ _err_json(str(exc), 2)
393
+ sys.exit(2)
394
+ except requests.exceptions.ConnectionError as exc:
395
+ _err_json(f"Network error: {exc}", 3)
396
+ sys.exit(3)
397
+ except requests.exceptions.Timeout:
398
+ _err_json("Request timed out while downloading PDF.", 3)
399
+ sys.exit(3)
400
+ except requests.exceptions.HTTPError as exc:
401
+ code = exc.response.status_code if exc.response is not None else "unknown"
402
+ _err_json(f"HTTP error {code} while downloading PDF.", 3)
403
+ sys.exit(3)
404
+ except RuntimeError as exc:
405
+ _err_json(str(exc), 3)
406
+ sys.exit(3)
407
+ except OSError as exc:
408
+ _err_json(f"I/O error: {exc}", 3)
409
+ sys.exit(3)
410
+ except Exception as exc: # noqa: BLE001
411
+ _err_json(f"Internal error: {exc}", 3)
412
+ sys.exit(3)
413
+
414
+
415
+ if __name__ == "__main__":
416
+ main()