lumina-wiki 0.7.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +31 -0
- package/package.json +2 -1
- package/src/installer/commands.js +2 -2
- package/src/installer/manifest.js +6 -1
- package/src/scripts/lint.mjs +59 -2
- package/src/scripts/schemas.mjs +3 -1
- package/src/skills/core/ingest/SKILL.md +71 -11
- package/src/skills/core/migrate-legacy/SKILL.md +148 -27
- package/src/skills/packs/research/discover/SKILL.md +17 -6
- package/src/templates/README.md +9 -4
- package/src/tools/fetch_pdf.py +416 -0
package/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,37 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
|
5
5
|
|
|
6
6
|
## [Unreleased]
|
|
7
7
|
|
|
8
|
+
## [0.8.0] - 2026-05-03
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
- Schema: `raw_paths` field (array, optional) on `sources` — relative paths to permanent raw artifacts backing the source page (`raw/sources/*`, `raw/notes/*`, `raw/download/<resource>/*`, `raw/discovered/<topic>/*.json`). Replaces implicit "URL is the anchor" semantic with an explicit pointer set; verify Stage A (planned v1.0) reads this directly instead of re-deriving from heuristics.
|
|
12
|
+
- `raw/download/<resource>/` — permanent agent-writable zone for auto-fetched full-text artifacts, partitioned by source (`arxiv`, `doi`, `s2`, `web`). Distinct from `raw/tmp/` (transient) and `raw/sources/` (human-curated).
|
|
13
|
+
- `_lumina/tools/fetch_pdf.py` — CLI tool: download URL to `raw/download/<resource>/<filename>`, idempotent (skip on existing, `--force` to overwrite). Resource detection from URL pattern (arxiv/doi/s2/web). Atomic write (tempfile + fsync + rename). Used by `/lumi-ingest` Mode B.
|
|
14
|
+
- Lint check L12: warning when `raw_paths` entries point to a missing file, escape the project root, or live in `raw/tmp/*` (transient — should be moved to `raw/sources/` or `raw/download/`).
|
|
15
|
+
- `/lumi-ingest` Mode B: input may be a URL, arxiv ID, DOI, or paper title from discover shortlist. Skill resolves to URL, calls `fetch_pdf.py`, ingests from the resulting `raw/download/` path. Mode A (local file path) unchanged.
|
|
16
|
+
|
|
17
|
+
### Changed
|
|
18
|
+
- Source frontmatter field `url: <string>` renamed to `urls: <array>` for symmetry with `raw_paths: array`. Multiple URLs supported per source (arxiv abs, DOI, repo, slides). Lint type validation expects `urls` to be an array; legacy `url` string entries flagged as unknown field. Migration handled by `/lumi-migrate-legacy` (detects and rewrites `url: <str>` → `urls: [<str>]`).
|
|
19
|
+
- Provenance semantic reframed raw-centric (enum unchanged, 3 values):
|
|
20
|
+
- `replayable` now requires `raw_paths` non-empty with at least one entry resolving to disk (URL is no longer a precondition — file-only sources qualify).
|
|
21
|
+
- `partial` requires `url` present and no resolvable `raw_paths`.
|
|
22
|
+
- `missing` unchanged.
|
|
23
|
+
Rubric updated in `/lumi-ingest`, `/lumi-research-discover`, `/lumi-migrate-legacy`.
|
|
24
|
+
- `/lumi-migrate-legacy` rubric: tier 1 reads ingest checkpoint (`_lumina/_state/ingest-<slug>.json`) for authoritative `source_path`; tier 2 falls back to slug-prefix and URL-derived-ID heuristics across `raw/sources/`, `raw/notes/`, `raw/download/**`, `raw/discovered/**`. Pages whose checkpoint points into `raw/tmp/*` are flagged for the user to relocate before backfill — skill does not auto-move human files.
|
|
25
|
+
- Manifest schema: `MANIFEST_SCHEMA_VERSION` 2 → 3. Migration is metadata-only (no manifest field shape change); workspace schema additions (`raw_paths`, `raw/download/`) are additive and backward-compatible — old wikis continue to lint clean (L12 warnings advisory only).
|
|
26
|
+
- `/lumi-migrate-legacy`: raised the work-list confirmation gate from 10 to 30 entries. Real wikis commonly have dozens of entries, and the original threshold made every migration a multi-turn chore. Lists ≤30 now proceed after the plan is reported; lists >30 still pause for explicit confirmation, since a large batch usually signals a long-dormant wiki or major schema bump worth spot-checking.
|
|
27
|
+
|
|
28
|
+
### Fixed
|
|
29
|
+
- `/lumi-migrate-legacy`: Step 1.2 and Step 4.1 now use `lint.mjs --summary` for counts and write `--json` to `/tmp/lumi-lint.json` before projecting findings. Avoids the Bash-tool ~30KB stdout cap which truncated full `--json` mid-string on wikis with many findings, breaking inline `JSON.parse`.
|
|
30
|
+
|
|
31
|
+
### Migration
|
|
32
|
+
- Existing source pages without `raw_paths`: no immediate action required. Lint stays green (`raw_paths` is optional, L12 only fires when present-but-broken).
|
|
33
|
+
- To backfill `raw_paths` on legacy entries, run `/lumi-migrate-legacy` after upgrading. The skill reads ingest checkpoints and applies the new tier-1/tier-2 rubric.
|
|
34
|
+
- If you have wiki sources currently pointing at `raw/tmp/arxiv-ingest/` or similar transient locations (a known artefact of pre-v0.8 agent improvisation): move those PDFs to `raw/download/arxiv/` (matching arxiv ID) or `raw/sources/` (custom-named), then re-run `/lumi-migrate-legacy`. Lint L12 will identify the affected pages.
|
|
35
|
+
- Custom tooling reading manifest: bump expected `schemaVersion` to 3 (or accept 2|3 transitionally — the manifest shape is unchanged).
|
|
36
|
+
|
|
37
|
+
## [0.7.0] - 2026-05-03
|
|
38
|
+
|
|
8
39
|
### Added
|
|
9
40
|
- `/lumi-migrate-legacy` core skill — LLM-driven backfill of provenance/confidence
|
|
10
41
|
- `CHANGELOG.md` shipped to `_lumina/CHANGELOG.md` for skill consumption
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"$schema": "https://json.schemastore.org/package.json",
|
|
3
3
|
"name": "lumina-wiki",
|
|
4
|
-
"version": "0.
|
|
4
|
+
"version": "0.8.0",
|
|
5
5
|
"description": "Domain-agnostic, multi-IDE wiki scaffolder — Karpathy's LLM-Wiki vision, cross-platform and pack-based.",
|
|
6
6
|
"keywords": [
|
|
7
7
|
"llm-wiki",
|
|
@@ -51,6 +51,7 @@
|
|
|
51
51
|
"src/tools/init_discovery.py",
|
|
52
52
|
"src/tools/prepare_source.py",
|
|
53
53
|
"src/tools/fetch_arxiv.py",
|
|
54
|
+
"src/tools/fetch_pdf.py",
|
|
54
55
|
"src/tools/fetch_wikipedia.py",
|
|
55
56
|
"src/tools/fetch_s2.py",
|
|
56
57
|
"src/tools/fetch_deepxiv.py",
|
|
@@ -110,7 +110,7 @@ const CORE_WIKI_DIRS = [
|
|
|
110
110
|
const RESEARCH_WIKI_DIRS = ['wiki/foundations', 'wiki/topics'];
|
|
111
111
|
const READING_WIKI_DIRS = ['wiki/chapters', 'wiki/characters', 'wiki/themes', 'wiki/plot'];
|
|
112
112
|
|
|
113
|
-
const CORE_RAW_DIRS = ['raw/sources', 'raw/notes', 'raw/assets', 'raw/tmp'];
|
|
113
|
+
const CORE_RAW_DIRS = ['raw/sources', 'raw/notes', 'raw/assets', 'raw/tmp', 'raw/download'];
|
|
114
114
|
const RESEARCH_RAW_DIRS = ['raw/discovered'];
|
|
115
115
|
|
|
116
116
|
const LUMINA_DIRS = [
|
|
@@ -878,7 +878,7 @@ function getSkillDefs(packs) {
|
|
|
878
878
|
|
|
879
879
|
async function copyTools(projectRoot, { research }) {
|
|
880
880
|
const destDir = join(projectRoot, '_lumina', 'tools');
|
|
881
|
-
const coreTools = ['extract_pdf.py'];
|
|
881
|
+
const coreTools = ['extract_pdf.py', 'fetch_pdf.py'];
|
|
882
882
|
const researchTools = [
|
|
883
883
|
'_env.py', 'discover.py', 'init_discovery.py', 'prepare_source.py',
|
|
884
884
|
'fetch_arxiv.py', 'fetch_wikipedia.py', 'fetch_s2.py', 'fetch_deepxiv.py',
|
|
@@ -19,7 +19,7 @@ import { atomicWrite, ensureDir } from './fs.js';
|
|
|
19
19
|
// Constants
|
|
20
20
|
// ---------------------------------------------------------------------------
|
|
21
21
|
|
|
22
|
-
export const MANIFEST_SCHEMA_VERSION =
|
|
22
|
+
export const MANIFEST_SCHEMA_VERSION = 3;
|
|
23
23
|
|
|
24
24
|
export const SKILLS_CSV_HEADER = 'canonical_id,display_name,pack,source,relative_path,target_link_path,version';
|
|
25
25
|
export const FILES_CSV_HEADER = 'relative_path,sha256,source_pack,installed_version';
|
|
@@ -293,6 +293,11 @@ export async function writeFilesManifest(projectRoot, rows) {
|
|
|
293
293
|
*/
|
|
294
294
|
const MIGRATIONS = {
|
|
295
295
|
'1->2': (m) => ({ ...m, legacyMigrationNeeded: true }),
|
|
296
|
+
// 2->3 (v0.8): workspace schema additions — raw_paths field, raw/download/ dir,
|
|
297
|
+
// lint L12, source frontmatter url (string) -> urls (array). All additive /
|
|
298
|
+
// backward-compatible at the manifest level. Wiki content migration is handled
|
|
299
|
+
// by /lumi-migrate-legacy, not by the installer. No manifest shape change.
|
|
300
|
+
'2->3': (m) => ({ ...m }),
|
|
296
301
|
};
|
|
297
302
|
|
|
298
303
|
/**
|
package/src/scripts/lint.mjs
CHANGED
|
@@ -67,7 +67,7 @@ const INDEX_MARKER_OPEN = '<!-- lumina:index -->';
|
|
|
67
67
|
const INDEX_MARKER_CLOSE = '<!-- /lumina:index -->';
|
|
68
68
|
|
|
69
69
|
/** All check IDs in run order. */
|
|
70
|
-
const ALL_CHECK_IDS = ['L01', 'L02', 'L03', 'L04', 'L05', 'L06', 'L07', 'L08', 'L09', 'L10', 'L11'];
|
|
70
|
+
const ALL_CHECK_IDS = ['L01', 'L02', 'L03', 'L04', 'L05', 'L06', 'L07', 'L08', 'L09', 'L10', 'L11', 'L12'];
|
|
71
71
|
|
|
72
72
|
/** Kebab-case pattern: lowercase letters, digits, hyphens; no leading/trailing hyphen. */
|
|
73
73
|
const KEBAB_RE = /^[a-z0-9]+(?:-[a-z0-9]+)*$/;
|
|
@@ -667,6 +667,62 @@ function checkL10(foundationEntries) {
|
|
|
667
667
|
return findings;
|
|
668
668
|
}
|
|
669
669
|
|
|
670
|
+
/**
|
|
671
|
+
* L12: `raw_paths` entries on a `sources` page point to a missing file, or to
|
|
672
|
+
* `raw/tmp/*` (transient location — canonical sources should not live there).
|
|
673
|
+
* Severity: warning. Not auto-fixable. Catches drift when the user moves or
|
|
674
|
+
* renames a backing file, and flags the common mistake of pinning a wiki page
|
|
675
|
+
* to a temp-zone artifact that may be cleaned at any time.
|
|
676
|
+
*
|
|
677
|
+
* @param {string} wikiRelPath
|
|
678
|
+
* @param {Record<string,unknown>} fm
|
|
679
|
+
* @param {string} projectRoot Absolute path; used to resolve raw_paths entries.
|
|
680
|
+
* @returns {Promise<Finding[]>}
|
|
681
|
+
*/
|
|
682
|
+
async function checkL12(wikiRelPath, fm, projectRoot) {
|
|
683
|
+
const type = entityTypeForPath(wikiRelPath);
|
|
684
|
+
if (type !== 'sources') return [];
|
|
685
|
+
|
|
686
|
+
const rawPaths = fm.raw_paths;
|
|
687
|
+
if (!Array.isArray(rawPaths) || rawPaths.length === 0) return [];
|
|
688
|
+
|
|
689
|
+
const findings = [];
|
|
690
|
+
for (const entry of rawPaths) {
|
|
691
|
+
if (typeof entry !== 'string' || entry === '') continue;
|
|
692
|
+
|
|
693
|
+
// Reject paths inside raw/tmp/ — transient zone, not for canonical sources.
|
|
694
|
+
if (entry.startsWith('raw/tmp/') || entry.startsWith('./raw/tmp/')) {
|
|
695
|
+
findings.push(finding(
|
|
696
|
+
'L12-raw-paths-transient', 'warning', false,
|
|
697
|
+
wikiRelPath, null,
|
|
698
|
+
`raw_paths entry "${entry}" lives in raw/tmp/ — transient. Move the file to raw/sources/ (human) or raw/download/<resource>/ (agent) and update raw_paths.`
|
|
699
|
+
));
|
|
700
|
+
continue;
|
|
701
|
+
}
|
|
702
|
+
|
|
703
|
+
// Verify file exists on disk (relative to project root).
|
|
704
|
+
const abs = resolve(projectRoot, entry);
|
|
705
|
+
if (!abs.startsWith(resolve(projectRoot))) {
|
|
706
|
+
findings.push(finding(
|
|
707
|
+
'L12-raw-paths-unsafe', 'warning', false,
|
|
708
|
+
wikiRelPath, null,
|
|
709
|
+
`raw_paths entry "${entry}" escapes the project root`
|
|
710
|
+
));
|
|
711
|
+
continue;
|
|
712
|
+
}
|
|
713
|
+
try {
|
|
714
|
+
await access(abs, fsConstants.F_OK);
|
|
715
|
+
} catch {
|
|
716
|
+
findings.push(finding(
|
|
717
|
+
'L12-raw-paths-missing', 'warning', false,
|
|
718
|
+
wikiRelPath, null,
|
|
719
|
+
`raw_paths entry "${entry}" does not exist on disk`
|
|
720
|
+
));
|
|
721
|
+
}
|
|
722
|
+
}
|
|
723
|
+
return findings;
|
|
724
|
+
}
|
|
725
|
+
|
|
670
726
|
/**
|
|
671
727
|
* L11: `confidence` field missing on a `sources` or `concepts` entity.
|
|
672
728
|
* Severity: warning. Not auto-fixable. Sets an explicit trust signal that
|
|
@@ -943,6 +999,7 @@ async function runLint(projectRoot, opts) {
|
|
|
943
999
|
allFindings.push(...checkL04(wikiRelPath, outboundMap.get(wikiRelPath) || new Set(), inboundSet));
|
|
944
1000
|
allFindings.push(...checkL05(wikiRelPath, content, knownSlugs));
|
|
945
1001
|
allFindings.push(...checkL11(wikiRelPath, fm));
|
|
1002
|
+
allFindings.push(...await checkL12(wikiRelPath, fm, projectRoot));
|
|
946
1003
|
}
|
|
947
1004
|
|
|
948
1005
|
allFindings.push(...checkL06(edges, new Set(edgeSet)));
|
|
@@ -1243,7 +1300,7 @@ export {
|
|
|
1243
1300
|
isExempt,
|
|
1244
1301
|
entityTypeForPath,
|
|
1245
1302
|
checkL01, checkL02, checkL03, checkL04, checkL05,
|
|
1246
|
-
checkL06, checkL07, checkL08, checkL09, checkL10, checkL11,
|
|
1303
|
+
checkL06, checkL07, checkL08, checkL09, checkL10, checkL11, checkL12,
|
|
1247
1304
|
fixL01, fixL03, fixL06, fixL07, fixL09,
|
|
1248
1305
|
runLint,
|
|
1249
1306
|
reportSummary,
|
package/src/scripts/schemas.mjs
CHANGED
|
@@ -118,6 +118,7 @@ export const RAW_DIRS = {
|
|
|
118
118
|
notes: 'core',
|
|
119
119
|
assets: 'core',
|
|
120
120
|
tmp: 'core',
|
|
121
|
+
download: 'core',
|
|
121
122
|
|
|
122
123
|
// research pack
|
|
123
124
|
discovered: 'research',
|
|
@@ -257,7 +258,8 @@ export const REQUIRED_FRONTMATTER = {
|
|
|
257
258
|
{ key: 'authors', type: 'array', required: true },
|
|
258
259
|
{ key: 'year', type: 'number', required: true },
|
|
259
260
|
{ key: 'importance', type: 'enum', required: true, values: [1, 2, 3, 4, 5] },
|
|
260
|
-
{ key: '
|
|
261
|
+
{ key: 'urls', type: 'array', required: false },
|
|
262
|
+
{ key: 'raw_paths', type: 'array', required: false },
|
|
261
263
|
{ key: 'provenance', type: 'enum', required: true, values: ['replayable', 'partial', 'missing'] },
|
|
262
264
|
{ key: 'confidence', type: 'enum', required: false, values: ['high', 'medium', 'low', 'unverified'] },
|
|
263
265
|
],
|
|
@@ -8,8 +8,13 @@ description: >
|
|
|
8
8
|
into the wiki", "create a wiki page for", or drops a filename from raw/sources/.
|
|
9
9
|
Also fires for: "I added a PDF to raw/sources/", "add this paper to the wiki",
|
|
10
10
|
"parse this article", "what should I do with raw/sources/X?", or any request
|
|
11
|
-
to bring a new source into the wiki graph.
|
|
12
|
-
|
|
11
|
+
to bring a new source into the wiki graph.
|
|
12
|
+
Also accepts Mode B input — paper title, arxiv ID, or URL, without a local
|
|
13
|
+
file path. Examples: "ingest paper 2604.03501v2", "ingest arxiv:2604.03501",
|
|
14
|
+
"ingest https://arxiv.org/abs/2604.03501". The skill fetches the PDF
|
|
15
|
+
automatically in that case.
|
|
16
|
+
This is the most-used skill — when in doubt about whether something is an
|
|
17
|
+
ingest vs an edit, ask the user.
|
|
13
18
|
allowed-tools:
|
|
14
19
|
- Bash
|
|
15
20
|
- Read
|
|
@@ -34,6 +39,7 @@ depends on bidirectional-link discipline.
|
|
|
34
39
|
|
|
35
40
|
Key workspace paths:
|
|
36
41
|
- `raw/sources/` — immutable user-provided sources; you read but never modify
|
|
42
|
+
- `raw/download/<resource>/` — agent-writable, permanent; auto-fetched PDFs land here (resource = `arxiv | doi | s2 | web`)
|
|
37
43
|
- `wiki/sources/` — one page per ingested source (you write this)
|
|
38
44
|
- `wiki/concepts/`, `wiki/people/` — you create or update stubs here
|
|
39
45
|
- `wiki/index.md` — updated on every ingest
|
|
@@ -79,6 +85,43 @@ If a checkpoint exists and `phase` is not `"done"`, ask the user whether to resu
|
|
|
79
85
|
or restart. Resuming skips completed phases. Restarting deletes the checkpoint and
|
|
80
86
|
starts from Phase 1.
|
|
81
87
|
|
|
88
|
+
### Phase 0.5 — Resolve input
|
|
89
|
+
|
|
90
|
+
Three input modes:
|
|
91
|
+
|
|
92
|
+
**Mode A — Local file path** (e.g. `raw/sources/foo.pdf`, `raw/notes/bar.md`)
|
|
93
|
+
|
|
94
|
+
Use directly as `source_path`. Proceed to Phase 1 to detect type from this file.
|
|
95
|
+
|
|
96
|
+
**Mode B — URL or identifier** (arxiv ID like `2604.03501v2`, arxiv URL, DOI, S2 paper ID, generic URL)
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
python3 _lumina/tools/fetch_pdf.py "<url-or-id>"
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
- For bare arxiv ID: pass `https://arxiv.org/abs/<id>`
|
|
103
|
+
- For DOI: pass `https://doi.org/<doi>`
|
|
104
|
+
|
|
105
|
+
On exit 0: read JSON output. Use the returned `path` as `source_path`. Write the input URL as
|
|
106
|
+
the first entry of `urls` on the source page; additional URLs (DOI, repo, slides, etc.) can be
|
|
107
|
+
appended after. Proceed to Phase 1.
|
|
108
|
+
|
|
109
|
+
On exit 2 (HTML response): the URL likely points to a paywall or non-PDF page.
|
|
110
|
+
Report to user and ask for a direct PDF URL or manual download. Do not proceed
|
|
111
|
+
with ingest until a valid file path is available.
|
|
112
|
+
|
|
113
|
+
On exit 3 (network error): retry once. If the second attempt also fails, surface
|
|
114
|
+
the error to the user with the exact message from the tool output.
|
|
115
|
+
|
|
116
|
+
**Mode C — Title only** (e.g. from a `/lumi-research-discover` shortlist)
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
node _lumina/scripts/wiki.mjs checkpoint-read research-discover shortlist
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Match the title to a shortlist entry and extract the URL from that entry.
|
|
123
|
+
Fall through to Mode B with that URL.
|
|
124
|
+
|
|
82
125
|
### Phase 1 — Detect type
|
|
83
126
|
|
|
84
127
|
Read the file header (first ~200 lines). Classify as one of:
|
|
@@ -115,7 +158,7 @@ drafting the page.
|
|
|
115
158
|
Draft `wiki/sources/<slug>.md` using the Source
|
|
116
159
|
template from `_lumina/schema/page-templates.md` (open it when in doubt about
|
|
117
160
|
required fields). Required frontmatter fields: `id`, `title`, `type`, `created`,
|
|
118
|
-
`updated`, `authors`, `year`, `importance` (1-5), `
|
|
161
|
+
`updated`, `authors`, `year`, `importance` (1-5), `urls` (optional, array).
|
|
119
162
|
|
|
120
163
|
Required body sections: `## Summary` (2-4 sentences), `## Key Claims` (bulleted
|
|
121
164
|
with confidence level), `## Concepts` (all `[[concept-slug]]` links), `## People`
|
|
@@ -129,20 +172,36 @@ verification (Stage A/B/C of `/lumi-verify`, planned for v1.0). An explicit deci
|
|
|
129
172
|
is more useful than a silently-defaulted value because verification needs to know
|
|
130
173
|
whether it can re-check the material end-to-end.
|
|
131
174
|
|
|
132
|
-
Provenance rubric — pick the one that
|
|
133
|
-
- `replayable` —
|
|
134
|
-
|
|
135
|
-
- `partial` —
|
|
136
|
-
detection works against the URL, but grounding cannot be re-checked.
|
|
137
|
-
- `missing` —
|
|
138
|
-
|
|
175
|
+
Provenance rubric — raw-centric; pick the one that fits:
|
|
176
|
+
- `replayable` — `raw_paths` is non-empty AND every entry resolves to an existing file.
|
|
177
|
+
Source can be re-grounded end-to-end against these files.
|
|
178
|
+
- `partial` — has at least one entry in `urls` but `raw_paths` is empty or every listed entry is missing.
|
|
179
|
+
Drift detection works against the URL, but grounding cannot be re-checked.
|
|
180
|
+
- `missing` — no `urls`, no `raw_paths`. Manual entry; verification has nothing to grip on.
|
|
181
|
+
|
|
182
|
+
A source can have multiple URLs (arxiv abs + DOI + repo + slides). List the most authoritative
|
|
183
|
+
first; the agent uses `urls[0]` when a single canonical URL is needed (e.g., for `fetch_pdf.py` Mode B).
|
|
184
|
+
|
|
185
|
+
Set `raw_paths` to the list of permanent raw artifacts backing this page:
|
|
186
|
+
- The primary file passed to ingest (`raw/sources/*`, `raw/notes/*`, or
|
|
187
|
+
`raw/download/<resource>/*` from Mode B).
|
|
188
|
+
- Any matching metadata JSON in `raw/discovered/<topic>/<id>.json` (research pack).
|
|
189
|
+
Match by paper ID (arxiv ID, DOI) extracted from the source's URL or filename.
|
|
190
|
+
|
|
191
|
+
Do NOT include `raw/tmp/*` entries — that zone is transient (lint L12 warns).
|
|
192
|
+
Do NOT include files outside `raw/`.
|
|
139
193
|
|
|
140
194
|
Also set `confidence:` (optional but encouraged). Use `high | medium | low | unverified`.
|
|
141
195
|
Default to `unverified` for fresh ingests; bump up only after you have cross-checked
|
|
142
196
|
the claims or the user has confirmed them.
|
|
143
197
|
|
|
144
|
-
Example frontmatter
|
|
198
|
+
Example frontmatter:
|
|
145
199
|
```yaml
|
|
200
|
+
urls:
|
|
201
|
+
- https://arxiv.org/abs/2604.03501v2
|
|
202
|
+
raw_paths:
|
|
203
|
+
- raw/download/arxiv/2604.03501v2.pdf
|
|
204
|
+
- raw/discovered/ai-economics/2604.03501v2.json
|
|
146
205
|
provenance: replayable
|
|
147
206
|
confidence: unverified
|
|
148
207
|
```
|
|
@@ -303,6 +362,7 @@ Link added to `## Concepts` in `wiki/sources/rlhf-overview.md`:
|
|
|
303
362
|
- Keep a checkpoint after every phase — an interrupted ingest must be resumable.
|
|
304
363
|
- If the source is too large to fully read, read in sections and checkpoint between them.
|
|
305
364
|
- `raw/tmp/` accepts additions only; never overwrite a file there.
|
|
365
|
+
- `raw_paths` must list permanent artifacts only. Reject `raw/tmp/*` entries.
|
|
306
366
|
|
|
307
367
|
## Definition of Done
|
|
308
368
|
|
|
@@ -56,13 +56,38 @@ Proceed to the lint check regardless.
|
|
|
56
56
|
|
|
57
57
|
**Step 1.2 — Run lint (read-only pass).**
|
|
58
58
|
|
|
59
|
+
First, get aggregate counts (tiny output, always safe):
|
|
60
|
+
|
|
59
61
|
```bash
|
|
60
|
-
node _lumina/scripts/lint.mjs --
|
|
62
|
+
node _lumina/scripts/lint.mjs --summary
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
If `errors === 0` and `by_check.L11` is `0` or absent, skip to the clean-exit
|
|
66
|
+
branch below. Otherwise, you need the per-entry findings.
|
|
67
|
+
|
|
68
|
+
**Important — do NOT pipe `--json` straight into a heredoc.** On a large wiki
|
|
69
|
+
the full findings JSON can exceed the shell tool's ~30KB stdout buffer and get
|
|
70
|
+
truncated mid-string, breaking JSON.parse. Instead, write it to a temp file
|
|
71
|
+
and read filtered slices:
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
node _lumina/scripts/lint.mjs --json > /tmp/lumi-lint.json
|
|
75
|
+
node -e "
|
|
76
|
+
const j=JSON.parse(require('fs').readFileSync('/tmp/lumi-lint.json','utf8'));
|
|
77
|
+
const want=new Set(['L01-frontmatter-required','L11-confidence-missing']);
|
|
78
|
+
const hits=j.findings.filter(f=>want.has(f.id))
|
|
79
|
+
.map(f=>({id:f.id,file:f.file,message:f.message}));
|
|
80
|
+
console.log(JSON.stringify(hits,null,2));
|
|
81
|
+
"
|
|
61
82
|
```
|
|
62
83
|
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
84
|
+
The projected output (id + file + message only) is bounded and parseable. If
|
|
85
|
+
even that exceeds buffer (very large wikis), read `/tmp/lumi-lint.json` with
|
|
86
|
+
the Read tool instead — Read paginates, Bash stdout does not.
|
|
87
|
+
|
|
88
|
+
Collect:
|
|
89
|
+
- All `L01-frontmatter-required` findings (severity: error) — entries with
|
|
90
|
+
missing required fields.
|
|
66
91
|
- All `L11-confidence-missing` findings (severity: warning) — entries missing
|
|
67
92
|
the optional-but-recommended `confidence` field.
|
|
68
93
|
|
|
@@ -107,9 +132,21 @@ Field: confidence (optional, sources + concepts)
|
|
|
107
132
|
- concepts/softmax-temperature
|
|
108
133
|
```
|
|
109
134
|
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
135
|
+
Always report this plan to the user before proceeding. For work lists of **30
|
|
136
|
+
or fewer entries**, continue without waiting for confirmation — small batches
|
|
137
|
+
are routine and the operation is safe to re-run. For **more than 30 entries**,
|
|
138
|
+
stop and ask the user to confirm before any writes. A large batch usually
|
|
139
|
+
means a long-dormant wiki or a major schema bump, and the user should have a
|
|
140
|
+
chance to spot-check the inference table before bulk changes land.
|
|
141
|
+
|
|
142
|
+
The safety net beneath this threshold:
|
|
143
|
+
|
|
144
|
+
- `set-meta` is atomic and idempotent — rerunning with a corrected value is
|
|
145
|
+
a single command, no rollback needed.
|
|
146
|
+
- The inference rubric falls back to `unverified` when evidence is ambiguous,
|
|
147
|
+
so wrong values err toward "honest about uncertainty," not overconfidence.
|
|
148
|
+
- Phase 4 re-runs lint and surfaces any remaining issues before clearing the
|
|
149
|
+
manifest flag.
|
|
113
150
|
|
|
114
151
|
### Phase 2 — Plan
|
|
115
152
|
|
|
@@ -127,13 +164,7 @@ existing fields (url, authors, year, type, etc.).
|
|
|
127
164
|
|
|
128
165
|
**For `sources` entries also check:**
|
|
129
166
|
|
|
130
|
-
1.
|
|
131
|
-
```bash
|
|
132
|
-
ls raw/sources/<slug>* 2>/dev/null || echo "no snapshot"
|
|
133
|
-
ls raw/discovered/<slug>* 2>/dev/null || echo "no snapshot"
|
|
134
|
-
```
|
|
135
|
-
|
|
136
|
-
2. Inbound citation/edge count (how many other entries link to this one):
|
|
167
|
+
1. Inbound citation/edge count (how many other entries link to this one):
|
|
137
168
|
```bash
|
|
138
169
|
grep -c '"target":"sources/<slug>"' wiki/graph/edges.jsonl 2>/dev/null || echo 0
|
|
139
170
|
grep -c '"target":"sources/<slug>"' wiki/graph/citations.jsonl 2>/dev/null || echo 0
|
|
@@ -141,16 +172,42 @@ existing fields (url, authors, year, type, etc.).
|
|
|
141
172
|
|
|
142
173
|
**Inference rubrics — apply these to decide values:**
|
|
143
174
|
|
|
144
|
-
#### provenance (required on `sources`)
|
|
175
|
+
#### provenance + raw_paths (required on `sources`)
|
|
176
|
+
|
|
177
|
+
Use the following inference order. Stop at the first tier that yields a result.
|
|
178
|
+
|
|
179
|
+
**Tier 1 (authoritative): read the ingest checkpoint.**
|
|
180
|
+
|
|
181
|
+
```bash
|
|
182
|
+
node _lumina/scripts/wiki.mjs checkpoint-read ingest <slug>
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
If a checkpoint exists with a `source_path` field:
|
|
186
|
+
- If `source_path` is under `raw/tmp/*`: do NOT write `raw_paths`. Tell the user:
|
|
187
|
+
"`<slug>` was ingested from a transient location (`<source_path>`). Move the
|
|
188
|
+
file to `raw/sources/` or `raw/download/<resource>/` and re-run
|
|
189
|
+
`/lumi-migrate-legacy` to backfill `raw_paths` properly."
|
|
190
|
+
Set `provenance` to `partial` (if `urls` is non-empty) or `missing` (no `urls`).
|
|
191
|
+
- Otherwise: set `raw_paths` to `[source_path]` and `provenance` to `replayable`.
|
|
192
|
+
Skip Tiers 2 and 3.
|
|
193
|
+
|
|
194
|
+
**Tier 2 (heuristic): scan raw/ for matching files.**
|
|
195
|
+
|
|
196
|
+
- Slug-prefix match: `raw/sources/<slug>*`, `raw/notes/<slug>*`, or
|
|
197
|
+
`raw/download/<resource>/<slug>*`
|
|
198
|
+
- URL-derived ID match: extract arxiv ID, DOI, or URL basename from the page's
|
|
199
|
+
`urls` array (or legacy `url` string for backward compat); scan `raw/sources/`,
|
|
200
|
+
`raw/notes/`, `raw/download/**` for filenames containing that ID.
|
|
201
|
+
- Research-pack flow: also scan `raw/discovered/<topic>/<id>.json` for a JSON
|
|
202
|
+
whose `id` or `url` matches any entry in the page's `urls` array.
|
|
203
|
+
|
|
204
|
+
All non-`raw/tmp/` matches go into `raw_paths`. Set `provenance` to `replayable`
|
|
205
|
+
if any match was found.
|
|
145
206
|
|
|
146
|
-
|
|
207
|
+
**Tier 3 (fall back to urls heuristic): no checkpoint, no file match.**
|
|
147
208
|
|
|
148
|
-
-
|
|
149
|
-
|
|
150
|
-
- `partial` — A `url` field is present but no raw snapshot was saved. Drift
|
|
151
|
-
detection works against the URL, but the full text cannot be re-grounded.
|
|
152
|
-
- `missing` — No `url` field and no raw snapshot. Manual entry; verification
|
|
153
|
-
has nothing to grip on.
|
|
209
|
+
- Has at least one entry in `urls`, no raw match → `partial` (leave `raw_paths` unset or `[]`)
|
|
210
|
+
- Neither → `missing`
|
|
154
211
|
|
|
155
212
|
#### confidence (optional-but-recommended on `sources` and `concepts`)
|
|
156
213
|
|
|
@@ -178,11 +235,13 @@ After the read phase, produce an inference table:
|
|
|
178
235
|
|
|
179
236
|
```
|
|
180
237
|
sources/attention-is-all-you-need:
|
|
181
|
-
|
|
238
|
+
raw_paths: ["raw/sources/attention-is-all-you-need.pdf"] (Tier 1: checkpoint source_path)
|
|
239
|
+
provenance: replayable (raw_paths non-empty, file exists)
|
|
182
240
|
confidence: high (7 inbound citations)
|
|
183
241
|
|
|
184
242
|
sources/lora-2021:
|
|
185
|
-
|
|
243
|
+
raw_paths: [] (Tier 3: url present, no file match)
|
|
244
|
+
provenance: partial (url present, no resolvable raw_paths)
|
|
186
245
|
confidence: unverified (0 inbound edges, no cross-checks)
|
|
187
246
|
|
|
188
247
|
concepts/softmax-temperature:
|
|
@@ -197,8 +256,15 @@ For each entry in the inference table, set each missing field:
|
|
|
197
256
|
node _lumina/scripts/wiki.mjs set-meta <slug> <key> "<value>"
|
|
198
257
|
```
|
|
199
258
|
|
|
259
|
+
For `raw_paths` (an array field), pass a JSON array with `--json-value`:
|
|
260
|
+
|
|
261
|
+
```bash
|
|
262
|
+
node _lumina/scripts/wiki.mjs set-meta sources/<slug> raw_paths '["raw/sources/foo.pdf"]' --json-value
|
|
263
|
+
```
|
|
264
|
+
|
|
200
265
|
Examples:
|
|
201
266
|
```bash
|
|
267
|
+
node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need raw_paths '["raw/sources/attention-is-all-you-need.pdf"]' --json-value
|
|
202
268
|
node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need provenance replayable
|
|
203
269
|
node _lumina/scripts/wiki.mjs set-meta sources/attention-is-all-you-need confidence high
|
|
204
270
|
node _lumina/scripts/wiki.mjs set-meta sources/lora-2021 provenance partial
|
|
@@ -209,6 +275,42 @@ node _lumina/scripts/wiki.mjs set-meta concepts/softmax-temperature confidence m
|
|
|
209
275
|
`set-meta` is atomic (temp + fsync + rename) and idempotent — calling it twice
|
|
210
276
|
with the same value is a no-op. It is safe to re-run this phase.
|
|
211
277
|
|
|
278
|
+
**Schema-shape upgrade — `url` → `urls` (v0.9+):**
|
|
279
|
+
|
|
280
|
+
For every source page that has a top-level `url:` key (singular string) in frontmatter,
|
|
281
|
+
rewrite it as `urls:` (array) and remove the old key. Preserve placement — keep `urls`
|
|
282
|
+
where `url` was.
|
|
283
|
+
|
|
284
|
+
```bash
|
|
285
|
+
# Detect source pages that still have legacy url: (singular)
|
|
286
|
+
node _lumina/scripts/wiki.mjs list-entities | node -e "
|
|
287
|
+
const lines=require('fs').readFileSync('/dev/stdin','utf8').trim().split('\n');
|
|
288
|
+
const ents=lines.map(l=>{ try{return JSON.parse(l);}catch{return null;} }).filter(Boolean);
|
|
289
|
+
ents.filter(e=>e.type==='sources').forEach(e=>console.log(e.slug));
|
|
290
|
+
" | while read slug; do
|
|
291
|
+
node _lumina/scripts/wiki.mjs read-meta "$slug" | node -e "
|
|
292
|
+
const m=JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
293
|
+
if(m.url && !m.urls) console.log(process.argv[1]);
|
|
294
|
+
" "$slug"
|
|
295
|
+
done
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
For each slug with a legacy `url:` field:
|
|
299
|
+
|
|
300
|
+
```bash
|
|
301
|
+
# Step 1 — read current url value
|
|
302
|
+
URL=$(node _lumina/scripts/wiki.mjs read-meta sources/<slug> | node -e "process.stdout.write(JSON.parse(require('fs').readFileSync('/dev/stdin','utf8')).url)")
|
|
303
|
+
|
|
304
|
+
# Step 2 — write urls array
|
|
305
|
+
node _lumina/scripts/wiki.mjs set-meta sources/<slug> urls "[\"$URL\"]" --json-value
|
|
306
|
+
|
|
307
|
+
# Step 3 — remove legacy url key (set-meta with empty string removes the key)
|
|
308
|
+
node _lumina/scripts/wiki.mjs set-meta sources/<slug> url --remove
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
If `set-meta --remove` is not supported by the installed wiki.mjs version, use `Edit` to
|
|
312
|
+
remove the `url:` line directly after confirming `urls:` was written successfully.
|
|
313
|
+
|
|
212
314
|
After backfilling all entries, proceed immediately to Phase 4.
|
|
213
315
|
|
|
214
316
|
### Phase 4 — Verify
|
|
@@ -216,12 +318,31 @@ After backfilling all entries, proceed immediately to Phase 4.
|
|
|
216
318
|
**Step 4.1 — Re-run lint.**
|
|
217
319
|
|
|
218
320
|
```bash
|
|
219
|
-
node _lumina/scripts/lint.mjs --
|
|
321
|
+
node _lumina/scripts/lint.mjs --summary
|
|
220
322
|
```
|
|
221
323
|
|
|
222
|
-
Confirm
|
|
324
|
+
Confirm `errors === 0`. If you need to inspect remaining findings, re-run with
|
|
325
|
+
`--json > /tmp/lumi-lint.json` and project as in Step 1.2 — never parse full
|
|
326
|
+
`--json` from inline stdout on a large wiki. L11 warnings for
|
|
223
327
|
entries you set `confidence` on should also be gone.
|
|
224
328
|
|
|
329
|
+
Check for L12 warnings explicitly and surface them to the user:
|
|
330
|
+
|
|
331
|
+
```bash
|
|
332
|
+
node -e "
|
|
333
|
+
const j=JSON.parse(require('fs').readFileSync('/tmp/lumi-lint.json','utf8'));
|
|
334
|
+
const l12=j.findings.filter(f=>f.id==='L12-raw-paths-drift')
|
|
335
|
+
.map(f=>({file:f.file,message:f.message}));
|
|
336
|
+
if(l12.length) console.log('L12 raw_paths drift:\n'+JSON.stringify(l12,null,2));
|
|
337
|
+
else console.log('No L12 warnings.');
|
|
338
|
+
"
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
L12 warnings mean one or more `raw_paths` entries point to files that do not
|
|
342
|
+
exist or are under `raw/tmp/`. Treat these as follow-up action items for the
|
|
343
|
+
user — the migration is not blocked, but the `raw_paths` value is inaccurate
|
|
344
|
+
until the referenced file is located or the entry is corrected.
|
|
345
|
+
|
|
225
346
|
If any L01 errors remain:
|
|
226
347
|
- Read the finding message — it names the exact field still missing.
|
|
227
348
|
- Return to Phase 2 and infer a value for that field.
|
|
@@ -290,7 +411,7 @@ node _lumina/scripts/lint.mjs --json
|
|
|
290
411
|
|
|
291
412
|
# Phase 2 — for each source:
|
|
292
413
|
node _lumina/scripts/wiki.mjs read-meta sources/attention-is-all-you-need
|
|
293
|
-
# → {
|
|
414
|
+
# → { urls: ["https://arxiv.org/abs/1706.03762"], ... }
|
|
294
415
|
ls raw/sources/attention-is-all-you-need*
|
|
295
416
|
# → raw/sources/attention-is-all-you-need.pdf (found)
|
|
296
417
|
# → infer: provenance = replayable
|
|
@@ -69,12 +69,21 @@ python3 _lumina/tools/discover.py --help
|
|
|
69
69
|
8. Present a checkpointed shortlist with title, authors/year, URL or identifier,
|
|
70
70
|
`_score`, rationale, duplicate status, and recommended next action.
|
|
71
71
|
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
72
|
+
Discover writes JSON metadata to `raw/discovered/<topic>/<id>.json`. It does
|
|
73
|
+
NOT fetch PDFs — full-text download happens at ingest time via `/lumi-ingest`
|
|
74
|
+
Mode B, which calls `fetch_pdf.py` and places the PDF at
|
|
75
|
+
`raw/download/<resource>/<id>.<ext>`.
|
|
76
|
+
|
|
77
|
+
For each candidate, include a suggested `provenance` value (advisory — the
|
|
78
|
+
actual value is set by `/lumi-ingest` once the PDF is fetched). This helps
|
|
79
|
+
the user plan which sources are immediately accessible:
|
|
80
|
+
- `replayable` — abstract + full text both fetchable; `/lumi-ingest` will
|
|
81
|
+
download the PDF to `raw/download/` and resolve `raw_paths` at ingest time.
|
|
82
|
+
- `partial` — only abstract or metadata available (closed-access paper); no
|
|
83
|
+
full-text PDF reachable. `/lumi-ingest` will set `raw_paths` from the
|
|
84
|
+
metadata JSON only.
|
|
85
|
+
- `missing` — no URL; metadata only (e.g. a manually entered title). Nothing
|
|
86
|
+
to fetch; ingest will result in `provenance: missing`.
|
|
78
87
|
|
|
79
88
|
9. Ask the user which candidates should be ingested. Do not create source pages
|
|
80
89
|
or graph edges in this skill.
|
|
@@ -88,6 +97,8 @@ python3 _lumina/tools/discover.py --help
|
|
|
88
97
|
`init_discovery.py`.
|
|
89
98
|
- Do not include any non-FR35 workflows such as ideation, LaTeX writing, or
|
|
90
99
|
orchestrator mode.
|
|
100
|
+
- Do not download PDFs. Discover writes metadata JSON to `raw/discovered/` only.
|
|
101
|
+
PDF fetching is `/lumi-ingest`'s job (Mode B, via `_lumina/tools/fetch_pdf.py`).
|
|
91
102
|
|
|
92
103
|
## Definition of Done
|
|
93
104
|
|
package/src/templates/README.md
CHANGED
|
@@ -44,12 +44,16 @@ Keep this mental map in immediate context:
|
|
|
44
44
|
- `raw/sources/` — `.pdf`, `.tex`, `.html`, `.md`, transcripts, anything ingested
|
|
45
45
|
- `raw/notes/` — user's own markdown notes
|
|
46
46
|
- `raw/assets/` — images and binary attachments
|
|
47
|
-
- `raw/tmp/` — sidecar files generated by skills (
|
|
47
|
+
- `raw/tmp/` — sidecar files generated by skills (transient; do not store canonical sources here)
|
|
48
|
+
- `raw/download/<resource>/` — full-text artifacts auto-fetched by skills, partitioned by source
|
|
49
|
+
(e.g. `raw/download/arxiv/2604.03501v2.pdf`, `raw/download/doi/<doi>.pdf`).
|
|
50
|
+
Permanent agent-writable zone — keep separate from `raw/sources/` (human-curated).
|
|
48
51
|
{{#if pack_research}}
|
|
49
|
-
- `raw/discovered
|
|
52
|
+
- `raw/discovered/<topic>/` — metadata JSON candidates from research-pack discovery
|
|
53
|
+
(additions only, pack: research). Holds `<paper-id>.json`; full-text PDFs go in `raw/download/`.
|
|
50
54
|
{{/if}}
|
|
51
55
|
|
|
52
|
-
**Rule:** never modify or delete an existing file under `raw/`. Files added by the user are authoritative and immutable to the agent. New files may only be *added*, only by a skill that documents this behavior, and only into `raw/tmp/`{{#if pack_research}} or `raw/discovered/`{{/if}}. Every other path under `raw/` is read-only.
|
|
56
|
+
**Rule:** never modify or delete an existing file under `raw/`. Files added by the user are authoritative and immutable to the agent. New files may only be *added*, only by a skill that documents this behavior, and only into `raw/tmp/`, `raw/download/`{{#if pack_research}}, or `raw/discovered/`{{/if}}. Every other path under `raw/` is read-only.
|
|
53
57
|
|
|
54
58
|
### `.agents/` is the skill source of truth
|
|
55
59
|
|
|
@@ -60,7 +64,7 @@ Keep this mental map in immediate context:
|
|
|
60
64
|
- `_lumina/config/lumina.config.yaml` — workspace config; editable
|
|
61
65
|
- `_lumina/schema/` — deeper reference docs; open when this file points you there
|
|
62
66
|
- `_lumina/scripts/` — Node engine (`wiki.mjs`, `lint.mjs`, `reset.mjs`, `schemas.mjs`)
|
|
63
|
-
- `_lumina/tools/` — Python tools (always: `extract_pdf.py`, `requirements.txt`{{#if pack_research}}; research pack adds `_env.py`, `prepare_source.py`, `init_discovery.py`, `discover.py`, and fetcher tools{{/if}})
|
|
67
|
+
- `_lumina/tools/` — Python tools (always: `extract_pdf.py`, `fetch_pdf.py`, `requirements.txt`{{#if pack_research}}; research pack adds `_env.py`, `prepare_source.py`, `init_discovery.py`, `discover.py`, and fetcher tools{{/if}})
|
|
64
68
|
- `_lumina/_state/` — installer/skill checkpoint state; gitignored
|
|
65
69
|
- `_lumina/manifest.json` — installer state; never edit by hand
|
|
66
70
|
|
|
@@ -190,6 +194,7 @@ Adds `/lumi-reading-chapter-ingest` (file a chapter, update characters/themes/pl
|
|
|
190
194
|
- **`_lumina/scripts/wiki.mjs`** — wiki engine (frontmatter, graph mutation, slug, log).
|
|
191
195
|
- **`_lumina/scripts/reset.mjs`** — scoped destructive reset.
|
|
192
196
|
- **`_lumina/tools/extract_pdf.py`** — PDF text extractor (pypdf-based); used by `/lumi-ingest` and `/lumi-reading-chapter-ingest` when the host IDE cannot read PDFs natively.
|
|
197
|
+
- **`_lumina/tools/fetch_pdf.py`** — URL → `raw/download/<resource>/` PDF downloader (streaming, atomic, idempotent); used by `/lumi-ingest` Mode B when the input is a URL or paper identifier.
|
|
193
198
|
- **`_lumina/tools/requirements.txt`** — Python dependencies for bundled tools. Run `pip install -r _lumina/tools/requirements.txt` when a tool reports a missing package.
|
|
194
199
|
{{#if pack_research}}- **`_lumina/tools/_env.py`** — shared `.env` loader for research tools.
|
|
195
200
|
- **`_lumina/tools/prepare_source.py`** — normalizes local source files into tool-readable JSON.
|
|
@@ -0,0 +1,416 @@
|
|
|
1
|
+
"""
|
|
2
|
+
fetch_pdf.py — Download a PDF from a URL into the workspace landing zone.
|
|
3
|
+
|
|
4
|
+
CLI:
|
|
5
|
+
python fetch_pdf.py <url> [--project-root PATH] [--filename NAME] [--force]
|
|
6
|
+
|
|
7
|
+
Output (stdout, single JSON object on success):
|
|
8
|
+
{
|
|
9
|
+
"url": "<input url>",
|
|
10
|
+
"resolved_url": "<final url after redirects/normalization>",
|
|
11
|
+
"resource": "arxiv|doi|s2|web",
|
|
12
|
+
"id": "<extracted id>",
|
|
13
|
+
"path": "raw/download/arxiv/2604.03501v2.pdf",
|
|
14
|
+
"size_bytes": 12345,
|
|
15
|
+
"sha256": "<hex>",
|
|
16
|
+
"skipped": false
|
|
17
|
+
}
|
|
18
|
+
|
|
19
|
+
Errors emitted to stderr as JSON; exit codes:
|
|
20
|
+
0 success (or skipped due to existing file)
|
|
21
|
+
2 user error (empty url, malformed url, path traversal, HTML response)
|
|
22
|
+
3 transient error (network failure, HTTP 5xx, timeout)
|
|
23
|
+
|
|
24
|
+
No API key required. All network calls use requests.Session().
|
|
25
|
+
Landing zone: raw/download/<resource>/<filename>
|
|
26
|
+
resource = arxiv | doi | s2 | web
|
|
27
|
+
"""
|
|
28
|
+
|
|
29
|
+
from __future__ import annotations
|
|
30
|
+
|
|
31
|
+
import argparse
|
|
32
|
+
import hashlib
|
|
33
|
+
import json
|
|
34
|
+
import os
|
|
35
|
+
import re
|
|
36
|
+
import sys
|
|
37
|
+
import tempfile
|
|
38
|
+
from pathlib import Path
|
|
39
|
+
from typing import Any
|
|
40
|
+
from urllib.parse import urlparse
|
|
41
|
+
|
|
42
|
+
import requests
|
|
43
|
+
|
|
44
|
+
# ---------------------------------------------------------------------------
|
|
45
|
+
# Constants
|
|
46
|
+
# ---------------------------------------------------------------------------
|
|
47
|
+
|
|
48
|
+
USER_AGENT = "lumina-wiki/0.1 (research-pack; pdf fetcher)"
|
|
49
|
+
REQUEST_TIMEOUT = 60
|
|
50
|
+
MIN_PDF_SIZE = 100 # bytes — smaller responses are likely error pages
|
|
51
|
+
CHUNK_SIZE = 65536 # 64 KB chunks for streaming download
|
|
52
|
+
|
|
53
|
+
# Windows-illegal characters in filenames
|
|
54
|
+
_WIN_ILLEGAL_RE = re.compile(r'[<>:"/\\|?*]')
|
|
55
|
+
|
|
56
|
+
# Resource detection patterns — compiled once at module level
|
|
57
|
+
_ARXIV_ABS_RE = re.compile(
|
|
58
|
+
r"arxiv\.org/abs/([0-9]{4}\.[0-9]{4,5}(?:v\d+)?)", re.IGNORECASE
|
|
59
|
+
)
|
|
60
|
+
_ARXIV_PDF_RE = re.compile(
|
|
61
|
+
r"arxiv\.org/pdf/([0-9]{4}\.[0-9]{4,5}(?:v\d+)?)(?:\.pdf)?$", re.IGNORECASE
|
|
62
|
+
)
|
|
63
|
+
_DOI_RE = re.compile(r"(?:dx\.)?doi\.org/(.+)", re.IGNORECASE)
|
|
64
|
+
_S2_RE = re.compile(r"semanticscholar\.org/paper/([^/?#]+)", re.IGNORECASE)
|
|
65
|
+
|
|
66
|
+
|
|
67
|
+
# ---------------------------------------------------------------------------
|
|
68
|
+
# Helpers
|
|
69
|
+
# ---------------------------------------------------------------------------
|
|
70
|
+
|
|
71
|
+
def _err_json(msg: str, code: int) -> None:
|
|
72
|
+
"""Print a JSON error to stderr."""
|
|
73
|
+
print(json.dumps({"error": msg, "code": code}), file=sys.stderr)
|
|
74
|
+
|
|
75
|
+
|
|
76
|
+
def _sha16_url(url: str) -> str:
|
|
77
|
+
"""First 16 hex chars of SHA256 of URL — used as web resource ID."""
|
|
78
|
+
return hashlib.sha256(url.encode()).hexdigest()[:16]
|
|
79
|
+
|
|
80
|
+
|
|
81
|
+
def _sanitize_filename(name: str) -> str:
|
|
82
|
+
"""Remove Windows-illegal characters from a filename."""
|
|
83
|
+
return _WIN_ILLEGAL_RE.sub("_", name)
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
def _safe_path(base: Path, rel: str, label: str) -> Path:
|
|
87
|
+
"""Resolve rel under base; reject '..', absolute, or escaping paths."""
|
|
88
|
+
rel_path = Path(rel)
|
|
89
|
+
if rel_path.is_absolute():
|
|
90
|
+
_err_json(f"{label} must be a relative path, got: {rel}", 2)
|
|
91
|
+
sys.exit(2)
|
|
92
|
+
if ".." in rel_path.parts:
|
|
93
|
+
_err_json(f"{label} contains '..': {rel}", 2)
|
|
94
|
+
sys.exit(2)
|
|
95
|
+
resolved = (base / rel_path).resolve()
|
|
96
|
+
try:
|
|
97
|
+
resolved.relative_to(base.resolve())
|
|
98
|
+
except ValueError:
|
|
99
|
+
_err_json(f"{label} escapes base directory: {rel}", 2)
|
|
100
|
+
sys.exit(2)
|
|
101
|
+
return resolved
|
|
102
|
+
|
|
103
|
+
|
|
104
|
+
# ---------------------------------------------------------------------------
|
|
105
|
+
# Resource detection
|
|
106
|
+
# ---------------------------------------------------------------------------
|
|
107
|
+
|
|
108
|
+
def detect_resource(url: str) -> tuple[str, str, str]:
|
|
109
|
+
"""Detect resource type and ID from URL.
|
|
110
|
+
|
|
111
|
+
Returns:
|
|
112
|
+
(resource, id, resolved_pdf_url)
|
|
113
|
+
|
|
114
|
+
resource is one of: arxiv, doi, s2, web
|
|
115
|
+
"""
|
|
116
|
+
url = url.strip()
|
|
117
|
+
|
|
118
|
+
# arxiv abs
|
|
119
|
+
m = _ARXIV_ABS_RE.search(url)
|
|
120
|
+
if m:
|
|
121
|
+
arxiv_id = m.group(1)
|
|
122
|
+
pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
|
|
123
|
+
return "arxiv", arxiv_id, pdf_url
|
|
124
|
+
|
|
125
|
+
# arxiv pdf
|
|
126
|
+
m = _ARXIV_PDF_RE.search(url)
|
|
127
|
+
if m:
|
|
128
|
+
arxiv_id = m.group(1)
|
|
129
|
+
pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
|
|
130
|
+
return "arxiv", arxiv_id, pdf_url
|
|
131
|
+
|
|
132
|
+
# DOI
|
|
133
|
+
m = _DOI_RE.search(url)
|
|
134
|
+
if m:
|
|
135
|
+
doi_raw = m.group(1).rstrip("/")
|
|
136
|
+
doi_id = doi_raw.replace("/", "-")
|
|
137
|
+
return "doi", doi_id, url
|
|
138
|
+
|
|
139
|
+
# Semantic Scholar
|
|
140
|
+
m = _S2_RE.search(url)
|
|
141
|
+
if m:
|
|
142
|
+
s2_id = m.group(1)
|
|
143
|
+
return "s2", s2_id, url
|
|
144
|
+
|
|
145
|
+
# Web fallback
|
|
146
|
+
sha16 = _sha16_url(url)
|
|
147
|
+
return "web", sha16, url
|
|
148
|
+
|
|
149
|
+
|
|
150
|
+
def _derive_filename(resource: str, id_: str, content_type: str = "") -> str:
|
|
151
|
+
"""Derive a default filename from resource/id.
|
|
152
|
+
|
|
153
|
+
For 'web', probes content_type for extension; defaults to .pdf.
|
|
154
|
+
"""
|
|
155
|
+
if resource in ("arxiv", "doi", "s2"):
|
|
156
|
+
return _sanitize_filename(id_) + ".pdf"
|
|
157
|
+
# web
|
|
158
|
+
ext = ".pdf"
|
|
159
|
+
if content_type:
|
|
160
|
+
ct = content_type.lower().split(";")[0].strip()
|
|
161
|
+
if "octet-stream" in ct or "pdf" in ct:
|
|
162
|
+
ext = ".pdf"
|
|
163
|
+
# If it's something else we still default to .pdf per spec
|
|
164
|
+
return _sanitize_filename(id_) + ext
|
|
165
|
+
|
|
166
|
+
|
|
167
|
+
# ---------------------------------------------------------------------------
|
|
168
|
+
# Session factory
|
|
169
|
+
# ---------------------------------------------------------------------------
|
|
170
|
+
|
|
171
|
+
def _make_session() -> requests.Session:
|
|
172
|
+
session = requests.Session()
|
|
173
|
+
session.headers.update({"User-Agent": USER_AGENT})
|
|
174
|
+
return session
|
|
175
|
+
|
|
176
|
+
|
|
177
|
+
# ---------------------------------------------------------------------------
|
|
178
|
+
# Core download function
|
|
179
|
+
# ---------------------------------------------------------------------------
|
|
180
|
+
|
|
181
|
+
def fetch_pdf(
|
|
182
|
+
url: str,
|
|
183
|
+
project_root: Path,
|
|
184
|
+
filename: str | None = None,
|
|
185
|
+
force: bool = False,
|
|
186
|
+
session: requests.Session | None = None,
|
|
187
|
+
) -> dict[str, Any]:
|
|
188
|
+
"""Download a PDF from url into raw/download/<resource>/<filename>.
|
|
189
|
+
|
|
190
|
+
Args:
|
|
191
|
+
url: The source URL (arxiv abs/pdf, doi, s2, or generic web URL).
|
|
192
|
+
project_root: Absolute path to the project root.
|
|
193
|
+
filename: Override output filename (sanitized). If None, derived from resource/id.
|
|
194
|
+
force: If True, overwrite existing file. If False, skip if exists.
|
|
195
|
+
session: Optional requests.Session for connection reuse.
|
|
196
|
+
|
|
197
|
+
Returns:
|
|
198
|
+
Result dict (see module docstring).
|
|
199
|
+
|
|
200
|
+
Raises:
|
|
201
|
+
ValueError: on user errors (empty url, content-type mismatch, path traversal).
|
|
202
|
+
RuntimeError: on transient errors (network, HTTP 5xx, timeout).
|
|
203
|
+
requests.RequestException: on low-level network failure (caller re-raises).
|
|
204
|
+
"""
|
|
205
|
+
url = url.strip()
|
|
206
|
+
if not url:
|
|
207
|
+
raise ValueError("url must not be empty")
|
|
208
|
+
|
|
209
|
+
parsed = urlparse(url)
|
|
210
|
+
if not parsed.scheme or not parsed.netloc:
|
|
211
|
+
raise ValueError(f"malformed url (no scheme or host): {url!r}")
|
|
212
|
+
|
|
213
|
+
resource, res_id, resolved_url = detect_resource(url)
|
|
214
|
+
|
|
215
|
+
sess = session or _make_session()
|
|
216
|
+
|
|
217
|
+
if filename is not None:
|
|
218
|
+
out_filename = _sanitize_filename(filename)
|
|
219
|
+
if not out_filename:
|
|
220
|
+
raise ValueError(f"--filename becomes empty after sanitization: {filename!r}")
|
|
221
|
+
else:
|
|
222
|
+
out_filename = None
|
|
223
|
+
|
|
224
|
+
rel_dir = f"raw/download/{resource}"
|
|
225
|
+
out_dir = _safe_path(project_root, rel_dir, "output directory")
|
|
226
|
+
|
|
227
|
+
if out_filename is None:
|
|
228
|
+
out_filename = _derive_filename(resource, res_id)
|
|
229
|
+
|
|
230
|
+
if "/" in out_filename or "\\" in out_filename or ".." in out_filename:
|
|
231
|
+
raise ValueError(f"filename contains path separators or '..': {out_filename!r}")
|
|
232
|
+
|
|
233
|
+
out_path = out_dir / out_filename
|
|
234
|
+
|
|
235
|
+
# Idempotency: skip if exists and not --force
|
|
236
|
+
if out_path.exists() and not force:
|
|
237
|
+
return {
|
|
238
|
+
"url": url,
|
|
239
|
+
"resolved_url": resolved_url,
|
|
240
|
+
"resource": resource,
|
|
241
|
+
"id": res_id,
|
|
242
|
+
"path": str(out_path.relative_to(project_root)),
|
|
243
|
+
"size_bytes": out_path.stat().st_size,
|
|
244
|
+
"sha256": _sha256_file(out_path),
|
|
245
|
+
"skipped": True,
|
|
246
|
+
"reason": "exists",
|
|
247
|
+
}
|
|
248
|
+
|
|
249
|
+
# Streaming download
|
|
250
|
+
resp = sess.get(resolved_url, timeout=REQUEST_TIMEOUT, allow_redirects=True, stream=True)
|
|
251
|
+
|
|
252
|
+
if resp.status_code >= 500:
|
|
253
|
+
raise RuntimeError(f"HTTP {resp.status_code} from server")
|
|
254
|
+
if resp.status_code == 404:
|
|
255
|
+
raise ValueError(f"HTTP 404: resource not found at {resolved_url}")
|
|
256
|
+
if resp.status_code >= 400:
|
|
257
|
+
raise ValueError(f"HTTP {resp.status_code} from server")
|
|
258
|
+
resp.raise_for_status()
|
|
259
|
+
|
|
260
|
+
content_type = resp.headers.get("Content-Type", "")
|
|
261
|
+
|
|
262
|
+
# For 'web' resource, refine the filename extension from content-type
|
|
263
|
+
if resource == "web" and filename is None:
|
|
264
|
+
out_filename = _derive_filename(resource, res_id, content_type)
|
|
265
|
+
out_path = out_dir / out_filename
|
|
266
|
+
|
|
267
|
+
ct_lower = content_type.lower().split(";")[0].strip()
|
|
268
|
+
url_ends_pdf = resolved_url.lower().endswith(".pdf")
|
|
269
|
+
is_pdf = ct_lower.startswith("application/pdf") or url_ends_pdf
|
|
270
|
+
|
|
271
|
+
if not is_pdf and ct_lower.startswith("text/html"):
|
|
272
|
+
raise ValueError(
|
|
273
|
+
f"expected PDF but server returned HTML (Content-Type: {content_type}); "
|
|
274
|
+
f"URL may be a landing page rather than a direct PDF link"
|
|
275
|
+
)
|
|
276
|
+
|
|
277
|
+
# Atomic write: temp + streaming + fsync + rename; SHA256 computed during download
|
|
278
|
+
out_dir.mkdir(parents=True, exist_ok=True)
|
|
279
|
+
fd, tmp_path_str = tempfile.mkstemp(dir=out_dir, suffix=".tmp")
|
|
280
|
+
hasher = hashlib.sha256()
|
|
281
|
+
size = 0
|
|
282
|
+
try:
|
|
283
|
+
with os.fdopen(fd, "wb") as f:
|
|
284
|
+
for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
|
|
285
|
+
if chunk:
|
|
286
|
+
f.write(chunk)
|
|
287
|
+
hasher.update(chunk)
|
|
288
|
+
size += len(chunk)
|
|
289
|
+
f.flush()
|
|
290
|
+
os.fsync(f.fileno())
|
|
291
|
+
except Exception:
|
|
292
|
+
try:
|
|
293
|
+
os.unlink(tmp_path_str)
|
|
294
|
+
except OSError:
|
|
295
|
+
pass
|
|
296
|
+
raise
|
|
297
|
+
|
|
298
|
+
if size < MIN_PDF_SIZE:
|
|
299
|
+
try:
|
|
300
|
+
os.unlink(tmp_path_str)
|
|
301
|
+
except OSError:
|
|
302
|
+
pass
|
|
303
|
+
raise ValueError(
|
|
304
|
+
f"downloaded content is too small ({size} bytes < {MIN_PDF_SIZE}); "
|
|
305
|
+
f"likely an error page rather than a real PDF"
|
|
306
|
+
)
|
|
307
|
+
|
|
308
|
+
os.replace(tmp_path_str, out_path)
|
|
309
|
+
|
|
310
|
+
return {
|
|
311
|
+
"url": url,
|
|
312
|
+
"resolved_url": resp.url,
|
|
313
|
+
"resource": resource,
|
|
314
|
+
"id": res_id,
|
|
315
|
+
"path": str(out_path.relative_to(project_root)),
|
|
316
|
+
"size_bytes": size,
|
|
317
|
+
"sha256": hasher.hexdigest(),
|
|
318
|
+
"skipped": False,
|
|
319
|
+
}
|
|
320
|
+
|
|
321
|
+
|
|
322
|
+
def _sha256_file(path: Path) -> str:
|
|
323
|
+
"""Compute SHA256 of an existing file."""
|
|
324
|
+
h = hashlib.sha256()
|
|
325
|
+
with path.open("rb") as f:
|
|
326
|
+
for chunk in iter(lambda: f.read(CHUNK_SIZE), b""):
|
|
327
|
+
h.update(chunk)
|
|
328
|
+
return h.hexdigest()
|
|
329
|
+
|
|
330
|
+
|
|
331
|
+
# ---------------------------------------------------------------------------
|
|
332
|
+
# CLI
|
|
333
|
+
# ---------------------------------------------------------------------------
|
|
334
|
+
|
|
335
|
+
def main(argv: list[str] | None = None) -> None:
|
|
336
|
+
parser = argparse.ArgumentParser(
|
|
337
|
+
prog="fetch_pdf.py",
|
|
338
|
+
description=(
|
|
339
|
+
"Download a PDF from a URL into raw/download/<resource>/<filename>. "
|
|
340
|
+
"Detects arxiv, DOI, Semantic Scholar, and generic web URLs."
|
|
341
|
+
),
|
|
342
|
+
)
|
|
343
|
+
parser.add_argument("url", help="URL of the PDF to download.")
|
|
344
|
+
parser.add_argument(
|
|
345
|
+
"--project-root", default=None,
|
|
346
|
+
help="Project root directory (default: current directory).",
|
|
347
|
+
)
|
|
348
|
+
parser.add_argument(
|
|
349
|
+
"--filename", default=None,
|
|
350
|
+
help="Override output filename (default: derived from resource/id).",
|
|
351
|
+
)
|
|
352
|
+
parser.add_argument(
|
|
353
|
+
"--force", action="store_true",
|
|
354
|
+
help="Re-download and overwrite if file already exists.",
|
|
355
|
+
)
|
|
356
|
+
|
|
357
|
+
args = parser.parse_args(argv)
|
|
358
|
+
|
|
359
|
+
if not args.url or not args.url.strip():
|
|
360
|
+
_err_json("url must not be empty", 2)
|
|
361
|
+
sys.exit(2)
|
|
362
|
+
|
|
363
|
+
project_root = (
|
|
364
|
+
Path(args.project_root).resolve()
|
|
365
|
+
if args.project_root
|
|
366
|
+
else Path.cwd().resolve()
|
|
367
|
+
)
|
|
368
|
+
|
|
369
|
+
if args.filename is not None:
|
|
370
|
+
fn = args.filename
|
|
371
|
+
if "/" in fn or "\\" in fn or ".." in fn:
|
|
372
|
+
_err_json(
|
|
373
|
+
f"--filename must be a plain filename (no path separators or '..'): {fn!r}",
|
|
374
|
+
2,
|
|
375
|
+
)
|
|
376
|
+
sys.exit(2)
|
|
377
|
+
|
|
378
|
+
session = _make_session()
|
|
379
|
+
|
|
380
|
+
try:
|
|
381
|
+
result = fetch_pdf(
|
|
382
|
+
url=args.url,
|
|
383
|
+
project_root=project_root,
|
|
384
|
+
filename=args.filename,
|
|
385
|
+
force=args.force,
|
|
386
|
+
session=session,
|
|
387
|
+
)
|
|
388
|
+
print(json.dumps(result, ensure_ascii=False, indent=2))
|
|
389
|
+
sys.exit(0)
|
|
390
|
+
|
|
391
|
+
except ValueError as exc:
|
|
392
|
+
_err_json(str(exc), 2)
|
|
393
|
+
sys.exit(2)
|
|
394
|
+
except requests.exceptions.ConnectionError as exc:
|
|
395
|
+
_err_json(f"Network error: {exc}", 3)
|
|
396
|
+
sys.exit(3)
|
|
397
|
+
except requests.exceptions.Timeout:
|
|
398
|
+
_err_json("Request timed out while downloading PDF.", 3)
|
|
399
|
+
sys.exit(3)
|
|
400
|
+
except requests.exceptions.HTTPError as exc:
|
|
401
|
+
code = exc.response.status_code if exc.response is not None else "unknown"
|
|
402
|
+
_err_json(f"HTTP error {code} while downloading PDF.", 3)
|
|
403
|
+
sys.exit(3)
|
|
404
|
+
except RuntimeError as exc:
|
|
405
|
+
_err_json(str(exc), 3)
|
|
406
|
+
sys.exit(3)
|
|
407
|
+
except OSError as exc:
|
|
408
|
+
_err_json(f"I/O error: {exc}", 3)
|
|
409
|
+
sys.exit(3)
|
|
410
|
+
except Exception as exc: # noqa: BLE001
|
|
411
|
+
_err_json(f"Internal error: {exc}", 3)
|
|
412
|
+
sys.exit(3)
|
|
413
|
+
|
|
414
|
+
|
|
415
|
+
if __name__ == "__main__":
|
|
416
|
+
main()
|