lumina-wiki 1.2.0 → 1.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +31 -0
- package/README.md +6 -1
- package/README.vi.md +6 -1
- package/README.zh.md +6 -1
- package/bin/lumina.js +51 -9
- package/package.json +3 -3
- package/src/installer/prompts.js +9 -9
- package/src/tools/prepare_source.py +154 -5
- package/src/tools/requirements.txt +8 -0
package/CHANGELOG.md
CHANGED
|
@@ -3,6 +3,37 @@
|
|
|
3
3
|
All notable changes to Lumina-Wiki are documented here.
|
|
4
4
|
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
5
5
|
|
|
6
|
+
## [1.3.0] - 2026-05-09
|
|
7
|
+
|
|
8
|
+
### Added — Local text-document ingestion (research pack)
|
|
9
|
+
|
|
10
|
+
- `prepare_source.py` (research pack tool) now supports `.docx`, `.rtf`, and
|
|
11
|
+
`.epub` in addition to the existing PDF / TeX / HTML / Markdown formats.
|
|
12
|
+
- Hardened against zip-bomb (raw size cap + decompressed total cap) and XXE
|
|
13
|
+
/ XML billion-laughs (`defusedxml.defuse_stdlib()`) for ZIP-of-XML formats
|
|
14
|
+
(`.docx`, `.epub`).
|
|
15
|
+
- DRM-protected EPUB detection: explicit error with hint instead of an
|
|
16
|
+
opaque parse crash. Lumina does not strip DRM.
|
|
17
|
+
|
|
18
|
+
### Requirements
|
|
19
|
+
|
|
20
|
+
- The new format support requires the **research pack**:
|
|
21
|
+
`lumina install --packs core,research`. After install run
|
|
22
|
+
`pip install -r _lumina/tools/requirements.txt` to fetch
|
|
23
|
+
`python-docx`, `striprtf`, `ebooklib`, `beautifulsoup4`, and `defusedxml`.
|
|
24
|
+
- Missing libs raise an actionable `ValueError` (CLI exit 2) with the
|
|
25
|
+
`pip install …` hint — no silent empty-text writes.
|
|
26
|
+
|
|
27
|
+
### Known Limitations
|
|
28
|
+
|
|
29
|
+
- `.docx`: shapes, text boxes, headers/footers, table cells not extracted.
|
|
30
|
+
- `.rtf`: table layout and embedded images discarded.
|
|
31
|
+
- `.epub`: images, CSS, footnotes, and cross-references discarded; chapter-
|
|
32
|
+
level segmentation is **not** emitted in v1 — it will land alongside
|
|
33
|
+
`/lumi-chapter-ingest` EPUB support in a future release.
|
|
34
|
+
- `.odt`, image (`.png`, `.jpg`) and scanned-PDF ingestion remain out of
|
|
35
|
+
scope. See the roadmap entry "Vision/OCR ingestion" for the follow-up.
|
|
36
|
+
|
|
6
37
|
## [1.2.0] - 2026-05-07
|
|
7
38
|
|
|
8
39
|
### Added
|
package/README.md
CHANGED
|
@@ -199,7 +199,8 @@ Lumina-Wiki is evolving rapidly. Here is our user-facing roadmap:
|
|
|
199
199
|
**Near-term (Stability & New Ingestion)**
|
|
200
200
|
- [ ] **`/lumi-help` Skill:** A smart assistant to help you learn and use Lumina-Wiki instantly.
|
|
201
201
|
- [x] **Multilingual setup:** Choose English, Vietnamese, or Chinese as your primary language during install. *(shipped in v1.2)*
|
|
202
|
-
- [
|
|
202
|
+
- [x] **Native DOCX, RTF & EPUB ingestion:** Pull Word, Rich Text, and EPUB books straight into your wiki via the research pack. *(shipped in v1.x)*
|
|
203
|
+
- [ ] **Image OCR & Scanned PDFs:** Ingest screenshots and scanned PDFs into your wiki.
|
|
203
204
|
- [ ] **Advanced Paper Ranking:** See influence scores and quality signals for your research papers.
|
|
204
205
|
- [x] **Improved CI/CD:** Native support for Bun and Node 22 environments. *(shipped in v1.2)*
|
|
205
206
|
|
|
@@ -221,6 +222,10 @@ Lumina-Wiki is evolving rapidly. Here is our user-facing roadmap:
|
|
|
221
222
|
|
|
222
223
|
## 7. Contributing & License
|
|
223
224
|
|
|
225
|
+
### CLI Contract
|
|
226
|
+
|
|
227
|
+
CI scripts and integrations should reference [`docs/cli-contract.md`](./docs/cli-contract.md) for the v1.x stable flag list and exit code mapping. Anything not listed there is internal and may change without notice.
|
|
228
|
+
|
|
224
229
|
### Local Development (for contributors)
|
|
225
230
|
|
|
226
231
|
If you want to contribute to the `lumina-wiki` installer:
|
package/README.vi.md
CHANGED
|
@@ -199,7 +199,8 @@ Lumina-Wiki đang phát triển nhanh chóng. Dưới đây là lộ trình hư
|
|
|
199
199
|
**Sắp tới (Ổn định & Mở rộng nạp tài liệu)**
|
|
200
200
|
- [ ] **Kỹ năng `/lumi-help`:** Trợ lý thông minh giúp bạn học và sử dụng Lumina-Wiki tức thì.
|
|
201
201
|
- [x] **Cài đặt đa ngôn ngữ:** Chọn Tiếng Anh, Tiếng Việt hoặc Tiếng Trung làm ngôn ngữ chính khi cài đặt. *(đã phát hành trong v1.2)*
|
|
202
|
-
- [
|
|
202
|
+
- [x] **Nạp DOCX, RTF & EPUB native:** Đưa thẳng file Word, Rich Text và sách EPUB vào wiki qua research pack. *(đã phát hành trong v1.x)*
|
|
203
|
+
- [ ] **OCR ảnh & PDF scan:** Nạp ảnh chụp màn hình và PDF dạng scan vào wiki.
|
|
203
204
|
- [ ] **Xếp hạng bài báo nâng cao:** Xem điểm số ảnh hưởng và tín hiệu chất lượng cho các nghiên cứu của bạn.
|
|
204
205
|
- [x] **Cải thiện CI/CD:** Hỗ trợ chính thức cho môi trường Bun và Node 22. *(đã phát hành trong v1.2)*
|
|
205
206
|
|
|
@@ -221,6 +222,10 @@ Lumina-Wiki đang phát triển nhanh chóng. Dưới đây là lộ trình hư
|
|
|
221
222
|
|
|
222
223
|
## 7. Đóng góp & Giấy phép
|
|
223
224
|
|
|
225
|
+
### Hợp đồng CLI
|
|
226
|
+
|
|
227
|
+
Script CI và tích hợp nên tham chiếu [`docs/cli-contract.md`](./docs/cli-contract.md) để biết danh sách cờ ổn định và mapping exit code cho v1.x. Bất cứ thứ gì không liệt kê trong đó đều là nội bộ và có thể đổi mà không báo trước.
|
|
228
|
+
|
|
224
229
|
### Phát triển cục bộ (dành cho người đóng góp)
|
|
225
230
|
|
|
226
231
|
Nếu bạn muốn đóng góp cho trình cài đặt `lumina-wiki`:
|
package/README.zh.md
CHANGED
|
@@ -200,7 +200,8 @@ Lumina-Wiki 正在快速演进。这是我们的用户路线图:
|
|
|
200
200
|
**近期计划(稳定性与新导入支持)**
|
|
201
201
|
- [ ] **`/lumi-help` 技能:** 智能助手,帮您即时学习和使用 Lumina-Wiki。
|
|
202
202
|
- [x] **多语言安装:** 安装时可选英文、越南文或中文作为主语言。*(v1.2 已发布)*
|
|
203
|
-
- [
|
|
203
|
+
- [x] **原生 DOCX、RTF 与 EPUB 导入:** 通过 research pack 将 Word、Rich Text 与 EPUB 电子书直接导入维基。*(v1.x 已发布)*
|
|
204
|
+
- [ ] **图片 OCR 与扫描 PDF:** 将截图与扫描版 PDF 导入维基。
|
|
204
205
|
- [ ] **高级论文排名:** 查看研究论文的影响力评分和质量信号。
|
|
205
206
|
- [x] **改进的 CI/CD:** 正式支持 Bun 和 Node 22 环境。*(v1.2 已发布)*
|
|
206
207
|
|
|
@@ -223,6 +224,10 @@ Lumina-Wiki 正在快速演进。这是我们的用户路线图:
|
|
|
223
224
|
|
|
224
225
|
## 7. 贡献与许可
|
|
225
226
|
|
|
227
|
+
### CLI 契约
|
|
228
|
+
|
|
229
|
+
CI 脚本和集成应参考 [`docs/cli-contract.md`](./docs/cli-contract.md) 了解 v1.x 稳定标志列表和退出码映射。未在其中列出的任何内容均为内部,可能在不另行通知的情况下更改。
|
|
230
|
+
|
|
226
231
|
### 本地开发(贡献者)
|
|
227
232
|
|
|
228
233
|
如果您想为 `lumina-wiki` 安装器做贡献:
|
package/bin/lumina.js
CHANGED
|
@@ -12,14 +12,15 @@
|
|
|
12
12
|
*
|
|
13
13
|
* Flags (all commands):
|
|
14
14
|
* --directory <path> — installation directory (defaults to current directory)
|
|
15
|
-
* --cwd <path> —
|
|
15
|
+
* --cwd <path> — [deprecated] alias for --directory; removed in v2.0
|
|
16
16
|
* --yes, -y — accept all defaults (non-interactive / CI)
|
|
17
17
|
* --no-update — skip npm registry version check
|
|
18
18
|
* --re-link — recompute symlink/junction/copy strategy
|
|
19
19
|
* --packs <list> — comma-separated pack list for non-interactive install
|
|
20
20
|
* --ide-targets <list> — comma-separated IDE target list for non-interactive install
|
|
21
21
|
*
|
|
22
|
-
* Exit codes: 0 success, 1 user error, 2 filesystem
|
|
22
|
+
* Exit codes: 0 success, 1 user error, 2 filesystem/safety, 3 internal/network,
|
|
23
|
+
* 4 user cancelled (Ctrl-C in interactive prompt or declined confirm)
|
|
23
24
|
*/
|
|
24
25
|
|
|
25
26
|
import { createRequire } from 'node:module';
|
|
@@ -76,6 +77,39 @@ if (handledVersion) process.exit(0);
|
|
|
76
77
|
const { Command, Option } = await import('commander');
|
|
77
78
|
const program = new Command();
|
|
78
79
|
|
|
80
|
+
// Exit code contract (see docs/planning-artifacts/audits/cli-contract-audit.md
|
|
81
|
+
// and `--help` text below). Caught errors map as follows:
|
|
82
|
+
// - RangeError (from safePath) → 2 (path safety)
|
|
83
|
+
// - err.code in {EACCES, EPERM} → 2 (filesystem perms)
|
|
84
|
+
// - err.code === 2 / err.code === 3 → preserved
|
|
85
|
+
// - other string fs codes (E*) → 3 (internal/io: ENOENT, EBUSY, EIO,
|
|
86
|
+
// EROFS, ENOSPC, ENOTDIR, …)
|
|
87
|
+
// - everything else → 1 (user error)
|
|
88
|
+
// ---------------------------------------------------------------------------
|
|
89
|
+
function exitCodeFor(err, defaultCode = 1) {
|
|
90
|
+
if (err instanceof RangeError) return 2;
|
|
91
|
+
if (err.code === 'EACCES' || err.code === 'EPERM') return 2;
|
|
92
|
+
if (err.code === 2) return 2;
|
|
93
|
+
if (err.code === 3) return 3;
|
|
94
|
+
if (typeof err.code === 'string' && err.code.startsWith('E')) return 3;
|
|
95
|
+
return defaultCode;
|
|
96
|
+
}
|
|
97
|
+
|
|
98
|
+
// ---------------------------------------------------------------------------
|
|
99
|
+
// Deprecation warnings — emitted to stderr once per invocation.
|
|
100
|
+
// Source of truth: docs/cli-contract.md.
|
|
101
|
+
// ---------------------------------------------------------------------------
|
|
102
|
+
let _cwdWarned = false;
|
|
103
|
+
function warnDeprecatedCwdIfUsed(cmdOpts, globalOpts) {
|
|
104
|
+
if (_cwdWarned) return;
|
|
105
|
+
if (cmdOpts.cwd != null || globalOpts.cwd != null) {
|
|
106
|
+
process.stderr.write(
|
|
107
|
+
'[deprecated] --cwd is deprecated and will be removed in v2.0. Use --directory instead.\n'
|
|
108
|
+
);
|
|
109
|
+
_cwdWarned = true;
|
|
110
|
+
}
|
|
111
|
+
}
|
|
112
|
+
|
|
79
113
|
program
|
|
80
114
|
.name('lumina')
|
|
81
115
|
.description('Lumina Wiki — domain-agnostic, multi-IDE wiki scaffolder')
|
|
@@ -84,8 +118,9 @@ program
|
|
|
84
118
|
Exit codes:
|
|
85
119
|
0 success
|
|
86
120
|
1 user error (bad flag, missing prereq)
|
|
87
|
-
2 filesystem
|
|
88
|
-
3
|
|
121
|
+
2 filesystem / safety (permission denied, path outside cwd, unknown pack slug)
|
|
122
|
+
3 internal / network (atomicWrite failure, 5xx, upgrade incompatibility, lint catastrophic)
|
|
123
|
+
4 user cancelled (Ctrl-C in interactive prompt or declined confirm)
|
|
89
124
|
|
|
90
125
|
Flags applicable to all commands:
|
|
91
126
|
--directory <path> installation directory (defaults to current directory)
|
|
@@ -117,6 +152,13 @@ program
|
|
|
117
152
|
.option('--no-update', 'skip npm registry version check')
|
|
118
153
|
.option('--re-link', 'recompute symlink strategy from current platform capabilities');
|
|
119
154
|
|
|
155
|
+
// Single source of truth for --cwd deprecation: fires once before any
|
|
156
|
+
// subcommand action regardless of whether --cwd was passed globally or
|
|
157
|
+
// per-command. New subcommands inherit this for free.
|
|
158
|
+
program.hook('preAction', (_thisCommand, actionCommand) => {
|
|
159
|
+
warnDeprecatedCwdIfUsed(actionCommand.opts(), program.opts());
|
|
160
|
+
});
|
|
161
|
+
|
|
120
162
|
// ---------------------------------------------------------------------------
|
|
121
163
|
// --version / -v — print immediately then do async update check
|
|
122
164
|
// ---------------------------------------------------------------------------
|
|
@@ -169,11 +211,9 @@ program
|
|
|
169
211
|
} catch (err) {
|
|
170
212
|
// Top-level catch: locale may not be resolved yet (pre-loadLocale path).
|
|
171
213
|
// Error strings kept as EN literals — machine-readable, intentionally exempt.
|
|
172
|
-
const isPermError = err.code === 'EACCES' || err.code === 'EPERM';
|
|
173
|
-
const isRangeError = err instanceof RangeError;
|
|
174
214
|
console.error(`[error] ${err.message}`);
|
|
175
215
|
if (process.env.DEBUG) console.error(err.stack);
|
|
176
|
-
process.exit(
|
|
216
|
+
process.exit(exitCodeFor(err));
|
|
177
217
|
}
|
|
178
218
|
});
|
|
179
219
|
|
|
@@ -200,7 +240,7 @@ program
|
|
|
200
240
|
} catch (err) {
|
|
201
241
|
console.error(`[error] ${err.message}`);
|
|
202
242
|
if (process.env.DEBUG) console.error(err.stack);
|
|
203
|
-
process.exit(
|
|
243
|
+
process.exit(exitCodeFor(err));
|
|
204
244
|
}
|
|
205
245
|
});
|
|
206
246
|
|
|
@@ -235,7 +275,9 @@ discover
|
|
|
235
275
|
} catch (err) {
|
|
236
276
|
console.error(`[error] ${err.message}`);
|
|
237
277
|
if (process.env.DEBUG) console.error(err.stack);
|
|
238
|
-
|
|
278
|
+
// Unhandled exceptions from discover-runner are by definition not user
|
|
279
|
+
// errors (main() handles those), so default unknown → 3 (internal).
|
|
280
|
+
process.exit(exitCodeFor(err, 3));
|
|
239
281
|
}
|
|
240
282
|
});
|
|
241
283
|
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"$schema": "https://json.schemastore.org/package.json",
|
|
3
3
|
"name": "lumina-wiki",
|
|
4
|
-
"version": "1.
|
|
4
|
+
"version": "1.3.0",
|
|
5
5
|
"description": "Domain-agnostic, multi-IDE wiki scaffolder — Karpathy's LLM-Wiki vision, cross-platform and pack-based.",
|
|
6
6
|
"keywords": [
|
|
7
7
|
"llm-wiki",
|
|
@@ -83,9 +83,9 @@
|
|
|
83
83
|
"devDependencies": {},
|
|
84
84
|
"scripts": {
|
|
85
85
|
"test": "npm run test:installer",
|
|
86
|
-
"test:installer": "node --test src/installer/commands.test.js src/installer/fs.test.js src/installer/locales.test.js src/installer/manifest.test.js src/installer/prompts.test.js src/installer/readme-templates.test.js src/installer/template-engine.test.js src/installer/update-check.test.js",
|
|
86
|
+
"test:installer": "node --test bin/lumina.flags.test.js bin/lumina.deprecations.test.js bin/lumina.cancel.test.js src/installer/commands.test.js src/installer/fs.test.js src/installer/locales.test.js src/installer/manifest.test.js src/installer/prompts.test.js src/installer/readme-templates.test.js src/installer/template-engine.test.js src/installer/update-check.test.js",
|
|
87
87
|
"test:scripts": "node --test src/scripts/lint.test.mjs src/scripts/reset.test.mjs src/scripts/wiki.test.mjs src/scripts/discover-runner.test.mjs src/scripts/external-ids.test.mjs src/scripts/parse-ids.test.mjs src/scripts/merge-ids.test.mjs src/scripts/build-source.test.mjs src/scripts/wiki-yaml-object.test.mjs",
|
|
88
|
-
"test:python": "
|
|
88
|
+
"test:python": "node scripts/run-pytest.mjs",
|
|
89
89
|
"test:all": "npm run test:installer && npm run test:scripts && npm run test:python",
|
|
90
90
|
"test:fs": "node --test src/installer/fs.test.js",
|
|
91
91
|
"test:manifest": "node --test src/installer/manifest.test.js",
|
package/src/installer/prompts.js
CHANGED
|
@@ -143,7 +143,7 @@ export function buildPromptList(existingManifest, defaultLocale = 'en') {
|
|
|
143
143
|
/**
|
|
144
144
|
* Run the five interactive install prompts.
|
|
145
145
|
* Returns default answers immediately when `acceptDefaults` is true (--yes mode).
|
|
146
|
-
* Calls process.exit(
|
|
146
|
+
* Calls process.exit(4) if the user cancels (Ctrl-C) or declines a confirm prompt.
|
|
147
147
|
*
|
|
148
148
|
* @param {object} [opts]
|
|
149
149
|
* @param {boolean} [opts.acceptDefaults=false] - Skip prompts; return defaults.
|
|
@@ -174,7 +174,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
174
174
|
if (isCancel(localeRaw)) {
|
|
175
175
|
// t may be EN or may not be loaded yet — use cancel string from t if available
|
|
176
176
|
cancel(t ? t('prompt.cancelled') : 'Installation cancelled.');
|
|
177
|
-
process.exit(
|
|
177
|
+
process.exit(4);
|
|
178
178
|
}
|
|
179
179
|
const locale = localeRaw;
|
|
180
180
|
const langDefault = LOCALE_LANGUAGE_NAME[locale] ?? 'English';
|
|
@@ -192,7 +192,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
192
192
|
});
|
|
193
193
|
if (isCancel(proceed) || !proceed) {
|
|
194
194
|
cancel(t ? t('prompt.cancelled') : 'Installation cancelled.');
|
|
195
|
-
process.exit(
|
|
195
|
+
process.exit(4);
|
|
196
196
|
}
|
|
197
197
|
}
|
|
198
198
|
|
|
@@ -203,7 +203,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
203
203
|
placeholder: cwdAbs,
|
|
204
204
|
defaultValue: cwdAbs,
|
|
205
205
|
});
|
|
206
|
-
if (isCancel(directoryRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(
|
|
206
|
+
if (isCancel(directoryRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(4); }
|
|
207
207
|
const directory = expandUserPath(directoryRaw, cwdAbs);
|
|
208
208
|
const projectName = defaultProjectName(directory);
|
|
209
209
|
|
|
@@ -212,7 +212,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
212
212
|
message: t ? t('prompt.purpose.message') : 'Research purpose (optional — describe what this wiki is for)',
|
|
213
213
|
placeholder: t ? t('prompt.purpose.placeholder') : 'e.g. Track flash-attention variants for a survey',
|
|
214
214
|
});
|
|
215
|
-
if (isCancel(researchPurposeRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(
|
|
215
|
+
if (isCancel(researchPurposeRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(4); }
|
|
216
216
|
const researchPurpose = researchPurposeRaw || '';
|
|
217
217
|
|
|
218
218
|
// ── Prompt 3: IDE targets ────────────────────────────────────────────────
|
|
@@ -230,7 +230,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
230
230
|
initialValues: ['claude_code'],
|
|
231
231
|
required: false,
|
|
232
232
|
});
|
|
233
|
-
if (isCancel(ideTargetsRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(
|
|
233
|
+
if (isCancel(ideTargetsRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(4); }
|
|
234
234
|
const ideTargets = Array.isArray(ideTargetsRaw) && ideTargetsRaw.length > 0
|
|
235
235
|
? ideTargetsRaw
|
|
236
236
|
: ['claude_code'];
|
|
@@ -244,7 +244,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
244
244
|
],
|
|
245
245
|
required: false,
|
|
246
246
|
});
|
|
247
|
-
if (isCancel(packsRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(
|
|
247
|
+
if (isCancel(packsRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(4); }
|
|
248
248
|
const selectedPacks = Array.isArray(packsRaw) ? packsRaw : [];
|
|
249
249
|
const packs = ['core', ...selectedPacks.filter(p => p !== 'core')];
|
|
250
250
|
|
|
@@ -254,7 +254,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
254
254
|
placeholder: langDefault,
|
|
255
255
|
defaultValue: langDefault,
|
|
256
256
|
});
|
|
257
|
-
if (isCancel(communicationLangRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(
|
|
257
|
+
if (isCancel(communicationLangRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(4); }
|
|
258
258
|
const communicationLang = communicationLangRaw || langDefault;
|
|
259
259
|
|
|
260
260
|
const documentOutputLangRaw = await text({
|
|
@@ -262,7 +262,7 @@ export async function runInstallPrompts({ acceptDefaults = false, cwd = process.
|
|
|
262
262
|
placeholder: langDefault,
|
|
263
263
|
defaultValue: langDefault,
|
|
264
264
|
});
|
|
265
|
-
if (isCancel(documentOutputLangRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(
|
|
265
|
+
if (isCancel(documentOutputLangRaw)) { cancel(t ? t('prompt.cancelled') : 'Installation cancelled.'); process.exit(4); }
|
|
266
266
|
const documentOutputLang = documentOutputLangRaw || langDefault;
|
|
267
267
|
|
|
268
268
|
return {
|
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
"""
|
|
2
2
|
prepare_source.py — Normalize a local file into an ingest-ready package.
|
|
3
3
|
|
|
4
|
-
Accepts one local file (PDF, .tex, .html, .md
|
|
5
|
-
output package at raw/tmp/<slug>/ containing:
|
|
4
|
+
Accepts one local file (PDF, .tex, .html, .md, .txt, .docx, .rtf, .epub)
|
|
5
|
+
and produces a deterministic output package at raw/tmp/<slug>/ containing:
|
|
6
6
|
source.<ext> — original file (hard-link or copy)
|
|
7
7
|
meta.json — extracted metadata (title, type, sha256, ext, slug, size)
|
|
8
8
|
text.txt — extracted plain text
|
|
@@ -61,10 +61,19 @@ from typing import Any
|
|
|
61
61
|
# Constants
|
|
62
62
|
# ---------------------------------------------------------------------------
|
|
63
63
|
|
|
64
|
-
SUPPORTED_EXTENSIONS = {".pdf", ".tex", ".html", ".htm", ".md", ".txt"}
|
|
64
|
+
SUPPORTED_EXTENSIONS = {".pdf", ".tex", ".html", ".htm", ".md", ".txt", ".docx", ".rtf", ".epub"}
|
|
65
65
|
# Slug is the first 16 hex chars of the file's SHA256 — enough uniqueness.
|
|
66
66
|
SLUG_LENGTH = 16
|
|
67
67
|
|
|
68
|
+
# Zip-bomb defense thresholds for ZIP-of-XML formats (.docx, .epub).
|
|
69
|
+
# Raw caps reject oversized files outright; decompressed caps reject ratio
|
|
70
|
+
# attacks. Pre-flight only — does not stream the archive.
|
|
71
|
+
MAX_DOCX_BYTES = 50_000_000 # 50 MB raw .docx
|
|
72
|
+
MAX_DOCX_EXTRACTED_BYTES = 200_000_000 # 200 MB total uncompressed
|
|
73
|
+
MAX_EPUB_BYTES = 100_000_000 # 100 MB raw .epub (long novels + images)
|
|
74
|
+
MAX_EPUB_EXTRACTED_BYTES = 500_000_000 # 500 MB total uncompressed
|
|
75
|
+
EPUB_SIZE_HINT_BYTES = 1_000_000 # extracted-text threshold for stderr note
|
|
76
|
+
|
|
68
77
|
|
|
69
78
|
# ---------------------------------------------------------------------------
|
|
70
79
|
# Helpers
|
|
@@ -243,6 +252,136 @@ def _extract_html_text(path: Path) -> str:
|
|
|
243
252
|
return extractor.get_text()
|
|
244
253
|
|
|
245
254
|
|
|
255
|
+
def _check_zip_safety(path: Path, max_bytes: int, max_extracted: int) -> None:
|
|
256
|
+
"""Pre-flight zip-bomb defense: cap raw size + sum of uncompressed sizes."""
|
|
257
|
+
import zipfile
|
|
258
|
+
|
|
259
|
+
raw_size = path.stat().st_size
|
|
260
|
+
if raw_size > max_bytes:
|
|
261
|
+
raise ValueError(
|
|
262
|
+
f"File too large for safe extraction: {raw_size} bytes "
|
|
263
|
+
f"> {max_bytes}. Refusing to ingest."
|
|
264
|
+
)
|
|
265
|
+
try:
|
|
266
|
+
with zipfile.ZipFile(path) as zf:
|
|
267
|
+
total = sum(info.file_size for info in zf.infolist())
|
|
268
|
+
if total > max_extracted:
|
|
269
|
+
raise ValueError(
|
|
270
|
+
f"Decompressed size {total} bytes > {max_extracted}. "
|
|
271
|
+
f"Suspected zip-bomb; refusing to ingest."
|
|
272
|
+
)
|
|
273
|
+
except zipfile.BadZipFile as exc:
|
|
274
|
+
raise ValueError(f"Not a valid zip-based document: {exc}") from exc
|
|
275
|
+
|
|
276
|
+
|
|
277
|
+
def _extract_docx_text(path: Path) -> str:
|
|
278
|
+
"""Extract text from .docx. Body paragraphs only; shapes/textboxes excluded."""
|
|
279
|
+
try:
|
|
280
|
+
import defusedxml # type: ignore[import-untyped]
|
|
281
|
+
defusedxml.defuse_stdlib()
|
|
282
|
+
from docx import Document # type: ignore[import-untyped]
|
|
283
|
+
except ImportError as exc:
|
|
284
|
+
raise ValueError(
|
|
285
|
+
"python-docx and defusedxml required for .docx ingestion. "
|
|
286
|
+
"Install: pip install python-docx defusedxml. "
|
|
287
|
+
f"(Underlying: {exc})"
|
|
288
|
+
) from exc
|
|
289
|
+
|
|
290
|
+
_check_zip_safety(path, MAX_DOCX_BYTES, MAX_DOCX_EXTRACTED_BYTES)
|
|
291
|
+
|
|
292
|
+
try:
|
|
293
|
+
doc = Document(str(path))
|
|
294
|
+
except Exception as exc: # noqa: BLE001 — python-docx raises wide; surface as ValueError
|
|
295
|
+
raise ValueError(f"Cannot parse .docx (corrupt or DRM): {exc}") from exc
|
|
296
|
+
|
|
297
|
+
return "\n".join(p.text for p in doc.paragraphs)
|
|
298
|
+
|
|
299
|
+
|
|
300
|
+
def _extract_rtf_text(path: Path) -> str:
|
|
301
|
+
"""Extract plain text from .rtf using striprtf."""
|
|
302
|
+
try:
|
|
303
|
+
from striprtf.striprtf import rtf_to_text # type: ignore[import-untyped]
|
|
304
|
+
except ImportError as exc:
|
|
305
|
+
raise ValueError(
|
|
306
|
+
"striprtf required for .rtf ingestion. "
|
|
307
|
+
"Install: pip install striprtf. "
|
|
308
|
+
f"(Underlying: {exc})"
|
|
309
|
+
) from exc
|
|
310
|
+
try:
|
|
311
|
+
src = path.read_text(encoding="utf-8", errors="replace")
|
|
312
|
+
except OSError as exc:
|
|
313
|
+
raise ValueError(f"Cannot read .rtf file: {exc}") from exc
|
|
314
|
+
try:
|
|
315
|
+
return rtf_to_text(src)
|
|
316
|
+
except Exception as exc: # noqa: BLE001 — match docx/epub: surface as ValueError exit 2
|
|
317
|
+
raise ValueError(f"Cannot parse .rtf: {exc}") from exc
|
|
318
|
+
|
|
319
|
+
|
|
320
|
+
def _epub_is_drm_protected(path: Path) -> bool:
|
|
321
|
+
"""Detect DRM by presence of META-INF/encryption.xml in the archive."""
|
|
322
|
+
import zipfile
|
|
323
|
+
|
|
324
|
+
try:
|
|
325
|
+
with zipfile.ZipFile(path) as zf:
|
|
326
|
+
return "META-INF/encryption.xml" in zf.namelist()
|
|
327
|
+
except zipfile.BadZipFile:
|
|
328
|
+
return False
|
|
329
|
+
|
|
330
|
+
|
|
331
|
+
def _extract_epub_text(path: Path) -> str:
|
|
332
|
+
"""Extract flat text from .epub by walking the spine. v1 = flat text only."""
|
|
333
|
+
try:
|
|
334
|
+
import warnings
|
|
335
|
+
import defusedxml # type: ignore[import-untyped]
|
|
336
|
+
defusedxml.defuse_stdlib()
|
|
337
|
+
import ebooklib # type: ignore[import-untyped]
|
|
338
|
+
from ebooklib import epub # type: ignore[import-untyped]
|
|
339
|
+
from bs4 import BeautifulSoup # type: ignore[import-untyped]
|
|
340
|
+
except ImportError as exc:
|
|
341
|
+
raise ValueError(
|
|
342
|
+
"ebooklib, beautifulsoup4, and defusedxml required for .epub ingestion. "
|
|
343
|
+
"Install: pip install ebooklib beautifulsoup4 defusedxml. "
|
|
344
|
+
f"(Underlying: {exc})"
|
|
345
|
+
) from exc
|
|
346
|
+
|
|
347
|
+
_check_zip_safety(path, MAX_EPUB_BYTES, MAX_EPUB_EXTRACTED_BYTES)
|
|
348
|
+
|
|
349
|
+
if _epub_is_drm_protected(path):
|
|
350
|
+
raise ValueError(
|
|
351
|
+
"EPUB is DRM-protected (META-INF/encryption.xml present). "
|
|
352
|
+
"DRM removal is the user's responsibility; Lumina does not strip DRM."
|
|
353
|
+
)
|
|
354
|
+
|
|
355
|
+
try:
|
|
356
|
+
book = epub.read_epub(str(path))
|
|
357
|
+
except Exception as exc: # noqa: BLE001 — ebooklib raises wide on bad XML
|
|
358
|
+
raise ValueError(f"Cannot parse .epub (corrupt or unsupported): {exc}") from exc
|
|
359
|
+
|
|
360
|
+
parts: list[str] = []
|
|
361
|
+
with warnings.catch_warnings():
|
|
362
|
+
warnings.filterwarnings("ignore", module="bs4")
|
|
363
|
+
for spine_entry in book.spine:
|
|
364
|
+
try:
|
|
365
|
+
doc = book.get_item_with_id(spine_entry[0])
|
|
366
|
+
except Exception: # noqa: BLE001
|
|
367
|
+
continue
|
|
368
|
+
if doc is None or doc.get_type() != ebooklib.ITEM_DOCUMENT:
|
|
369
|
+
continue
|
|
370
|
+
soup = BeautifulSoup(doc.get_content(), "html.parser")
|
|
371
|
+
text = soup.get_text(separator="\n").strip()
|
|
372
|
+
if text:
|
|
373
|
+
parts.append(text)
|
|
374
|
+
|
|
375
|
+
full = "\n\n".join(parts)
|
|
376
|
+
size_bytes = len(full.encode("utf-8"))
|
|
377
|
+
if size_bytes > EPUB_SIZE_HINT_BYTES:
|
|
378
|
+
_err(
|
|
379
|
+
f"Note: extracted EPUB text {size_bytes:,} bytes (> 1 MB). "
|
|
380
|
+
"Future: /lumi-chapter-ingest may help once EPUB support lands."
|
|
381
|
+
)
|
|
382
|
+
return full
|
|
383
|
+
|
|
384
|
+
|
|
246
385
|
def _extract_text(path: Path) -> str:
|
|
247
386
|
"""Dispatch text extraction by file extension."""
|
|
248
387
|
ext = path.suffix.lower()
|
|
@@ -252,6 +391,12 @@ def _extract_text(path: Path) -> str:
|
|
|
252
391
|
return _extract_tex_text(path)
|
|
253
392
|
if ext in (".html", ".htm"):
|
|
254
393
|
return _extract_html_text(path)
|
|
394
|
+
if ext == ".docx":
|
|
395
|
+
return _extract_docx_text(path)
|
|
396
|
+
if ext == ".rtf":
|
|
397
|
+
return _extract_rtf_text(path)
|
|
398
|
+
if ext == ".epub":
|
|
399
|
+
return _extract_epub_text(path)
|
|
255
400
|
# .md, .txt, and other text files — read as UTF-8
|
|
256
401
|
try:
|
|
257
402
|
return path.read_text(encoding="utf-8", errors="replace")
|
|
@@ -366,6 +511,9 @@ def _guess_type(ext: str) -> str:
|
|
|
366
511
|
".htm": "webpage",
|
|
367
512
|
".md": "markdown",
|
|
368
513
|
".txt": "text",
|
|
514
|
+
".docx": "docx",
|
|
515
|
+
".rtf": "rtf",
|
|
516
|
+
".epub": "epub",
|
|
369
517
|
}.get(ext, "unknown")
|
|
370
518
|
|
|
371
519
|
|
|
@@ -377,8 +525,9 @@ def main(argv: list[str] | None = None) -> None:
|
|
|
377
525
|
parser = argparse.ArgumentParser(
|
|
378
526
|
prog="prepare_source.py",
|
|
379
527
|
description=(
|
|
380
|
-
"Normalize a local file (PDF, .tex, .html, .md
|
|
381
|
-
"package under raw/tmp/<slug>/.
|
|
528
|
+
"Normalize a local file (PDF, .tex, .html, .md, .txt, .docx, .rtf, "
|
|
529
|
+
".epub) into an ingest-ready package under raw/tmp/<slug>/. "
|
|
530
|
+
"Deterministic: same input -> same output."
|
|
382
531
|
),
|
|
383
532
|
)
|
|
384
533
|
parser.add_argument("file", help="Path to the source file to prepare.")
|
|
@@ -13,6 +13,14 @@ pypdf>=4.0.0
|
|
|
13
13
|
# fetch_arxiv.py, fetch_s2.py, fetch_wikipedia.py, fetch_deepxiv.py, discover.py
|
|
14
14
|
requests>=2.31.0
|
|
15
15
|
|
|
16
|
+
# ─── Local text-document ingestion (research pack) ──────────────────────────
|
|
17
|
+
# prepare_source.py extractors for .docx/.rtf/.epub
|
|
18
|
+
python-docx>=1.1.0
|
|
19
|
+
striprtf>=0.0.26
|
|
20
|
+
ebooklib>=0.18
|
|
21
|
+
beautifulsoup4>=4.12.0
|
|
22
|
+
defusedxml>=0.7.1
|
|
23
|
+
|
|
16
24
|
# ─── Development ─────────────────────────────────────────────────────────────
|
|
17
25
|
pytest>=7.0.0
|
|
18
26
|
pytest-cov>=4.0.0
|