regen.mde 0.2.2 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. package/README.md +409 -295
  2. package/bin/build-corpus-editor.js +5 -3
  3. package/bin/postinstall.js +259 -187
  4. package/bin/regen-mdeditor-install.js +1 -1
  5. package/bin/regen-mdeditor-uninstall.js +1 -1
  6. package/desktop/BuildCorpusEditor/BuildCorpusBridge.cs +493 -270
  7. package/desktop/BuildCorpusEditor/EditorForm.cs +853 -540
  8. package/desktop/BuildCorpusEditor/Program.cs +85 -81
  9. package/dist/release/regen-mde-0.3.0-win-x64-setup.exe +0 -0
  10. package/dist/release/{regen.mde-0.2.2-win-x64.zip → regen-mde-0.3.0-win-x64.zip} +0 -0
  11. package/dist/release/regen-mde-0.7.0-win-x64-setup.exe +0 -0
  12. package/dist/release/regen-mde-0.7.0-win-x64.zip +0 -0
  13. package/dist/windows-editor/BuildCorpusEditor.dll +0 -0
  14. package/dist/windows-editor/BuildCorpusEditor.exe +0 -0
  15. package/dist/windows-editor/BuildCorpusEditor.pdb +0 -0
  16. package/dist/windows-editor/wwwroot/assets/index-C_VxJk4k.js +375 -0
  17. package/dist/windows-editor/wwwroot/assets/index-Wt9zSjIw.css +1 -0
  18. package/dist/windows-editor/wwwroot/index.html +3 -3
  19. package/editor-web/index.html +1 -1
  20. package/editor-web/src/main.jsx +1044 -399
  21. package/editor-web/src/styles.css +846 -602
  22. package/installer/install-regen-mde.ps1 +49 -10
  23. package/installer/regen-mde.nsi +16 -16
  24. package/package.json +90 -86
  25. package/pyproject.toml +35 -33
  26. package/requirements.txt +6 -4
  27. package/scripts/package-windows-editor.ps1 +8 -8
  28. package/scripts/release-dual.mjs +105 -0
  29. package/scripts/run-editor-implementation-plane.ps1 +29 -6
  30. package/src/build_corpus/docx_exporter.py +1055 -798
  31. package/src/build_corpus/equations.py +80 -0
  32. package/src/build_corpus/exporter.py +1488 -1195
  33. package/src/build_corpus/frontmatter.py +302 -0
  34. package/src/build_corpus/ppt_exporter.py +543 -532
  35. package/dist/release/regen.mde-0.2.2-win-x64-setup.exe +0 -0
  36. package/dist/windows-editor/wwwroot/assets/index-DjJ6xmhy.js +0 -326
  37. package/dist/windows-editor/wwwroot/assets/index-_dwMNNsm.css +0 -1
package/README.md CHANGED
@@ -1,295 +1,409 @@
1
- # regen.mde
2
-
3
- regen.mde is the Windows editor and conversion suite for Markdown, Word, and PowerPoint files. Its `build-corpus` CLI converts `.docx`, `.pptx`, and `.ppt` files to Markdown while preserving the pieces that usually break in generic converters:
4
-
5
- - Word OMML equations as KaTeX-readable TeX
6
- - embedded images as local assets, base64 data URIs, or S3/R2-hosted URLs
7
- - Markdown tables for simple Word tables
8
- - HTML table fallback for complex tables
9
- - headings, lists, links, bold, italic, inline code, and code-style paragraphs
10
- - PowerPoint slide extraction with slide title detection, table mapping, and repetitive footer suppression
11
-
12
- ## Install
13
-
14
- Python is the native runtime:
15
-
16
- ```powershell
17
- pip install build-corpus
18
- ```
19
-
20
- The npm package ships the Windows installer plus the conversion CLI:
21
-
22
- ```powershell
23
- npm pack regen.mde
24
- ```
25
-
26
- Extract the package and run `dist\release\regen.mde-<version>-win-x64-setup.exe` for a normal Windows install. The installer creates Start Menu entries for `regen.mde` and `Uninstall regen.mde`, registers right-click Explorer verbs for `.docx` and `.md`, and removes those entries during uninstall.
27
-
28
- The legacy global npm command path is still supported for automation:
29
-
30
- ```powershell
31
- npm install -g regen.mde
32
- ```
33
-
34
- On Windows, the installer and supported automation paths add right-click Explorer menus for `.docx` and `.md` files under `Life AI`:
35
-
36
- - `Life AI -> Open in regen.mde`
37
- - opens `.md` directly and opens `.docx` by converting it into editable Markdown first
38
- - `Life AI -> Convert to Markdown`
39
- - runs `build-corpus "%1" --out-same-dir`
40
- - writes `.md`, `assets`, and reports beside the source document
41
- - `Life AI -> Convert to Word`
42
- - runs `build-corpus "%1" --to word --out-same-dir`
43
- - writes `.docx` and export report beside the source document
44
-
45
- Set `BUILD_CORPUS_SKIP_WINDOWS_MENU=1` before a global npm install if you do not want the Explorer menu.
46
- Set `BUILD_CORPUS_SKIP_EDITOR=1` before a global npm install if you want the CLI conversion verbs but not the editor open verbs.
47
-
48
- To remove the Windows Explorer menus without uninstalling the package:
49
-
50
- ```powershell
51
- build-corpus --uninstall-windows
52
- ```
53
-
54
- If you uninstall the global npm package, `build-corpus` now removes those Explorer menu entries automatically during uninstall.
55
-
56
- For a project-local install, use `npx`:
57
-
58
- ```powershell
59
- npm install regen.mde
60
- npx build-corpus --help
61
- ```
62
-
63
- On Windows, if `build-corpus` launches a Python executable and fails with `ModuleNotFoundError`, a stale pip install is shadowing the npm command. Remove it with:
64
-
65
- ```powershell
66
- py -3 -m pip uninstall build-corpus
67
- ```
68
-
69
- For S3/R2 image upload support:
70
-
71
- ```powershell
72
- pip install "build-corpus[s3]"
73
- ```
74
-
75
- ## Basic Usage
76
-
77
- ```powershell
78
- build-corpus input.docx --out out
79
- build-corpus deck.pptx --out out
80
- build-corpus input.md --to word --out out
81
- build-corpus input.md --to word --word-template C:\path\custom.dotx --out out
82
- regen.mde input.md
83
- regen.mdeditor input.md
84
- regen-mdeditor input.md
85
- build-corpus editor input.md
86
- build-corpus editor input.docx
87
- ```
88
-
89
- ## regen.mde
90
-
91
- regen.mde is a Windows WebView2 desktop app bundled with the package. It uses the same local Build Corpus conversion engine as the CLI:
92
-
93
- - Markdown opens directly.
94
- - Word and PowerPoint files open by converting into Markdown.
95
- - Save writes Markdown.
96
- - Save As writes a new Markdown file.
97
- - Export DOCX writes Word output through the Markdown-to-Word route.
98
-
99
- Build the Windows executable locally:
100
-
101
- ```powershell
102
- npm run editor:windows
103
- ```
104
-
105
- The executable is written to:
106
-
107
- ```text
108
- dist\windows-editor\BuildCorpusEditor.exe
109
- ```
110
-
111
- Convert every `.docx` in a folder:
112
-
113
- ```powershell
114
- build-corpus ./word-files --out ./markdown
115
- ```
116
-
117
- Convert every supported file type in a folder (`.docx`, `.pptx`, `.ppt`):
118
-
119
- ```powershell
120
- build-corpus ./source-files --out ./markdown
121
- ```
122
-
123
- Write Markdown beside each source document:
124
-
125
- ```powershell
126
- build-corpus ./word-files --out-same-dir
127
- ```
128
-
129
- ## Image Modes
130
-
131
- Local asset files, the default:
132
-
133
- ```powershell
134
- build-corpus input.docx --images assets
135
- ```
136
-
137
- Single-file Markdown with base64 image data URIs:
138
-
139
- ```powershell
140
- build-corpus input.docx --images base64
141
- ```
142
-
143
- Upload images to S3-compatible storage and write public URLs:
144
-
145
- ```powershell
146
- build-corpus input.docx --images s3 --config examples\build-corpus.config.example.json
147
- ```
148
-
149
- Cloudflare R2 uses the same `s3` mode. Set `endpoint_url` to:
150
-
151
- ```text
152
- https://ACCOUNT_ID.r2.cloudflarestorage.com
153
- ```
154
-
155
- ## Config
156
-
157
- Copy `examples/build-corpus.config.example.json` and edit it for your environment.
158
-
159
- ```json
160
- {
161
- "conversion": {
162
- "equations": "tex",
163
- "images": "s3"
164
- },
165
- "output": {
166
- "out": "out",
167
- "out_same_dir": false
168
- },
169
- "s3": {
170
- "bucket": "build-corpus-assets",
171
- "public_base_url": "https://assets.example.com",
172
- "prefix": "knowledge-base",
173
- "endpoint_url": "https://ACCOUNT_ID.r2.cloudflarestorage.com",
174
- "region_name": "auto",
175
- "access_key_id": "%R2_ACCESS_KEY_ID%",
176
- "secret_access_key": "%R2_SECRET_ACCESS_KEY%"
177
- }
178
- }
179
- ```
180
-
181
- Build Corpus expands environment variables in JSON string values, so credentials do not need to be committed.
182
-
183
- ### Output Placement
184
-
185
- There are two output modes.
186
-
187
- Write all converted Markdown into one output tree:
188
-
189
- ```json
190
- {
191
- "output": {
192
- "out": "./markdown",
193
- "out_same_dir": false
194
- }
195
- }
196
- ```
197
-
198
- Write each `.md`, asset folder, and report beside the source `.docx`:
199
-
200
- ```json
201
- {
202
- "output": {
203
- "out_same_dir": true
204
- }
205
- }
206
- ```
207
-
208
- The same-dir mode is equivalent to:
209
-
210
- ```powershell
211
- build-corpus ./word-files --out-same-dir
212
- ```
213
-
214
- ## Markdown to Word Templates
215
-
216
- Markdown -> Word conversion uses this template precedence:
217
-
218
- 1. `--word-template <path>`
219
- 2. `word.template` in the JSON config
220
- 3. the bundled installed package template
221
- 4. built-in fallback styles if no template can be found
222
-
223
- Template files are treated as style sources. Build Corpus creates a fresh output document body, then applies the template's Word styles, numbering, theme, fonts, and settings. It does not reuse the template body content as the exported document.
224
-
225
- ## Equations
226
-
227
- The default equation mode is parseable TeX:
228
-
229
- ```powershell
230
- build-corpus input.docx --equations tex
231
- ```
232
-
233
- Equation images are only for visual debugging:
234
-
235
- ```powershell
236
- build-corpus input.docx --equations image
237
- ```
238
-
239
- ## PowerPoint Notes
240
-
241
- - `.pptx` is processed directly.
242
- - `.ppt` is converted to `.pptx` first using LibreOffice (`soffice --headless --convert-to pptx`).
243
- - Repeated boilerplate blocks that appear on most slides are removed from the emitted Markdown.
244
- - Slide images are exported from the original package binaries (`ppt/media/*`), not screen-captured display rasters.
245
- - Markdown output uses size-aware HTML image tags (`<img ... width= height=>`) based on OOXML display extents (`a:xfrm/a:ext`).
246
- - The export report includes `low_dpi_images` to flag images whose effective on-slide DPI is under 150.
247
-
248
- ## Validation
249
-
250
- The package includes a KaTeX validator for emitted Markdown math:
251
-
252
- ```powershell
253
- build-corpus-katex out
254
- ```
255
-
256
- ## Repeatable Test Wrappers
257
-
258
- Run a single known DOCX through conversion plus validators:
259
-
260
- ```powershell
261
- .\scripts\run-smoke.ps1 -Docx ".\fixtures\sample.docx" -Out ".tmp\smoke" -Images assets
262
- ```
263
-
264
- Run a whole folder corpus:
265
-
266
- ```powershell
267
- .\scripts\run-corpus.ps1 -Source ".\fixtures\wordtest" -Out ".tmp\wordtest" -Images base64
268
- ```
269
-
270
- Build a public online DOCX corpus for regression testing:
271
-
272
- ```powershell
273
- python .\tools\collect_online_docx_corpus.py --out ".tmp\online-docx\source-docx" --target 50
274
- .\scripts\run-corpus.ps1 -Source ".tmp\online-docx\source-docx" -Out ".tmp\online-docx\markdown"
275
- ```
276
-
277
- Build a public online PPTX corpus and compare input/output extraction:
278
-
279
- ```powershell
280
- python .\tools\collect_online_pptx_corpus.py --out ".tmp\online-pptx\source-pptx" --target 20
281
- .\scripts\run-corpus.ps1 -Source ".tmp\online-pptx\source-pptx" -Out ".tmp\online-pptx\markdown"
282
- python .\tools\compare_pptx_inputs_outputs.py --manifest ".tmp\online-pptx\source-pptx\online-pptx-manifest.json" --out ".tmp\online-pptx\markdown" --report ".tmp\online-pptx\markdown\pptx-io-compare.json"
283
- ```
284
-
285
- ## Failed Documents
286
-
287
- If a document does not convert correctly, open an issue with:
288
-
289
- - the `.docx` file if it is safe to share
290
- - the generated `.md`
291
- - the `export-report.json`
292
- - the command and config used
293
- - a screenshot of the expected Word output if layout is the issue
294
-
295
- For confidential files, strip or replace sensitive content before sharing. The useful part is the broken DOCX structure, not the private text.
1
+ # regen-mde
2
+
3
+ regen-mde is the Windows editor and conversion suite for Markdown, Word, and PowerPoint files. Its `build-corpus` CLI converts `.docx`, `.pptx`, and `.ppt` files to Markdown while preserving the pieces that usually break in generic converters:
4
+
5
+ - Word OMML equations as KaTeX-readable TeX
6
+ - embedded images as local assets, base64 data URIs, or S3/R2-hosted URLs
7
+ - Markdown tables for simple Word tables
8
+ - HTML table fallback for complex tables
9
+ - headings, lists, links, bold, italic, inline code, and code-style paragraphs
10
+ - PowerPoint slide extraction with slide title detection, table mapping, and repetitive footer suppression
11
+
12
+ ## Install
13
+
14
+ `build-corpus` is a **dual package**: the same version ships to both PyPI (Python-native)
15
+ and npm (Node wrapper). Pick the channel that fits your OS.
16
+
17
+ | OS | Recommended | Command | What you get |
18
+ |----|-------------|---------|--------------|
19
+ | **Ubuntu / Debian / Linux** | PyPI via pipx | `pipx install build-corpus` | native `build-corpus` CLI, no Node, isolated from system Python (PEP 668-safe) |
20
+ | **macOS** | PyPI via pipx | `pipx install build-corpus` | native `build-corpus` CLI, no Node |
21
+ | **Windows** | npm (full kit) | `npm install -g regen-mde` | `build-corpus` CLI **plus** the `regen-mde` editor and Explorer right-click menus |
22
+
23
+ ### Ubuntu / Debian / Linux
24
+
25
+ The CLI is pure Python; the editor is Windows-only, so on Linux you install just the converter.
26
+
27
+ ```bash
28
+ # prerequisites
29
+ sudo apt update
30
+ sudo apt install -y python3 python3-pip pipx
31
+ sudo apt install -y libreoffice # optional — only needed to convert legacy .ppt
32
+
33
+ # install the CLI (isolated venv, survives Ubuntu's externally-managed Python)
34
+ pipx install build-corpus
35
+ build-corpus --help
36
+ ```
37
+
38
+ The npm package (`regen-mde`) also installs on Linux (`npm install -g regen-mde`) it shells
39
+ out to `python3` for conversion. On Ubuntu 24.04+ the postinstall installs the Python dependencies
40
+ into your user site; if that is blocked it prints the `pipx install build-corpus` fallback and
41
+ still completes. The `regen-mde` editor is not built on Linux.
42
+
43
+ ### Windows
44
+
45
+ Python is the native runtime:
46
+
47
+ ```powershell
48
+ pip install build-corpus
49
+ ```
50
+
51
+ The npm package ships the Windows installer plus the conversion CLI:
52
+
53
+ ```powershell
54
+ npm pack regen-mde
55
+ ```
56
+
57
+ Extract the package and run `dist\release\regen-mde-<version>-win-x64-setup.exe` for a normal Windows install. The installer creates Start Menu entries for `regen-mde` and `Uninstall regen-mde`, registers right-click Explorer verbs for `.docx` and `.md`, and removes those entries during uninstall.
58
+
59
+ The legacy global npm command path is still supported for automation:
60
+
61
+ ```powershell
62
+ npm install -g regen-mde
63
+ ```
64
+
65
+ On Windows, the installer and supported automation paths add right-click Explorer menus for `.docx`, `.pptx`, `.ppt`, `.md`, and folders:
66
+
67
+ - `Life AI -> Open in regen-mde`
68
+ - opens `.md` directly and opens `.docx` by converting it into editable Markdown first
69
+ - `Life AI -> Convert to Markdown`
70
+ - runs `build-corpus "%1" --out-same-dir` for `.docx`, `.pptx`, and `.ppt`
71
+ - writes `.md`, `assets`, and reports beside the source document
72
+ - `Life AI -> Convert to Word`
73
+ - runs `build-corpus "%1" --to word --out-same-dir`
74
+ - writes `.docx` and export report beside the source document
75
+ - `Life AI -> Inline Markdown Images`
76
+ - runs `build-corpus "%1" --inline-images`
77
+ - writes `<name>.inline.md` with local or HTTP image references embedded as data URIs
78
+ - folder `Convert Documents to Markdown`
79
+ - runs `build-corpus "%V" --out-same-dir`
80
+ - converts all `.docx`, `.pptx`, and `.ppt` files in the selected folder tree
81
+
82
+ The installer also registers `.md` under Explorer's New menu so you can create a blank Markdown document directly from `New`.
83
+
84
+ Set `BUILD_CORPUS_SKIP_WINDOWS_MENU=1` before a global npm install if you do not want the Explorer menu.
85
+ Set `BUILD_CORPUS_SKIP_EDITOR=1` before a global npm install if you want the CLI conversion verbs but not the editor open verbs.
86
+
87
+ To remove the Windows Explorer menus without uninstalling the package:
88
+
89
+ ```powershell
90
+ build-corpus --uninstall-windows
91
+ ```
92
+
93
+ If you uninstall the global npm package, `build-corpus` now removes those Explorer menu entries automatically during uninstall.
94
+
95
+ For a project-local install, use `npx`:
96
+
97
+ ```powershell
98
+ npm install regen-mde
99
+ npx build-corpus --help
100
+ ```
101
+
102
+ On Windows, if `build-corpus` launches a Python executable and fails with `ModuleNotFoundError`, a stale pip install is shadowing the npm command. Remove it with:
103
+
104
+ ```powershell
105
+ py -3 -m pip uninstall build-corpus
106
+ ```
107
+
108
+ For S3/R2 image upload support:
109
+
110
+ ```powershell
111
+ pip install "build-corpus[s3]"
112
+ ```
113
+
114
+ ## Basic Usage
115
+
116
+ ```powershell
117
+ build-corpus input.docx --out out
118
+ build-corpus deck.pptx --out out
119
+ build-corpus input.md --to word --out out
120
+ build-corpus input.md --to word --word-template C:\path\custom.dotx --out out
121
+ regen-mde input.md
122
+ regen-mdeditor input.md
123
+ regen-mdeditor input.md
124
+ build-corpus editor input.md
125
+ build-corpus editor input.docx
126
+ ```
127
+
128
+ ## regen-mde
129
+
130
+ regen-mde is a Windows WebView2 desktop app bundled with the package. It uses the same local Build Corpus conversion engine as the CLI:
131
+
132
+ - Markdown opens directly.
133
+ - Word and PowerPoint files open by converting into Markdown.
134
+ - Save writes Markdown.
135
+ - Save As writes a new Markdown file.
136
+ - Export DOCX writes Word output through the Markdown-to-Word route.
137
+
138
+ Build the Windows executable locally:
139
+
140
+ ```powershell
141
+ npm run editor:windows
142
+ ```
143
+
144
+ The executable is written to:
145
+
146
+ ```text
147
+ dist\windows-editor\BuildCorpusEditor.exe
148
+ ```
149
+
150
+ Convert every `.docx` in a folder:
151
+
152
+ ```powershell
153
+ build-corpus ./word-files --out ./markdown
154
+ ```
155
+
156
+ Convert every supported file type in a folder (`.docx`, `.pptx`, `.ppt`):
157
+
158
+ ```powershell
159
+ build-corpus ./source-files --out ./markdown
160
+ ```
161
+
162
+ Convert specific selected files or folders from automation:
163
+
164
+ ```powershell
165
+ build-corpus .\a.docx .\deck.pptx .\folder --out-same-dir
166
+ ```
167
+
168
+ Move successfully processed source `.docx`, `.pptx`, and `.ppt` files into `sources` beside each file:
169
+
170
+ ```powershell
171
+ build-corpus ./source-files --out-same-dir --move-sources
172
+ ```
173
+
174
+ Write Markdown beside each source document:
175
+
176
+ ```powershell
177
+ build-corpus ./word-files --out-same-dir
178
+ ```
179
+
180
+ ## Image Modes
181
+
182
+ Local asset files, the default:
183
+
184
+ ```powershell
185
+ build-corpus input.docx --images assets
186
+ ```
187
+
188
+ Single-file Markdown with base64 image data URIs:
189
+
190
+ ```powershell
191
+ build-corpus input.docx --images base64
192
+ ```
193
+
194
+ Re-merge an existing Markdown file that references local or HTTP-hosted images into a single Markdown file with inline image data:
195
+
196
+ ```powershell
197
+ build-corpus input.md --inline-images
198
+ ```
199
+
200
+ Upload images to S3-compatible storage and write public URLs:
201
+
202
+ ```powershell
203
+ build-corpus input.docx --images s3 --config examples\build-corpus.config.example.json
204
+ ```
205
+
206
+ Cloudflare R2 uses the same `s3` mode. Set `endpoint_url` to:
207
+
208
+ ```text
209
+ https://ACCOUNT_ID.r2.cloudflarestorage.com
210
+ ```
211
+
212
+ ## Config
213
+
214
+ Copy `examples/build-corpus.config.example.json` and edit it for your environment.
215
+
216
+ ```json
217
+ {
218
+ "conversion": {
219
+ "equations": "tex",
220
+ "images": "s3"
221
+ },
222
+ "output": {
223
+ "out": "out",
224
+ "out_same_dir": false
225
+ },
226
+ "s3": {
227
+ "bucket": "build-corpus-assets",
228
+ "public_base_url": "https://assets.example.com",
229
+ "prefix": "knowledge-base",
230
+ "endpoint_url": "https://ACCOUNT_ID.r2.cloudflarestorage.com",
231
+ "region_name": "auto",
232
+ "access_key_id": "%R2_ACCESS_KEY_ID%",
233
+ "secret_access_key": "%R2_SECRET_ACCESS_KEY%"
234
+ }
235
+ }
236
+ ```
237
+
238
+ Build Corpus expands environment variables in JSON string values, so credentials do not need to be committed.
239
+
240
+ ### Output Placement
241
+
242
+ There are two output modes.
243
+
244
+ Write all converted Markdown into one output tree:
245
+
246
+ ```json
247
+ {
248
+ "output": {
249
+ "out": "./markdown",
250
+ "out_same_dir": false
251
+ }
252
+ }
253
+ ```
254
+
255
+ Write each `.md`, asset folder, and report beside the source `.docx`:
256
+
257
+ ```json
258
+ {
259
+ "output": {
260
+ "out_same_dir": true
261
+ }
262
+ }
263
+ ```
264
+
265
+ The same-dir mode is equivalent to:
266
+
267
+ ```powershell
268
+ build-corpus ./word-files --out-same-dir
269
+ ```
270
+
271
+ ## Markdown to Word Templates
272
+
273
+ Markdown -> Word conversion uses this template precedence:
274
+
275
+ 1. `--word-template <path>`
276
+ 2. `word.template` in the JSON config
277
+ 3. the bundled installed package template
278
+ 4. built-in fallback styles if no template can be found
279
+
280
+ Template files are treated as style sources. Build Corpus creates a fresh output document body, then applies the template's Word styles, numbering, theme, fonts, and settings. It does not reuse the template body content as the exported document.
281
+
282
+ ## Equations
283
+
284
+ Equation handling is real in **both** directions:
285
+
286
+ **DOCX → Markdown** — Word OMML equations are converted to KaTeX-readable TeX
287
+ (via `omml2latex`). The default mode is parseable TeX:
288
+
289
+ ```powershell
290
+ build-corpus input.docx --equations tex
291
+ ```
292
+
293
+ Equation images are only for visual debugging:
294
+
295
+ ```powershell
296
+ build-corpus input.docx --equations image
297
+ ```
298
+
299
+ **Markdown → Word** — inline `$...$` and display `$$...$$` LaTeX are converted to
300
+ **native Office Math (OMML)** that Word renders as real equations — not raw text
301
+ in a math font. The pipeline is `latex2mathml` → `mathml2omml`, so commands like
302
+ `\sum`, `\int`, `\frac`, `\Delta`, `\rightarrow`, and `\leq` render correctly:
303
+
304
+ ```powershell
305
+ build-corpus notes.md --to word --out out
306
+ ```
307
+
308
+ If a fragment cannot be parsed as LaTeX, it falls back to the literal text in
309
+ Cambria Math and is flagged in the export report's `warnings`. Fence display
310
+ equations with `$$` on their own lines and no blank lines inside the fence.
311
+
312
+ ## Fidelity report (md → word)
313
+
314
+ Every md→word export writes `export-report.json` (and a `build-corpus-batch-report.json`
315
+ across a batch) so you can confirm nothing was silently dropped or altered. Beyond the
316
+ raw output `stats`, the report carries:
317
+
318
+ - **`fidelity_ok`** — top-level ship gate. `true` only when every reconciliation row
319
+ matches (and zero equations fell back). The batch summary prints `all_fidelity_ok`
320
+ plus the list of `fidelity_failures`.
321
+ - **`reconciliation`** — input vs output per element type:
322
+ ```json
323
+ "reconciliation": {
324
+ "tables": { "in": 1, "out": 1, "ok": true },
325
+ "equations": { "in": 3, "out_omml": 2, "fell_back": 1, "ok": false },
326
+ "images": { "in": 2, "out": 0, "failed": 2, "ok": false },
327
+ "code_blocks": { "in": 0, "out": 0, "ok": true },
328
+ "headings": { "in": 1, "out": 1, "ok": true },
329
+ "links": { "in": 1, "out": 1, "ok": true }
330
+ }
331
+ ```
332
+ - **`issues`** — one entry per problem with the source line: `{ "type", "line", "source"|"target", "reason" }`.
333
+ - **`text_fixups`** — markdown escapes the engine resolved on your content, e.g.
334
+ `{ "total": 2, "currency_unescaped": 2 }`. Escaped currency like `\$252.3B` is kept as
335
+ literal text (`$252.3B`), never mistaken for inline math.
336
+ - A one-line **stdout digest** for a quick CLI glance:
337
+ ```
338
+ [OK] tables 1/1 [!!] equations 2/3 (1 fell back) [!!] images 0/2 (2 failed) … -> fidelity_ok=false
339
+ ```
340
+
341
+ Image failures carry a specific `reason` so you know how to react:
342
+
343
+ | reason | meaning | fix |
344
+ |--------|---------|-----|
345
+ | `missing-file` | target path not found | correct the path |
346
+ | `unsupported-on-platform` | EMF/WMF that needs metafile→PNG conversion | install LibreOffice / run on Windows |
347
+ | `unsupported-format` | `.html`/`.jsx`/`.svg` etc. — cannot be embedded | pre-render to PNG via a render pipeline |
348
+ | `skipped-remote` | `http(s)`/`data:` target | localize the asset first |
349
+
350
+ build-corpus does **not** rasterize HTML/JSX — that belongs to a separate render
351
+ step (e.g. a headless-browser screenshot). It flags them and moves on.
352
+
353
+ ## PowerPoint Notes
354
+
355
+ - `.pptx` is processed directly.
356
+ - `.ppt` is converted to `.pptx` first using LibreOffice (`soffice --headless --convert-to pptx`).
357
+ - Repeated boilerplate blocks that appear on most slides are removed from the emitted Markdown.
358
+ - Slide images are exported from the original package binaries (`ppt/media/*`), not screen-captured display rasters.
359
+ - Markdown output uses size-aware HTML image tags (`<img ... width= height=>`) based on OOXML display extents (`a:xfrm/a:ext`).
360
+ - The export report includes `low_dpi_images` to flag images whose effective on-slide DPI is under 150.
361
+
362
+ ## Validation
363
+
364
+ The package includes a KaTeX validator for emitted Markdown math:
365
+
366
+ ```powershell
367
+ build-corpus-katex out
368
+ ```
369
+
370
+ ## Repeatable Test Wrappers
371
+
372
+ Run a single known DOCX through conversion plus validators:
373
+
374
+ ```powershell
375
+ .\scripts\run-smoke.ps1 -Docx ".\fixtures\sample.docx" -Out ".tmp\smoke" -Images assets
376
+ ```
377
+
378
+ Run a whole folder corpus:
379
+
380
+ ```powershell
381
+ .\scripts\run-corpus.ps1 -Source ".\fixtures\wordtest" -Out ".tmp\wordtest" -Images base64
382
+ ```
383
+
384
+ Build a public online DOCX corpus for regression testing:
385
+
386
+ ```powershell
387
+ python .\tools\collect_online_docx_corpus.py --out ".tmp\online-docx\source-docx" --target 50
388
+ .\scripts\run-corpus.ps1 -Source ".tmp\online-docx\source-docx" -Out ".tmp\online-docx\markdown"
389
+ ```
390
+
391
+ Build a public online PPTX corpus and compare input/output extraction:
392
+
393
+ ```powershell
394
+ python .\tools\collect_online_pptx_corpus.py --out ".tmp\online-pptx\source-pptx" --target 20
395
+ .\scripts\run-corpus.ps1 -Source ".tmp\online-pptx\source-pptx" -Out ".tmp\online-pptx\markdown"
396
+ python .\tools\compare_pptx_inputs_outputs.py --manifest ".tmp\online-pptx\source-pptx\online-pptx-manifest.json" --out ".tmp\online-pptx\markdown" --report ".tmp\online-pptx\markdown\pptx-io-compare.json"
397
+ ```
398
+
399
+ ## Failed Documents
400
+
401
+ If a document does not convert correctly, open an issue with:
402
+
403
+ - the `.docx` file if it is safe to share
404
+ - the generated `.md`
405
+ - the `export-report.json`
406
+ - the command and config used
407
+ - a screenshot of the expected Word output if layout is the issue
408
+
409
+ For confidential files, strip or replace sensitive content before sharing. The useful part is the broken DOCX structure, not the private text.