regen.mde 0.2.2 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +409 -295
- package/bin/build-corpus-editor.js +5 -3
- package/bin/postinstall.js +259 -187
- package/bin/regen-mdeditor-install.js +1 -1
- package/bin/regen-mdeditor-uninstall.js +1 -1
- package/desktop/BuildCorpusEditor/BuildCorpusBridge.cs +493 -270
- package/desktop/BuildCorpusEditor/EditorForm.cs +853 -540
- package/desktop/BuildCorpusEditor/Program.cs +85 -81
- package/dist/release/regen-mde-0.3.0-win-x64-setup.exe +0 -0
- package/dist/release/{regen.mde-0.2.2-win-x64.zip → regen-mde-0.3.0-win-x64.zip} +0 -0
- package/dist/release/regen-mde-0.7.0-win-x64-setup.exe +0 -0
- package/dist/release/regen-mde-0.7.0-win-x64.zip +0 -0
- package/dist/windows-editor/BuildCorpusEditor.dll +0 -0
- package/dist/windows-editor/BuildCorpusEditor.exe +0 -0
- package/dist/windows-editor/BuildCorpusEditor.pdb +0 -0
- package/dist/windows-editor/wwwroot/assets/index-C_VxJk4k.js +375 -0
- package/dist/windows-editor/wwwroot/assets/index-Wt9zSjIw.css +1 -0
- package/dist/windows-editor/wwwroot/index.html +3 -3
- package/editor-web/index.html +1 -1
- package/editor-web/src/main.jsx +1044 -399
- package/editor-web/src/styles.css +846 -602
- package/installer/install-regen-mde.ps1 +49 -10
- package/installer/regen-mde.nsi +16 -16
- package/package.json +90 -86
- package/pyproject.toml +35 -33
- package/requirements.txt +6 -4
- package/scripts/package-windows-editor.ps1 +8 -8
- package/scripts/release-dual.mjs +105 -0
- package/scripts/run-editor-implementation-plane.ps1 +29 -6
- package/src/build_corpus/docx_exporter.py +1055 -798
- package/src/build_corpus/equations.py +80 -0
- package/src/build_corpus/exporter.py +1488 -1195
- package/src/build_corpus/frontmatter.py +302 -0
- package/src/build_corpus/ppt_exporter.py +543 -532
- package/dist/release/regen.mde-0.2.2-win-x64-setup.exe +0 -0
- package/dist/windows-editor/wwwroot/assets/index-DjJ6xmhy.js +0 -326
- package/dist/windows-editor/wwwroot/assets/index-_dwMNNsm.css +0 -1
package/README.md
CHANGED
|
@@ -1,295 +1,409 @@
|
|
|
1
|
-
# regen
|
|
2
|
-
|
|
3
|
-
regen
|
|
4
|
-
|
|
5
|
-
- Word OMML equations as KaTeX-readable TeX
|
|
6
|
-
- embedded images as local assets, base64 data URIs, or S3/R2-hosted URLs
|
|
7
|
-
- Markdown tables for simple Word tables
|
|
8
|
-
- HTML table fallback for complex tables
|
|
9
|
-
- headings, lists, links, bold, italic, inline code, and code-style paragraphs
|
|
10
|
-
- PowerPoint slide extraction with slide title detection, table mapping, and repetitive footer suppression
|
|
11
|
-
|
|
12
|
-
## Install
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
- `
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
npm
|
|
60
|
-
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
build-corpus
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
-
|
|
94
|
-
|
|
95
|
-
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
npm
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
build-corpus
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
```
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
```
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
261
|
-
|
|
262
|
-
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
290
|
-
-
|
|
291
|
-
|
|
292
|
-
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
|
|
1
|
+
# regen-mde
|
|
2
|
+
|
|
3
|
+
regen-mde is the Windows editor and conversion suite for Markdown, Word, and PowerPoint files. Its `build-corpus` CLI converts `.docx`, `.pptx`, and `.ppt` files to Markdown while preserving the pieces that usually break in generic converters:
|
|
4
|
+
|
|
5
|
+
- Word OMML equations as KaTeX-readable TeX
|
|
6
|
+
- embedded images as local assets, base64 data URIs, or S3/R2-hosted URLs
|
|
7
|
+
- Markdown tables for simple Word tables
|
|
8
|
+
- HTML table fallback for complex tables
|
|
9
|
+
- headings, lists, links, bold, italic, inline code, and code-style paragraphs
|
|
10
|
+
- PowerPoint slide extraction with slide title detection, table mapping, and repetitive footer suppression
|
|
11
|
+
|
|
12
|
+
## Install
|
|
13
|
+
|
|
14
|
+
`build-corpus` is a **dual package**: the same version ships to both PyPI (Python-native)
|
|
15
|
+
and npm (Node wrapper). Pick the channel that fits your OS.
|
|
16
|
+
|
|
17
|
+
| OS | Recommended | Command | What you get |
|
|
18
|
+
|----|-------------|---------|--------------|
|
|
19
|
+
| **Ubuntu / Debian / Linux** | PyPI via pipx | `pipx install build-corpus` | native `build-corpus` CLI, no Node, isolated from system Python (PEP 668-safe) |
|
|
20
|
+
| **macOS** | PyPI via pipx | `pipx install build-corpus` | native `build-corpus` CLI, no Node |
|
|
21
|
+
| **Windows** | npm (full kit) | `npm install -g regen-mde` | `build-corpus` CLI **plus** the `regen-mde` editor and Explorer right-click menus |
|
|
22
|
+
|
|
23
|
+
### Ubuntu / Debian / Linux
|
|
24
|
+
|
|
25
|
+
The CLI is pure Python; the editor is Windows-only, so on Linux you install just the converter.
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
# prerequisites
|
|
29
|
+
sudo apt update
|
|
30
|
+
sudo apt install -y python3 python3-pip pipx
|
|
31
|
+
sudo apt install -y libreoffice # optional — only needed to convert legacy .ppt
|
|
32
|
+
|
|
33
|
+
# install the CLI (isolated venv, survives Ubuntu's externally-managed Python)
|
|
34
|
+
pipx install build-corpus
|
|
35
|
+
build-corpus --help
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
The npm package (`regen-mde`) also installs on Linux (`npm install -g regen-mde`) — it shells
|
|
39
|
+
out to `python3` for conversion. On Ubuntu 24.04+ the postinstall installs the Python dependencies
|
|
40
|
+
into your user site; if that is blocked it prints the `pipx install build-corpus` fallback and
|
|
41
|
+
still completes. The `regen-mde` editor is not built on Linux.
|
|
42
|
+
|
|
43
|
+
### Windows
|
|
44
|
+
|
|
45
|
+
Python is the native runtime:
|
|
46
|
+
|
|
47
|
+
```powershell
|
|
48
|
+
pip install build-corpus
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
The npm package ships the Windows installer plus the conversion CLI:
|
|
52
|
+
|
|
53
|
+
```powershell
|
|
54
|
+
npm pack regen-mde
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
Extract the package and run `dist\release\regen-mde-<version>-win-x64-setup.exe` for a normal Windows install. The installer creates Start Menu entries for `regen-mde` and `Uninstall regen-mde`, registers right-click Explorer verbs for `.docx` and `.md`, and removes those entries during uninstall.
|
|
58
|
+
|
|
59
|
+
The legacy global npm command path is still supported for automation:
|
|
60
|
+
|
|
61
|
+
```powershell
|
|
62
|
+
npm install -g regen-mde
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
On Windows, the installer and supported automation paths add right-click Explorer menus for `.docx`, `.pptx`, `.ppt`, `.md`, and folders:
|
|
66
|
+
|
|
67
|
+
- `Life AI -> Open in regen-mde`
|
|
68
|
+
- opens `.md` directly and opens `.docx` by converting it into editable Markdown first
|
|
69
|
+
- `Life AI -> Convert to Markdown`
|
|
70
|
+
- runs `build-corpus "%1" --out-same-dir` for `.docx`, `.pptx`, and `.ppt`
|
|
71
|
+
- writes `.md`, `assets`, and reports beside the source document
|
|
72
|
+
- `Life AI -> Convert to Word`
|
|
73
|
+
- runs `build-corpus "%1" --to word --out-same-dir`
|
|
74
|
+
- writes `.docx` and export report beside the source document
|
|
75
|
+
- `Life AI -> Inline Markdown Images`
|
|
76
|
+
- runs `build-corpus "%1" --inline-images`
|
|
77
|
+
- writes `<name>.inline.md` with local or HTTP image references embedded as data URIs
|
|
78
|
+
- folder `Convert Documents to Markdown`
|
|
79
|
+
- runs `build-corpus "%V" --out-same-dir`
|
|
80
|
+
- converts all `.docx`, `.pptx`, and `.ppt` files in the selected folder tree
|
|
81
|
+
|
|
82
|
+
The installer also registers `.md` under Explorer's New menu so you can create a blank Markdown document directly from `New`.
|
|
83
|
+
|
|
84
|
+
Set `BUILD_CORPUS_SKIP_WINDOWS_MENU=1` before a global npm install if you do not want the Explorer menu.
|
|
85
|
+
Set `BUILD_CORPUS_SKIP_EDITOR=1` before a global npm install if you want the CLI conversion verbs but not the editor open verbs.
|
|
86
|
+
|
|
87
|
+
To remove the Windows Explorer menus without uninstalling the package:
|
|
88
|
+
|
|
89
|
+
```powershell
|
|
90
|
+
build-corpus --uninstall-windows
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
If you uninstall the global npm package, `build-corpus` now removes those Explorer menu entries automatically during uninstall.
|
|
94
|
+
|
|
95
|
+
For a project-local install, use `npx`:
|
|
96
|
+
|
|
97
|
+
```powershell
|
|
98
|
+
npm install regen-mde
|
|
99
|
+
npx build-corpus --help
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
On Windows, if `build-corpus` launches a Python executable and fails with `ModuleNotFoundError`, a stale pip install is shadowing the npm command. Remove it with:
|
|
103
|
+
|
|
104
|
+
```powershell
|
|
105
|
+
py -3 -m pip uninstall build-corpus
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
For S3/R2 image upload support:
|
|
109
|
+
|
|
110
|
+
```powershell
|
|
111
|
+
pip install "build-corpus[s3]"
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## Basic Usage
|
|
115
|
+
|
|
116
|
+
```powershell
|
|
117
|
+
build-corpus input.docx --out out
|
|
118
|
+
build-corpus deck.pptx --out out
|
|
119
|
+
build-corpus input.md --to word --out out
|
|
120
|
+
build-corpus input.md --to word --word-template C:\path\custom.dotx --out out
|
|
121
|
+
regen-mde input.md
|
|
122
|
+
regen-mdeditor input.md
|
|
123
|
+
regen-mdeditor input.md
|
|
124
|
+
build-corpus editor input.md
|
|
125
|
+
build-corpus editor input.docx
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## regen-mde
|
|
129
|
+
|
|
130
|
+
regen-mde is a Windows WebView2 desktop app bundled with the package. It uses the same local Build Corpus conversion engine as the CLI:
|
|
131
|
+
|
|
132
|
+
- Markdown opens directly.
|
|
133
|
+
- Word and PowerPoint files open by converting into Markdown.
|
|
134
|
+
- Save writes Markdown.
|
|
135
|
+
- Save As writes a new Markdown file.
|
|
136
|
+
- Export DOCX writes Word output through the Markdown-to-Word route.
|
|
137
|
+
|
|
138
|
+
Build the Windows executable locally:
|
|
139
|
+
|
|
140
|
+
```powershell
|
|
141
|
+
npm run editor:windows
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
The executable is written to:
|
|
145
|
+
|
|
146
|
+
```text
|
|
147
|
+
dist\windows-editor\BuildCorpusEditor.exe
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
Convert every `.docx` in a folder:
|
|
151
|
+
|
|
152
|
+
```powershell
|
|
153
|
+
build-corpus ./word-files --out ./markdown
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Convert every supported file type in a folder (`.docx`, `.pptx`, `.ppt`):
|
|
157
|
+
|
|
158
|
+
```powershell
|
|
159
|
+
build-corpus ./source-files --out ./markdown
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Convert specific selected files or folders from automation:
|
|
163
|
+
|
|
164
|
+
```powershell
|
|
165
|
+
build-corpus .\a.docx .\deck.pptx .\folder --out-same-dir
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
Move successfully processed source `.docx`, `.pptx`, and `.ppt` files into `sources` beside each file:
|
|
169
|
+
|
|
170
|
+
```powershell
|
|
171
|
+
build-corpus ./source-files --out-same-dir --move-sources
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
Write Markdown beside each source document:
|
|
175
|
+
|
|
176
|
+
```powershell
|
|
177
|
+
build-corpus ./word-files --out-same-dir
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
## Image Modes
|
|
181
|
+
|
|
182
|
+
Local asset files, the default:
|
|
183
|
+
|
|
184
|
+
```powershell
|
|
185
|
+
build-corpus input.docx --images assets
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
Single-file Markdown with base64 image data URIs:
|
|
189
|
+
|
|
190
|
+
```powershell
|
|
191
|
+
build-corpus input.docx --images base64
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
Re-merge an existing Markdown file that references local or HTTP-hosted images into a single Markdown file with inline image data:
|
|
195
|
+
|
|
196
|
+
```powershell
|
|
197
|
+
build-corpus input.md --inline-images
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Upload images to S3-compatible storage and write public URLs:
|
|
201
|
+
|
|
202
|
+
```powershell
|
|
203
|
+
build-corpus input.docx --images s3 --config examples\build-corpus.config.example.json
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
Cloudflare R2 uses the same `s3` mode. Set `endpoint_url` to:
|
|
207
|
+
|
|
208
|
+
```text
|
|
209
|
+
https://ACCOUNT_ID.r2.cloudflarestorage.com
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
## Config
|
|
213
|
+
|
|
214
|
+
Copy `examples/build-corpus.config.example.json` and edit it for your environment.
|
|
215
|
+
|
|
216
|
+
```json
|
|
217
|
+
{
|
|
218
|
+
"conversion": {
|
|
219
|
+
"equations": "tex",
|
|
220
|
+
"images": "s3"
|
|
221
|
+
},
|
|
222
|
+
"output": {
|
|
223
|
+
"out": "out",
|
|
224
|
+
"out_same_dir": false
|
|
225
|
+
},
|
|
226
|
+
"s3": {
|
|
227
|
+
"bucket": "build-corpus-assets",
|
|
228
|
+
"public_base_url": "https://assets.example.com",
|
|
229
|
+
"prefix": "knowledge-base",
|
|
230
|
+
"endpoint_url": "https://ACCOUNT_ID.r2.cloudflarestorage.com",
|
|
231
|
+
"region_name": "auto",
|
|
232
|
+
"access_key_id": "%R2_ACCESS_KEY_ID%",
|
|
233
|
+
"secret_access_key": "%R2_SECRET_ACCESS_KEY%"
|
|
234
|
+
}
|
|
235
|
+
}
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
Build Corpus expands environment variables in JSON string values, so credentials do not need to be committed.
|
|
239
|
+
|
|
240
|
+
### Output Placement
|
|
241
|
+
|
|
242
|
+
There are two output modes.
|
|
243
|
+
|
|
244
|
+
Write all converted Markdown into one output tree:
|
|
245
|
+
|
|
246
|
+
```json
|
|
247
|
+
{
|
|
248
|
+
"output": {
|
|
249
|
+
"out": "./markdown",
|
|
250
|
+
"out_same_dir": false
|
|
251
|
+
}
|
|
252
|
+
}
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
Write each `.md`, asset folder, and report beside the source `.docx`:
|
|
256
|
+
|
|
257
|
+
```json
|
|
258
|
+
{
|
|
259
|
+
"output": {
|
|
260
|
+
"out_same_dir": true
|
|
261
|
+
}
|
|
262
|
+
}
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
The same-dir mode is equivalent to:
|
|
266
|
+
|
|
267
|
+
```powershell
|
|
268
|
+
build-corpus ./word-files --out-same-dir
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
## Markdown to Word Templates
|
|
272
|
+
|
|
273
|
+
Markdown -> Word conversion uses this template precedence:
|
|
274
|
+
|
|
275
|
+
1. `--word-template <path>`
|
|
276
|
+
2. `word.template` in the JSON config
|
|
277
|
+
3. the bundled installed package template
|
|
278
|
+
4. built-in fallback styles if no template can be found
|
|
279
|
+
|
|
280
|
+
Template files are treated as style sources. Build Corpus creates a fresh output document body, then applies the template's Word styles, numbering, theme, fonts, and settings. It does not reuse the template body content as the exported document.
|
|
281
|
+
|
|
282
|
+
## Equations
|
|
283
|
+
|
|
284
|
+
Equation handling is real in **both** directions:
|
|
285
|
+
|
|
286
|
+
**DOCX → Markdown** — Word OMML equations are converted to KaTeX-readable TeX
|
|
287
|
+
(via `omml2latex`). The default mode is parseable TeX:
|
|
288
|
+
|
|
289
|
+
```powershell
|
|
290
|
+
build-corpus input.docx --equations tex
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
Equation images are only for visual debugging:
|
|
294
|
+
|
|
295
|
+
```powershell
|
|
296
|
+
build-corpus input.docx --equations image
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
**Markdown → Word** — inline `$...$` and display `$$...$$` LaTeX are converted to
|
|
300
|
+
**native Office Math (OMML)** that Word renders as real equations — not raw text
|
|
301
|
+
in a math font. The pipeline is `latex2mathml` → `mathml2omml`, so commands like
|
|
302
|
+
`\sum`, `\int`, `\frac`, `\Delta`, `\rightarrow`, and `\leq` render correctly:
|
|
303
|
+
|
|
304
|
+
```powershell
|
|
305
|
+
build-corpus notes.md --to word --out out
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
If a fragment cannot be parsed as LaTeX, it falls back to the literal text in
|
|
309
|
+
Cambria Math and is flagged in the export report's `warnings`. Fence display
|
|
310
|
+
equations with `$$` on their own lines and no blank lines inside the fence.
|
|
311
|
+
|
|
312
|
+
## Fidelity report (md → word)
|
|
313
|
+
|
|
314
|
+
Every md→word export writes `export-report.json` (and a `build-corpus-batch-report.json`
|
|
315
|
+
across a batch) so you can confirm nothing was silently dropped or altered. Beyond the
|
|
316
|
+
raw output `stats`, the report carries:
|
|
317
|
+
|
|
318
|
+
- **`fidelity_ok`** — top-level ship gate. `true` only when every reconciliation row
|
|
319
|
+
matches (and zero equations fell back). The batch summary prints `all_fidelity_ok`
|
|
320
|
+
plus the list of `fidelity_failures`.
|
|
321
|
+
- **`reconciliation`** — input vs output per element type:
|
|
322
|
+
```json
|
|
323
|
+
"reconciliation": {
|
|
324
|
+
"tables": { "in": 1, "out": 1, "ok": true },
|
|
325
|
+
"equations": { "in": 3, "out_omml": 2, "fell_back": 1, "ok": false },
|
|
326
|
+
"images": { "in": 2, "out": 0, "failed": 2, "ok": false },
|
|
327
|
+
"code_blocks": { "in": 0, "out": 0, "ok": true },
|
|
328
|
+
"headings": { "in": 1, "out": 1, "ok": true },
|
|
329
|
+
"links": { "in": 1, "out": 1, "ok": true }
|
|
330
|
+
}
|
|
331
|
+
```
|
|
332
|
+
- **`issues`** — one entry per problem with the source line: `{ "type", "line", "source"|"target", "reason" }`.
|
|
333
|
+
- **`text_fixups`** — markdown escapes the engine resolved on your content, e.g.
|
|
334
|
+
`{ "total": 2, "currency_unescaped": 2 }`. Escaped currency like `\$252.3B` is kept as
|
|
335
|
+
literal text (`$252.3B`), never mistaken for inline math.
|
|
336
|
+
- A one-line **stdout digest** for a quick CLI glance:
|
|
337
|
+
```
|
|
338
|
+
[OK] tables 1/1 [!!] equations 2/3 (1 fell back) [!!] images 0/2 (2 failed) … -> fidelity_ok=false
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
Image failures carry a specific `reason` so you know how to react:
|
|
342
|
+
|
|
343
|
+
| reason | meaning | fix |
|
|
344
|
+
|--------|---------|-----|
|
|
345
|
+
| `missing-file` | target path not found | correct the path |
|
|
346
|
+
| `unsupported-on-platform` | EMF/WMF that needs metafile→PNG conversion | install LibreOffice / run on Windows |
|
|
347
|
+
| `unsupported-format` | `.html`/`.jsx`/`.svg` etc. — cannot be embedded | pre-render to PNG via a render pipeline |
|
|
348
|
+
| `skipped-remote` | `http(s)`/`data:` target | localize the asset first |
|
|
349
|
+
|
|
350
|
+
build-corpus does **not** rasterize HTML/JSX — that belongs to a separate render
|
|
351
|
+
step (e.g. a headless-browser screenshot). It flags them and moves on.
|
|
352
|
+
|
|
353
|
+
## PowerPoint Notes
|
|
354
|
+
|
|
355
|
+
- `.pptx` is processed directly.
|
|
356
|
+
- `.ppt` is converted to `.pptx` first using LibreOffice (`soffice --headless --convert-to pptx`).
|
|
357
|
+
- Repeated boilerplate blocks that appear on most slides are removed from the emitted Markdown.
|
|
358
|
+
- Slide images are exported from the original package binaries (`ppt/media/*`), not screen-captured display rasters.
|
|
359
|
+
- Markdown output uses size-aware HTML image tags (`<img ... width= height=>`) based on OOXML display extents (`a:xfrm/a:ext`).
|
|
360
|
+
- The export report includes `low_dpi_images` to flag images whose effective on-slide DPI is under 150.
|
|
361
|
+
|
|
362
|
+
## Validation
|
|
363
|
+
|
|
364
|
+
The package includes a KaTeX validator for emitted Markdown math:
|
|
365
|
+
|
|
366
|
+
```powershell
|
|
367
|
+
build-corpus-katex out
|
|
368
|
+
```
|
|
369
|
+
|
|
370
|
+
## Repeatable Test Wrappers
|
|
371
|
+
|
|
372
|
+
Run a single known DOCX through conversion plus validators:
|
|
373
|
+
|
|
374
|
+
```powershell
|
|
375
|
+
.\scripts\run-smoke.ps1 -Docx ".\fixtures\sample.docx" -Out ".tmp\smoke" -Images assets
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
Run a whole folder corpus:
|
|
379
|
+
|
|
380
|
+
```powershell
|
|
381
|
+
.\scripts\run-corpus.ps1 -Source ".\fixtures\wordtest" -Out ".tmp\wordtest" -Images base64
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
Build a public online DOCX corpus for regression testing:
|
|
385
|
+
|
|
386
|
+
```powershell
|
|
387
|
+
python .\tools\collect_online_docx_corpus.py --out ".tmp\online-docx\source-docx" --target 50
|
|
388
|
+
.\scripts\run-corpus.ps1 -Source ".tmp\online-docx\source-docx" -Out ".tmp\online-docx\markdown"
|
|
389
|
+
```
|
|
390
|
+
|
|
391
|
+
Build a public online PPTX corpus and compare input/output extraction:
|
|
392
|
+
|
|
393
|
+
```powershell
|
|
394
|
+
python .\tools\collect_online_pptx_corpus.py --out ".tmp\online-pptx\source-pptx" --target 20
|
|
395
|
+
.\scripts\run-corpus.ps1 -Source ".tmp\online-pptx\source-pptx" -Out ".tmp\online-pptx\markdown"
|
|
396
|
+
python .\tools\compare_pptx_inputs_outputs.py --manifest ".tmp\online-pptx\source-pptx\online-pptx-manifest.json" --out ".tmp\online-pptx\markdown" --report ".tmp\online-pptx\markdown\pptx-io-compare.json"
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
## Failed Documents
|
|
400
|
+
|
|
401
|
+
If a document does not convert correctly, open an issue with:
|
|
402
|
+
|
|
403
|
+
- the `.docx` file if it is safe to share
|
|
404
|
+
- the generated `.md`
|
|
405
|
+
- the `export-report.json`
|
|
406
|
+
- the command and config used
|
|
407
|
+
- a screenshot of the expected Word output if layout is the issue
|
|
408
|
+
|
|
409
|
+
For confidential files, strip or replace sensitive content before sharing. The useful part is the broken DOCX structure, not the private text.
|