mineru-layout-viewer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,301 @@
1
+ # MinerU Layout Viewer
2
+
3
+ Visualize [MinerU](https://github.com/opendatalab/MinerU) `layout.json` / `middle.json` output — side-by-side PDF + Markdown with bidirectional click-to-navigate.
4
+
5
+ [English](#english) | [中文](#chinese)
6
+
7
+ ---
8
+
9
+ <a name="english"></a>
10
+ ## English
11
+
12
+ ### Demo
13
+
14
+ Drop a MinerU export `.zip` (or PDF + `layout.json`) onto the page:
15
+
16
+ ```html
17
+ <mineru-layout-viewer
18
+ pdf="document.pdf"
19
+ layout="layout.json"
20
+ markdown="full.md">
21
+ </mineru-layout-viewer>
22
+ ```
23
+
24
+ ### Features
25
+
26
+ - **Dual-pane view** — PDF pages on the left, Markdown text on the right
27
+ - **Bidirectional navigation** — click a Markdown line → scrolls to the corresponding PDF block and highlights it; click a PDF overlay → scrolls to the matching Markdown line
28
+ - **Multi-format support** — automatically detects `layout.json`, `middle.json`, and `content_list.json`
29
+ - **Nested block handling** — resolves list items, table cells, and other nested blocks to their leaf coordinates
30
+ - **Framework-agnostic** — built as a Web Component, works with React, Vue, or plain HTML
31
+ - **Zip support** — drop a MinerU output `.zip` directly, auto-extracts PDF + layout + markdown
32
+
33
+ ### Installation
34
+
35
+ ```bash
36
+ npm install mineru-layout-viewer
37
+ ```
38
+
39
+ #### Via CDN
40
+
41
+ ```html
42
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.min.js"></script>
43
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/jszip/3.10.1/jszip.min.js"></script>
44
+ <script src="https://unpkg.com/mineru-layout-viewer/dist/mineru-layout-viewer.iife.js"></script>
45
+ ```
46
+
47
+ Set the PDF.js worker:
48
+
49
+ ```html
50
+ <script>
51
+ pdfjsLib.GlobalWorkerOptions.workerSrc =
52
+ 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js'
53
+ </script>
54
+ ```
55
+
56
+ ### Usage
57
+
58
+ #### HTML / Web Component
59
+
60
+ ```html
61
+ <!-- Attribute-based -->
62
+ <mineru-layout-viewer
63
+ pdf="./document.pdf"
64
+ layout="./layout.json"
65
+ markdown="./full.md">
66
+ </mineru-layout-viewer>
67
+
68
+ <!-- Programmatic API (zip) -->
69
+ <script>
70
+ const viewer = document.querySelector('mineru-layout-viewer')
71
+ await viewer.loadZip(zipBlob) // from MinerU export .zip
72
+ </script>
73
+
74
+ <!-- Programmatic API (JSON) -->
75
+ <script>
76
+ const viewer = document.querySelector('mineru-layout-viewer')
77
+ viewer.loadLayoutFromJson(jsonData)
78
+ viewer.loadMarkdown(markdownText)
79
+ </script>
80
+ ```
81
+
82
+ #### JavaScript / ESM
83
+
84
+ ```js
85
+ import {
86
+ parseBlocks,
87
+ matchMarkdownToPdf,
88
+ MineruLayoutViewer,
89
+ } from 'mineru-layout-viewer'
90
+
91
+ // Parse a layout.json string into blocks
92
+ const blocks = parseBlocks(layoutJsonStr)
93
+
94
+ // Match markdown paragraphs to PDF blocks
95
+ const sections = matchMarkdownToPdf(markdown, blocks)
96
+
97
+ // Each section: { text, page, bbox: [x0,y0,x1,y1] | null }
98
+ ```
99
+
100
+ ### API
101
+
102
+ #### `parseBlocks(jsonStr: string): PdfBlock[]`
103
+
104
+ Parses a MinerU JSON file (`layout.json`, `middle.json`, or `content_list.json`) into an array of leaf-level blocks.
105
+
106
+ ```ts
107
+ interface PdfBlock {
108
+ page_idx: number // 0-based page index
109
+ bbox: [number, number, number, number] // [x0, y0, x1, y1] — top-left origin
110
+ text?: string // extracted span content
111
+ type?: string // block type: "text", "title", "list", etc.
112
+ }
113
+ ```
114
+
115
+ #### `matchMarkdownToPdf(markdown: string, blocks: PdfBlock[]): MdSection[]`
116
+
117
+ Matches markdown text (split by lines) to PDF blocks using LCS similarity.
118
+
119
+ ```ts
120
+ interface MdSection {
121
+ text: string
122
+ page: number // 1-based page number
123
+ bbox: [number, number, number, number] | null
124
+ }
125
+ ```
126
+
127
+ #### `<mineru-layout-viewer>` Attributes
128
+
129
+ | Attribute | Description |
130
+ |------------|------------------------------------------|
131
+ | `pdf` | URL to the PDF file |
132
+ | `layout` | URL to `layout.json` / `middle.json` |
133
+ | `markdown` | URL to `full.md` (optional) |
134
+
135
+ #### `<mineru-layout-viewer>` Methods
136
+
137
+ | Method | Description |
138
+ |-----------------------------------------|---------------------------------------|
139
+ | `loadZip(blob: Blob): Promise<void>` | Load from a MinerU export .zip |
140
+ | `loadLayoutFromJson(data: object\|string)`| Load layout JSON directly |
141
+ | `loadMarkdown(text: string)` | Load markdown text directly |
142
+
143
+ ### Supported JSON Formats
144
+
145
+ | Format | Structure | Origin |
146
+ |---------------------|------------------------------------|-----------|
147
+ | `layout.json` | `{ pdf_info: [{ preproc_blocks }] }` | top-left |
148
+ | `middle.json` | `{ pdf_info: [{ preproc_blocks }] }` | top-left |
149
+ | `content_list.json` | `[{ page_idx, bbox, text }]` | 0–1000 |
150
+
151
+ ### License
152
+
153
+ MIT
154
+
155
+ ---
156
+
157
+ <a name="chinese"></a>
158
+ ## 中文
159
+
160
+ ### 演示
161
+
162
+ 将 MinerU 导出 `.zip`(或 PDF + `layout.json`)拖放到页面上即可:
163
+
164
+ ```html
165
+ <mineru-layout-viewer
166
+ pdf="document.pdf"
167
+ layout="layout.json"
168
+ markdown="full.md">
169
+ </mineru-layout-viewer>
170
+ ```
171
+
172
+ ### 特性
173
+
174
+ - **双栏对照** — 左侧 PDF 页面,右侧 Markdown 文本
175
+ - **双向定位** — 点击 Markdown 行 → 滚动到对应 PDF 块并高亮;点击 PDF 覆盖块 → 滚动到匹配的 Markdown 行
176
+ - **多格式支持** — 自动识别 `layout.json`、`middle.json`、`content_list.json`
177
+ - **嵌套块解析** — 将列表项、表格单元格等嵌套块解析到叶子节点坐标
178
+ - **框架无关** — 基于 Web Component,支持 React、Vue 或原生 HTML
179
+ - **Zip 直拖** — 直接拖放 MinerU 输出 `.zip`,自动解压 PDF + layout + markdown
180
+
181
+ ### 安装
182
+
183
+ ```bash
184
+ npm install mineru-layout-viewer
185
+ ```
186
+
187
+ #### CDN 引入
188
+
189
+ ```html
190
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.min.js"></script>
191
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/jszip/3.10.1/jszip.min.js"></script>
192
+ <script src="https://unpkg.com/mineru-layout-viewer/dist/mineru-layout-viewer.iife.js"></script>
193
+ ```
194
+
195
+ 设置 PDF.js Worker:
196
+
197
+ ```html
198
+ <script>
199
+ pdfjsLib.GlobalWorkerOptions.workerSrc =
200
+ 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js'
201
+ </script>
202
+ ```
203
+
204
+ ### 使用方式
205
+
206
+ #### HTML / Web Component
207
+
208
+ ```html
209
+ <!-- 属性方式 -->
210
+ <mineru-layout-viewer
211
+ pdf="./document.pdf"
212
+ layout="./layout.json"
213
+ markdown="./full.md">
214
+ </mineru-layout-viewer>
215
+
216
+ <!-- 编程 API(从 zip 加载) -->
217
+ <script>
218
+ const viewer = document.querySelector('mineru-layout-viewer')
219
+ await viewer.loadZip(zipBlob) // 从 MinerU 导出 .zip 加载
220
+ </script>
221
+
222
+ <!-- 编程 API(直接传 JSON) -->
223
+ <script>
224
+ const viewer = document.querySelector('mineru-layout-viewer')
225
+ viewer.loadLayoutFromJson(jsonData)
226
+ viewer.loadMarkdown(markdownText)
227
+ </script>
228
+ ```
229
+
230
+ #### JavaScript / ESM
231
+
232
+ ```js
233
+ import {
234
+ parseBlocks,
235
+ matchMarkdownToPdf,
236
+ MineruLayoutViewer,
237
+ } from 'mineru-layout-viewer'
238
+
239
+ // 解析 layout.json 字符串为 block 数组
240
+ const blocks = parseBlocks(layoutJsonStr)
241
+
242
+ // 将 markdown 段落匹配到 PDF block
243
+ const sections = matchMarkdownToPdf(markdown, blocks)
244
+
245
+ // 每个 section: { text, page, bbox: [x0,y0,x1,y1] | null }
246
+ ```
247
+
248
+ ### API
249
+
250
+ #### `parseBlocks(jsonStr: string): PdfBlock[]`
251
+
252
+ 将 MinerU JSON 文件(`layout.json`、`middle.json` 或 `content_list.json`)解析为叶子级 block 数组。
253
+
254
+ ```ts
255
+ interface PdfBlock {
256
+ page_idx: number // 0-based 页码
257
+ bbox: [number, number, number, number] // [x0, y0, x1, y1] — 左上角原点
258
+ text?: string // 提取的 span 文本
259
+ type?: string // block 类型:"text"、"title"、"list" 等
260
+ }
261
+ ```
262
+
263
+ #### `matchMarkdownToPdf(markdown: string, blocks: PdfBlock[]): MdSection[]`
264
+
265
+ 使用 LCS 相似度将 Markdown 文本(按行分割)匹配到 PDF block。
266
+
267
+ ```ts
268
+ interface MdSection {
269
+ text: string
270
+ page: number // 1-based 页码
271
+ bbox: [number, number, number, number] | null
272
+ }
273
+ ```
274
+
275
+ #### `<mineru-layout-viewer>` 属性
276
+
277
+ | 属性 | 说明 |
278
+ |------------|----------------------------------------|
279
+ | `pdf` | PDF 文件 URL |
280
+ | `layout` | `layout.json` / `middle.json` 文件 URL |
281
+ | `markdown` | `full.md` 文件 URL(可选) |
282
+
283
+ #### `<mineru-layout-viewer>` 方法
284
+
285
+ | 方法 | 说明 |
286
+ |-----------------------------------------|-----------------------------|
287
+ | `loadZip(blob: Blob): Promise<void>` | 从 Mineru 导出 .zip 加载 |
288
+ | `loadLayoutFromJson(data: object\|string)`| 直接加载 layout JSON |
289
+ | `loadMarkdown(text: string)` | 直接加载 markdown 文本 |
290
+
291
+ ### 支持的 JSON 格式
292
+
293
+ | 格式 | 结构 | 坐标系 |
294
+ |---------------------|------------------------------------|----------|
295
+ | `layout.json` | `{ pdf_info: [{ preproc_blocks }] }` | 左上角 |
296
+ | `middle.json` | `{ pdf_info: [{ preproc_blocks }] }` | 左上角 |
297
+ | `content_list.json` | `[{ page_idx, bbox, text }]` | 0–1000 |
298
+
299
+ ### 许可证
300
+
301
+ MIT