mineru-layout-viewer 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +301 -0
- package/dist/index.mjs +444 -0
- package/dist/index.mjs.map +7 -0
- package/dist/mineru-layout-viewer.iife.js +2864 -0
- package/dist/mineru-layout-viewer.iife.js.map +7 -0
- package/package.json +53 -0
package/README.md
ADDED
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# MinerU Layout Viewer
|
|
2
|
+
|
|
3
|
+
Visualize [MinerU](https://github.com/opendatalab/MinerU) `layout.json` / `middle.json` output — side-by-side PDF + Markdown with bidirectional click-to-navigate.
|
|
4
|
+
|
|
5
|
+
[English](#english) | [中文](#chinese)
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
<a name="english"></a>
|
|
10
|
+
## English
|
|
11
|
+
|
|
12
|
+
### Demo
|
|
13
|
+
|
|
14
|
+
Drop a MinerU export `.zip` (or PDF + `layout.json`) onto the page:
|
|
15
|
+
|
|
16
|
+
```html
|
|
17
|
+
<mineru-layout-viewer
|
|
18
|
+
pdf="document.pdf"
|
|
19
|
+
layout="layout.json"
|
|
20
|
+
markdown="full.md">
|
|
21
|
+
</mineru-layout-viewer>
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
### Features
|
|
25
|
+
|
|
26
|
+
- **Dual-pane view** — PDF pages on the left, Markdown text on the right
|
|
27
|
+
- **Bidirectional navigation** — click a Markdown line → scrolls to the corresponding PDF block and highlights it; click a PDF overlay → scrolls to the matching Markdown line
|
|
28
|
+
- **Multi-format support** — automatically detects `layout.json`, `middle.json`, and `content_list.json`
|
|
29
|
+
- **Nested block handling** — resolves list items, table cells, and other nested blocks to their leaf coordinates
|
|
30
|
+
- **Framework-agnostic** — built as a Web Component, works with React, Vue, or plain HTML
|
|
31
|
+
- **Zip support** — drop a MinerU output `.zip` directly, auto-extracts PDF + layout + markdown
|
|
32
|
+
|
|
33
|
+
### Installation
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
npm install mineru-layout-viewer
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
#### Via CDN
|
|
40
|
+
|
|
41
|
+
```html
|
|
42
|
+
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.min.js"></script>
|
|
43
|
+
<script src="https://cdnjs.cloudflare.com/ajax/libs/jszip/3.10.1/jszip.min.js"></script>
|
|
44
|
+
<script src="https://unpkg.com/mineru-layout-viewer/dist/mineru-layout-viewer.iife.js"></script>
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
Set the PDF.js worker:
|
|
48
|
+
|
|
49
|
+
```html
|
|
50
|
+
<script>
|
|
51
|
+
pdfjsLib.GlobalWorkerOptions.workerSrc =
|
|
52
|
+
'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js'
|
|
53
|
+
</script>
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Usage
|
|
57
|
+
|
|
58
|
+
#### HTML / Web Component
|
|
59
|
+
|
|
60
|
+
```html
|
|
61
|
+
<!-- Attribute-based -->
|
|
62
|
+
<mineru-layout-viewer
|
|
63
|
+
pdf="./document.pdf"
|
|
64
|
+
layout="./layout.json"
|
|
65
|
+
markdown="./full.md">
|
|
66
|
+
</mineru-layout-viewer>
|
|
67
|
+
|
|
68
|
+
<!-- Programmatic API (zip) -->
|
|
69
|
+
<script>
|
|
70
|
+
const viewer = document.querySelector('mineru-layout-viewer')
|
|
71
|
+
await viewer.loadZip(zipBlob) // from MinerU export .zip
|
|
72
|
+
</script>
|
|
73
|
+
|
|
74
|
+
<!-- Programmatic API (JSON) -->
|
|
75
|
+
<script>
|
|
76
|
+
const viewer = document.querySelector('mineru-layout-viewer')
|
|
77
|
+
viewer.loadLayoutFromJson(jsonData)
|
|
78
|
+
viewer.loadMarkdown(markdownText)
|
|
79
|
+
</script>
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
#### JavaScript / ESM
|
|
83
|
+
|
|
84
|
+
```js
|
|
85
|
+
import {
|
|
86
|
+
parseBlocks,
|
|
87
|
+
matchMarkdownToPdf,
|
|
88
|
+
MineruLayoutViewer,
|
|
89
|
+
} from 'mineru-layout-viewer'
|
|
90
|
+
|
|
91
|
+
// Parse a layout.json string into blocks
|
|
92
|
+
const blocks = parseBlocks(layoutJsonStr)
|
|
93
|
+
|
|
94
|
+
// Match markdown paragraphs to PDF blocks
|
|
95
|
+
const sections = matchMarkdownToPdf(markdown, blocks)
|
|
96
|
+
|
|
97
|
+
// Each section: { text, page, bbox: [x0,y0,x1,y1] | null }
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### API
|
|
101
|
+
|
|
102
|
+
#### `parseBlocks(jsonStr: string): PdfBlock[]`
|
|
103
|
+
|
|
104
|
+
Parses a MinerU JSON file (`layout.json`, `middle.json`, or `content_list.json`) into an array of leaf-level blocks.
|
|
105
|
+
|
|
106
|
+
```ts
|
|
107
|
+
interface PdfBlock {
|
|
108
|
+
page_idx: number // 0-based page index
|
|
109
|
+
bbox: [number, number, number, number] // [x0, y0, x1, y1] — top-left origin
|
|
110
|
+
text?: string // extracted span content
|
|
111
|
+
type?: string // block type: "text", "title", "list", etc.
|
|
112
|
+
}
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
#### `matchMarkdownToPdf(markdown: string, blocks: PdfBlock[]): MdSection[]`
|
|
116
|
+
|
|
117
|
+
Matches markdown text (split by lines) to PDF blocks using LCS similarity.
|
|
118
|
+
|
|
119
|
+
```ts
|
|
120
|
+
interface MdSection {
|
|
121
|
+
text: string
|
|
122
|
+
page: number // 1-based page number
|
|
123
|
+
bbox: [number, number, number, number] | null
|
|
124
|
+
}
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
#### `<mineru-layout-viewer>` Attributes
|
|
128
|
+
|
|
129
|
+
| Attribute | Description |
|
|
130
|
+
|------------|------------------------------------------|
|
|
131
|
+
| `pdf` | URL to the PDF file |
|
|
132
|
+
| `layout` | URL to `layout.json` / `middle.json` |
|
|
133
|
+
| `markdown` | URL to `full.md` (optional) |
|
|
134
|
+
|
|
135
|
+
#### `<mineru-layout-viewer>` Methods
|
|
136
|
+
|
|
137
|
+
| Method | Description |
|
|
138
|
+
|-----------------------------------------|---------------------------------------|
|
|
139
|
+
| `loadZip(blob: Blob): Promise<void>` | Load from a MinerU export .zip |
|
|
140
|
+
| `loadLayoutFromJson(data: object\|string)`| Load layout JSON directly |
|
|
141
|
+
| `loadMarkdown(text: string)` | Load markdown text directly |
|
|
142
|
+
|
|
143
|
+
### Supported JSON Formats
|
|
144
|
+
|
|
145
|
+
| Format | Structure | Origin |
|
|
146
|
+
|---------------------|------------------------------------|-----------|
|
|
147
|
+
| `layout.json` | `{ pdf_info: [{ preproc_blocks }] }` | top-left |
|
|
148
|
+
| `middle.json` | `{ pdf_info: [{ preproc_blocks }] }` | top-left |
|
|
149
|
+
| `content_list.json` | `[{ page_idx, bbox, text }]` | 0–1000 |
|
|
150
|
+
|
|
151
|
+
### License
|
|
152
|
+
|
|
153
|
+
MIT
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
<a name="chinese"></a>
|
|
158
|
+
## 中文
|
|
159
|
+
|
|
160
|
+
### 演示
|
|
161
|
+
|
|
162
|
+
将 MinerU 导出 `.zip`(或 PDF + `layout.json`)拖放到页面上即可:
|
|
163
|
+
|
|
164
|
+
```html
|
|
165
|
+
<mineru-layout-viewer
|
|
166
|
+
pdf="document.pdf"
|
|
167
|
+
layout="layout.json"
|
|
168
|
+
markdown="full.md">
|
|
169
|
+
</mineru-layout-viewer>
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### 特性
|
|
173
|
+
|
|
174
|
+
- **双栏对照** — 左侧 PDF 页面,右侧 Markdown 文本
|
|
175
|
+
- **双向定位** — 点击 Markdown 行 → 滚动到对应 PDF 块并高亮;点击 PDF 覆盖块 → 滚动到匹配的 Markdown 行
|
|
176
|
+
- **多格式支持** — 自动识别 `layout.json`、`middle.json`、`content_list.json`
|
|
177
|
+
- **嵌套块解析** — 将列表项、表格单元格等嵌套块解析到叶子节点坐标
|
|
178
|
+
- **框架无关** — 基于 Web Component,支持 React、Vue 或原生 HTML
|
|
179
|
+
- **Zip 直拖** — 直接拖放 MinerU 输出 `.zip`,自动解压 PDF + layout + markdown
|
|
180
|
+
|
|
181
|
+
### 安装
|
|
182
|
+
|
|
183
|
+
```bash
|
|
184
|
+
npm install mineru-layout-viewer
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
#### CDN 引入
|
|
188
|
+
|
|
189
|
+
```html
|
|
190
|
+
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.min.js"></script>
|
|
191
|
+
<script src="https://cdnjs.cloudflare.com/ajax/libs/jszip/3.10.1/jszip.min.js"></script>
|
|
192
|
+
<script src="https://unpkg.com/mineru-layout-viewer/dist/mineru-layout-viewer.iife.js"></script>
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
设置 PDF.js Worker:
|
|
196
|
+
|
|
197
|
+
```html
|
|
198
|
+
<script>
|
|
199
|
+
pdfjsLib.GlobalWorkerOptions.workerSrc =
|
|
200
|
+
'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js'
|
|
201
|
+
</script>
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
### 使用方式
|
|
205
|
+
|
|
206
|
+
#### HTML / Web Component
|
|
207
|
+
|
|
208
|
+
```html
|
|
209
|
+
<!-- 属性方式 -->
|
|
210
|
+
<mineru-layout-viewer
|
|
211
|
+
pdf="./document.pdf"
|
|
212
|
+
layout="./layout.json"
|
|
213
|
+
markdown="./full.md">
|
|
214
|
+
</mineru-layout-viewer>
|
|
215
|
+
|
|
216
|
+
<!-- 编程 API(从 zip 加载) -->
|
|
217
|
+
<script>
|
|
218
|
+
const viewer = document.querySelector('mineru-layout-viewer')
|
|
219
|
+
await viewer.loadZip(zipBlob) // 从 MinerU 导出 .zip 加载
|
|
220
|
+
</script>
|
|
221
|
+
|
|
222
|
+
<!-- 编程 API(直接传 JSON) -->
|
|
223
|
+
<script>
|
|
224
|
+
const viewer = document.querySelector('mineru-layout-viewer')
|
|
225
|
+
viewer.loadLayoutFromJson(jsonData)
|
|
226
|
+
viewer.loadMarkdown(markdownText)
|
|
227
|
+
</script>
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
#### JavaScript / ESM
|
|
231
|
+
|
|
232
|
+
```js
|
|
233
|
+
import {
|
|
234
|
+
parseBlocks,
|
|
235
|
+
matchMarkdownToPdf,
|
|
236
|
+
MineruLayoutViewer,
|
|
237
|
+
} from 'mineru-layout-viewer'
|
|
238
|
+
|
|
239
|
+
// 解析 layout.json 字符串为 block 数组
|
|
240
|
+
const blocks = parseBlocks(layoutJsonStr)
|
|
241
|
+
|
|
242
|
+
// 将 markdown 段落匹配到 PDF block
|
|
243
|
+
const sections = matchMarkdownToPdf(markdown, blocks)
|
|
244
|
+
|
|
245
|
+
// 每个 section: { text, page, bbox: [x0,y0,x1,y1] | null }
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
### API
|
|
249
|
+
|
|
250
|
+
#### `parseBlocks(jsonStr: string): PdfBlock[]`
|
|
251
|
+
|
|
252
|
+
将 MinerU JSON 文件(`layout.json`、`middle.json` 或 `content_list.json`)解析为叶子级 block 数组。
|
|
253
|
+
|
|
254
|
+
```ts
|
|
255
|
+
interface PdfBlock {
|
|
256
|
+
page_idx: number // 0-based 页码
|
|
257
|
+
bbox: [number, number, number, number] // [x0, y0, x1, y1] — 左上角原点
|
|
258
|
+
text?: string // 提取的 span 文本
|
|
259
|
+
type?: string // block 类型:"text"、"title"、"list" 等
|
|
260
|
+
}
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
#### `matchMarkdownToPdf(markdown: string, blocks: PdfBlock[]): MdSection[]`
|
|
264
|
+
|
|
265
|
+
使用 LCS 相似度将 Markdown 文本(按行分割)匹配到 PDF block。
|
|
266
|
+
|
|
267
|
+
```ts
|
|
268
|
+
interface MdSection {
|
|
269
|
+
text: string
|
|
270
|
+
page: number // 1-based 页码
|
|
271
|
+
bbox: [number, number, number, number] | null
|
|
272
|
+
}
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
#### `<mineru-layout-viewer>` 属性
|
|
276
|
+
|
|
277
|
+
| 属性 | 说明 |
|
|
278
|
+
|------------|----------------------------------------|
|
|
279
|
+
| `pdf` | PDF 文件 URL |
|
|
280
|
+
| `layout` | `layout.json` / `middle.json` 文件 URL |
|
|
281
|
+
| `markdown` | `full.md` 文件 URL(可选) |
|
|
282
|
+
|
|
283
|
+
#### `<mineru-layout-viewer>` 方法
|
|
284
|
+
|
|
285
|
+
| 方法 | 说明 |
|
|
286
|
+
|-----------------------------------------|-----------------------------|
|
|
287
|
+
| `loadZip(blob: Blob): Promise<void>` | 从 Mineru 导出 .zip 加载 |
|
|
288
|
+
| `loadLayoutFromJson(data: object\|string)`| 直接加载 layout JSON |
|
|
289
|
+
| `loadMarkdown(text: string)` | 直接加载 markdown 文本 |
|
|
290
|
+
|
|
291
|
+
### 支持的 JSON 格式
|
|
292
|
+
|
|
293
|
+
| 格式 | 结构 | 坐标系 |
|
|
294
|
+
|---------------------|------------------------------------|----------|
|
|
295
|
+
| `layout.json` | `{ pdf_info: [{ preproc_blocks }] }` | 左上角 |
|
|
296
|
+
| `middle.json` | `{ pdf_info: [{ preproc_blocks }] }` | 左上角 |
|
|
297
|
+
| `content_list.json` | `[{ page_idx, bbox, text }]` | 0–1000 |
|
|
298
|
+
|
|
299
|
+
### 许可证
|
|
300
|
+
|
|
301
|
+
MIT
|