@editneo/pdf 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +129 -0
- package/package.json +1 -1
package/README.md
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
# @editneo/pdf
|
|
2
|
+
|
|
3
|
+
Client-side PDF-to-blocks extraction for EditNeo. This package reads a PDF file in the browser, analyzes its text content and layout, and converts it into an array of `NeoBlock` objects that the editor can render and edit.
|
|
4
|
+
|
|
5
|
+
All processing happens locally — no files are uploaded to any server.
|
|
6
|
+
|
|
7
|
+
## Installation
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
npm install @editneo/pdf @editneo/core
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
This package depends on [pdfjs-dist](https://github.com/nicolo-ribaudo/pdfjs-dist) for PDF parsing.
|
|
14
|
+
|
|
15
|
+
## Usage
|
|
16
|
+
|
|
17
|
+
```typescript
|
|
18
|
+
import { extractBlocksFromPdf } from "@editneo/pdf";
|
|
19
|
+
|
|
20
|
+
// Get the PDF as an ArrayBuffer (from a file input, drag-and-drop, fetch, etc.)
|
|
21
|
+
const file = event.dataTransfer.files[0];
|
|
22
|
+
const buffer = await file.arrayBuffer();
|
|
23
|
+
|
|
24
|
+
const blocks = await extractBlocksFromPdf(buffer);
|
|
25
|
+
// blocks: NeoBlock[]
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Each returned block has a type inferred from the PDF content (paragraphs, headings, images) and contains the extracted text as spans.
|
|
29
|
+
|
|
30
|
+
### With the React drop zone
|
|
31
|
+
|
|
32
|
+
The `PDFDropZone` component from `@editneo/react` calls this function automatically when a user drops a PDF onto the editor. You only need this package directly if you want to handle PDF extraction yourself.
|
|
33
|
+
|
|
34
|
+
```tsx
|
|
35
|
+
import { PDFDropZone } from "@editneo/react";
|
|
36
|
+
|
|
37
|
+
<PDFDropZone>{/* editor content */}</PDFDropZone>;
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## How Extraction Works
|
|
41
|
+
|
|
42
|
+
The extraction processes each page of the PDF in order and applies heuristics to convert raw PDF content into structured blocks.
|
|
43
|
+
|
|
44
|
+
### Text extraction
|
|
45
|
+
|
|
46
|
+
PDF files don't have a concept of "paragraphs" or "headings" — they store positioned text fragments with font metadata. The extractor groups these fragments into blocks using two signals:
|
|
47
|
+
|
|
48
|
+
1. **Vertical position:** When there's a significant vertical gap between consecutive text items (more than 5 PDF units), the extractor treats them as separate blocks.
|
|
49
|
+
|
|
50
|
+
2. **Font size:** The extractor calculates the most common font size on each page (the "mode"). Text items significantly larger than the mode are classified as headings:
|
|
51
|
+
- Greater than 2x the mode font size: `heading-1`
|
|
52
|
+
- Greater than 1.5x the mode font size: `heading-2`
|
|
53
|
+
- Everything else: `paragraph`
|
|
54
|
+
|
|
55
|
+
### Image detection
|
|
56
|
+
|
|
57
|
+
The extractor scans each page's operator list for `PaintImageXObject` operations, which indicate rendered images. When found, an `image` block is added to the output.
|
|
58
|
+
|
|
59
|
+
Note: Full image data extraction (decoding the pixel data from the PDF) requires additional processing of the operator list's argument arrays. The current implementation adds placeholder image blocks — proper image extraction can be implemented by processing `page.getOperatorList()` results in more detail.
|
|
60
|
+
|
|
61
|
+
### Limitations
|
|
62
|
+
|
|
63
|
+
This is a heuristic-based extractor designed for common document layouts. Some things it does not handle perfectly:
|
|
64
|
+
|
|
65
|
+
- **Multi-column layouts** — text from different columns may be merged or ordered incorrectly
|
|
66
|
+
- **Tables** — table content is extracted as plain text, not as a structured table block
|
|
67
|
+
- **Footnotes and headers/footers** — these are extracted as regular text blocks
|
|
68
|
+
- **Scanned PDFs** — if the PDF contains only images (no text layer), the extractor produces only image blocks. OCR is not performed.
|
|
69
|
+
- **Complex font detection** — bold and italic detection based on font name analysis is not yet implemented; all text is returned as plain spans
|
|
70
|
+
|
|
71
|
+
## API Reference
|
|
72
|
+
|
|
73
|
+
### `extractBlocksFromPdf(data)`
|
|
74
|
+
|
|
75
|
+
| Parameter | Type | Description |
|
|
76
|
+
| --------- | ------------- | --------------------- |
|
|
77
|
+
| `data` | `ArrayBuffer` | The raw PDF file data |
|
|
78
|
+
|
|
79
|
+
**Returns:** `Promise<NeoBlock[]>`
|
|
80
|
+
|
|
81
|
+
An array of `NeoBlock` objects in reading order. Each block has:
|
|
82
|
+
|
|
83
|
+
- A generated UUID as its `id`
|
|
84
|
+
- A `type` inferred from the content (`paragraph`, `heading-1`, `heading-2`, or `image`)
|
|
85
|
+
- A `content` array with a single `Span` containing the extracted text
|
|
86
|
+
- `createdAt` and `updatedAt` timestamps set to the extraction time
|
|
87
|
+
|
|
88
|
+
### Example output
|
|
89
|
+
|
|
90
|
+
Given a PDF with a title, some body text, and an image, the function might return:
|
|
91
|
+
|
|
92
|
+
```typescript
|
|
93
|
+
[
|
|
94
|
+
{
|
|
95
|
+
id: "a1b2c3...",
|
|
96
|
+
type: "heading-1",
|
|
97
|
+
content: [{ text: "Introduction to EditNeo" }],
|
|
98
|
+
props: {},
|
|
99
|
+
children: [],
|
|
100
|
+
parentId: null,
|
|
101
|
+
createdAt: 1707840000000,
|
|
102
|
+
updatedAt: 1707840000000,
|
|
103
|
+
},
|
|
104
|
+
{
|
|
105
|
+
id: "d4e5f6...",
|
|
106
|
+
type: "paragraph",
|
|
107
|
+
content: [{ text: "EditNeo is a modular block-based editor..." }],
|
|
108
|
+
props: {},
|
|
109
|
+
children: [],
|
|
110
|
+
parentId: null,
|
|
111
|
+
createdAt: 1707840000000,
|
|
112
|
+
updatedAt: 1707840000000,
|
|
113
|
+
},
|
|
114
|
+
{
|
|
115
|
+
id: "g7h8i9...",
|
|
116
|
+
type: "image",
|
|
117
|
+
content: [{ text: "" }],
|
|
118
|
+
props: { src: "placeholder-image-url" },
|
|
119
|
+
children: [],
|
|
120
|
+
parentId: null,
|
|
121
|
+
createdAt: 1707840000000,
|
|
122
|
+
updatedAt: 1707840000000,
|
|
123
|
+
},
|
|
124
|
+
];
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## License
|
|
128
|
+
|
|
129
|
+
MIT
|