@editneo/pdf 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +150 -0
- package/dist/worker.d.ts +7 -0
- package/dist/worker.d.ts.map +1 -1
- package/dist/worker.js +225 -52
- package/dist/worker.js.map +1 -1
- package/package.json +9 -1
package/README.md
ADDED
|
@@ -0,0 +1,150 @@
|
|
|
1
|
+
# @editneo/pdf
|
|
2
|
+
|
|
3
|
+
Client-side PDF-to-blocks extraction for EditNeo. This package reads a PDF file in the browser, analyzes its text content and layout, and converts it into an array of `NeoBlock` objects that the editor can render and edit.
|
|
4
|
+
|
|
5
|
+
All processing happens locally — no files are uploaded to any server.
|
|
6
|
+
|
|
7
|
+
## Installation
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
npm install @editneo/pdf @editneo/core
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
This package depends on [pdfjs-dist](https://github.com/nicolo-ribaudo/pdfjs-dist) for PDF parsing. The pdf.js worker is auto-configured from CDN.
|
|
14
|
+
|
|
15
|
+
## Usage
|
|
16
|
+
|
|
17
|
+
```typescript
|
|
18
|
+
import { extractBlocksFromPdf } from "@editneo/pdf";
|
|
19
|
+
|
|
20
|
+
// Get the PDF as an ArrayBuffer (from a file input, drag-and-drop, fetch, etc.)
|
|
21
|
+
const file = event.dataTransfer.files[0];
|
|
22
|
+
const buffer = await file.arrayBuffer();
|
|
23
|
+
|
|
24
|
+
const blocks = await extractBlocksFromPdf(buffer);
|
|
25
|
+
// blocks: NeoBlock[]
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Each returned block has a type inferred from the PDF content (paragraphs, headings, images) and contains the extracted text as spans.
|
|
29
|
+
|
|
30
|
+
### With the React drop zone
|
|
31
|
+
|
|
32
|
+
The `PDFDropZone` component from `@editneo/react` calls this function automatically when a user drops a PDF onto the editor:
|
|
33
|
+
|
|
34
|
+
```tsx
|
|
35
|
+
import { PDFDropZone } from "@editneo/react";
|
|
36
|
+
|
|
37
|
+
<PDFDropZone>{/* editor content */}</PDFDropZone>;
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### Error handling
|
|
41
|
+
|
|
42
|
+
The function provides clear error messages for common failures:
|
|
43
|
+
|
|
44
|
+
```typescript
|
|
45
|
+
try {
|
|
46
|
+
const blocks = await extractBlocksFromPdf(buffer);
|
|
47
|
+
} catch (err) {
|
|
48
|
+
// Possible errors:
|
|
49
|
+
// "[EditNeo PDF] Cannot extract blocks: empty or missing PDF data"
|
|
50
|
+
// "[EditNeo PDF] This PDF is password-protected. Please provide an unlocked file."
|
|
51
|
+
// "[EditNeo PDF] Failed to load PDF: <reason>"
|
|
52
|
+
console.error(err.message);
|
|
53
|
+
}
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
Per-page errors are caught individually — one corrupted page won't prevent the rest from extracting.
|
|
57
|
+
|
|
58
|
+
## How Extraction Works
|
|
59
|
+
|
|
60
|
+
The extraction processes each page of the PDF in order and applies heuristics to convert raw PDF content into structured blocks.
|
|
61
|
+
|
|
62
|
+
### Text extraction
|
|
63
|
+
|
|
64
|
+
PDF files don't have a concept of "paragraphs" or "headings" — they store positioned text fragments with font metadata. The extractor groups these fragments into blocks using vertical position gaps (>5 PDF units between lines starts a new block).
|
|
65
|
+
|
|
66
|
+
### Heading detection
|
|
67
|
+
|
|
68
|
+
Headings are classified using multiple signals:
|
|
69
|
+
|
|
70
|
+
- **Font size ratio** — text significantly larger than the page's most common font size
|
|
71
|
+
- \>1.8× → `heading-1`
|
|
72
|
+
- \>1.4× → `heading-2`
|
|
73
|
+
- \>1.15× (if bold or all-caps) → `heading-3`
|
|
74
|
+
- **Bold font name** — fonts with "Bold", "Heavy", or "Black" in the name
|
|
75
|
+
- **Line length** — headings are typically shorter than 120 characters
|
|
76
|
+
- **All-caps** — uppercase text at normal size with a bold font → `heading-3`
|
|
77
|
+
|
|
78
|
+
### Image extraction
|
|
79
|
+
|
|
80
|
+
The extractor scans each page's operator list for `PaintImageXObject` operations. When found, the raw pixel data is read and converted to a PNG data URI via canvas. Small decorative images (<50×50px) are filtered out. In SSR environments where `document.createElement` isn't available, a placeholder is used instead.
|
|
81
|
+
|
|
82
|
+
### Limitations
|
|
83
|
+
|
|
84
|
+
- **Multi-column layouts** — text from different columns may be merged or ordered incorrectly
|
|
85
|
+
- **Tables** — table content is extracted as plain text, not as a structured table block
|
|
86
|
+
- **Footnotes and headers/footers** — these are extracted as regular text blocks
|
|
87
|
+
- **Scanned PDFs** — if the PDF contains only images (no text layer), the extractor produces only image blocks. OCR is not performed.
|
|
88
|
+
|
|
89
|
+
## API Reference
|
|
90
|
+
|
|
91
|
+
### `extractBlocksFromPdf(data)`
|
|
92
|
+
|
|
93
|
+
| Parameter | Type | Description |
|
|
94
|
+
| --------- | ------------- | --------------------- |
|
|
95
|
+
| `data` | `ArrayBuffer` | The raw PDF file data |
|
|
96
|
+
|
|
97
|
+
**Returns:** `Promise<NeoBlock[]>`
|
|
98
|
+
|
|
99
|
+
**Throws:** `Error` with descriptive message on failure (empty data, password-protected, corrupt file).
|
|
100
|
+
|
|
101
|
+
An array of `NeoBlock` objects in reading order. Each block has:
|
|
102
|
+
|
|
103
|
+
- A generated UUID as its `id`
|
|
104
|
+
- A `type` inferred from the content (`paragraph`, `heading-1`, `heading-2`, `heading-3`, or `image`)
|
|
105
|
+
- A `content` array with a single `Span` containing the extracted text
|
|
106
|
+
- Image blocks include `props.src` (data URI), `props.width`, and `props.height`
|
|
107
|
+
- `createdAt` and `updatedAt` timestamps set to the extraction time
|
|
108
|
+
|
|
109
|
+
### Example output
|
|
110
|
+
|
|
111
|
+
Given a PDF with a title, some body text, and an image, the function might return:
|
|
112
|
+
|
|
113
|
+
```typescript
|
|
114
|
+
[
|
|
115
|
+
{
|
|
116
|
+
id: "a1b2c3...",
|
|
117
|
+
type: "heading-1",
|
|
118
|
+
content: [{ text: "Introduction to EditNeo" }],
|
|
119
|
+
props: {},
|
|
120
|
+
children: [],
|
|
121
|
+
parentId: null,
|
|
122
|
+
createdAt: 1707840000000,
|
|
123
|
+
updatedAt: 1707840000000,
|
|
124
|
+
},
|
|
125
|
+
{
|
|
126
|
+
id: "d4e5f6...",
|
|
127
|
+
type: "paragraph",
|
|
128
|
+
content: [{ text: "EditNeo is a modular block-based editor..." }],
|
|
129
|
+
props: {},
|
|
130
|
+
children: [],
|
|
131
|
+
parentId: null,
|
|
132
|
+
createdAt: 1707840000000,
|
|
133
|
+
updatedAt: 1707840000000,
|
|
134
|
+
},
|
|
135
|
+
{
|
|
136
|
+
id: "g7h8i9...",
|
|
137
|
+
type: "image",
|
|
138
|
+
content: [{ text: "" }],
|
|
139
|
+
props: { src: "data:image/png;base64,...", width: 640, height: 480 },
|
|
140
|
+
children: [],
|
|
141
|
+
parentId: null,
|
|
142
|
+
createdAt: 1707840000000,
|
|
143
|
+
updatedAt: 1707840000000,
|
|
144
|
+
},
|
|
145
|
+
];
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## License
|
|
149
|
+
|
|
150
|
+
MIT
|
package/dist/worker.d.ts
CHANGED
|
@@ -1,3 +1,10 @@
|
|
|
1
1
|
import { NeoBlock } from '@editneo/core';
|
|
2
|
+
/**
|
|
3
|
+
* Extract structured NeoBlocks from a PDF file.
|
|
4
|
+
*
|
|
5
|
+
* @param data - An ArrayBuffer of the PDF file
|
|
6
|
+
* @returns An array of NeoBlock objects
|
|
7
|
+
* @throws Error with descriptive message on failure
|
|
8
|
+
*/
|
|
2
9
|
export declare function extractBlocksFromPdf(data: ArrayBuffer): Promise<NeoBlock[]>;
|
|
3
10
|
//# sourceMappingURL=worker.d.ts.map
|
package/dist/worker.d.ts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"worker.d.ts","sourceRoot":"","sources":["../src/worker.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,QAAQ,EAAa,MAAM,eAAe,CAAC;
|
|
1
|
+
{"version":3,"file":"worker.d.ts","sourceRoot":"","sources":["../src/worker.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,QAAQ,EAAa,MAAM,eAAe,CAAC;AA0KpD;;;;;;GAMG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,WAAW,GAAG,OAAO,CAAC,QAAQ,EAAE,CAAC,CA8FjF"}
|
package/dist/worker.js
CHANGED
|
@@ -1,67 +1,240 @@
|
|
|
1
1
|
import * as pdfjs from 'pdfjs-dist';
|
|
2
2
|
import { v4 as uuid } from 'uuid';
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
// Simplistic mode font size calculation for the page
|
|
15
|
-
const fontSizes = {};
|
|
16
|
-
for (const item of textContent.items) {
|
|
17
|
-
if ('height' in item) {
|
|
18
|
-
const height = Math.floor(item.height);
|
|
19
|
-
fontSizes[height] = (fontSizes[height] || 0) + 1;
|
|
20
|
-
}
|
|
3
|
+
/**
|
|
4
|
+
* Configure the pdf.js worker automatically.
|
|
5
|
+
* Uses the CDN-hosted worker matching the installed pdfjs-dist version.
|
|
6
|
+
* Falls back to disabling the worker if configuration fails.
|
|
7
|
+
*/
|
|
8
|
+
function ensureWorkerConfigured() {
|
|
9
|
+
if (!pdfjs.GlobalWorkerOptions.workerSrc) {
|
|
10
|
+
try {
|
|
11
|
+
const version = pdfjs.version;
|
|
12
|
+
pdfjs.GlobalWorkerOptions.workerSrc =
|
|
13
|
+
`https://cdnjs.cloudflare.com/ajax/libs/pdf.js/${version}/pdf.worker.min.js`;
|
|
21
14
|
}
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
15
|
+
catch {
|
|
16
|
+
// Fallback: disable worker (runs on main thread, slower but works)
|
|
17
|
+
pdfjs.disableWorker = true;
|
|
18
|
+
}
|
|
19
|
+
}
|
|
20
|
+
}
|
|
21
|
+
/**
|
|
22
|
+
* (#39) Improved heading detection using multiple heuristics:
|
|
23
|
+
* - Font size relative to the page mode
|
|
24
|
+
* - Short line length (headings are usually < 100 chars)
|
|
25
|
+
* - Bold font name patterns
|
|
26
|
+
* - All-caps detection
|
|
27
|
+
*/
|
|
28
|
+
function classifyBlockType(item, text, modeFontSize, fontStyles) {
|
|
29
|
+
const ratio = item.height / modeFontSize;
|
|
30
|
+
const trimmed = text.trim();
|
|
31
|
+
const isShortLine = trimmed.length < 120;
|
|
32
|
+
const fontInfo = item.fontName ? fontStyles.get(item.fontName) : undefined;
|
|
33
|
+
const isBoldFont = (fontInfo === null || fontInfo === void 0 ? void 0 : fontInfo.bold) || false;
|
|
34
|
+
const isAllCaps = trimmed.length > 2 && trimmed === trimmed.toUpperCase() && /[A-Z]/.test(trimmed);
|
|
35
|
+
// Large font → heading
|
|
36
|
+
if (ratio > 1.8 && isShortLine)
|
|
37
|
+
return 'heading-1';
|
|
38
|
+
if (ratio > 1.4 && isShortLine)
|
|
39
|
+
return 'heading-2';
|
|
40
|
+
if (ratio > 1.15 && isShortLine && (isBoldFont || isAllCaps))
|
|
41
|
+
return 'heading-3';
|
|
42
|
+
// Bold short line at normal size → heading-3
|
|
43
|
+
if (isBoldFont && isShortLine && trimmed.length < 60)
|
|
44
|
+
return 'heading-3';
|
|
45
|
+
return 'paragraph';
|
|
46
|
+
}
|
|
47
|
+
/**
|
|
48
|
+
* Extract font style information from the page's common objects.
|
|
49
|
+
* Detects bold fonts by checking the font name for common bold patterns.
|
|
50
|
+
*/
|
|
51
|
+
function extractFontStyles(page) {
|
|
52
|
+
var _a;
|
|
53
|
+
const styles = new Map();
|
|
54
|
+
try {
|
|
55
|
+
const commonObjs = page.commonObjs;
|
|
56
|
+
if (commonObjs && commonObjs._objs) {
|
|
57
|
+
for (const [key, entry] of Object.entries(commonObjs._objs)) {
|
|
58
|
+
if ((_a = entry === null || entry === void 0 ? void 0 : entry.data) === null || _a === void 0 ? void 0 : _a.name) {
|
|
59
|
+
const name = entry.data.name.toLowerCase();
|
|
60
|
+
styles.set(key, {
|
|
61
|
+
bold: name.includes('bold') || name.includes('heavy') || name.includes('black'),
|
|
62
|
+
});
|
|
63
|
+
}
|
|
28
64
|
}
|
|
29
65
|
}
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
66
|
+
}
|
|
67
|
+
catch {
|
|
68
|
+
// Font extraction is best-effort
|
|
69
|
+
}
|
|
70
|
+
return styles;
|
|
71
|
+
}
|
|
72
|
+
// ── Image extraction (#37) ────────────────────────────────────────────
|
|
73
|
+
/**
|
|
74
|
+
* (#37) Extract images from a PDF page's operator list.
|
|
75
|
+
* Converts image data to a data URI when possible.
|
|
76
|
+
*/
|
|
77
|
+
async function extractImages(page, opList) {
|
|
78
|
+
var _a;
|
|
79
|
+
const blocks = [];
|
|
80
|
+
const seenImages = new Set();
|
|
81
|
+
for (let i = 0; i < opList.fnArray.length; i++) {
|
|
82
|
+
if (opList.fnArray[i] === pdfjs.OPS.paintImageXObject) {
|
|
83
|
+
const imgName = (_a = opList.argsArray[i]) === null || _a === void 0 ? void 0 : _a[0];
|
|
84
|
+
if (!imgName || seenImages.has(imgName))
|
|
37
85
|
continue;
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
86
|
+
seenImages.add(imgName);
|
|
87
|
+
try {
|
|
88
|
+
const imgData = await new Promise((resolve, reject) => {
|
|
89
|
+
page.objs.get(imgName, (data) => {
|
|
90
|
+
if (data)
|
|
91
|
+
resolve(data);
|
|
92
|
+
else
|
|
93
|
+
reject(new Error('No image data'));
|
|
94
|
+
});
|
|
95
|
+
});
|
|
96
|
+
if (imgData && imgData.data && imgData.width && imgData.height) {
|
|
97
|
+
// Convert raw pixel data to a data URI via canvas
|
|
98
|
+
const canvas = typeof document !== 'undefined' ? document.createElement('canvas') : null;
|
|
99
|
+
if (canvas) {
|
|
100
|
+
canvas.width = imgData.width;
|
|
101
|
+
canvas.height = imgData.height;
|
|
102
|
+
const ctx = canvas.getContext('2d');
|
|
103
|
+
if (ctx) {
|
|
104
|
+
const imageData = ctx.createImageData(imgData.width, imgData.height);
|
|
105
|
+
// pdfjs image data can be RGB (3 bytes) or RGBA (4 bytes) per pixel
|
|
106
|
+
const src = imgData.data;
|
|
107
|
+
const dst = imageData.data;
|
|
108
|
+
const bytesPerPixel = src.length / (imgData.width * imgData.height);
|
|
109
|
+
if (bytesPerPixel === 4) {
|
|
110
|
+
dst.set(src);
|
|
111
|
+
}
|
|
112
|
+
else if (bytesPerPixel === 3) {
|
|
113
|
+
for (let j = 0, k = 0; j < dst.length; j += 4, k += 3) {
|
|
114
|
+
dst[j] = src[k];
|
|
115
|
+
dst[j + 1] = src[k + 1];
|
|
116
|
+
dst[j + 2] = src[k + 2];
|
|
117
|
+
dst[j + 3] = 255;
|
|
118
|
+
}
|
|
119
|
+
}
|
|
120
|
+
else {
|
|
121
|
+
// Grayscale or unsupported — skip
|
|
122
|
+
continue;
|
|
123
|
+
}
|
|
124
|
+
ctx.putImageData(imageData, 0, 0);
|
|
125
|
+
const dataUrl = canvas.toDataURL('image/png');
|
|
126
|
+
// Only include images of meaningful size (skip tiny decorative images)
|
|
127
|
+
if (imgData.width > 50 && imgData.height > 50) {
|
|
128
|
+
blocks.push(createBlock('image', '', {
|
|
129
|
+
src: dataUrl,
|
|
130
|
+
width: imgData.width,
|
|
131
|
+
height: imgData.height,
|
|
132
|
+
}));
|
|
133
|
+
}
|
|
134
|
+
}
|
|
135
|
+
}
|
|
136
|
+
else {
|
|
137
|
+
// No DOM (SSR) — insert placeholder
|
|
138
|
+
blocks.push(createBlock('image', '', {
|
|
139
|
+
src: 'placeholder-image-url',
|
|
140
|
+
width: imgData.width,
|
|
141
|
+
height: imgData.height,
|
|
142
|
+
}));
|
|
143
|
+
}
|
|
42
144
|
}
|
|
43
|
-
currentBlockText = '';
|
|
44
|
-
// Reset type based on new line height
|
|
45
|
-
if (item.height > modeFontSize * 2)
|
|
46
|
-
currentBlockType = 'heading-1';
|
|
47
|
-
else if (item.height > modeFontSize * 1.5)
|
|
48
|
-
currentBlockType = 'heading-2';
|
|
49
|
-
else
|
|
50
|
-
currentBlockType = 'paragraph';
|
|
51
145
|
}
|
|
52
|
-
|
|
53
|
-
|
|
146
|
+
catch {
|
|
147
|
+
// Image extraction failed for this image — skip
|
|
148
|
+
}
|
|
149
|
+
}
|
|
150
|
+
}
|
|
151
|
+
return blocks;
|
|
152
|
+
}
|
|
153
|
+
// ── Main extraction ───────────────────────────────────────────────────
|
|
154
|
+
/**
|
|
155
|
+
* Extract structured NeoBlocks from a PDF file.
|
|
156
|
+
*
|
|
157
|
+
* @param data - An ArrayBuffer of the PDF file
|
|
158
|
+
* @returns An array of NeoBlock objects
|
|
159
|
+
* @throws Error with descriptive message on failure
|
|
160
|
+
*/
|
|
161
|
+
export async function extractBlocksFromPdf(data) {
|
|
162
|
+
var _a;
|
|
163
|
+
if (!data || data.byteLength === 0) {
|
|
164
|
+
throw new Error('[EditNeo PDF] Cannot extract blocks: empty or missing PDF data');
|
|
165
|
+
}
|
|
166
|
+
ensureWorkerConfigured();
|
|
167
|
+
let pdf;
|
|
168
|
+
try {
|
|
169
|
+
const loadingTask = pdfjs.getDocument(data);
|
|
170
|
+
pdf = await loadingTask.promise;
|
|
171
|
+
}
|
|
172
|
+
catch (err) {
|
|
173
|
+
if ((_a = err === null || err === void 0 ? void 0 : err.message) === null || _a === void 0 ? void 0 : _a.includes('password')) {
|
|
174
|
+
throw new Error('[EditNeo PDF] This PDF is password-protected. Please provide an unlocked file.');
|
|
54
175
|
}
|
|
55
|
-
|
|
56
|
-
|
|
176
|
+
throw new Error(`[EditNeo PDF] Failed to load PDF: ${(err === null || err === void 0 ? void 0 : err.message) || 'Unknown error'}`);
|
|
177
|
+
}
|
|
178
|
+
const blocks = [];
|
|
179
|
+
for (let i = 1; i <= pdf.numPages; i++) {
|
|
180
|
+
try {
|
|
181
|
+
const page = await pdf.getPage(i);
|
|
182
|
+
const textContent = await page.getTextContent();
|
|
183
|
+
const opList = await page.getOperatorList();
|
|
184
|
+
// (#39) Extract font style info for heading heuristics
|
|
185
|
+
const fontStyles = extractFontStyles(page);
|
|
186
|
+
// Mode font-size calculation
|
|
187
|
+
const fontSizes = {};
|
|
188
|
+
for (const item of textContent.items) {
|
|
189
|
+
if ('height' in item) {
|
|
190
|
+
const height = Math.floor(item.height);
|
|
191
|
+
fontSizes[height] = (fontSizes[height] || 0) + 1;
|
|
192
|
+
}
|
|
193
|
+
}
|
|
194
|
+
let modeFontSize = 12;
|
|
195
|
+
let maxCount = 0;
|
|
196
|
+
for (const size in fontSizes) {
|
|
197
|
+
if (fontSizes[size] > maxCount) {
|
|
198
|
+
maxCount = fontSizes[size];
|
|
199
|
+
modeFontSize = Number(size);
|
|
200
|
+
}
|
|
201
|
+
}
|
|
202
|
+
// Text extraction with improved line grouping
|
|
203
|
+
let currentBlockText = '';
|
|
204
|
+
let currentBlockType = 'paragraph';
|
|
205
|
+
let currentItem = null;
|
|
206
|
+
let lastY = -1;
|
|
207
|
+
for (const item of textContent.items) {
|
|
208
|
+
if (!('str' in item))
|
|
209
|
+
continue;
|
|
210
|
+
if (lastY !== -1 && Math.abs(item.transform[5] - lastY) > 5) {
|
|
211
|
+
if (currentBlockText.trim()) {
|
|
212
|
+
blocks.push(createBlock(currentBlockType, currentBlockText));
|
|
213
|
+
}
|
|
214
|
+
currentBlockText = '';
|
|
215
|
+
// (#39) Use improved heading heuristics
|
|
216
|
+
currentBlockType = classifyBlockType(item, item.str, modeFontSize, fontStyles);
|
|
217
|
+
currentItem = item;
|
|
218
|
+
}
|
|
219
|
+
currentBlockText += item.str + ' ';
|
|
220
|
+
lastY = item.transform[5];
|
|
221
|
+
if (!currentItem)
|
|
222
|
+
currentItem = item;
|
|
223
|
+
}
|
|
224
|
+
if (currentBlockText.trim()) {
|
|
225
|
+
blocks.push(createBlock(currentBlockType, currentBlockText));
|
|
226
|
+
}
|
|
227
|
+
// (#37) Extract actual images
|
|
228
|
+
const imageBlocks = await extractImages(page, opList);
|
|
229
|
+
blocks.push(...imageBlocks);
|
|
57
230
|
}
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
// For this task, we will just add a placeholder if we detect image ops
|
|
61
|
-
if (opList.fnArray.includes(pdfjs.OPS.paintImageXObject)) {
|
|
62
|
-
blocks.push(createBlock('image', '', { src: 'placeholder-image-url' }));
|
|
231
|
+
catch (pageErr) {
|
|
232
|
+
console.warn(`[EditNeo PDF] Failed to extract page ${i}: ${pageErr === null || pageErr === void 0 ? void 0 : pageErr.message}`);
|
|
63
233
|
}
|
|
64
234
|
}
|
|
235
|
+
if (blocks.length === 0) {
|
|
236
|
+
console.warn('[EditNeo PDF] No content extracted from PDF');
|
|
237
|
+
}
|
|
65
238
|
return blocks;
|
|
66
239
|
}
|
|
67
240
|
function createBlock(type, text, props = {}) {
|
package/dist/worker.js.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"worker.js","sourceRoot":"","sources":["../src/worker.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,KAAK,MAAM,YAAY,CAAC;AAEpC,OAAO,EAAE,EAAE,IAAI,IAAI,EAAE,MAAM,MAAM,CAAC;AAElC,
|
|
1
|
+
{"version":3,"file":"worker.js","sourceRoot":"","sources":["../src/worker.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,KAAK,MAAM,YAAY,CAAC;AAEpC,OAAO,EAAE,EAAE,IAAI,IAAI,EAAE,MAAM,MAAM,CAAC;AAElC;;;;GAIG;AACH,SAAS,sBAAsB;IAC7B,IAAI,CAAC,KAAK,CAAC,mBAAmB,CAAC,SAAS,EAAE,CAAC;QACzC,IAAI,CAAC;YACH,MAAM,OAAO,GAAG,KAAK,CAAC,OAAO,CAAC;YAC9B,KAAK,CAAC,mBAAmB,CAAC,SAAS;gBACjC,iDAAiD,OAAO,oBAAoB,CAAC;QACjF,CAAC;QAAC,MAAM,CAAC;YACP,mEAAmE;YAClE,KAAa,CAAC,aAAa,GAAG,IAAI,CAAC;QACtC,CAAC;IACH,CAAC;AACH,CAAC;AAWD;;;;;;GAMG;AACH,SAAS,iBAAiB,CACxB,IAAc,EACd,IAAY,EACZ,YAAoB,EACpB,UAA0C;IAE1C,MAAM,KAAK,GAAG,IAAI,CAAC,MAAM,GAAG,YAAY,CAAC;IACzC,MAAM,OAAO,GAAG,IAAI,CAAC,IAAI,EAAE,CAAC;IAC5B,MAAM,WAAW,GAAG,OAAO,CAAC,MAAM,GAAG,GAAG,CAAC;IACzC,MAAM,QAAQ,GAAG,IAAI,CAAC,QAAQ,CAAC,CAAC,CAAC,UAAU,CAAC,GAAG,CAAC,IAAI,CAAC,QAAQ,CAAC,CAAC,CAAC,CAAC,SAAS,CAAC;IAC3E,MAAM,UAAU,GAAG,CAAA,QAAQ,aAAR,QAAQ,uBAAR,QAAQ,CAAE,IAAI,KAAI,KAAK,CAAC;IAC3C,MAAM,SAAS,GAAG,OAAO,CAAC,MAAM,GAAG,CAAC,IAAI,OAAO,KAAK,OAAO,CAAC,WAAW,EAAE,IAAI,OAAO,CAAC,IAAI,CAAC,OAAO,CAAC,CAAC;IAEnG,uBAAuB;IACvB,IAAI,KAAK,GAAG,GAAG,IAAI,WAAW;QAAE,OAAO,WAAW,CAAC;IACnD,IAAI,KAAK,GAAG,GAAG,IAAI,WAAW;QAAE,OAAO,WAAW,CAAC;IACnD,IAAI,KAAK,GAAG,IAAI,IAAI,WAAW,IAAI,CAAC,UAAU,IAAI,SAAS,CAAC;QAAE,OAAO,WAAW,CAAC;IAEjF,6CAA6C;IAC7C,IAAI,UAAU,IAAI,WAAW,IAAI,OAAO,CAAC,MAAM,GAAG,EAAE;QAAE,OAAO,WAAW,CAAC;IAEzE,OAAO,WAAW,CAAC;AACrB,CAAC;AAED;;;GAGG;AACH,SAAS,iBAAiB,CAAC,IAAS;;IAClC,MAAM,MAAM,GAAG,IAAI,GAAG,EAA6B,CAAC;IACpD,IAAI,CAAC;QACH,MAAM,UAAU,GAAG,IAAI,CAAC,UAAU,CAAC;QACnC,IAAI,UAAU,IAAI,UAAU,CAAC,KAAK,EAAE,CAAC;YACnC,KAAK,MAAM,CAAC,GAAG,EAAE,KAAK,CAAC,IAAI,MAAM,CAAC,OAAO,CAAC,UAAU,CAAC,KAA4B,CAAC,EAAE,CAAC;gBACnF,IAAI,MAAA,KAAK,aAAL,KAAK,uBAAL,KAAK,CAAE,IAAI,0CAAE,IAAI,EAAE,CAAC;oBACtB,MAAM,IAAI,GAAI,KAAK,CAAC,IAAI,CAAC,IAAe,CAAC,WAAW,EAAE,CAAC;oBACvD,MAAM,CAAC,GAAG,CAAC,GAAG,EAAE;wBACd,IAAI,EAAE,IAAI,CAAC,QAAQ,CAAC,MAAM,CAAC,IAAI,IAAI,CAAC,QAAQ,CAAC,OAAO,CAAC,IAAI,IAAI,CAAC,QAAQ,CAAC,OAAO,CAAC;qBAChF,CAAC,CAAC;gBACL,CAAC;YACH,CAAC;QACH,CAAC;IACH,CAAC;IAAC,MAAM,CAAC;QACP,iCAAiC;IACnC,CAAC;IACD,OAAO,MAAM,CAAC;AAChB,CAAC;AAED,yEAAyE;AAEzE;;;GAGG;AACH,KAAK,UAAU,aAAa,CAAC,IAAS,EAAE,MAAW;;IACjD,MAAM,MAAM,GAAe,EAAE,CAAC;IAC9B,MAAM,UAAU,GAAG,IAAI,GAAG,EAAU,CAAC;IAErC,KAAK,IAAI,CAAC,GAAG,CAAC,EAAE,CAAC,GAAG,MAAM,CAAC,OAAO,CAAC,MAAM,EAAE,CAAC,EAAE,EAAE,CAAC;QAC/C,IAAI,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,KAAK,KAAK,CAAC,GAAG,CAAC,iBAAiB,EAAE,CAAC;YACtD,MAAM,OAAO,GAAG,MAAA,MAAM,CAAC,SAAS,CAAC,CAAC,CAAC,0CAAG,CAAC,CAAC,CAAC;YACzC,IAAI,CAAC,OAAO,IAAI,UAAU,CAAC,GAAG,CAAC,OAAO,CAAC;gBAAE,SAAS;YAClD,UAAU,CAAC,GAAG,CAAC,OAAO,CAAC,CAAC;YAExB,IAAI,CAAC;gBACH,MAAM,OAAO,GAAG,MAAM,IAAI,OAAO,CAAM,CAAC,OAAO,EAAE,MAAM,EAAE,EAAE;oBACzD,IAAI,CAAC,IAAI,CAAC,GAAG,CAAC,OAAO,EAAE,CAAC,IAAS,EAAE,EAAE;wBACnC,IAAI,IAAI;4BAAE,OAAO,CAAC,IAAI,CAAC,CAAC;;4BACnB,MAAM,CAAC,IAAI,KAAK,CAAC,eAAe,CAAC,CAAC,CAAC;oBAC1C,CAAC,CAAC,CAAC;gBACL,CAAC,CAAC,CAAC;gBAEH,IAAI,OAAO,IAAI,OAAO,CAAC,IAAI,IAAI,OAAO,CAAC,KAAK,IAAI,OAAO,CAAC,MAAM,EAAE,CAAC;oBAC/D,kDAAkD;oBAClD,MAAM,MAAM,GAAG,OAAO,QAAQ,KAAK,WAAW,CAAC,CAAC,CAAC,QAAQ,CAAC,aAAa,CAAC,QAAQ,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC;oBACzF,IAAI,MAAM,EAAE,CAAC;wBACX,MAAM,CAAC,KAAK,GAAG,OAAO,CAAC,KAAK,CAAC;wBAC7B,MAAM,CAAC,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC;wBAC/B,MAAM,GAAG,GAAG,MAAM,CAAC,UAAU,CAAC,IAAI,CAAC,CAAC;wBACpC,IAAI,GAAG,EAAE,CAAC;4BACR,MAAM,SAAS,GAAG,GAAG,CAAC,eAAe,CAAC,OAAO,CAAC,KAAK,EAAE,OAAO,CAAC,MAAM,CAAC,CAAC;4BAErE,oEAAoE;4BACpE,MAAM,GAAG,GAAG,OAAO,CAAC,IAAI,CAAC;4BACzB,MAAM,GAAG,GAAG,SAAS,CAAC,IAAI,CAAC;4BAC3B,MAAM,aAAa,GAAG,GAAG,CAAC,MAAM,GAAG,CAAC,OAAO,CAAC,KAAK,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC;4BAEpE,IAAI,aAAa,KAAK,CAAC,EAAE,CAAC;gCACxB,GAAG,CAAC,GAAG,CAAC,GAAG,CAAC,CAAC;4BACf,CAAC;iCAAM,IAAI,aAAa,KAAK,CAAC,EAAE,CAAC;gCAC/B,KAAK,IAAI,CAAC,GAAG,CAAC,EAAE,CAAC,GAAG,CAAC,EAAE,CAAC,GAAG,GAAG,CAAC,MAAM,EAAE,CAAC,IAAI,CAAC,EAAE,CAAC,IAAI,CAAC,EAAE,CAAC;oCACtD,GAAG,CAAC,CAAC,CAAC,GAAG,GAAG,CAAC,CAAC,CAAC,CAAC;oCAChB,GAAG,CAAC,CAAC,GAAG,CAAC,CAAC,GAAG,GAAG,CAAC,CAAC,GAAG,CAAC,CAAC,CAAC;oCACxB,GAAG,CAAC,CAAC,GAAG,CAAC,CAAC,GAAG,GAAG,CAAC,CAAC,GAAG,CAAC,CAAC,CAAC;oCACxB,GAAG,CAAC,CAAC,GAAG,CAAC,CAAC,GAAG,GAAG,CAAC;gCACnB,CAAC;4BACH,CAAC;iCAAM,CAAC;gCACN,kCAAkC;gCAClC,SAAS;4BACX,CAAC;4BAED,GAAG,CAAC,YAAY,CAAC,SAAS,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC;4BAClC,MAAM,OAAO,GAAG,MAAM,CAAC,SAAS,CAAC,WAAW,CAAC,CAAC;4BAE9C,uEAAuE;4BACvE,IAAI,OAAO,CAAC,KAAK,GAAG,EAAE,IAAI,OAAO,CAAC,MAAM,GAAG,EAAE,EAAE,CAAC;gCAC9C,MAAM,CAAC,IAAI,CAAC,WAAW,CAAC,OAAO,EAAE,EAAE,EAAE;oCACnC,GAAG,EAAE,OAAO;oCACZ,KAAK,EAAE,OAAO,CAAC,KAAK;oCACpB,MAAM,EAAE,OAAO,CAAC,MAAM;iCACvB,CAAC,CAAC,CAAC;4BACN,CAAC;wBACH,CAAC;oBACH,CAAC;yBAAM,CAAC;wBACN,oCAAoC;wBACpC,MAAM,CAAC,IAAI,CAAC,WAAW,CAAC,OAAO,EAAE,EAAE,EAAE;4BACnC,GAAG,EAAE,uBAAuB;4BAC5B,KAAK,EAAE,OAAO,CAAC,KAAK;4BACpB,MAAM,EAAE,OAAO,CAAC,MAAM;yBACvB,CAAC,CAAC,CAAC;oBACN,CAAC;gBACH,CAAC;YACH,CAAC;YAAC,MAAM,CAAC;gBACP,gDAAgD;YAClD,CAAC;QACH,CAAC;IACH,CAAC;IAED,OAAO,MAAM,CAAC;AAChB,CAAC;AAED,yEAAyE;AAEzE;;;;;;GAMG;AACH,MAAM,CAAC,KAAK,UAAU,oBAAoB,CAAC,IAAiB;;IAC1D,IAAI,CAAC,IAAI,IAAI,IAAI,CAAC,UAAU,KAAK,CAAC,EAAE,CAAC;QACnC,MAAM,IAAI,KAAK,CAAC,gEAAgE,CAAC,CAAC;IACpF,CAAC;IAED,sBAAsB,EAAE,CAAC;IAEzB,IAAI,GAAG,CAAC;IACR,IAAI,CAAC;QACH,MAAM,WAAW,GAAG,KAAK,CAAC,WAAW,CAAC,IAAI,CAAC,CAAC;QAC5C,GAAG,GAAG,MAAM,WAAW,CAAC,OAAO,CAAC;IAClC,CAAC;IAAC,OAAO,GAAQ,EAAE,CAAC;QAClB,IAAI,MAAA,GAAG,aAAH,GAAG,uBAAH,GAAG,CAAE,OAAO,0CAAE,QAAQ,CAAC,UAAU,CAAC,EAAE,CAAC;YACvC,MAAM,IAAI,KAAK,CAAC,gFAAgF,CAAC,CAAC;QACpG,CAAC;QACD,MAAM,IAAI,KAAK,CAAC,qCAAqC,CAAA,GAAG,aAAH,GAAG,uBAAH,GAAG,CAAE,OAAO,KAAI,eAAe,EAAE,CAAC,CAAC;IAC1F,CAAC;IAED,MAAM,MAAM,GAAe,EAAE,CAAC;IAE9B,KAAK,IAAI,CAAC,GAAG,CAAC,EAAE,CAAC,IAAI,GAAG,CAAC,QAAQ,EAAE,CAAC,EAAE,EAAE,CAAC;QACvC,IAAI,CAAC;YACH,MAAM,IAAI,GAAG,MAAM,GAAG,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC;YAClC,MAAM,WAAW,GAAG,MAAM,IAAI,CAAC,cAAc,EAAE,CAAC;YAChD,MAAM,MAAM,GAAG,MAAM,IAAI,CAAC,eAAe,EAAE,CAAC;YAE5C,uDAAuD;YACvD,MAAM,UAAU,GAAG,iBAAiB,CAAC,IAAI,CAAC,CAAC;YAE3C,6BAA6B;YAC7B,MAAM,SAAS,GAA2B,EAAE,CAAC;YAC7C,KAAK,MAAM,IAAI,IAAI,WAAW,CAAC,KAAK,EAAE,CAAC;gBACrC,IAAI,QAAQ,IAAI,IAAI,EAAE,CAAC;oBACrB,MAAM,MAAM,GAAG,IAAI,CAAC,KAAK,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC;oBACvC,SAAS,CAAC,MAAM,CAAC,GAAG,CAAC,SAAS,CAAC,MAAM,CAAC,IAAI,CAAC,CAAC,GAAG,CAAC,CAAC;gBACnD,CAAC;YACH,CAAC;YAED,IAAI,YAAY,GAAG,EAAE,CAAC;YACtB,IAAI,QAAQ,GAAG,CAAC,CAAC;YACjB,KAAK,MAAM,IAAI,IAAI,SAAS,EAAE,CAAC;gBAC7B,IAAI,SAAS,CAAC,IAAI,CAAC,GAAG,QAAQ,EAAE,CAAC;oBAC/B,QAAQ,GAAG,SAAS,CAAC,IAAI,CAAC,CAAC;oBAC3B,YAAY,GAAG,MAAM,CAAC,IAAI,CAAC,CAAC;gBAC9B,CAAC;YACH,CAAC;YAED,8CAA8C;YAC9C,IAAI,gBAAgB,GAAG,EAAE,CAAC;YAC1B,IAAI,gBAAgB,GAAc,WAAW,CAAC;YAC9C,IAAI,WAAW,GAAQ,IAAI,CAAC;YAC5B,IAAI,KAAK,GAAG,CAAC,CAAC,CAAC;YAEf,KAAK,MAAM,IAAI,IAAI,WAAW,CAAC,KAAK,EAAE,CAAC;gBACrC,IAAI,CAAC,CAAC,KAAK,IAAI,IAAI,CAAC;oBAAE,SAAS;gBAE/B,IAAI,KAAK,KAAK,CAAC,CAAC,IAAI,IAAI,CAAC,GAAG,CAAC,IAAI,CAAC,SAAS,CAAC,CAAC,CAAC,GAAG,KAAK,CAAC,GAAG,CAAC,EAAE,CAAC;oBAC5D,IAAI,gBAAgB,CAAC,IAAI,EAAE,EAAE,CAAC;wBAC5B,MAAM,CAAC,IAAI,CAAC,WAAW,CAAC,gBAAgB,EAAE,gBAAgB,CAAC,CAAC,CAAC;oBAC/D,CAAC;oBACD,gBAAgB,GAAG,EAAE,CAAC;oBACtB,wCAAwC;oBACxC,gBAAgB,GAAG,iBAAiB,CAClC,IAAgB,EAChB,IAAI,CAAC,GAAG,EACR,YAAY,EACZ,UAAU,CACX,CAAC;oBACF,WAAW,GAAG,IAAI,CAAC;gBACrB,CAAC;gBAED,gBAAgB,IAAI,IAAI,CAAC,GAAG,GAAG,GAAG,CAAC;gBACnC,KAAK,GAAG,IAAI,CAAC,SAAS,CAAC,CAAC,CAAC,CAAC;gBAC1B,IAAI,CAAC,WAAW;oBAAE,WAAW,GAAG,IAAI,CAAC;YACvC,CAAC;YAED,IAAI,gBAAgB,CAAC,IAAI,EAAE,EAAE,CAAC;gBAC5B,MAAM,CAAC,IAAI,CAAC,WAAW,CAAC,gBAAgB,EAAE,gBAAgB,CAAC,CAAC,CAAC;YAC/D,CAAC;YAED,8BAA8B;YAC9B,MAAM,WAAW,GAAG,MAAM,aAAa,CAAC,IAAI,EAAE,MAAM,CAAC,CAAC;YACtD,MAAM,CAAC,IAAI,CAAC,GAAG,WAAW,CAAC,CAAC;QAE9B,CAAC;QAAC,OAAO,OAAY,EAAE,CAAC;YACtB,OAAO,CAAC,IAAI,CAAC,wCAAwC,CAAC,KAAK,OAAO,aAAP,OAAO,uBAAP,OAAO,CAAE,OAAO,EAAE,CAAC,CAAC;QACjF,CAAC;IACH,CAAC;IAED,IAAI,MAAM,CAAC,MAAM,KAAK,CAAC,EAAE,CAAC;QACxB,OAAO,CAAC,IAAI,CAAC,6CAA6C,CAAC,CAAC;IAC9D,CAAC;IAED,OAAO,MAAM,CAAC;AAChB,CAAC;AAED,SAAS,WAAW,CAAC,IAAe,EAAE,IAAY,EAAE,QAA6B,EAAE;IACjF,OAAO;QACL,EAAE,EAAE,IAAI,EAAE;QACV,IAAI;QACJ,OAAO,EAAE,CAAC,EAAE,IAAI,EAAE,IAAI,CAAC,IAAI,EAAE,EAAE,CAAC;QAChC,KAAK;QACL,QAAQ,EAAE,EAAE;QACZ,QAAQ,EAAE,IAAI;QACd,SAAS,EAAE,IAAI,CAAC,GAAG,EAAE;QACrB,SAAS,EAAE,IAAI,CAAC,GAAG,EAAE;KACtB,CAAC;AACJ,CAAC"}
|
package/package.json
CHANGED
|
@@ -1,9 +1,17 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@editneo/pdf",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.2",
|
|
4
4
|
"description": "PDF transmutation engine for EditNeo — converts PDFs into editable blocks",
|
|
5
|
+
"type": "module",
|
|
5
6
|
"main": "./dist/worker.js",
|
|
6
7
|
"types": "./dist/worker.d.ts",
|
|
8
|
+
"exports": {
|
|
9
|
+
".": {
|
|
10
|
+
"types": "./dist/worker.d.ts",
|
|
11
|
+
"import": "./dist/worker.js",
|
|
12
|
+
"default": "./dist/worker.js"
|
|
13
|
+
}
|
|
14
|
+
},
|
|
7
15
|
"files": ["dist"],
|
|
8
16
|
"scripts": {
|
|
9
17
|
"build": "tsc",
|