n8n-nodes-docuparse 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +92 -0
- package/dist/nodes/DocuParse/DocuParse.node.js +607 -0
- package/package.json +56 -0
package/README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# n8n-nodes-liteparse-local
|
|
2
|
+
|
|
3
|
+
Full-featured document parsing node for n8n. No native dependencies — works in any Docker container.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
|
|
7
|
+
### Operations
|
|
8
|
+
|
|
9
|
+
| Operation | Description |
|
|
10
|
+
|-----------|-------------|
|
|
11
|
+
| **Parse Document** | Extract text, tables, and structure from documents |
|
|
12
|
+
| **OCR Image** | Extract text from images using Tesseract OCR |
|
|
13
|
+
| **Extract Tables** | Extract tables as structured JSON |
|
|
14
|
+
| **Convert to Markdown** | Convert document to clean markdown |
|
|
15
|
+
| **Merge PDFs** | Combine multiple PDF files into one |
|
|
16
|
+
| **Split PDF** | Split PDF into separate pages |
|
|
17
|
+
| **Extract Entities** | Extract emails, phones, URLs, dates from text |
|
|
18
|
+
|
|
19
|
+
### Supported Formats
|
|
20
|
+
|
|
21
|
+
| Format | Extensions | Parse | Tables | Markdown |
|
|
22
|
+
|--------|-----------|-------|--------|----------|
|
|
23
|
+
| PDF | `.pdf` | ✅ | ✅ | ✅ |
|
|
24
|
+
| Word | `.docx`, `.doc` | ✅ | ❌ | ✅ |
|
|
25
|
+
| Excel | `.xlsx`, `.xls` | ✅ | ✅ | ❌ |
|
|
26
|
+
| PowerPoint | `.pptx`, `.ppt` | ✅ | ❌ | ❌ |
|
|
27
|
+
| HTML | `.html`, `.htm` | ✅ | ❌ | ✅ |
|
|
28
|
+
| CSV | `.csv` | ✅ | ❌ | ❌ |
|
|
29
|
+
| Images | `.png`, `.jpg`, `.tiff`, `.bmp`, `.webp` | ✅ OCR | ❌ | ❌ |
|
|
30
|
+
| Text | `.txt`, `.md`, `.json`, `.xml` | ✅ | ❌ | ❌ |
|
|
31
|
+
|
|
32
|
+
### OCR Languages
|
|
33
|
+
|
|
34
|
+
English, Arabic, Chinese (Simplified/Traditional), French, German, Hindi, Japanese, Korean, Portuguese, Russian, Spanish, Urdu
|
|
35
|
+
|
|
36
|
+
### Entity Extraction
|
|
37
|
+
|
|
38
|
+
- Email addresses
|
|
39
|
+
- Phone numbers
|
|
40
|
+
- URLs
|
|
41
|
+
- Dates
|
|
42
|
+
- IP addresses
|
|
43
|
+
- Currency amounts
|
|
44
|
+
|
|
45
|
+
## Installation
|
|
46
|
+
|
|
47
|
+
In n8n, go to **Settings → Community Nodes** and install:
|
|
48
|
+
|
|
49
|
+
```
|
|
50
|
+
n8n-nodes-liteparse-local
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Usage Examples
|
|
54
|
+
|
|
55
|
+
### Parse PDF to Markdown
|
|
56
|
+
|
|
57
|
+
1. **Read Binary File** → point to PDF
|
|
58
|
+
2. **DocuParse** → Operation: Parse Document, Output Format: Markdown
|
|
59
|
+
3. Output: `json.text` contains the markdown
|
|
60
|
+
|
|
61
|
+
### OCR a Scanned Document
|
|
62
|
+
|
|
63
|
+
1. **Read Binary File** → point to image/PDF
|
|
64
|
+
2. **DocuParse** → Operation: OCR Image, Language: English
|
|
65
|
+
3. Output: `json.text` contains extracted text
|
|
66
|
+
|
|
67
|
+
### Extract Tables from Excel
|
|
68
|
+
|
|
69
|
+
1. **Read Binary File** → point to XLSX
|
|
70
|
+
2. **DocuParse** → Operation: Extract Tables
|
|
71
|
+
3. Output: `json.tables` contains array of tables
|
|
72
|
+
|
|
73
|
+
### Merge Multiple PDFs
|
|
74
|
+
|
|
75
|
+
1. **Read Binary File** → first PDF (field: `data`)
|
|
76
|
+
2. **Read Binary File** → second PDF (field: `data1`)
|
|
77
|
+
3. **DocuParse** → Operation: Merge PDFs, Additional Fields: `data1`
|
|
78
|
+
4. Output: merged PDF in `binary.merged_pdf`
|
|
79
|
+
|
|
80
|
+
## Dependencies (All Pure JavaScript)
|
|
81
|
+
|
|
82
|
+
- `pdfjs-dist` — Mozilla's PDF.js for PDF parsing
|
|
83
|
+
- `tesseract.js` — OCR via WebAssembly
|
|
84
|
+
- `mammoth` — DOCX parsing
|
|
85
|
+
- `xlsx` — Excel parsing
|
|
86
|
+
- `cheerio` — HTML parsing
|
|
87
|
+
- `csv-parse` — CSV parsing
|
|
88
|
+
- `pdf-lib` — PDF manipulation (merge/split)
|
|
89
|
+
|
|
90
|
+
## License
|
|
91
|
+
|
|
92
|
+
MIT
|