n8n-nodes-docuparse 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,92 @@
1
+ # n8n-nodes-liteparse-local
2
+
3
+ Full-featured document parsing node for n8n. No native dependencies — works in any Docker container.
4
+
5
+ ## Features
6
+
7
+ ### Operations
8
+
9
+ | Operation | Description |
10
+ |-----------|-------------|
11
+ | **Parse Document** | Extract text, tables, and structure from documents |
12
+ | **OCR Image** | Extract text from images using Tesseract OCR |
13
+ | **Extract Tables** | Extract tables as structured JSON |
14
+ | **Convert to Markdown** | Convert document to clean markdown |
15
+ | **Merge PDFs** | Combine multiple PDF files into one |
16
+ | **Split PDF** | Split PDF into separate pages |
17
+ | **Extract Entities** | Extract emails, phones, URLs, dates from text |
18
+
19
+ ### Supported Formats
20
+
21
+ | Format | Extensions | Parse | Tables | Markdown |
22
+ |--------|-----------|-------|--------|----------|
23
+ | PDF | `.pdf` | ✅ | ✅ | ✅ |
24
+ | Word | `.docx`, `.doc` | ✅ | ❌ | ✅ |
25
+ | Excel | `.xlsx`, `.xls` | ✅ | ✅ | ❌ |
26
+ | PowerPoint | `.pptx`, `.ppt` | ✅ | ❌ | ❌ |
27
+ | HTML | `.html`, `.htm` | ✅ | ❌ | ✅ |
28
+ | CSV | `.csv` | ✅ | ❌ | ❌ |
29
+ | Images | `.png`, `.jpg`, `.tiff`, `.bmp`, `.webp` | ✅ OCR | ❌ | ❌ |
30
+ | Text | `.txt`, `.md`, `.json`, `.xml` | ✅ | ❌ | ❌ |
31
+
32
+ ### OCR Languages
33
+
34
+ English, Arabic, Chinese (Simplified/Traditional), French, German, Hindi, Japanese, Korean, Portuguese, Russian, Spanish, Urdu
35
+
36
+ ### Entity Extraction
37
+
38
+ - Email addresses
39
+ - Phone numbers
40
+ - URLs
41
+ - Dates
42
+ - IP addresses
43
+ - Currency amounts
44
+
45
+ ## Installation
46
+
47
+ In n8n, go to **Settings → Community Nodes** and install:
48
+
49
+ ```
50
+ n8n-nodes-liteparse-local
51
+ ```
52
+
53
+ ## Usage Examples
54
+
55
+ ### Parse PDF to Markdown
56
+
57
+ 1. **Read Binary File** → point to PDF
58
+ 2. **DocuParse** → Operation: Parse Document, Output Format: Markdown
59
+ 3. Output: `json.text` contains the markdown
60
+
61
+ ### OCR a Scanned Document
62
+
63
+ 1. **Read Binary File** → point to image/PDF
64
+ 2. **DocuParse** → Operation: OCR Image, Language: English
65
+ 3. Output: `json.text` contains extracted text
66
+
67
+ ### Extract Tables from Excel
68
+
69
+ 1. **Read Binary File** → point to XLSX
70
+ 2. **DocuParse** → Operation: Extract Tables
71
+ 3. Output: `json.tables` contains array of tables
72
+
73
+ ### Merge Multiple PDFs
74
+
75
+ 1. **Read Binary File** → first PDF (field: `data`)
76
+ 2. **Read Binary File** → second PDF (field: `data1`)
77
+ 3. **DocuParse** → Operation: Merge PDFs, Additional Fields: `data1`
78
+ 4. Output: merged PDF in `binary.merged_pdf`
79
+
80
+ ## Dependencies (All Pure JavaScript)
81
+
82
+ - `pdfjs-dist` — Mozilla's PDF.js for PDF parsing
83
+ - `tesseract.js` — OCR via WebAssembly
84
+ - `mammoth` — DOCX parsing
85
+ - `xlsx` — Excel parsing
86
+ - `cheerio` — HTML parsing
87
+ - `csv-parse` — CSV parsing
88
+ - `pdf-lib` — PDF manipulation (merge/split)
89
+
90
+ ## License
91
+
92
+ MIT