npm - @leolionart/n8n-nodes-pdf-extractor - Versions diffs - 1.0.0 - Mend

@leolionart/n8n-nodes-pdf-extractor 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/LICENSE +21 -0
package/README.md +127 -0
package/dist/nodes/PdfExtractor/PdfExtractor.node.d.ts +5 -0
package/dist/nodes/PdfExtractor/PdfExtractor.node.js +273 -0
package/dist/nodes/PdfExtractor/pdf-extractor.svg +11 -0
package/package.json +62 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 NAAI Studio
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,127 @@
+# n8n-nodes-pdf-extractor
+This is an n8n community node that extracts text from **password-protected PDFs** reliably using `qpdf` and `pdftotext` command-line tools.
+This node was created to solve the [known crashing issue](https://github.com/n8n-io/n8n/issues/23754) with the built-in "Extract from File" PDF node.
+[n8n](https://n8n.io/) is a [fair-code licensed](https://docs.n8n.io/reference/license/) workflow automation platform.
+## Features
+- ✅ **Extract text** from password-protected PDFs
+- ✅ **Decrypt PDFs** and return as binary for further processing
+- ✅ **No crashes** - uses battle-tested command-line tools instead of buggy JavaScript libraries
+- ✅ **Layout preservation** - maintains original text positioning
+- ✅ **Page range selection** - extract specific pages only
+- ✅ **Multiple encodings** - UTF-8, Latin1, ASCII7
+## Prerequisites
+Before using this node, you **must install** the required tools in your n8n container:
+```bash
+docker exec -u root n8n apk add --no-cache qpdf poppler-utils
+```
+For **persistent installation**, add this to your Docker Compose file:
+```yaml
+services:
+  n8n:
+    image: n8nio/n8n:latest
+    # ... other config
+    entrypoint: /bin/sh
+    command:
+      - -c
+      - |
+        apk add --no-cache qpdf poppler-utils
+        exec tini -- /docker-entrypoint.sh
+```
+## Installation
+### Via n8n UI (Recommended)
+1. Go to **Settings** → **Community Nodes**
+2. Click **Install**
+3. Enter: `n8n-nodes-pdf-extractor`
+4. Click **Install**
+### Via npm
+```bash
+cd ~/.n8n/nodes
+npm install n8n-nodes-pdf-extractor
+```
+## Operations
+### Extract Text
+Extracts text content from a PDF file.
+**Parameters:**
+- **Binary Property**: Name of the binary property containing the PDF (default: `data`)
+- **Password**: Password to decrypt the PDF (leave empty if not encrypted)
+**Options:**
+- **Layout Mode**: Maintain original text layout (default: true)
+- **Page Range**: Extract specific pages (e.g., "1-5" or "1,3,5")
+- **Output Property**: JSON property name for extracted text (default: `text`)
+- **Encoding**: Text encoding (UTF-8, Latin1, ASCII7)
+### Decrypt Only
+Decrypts a password-protected PDF and returns it as a binary file for further processing.
+## Example Usage
+### Extract text from bank statement
+```
+[Gmail Trigger] → [PDF Extractor] → [AI/LLM] → [Google Sheets]
+```
+1. Gmail Trigger receives email with PDF attachment
+2. PDF Extractor extracts text with password
+3. AI extracts structured data
+4. Save to Google Sheets
+## Why This Node?
+The built-in n8n "Extract from File" node uses `pdf-parse` JavaScript library which:
+- ❌ Crashes n8n container with certain PDF encryption types
+- ❌ Causes "SIGILL" errors on Alpine Linux
+- ❌ Has memory issues with large PDFs
+This node uses:
+- ✅ **qpdf** - Industry-standard PDF manipulation tool
+- ✅ **pdftotext** (poppler-utils) - Robust text extraction from PDFs
+## Troubleshooting
+### "Required tools not found"
+Install the required tools:
+```bash
+docker exec -u root n8n apk add --no-cache qpdf poppler-utils
+```
+### "Invalid password for PDF file"
+Check that the password is correct. Some PDFs use owner password vs user password.
+### Empty text output
+The PDF might be scanned/image-based. This node extracts text layers only. For scanned PDFs, use OCR tools.
+## Resources
+- [n8n community nodes documentation](https://docs.n8n.io/integrations/community-nodes/)
+- [GitHub Issue #23754 - PDF crash bug](https://github.com/n8n-io/n8n/issues/23754)
+## License
+[MIT](LICENSE)

package/dist/nodes/PdfExtractor/PdfExtractor.node.d.ts ADDED Viewed

@@ -0,0 +1,5 @@
+import { IExecuteFunctions, INodeExecutionData, INodeType, INodeTypeDescription } from 'n8n-workflow';
+export declare class PdfExtractor implements INodeType {
+    description: INodeTypeDescription;
+    execute(this: IExecuteFunctions): Promise<INodeExecutionData[][]>;
+}

package/dist/nodes/PdfExtractor/PdfExtractor.node.js ADDED Viewed

@@ -0,0 +1,273 @@
+"use strict";
+var __createBinding = (this && this.__createBinding) || (Object.create ? (function(o, m, k, k2) {
+    if (k2 === undefined) k2 = k;
+    var desc = Object.getOwnPropertyDescriptor(m, k);
+    if (!desc || ("get" in desc ? !m.__esModule : desc.writable || desc.configurable)) {
+      desc = { enumerable: true, get: function() { return m[k]; } };
+    }
+    Object.defineProperty(o, k2, desc);
+}) : (function(o, m, k, k2) {
+    if (k2 === undefined) k2 = k;
+    o[k2] = m[k];
+}));
+var __setModuleDefault = (this && this.__setModuleDefault) || (Object.create ? (function(o, v) {
+    Object.defineProperty(o, "default", { enumerable: true, value: v });
+}) : function(o, v) {
+    o["default"] = v;
+});
+var __importStar = (this && this.__importStar) || (function () {
+    var ownKeys = function(o) {
+        ownKeys = Object.getOwnPropertyNames || function (o) {
+            var ar = [];
+            for (var k in o) if (Object.prototype.hasOwnProperty.call(o, k)) ar[ar.length] = k;
+            return ar;
+        };
+        return ownKeys(o);
+    };
+    return function (mod) {
+        if (mod && mod.__esModule) return mod;
+        var result = {};
+        if (mod != null) for (var k = ownKeys(mod), i = 0; i < k.length; i++) if (k[i] !== "default") __createBinding(result, mod, k[i]);
+        __setModuleDefault(result, mod);
+        return result;
+    };
+})();
+Object.defineProperty(exports, "__esModule", { value: true });
+exports.PdfExtractor = void 0;
+const n8n_workflow_1 = require("n8n-workflow");
+const child_process_1 = require("child_process");
+const util_1 = require("util");
+const fs = __importStar(require("fs"));
+const path = __importStar(require("path"));
+const os = __importStar(require("os"));
+const execAsync = (0, util_1.promisify)(child_process_1.exec);
+class PdfExtractor {
+    constructor() {
+        this.description = {
+            displayName: 'PDF Extractor',
+            name: 'pdfExtractor',
+            icon: 'file:pdf-extractor.svg',
+            group: ['transform'],
+            version: 1,
+            subtitle: '={{$parameter["operation"]}}',
+            description: 'Extract text from password-protected PDFs using qpdf and pdftotext. Requires qpdf and poppler-utils installed in the n8n container.',
+            defaults: {
+                name: 'PDF Extractor',
+            },
+            inputs: ['main'],
+            outputs: ['main'],
+            properties: [
+                {
+                    displayName: 'Operation',
+                    name: 'operation',
+                    type: 'options',
+                    noDataExpression: true,
+                    options: [
+                        {
+                            name: 'Extract Text',
+                            value: 'extractText',
+                            description: 'Extract text content from PDF',
+                            action: 'Extract text from PDF',
+                        },
+                        {
+                            name: 'Decrypt Only',
+                            value: 'decrypt',
+                            description: 'Decrypt PDF and return as binary',
+                            action: 'Decrypt PDF file',
+                        },
+                    ],
+                    default: 'extractText',
+                },
+                {
+                    displayName: 'Binary Property',
+                    name: 'binaryPropertyName',
+                    type: 'string',
+                    default: 'data',
+                    required: true,
+                    description: 'Name of the binary property containing the PDF file',
+                    placeholder: 'data',
+                },
+                {
+                    displayName: 'Password',
+                    name: 'password',
+                    type: 'string',
+                    typeOptions: {
+                        password: true,
+                    },
+                    default: '',
+                    description: 'Password to decrypt the PDF. Leave empty if the PDF is not encrypted.',
+                },
+                {
+                    displayName: 'Options',
+                    name: 'options',
+                    type: 'collection',
+                    placeholder: 'Add Option',
+                    default: {},
+                    options: [
+                        {
+                            displayName: 'Layout Mode',
+                            name: 'layout',
+                            type: 'boolean',
+                            default: true,
+                            description: 'Whether to maintain the original physical layout of the text',
+                        },
+                        {
+                            displayName: 'Page Range',
+                            name: 'pageRange',
+                            type: 'string',
+                            default: '',
+                            placeholder: '1-5',
+                            description: 'Extract specific pages only (e.g., "1-5" or "1,3,5"). Leave empty for all pages.',
+                        },
+                        {
+                            displayName: 'Output Property',
+                            name: 'outputProperty',
+                            type: 'string',
+                            default: 'text',
+                            description: 'Name of the JSON property to store extracted text',
+                        },
+                        {
+                            displayName: 'Encoding',
+                            name: 'encoding',
+                            type: 'options',
+                            options: [
+                                { name: 'UTF-8', value: 'UTF-8' },
+                                { name: 'Latin1', value: 'Latin1' },
+                                { name: 'ASCII7', value: 'ASCII7' },
+                            ],
+                            default: 'UTF-8',
+                            description: 'Text encoding for output',
+                        },
+                    ],
+                },
+            ],
+        };
+    }
+    async execute() {
+        const items = this.getInputData();
+        const returnData = [];
+        // Check if required tools are installed
+        try {
+            await execAsync('which qpdf');
+            await execAsync('which pdftotext');
+        }
+        catch {
+            throw new n8n_workflow_1.NodeOperationError(this.getNode(), 'Required tools not found. Please install qpdf and poppler-utils in your n8n container:\n' +
+                'docker exec -u root n8n apk add --no-cache qpdf poppler-utils');
+        }
+        for (let itemIndex = 0; itemIndex < items.length; itemIndex++) {
+            try {
+                const operation = this.getNodeParameter('operation', itemIndex);
+                const binaryPropertyName = this.getNodeParameter('binaryPropertyName', itemIndex);
+                const password = this.getNodeParameter('password', itemIndex);
+                const options = this.getNodeParameter('options', itemIndex, {});
+                // Validate binary data exists
+                const binaryData = this.helpers.assertBinaryData(itemIndex, binaryPropertyName);
+                const buffer = await this.helpers.getBinaryDataBuffer(itemIndex, binaryPropertyName);
+                // Create temp files with unique names
+                const tempDir = os.tmpdir();
+                const timestamp = Date.now();
+                const randomId = Math.random().toString(36).substring(7);
+                const inputPath = path.join(tempDir, `n8n_pdf_input_${timestamp}_${randomId}.pdf`);
+                const decryptedPath = path.join(tempDir, `n8n_pdf_decrypted_${timestamp}_${randomId}.pdf`);
+                // Write PDF to temp file
+                fs.writeFileSync(inputPath, buffer);
+                let pdfPath = inputPath;
+                try {
+                    // Decrypt if password provided
+                    if (password) {
+                        const qpdfCmd = `qpdf --decrypt --password="${password.replace(/"/g, '\\"')}" "${inputPath}" "${decryptedPath}"`;
+                        try {
+                            await execAsync(qpdfCmd);
+                            pdfPath = decryptedPath;
+                        }
+                        catch (error) {
+                            const errorMessage = error.message || String(error);
+                            if (errorMessage.includes('invalid password')) {
+                                throw new n8n_workflow_1.NodeOperationError(this.getNode(), 'Invalid password for PDF file', { itemIndex });
+                            }
+                            throw new n8n_workflow_1.NodeOperationError(this.getNode(), `Failed to decrypt PDF: ${errorMessage}`, { itemIndex });
+                        }
+                    }
+                    if (operation === 'extractText') {
+                        // Build pdftotext command
+                        const pdftotextArgs = [];
+                        if (options.layout !== false) {
+                            pdftotextArgs.push('-layout');
+                        }
+                        if (options.encoding) {
+                            pdftotextArgs.push(`-enc ${options.encoding}`);
+                        }
+                        if (options.pageRange) {
+                            const pageMatch = options.pageRange.match(/^(\d+)(?:-(\d+))?$/);
+                            if (pageMatch) {
+                                pdftotextArgs.push(`-f ${pageMatch[1]}`);
+                                if (pageMatch[2]) {
+                                    pdftotextArgs.push(`-l ${pageMatch[2]}`);
+                                }
+                            }
+                        }
+                        const pdftotextCmd = `pdftotext ${pdftotextArgs.join(' ')} "${pdfPath}" -`;
+                        const { stdout, stderr } = await execAsync(pdftotextCmd, { maxBuffer: 50 * 1024 * 1024 });
+                        if (stderr && !stderr.includes('Syntax Warning')) {
+                            console.warn(`pdftotext warning: ${stderr}`);
+                        }
+                        const outputProperty = options.outputProperty || 'text';
+                        returnData.push({
+                            json: {
+                                [outputProperty]: stdout,
+                                fileName: binaryData.fileName,
+                                mimeType: binaryData.mimeType,
+                                fileSize: buffer.length,
+                                encrypted: !!password,
+                            },
+                            pairedItem: { item: itemIndex },
+                        });
+                    }
+                    else if (operation === 'decrypt') {
+                        // Read decrypted PDF and return as binary
+                        const decryptedBuffer = fs.readFileSync(pdfPath);
+                        const newBinaryData = await this.helpers.prepareBinaryData(decryptedBuffer, binaryData.fileName?.replace('.pdf', '_decrypted.pdf') || 'decrypted.pdf', 'application/pdf');
+                        returnData.push({
+                            json: {
+                                fileName: binaryData.fileName,
+                                decrypted: true,
+                            },
+                            binary: {
+                                [binaryPropertyName]: newBinaryData,
+                            },
+                            pairedItem: { item: itemIndex },
+                        });
+                    }
+                }
+                finally {
+                    // Cleanup temp files
+                    try {
+                        if (fs.existsSync(inputPath))
+                            fs.unlinkSync(inputPath);
+                        if (fs.existsSync(decryptedPath))
+                            fs.unlinkSync(decryptedPath);
+                    }
+                    catch {
+                        // Ignore cleanup errors
+                    }
+                }
+            }
+            catch (error) {
+                if (this.continueOnFail()) {
+                    returnData.push({
+                        json: {
+                            error: error.message,
+                            success: false,
+                        },
+                        pairedItem: { item: itemIndex },
+                    });
+                    continue;
+                }
+                throw error;
+            }
+        }
+        return [returnData];
+    }
+}
+exports.PdfExtractor = PdfExtractor;

package/dist/nodes/PdfExtractor/pdf-extractor.svg ADDED Viewed

@@ -0,0 +1,11 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 64 64" fill="none">
+  <rect width="64" height="64" rx="8" fill="#E53935"/>
+  <path d="M20 12h16l12 12v28a4 4 0 01-4 4H20a4 4 0 01-4-4V16a4 4 0 014-4z" fill="white"/>
+  <path d="M36 12v12h12" fill="none" stroke="#E53935" stroke-width="2"/>
+  <path d="M24 32h16M24 38h12M24 44h8" stroke="#E53935" stroke-width="2" stroke-linecap="round"/>
+  <circle cx="48" cy="48" r="12" fill="#4CAF50"/>
+  <path d="M44 48h8M48 44v8" stroke="white" stroke-width="2" stroke-linecap="round"/>
+  <rect x="42" y="26" width="8" height="10" rx="1" fill="#FFC107"/>
+  <circle cx="46" cy="36" r="2" fill="#795548"/>
+  <path d="M46 38v4" stroke="#795548" stroke-width="1.5"/>
+</svg>

package/package.json ADDED Viewed

@@ -0,0 +1,62 @@
+{
+  "name": "@leolionart/n8n-nodes-pdf-extractor",
+  "version": "1.0.0",
+  "description": "n8n community node to extract text from password-protected PDFs using qpdf and pdftotext",
+  "keywords": [
+    "n8n-community-node-package",
+    "n8n",
+    "pdf",
+    "extract",
+    "password",
+    "decrypt",
+    "pdftotext",
+    "qpdf"
+  ],
+  "license": "MIT",
+  "homepage": "https://github.com/pntai/n8n-nodes-pdf-extractor",
+  "author": {
+    "name": "NAAI Studio",
+    "email": "art.leolion@gmail.com"
+  },
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/pntai/n8n-nodes-pdf-extractor.git"
+  },
+  "engines": {
+    "node": ">=18.0.0"
+  },
+  "main": "dist/index.js",
+  "types": "dist/index.d.ts",
+  "files": [
+    "dist"
+  ],
+  "scripts": {
+    "build": "tsc && gulp build:icons",
+    "dev": "tsc --watch",
+    "format": "prettier --write .",
+    "lint": "eslint .",
+    "lintfix": "eslint . --fix",
+    "prepublishOnly": "npm run build"
+  },
+  "n8n": {
+    "n8nNodesApiVersion": 1,
+    "nodes": [
+      "dist/nodes/PdfExtractor/PdfExtractor.node.js"
+    ],
+    "credentials": []
+  },
+  "devDependencies": {
+    "@types/node": "^20.10.0",
+    "@typescript-eslint/eslint-plugin": "^6.0.0",
+    "@typescript-eslint/parser": "^6.0.0",
+    "eslint": "^8.56.0",
+    "eslint-plugin-n8n-nodes-base": "^1.16.0",
+    "gulp": "^4.0.2",
+    "n8n-workflow": "*",
+    "prettier": "^3.1.0",
+    "typescript": "^5.3.0"
+  },
+  "peerDependencies": {
+    "n8n-workflow": "*"
+  }
+}