npm - @arela/uploader - Versions diffs - 0.2.0 → 0.2.2 - Mend

@arela/uploader 0.2.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md +97 -7
package/commands.md +6 -0
package/package.json +1 -1
package/src/document-type-shared.js +22 -8
package/src/document-types/pedimento-simplificado.js +11 -29
package/src/file-detection.js +44 -29
package/src/index.js +821 -225

package/README.md CHANGED Viewed

@@ -2,6 +2,71 @@
 CLI tool to upload files and directories to Arela API or Supabase Storage with automatic file processing, detection, and organization.
+## 🚀 OPTIMIZED 4-PHASE WORKFLOW
+**New in v0.2.0**: The tool now supports an optimized 4-phase workflow designed for maximum performance when processing large file collections:
+### Phase 1: Filesystem Stats Collection 📊
+```bash
+arela --stats-only
+```
+- ⚡ **ULTRA FAST**: Only reads filesystem metadata (no file content)
+- 📈 **Bulk database operations**: Processes 1000+ files per batch
+- 🔄 **Upsert optimization**: Handles duplicates efficiently
+- 💾 **Minimal memory usage**: No file content loading
+### Phase 2: PDF Detection 🔍
+```bash
+arela --detect-pdfs
+```
+- 🎯 **Targeted processing**: Only processes PDF files from database
+- � **Pedimento-simplificado detection**: Extracts RFC, pedimento numbers, and metadata
+- 🔄 **Batched processing**: Handles large datasets efficiently
+- 📊 **Progress tracking**: Real-time detection statistics
+### Phase 3: Path Propagation �📁
+```bash
+arela --propagate-arela-path
+```
+- 🎯 **Smart path copying**: Propagates arela_path from pedimento documents to related files
+- 📦 **Batch updates**: Processes files in groups for optimal database performance
+- 🔗 **Relationship mapping**: Links supporting documents to their pedimento
+### Phase 4: RFC-based Upload 🚀
+```bash
+arela --upload-by-rfc
+```
+- 🎯 **Targeted uploads**: Only uploads files for specified RFCs
+- 📋 **Supporting documents**: Includes all related files, not just pedimentos
+- 🏗️ **Structure preservation**: Maintains proper folder hierarchy
+### Combined Workflow 🎯
+```bash
+# Run all 4 phases in sequence (recommended)
+arela --run-all-phases
+# Or run phases individually for more control
+arela --stats-only           # Phase 1: Collect filesystem stats
+arela --detect-pdfs          # Phase 2: Detect pedimento documents
+arela --propagate-arela-path # Phase 3: Propagate paths to related files
+arela --upload-by-rfc        # Phase 4: Upload by RFC
+```
+### Performance Benefits
+**Before optimization** (single phase with detection):
+- 🐌 Read every file for detection
+- 💾 High memory usage
+- 🔄 Slow database operations
+- ❌ Process unsupported files
+**After optimization** (4-phase approach):
+- ⚡ **10x faster**: Phase 1 only reads filesystem metadata
+- 📊 **Bulk operations**: Database inserts up to 1000 records per batch
+- 🎯 **Targeted processing**: Phase 2 only processes PDFs needing detection
+- 💾 **Memory efficient**: No unnecessary file content loading
+- 🔄 **Optimized I/O**: Separates filesystem, database, and network operations
 ## Features
 - 📁 Upload entire directories or individual files
@@ -18,6 +83,7 @@ CLI tool to upload files and directories to Arela API or Supabase Storage with a
 - 🔧 **Performance optimizations with caching**
 - 📋 **Upload files by specific RFC values**
 - 🔍 **Propagate arela_path from pedimento documents to related files**
+- ⚡ **4-Phase optimized workflow for maximum performance**
 ## Installation
@@ -27,7 +93,22 @@ npm install -g @arela/uploader
 ## Usage
-### Basic Upload with Auto-Processing (API Mode)
+### 🚀 Optimized 4-Phase Workflow (Recommended)
+```bash
+# Run all phases automatically (most efficient)
+arela --run-all-phases --batch-size 20
+# Or run phases individually for fine-grained control
+arela --stats-only                    # Phase 1: Filesystem stats only
+arela --detect-pdfs --batch-size 10   # Phase 2: PDF detection
+arela --propagate-arela-path          # Phase 3: Path propagation
+arela --upload-by-rfc --batch-size 5  # Phase 4: RFC-based upload
+```
+### Traditional Single-Phase Upload (Legacy)
+#### Basic Upload with Auto-Processing (API Mode)
 ```bash
 arela --batch-size 10 -c 5
 ```
@@ -88,10 +169,21 @@ arela --client-path "/client/documents" --batch-size 10 -c 5
 ### Options
-- `-p, --prefix <prefix>`: Prefix path in bucket (default: "")
-- `-b, --bucket <bucket>`: Bucket name override
+#### Phase Control
+- `--stats-only`: **Phase 1** - Only collect filesystem stats (no file reading)
+- `--detect-pdfs`: **Phase 2** - Process PDF files for pedimento-simplificado detection
+- `--propagate-arela-path`: **Phase 3** - Propagate arela_path from pedimento records to related files
+- `--upload-by-rfc`: **Phase 4** - Upload files based on RFC values from UPLOAD_RFCS
+- `--run-all-phases`: **All Phases** - Run complete optimized workflow
+#### Performance & Configuration
 - `-c, --concurrency <number>`: Files per batch for processing (default: 10)
 - `--batch-size <number>`: API batch size (default: 10)
+- `--show-stats`: Show detailed processing statistics
+#### Upload Configuration
+- `-p, --prefix <prefix>`: Prefix path in bucket (default: "")
+- `-b, --bucket <bucket>`: Bucket name override
 - `--force-supabase`: Force direct Supabase upload (skip API)
 - `--no-auto-detect`: Disable automatic file detection (API mode only)
 - `--no-auto-organize`: Disable automatic file organization (API mode only)
@@ -99,11 +191,9 @@ arela --client-path "/client/documents" --batch-size 10 -c 5
 - `--folder-structure <structure>`: **Custom folder structure** (e.g., "2024/4023260" or "cliente1/pedimentos")
 - `--auto-detect-structure`: **Automatically detect year/pedimento from file paths**
 - `--client-path <path>`: Client path for metadata tracking
-- `--stats-only`: Only read file stats and insert to uploader table, skip file upload
+#### Legacy Options
 - `--no-detect`: Disable document type detection in stats-only mode
-- `--propagate-arela-path`: Propagate arela_path from pedimento_simplificado records to related files
-- `--upload-by-rfc`: Upload files to Arela API based on RFC values from UPLOAD_RFCS environment variable
-- `--show-stats`: Show detailed processing statistics
 - `-v, --version`: Display version number
 - `-h, --help`: Display help information

package/commands.md ADDED Viewed

@@ -0,0 +1,6 @@
+node src/index.js --stats-only
+node src/index.js --detect-pdfs
+node src/index.js --propagate-arela-path
+node src/index.js --upload-by-rfc --folder-structure palco
+UPLOAD_RFCS="RFC1|RFC2" node src/index.js --upload-by-rfc --folder-structure target-folder

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@arela/uploader",
-  "version": "0.2.0",
+  "version": "0.2.2",
   "description": "CLI to upload files/directories to Arela",
   "bin": {
     "arela": "./src/index.js"

package/src/document-type-shared.js CHANGED Viewed

@@ -1,3 +1,6 @@
+// Import all document type definitions
+import { pedimentoSimplificadoDefinition } from './document-types/pedimento-simplificado.js';
 // Document type definitions and extraction utilities
 // Ported from TypeScript to JavaScript for Node.js
@@ -10,7 +13,14 @@ export class FieldResult {
 }
 export class DocumentTypeDefinition {
-  constructor(type, extensions, match, extractors, extractNumPedimento, extractPedimentoYear) {
+  constructor(
+    type,
+    extensions,
+    match,
+    extractors,
+    extractNumPedimento,
+    extractPedimentoYear,
+  ) {
     this.type = type;
     this.extensions = extensions;
     this.match = match;
@@ -20,9 +30,6 @@ export class DocumentTypeDefinition {
   }
 }
-// Import all document type definitions
-import { pedimentoSimplificadoDefinition } from './document-types/pedimento-simplificado.js';
 // Registry of all document types
 const documentTypes = [
   pedimentoSimplificadoDefinition,
@@ -44,14 +51,17 @@ export function extractDocumentFields(source, fileExtension, filePath) {
   // Try to match against each document type
   for (const docType of documentTypes) {
     // Check if file extension matches
-    if (fileExtension && !docType.extensions.includes(fileExtension.toLowerCase())) {
+    if (
+      fileExtension &&
+      !docType.extensions.includes(fileExtension.toLowerCase())
+    ) {
       continue;
     }
     // Test if content matches this document type
     if (docType.match(source)) {
       console.log(`✅ Matched document type: ${docType.type}`);
       // Extract all fields
       const fields = [];
       for (const extractor of docType.extractors) {
@@ -68,8 +78,12 @@ export function extractDocumentFields(source, fileExtension, filePath) {
       }
       // Extract pedimento number and year
-      const pedimento = docType.extractNumPedimento ? docType.extractNumPedimento(source, fields) : null;
-      const year = docType.extractPedimentoYear ? docType.extractPedimentoYear(source, fields) : null;
+      const pedimento = docType.extractNumPedimento
+        ? docType.extractNumPedimento(source, fields)
+        : null;
+      const year = docType.extractPedimentoYear
+        ? docType.extractPedimentoYear(source, fields)
+        : null;
       return [docType.type, fields, pedimento, year];
     }

package/src/document-types/pedimento-simplificado.js CHANGED Viewed

@@ -33,7 +33,7 @@ export const pedimentoSimplificadoDefinition = {
         return new FieldResult(
           'numPedimento',
           !!match,
-          match ? match[0].replace(/\s/g, '') : null
+          match ? match[0].replace(/\s/g, '') : null,
         );
       },
     },
@@ -50,7 +50,7 @@ export const pedimentoSimplificadoDefinition = {
         return new FieldResult(
           'tipoOperacion',
           !!match,
-          match ? match[1] : null
+          match ? match[1] : null,
         );
       },
     },
@@ -67,7 +67,7 @@ export const pedimentoSimplificadoDefinition = {
         return new FieldResult(
           'clavePedimento',
           !!match,
-          match ? match[1] : null
+          match ? match[1] : null,
         );
       },
     },
@@ -83,7 +83,7 @@ export const pedimentoSimplificadoDefinition = {
         return new FieldResult(
           'aduanaEntradaSalida',
           !!match,
-          match ? match[1] : null
+          match ? match[1] : null,
         );
       },
     },
@@ -93,11 +93,7 @@ export const pedimentoSimplificadoDefinition = {
       field: 'rfc',
       extract: (source) => {
         const match = source.match(/\n\s*([A-Z0-9]{12,13})\s*\n/);
-        return new FieldResult(
-          'rfc',
-          !!match,
-          match ? match[1] : null
-        );
+        return new FieldResult('rfc', !!match, match ? match[1] : null);
       },
     },
@@ -112,9 +108,7 @@ export const pedimentoSimplificadoDefinition = {
           .filter((l) => l.length > 0);
         // 2) find the index of an RFC line (12–13 alnum chars)
-        const rfcIndex = lines.findIndex((l) =>
-          /^[A-Z0-9]{12,13}$/.test(l),
-        );
+        const rfcIndex = lines.findIndex((l) => /^[A-Z0-9]{12,13}$/.test(l));
         let code = null;
         // 3) if next line exists and is exactly 8 alnum chars, that's the code
@@ -122,11 +116,7 @@ export const pedimentoSimplificadoDefinition = {
           code = lines[rfcIndex + 1];
         }
-        return new FieldResult(
-          'codigoAceptacion',
-          code !== null,
-          code
-        );
+        return new FieldResult('codigoAceptacion', code !== null, code);
       },
     },
@@ -175,11 +165,7 @@ export const pedimentoSimplificadoDefinition = {
         if (!match) {
           match = source.match(/PRESENTACION:\s*(\d{2}\/\d{2}\/\d{4})/);
         }
-        return new FieldResult(
-          'paymentDate',
-          !!match,
-          match ? match[1] : null
-        );
+        return new FieldResult('paymentDate', !!match, match ? match[1] : null);
       },
     },
@@ -224,11 +210,7 @@ export const pedimentoSimplificadoDefinition = {
       extract: (source) => {
         // Look for the peso bruto value with decimal format
         const match = source.match(/(\d+\.\d+)\d{3}/);
-        return new FieldResult(
-          'pesoBruto',
-          !!match,
-          match ? match[1] : null
-        );
+        return new FieldResult('pesoBruto', !!match, match ? match[1] : null);
       },
     },
@@ -268,7 +250,7 @@ export const pedimentoSimplificadoDefinition = {
         return new FieldResult(
           'numeroOperacionBancaria',
           !!match,
-          match ? match[1] : null
+          match ? match[1] : null,
         );
       },
     },
@@ -281,7 +263,7 @@ export const pedimentoSimplificadoDefinition = {
         return new FieldResult(
           'numeroTransaccionSAT',
           !!match,
-          match ? match[1] : null
+          match ? match[1] : null,
         );
       },
     },

package/src/file-detection.js CHANGED Viewed

@@ -1,6 +1,7 @@
 import fs from 'fs';
-import path from 'path';
 import { getTextExtractor } from 'office-text-extractor';
+import path from 'path';
 import { extractDocumentFields } from './document-type-shared.js';
 const extractor = getTextExtractor();
@@ -10,15 +11,20 @@ const extractor = getTextExtractor();
  * Format: RFC/Year/Patente/Aduana/Pedimento/
  * Example: PED781129JT6/2023/3429/07/3019796/
  */
-function composeArelaPath(detectedType, fields, detectedPedimentoYear, filePath) {
+function composeArelaPath(
+  detectedType,
+  fields,
+  detectedPedimentoYear,
+  filePath,
+) {
   if (detectedType !== 'pedimento_simplificado') {
     return null;
   }
-  const rfc = fields?.find(f => f.name === 'rfc')?.value;
-  const patente = fields?.find(f => f.name === 'patente')?.value;
-  const aduana = fields?.find(f => f.name === 'aduanaEntradaSalida')?.value;
-  const pedimento = fields?.find(f => f.name === 'numPedimento')?.value;
+  const rfc = fields?.find((f) => f.name === 'rfc')?.value;
+  const patente = fields?.find((f) => f.name === 'patente')?.value;
+  const aduana = fields?.find((f) => f.name === 'aduanaEntradaSalida')?.value;
+  const pedimento = fields?.find((f) => f.name === 'numPedimento')?.value;
   const year = detectedPedimentoYear;
   // All components are required for a valid arela_path
@@ -28,17 +34,17 @@ function composeArelaPath(detectedType, fields, detectedPedimentoYear, filePath)
       year: !!year,
       patente: !!patente,
       aduana: !!aduana,
-      pedimento: !!pedimento
+      pedimento: !!pedimento,
     });
     return null;
   }
   // Ensure aduana is padded to 2 digits if needed (07 instead of 7)
   const aduanaFormatted = aduana.toString().padStart(2, '0');
   // arela_path should be the folder structure only, without filename
   const arelaPath = `${rfc}/${year}/${patente}/${aduanaFormatted}/${pedimento}/`;
   console.log(`✅ Composed arela_path: ${arelaPath}`);
   return arelaPath;
 }
@@ -48,7 +54,6 @@ function composeArelaPath(detectedType, fields, detectedPedimentoYear, filePath)
  * Detects document types and extracts metadata from files
  */
 export class FileDetectionService {
   /**
    * Detect document type from a file
    * @param {string} filePath - Path to the file to analyze
@@ -56,13 +61,16 @@ export class FileDetectionService {
    */
   async detectFile(filePath) {
     try {
-      const fileExtension = path.extname(filePath).toLowerCase().replace('.', '');
+      const fileExtension = path
+        .extname(filePath)
+        .toLowerCase()
+        .replace('.', '');
       const fileName = path.basename(filePath);
       console.log(`🔍 Analyzing file: ${fileName} (${fileExtension})`);
       let text = '';
       // Extract text based on file type
       switch (fileExtension) {
         case 'pdf':
@@ -83,7 +91,7 @@ export class FileDetectionService {
             detectedPedimentoYear: null,
             arelaPath: null,
             text: '',
-            error: `Unsupported file type: ${fileExtension}`
+            error: `Unsupported file type: ${fileExtension}`,
           };
       }
@@ -96,16 +104,21 @@ export class FileDetectionService {
           detectedPedimentoYear: null,
           arelaPath: null,
           text: '',
-          error: 'No text could be extracted from file'
+          error: 'No text could be extracted from file',
         };
       }
       // Extract document fields and detect type
-      const [detectedType, fields, detectedPedimento, detectedPedimentoYear] =
+      const [detectedType, fields, detectedPedimento, detectedPedimentoYear] =
         extractDocumentFields(text, fileExtension, filePath);
       // Compose arela_path for pedimento_simplificado documents
-      const arelaPath = composeArelaPath(detectedType, fields, detectedPedimentoYear, filePath);
+      const arelaPath = composeArelaPath(
+        detectedType,
+        fields,
+        detectedPedimentoYear,
+        filePath,
+      );
       return {
         detectedType,
@@ -114,9 +127,8 @@ export class FileDetectionService {
         detectedPedimentoYear,
         arelaPath,
         text,
-        error: null
+        error: null,
       };
     } catch (error) {
       console.error(`❌ Error detecting file ${filePath}:`, error.message);
       return {
@@ -126,7 +138,7 @@ export class FileDetectionService {
         detectedPedimentoYear: null,
         arelaPath: null,
         text: '',
-        error: error.message
+        error: error.message,
       };
     }
   }
@@ -139,13 +151,16 @@ export class FileDetectionService {
   async extractTextFromPDF(filePath) {
     try {
       const buffer = fs.readFileSync(filePath);
-      const text = await extractor.extractText({
-        input: buffer,
-        type: 'file'
+      const text = await extractor.extractText({
+        input: buffer,
+        type: 'file',
       });
       return text;
     } catch (error) {
-      console.error(`Error extracting text from PDF ${filePath}:`, error.message);
+      console.error(
+        `Error extracting text from PDF ${filePath}:`,
+        error.message,
+      );
       throw new Error(`Failed to extract text from PDF: ${error.message}`);
     }
   }
@@ -157,15 +172,15 @@ export class FileDetectionService {
    */
   async detectFiles(filePaths) {
     const results = [];
     for (const filePath of filePaths) {
       const result = await this.detectFile(filePath);
       results.push({
         filePath,
-        ...result
+        ...result,
       });
     }
     return results;
   }
@@ -176,7 +191,7 @@ export class FileDetectionService {
    */
   isSupportedFileType(filePath) {
     const fileExtension = path.extname(filePath).toLowerCase().replace('.', '');
-    const supportedExtensions = ['pdf', 'txt', 'xml'];
+    const supportedExtensions = ['pdf'];
     return supportedExtensions.includes(fileExtension);
   }
@@ -186,7 +201,7 @@ export class FileDetectionService {
    * @returns {Array<string>} - Filtered array of supported file paths
    */
   filterSupportedFiles(filePaths) {
-    return filePaths.filter(filePath => this.isSupportedFileType(filePath));
+    return filePaths.filter((filePath) => this.isSupportedFileType(filePath));
   }
 }