@arela/uploader 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,15 +1,89 @@
1
1
  # arela-uploader
2
2
 
3
- CLI tool to upload files and directories to Supabase Storage with automatic file renaming and sanitization.
3
+ CLI tool to upload files and directories to Arela API or Supabase Storage with automatic file processing, detection, and organization.
4
+
5
+ ## 🚀 OPTIMIZED 4-PHASE WORKFLOW
6
+
7
+ **New in v0.2.0**: The tool now supports an optimized 4-phase workflow designed for maximum performance when processing large file collections:
8
+
9
+ ### Phase 1: Filesystem Stats Collection 📊
10
+ ```bash
11
+ arela --stats-only
12
+ ```
13
+ - ⚡ **ULTRA FAST**: Only reads filesystem metadata (no file content)
14
+ - 📈 **Bulk database operations**: Processes 1000+ files per batch
15
+ - 🔄 **Upsert optimization**: Handles duplicates efficiently
16
+ - 💾 **Minimal memory usage**: No file content loading
17
+
18
+ ### Phase 2: PDF Detection 🔍
19
+ ```bash
20
+ arela --detect-pdfs
21
+ ```
22
+ - 🎯 **Targeted processing**: Only processes PDF files from database
23
+ - � **Pedimento-simplificado detection**: Extracts RFC, pedimento numbers, and metadata
24
+ - 🔄 **Batched processing**: Handles large datasets efficiently
25
+ - 📊 **Progress tracking**: Real-time detection statistics
26
+
27
+ ### Phase 3: Path Propagation �📁
28
+ ```bash
29
+ arela --propagate-arela-path
30
+ ```
31
+ - 🎯 **Smart path copying**: Propagates arela_path from pedimento documents to related files
32
+ - 📦 **Batch updates**: Processes files in groups for optimal database performance
33
+ - 🔗 **Relationship mapping**: Links supporting documents to their pedimento
34
+
35
+ ### Phase 4: RFC-based Upload 🚀
36
+ ```bash
37
+ arela --upload-by-rfc
38
+ ```
39
+ - 🎯 **Targeted uploads**: Only uploads files for specified RFCs
40
+ - 📋 **Supporting documents**: Includes all related files, not just pedimentos
41
+ - 🏗️ **Structure preservation**: Maintains proper folder hierarchy
42
+
43
+ ### Combined Workflow 🎯
44
+ ```bash
45
+ # Run all 4 phases in sequence (recommended)
46
+ arela --run-all-phases
47
+
48
+ # Or run phases individually for more control
49
+ arela --stats-only # Phase 1: Collect filesystem stats
50
+ arela --detect-pdfs # Phase 2: Detect pedimento documents
51
+ arela --propagate-arela-path # Phase 3: Propagate paths to related files
52
+ arela --upload-by-rfc # Phase 4: Upload by RFC
53
+ ```
54
+
55
+ ### Performance Benefits
56
+
57
+ **Before optimization** (single phase with detection):
58
+ - 🐌 Read every file for detection
59
+ - 💾 High memory usage
60
+ - 🔄 Slow database operations
61
+ - ❌ Process unsupported files
62
+
63
+ **After optimization** (4-phase approach):
64
+ - ⚡ **10x faster**: Phase 1 only reads filesystem metadata
65
+ - 📊 **Bulk operations**: Database inserts up to 1000 records per batch
66
+ - 🎯 **Targeted processing**: Phase 2 only processes PDFs needing detection
67
+ - 💾 **Memory efficient**: No unnecessary file content loading
68
+ - 🔄 **Optimized I/O**: Separates filesystem, database, and network operations
4
69
 
5
70
  ## Features
6
71
 
7
72
  - 📁 Upload entire directories or individual files
73
+ - 🤖 **Automatic file detection and organization** (API mode)
74
+ - 🗂️ **Smart year/pedimento auto-detection from file paths**
75
+ - 🏗️ **Custom folder structure support**
8
76
  - 🔄 Automatic file renaming to handle problematic characters
9
77
  - 📝 Comprehensive logging (local and remote)
10
78
  - ⚡ Retry mechanism for failed uploads
11
79
  - 🎯 Skip duplicate files automatically
12
80
  - 📊 Progress bars and detailed summaries
81
+ - 📂 **Preserve directory structure with auto-organization**
82
+ - 🚀 **Batch processing with configurable concurrency**
83
+ - 🔧 **Performance optimizations with caching**
84
+ - 📋 **Upload files by specific RFC values**
85
+ - 🔍 **Propagate arela_path from pedimento documents to related files**
86
+ - ⚡ **4-Phase optimized workflow for maximum performance**
13
87
 
14
88
  ## Installation
15
89
 
@@ -19,51 +93,298 @@ npm install -g @arela/uploader
19
93
 
20
94
  ## Usage
21
95
 
22
- ### Basic Upload
96
+ ### 🚀 Optimized 4-Phase Workflow (Recommended)
97
+
98
+ ```bash
99
+ # Run all phases automatically (most efficient)
100
+ arela --run-all-phases --batch-size 20
101
+
102
+ # Or run phases individually for fine-grained control
103
+ arela --stats-only # Phase 1: Filesystem stats only
104
+ arela --detect-pdfs --batch-size 10 # Phase 2: PDF detection
105
+ arela --propagate-arela-path # Phase 3: Path propagation
106
+ arela --upload-by-rfc --batch-size 5 # Phase 4: RFC-based upload
107
+ ```
108
+
109
+ ### Traditional Single-Phase Upload (Legacy)
110
+
111
+ #### Basic Upload with Auto-Processing (API Mode)
112
+ ```bash
113
+ arela --batch-size 10 -c 5
114
+ ```
115
+
116
+ ### Upload with Auto-Detection of Year/Pedimento
117
+ ```bash
118
+ arela --auto-detect-structure --batch-size 10 -c 5
119
+ ```
120
+
121
+ ### Upload with Custom Folder Structure
122
+ ```bash
123
+ arela --folder-structure "2024/4023260" --batch-size 10 -c 5
124
+ ```
125
+
126
+ ### Upload with Directory Structure Preservation
127
+ ```bash
128
+ arela --batch-size 10 -c 5 --preserve-structure
129
+ ```
130
+
131
+ ### Upload to Supabase Directly (Skip API)
23
132
  ```bash
24
- arela -p "my-folder"
133
+ arela --force-supabase -p "my-folder"
25
134
  ```
26
135
 
27
- ### Upload with File Renaming
28
- For files with accents, special characters, or problematic names:
136
+ ### Upload Files by Specific RFC Values
137
+ ```bash
138
+ # Upload all files associated with specific RFCs
139
+ arela --upload-by-rfc --batch-size 5
140
+
141
+ # Upload RFC files with custom folder prefix
142
+ arela --upload-by-rfc --folder-structure "palco" --batch-size 5
143
+
144
+ # Upload RFC files with nested folder structure
145
+ arela --upload-by-rfc --folder-structure "2024/client1/pedimentos" --batch-size 5
146
+ ```
29
147
 
148
+ ### Propagate Arela Path from Pedimentos to Related Files
30
149
  ```bash
31
- # Preview what files would be renamed (dry run)
32
- arela --rename-files --dry-run
150
+ # Copy arela_path from pedimento_simplificado records to related files
151
+ arela --propagate-arela-path
152
+ ```
33
153
 
34
- # Actually rename and upload files
35
- arela --rename-files -p "documents"
154
+ ### Stats-Only Mode (No Upload)
155
+ ```bash
156
+ # Only process file stats and insert to database, don't upload
157
+ arela --stats-only --folder-structure "2023/3019796"
158
+ ```
159
+
160
+ ### Upload with Performance Statistics
161
+ ```bash
162
+ arela --batch-size 10 -c 5 --show-stats
163
+ ```
164
+
165
+ ### Upload with Client Path Tracking
166
+ ```bash
167
+ arela --client-path "/client/documents" --batch-size 10 -c 5
36
168
  ```
37
169
 
38
170
  ### Options
39
171
 
172
+ #### Phase Control
173
+ - `--stats-only`: **Phase 1** - Only collect filesystem stats (no file reading)
174
+ - `--detect-pdfs`: **Phase 2** - Process PDF files for pedimento-simplificado detection
175
+ - `--propagate-arela-path`: **Phase 3** - Propagate arela_path from pedimento records to related files
176
+ - `--upload-by-rfc`: **Phase 4** - Upload files based on RFC values from UPLOAD_RFCS
177
+ - `--run-all-phases`: **All Phases** - Run complete optimized workflow
178
+
179
+ #### Performance & Configuration
180
+ - `-c, --concurrency <number>`: Files per batch for processing (default: 10)
181
+ - `--batch-size <number>`: API batch size (default: 10)
182
+ - `--show-stats`: Show detailed processing statistics
183
+
184
+ #### Upload Configuration
40
185
  - `-p, --prefix <prefix>`: Prefix path in bucket (default: "")
41
- - `-r, --rename-files`: Rename files with problematic characters before uploading
42
- - `--dry-run`: Show what files would be renamed without actually renaming them
43
- - `-h, --help`: Display help information
186
+ - `-b, --bucket <bucket>`: Bucket name override
187
+ - `--force-supabase`: Force direct Supabase upload (skip API)
188
+ - `--no-auto-detect`: Disable automatic file detection (API mode only)
189
+ - `--no-auto-organize`: Disable automatic file organization (API mode only)
190
+ - `--preserve-structure`: **Preserve original directory structure when using auto-organize**
191
+ - `--folder-structure <structure>`: **Custom folder structure** (e.g., "2024/4023260" or "cliente1/pedimentos")
192
+ - `--auto-detect-structure`: **Automatically detect year/pedimento from file paths**
193
+ - `--client-path <path>`: Client path for metadata tracking
194
+
195
+ #### Legacy Options
196
+ - `--no-detect`: Disable document type detection in stats-only mode
44
197
  - `-v, --version`: Display version number
198
+ - `-h, --help`: Display help information
45
199
 
46
200
  ## Environment Variables
47
201
 
48
202
  Create a `.env` file in your project root:
49
203
 
50
204
  ```env
205
+ # For API Mode (recommended)
206
+ ARELA_API_URL=http://localhost:3010
207
+ ARELA_API_TOKEN=your_api_token
208
+
209
+ # For Direct Supabase Mode (fallback)
51
210
  SUPABASE_URL=your_supabase_url
52
211
  SUPABASE_KEY=your_supabase_anon_key
53
212
  SUPABASE_BUCKET=your_bucket_name
213
+
214
+ # Required for both modes
54
215
  UPLOAD_BASE_PATH=/path/to/your/files
55
216
  UPLOAD_SOURCES=folder1|folder2|file.pdf
217
+
218
+ # RFC-based Upload Configuration
219
+ # Pipe-separated list of RFCs to upload files for
220
+ UPLOAD_RFCS=MMJ0810145N1|ABC1234567XY|DEF9876543ZZ
221
+ ```
222
+
223
+ **Environment Variable Details:**
224
+
225
+ - `ARELA_API_URL`: Base URL for the Arela API service
226
+ - `ARELA_API_TOKEN`: Authentication token for API access
227
+ - `SUPABASE_URL`: Your Supabase project URL
228
+ - `SUPABASE_KEY`: Supabase anonymous key for direct uploads
229
+ - `SUPABASE_BUCKET`: Target bucket name in Supabase Storage
230
+ - `UPLOAD_BASE_PATH`: Root directory containing files to upload
231
+ - `UPLOAD_SOURCES`: Pipe-separated list of folders/files to process
232
+ - `UPLOAD_RFCS`: Pipe-separated list of RFC values for targeted uploads
233
+
234
+ ## RFC-Based File Upload
235
+
236
+ The `--upload-by-rfc` feature allows you to upload files to the Arela API based on specific RFC values. This is useful when you want to upload only files associated with certain companies or entities.
237
+
238
+ ### How it works:
239
+
240
+ 1. **Configure RFCs**: Set the `UPLOAD_RFCS` environment variable with pipe-separated RFC values
241
+ 2. **Query Database**: The tool searches the Supabase database for files matching the specified RFCs
242
+ 3. **Include Supporting Documents**: Finds all files sharing the same `arela_path` as the RFC matches (not just the pedimento files)
243
+ 4. **Apply Folder Structure**: Optionally applies custom folder prefix using `--folder-structure`
244
+ 5. **Group and Upload**: Files are grouped by their final destination path and uploaded with proper structure
245
+
246
+ ### Folder Structure Options:
247
+
248
+ **Default Behavior** (no `--folder-structure`):
249
+ - Uses original `arela_path`: `CAD890407NK7/2023/3429/070/230734293000421/`
250
+
251
+ **With Custom Prefix** (`--folder-structure "palco"`):
252
+ - Results in: `palco/CAD890407NK7/2023/3429/070/230734293000421/`
253
+
254
+ **With Nested Prefix** (`--folder-structure "2024/client1/pedimentos"`):
255
+ - Results in: `2024/client1/pedimentos/CAD890407NK7/2023/3429/070/230734293000421/`
256
+
257
+ ### Prerequisites:
258
+
259
+ - Files must have been previously processed (have entries in the `uploader` table)
260
+ - Files must have `rfc` field populated (from document detection)
261
+ - Files must have `arela_path` populated (from pedimento processing)
262
+ - Original files must still exist at their `original_path` locations
263
+
264
+ ### Example:
265
+
266
+ ```bash
267
+ # Set RFCs in environment
268
+ export UPLOAD_RFCS="MMJ0810145N1|ABC1234567XY|DEF9876543ZZ"
269
+
270
+ # Upload files for these RFCs (original folder structure)
271
+ arela --upload-by-rfc --batch-size 5 --show-stats
272
+
273
+ # Upload with custom folder prefix
274
+ arela --upload-by-rfc --folder-structure "palco" --batch-size 10
275
+
276
+ # Upload with nested organization
277
+ arela --upload-by-rfc --folder-structure "2024/Q1/processed" --batch-size 15
278
+ ```
279
+
280
+ The tool will:
281
+ - Find all database records matching the specified RFCs
282
+ - Include ALL supporting documents that share the same `arela_path`
283
+ - Apply the optional folder structure prefix if specified
284
+ - Group files by their final destination folder structure
285
+ - Upload each group maintaining the correct Arela folder hierarchy
286
+ - Provide detailed progress and summary statistics
287
+ - Handle large datasets with automatic pagination (no 1000-file limit)
288
+
289
+ ## File Processing Modes
290
+
291
+ ### API Mode (Default)
292
+ When `ARELA_API_URL` and `ARELA_API_TOKEN` are configured:
293
+ - ✅ Automatic file detection and classification
294
+ - ✅ Intelligent file organization
295
+ - ✅ **Smart year/pedimento auto-detection from paths**
296
+ - ✅ **Custom folder structure support**
297
+ - ✅ Batch processing with progress tracking
298
+ - ✅ Advanced error handling and retry logic
299
+ - ✅ **Performance optimizations with file sanitization caching**
300
+
301
+ ### Auto-Detection Features
302
+ The tool can automatically detect year and pedimento numbers from file paths using multiple patterns:
303
+
304
+ **Pattern 1: Direct Structure**
305
+ ```
306
+ /path/to/2024/4023260/file.pdf
307
+ /path/to/pedimentos/2024/4023260/file.pdf
308
+ ```
309
+
310
+ **Pattern 2: Named Patterns**
311
+ ```
312
+ /path/to/docs/año2024/ped4023260/file.pdf
313
+ /path/to/files/year2024/pedimento4023260/file.pdf
314
+ ```
315
+
316
+ **Pattern 3: Loose Detection**
317
+ - Year: Any 4-digit number starting with "202" (2020-2029)
318
+ - Pedimento: Any 4-8 consecutive digits in path
319
+
320
+ Use `--auto-detect-structure` to enable automatic detection:
321
+ ```bash
322
+ arela --auto-detect-structure --batch-size 10
323
+ ```
324
+
325
+ ### Custom Folder Structure
326
+ Specify a custom organization pattern:
327
+ ```bash
328
+ # Static structure
329
+ arela --folder-structure "2024/4023260" --batch-size 10
330
+
331
+ # Client-based structure
332
+ arela --folder-structure "cliente1/pedimentos" --batch-size 10
333
+ ```
334
+
335
+ ### Directory Structure Preservation
336
+ Use `--preserve-structure` to maintain your original folder structure even with auto-organization:
337
+
338
+ ```bash
339
+ # Without --preserve-structure
340
+ # Files organized by API: bucket/filename.pdf
341
+
342
+ # With --preserve-structure
343
+ # Files keep structure: bucket/2024/4023260/filename.pdf
344
+ arela --preserve-structure --batch-size 10
56
345
  ```
57
346
 
58
- ## File Renaming
347
+ ### Supabase Direct Mode (Fallback)
348
+ When API is unavailable or `--force-supabase` is used:
349
+ - ✅ Direct upload to Supabase Storage
350
+ - ✅ File sanitization and renaming
351
+ - ✅ Basic progress tracking
352
+ - ✅ **Optimized sanitization with pre-compiled regex patterns**
353
+ - ✅ **Performance caching for file name sanitization**
59
354
 
60
- The tool automatically handles problematic characters by:
355
+ ## Performance Features
61
356
 
62
- - Removing accents (á → a, ñ → n, etc.)
63
- - Replacing special characters with safe alternatives
64
- - Converting spaces to dashes
65
- - Removing or replacing symbols like `{}[]~^`|"<>?*:`
66
- - Handling Korean characters and other Unicode symbols
357
+ ### Database Pagination
358
+ - **No Upload Limits**: Handles datasets larger than 1000 files through automatic pagination
359
+ - **Efficient Querying**: Uses Supabase `.range()` method to fetch data in batches
360
+ - **Memory Optimization**: Processes large datasets without memory overflow
361
+
362
+ ### File Processing
363
+ - **Pre-compiled Regex**: Sanitization patterns are compiled once for optimal performance
364
+ - **Caching System**: File name sanitization results are cached to avoid re-processing
365
+ - **Batch Processing**: Configurable batch sizes for optimal upload throughput
366
+
367
+ ### RFC Upload Optimizations
368
+ - **Smart Querying**: Three-step query process to efficiently find related files
369
+ - **Supporting Document Inclusion**: Automatically includes all related documents, not just pedimentos
370
+ - **Path Concatenation**: Efficiently combines custom folder structures with arela_paths
371
+
372
+ ## File Sanitization
373
+
374
+ The tool automatically handles problematic characters using advanced sanitization:
375
+
376
+ **Character Replacements:**
377
+ - **Accents**: á→a, é→e, í→i, ó→o, ú→u, ñ→n, ç→c
378
+ - **Korean characters**: 멕→meok, 시→si, 코→ko, 용→yong, others→kr
379
+ - **Special symbols**: &→and, {}[]~^|"<>?*: →-
380
+ - **Email symbols**: @→(removed), spaces→-
381
+ - **Multiple dashes**: collapsed to single dash
382
+ - **Leading/trailing**: dashes and dots removed
383
+
384
+ **Performance Features:**
385
+ - Pre-compiled regex patterns for faster processing
386
+ - Sanitization result caching to avoid re-processing
387
+ - Unicode normalization (NFD) for consistent handling
67
388
 
68
389
  ### Examples
69
390
 
@@ -73,12 +394,79 @@ The tool automatically handles problematic characters by:
73
394
  | `File{with}brackets.pdf` | `File-with-brackets.pdf` |
74
395
  | `Document ^& symbols.pdf` | `Document-and-symbols.pdf` |
75
396
  | `CI & PL-20221212(멕시코용).xls` | `CI-and-PL-20221212.xls` |
397
+ | `impresora@nereprint.com_file.xml` | `impresoranereprint.com_file.xml` |
398
+ | `07-3429-3000430 HC.pdf` | `07-3429-3000430-HC.pdf` |
399
+ | `FACTURA IN 3000430.pdf` | `FACTURA-IN-3000430.pdf` |
76
400
 
77
- ## Logging
401
+ ## Logging and Monitoring
78
402
 
79
- The tool maintains logs both locally (`upload.log`) and remotely in your Supabase database. Logs include:
403
+ The tool maintains comprehensive logs both locally and remotely:
80
404
 
81
- - Upload status (success/error/skipped)
82
- - File paths and sanitization changes
405
+ **Local Logging (`arela-upload.log`):**
406
+ - Upload status (SUCCESS/ERROR/SKIPPED/SANITIZED)
407
+ - File paths and sanitization changes
83
408
  - Error messages and timestamps
84
- - Rename operations
409
+ - Rename operations with before/after names
410
+ - Processing statistics and performance metrics
411
+
412
+ **Log Entry Examples:**
413
+ ```
414
+ [2025-09-04T01:17:00.141Z] SUCCESS: /Users/.../file.xml -> 2023/2003180/file.xml
415
+ [2025-09-04T01:17:00.822Z] SANITIZED: file name.pdf → file-name.pdf
416
+ [2025-09-04T01:17:00.856Z] SKIPPED: /Users/.../duplicate.pdf (already exists)
417
+ ```
418
+
419
+ **Remote Logging:**
420
+ - Integration with Supabase database for centralized logging
421
+ - Upload tracking and audit trails
422
+ - Error reporting and monitoring
423
+
424
+ ## Performance Features
425
+
426
+ **Version 2.0.0 introduces several performance optimizations:**
427
+
428
+ - **Pre-compiled Regex Patterns**: Sanitization patterns are compiled once and reused
429
+ - **Sanitization Caching**: File name sanitization results are cached to avoid reprocessing
430
+ - **Batch Processing**: Configurable batch sizes for optimal API usage
431
+ - **Concurrent Processing**: Adjustable concurrency levels for file processing
432
+ - **Smart Skip Logic**: Efficiently skips already processed files using log analysis
433
+ - **Memory Optimization**: Large file outputs are truncated to prevent memory issues
434
+
435
+ ## Version History
436
+
437
+ **v2.0.0** - Latest Release
438
+ - ✨ Added smart year/pedimento auto-detection from file paths
439
+ - ✨ Custom folder structure support with `--folder-structure` option
440
+ - ✨ Client path tracking with `--client-path` option
441
+ - ✨ Performance optimizations with regex pre-compilation
442
+ - ✨ Sanitization result caching for improved speed
443
+ - ✨ Enhanced file sanitization with Korean character support
444
+ - ✨ Improved email character handling in file names
445
+ - ✨ Better error handling and logging
446
+ - 📝 Comprehensive logging with SANITIZED status
447
+ - 🔧 Memory optimization for large file processing
448
+
449
+ ## Troubleshooting
450
+
451
+ **Connection Issues:**
452
+ - Verify `ARELA_API_URL` and `ARELA_API_TOKEN` are correct
453
+ - Check network connectivity to the API endpoint
454
+ - The tool will automatically fallback to Supabase direct mode if API is unavailable
455
+
456
+ **Performance Issues:**
457
+ - Adjust `--batch-size` for optimal API performance (default: 10)
458
+ - Modify `--concurrency` to control parallel processing (default: 10)
459
+ - Use `--show-stats` to monitor sanitization cache performance
460
+
461
+ **File Issues:**
462
+ - Check file permissions in `UPLOAD_BASE_PATH`
463
+ - Verify `UPLOAD_SOURCES` paths exist and are accessible
464
+ - Review `arela-upload.log` for detailed error information
465
+
466
+ ## Contributing
467
+
468
+ Contributions are welcome! Please feel free to submit a Pull Request.
469
+
470
+ ## License
471
+
472
+ ISC License - see LICENSE file for details.
File without changes
package/commands.md ADDED
@@ -0,0 +1,6 @@
1
+ node src/index.js --stats-only
2
+ node src/index.js --detect-pdfs
3
+ node src/index.js --propagate-arela-path
4
+ node src/index.js --upload-by-rfc --folder-structure palco
5
+
6
+ UPLOAD_RFCS="RFC1|RFC2" node src/index.js --upload-by-rfc --folder-structure target-folder
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@arela/uploader",
3
- "version": "0.1.0",
3
+ "version": "0.2.1",
4
4
  "description": "CLI to upload files/directories to Arela",
5
5
  "bin": {
6
6
  "arela": "./src/index.js"
@@ -28,15 +28,18 @@
28
28
  },
29
29
  "homepage": "https://github.com/inspiraCode/arela-uploader#readme",
30
30
  "dependencies": {
31
- "@supabase/supabase-js": "^2.49.4",
32
- "cli-progress": "^3.12.0",
33
- "commander": "^13.1.0",
34
- "dotenv": "^16.5.0",
35
- "globby": "^14.1.0",
36
- "mime-types": "^3.0.1"
31
+ "@supabase/supabase-js": "2.49.4",
32
+ "cli-progress": "3.12.0",
33
+ "commander": "13.1.0",
34
+ "dotenv": "16.5.0",
35
+ "form-data": "4.0.4",
36
+ "globby": "14.1.0",
37
+ "mime-types": "3.0.1",
38
+ "node-fetch": "3.3.2",
39
+ "office-text-extractor": "3.0.3"
37
40
  },
38
41
  "devDependencies": {
39
- "@trivago/prettier-plugin-sort-imports": "^5.2.2",
40
- "prettier": "^3.5.3"
42
+ "@trivago/prettier-plugin-sort-imports": "5.2.2",
43
+ "prettier": "3.5.3"
41
44
  }
42
45
  }
@@ -0,0 +1,80 @@
1
+ // Document type definitions and extraction utilities
2
+ // Ported from TypeScript to JavaScript for Node.js
3
+
4
+ export class FieldResult {
5
+ constructor(name, found, value) {
6
+ this.name = name;
7
+ this.found = found;
8
+ this.value = value;
9
+ }
10
+ }
11
+
12
+ export class DocumentTypeDefinition {
13
+ constructor(type, extensions, match, extractors, extractNumPedimento, extractPedimentoYear) {
14
+ this.type = type;
15
+ this.extensions = extensions;
16
+ this.match = match;
17
+ this.extractors = extractors;
18
+ this.extractNumPedimento = extractNumPedimento;
19
+ this.extractPedimentoYear = extractPedimentoYear;
20
+ }
21
+ }
22
+
23
+ // Import all document type definitions
24
+ import { pedimentoSimplificadoDefinition } from './document-types/pedimento-simplificado.js';
25
+
26
+ // Registry of all document types
27
+ const documentTypes = [
28
+ pedimentoSimplificadoDefinition,
29
+ // Add more document types here as needed
30
+ ];
31
+
32
+ /**
33
+ * Extract document fields from text content
34
+ * @param {string} source - The text content to analyze
35
+ * @param {string} fileExtension - File extension for context
36
+ * @param {string} filePath - File path for context
37
+ * @returns {[string|null, FieldResult[], string|null, number|null]} - [detectedType, fields, pedimento, year]
38
+ */
39
+ export function extractDocumentFields(source, fileExtension, filePath) {
40
+ if (!source || typeof source !== 'string') {
41
+ return [null, [], null, null];
42
+ }
43
+
44
+ // Try to match against each document type
45
+ for (const docType of documentTypes) {
46
+ // Check if file extension matches
47
+ if (fileExtension && !docType.extensions.includes(fileExtension.toLowerCase())) {
48
+ continue;
49
+ }
50
+
51
+ // Test if content matches this document type
52
+ if (docType.match(source)) {
53
+ console.log(`✅ Matched document type: ${docType.type}`);
54
+
55
+ // Extract all fields
56
+ const fields = [];
57
+ for (const extractor of docType.extractors) {
58
+ try {
59
+ const result = extractor.extract(source);
60
+ fields.push(result);
61
+ if (result.found) {
62
+ console.log(` - ${result.name}: ${result.value}`);
63
+ }
64
+ } catch (error) {
65
+ console.error(`Error extracting field ${extractor.field}:`, error);
66
+ fields.push(new FieldResult(extractor.field, false, null));
67
+ }
68
+ }
69
+
70
+ // Extract pedimento number and year
71
+ const pedimento = docType.extractNumPedimento ? docType.extractNumPedimento(source, fields) : null;
72
+ const year = docType.extractPedimentoYear ? docType.extractPedimentoYear(source, fields) : null;
73
+
74
+ return [docType.type, fields, pedimento, year];
75
+ }
76
+ }
77
+
78
+ console.log('❓ No document type matched');
79
+ return [null, [], null, null];
80
+ }