@arela/uploader 1.0.2 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,139 @@
1
+ # Arela Scan Quick Reference
2
+
3
+ ## Setup
4
+
5
+ ### 1. Configure Backend
6
+
7
+ Add `cli_registry` entity to TypeORM and run migration:
8
+
9
+ ```bash
10
+ cd arela-api
11
+ npm run migration:generate -- -n CreateCliRegistry
12
+ npm run migration:run
13
+ ```
14
+
15
+ ### 2. Configure CLI
16
+
17
+ Set environment variables in `.env`:
18
+
19
+ ```bash
20
+ # Required
21
+ ARELA_COMPANY_SLUG=your_company
22
+ ARELA_SERVER_ID=server01
23
+ UPLOAD_BASE_PATH=/path/to/files
24
+ UPLOAD_SOURCES=2023|2024|2025
25
+
26
+ # Optional
27
+ ARELA_BASE_PATH_LABEL=data
28
+ SCAN_EXCLUDE_PATTERNS=.DS_Store,Thumbs.db,desktop.ini
29
+ SCAN_BATCH_SIZE=2000
30
+
31
+ # API Configuration
32
+ ARELA_API_URL=http://localhost:3010
33
+ ARELA_API_TOKEN=your-token
34
+ ```
35
+
36
+ ## Commands
37
+
38
+ ### Scan Filesystem
39
+
40
+ ```bash
41
+ # Basic scan with throughput display
42
+ arela scan
43
+
44
+ # Scan with percentage progress (counts files first)
45
+ arela scan --count-first
46
+
47
+ # Scan to specific API target
48
+ arela scan --api cliente
49
+ ```
50
+
51
+ ### View Scan Instances
52
+
53
+ ```bash
54
+ # List all registered instances
55
+ curl -H "x-api-key: $TOKEN" \
56
+ http://localhost:3010/api/uploader/scan/instances
57
+
58
+ # Get stale instances (no scan > 90 days)
59
+ curl -H "x-api-key: $TOKEN" \
60
+ "http://localhost:3010/api/uploader/scan/stale-instances?days=90"
61
+ ```
62
+
63
+ ### Deactivate Instance
64
+
65
+ ```bash
66
+ curl -X PATCH \
67
+ -H "x-api-key: $TOKEN" \
68
+ -H "Content-Type: application/json" \
69
+ -d '{"tableName":"file_stats_company_server_path"}' \
70
+ http://localhost:3010/api/uploader/scan/deactivate
71
+ ```
72
+
73
+ ## Files Modified
74
+
75
+ ### Backend
76
+
77
+ - ✅ `src/uploader/entities/cli-registry.entity.ts` - New entity
78
+ - ✅ `src/uploader/services/file-stats-table-manager.service.ts` - New service
79
+ - ✅ `src/uploader/services/uploader.service.ts` - Added scan methods
80
+ - ✅ `src/uploader/controllers/uploader.controller.ts` - Added scan endpoints
81
+ - ✅ `src/uploader/uploader.module.ts` - Updated imports
82
+
83
+ ### CLI
84
+
85
+ - ✅ `src/commands/ScanCommand.js` - New command
86
+ - ✅ `src/services/ScanApiService.js` - New API service
87
+ - ✅ `src/config/config.js` - Added scan configuration
88
+ - ✅ `src/index.js` - Registered scan command
89
+ - ✅ `.env.template` - Added scan variables
90
+ - ✅ `docs/ARELA_SCAN_IMPLEMENTATION.md` - Documentation
91
+
92
+ ## Table Schema
93
+
94
+ Each scan instance creates a table:
95
+
96
+ ```sql
97
+ file_stats_<company>_<server>_<path>
98
+ ├── id (uuid)
99
+ ├── file_name (varchar)
100
+ ├── file_extension (varchar)
101
+ ├── directory_path (text)
102
+ ├── relative_path (text)
103
+ ├── absolute_path (text) [unique]
104
+ ├── size_bytes (bigint)
105
+ ├── modified_at (timestamp)
106
+ ├── scan_timestamp (timestamp)
107
+ └── created_at (timestamp)
108
+ ```
109
+
110
+ ## Troubleshooting
111
+
112
+ ### Error: Missing configuration
113
+
114
+ **Cause**: Required env vars not set
115
+ **Fix**: Set `ARELA_COMPANY_SLUG` and `ARELA_SERVER_ID`
116
+
117
+ ### Error: Table name collision
118
+
119
+ **Cause**: Same identifiers used with different base path
120
+ **Fix**: Change server ID or base path label
121
+
122
+ ### Error: Cannot connect to API
123
+
124
+ **Cause**: Backend not running or wrong URL
125
+ **Fix**: Verify `ARELA_API_URL` and start backend
126
+
127
+ ## Next Steps
128
+
129
+ 1. ✅ Implement `arela scan` (DONE)
130
+ 2. ⏳ Implement `arela identify` (detect pedimentos from PDFs)
131
+ 3. ⏳ Implement `arela propagate` (propagate arela_path)
132
+ 4. ⏳ Implement `arela push` (upload files by RFC)
133
+
134
+ ## Performance Tips
135
+
136
+ - Increase `SCAN_BATCH_SIZE` for better throughput (default: 2000)
137
+ - Use `MAX_API_CONNECTIONS` to match backend replicas
138
+ - Run scans during off-peak hours for large datasets
139
+ - Use `--count-first` only when progress percentage is needed
@@ -0,0 +1,414 @@
1
+ # Detection Attempt Tracking
2
+
3
+ ## Overview
4
+
5
+ The `arela identify` command now includes intelligent attempt tracking to avoid reprocessing files unnecessarily and provide better debugging information when detection fails.
6
+
7
+ ## New Fields in file_stats_* Tables
8
+
9
+ ### Attempt Tracking
10
+
11
+ | Field | Type | Default | Description |
12
+ |-------|------|---------|-------------|
13
+ | `detection_attempts` | INTEGER | 0 | Number of times detection has been attempted |
14
+ | `max_detection_attempts` | INTEGER | 3 | Maximum attempts before giving up |
15
+ | `is_not_pedimento` | BOOLEAN | FALSE | True if definitely not a pedimento document |
16
+
17
+ ### Enhanced Error Reporting
18
+
19
+ The `detection_error` field now contains categorized, descriptive errors:
20
+
21
+ | Error Category | Example | Meaning |
22
+ |----------------|---------|---------|
23
+ | `FILE_NOT_FOUND` | File does not exist on filesystem | File was deleted after scan |
24
+ | `FILE_TOO_LARGE` | File size 75.5MB exceeds 50MB limit | PDF too large to process |
25
+ | `NOT_PEDIMENTO` | Missing key markers: "FORMA SIMPLIFICADA DE PEDIMENTO" | Definitely not a pedimento |
26
+ | `INCOMPLETE_PEDIMENTO` | Missing fields: rfc, numPedimento | Partial match, needs matcher improvement |
27
+ | `PDF_PARSE_ERROR` | Failed to extract text from PDF | Corrupted or encrypted PDF |
28
+ | `TEXT_EXTRACTION_ERROR` | Cannot extract text | Unsupported PDF format |
29
+ | `TIMEOUT` | Detection timeout | Processing took too long |
30
+
31
+ ## Optimized Query Logic
32
+
33
+ ### Previous Behavior
34
+
35
+ ```sql
36
+ -- Retrieved ALL unprocessed PDFs every time
37
+ WHERE file_extension = 'pdf'
38
+ AND detection_attempted_at IS NULL
39
+ ```
40
+
41
+ **Problem**: Would keep trying the same files even if they repeatedly failed.
42
+
43
+ ### New Behavior
44
+
45
+ ```sql
46
+ -- Only retrieve PDFs that haven't reached max attempts
47
+ WHERE file_extension = 'pdf'
48
+ AND is_not_pedimento = FALSE
49
+ AND (detection_attempts < max_detection_attempts OR detection_attempts IS NULL)
50
+ ```
51
+
52
+ **Benefits**:
53
+ 1. **Skips non-pedimentos**: Files marked as `is_not_pedimento = TRUE` are never retrieved again
54
+ 2. **Respects attempt limit**: Files that reached `max_detection_attempts` are skipped
55
+ 3. **Faster queries**: Uses optimized composite index on `(file_extension, is_not_pedimento, detection_attempts, max_detection_attempts)`
56
+
57
+ ## Detection Logic Flow
58
+
59
+ ```
60
+ ┌─────────────────────────┐
61
+ │ Fetch PDFs to detect │
62
+ │ (respects attempts) │
63
+ └───────────┬─────────────┘
64
+
65
+
66
+ ┌─────────────────────────┐
67
+ │ Check file exists │
68
+ └───────────┬─────────────┘
69
+
70
+ ├── File not found ────────────┐
71
+ │ │
72
+ ▼ │
73
+ ┌─────────────────────────┐ │
74
+ │ Check file size │ │
75
+ │ (max 50MB) │ │
76
+ └───────────┬─────────────┘ │
77
+ │ │
78
+ ├── Too large ────────────────┤
79
+ │ │
80
+ ▼ │
81
+ ┌─────────────────────────┐ │
82
+ │ Extract text & detect │ │
83
+ └───────────┬─────────────┘ │
84
+ │ │
85
+ ├── Pedimento found ──────────┤
86
+ │ (success!) │
87
+ │ │
88
+ ├── No pedimento markers ─────┤
89
+ │ (is_not_pedimento=TRUE) │
90
+ │ │
91
+ ├── Partial match ────────────┤
92
+ │ (missing fields) │
93
+ │ │
94
+ └── Error ────────────────────┤
95
+
96
+
97
+ ┌──────────────────────┐
98
+ │ Update database: │
99
+ │ - detection_attempts │
100
+ │ - detection_error │
101
+ │ - is_not_pedimento │
102
+ └──────────────────────┘
103
+ ```
104
+
105
+ ## Error Categories Explained
106
+
107
+ ### 1. **FILE_NOT_FOUND**
108
+
109
+ ```
110
+ FILE_NOT_FOUND: File does not exist on filesystem. May have been moved or deleted after scan.
111
+ ```
112
+
113
+ **Cause**: File was present during `arela scan` but no longer exists
114
+ **Action**: No retry needed. File should be rescanned if it returns
115
+ **Attempt behavior**: Counts toward max attempts
116
+
117
+ ### 2. **FILE_TOO_LARGE**
118
+
119
+ ```
120
+ FILE_TOO_LARGE: File size 75.5MB exceeds 50MB limit.
121
+ ```
122
+
123
+ **Cause**: PDF exceeds 50MB processing limit
124
+ **Action**: Increase limit if needed, or process manually
125
+ **Attempt behavior**: Counts toward max attempts
126
+
127
+ ### 3. **NOT_PEDIMENTO**
128
+
129
+ ```
130
+ NOT_PEDIMENTO: File does not match pedimento-simplificado pattern. Missing key markers: "FORMA SIMPLIFICADA DE PEDIMENTO".
131
+ ```
132
+
133
+ **Cause**: File doesn't contain pedimento markers
134
+ **Action**: File is marked `is_not_pedimento = TRUE` and never retrieved again
135
+ **Attempt behavior**: Counts toward max attempts, but file is excluded from future queries
136
+
137
+ ### 4. **INCOMPLETE_PEDIMENTO**
138
+
139
+ ```
140
+ INCOMPLETE_PEDIMENTO: Detected as potential pedimento but missing fields: rfc, numPedimento. Matcher may need improvement.
141
+ ```
142
+
143
+ **Cause**: File has some pedimento characteristics but missing required fields
144
+ **Action**: Review matcher patterns in `pedimento-simplificado.js`
145
+ **Attempt behavior**: Will retry up to max attempts (may succeed if matcher improves)
146
+
147
+ ### 5. **PDF_PARSE_ERROR**
148
+
149
+ ```
150
+ PDF_PARSE_ERROR: Failed to extract text from PDF: Encrypted PDF
151
+ ```
152
+
153
+ **Cause**: Corrupted, encrypted, or malformed PDF
154
+ **Action**: Check file integrity, decrypt if password-protected
155
+ **Attempt behavior**: Will retry up to max attempts
156
+
157
+ ### 6. **TEXT_EXTRACTION_ERROR**
158
+
159
+ ```
160
+ TEXT_EXTRACTION_ERROR: Cannot extract text: Unsupported format
161
+ ```
162
+
163
+ **Cause**: PDF format not supported by text extractor
164
+ **Action**: Convert PDF to standard format
165
+ **Attempt behavior**: Will retry up to max attempts
166
+
167
+ ## Statistics Display
168
+
169
+ ### Initial Stats
170
+
171
+ ```
172
+ 📈 Detection Status:
173
+ Total PDFs: 1000
174
+ Detected: 850
175
+ Pending: 100
176
+ Not Pedimento: 40
177
+ Max Attempts Reached: 10
178
+ Errors: 25
179
+ ```
180
+
181
+ **Pending**: Files that can still be processed (haven't reached max attempts)
182
+ **Not Pedimento**: Files marked as definitely not pedimentos (skipped in future runs)
183
+ **Max Attempts Reached**: Files that exhausted retry attempts
184
+
185
+ ### Final Stats
186
+
187
+ ```
188
+ ✅ Identification Complete!
189
+
190
+ 📊 Results:
191
+ Processed: 100 files
192
+ Pedimentos Detected: 85
193
+ Errors: 5
194
+ Duration: 11.5s
195
+ Speed: 87 files/sec
196
+
197
+ 📈 Final Status:
198
+ Total PDFs: 1000
199
+ Detected: 935
200
+ Pending: 0
201
+ Not Pedimento: 50
202
+ Max Attempts Reached: 15
203
+ Errors: 30
204
+
205
+ ⚠️ 15 PDFs reached max detection attempts.
206
+ Run with increased max_detection_attempts if needed, or review matcher patterns.
207
+ ```
208
+
209
+ ## Debugging Failed Detections
210
+
211
+ ### Query Files by Error Category
212
+
213
+ ```sql
214
+ -- Find files with specific error type
215
+ SELECT relative_path, detection_error, detection_attempts
216
+ FROM cli.file_stats_acme_corp_nas01_data
217
+ WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO%'
218
+ ORDER BY detection_attempts DESC
219
+ LIMIT 50;
220
+ ```
221
+
222
+ ### Find Files Marked as Not Pedimento
223
+
224
+ ```sql
225
+ -- Review files marked as not pedimentos
226
+ SELECT relative_path, detection_error, size_bytes
227
+ FROM cli.file_stats_acme_corp_nas01_data
228
+ WHERE is_not_pedimento = TRUE
229
+ ORDER BY size_bytes DESC
230
+ LIMIT 50;
231
+ ```
232
+
233
+ ### Find Files That Reached Max Attempts
234
+
235
+ ```sql
236
+ -- Check files that exhausted retries
237
+ SELECT relative_path, detection_error, detection_attempts, max_detection_attempts
238
+ FROM cli.file_stats_acme_corp_nas01_data
239
+ WHERE detection_attempts >= max_detection_attempts
240
+ AND detected_type IS NULL
241
+ ORDER BY detection_attempts DESC
242
+ LIMIT 50;
243
+ ```
244
+
245
+ ### Reset Specific Files for Retry
246
+
247
+ ```sql
248
+ -- Reset detection attempts for specific files
249
+ UPDATE cli.file_stats_acme_corp_nas01_data
250
+ SET
251
+ detection_attempts = 0,
252
+ detection_attempted_at = NULL,
253
+ detection_error = NULL,
254
+ is_not_pedimento = FALSE
255
+ WHERE relative_path LIKE '2024/3019796/%';
256
+ ```
257
+
258
+ ### Increase Max Attempts for All Files
259
+
260
+ ```sql
261
+ -- Allow more retries for all files
262
+ UPDATE cli.file_stats_acme_corp_nas01_data
263
+ SET max_detection_attempts = 5
264
+ WHERE max_detection_attempts = 3;
265
+ ```
266
+
267
+ ## Performance Impact
268
+
269
+ ### Query Performance
270
+
271
+ **Before** (without attempt tracking):
272
+ ```sql
273
+ EXPLAIN ANALYZE
274
+ SELECT * FROM cli.file_stats_company_server_path
275
+ WHERE file_extension = 'pdf' AND detection_attempted_at IS NULL;
276
+
277
+ -- Seq Scan on file_stats_... (cost=0.00..1250.00 rows=5000)
278
+ -- Planning Time: 0.5ms
279
+ -- Execution Time: 45.2ms
280
+ ```
281
+
282
+ **After** (with optimized index):
283
+ ```sql
284
+ EXPLAIN ANALYZE
285
+ SELECT * FROM cli.file_stats_company_server_path
286
+ WHERE file_extension = 'pdf'
287
+ AND is_not_pedimento = FALSE
288
+ AND detection_attempts < max_detection_attempts;
289
+
290
+ -- Index Scan using idx_..._detection_pending (cost=0.42..85.50 rows=1000)
291
+ -- Planning Time: 0.3ms
292
+ -- Execution Time: 8.1ms
293
+ ```
294
+
295
+ **Improvement**: ~5.5x faster query execution
296
+
297
+ ### Storage Impact
298
+
299
+ **Additional storage per record**:
300
+ - `detection_attempts`: 4 bytes (INTEGER)
301
+ - `max_detection_attempts`: 4 bytes (INTEGER)
302
+ - `is_not_pedimento`: 1 byte (BOOLEAN)
303
+ - Total: **9 bytes per record**
304
+
305
+ For 1 million PDFs: ~9 MB additional storage (negligible)
306
+
307
+ ### Index Size
308
+
309
+ **New partial indexes**:
310
+ - `idx_detection_pending`: ~50-100 KB per 10,000 pending files
311
+ - `idx_detection_errors`: ~20-50 KB per 10,000 errors
312
+
313
+ Total index overhead: **< 1 MB per 100,000 files**
314
+
315
+ ## Best Practices
316
+
317
+ ### 1. **Review Error Patterns**
318
+
319
+ Regularly check error distribution:
320
+
321
+ ```sql
322
+ SELECT
323
+ SPLIT_PART(detection_error, ':', 1) as error_category,
324
+ COUNT(*) as count
325
+ FROM cli.file_stats_company_server_path
326
+ WHERE detection_error IS NOT NULL
327
+ GROUP BY error_category
328
+ ORDER BY count DESC;
329
+ ```
330
+
331
+ ### 2. **Adjust Max Attempts**
332
+
333
+ For files with `INCOMPLETE_PEDIMENTO` errors, increase attempts if improving matchers:
334
+
335
+ ```sql
336
+ UPDATE cli.file_stats_company_server_path
337
+ SET max_detection_attempts = 5
338
+ WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO%'
339
+ AND detection_attempts >= max_detection_attempts;
340
+ ```
341
+
342
+ ### 3. **Review Not Pedimento Files**
343
+
344
+ Periodically verify files marked as not pedimentos:
345
+
346
+ ```sql
347
+ -- Sample random not-pedimento files for manual review
348
+ SELECT relative_path, size_bytes, detection_error
349
+ FROM cli.file_stats_company_server_path
350
+ WHERE is_not_pedimento = TRUE
351
+ ORDER BY RANDOM()
352
+ LIMIT 20;
353
+ ```
354
+
355
+ ### 4. **Clean Up Old Errors**
356
+
357
+ After fixing matcher bugs, reset affected files:
358
+
359
+ ```sql
360
+ -- Reset files with specific error after matcher improvement
361
+ UPDATE cli.file_stats_company_server_path
362
+ SET
363
+ detection_attempts = 0,
364
+ detection_error = NULL,
365
+ is_not_pedimento = FALSE
366
+ WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO: Missing fields: patente%'
367
+ AND detection_attempts >= max_detection_attempts;
368
+ ```
369
+
370
+ ## Matcher Improvement Workflow
371
+
372
+ When `INCOMPLETE_PEDIMENTO` errors indicate matcher issues:
373
+
374
+ 1. **Identify patterns**:
375
+ ```sql
376
+ SELECT detection_error, COUNT(*)
377
+ FROM cli.file_stats_company_server_path
378
+ WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO%'
379
+ GROUP BY detection_error;
380
+ ```
381
+
382
+ 2. **Sample affected files**:
383
+ ```sql
384
+ SELECT absolute_path, detection_error
385
+ FROM cli.file_stats_company_server_path
386
+ WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO: Missing fields: patente%'
387
+ LIMIT 5;
388
+ ```
389
+
390
+ 3. **Review PDFs manually** to understand patterns
391
+
392
+ 4. **Update matcher** in `src/document-types/pedimento-simplificado.js`
393
+
394
+ 5. **Reset affected files**:
395
+ ```sql
396
+ UPDATE cli.file_stats_company_server_path
397
+ SET detection_attempts = 0, detection_error = NULL
398
+ WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO: Missing fields: patente%';
399
+ ```
400
+
401
+ 6. **Re-run detection**:
402
+ ```bash
403
+ arela identify
404
+ ```
405
+
406
+ ## Conclusion
407
+
408
+ The attempt tracking system provides:
409
+ - **Better performance**: Avoids reprocessing impossible files
410
+ - **Better debugging**: Categorized errors show exactly what's wrong
411
+ - **Better visibility**: Statistics show what's pending vs. hopeless
412
+ - **Better efficiency**: Optimized indexes make queries 5x faster
413
+
414
+ This enables iterative improvement of detection matchers while maintaining system performance.