@arela/uploader 1.0.1 → 1.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.env.template +70 -0
- package/docs/API_RETRY_MECHANISM.md +338 -0
- package/docs/ARELA_IDENTIFY_IMPLEMENTATION.md +489 -0
- package/docs/ARELA_IDENTIFY_QUICKREF.md +186 -0
- package/docs/ARELA_PROPAGATE_IMPLEMENTATION.md +581 -0
- package/docs/ARELA_PROPAGATE_QUICKREF.md +272 -0
- package/docs/ARELA_PUSH_IMPLEMENTATION.md +577 -0
- package/docs/ARELA_PUSH_QUICKREF.md +322 -0
- package/docs/ARELA_SCAN_IMPLEMENTATION.md +373 -0
- package/docs/ARELA_SCAN_QUICKREF.md +139 -0
- package/docs/DETECTION_ATTEMPT_TRACKING.md +414 -0
- package/docs/MIGRATION_UPLOADER_TO_FILE_STATS.md +1020 -0
- package/docs/MULTI_LEVEL_DIRECTORY_SCANNING.md +494 -0
- package/docs/STATS_COMMAND_SEQUENCE_DIAGRAM.md +287 -0
- package/docs/STATS_COMMAND_SIMPLE.md +93 -0
- package/package.json +4 -2
- package/src/commands/IdentifyCommand.js +486 -0
- package/src/commands/PropagateCommand.js +474 -0
- package/src/commands/PushCommand.js +473 -0
- package/src/commands/ScanCommand.js +516 -0
- package/src/config/config.js +177 -7
- package/src/file-detection.js +9 -10
- package/src/index.js +150 -0
- package/src/services/DatabaseService.js +2 -2
- package/src/services/ScanApiService.js +646 -0
- package/src/services/upload/ApiUploadService.js +12 -0
|
@@ -0,0 +1,489 @@
|
|
|
1
|
+
# Arela Identify Command Implementation
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
The `arela identify` command is an optimized replacement for the legacy `detect --detect-pdfs` command. It identifies pedimento-simplificado documents from PDF files scanned by the `arela scan` command, extracting metadata and composing the target upload path (`arela_path`).
|
|
6
|
+
|
|
7
|
+
## Key Improvements Over Legacy Command
|
|
8
|
+
|
|
9
|
+
### 1. **Architecture**
|
|
10
|
+
- **Legacy**: Uses Supabase directly, queries `uploader` table
|
|
11
|
+
- **New**: Uses configured API, queries dynamic `file_stats_*` tables
|
|
12
|
+
|
|
13
|
+
### 2. **Resource Optimization**
|
|
14
|
+
- **Local Detection**: Runs detection on CLI host to leverage local CPU/memory
|
|
15
|
+
- **Batch Processing**: Fetches files in batches, processes in parallel
|
|
16
|
+
- **Connection Pooling**: Reuses HTTP connections for API efficiency
|
|
17
|
+
|
|
18
|
+
### 3. **Table Structure**
|
|
19
|
+
- **Legacy**: Single global `uploader` table
|
|
20
|
+
- **New**: Per-instance `file_stats_<company>_<server>_<path>` tables with optimized indexes
|
|
21
|
+
|
|
22
|
+
### 4. **Detection Fields**
|
|
23
|
+
- `detected_type` - Document type (e.g., "pedimento_simplificado")
|
|
24
|
+
- `detected_pedimento` - Pedimento number
|
|
25
|
+
- `detected_pedimento_year` - Year extracted from pedimento
|
|
26
|
+
- `rfc` - RFC (tax ID) extracted from document
|
|
27
|
+
- `arela_path` - Computed upload path (RFC/Year/Patente/Aduana/Pedimento/)
|
|
28
|
+
- `detection_attempted_at` - Timestamp of detection attempt
|
|
29
|
+
- `detection_error` - Error message if detection failed
|
|
30
|
+
|
|
31
|
+
## Database Schema Updates
|
|
32
|
+
|
|
33
|
+
### New Columns in file_stats_* Tables
|
|
34
|
+
|
|
35
|
+
```sql
|
|
36
|
+
-- Detection fields
|
|
37
|
+
detected_type VARCHAR(100),
|
|
38
|
+
detected_pedimento VARCHAR(50),
|
|
39
|
+
detected_pedimento_year INTEGER,
|
|
40
|
+
rfc VARCHAR(20),
|
|
41
|
+
arela_path TEXT,
|
|
42
|
+
detection_attempted_at TIMESTAMP,
|
|
43
|
+
detection_error TEXT
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### Optimized Indexes
|
|
47
|
+
|
|
48
|
+
```sql
|
|
49
|
+
-- Fast filtering for PDFs with/without detection
|
|
50
|
+
CREATE INDEX idx_<table>_ext_detected
|
|
51
|
+
ON cli.<table>(file_extension, detected_type)
|
|
52
|
+
WHERE detection_attempted_at IS NOT NULL;
|
|
53
|
+
|
|
54
|
+
-- Fast lookup by pedimento number
|
|
55
|
+
CREATE INDEX idx_<table>_pedimento
|
|
56
|
+
ON cli.<table>(detected_pedimento)
|
|
57
|
+
WHERE detected_pedimento IS NOT NULL;
|
|
58
|
+
|
|
59
|
+
-- Fast RFC-based queries (for push command)
|
|
60
|
+
CREATE INDEX idx_<table>_rfc
|
|
61
|
+
ON cli.<table>(rfc)
|
|
62
|
+
WHERE rfc IS NOT NULL;
|
|
63
|
+
|
|
64
|
+
-- Fast arela_path queries (for propagate command)
|
|
65
|
+
CREATE INDEX idx_<table>_arela_path
|
|
66
|
+
ON cli.<table>(arela_path)
|
|
67
|
+
WHERE arela_path IS NOT NULL;
|
|
68
|
+
|
|
69
|
+
-- Fast detection pending queries
|
|
70
|
+
CREATE INDEX idx_<table>_detection_pending
|
|
71
|
+
ON cli.<table>(file_extension, detection_attempted_at)
|
|
72
|
+
WHERE file_extension = 'pdf' AND detection_attempted_at IS NULL;
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Backend Implementation
|
|
76
|
+
|
|
77
|
+
### 1. FileStatsTableManagerService
|
|
78
|
+
|
|
79
|
+
**File**: `arela-api/src/uploader/services/file-stats-table-manager.service.ts`
|
|
80
|
+
|
|
81
|
+
**New Methods**:
|
|
82
|
+
|
|
83
|
+
```typescript
|
|
84
|
+
// Fetch PDFs that haven't been processed yet
|
|
85
|
+
async fetchPdfsForDetection(
|
|
86
|
+
tableName: string,
|
|
87
|
+
offset: number,
|
|
88
|
+
limit: number
|
|
89
|
+
): Promise<any[]>
|
|
90
|
+
|
|
91
|
+
// Batch update detection results
|
|
92
|
+
async batchUpdateDetection(
|
|
93
|
+
tableName: string,
|
|
94
|
+
updates: DetectionUpdate[]
|
|
95
|
+
): Promise<{ updated: number; errors: number }>
|
|
96
|
+
|
|
97
|
+
// Get detection statistics
|
|
98
|
+
async getDetectionStats(
|
|
99
|
+
tableName: string
|
|
100
|
+
): Promise<{ totalPdfs, detected, pending, errors }>
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### 2. UploaderController Endpoints
|
|
104
|
+
|
|
105
|
+
**File**: `arela-api/src/uploader/controllers/uploader.controller.ts`
|
|
106
|
+
|
|
107
|
+
**New Endpoints**:
|
|
108
|
+
|
|
109
|
+
```typescript
|
|
110
|
+
GET /api/uploader/scan/pdfs-for-detection?tableName=X&offset=0&limit=100
|
|
111
|
+
→ Fetch PDFs ready for detection
|
|
112
|
+
|
|
113
|
+
PATCH /api/uploader/scan/batch-update-detection?tableName=X
|
|
114
|
+
→ Update detection results for batch of files
|
|
115
|
+
|
|
116
|
+
GET /api/uploader/scan/detection-stats?tableName=X
|
|
117
|
+
→ Get detection statistics (total, detected, pending, errors)
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
## CLI Implementation
|
|
121
|
+
|
|
122
|
+
### 1. IdentifyCommand
|
|
123
|
+
|
|
124
|
+
**File**: `arela-uploader/src/commands/IdentifyCommand.js`
|
|
125
|
+
|
|
126
|
+
**Key Features**:
|
|
127
|
+
- Validates scan configuration (same config as scan command)
|
|
128
|
+
- Fetches PDF files from API in batches
|
|
129
|
+
- Detects files locally using `FileDetectionService`
|
|
130
|
+
- Processes files in parallel with configurable concurrency (default: 10)
|
|
131
|
+
- Batch updates results back to API
|
|
132
|
+
- Real-time progress bar with throughput metrics
|
|
133
|
+
- Detailed statistics on completion
|
|
134
|
+
|
|
135
|
+
**Workflow**:
|
|
136
|
+
```
|
|
137
|
+
1. Validate scan config → Ensure ARELA_COMPANY_SLUG, ARELA_SERVER_ID set
|
|
138
|
+
2. Generate table name → Same logic as scan command
|
|
139
|
+
3. Fetch detection stats → Show initial status
|
|
140
|
+
4. Loop until no more files:
|
|
141
|
+
a. Fetch batch of PDFs (default: 100)
|
|
142
|
+
b. Detect locally in parallel (concurrency: 10)
|
|
143
|
+
c. Update results via API
|
|
144
|
+
d. Update progress bar
|
|
145
|
+
5. Show final statistics
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### 2. ScanApiService Updates
|
|
149
|
+
|
|
150
|
+
**File**: `arela-uploader/src/services/ScanApiService.js`
|
|
151
|
+
|
|
152
|
+
**New Methods**:
|
|
153
|
+
|
|
154
|
+
```javascript
|
|
155
|
+
async fetchPdfsForDetection(tableName, offset, limit)
|
|
156
|
+
→ GET /api/uploader/scan/pdfs-for-detection
|
|
157
|
+
|
|
158
|
+
async batchUpdateDetection(tableName, updates)
|
|
159
|
+
→ PATCH /api/uploader/scan/batch-update-detection
|
|
160
|
+
|
|
161
|
+
async getDetectionStats(tableName)
|
|
162
|
+
→ GET /api/uploader/scan/detection-stats
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## Usage
|
|
166
|
+
|
|
167
|
+
### Basic Identify
|
|
168
|
+
|
|
169
|
+
```bash
|
|
170
|
+
# Identify pedimento-simplificado documents from scanned PDFs
|
|
171
|
+
arela identify
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### With Different API Target
|
|
175
|
+
|
|
176
|
+
```bash
|
|
177
|
+
# Use specific API target
|
|
178
|
+
arela identify --api agencia
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
### With Custom Batch Size
|
|
182
|
+
|
|
183
|
+
```bash
|
|
184
|
+
# Process 200 files per batch (faster for large datasets)
|
|
185
|
+
arela identify --batch-size 200
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
### With Detailed Statistics
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
# Show detailed performance and memory statistics
|
|
192
|
+
arela identify --show-stats
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## Configuration Requirements
|
|
196
|
+
|
|
197
|
+
Same configuration as `arela scan`:
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
# Required
|
|
201
|
+
ARELA_COMPANY_SLUG=your_company
|
|
202
|
+
ARELA_SERVER_ID=server01
|
|
203
|
+
UPLOAD_BASE_PATH=/path/to/files
|
|
204
|
+
UPLOAD_SOURCES=2023|2024|2025
|
|
205
|
+
|
|
206
|
+
# Optional
|
|
207
|
+
ARELA_BASE_PATH_LABEL=data
|
|
208
|
+
ARELA_API_URL=http://localhost:3010
|
|
209
|
+
ARELA_API_TOKEN=your-token
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
## Performance Characteristics
|
|
213
|
+
|
|
214
|
+
### Concurrency
|
|
215
|
+
|
|
216
|
+
- **File Detection**: 10 parallel operations (configurable via code)
|
|
217
|
+
- **API Batching**: 100 files per API call (configurable via `--batch-size`)
|
|
218
|
+
- **HTTP Pooling**: Reuses connections for efficiency
|
|
219
|
+
|
|
220
|
+
### Resource Usage
|
|
221
|
+
|
|
222
|
+
- **Memory**: O(batch_size) - Only current batch in memory
|
|
223
|
+
- **CPU**: Leverages local CPU for PDF text extraction
|
|
224
|
+
- **Network**: Minimal - only sends detection results, not file contents
|
|
225
|
+
|
|
226
|
+
### Typical Performance
|
|
227
|
+
|
|
228
|
+
**Dataset**: 1,000 PDFs, average size 500KB
|
|
229
|
+
|
|
230
|
+
| Metric | Value |
|
|
231
|
+
|--------|-------|
|
|
232
|
+
| Total Time | 8-12 minutes |
|
|
233
|
+
| Throughput | 80-120 files/min |
|
|
234
|
+
| Memory Usage | ~200-300 MB |
|
|
235
|
+
| API Calls | ~10 (100 files per batch) |
|
|
236
|
+
|
|
237
|
+
## Progress Display
|
|
238
|
+
|
|
239
|
+
### Default Mode
|
|
240
|
+
|
|
241
|
+
```
|
|
242
|
+
📄 Identifying |████████████████████░░░░░░░░| 67% | 670/1000 files | 85 files/sec
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
Shows:
|
|
246
|
+
- Progress bar
|
|
247
|
+
- Percentage complete
|
|
248
|
+
- Files processed / total files
|
|
249
|
+
- Real-time throughput (files/sec)
|
|
250
|
+
|
|
251
|
+
### Final Output
|
|
252
|
+
|
|
253
|
+
```
|
|
254
|
+
✅ Identification Complete!
|
|
255
|
+
|
|
256
|
+
📊 Results:
|
|
257
|
+
Processed: 1000 files
|
|
258
|
+
Pedimentos Detected: 850
|
|
259
|
+
Errors: 15
|
|
260
|
+
Duration: 11.5s
|
|
261
|
+
Speed: 87 files/sec
|
|
262
|
+
|
|
263
|
+
📈 Final Status:
|
|
264
|
+
Total PDFs: 1000
|
|
265
|
+
Detected: 850
|
|
266
|
+
Pending: 0
|
|
267
|
+
Errors: 15
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
## Detection Logic
|
|
271
|
+
|
|
272
|
+
Uses existing `FileDetectionService` from legacy codebase:
|
|
273
|
+
|
|
274
|
+
**File**: `arela-uploader/src/file-detection.js`
|
|
275
|
+
|
|
276
|
+
**Detection Process**:
|
|
277
|
+
1. Extract text from PDF using `office-text-extractor`
|
|
278
|
+
2. Apply regex patterns to identify document type
|
|
279
|
+
3. Extract fields: RFC, patente, aduana, pedimento number, year
|
|
280
|
+
4. Compose `arela_path`: `RFC/Year/Patente/Aduana/Pedimento/`
|
|
281
|
+
|
|
282
|
+
**Example arela_path**:
|
|
283
|
+
```
|
|
284
|
+
PED781129JT6/2023/3429/07/3019796/
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
Components:
|
|
288
|
+
- **RFC**: PED781129JT6 (tax ID)
|
|
289
|
+
- **Year**: 2023 (from pedimento)
|
|
290
|
+
- **Patente**: 3429 (customs broker license)
|
|
291
|
+
- **Aduana**: 07 (customs office, zero-padded)
|
|
292
|
+
- **Pedimento**: 3019796 (customs declaration number)
|
|
293
|
+
|
|
294
|
+
## Error Handling
|
|
295
|
+
|
|
296
|
+
### Common Errors
|
|
297
|
+
|
|
298
|
+
**1. Configuration Missing**
|
|
299
|
+
|
|
300
|
+
```
|
|
301
|
+
Error: Scan configuration errors:
|
|
302
|
+
- ARELA_COMPANY_SLUG is required
|
|
303
|
+
- ARELA_SERVER_ID is required
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
**Solution**: Set environment variables in `.env`
|
|
307
|
+
|
|
308
|
+
**2. Table Not Found**
|
|
309
|
+
|
|
310
|
+
```
|
|
311
|
+
Error: Table 'file_stats_...' not found in CLI registry
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
**Solution**: Run `arela scan` first to create the table
|
|
315
|
+
|
|
316
|
+
**3. PDF Not Found**
|
|
317
|
+
|
|
318
|
+
Detection result will include:
|
|
319
|
+
```json
|
|
320
|
+
{
|
|
321
|
+
"detectionError": "File not found on filesystem"
|
|
322
|
+
}
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
This can happen if:
|
|
326
|
+
- File was deleted after scan
|
|
327
|
+
- File path is incorrect
|
|
328
|
+
- File is on unmounted drive
|
|
329
|
+
|
|
330
|
+
**4. Detection Failed**
|
|
331
|
+
|
|
332
|
+
Detection result will include:
|
|
333
|
+
```json
|
|
334
|
+
{
|
|
335
|
+
"detectionError": "Failed to extract text from PDF: ..."
|
|
336
|
+
}
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
Common causes:
|
|
340
|
+
- Corrupted PDF
|
|
341
|
+
- Encrypted/password-protected PDF
|
|
342
|
+
- Unsupported PDF format
|
|
343
|
+
|
|
344
|
+
## Monitoring Queries
|
|
345
|
+
|
|
346
|
+
### Check Detection Progress
|
|
347
|
+
|
|
348
|
+
```sql
|
|
349
|
+
SELECT
|
|
350
|
+
COUNT(*) FILTER (WHERE file_extension = 'pdf') as total_pdfs,
|
|
351
|
+
COUNT(*) FILTER (WHERE detected_type IS NOT NULL) as detected,
|
|
352
|
+
COUNT(*) FILTER (WHERE detection_attempted_at IS NULL) as pending,
|
|
353
|
+
COUNT(*) FILTER (WHERE detection_error IS NOT NULL) as errors
|
|
354
|
+
FROM cli.file_stats_<company>_<server>_<path>;
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
### List Detected Pedimentos
|
|
358
|
+
|
|
359
|
+
```sql
|
|
360
|
+
SELECT
|
|
361
|
+
detected_pedimento,
|
|
362
|
+
detected_pedimento_year,
|
|
363
|
+
rfc,
|
|
364
|
+
arela_path,
|
|
365
|
+
COUNT(*) as file_count
|
|
366
|
+
FROM cli.file_stats_<company>_<server>_<path>
|
|
367
|
+
WHERE detected_type = 'pedimento_simplificado'
|
|
368
|
+
GROUP BY detected_pedimento, detected_pedimento_year, rfc, arela_path
|
|
369
|
+
ORDER BY detected_pedimento_year DESC, detected_pedimento;
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
### Find Files with Errors
|
|
373
|
+
|
|
374
|
+
```sql
|
|
375
|
+
SELECT
|
|
376
|
+
relative_path,
|
|
377
|
+
file_name,
|
|
378
|
+
detection_error,
|
|
379
|
+
detection_attempted_at
|
|
380
|
+
FROM cli.file_stats_<company>_<server>_<path>
|
|
381
|
+
WHERE detection_error IS NOT NULL
|
|
382
|
+
ORDER BY detection_attempted_at DESC
|
|
383
|
+
LIMIT 50;
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
## Next Phase: arela propagate
|
|
387
|
+
|
|
388
|
+
After identification, use `arela propagate` to copy `arela_path` from pedimento PDFs to related files in the same directory.
|
|
389
|
+
|
|
390
|
+
**Example**:
|
|
391
|
+
```
|
|
392
|
+
/2023/3019796/
|
|
393
|
+
├── pedimento.pdf (detected, has arela_path)
|
|
394
|
+
├── invoice.pdf (related, needs arela_path)
|
|
395
|
+
└── packing_list.pdf (related, needs arela_path)
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
After propagation, all 3 files will have the same `arela_path`, enabling batch upload via `arela push`.
|
|
399
|
+
|
|
400
|
+
## Troubleshooting
|
|
401
|
+
|
|
402
|
+
### Slow Performance
|
|
403
|
+
|
|
404
|
+
**Symptoms**: < 50 files/sec throughput
|
|
405
|
+
|
|
406
|
+
**Solutions**:
|
|
407
|
+
1. Increase batch size: `--batch-size 200`
|
|
408
|
+
2. Check filesystem I/O (slow disk?)
|
|
409
|
+
3. Check API response time
|
|
410
|
+
4. Verify network latency
|
|
411
|
+
|
|
412
|
+
### High Memory Usage
|
|
413
|
+
|
|
414
|
+
**Symptoms**: Memory grows during execution
|
|
415
|
+
|
|
416
|
+
**Solutions**:
|
|
417
|
+
1. Reduce batch size: `--batch-size 50`
|
|
418
|
+
2. Check for memory leaks in `FileDetectionService`
|
|
419
|
+
3. Monitor with `--show-stats` flag
|
|
420
|
+
|
|
421
|
+
### Files Not Detected
|
|
422
|
+
|
|
423
|
+
**Symptoms**: Many files with `detection_error`
|
|
424
|
+
|
|
425
|
+
**Solutions**:
|
|
426
|
+
1. Check file permissions (readable?)
|
|
427
|
+
2. Verify PDF format (not corrupted?)
|
|
428
|
+
3. Check detection patterns in `document-type-shared.js`
|
|
429
|
+
4. Review error messages in detection_error field
|
|
430
|
+
|
|
431
|
+
## Files Modified
|
|
432
|
+
|
|
433
|
+
### Backend
|
|
434
|
+
|
|
435
|
+
- ✅ `src/uploader/services/file-stats-table-manager.service.ts`
|
|
436
|
+
- Updated schema with detection fields
|
|
437
|
+
- Added detection indexes
|
|
438
|
+
- Added `fetchPdfsForDetection()`
|
|
439
|
+
- Added `batchUpdateDetection()`
|
|
440
|
+
- Added `getDetectionStats()`
|
|
441
|
+
|
|
442
|
+
- ✅ `src/uploader/services/uploader.service.ts`
|
|
443
|
+
- Added `fetchPdfsForDetection()`
|
|
444
|
+
- Added `batchUpdateDetection()`
|
|
445
|
+
- Added `getDetectionStats()`
|
|
446
|
+
|
|
447
|
+
- ✅ `src/uploader/controllers/uploader.controller.ts`
|
|
448
|
+
- Added `GET /api/uploader/scan/pdfs-for-detection`
|
|
449
|
+
- Added `PATCH /api/uploader/scan/batch-update-detection`
|
|
450
|
+
- Added `GET /api/uploader/scan/detection-stats`
|
|
451
|
+
|
|
452
|
+
### CLI
|
|
453
|
+
|
|
454
|
+
- ✅ `src/commands/IdentifyCommand.js` - New command
|
|
455
|
+
- ✅ `src/services/ScanApiService.js` - Added detection methods
|
|
456
|
+
- ✅ `src/index.js` - Wired up identify command
|
|
457
|
+
- ✅ `docs/ARELA_IDENTIFY_IMPLEMENTATION.md` - Documentation
|
|
458
|
+
|
|
459
|
+
## Migration Path
|
|
460
|
+
|
|
461
|
+
### For New Installations
|
|
462
|
+
|
|
463
|
+
Use the optimized workflow:
|
|
464
|
+
```bash
|
|
465
|
+
arela scan # Collect file stats
|
|
466
|
+
arela identify # Detect pedimentos
|
|
467
|
+
arela propagate # Propagate arela_path
|
|
468
|
+
arela push # Upload by RFC
|
|
469
|
+
```
|
|
470
|
+
|
|
471
|
+
### For Existing Installations
|
|
472
|
+
|
|
473
|
+
Legacy commands still work:
|
|
474
|
+
```bash
|
|
475
|
+
arela stats --stats-only # Still supported
|
|
476
|
+
arela detect --detect-pdfs # Still supported
|
|
477
|
+
```
|
|
478
|
+
|
|
479
|
+
Gradually migrate to new commands when ready.
|
|
480
|
+
|
|
481
|
+
## Conclusion
|
|
482
|
+
|
|
483
|
+
The `arela identify` command provides a **faster, more scalable, and resource-efficient** alternative to the legacy detection workflow. By leveraging local processing, batch operations, and optimized database indexes, it can handle large-scale document identification with minimal overhead.
|
|
484
|
+
|
|
485
|
+
**Next Steps**:
|
|
486
|
+
1. ✅ Phase 1: `arela scan` - COMPLETE
|
|
487
|
+
2. ✅ Phase 2: `arela identify` - COMPLETE
|
|
488
|
+
3. ⏳ Phase 3: `arela propagate` - Propagate arela_path to related files
|
|
489
|
+
4. ⏳ Phase 4: `arela push` - Upload files by RFC
|
|
@@ -0,0 +1,186 @@
|
|
|
1
|
+
# Arela Identify Quick Reference
|
|
2
|
+
|
|
3
|
+
## Command
|
|
4
|
+
|
|
5
|
+
```bash
|
|
6
|
+
arela identify [options]
|
|
7
|
+
```
|
|
8
|
+
|
|
9
|
+
## Options
|
|
10
|
+
|
|
11
|
+
| Option | Default | Description |
|
|
12
|
+
|--------|---------|-------------|
|
|
13
|
+
| `--api <target>` | `default` | API target: default\|agencia\|cliente |
|
|
14
|
+
| `-b, --batch-size <size>` | `100` | Files per batch |
|
|
15
|
+
| `--show-stats` | `false` | Show detailed statistics |
|
|
16
|
+
|
|
17
|
+
## Prerequisites
|
|
18
|
+
|
|
19
|
+
1. **Run `arela scan` first** - Identify requires scanned files
|
|
20
|
+
2. **Same configuration** - Use same env vars as scan command
|
|
21
|
+
|
|
22
|
+
## Required Environment Variables
|
|
23
|
+
|
|
24
|
+
```bash
|
|
25
|
+
ARELA_COMPANY_SLUG=your_company
|
|
26
|
+
ARELA_SERVER_ID=server01
|
|
27
|
+
UPLOAD_BASE_PATH=/path/to/files
|
|
28
|
+
UPLOAD_SOURCES=2023|2024|2025
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Examples
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
# Basic identification
|
|
35
|
+
arela identify
|
|
36
|
+
|
|
37
|
+
# Use specific API
|
|
38
|
+
arela identify --api agencia
|
|
39
|
+
|
|
40
|
+
# Faster for large datasets
|
|
41
|
+
arela identify --batch-size 200
|
|
42
|
+
|
|
43
|
+
# With detailed stats
|
|
44
|
+
arela identify --show-stats
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
## What It Does
|
|
48
|
+
|
|
49
|
+
1. Fetches unprocessed PDFs from `file_stats_*` table
|
|
50
|
+
2. Extracts text and detects document type locally
|
|
51
|
+
3. Identifies pedimento-simplificado documents
|
|
52
|
+
4. Extracts: RFC, patente, aduana, pedimento, year
|
|
53
|
+
5. Composes `arela_path`: `RFC/Year/Patente/Aduana/Pedimento/`
|
|
54
|
+
6. Updates results in database
|
|
55
|
+
|
|
56
|
+
## Output Example
|
|
57
|
+
|
|
58
|
+
```
|
|
59
|
+
🔍 Starting arela identify command
|
|
60
|
+
📊 Table: file_stats_acme_corp_nas01_data
|
|
61
|
+
🎯 API Target: default
|
|
62
|
+
📦 Batch Size: 100
|
|
63
|
+
|
|
64
|
+
📈 Detection Status:
|
|
65
|
+
Total PDFs: 1000
|
|
66
|
+
Detected: 0
|
|
67
|
+
Pending: 1000
|
|
68
|
+
Errors: 0
|
|
69
|
+
|
|
70
|
+
🚀 Processing 1000 pending PDFs...
|
|
71
|
+
|
|
72
|
+
📄 Identifying |████████████████████| 100% | 1000/1000 files | 87 files/sec
|
|
73
|
+
|
|
74
|
+
✅ Identification Complete!
|
|
75
|
+
|
|
76
|
+
📊 Results:
|
|
77
|
+
Processed: 1000 files
|
|
78
|
+
Pedimentos Detected: 850
|
|
79
|
+
Errors: 15
|
|
80
|
+
Duration: 11.5s
|
|
81
|
+
Speed: 87 files/sec
|
|
82
|
+
|
|
83
|
+
📈 Final Status:
|
|
84
|
+
Total PDFs: 1000
|
|
85
|
+
Detected: 850
|
|
86
|
+
Pending: 0
|
|
87
|
+
Errors: 15
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
## Backend Endpoints Used
|
|
91
|
+
|
|
92
|
+
```
|
|
93
|
+
GET /api/uploader/scan/detection-stats?tableName=X
|
|
94
|
+
GET /api/uploader/scan/pdfs-for-detection?tableName=X&offset=0&limit=100
|
|
95
|
+
PATCH /api/uploader/scan/batch-update-detection?tableName=X
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Database Fields Updated
|
|
99
|
+
|
|
100
|
+
| Field | Type | Description |
|
|
101
|
+
|-------|------|-------------|
|
|
102
|
+
| `detected_type` | VARCHAR | Document type (pedimento_simplificado) |
|
|
103
|
+
| `detected_pedimento` | VARCHAR | Pedimento number |
|
|
104
|
+
| `detected_pedimento_year` | INTEGER | Year from pedimento |
|
|
105
|
+
| `rfc` | VARCHAR | Tax ID |
|
|
106
|
+
| `arela_path` | TEXT | Upload path (RFC/Year/...) |
|
|
107
|
+
| `detection_attempted_at` | TIMESTAMP | When detection ran |
|
|
108
|
+
| `detection_error` | TEXT | Error if detection failed |
|
|
109
|
+
|
|
110
|
+
## Performance Tips
|
|
111
|
+
|
|
112
|
+
- **Large datasets**: Increase batch size to 200-500
|
|
113
|
+
- **Slow detection**: Check PDF file sizes and complexity
|
|
114
|
+
- **API latency**: Use `--api` flag to select closer API
|
|
115
|
+
- **Memory usage**: Reduce batch size if high memory usage
|
|
116
|
+
|
|
117
|
+
## Troubleshooting
|
|
118
|
+
|
|
119
|
+
| Error | Solution |
|
|
120
|
+
|-------|----------|
|
|
121
|
+
| "Scan configuration errors" | Set ARELA_COMPANY_SLUG and ARELA_SERVER_ID |
|
|
122
|
+
| "Table not found" | Run `arela scan` first |
|
|
123
|
+
| "File not found on filesystem" | File was deleted after scan |
|
|
124
|
+
| "Failed to extract text" | PDF is corrupted or encrypted |
|
|
125
|
+
|
|
126
|
+
## Next Steps
|
|
127
|
+
|
|
128
|
+
After identification:
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
# Propagate arela_path to related files
|
|
132
|
+
arela propagate
|
|
133
|
+
|
|
134
|
+
# Upload files by RFC
|
|
135
|
+
arela push
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
## Monitoring Queries
|
|
139
|
+
|
|
140
|
+
```sql
|
|
141
|
+
-- Check progress
|
|
142
|
+
SELECT
|
|
143
|
+
COUNT(*) FILTER (WHERE file_extension = 'pdf') as total,
|
|
144
|
+
COUNT(*) FILTER (WHERE detected_type IS NOT NULL) as detected,
|
|
145
|
+
COUNT(*) FILTER (WHERE detection_attempted_at IS NULL) as pending
|
|
146
|
+
FROM cli.file_stats_<company>_<server>_<path>;
|
|
147
|
+
|
|
148
|
+
-- View detected pedimentos
|
|
149
|
+
SELECT
|
|
150
|
+
detected_pedimento,
|
|
151
|
+
rfc,
|
|
152
|
+
arela_path,
|
|
153
|
+
COUNT(*) as files
|
|
154
|
+
FROM cli.file_stats_<company>_<server>_<path>
|
|
155
|
+
WHERE detected_type = 'pedimento_simplificado'
|
|
156
|
+
GROUP BY detected_pedimento, rfc, arela_path;
|
|
157
|
+
|
|
158
|
+
-- Check errors
|
|
159
|
+
SELECT relative_path, detection_error
|
|
160
|
+
FROM cli.file_stats_<company>_<server>_<path>
|
|
161
|
+
WHERE detection_error IS NOT NULL
|
|
162
|
+
LIMIT 20;
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## Comparison: Legacy vs Optimized
|
|
166
|
+
|
|
167
|
+
| Feature | Legacy (detect --detect-pdfs) | New (identify) |
|
|
168
|
+
|---------|-------------------------------|----------------|
|
|
169
|
+
| Table | Global `uploader` | Dynamic `file_stats_*` |
|
|
170
|
+
| API | Supabase direct | Configured API |
|
|
171
|
+
| Detection | Mixed (client/server) | Client-side |
|
|
172
|
+
| Batching | Single query | Paginated batches |
|
|
173
|
+
| Progress | Percentage | Throughput |
|
|
174
|
+
| Indexes | Basic | Optimized for speed |
|
|
175
|
+
|
|
176
|
+
## Files Involved
|
|
177
|
+
|
|
178
|
+
### CLI
|
|
179
|
+
- `src/commands/IdentifyCommand.js` - Main command
|
|
180
|
+
- `src/services/ScanApiService.js` - API communication
|
|
181
|
+
- `src/file-detection.js` - Detection logic (reused)
|
|
182
|
+
|
|
183
|
+
### Backend
|
|
184
|
+
- `src/uploader/services/file-stats-table-manager.service.ts` - Table operations
|
|
185
|
+
- `src/uploader/services/uploader.service.ts` - Business logic
|
|
186
|
+
- `src/uploader/controllers/uploader.controller.ts` - REST endpoints
|