@arela/uploader 1.0.2 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,494 @@
1
+ # Multi-Level Directory Scanning Implementation
2
+
3
+ ## Overview
4
+
5
+ The arela CLI now supports creating multiple database tables based on directory structure depth. This allows for better organization when scanning large file systems with multiple subdirectories.
6
+
7
+ ## Configuration
8
+
9
+ ### Environment Variable
10
+
11
+ ```bash
12
+ # Directory depth level for creating separate tables (default: 0)
13
+ # 0 = single table for entire base path (original behavior)
14
+ # 1 = one table per first-level subdirectory
15
+ # 2 = one table per second-level subdirectory, etc.
16
+ SCAN_DIRECTORY_LEVEL=1
17
+ ```
18
+
19
+ Add this to your `.env` file.
20
+
21
+ ## Example
22
+
23
+ Given this directory structure:
24
+
25
+ ```
26
+ /data
27
+ ├── adios
28
+ │ ├── otro
29
+ │ │ └── otro.txt
30
+ │ └── uno
31
+ │ └── prueba.txt
32
+ └── hola
33
+ ├── otro
34
+ │ └── otro.txt
35
+ └── uno
36
+ └── prueba.txt
37
+ ```
38
+
39
+ ### Level 0 (Default)
40
+
41
+ ```bash
42
+ SCAN_DIRECTORY_LEVEL=0
43
+ ```
44
+
45
+ Creates a single table:
46
+ - `file_stats_palco_local_sample`
47
+
48
+ ### Level 1
49
+
50
+ ```bash
51
+ SCAN_DIRECTORY_LEVEL=1
52
+ ```
53
+
54
+ Creates two tables:
55
+ - `file_stats_palco_local_sample_adios`
56
+ - `file_stats_palco_local_sample_hola`
57
+
58
+ ### Level 2
59
+
60
+ ```bash
61
+ SCAN_DIRECTORY_LEVEL=2
62
+ ```
63
+
64
+ Creates four tables:
65
+ - `file_stats_palco_local_sample_adios_otro`
66
+ - `file_stats_palco_local_sample_adios_uno`
67
+ - `file_stats_palco_local_sample_hola_otro`
68
+ - `file_stats_palco_local_sample_hola_uno`
69
+
70
+ ## How It Works
71
+
72
+ ### Backend Changes
73
+
74
+ 1. **New Endpoint**: `GET /api/uploader/scan/instance-tables`
75
+ - Fetches all tables for a specific company/server/base combination
76
+ - Supports filtering by pattern matching
77
+
78
+ 2. **FileStatsTableManagerService**:
79
+ - Added `getInstanceTables()` method
80
+ - Returns all tables matching base pattern
81
+
82
+ ### CLI Changes
83
+
84
+ #### 1. ScanCommand
85
+
86
+ - Discovers directories at specified level
87
+ - Registers multiple instances (one per directory)
88
+ - Creates separate tables for each directory
89
+ - Streams files into appropriate tables
90
+
91
+ **Output Example**:
92
+
93
+ ```bash
94
+ 🔍 Discovering directories...
95
+ 📁 Found 2 directories to scan
96
+
97
+ 📝 Registering scan instances...
98
+ ✓ adios: file_stats_palco_local_sample_adios (new)
99
+ ✓ hola: file_stats_palco_local_sample_hola (new)
100
+
101
+ 🚀 Starting file scan...
102
+
103
+ 📂 Scanning: adios
104
+ 📄 |████████████████████| 100% | 50/50 files | 25 files/sec
105
+
106
+ 📂 Scanning: hola
107
+ 📄 |████████████████████| 100% | 45/45 files | 23 files/sec
108
+
109
+ ✅ Scan completed successfully!
110
+
111
+ 📊 Scan Statistics:
112
+ Directories scanned: 2
113
+ Files scanned: 95
114
+ Files inserted: 95
115
+ Files skipped: 0 (excluded patterns)
116
+ Total size: 2.5 MB
117
+ Duration: 4.1s
118
+ Throughput: 23 files/sec
119
+
120
+ 📋 Tables created:
121
+ - file_stats_palco_local_sample_adios
122
+ - file_stats_palco_local_sample_hola
123
+ ```
124
+
125
+ #### 2. IdentifyCommand
126
+
127
+ - Automatically fetches all tables for the instance
128
+ - Processes each table sequentially
129
+ - Shows per-table and total statistics
130
+
131
+ **Output Example**:
132
+
133
+ ```bash
134
+ 🔍 Starting arela identify command
135
+ 📊 Fetching instance tables...
136
+ 📋 Found 2 tables to process
137
+ - file_stats_palco_local_sample_adios
138
+ - file_stats_palco_local_sample_hola
139
+
140
+ 🔍 Processing table: file_stats_palco_local_sample_adios
141
+ Total PDFs: 20
142
+ Detected: 0
143
+ Pending: 20
144
+ 🚀 Processing 20 pending PDFs...
145
+ 📄 |████████████████████| 100% | 20/20 files | 15 files/sec
146
+
147
+ 🔍 Processing table: file_stats_palco_local_sample_hola
148
+ Total PDFs: 18
149
+ Detected: 0
150
+ Pending: 18
151
+ 🚀 Processing 18 pending PDFs...
152
+ 📄 |████████████████████| 100% | 18/18 files | 14 files/sec
153
+
154
+ ✅ Identification Complete!
155
+
156
+ 📊 Total Results:
157
+ Tables Processed: 2
158
+ Files Processed: 38
159
+ Pedimentos Detected: 35
160
+ Errors: 3
161
+ Duration: 2.7s
162
+ Speed: 14 files/sec
163
+ ```
164
+
165
+ #### 3. PropagateCommand
166
+
167
+ - Fetches all tables for the instance
168
+ - Processes each table's directories
169
+ - Shows per-table and total statistics
170
+
171
+ **Output Example**:
172
+
173
+ ```bash
174
+ 🔄 Starting arela propagate command
175
+
176
+ 🎯 API Target: default
177
+ 📦 Batch Size: 50
178
+
179
+ 📋 Found 2 tables to process:
180
+ - file_stats_palco_local_sample_adios
181
+ - file_stats_palco_local_sample_hola
182
+
183
+ 🔄 Processing table: file_stats_palco_local_sample_adios
184
+
185
+ Total Files: 50
186
+ With arela_path: 20
187
+ Pedimento Sources: 18
188
+ 🚀 Found 30 files ready for propagation.
189
+
190
+ 📄 Propagating |████████████████████| 100% | 18/18 directories | 42 files/sec | 30 files updated
191
+
192
+ 📊 Results:
193
+ Pedimentos Processed: 18
194
+ Directories Processed: 18
195
+ Files Updated: 30
196
+ Errors: 0
197
+
198
+ 🔄 Processing table: file_stats_palco_local_sample_hola
199
+
200
+ Total Files: 45
201
+ With arela_path: 18
202
+ Pedimento Sources: 17
203
+ 🚀 Found 27 files ready for propagation.
204
+
205
+ 📄 Propagating |████████████████████| 100% | 17/17 directories | 39 files/sec | 27 files updated
206
+
207
+ 📊 Results:
208
+ Pedimentos Processed: 17
209
+ Directories Processed: 17
210
+ Files Updated: 27
211
+ Errors: 0
212
+
213
+ ✅ Propagation Complete!
214
+
215
+ 📊 Total Results:
216
+ Tables Processed: 2
217
+ Pedimentos Processed: 35
218
+ Files Updated: 57
219
+ Files Failed: 0
220
+ Directories Processed: 35
221
+ Duration: 1.5s
222
+ Speed: 38 files/sec
223
+ ```
224
+
225
+ #### 4. PushCommand
226
+
227
+ - Fetches all tables for the instance
228
+ - Processes files from each table
229
+ - Shows per-table and total statistics
230
+
231
+ **Output Example**:
232
+
233
+ ```bash
234
+ 🚀 Starting arela push command
235
+
236
+ 🎯 Scan API Target: default
237
+ 🎯 Upload API Target: default → http://localhost:3010
238
+ 📦 Fetch Batch Size: 100
239
+ 📤 Upload Batch Size: 10
240
+
241
+ 📊 Fetching instance tables...
242
+ 📋 Found 2 tables to process:
243
+ - file_stats_palco_local_sample_adios
244
+ - file_stats_palco_local_sample_hola
245
+
246
+ 🚀 Processing table: file_stats_palco_local_sample_adios
247
+
248
+ Table Status:
249
+ Total with arela_path: 50
250
+ Pending: 50
251
+ Uploaded: 0
252
+
253
+ 🚀 Uploading 50 pending files...
254
+
255
+ 📤 Uploading |████████████████████| 100% | 50/50 files | 8 files/sec | ✓ 48 ✗ 2
256
+
257
+ 📊 Table Results:
258
+ Files Processed: 50
259
+ Uploaded: 48
260
+ Errors: 2
261
+
262
+ 🚀 Processing table: file_stats_palco_local_sample_hola
263
+
264
+ Table Status:
265
+ Total with arela_path: 45
266
+ Pending: 45
267
+ Uploaded: 0
268
+
269
+ 🚀 Uploading 45 pending files...
270
+
271
+ 📤 Uploading |████████████████████| 100% | 45/45 files | 7 files/sec | ✓ 44 ✗ 1
272
+
273
+ 📊 Table Results:
274
+ Files Processed: 45
275
+ Uploaded: 44
276
+ Errors: 1
277
+
278
+ ✅ Push Complete!
279
+
280
+ 📊 Total Results:
281
+ Tables Processed: 2
282
+ Files Processed: 95
283
+ Uploaded: 92
284
+ Errors: 3
285
+ Duration: 12.1s
286
+ Speed: 8 files/sec
287
+ ```
288
+
289
+ ## Use Cases
290
+
291
+ ### 1. Large File Systems
292
+
293
+ Split large file systems into manageable tables:
294
+
295
+ ```bash
296
+ # Instead of one huge table with millions of files
297
+ SCAN_DIRECTORY_LEVEL=0 # ❌ file_stats_company_nas01_data (5M files)
298
+
299
+ # Create tables per department/year/project
300
+ SCAN_DIRECTORY_LEVEL=1 # ✅ file_stats_company_nas01_data_accounting
301
+ # ✅ file_stats_company_nas01_data_sales
302
+ # ✅ file_stats_company_nas01_data_legal
303
+ ```
304
+
305
+ ### 2. Year-Based Organization
306
+
307
+ Organize by year for time-based processing:
308
+
309
+ ```bash
310
+ UPLOAD_BASE_PATH=/archive
311
+ UPLOAD_SOURCES=2023|2024|2025
312
+ SCAN_DIRECTORY_LEVEL=1
313
+
314
+ # Creates:
315
+ # file_stats_company_nas01_archive_2023
316
+ # file_stats_company_nas01_archive_2024
317
+ # file_stats_company_nas01_archive_2025
318
+ ```
319
+
320
+ ### 3. Client-Based Organization
321
+
322
+ Separate tables per client for multi-tenant setups:
323
+
324
+ ```bash
325
+ UPLOAD_BASE_PATH=/clients
326
+ SCAN_DIRECTORY_LEVEL=1
327
+
328
+ # Creates:
329
+ # file_stats_agency_nas01_clients_acme
330
+ # file_stats_agency_nas01_clients_globex
331
+ # file_stats_agency_nas01_clients_initech
332
+ ```
333
+
334
+ ## Performance Considerations
335
+
336
+ ### Benefits
337
+
338
+ 1. **Smaller Tables**: Easier to query and maintain
339
+ 2. **Parallel Processing**: Can process multiple tables concurrently (future enhancement)
340
+ 3. **Targeted Operations**: Process only relevant tables (e.g., specific year or client)
341
+ 4. **Better Indexes**: Smaller tables = more efficient indexes
342
+
343
+ ### Trade-offs
344
+
345
+ 1. **More API Calls**: Sequential processing of multiple tables
346
+ 2. **Increased Complexity**: More tables to manage and monitor
347
+ 3. **Slower Initial Scan**: Directory discovery adds overhead
348
+
349
+ ## Best Practices
350
+
351
+ ### 1. Choose Appropriate Level
352
+
353
+ - **Level 0**: Small datasets (< 100K files)
354
+ - **Level 1**: Medium datasets (100K-1M files) or logical separation
355
+ - **Level 2+**: Very large datasets or deep hierarchies
356
+
357
+ ### 2. Consistent Configuration
358
+
359
+ Always use the same configuration across commands:
360
+
361
+ ```bash
362
+ # Set once in .env
363
+ ARELA_COMPANY_SLUG=my_company
364
+ ARELA_SERVER_ID=nas01
365
+ UPLOAD_BASE_PATH=/data
366
+ SCAN_DIRECTORY_LEVEL=1
367
+
368
+ # Then run commands without changing these values
369
+ arela scan
370
+ arela identify
371
+ arela propagate
372
+ arela push
373
+ ```
374
+
375
+ ### 3. Monitor Table Sizes
376
+
377
+ Check table sizes periodically:
378
+
379
+ ```sql
380
+ SELECT
381
+ tablename,
382
+ pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
383
+ FROM pg_tables
384
+ WHERE schemaname = 'cli'
385
+ ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
386
+ ```
387
+
388
+ ## Troubleshooting
389
+
390
+ ### Problem: No tables found
391
+
392
+ ```bash
393
+ ❌ No tables found for this instance. Run "arela scan" first.
394
+ ```
395
+
396
+ **Solution**: Run `arela scan` with the same configuration.
397
+
398
+ ### Problem: Table name collision
399
+
400
+ If you change `SCAN_DIRECTORY_LEVEL` after initial scan, you might get conflicts.
401
+
402
+ **Solution**: Use different `ARELA_BASE_PATH_LABEL` or deactivate old tables:
403
+
404
+ ```bash
405
+ # Deactivate old table
406
+ curl -X PATCH http://localhost:3010/api/uploader/scan/deactivate \
407
+ -H "x-api-key: $TOKEN" \
408
+ -H "Content-Type: application/json" \
409
+ -d '{"tableName":"file_stats_old_table"}'
410
+
411
+ # Then run scan with new level
412
+ arela scan
413
+ ```
414
+
415
+ ### Problem: Duplicate files across tables
416
+
417
+ Files shouldn't appear in multiple tables if directory level is set correctly.
418
+
419
+ **Check**: Verify directory structure matches your level setting.
420
+
421
+ ## Migration Guide
422
+
423
+ ### From Level 0 to Level 1
424
+
425
+ 1. **Backup existing data** (optional):
426
+ ```sql
427
+ CREATE TABLE cli.file_stats_backup AS
428
+ SELECT * FROM cli.file_stats_old_table;
429
+ ```
430
+
431
+ 2. **Deactivate old table**:
432
+ ```bash
433
+ curl -X PATCH http://localhost:3010/api/uploader/scan/deactivate \
434
+ -H "x-api-key: $TOKEN" \
435
+ -d '{"tableName":"file_stats_old_table"}'
436
+ ```
437
+
438
+ 3. **Update configuration**:
439
+ ```bash
440
+ SCAN_DIRECTORY_LEVEL=1
441
+ ```
442
+
443
+ 4. **Run full pipeline**:
444
+ ```bash
445
+ arela scan
446
+ arela identify
447
+ arela propagate
448
+ arela push
449
+ ```
450
+
451
+ ## API Reference
452
+
453
+ ### Get Instance Tables
454
+
455
+ ```
456
+ GET /api/uploader/scan/instance-tables?companySlug=X&serverId=Y&basePathLabel=Z
457
+ ```
458
+
459
+ **Response**:
460
+ ```json
461
+ [
462
+ {
463
+ "id": "uuid",
464
+ "companySlug": "my_company",
465
+ "serverId": "nas01",
466
+ "basePathLabel": "data_adios",
467
+ "tableName": "file_stats_my_company_nas01_data_adios",
468
+ "basePathFull": "/data/adios",
469
+ "lastScanAt": "2025-01-18T10:30:00Z",
470
+ "totalFiles": 50,
471
+ "totalSizeBytes": 1048576,
472
+ "status": "ACTIVE"
473
+ },
474
+ {
475
+ "id": "uuid",
476
+ "companySlug": "my_company",
477
+ "serverId": "nas01",
478
+ "basePathLabel": "data_hola",
479
+ "tableName": "file_stats_my_company_nas01_data_hola",
480
+ "basePathFull": "/data/hola",
481
+ "lastScanAt": "2025-01-18T10:31:00Z",
482
+ "totalFiles": 45,
483
+ "totalSizeBytes": 987654,
484
+ "status": "ACTIVE"
485
+ }
486
+ ]
487
+ ```
488
+
489
+ ## Future Enhancements
490
+
491
+ 1. **Parallel Processing**: Process multiple tables concurrently
492
+ 2. **Table Merging**: Combine tables when directory level changes
493
+ 3. **Selective Processing**: Process only specific tables via CLI flag
494
+ 4. **Table Statistics Dashboard**: View all tables and their stats in one place