npm - @arela/uploader - Versions diffs - 1.0.2 → 1.0.3 - Mend

@arela/uploader 1.0.2 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/.env.template +70 -0
package/docs/API_RETRY_MECHANISM.md +338 -0
package/docs/ARELA_IDENTIFY_IMPLEMENTATION.md +489 -0
package/docs/ARELA_IDENTIFY_QUICKREF.md +186 -0
package/docs/ARELA_PROPAGATE_IMPLEMENTATION.md +581 -0
package/docs/ARELA_PROPAGATE_QUICKREF.md +272 -0
package/docs/ARELA_PUSH_IMPLEMENTATION.md +577 -0
package/docs/ARELA_PUSH_QUICKREF.md +322 -0
package/docs/ARELA_SCAN_IMPLEMENTATION.md +373 -0
package/docs/ARELA_SCAN_QUICKREF.md +139 -0
package/docs/DETECTION_ATTEMPT_TRACKING.md +414 -0
package/docs/MIGRATION_UPLOADER_TO_FILE_STATS.md +1020 -0
package/docs/MULTI_LEVEL_DIRECTORY_SCANNING.md +494 -0
package/docs/STATS_COMMAND_SEQUENCE_DIAGRAM.md +287 -0
package/docs/STATS_COMMAND_SIMPLE.md +93 -0
package/package.json +4 -2
package/src/commands/IdentifyCommand.js +486 -0
package/src/commands/PropagateCommand.js +474 -0
package/src/commands/PushCommand.js +473 -0
package/src/commands/ScanCommand.js +516 -0
package/src/config/config.js +177 -7
package/src/file-detection.js +9 -10
package/src/index.js +150 -0
package/src/services/ScanApiService.js +646 -0

package/docs/ARELA_SCAN_QUICKREF.md ADDED Viewed

@@ -0,0 +1,139 @@
+# Arela Scan Quick Reference
+## Setup
+### 1. Configure Backend
+Add `cli_registry` entity to TypeORM and run migration:
+```bash
+cd arela-api
+npm run migration:generate -- -n CreateCliRegistry
+npm run migration:run
+```
+### 2. Configure CLI
+Set environment variables in `.env`:
+```bash
+# Required
+ARELA_COMPANY_SLUG=your_company
+ARELA_SERVER_ID=server01
+UPLOAD_BASE_PATH=/path/to/files
+UPLOAD_SOURCES=2023|2024|2025
+# Optional
+ARELA_BASE_PATH_LABEL=data
+SCAN_EXCLUDE_PATTERNS=.DS_Store,Thumbs.db,desktop.ini
+SCAN_BATCH_SIZE=2000
+# API Configuration
+ARELA_API_URL=http://localhost:3010
+ARELA_API_TOKEN=your-token
+```
+## Commands
+### Scan Filesystem
+```bash
+# Basic scan with throughput display
+arela scan
+# Scan with percentage progress (counts files first)
+arela scan --count-first
+# Scan to specific API target
+arela scan --api cliente
+```
+### View Scan Instances
+```bash
+# List all registered instances
+curl -H "x-api-key: $TOKEN" \
+  http://localhost:3010/api/uploader/scan/instances
+# Get stale instances (no scan > 90 days)
+curl -H "x-api-key: $TOKEN" \
+  "http://localhost:3010/api/uploader/scan/stale-instances?days=90"
+```
+### Deactivate Instance
+```bash
+curl -X PATCH \
+  -H "x-api-key: $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"tableName":"file_stats_company_server_path"}' \
+  http://localhost:3010/api/uploader/scan/deactivate
+```
+## Files Modified
+### Backend
+- ✅ `src/uploader/entities/cli-registry.entity.ts` - New entity
+- ✅ `src/uploader/services/file-stats-table-manager.service.ts` - New service
+- ✅ `src/uploader/services/uploader.service.ts` - Added scan methods
+- ✅ `src/uploader/controllers/uploader.controller.ts` - Added scan endpoints
+- ✅ `src/uploader/uploader.module.ts` - Updated imports
+### CLI
+- ✅ `src/commands/ScanCommand.js` - New command
+- ✅ `src/services/ScanApiService.js` - New API service
+- ✅ `src/config/config.js` - Added scan configuration
+- ✅ `src/index.js` - Registered scan command
+- ✅ `.env.template` - Added scan variables
+- ✅ `docs/ARELA_SCAN_IMPLEMENTATION.md` - Documentation
+## Table Schema
+Each scan instance creates a table:
+```sql
+file_stats_<company>_<server>_<path>
+├── id (uuid)
+├── file_name (varchar)
+├── file_extension (varchar)
+├── directory_path (text)
+├── relative_path (text)
+├── absolute_path (text) [unique]
+├── size_bytes (bigint)
+├── modified_at (timestamp)
+├── scan_timestamp (timestamp)
+└── created_at (timestamp)
+```
+## Troubleshooting
+### Error: Missing configuration
+**Cause**: Required env vars not set
+**Fix**: Set `ARELA_COMPANY_SLUG` and `ARELA_SERVER_ID`
+### Error: Table name collision
+**Cause**: Same identifiers used with different base path
+**Fix**: Change server ID or base path label
+### Error: Cannot connect to API
+**Cause**: Backend not running or wrong URL
+**Fix**: Verify `ARELA_API_URL` and start backend
+## Next Steps
+1. ✅ Implement `arela scan` (DONE)
+2. ⏳ Implement `arela identify` (detect pedimentos from PDFs)
+3. ⏳ Implement `arela propagate` (propagate arela_path)
+4. ⏳ Implement `arela push` (upload files by RFC)
+## Performance Tips
+- Increase `SCAN_BATCH_SIZE` for better throughput (default: 2000)
+- Use `MAX_API_CONNECTIONS` to match backend replicas
+- Run scans during off-peak hours for large datasets
+- Use `--count-first` only when progress percentage is needed

package/docs/DETECTION_ATTEMPT_TRACKING.md ADDED Viewed

@@ -0,0 +1,414 @@
+# Detection Attempt Tracking
+## Overview
+The `arela identify` command now includes intelligent attempt tracking to avoid reprocessing files unnecessarily and provide better debugging information when detection fails.
+## New Fields in file_stats_* Tables
+### Attempt Tracking
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `detection_attempts` | INTEGER | 0 | Number of times detection has been attempted |
+| `max_detection_attempts` | INTEGER | 3 | Maximum attempts before giving up |
+| `is_not_pedimento` | BOOLEAN | FALSE | True if definitely not a pedimento document |
+### Enhanced Error Reporting
+The `detection_error` field now contains categorized, descriptive errors:
+| Error Category | Example | Meaning |
+|----------------|---------|---------|
+| `FILE_NOT_FOUND` | File does not exist on filesystem | File was deleted after scan |
+| `FILE_TOO_LARGE` | File size 75.5MB exceeds 50MB limit | PDF too large to process |
+| `NOT_PEDIMENTO` | Missing key markers: "FORMA SIMPLIFICADA DE PEDIMENTO" | Definitely not a pedimento |
+| `INCOMPLETE_PEDIMENTO` | Missing fields: rfc, numPedimento | Partial match, needs matcher improvement |
+| `PDF_PARSE_ERROR` | Failed to extract text from PDF | Corrupted or encrypted PDF |
+| `TEXT_EXTRACTION_ERROR` | Cannot extract text | Unsupported PDF format |
+| `TIMEOUT` | Detection timeout | Processing took too long |
+## Optimized Query Logic
+### Previous Behavior
+```sql
+-- Retrieved ALL unprocessed PDFs every time
+WHERE file_extension = 'pdf'
+  AND detection_attempted_at IS NULL
+```
+**Problem**: Would keep trying the same files even if they repeatedly failed.
+### New Behavior
+```sql
+-- Only retrieve PDFs that haven't reached max attempts
+WHERE file_extension = 'pdf'
+  AND is_not_pedimento = FALSE
+  AND (detection_attempts < max_detection_attempts OR detection_attempts IS NULL)
+```
+**Benefits**:
+1. **Skips non-pedimentos**: Files marked as `is_not_pedimento = TRUE` are never retrieved again
+2. **Respects attempt limit**: Files that reached `max_detection_attempts` are skipped
+3. **Faster queries**: Uses optimized composite index on `(file_extension, is_not_pedimento, detection_attempts, max_detection_attempts)`
+## Detection Logic Flow
+```
+┌─────────────────────────┐
+│   Fetch PDFs to detect  │
+│   (respects attempts)   │
+└───────────┬─────────────┘
+            │
+            ▼
+┌─────────────────────────┐
+│   Check file exists     │
+└───────────┬─────────────┘
+            │
+            ├── File not found ────────────┐
+            │                              │
+            ▼                              │
+┌─────────────────────────┐               │
+│  Check file size        │               │
+│  (max 50MB)             │               │
+└───────────┬─────────────┘               │
+            │                              │
+            ├── Too large ────────────────┤
+            │                              │
+            ▼                              │
+┌─────────────────────────┐               │
+│  Extract text & detect  │               │
+└───────────┬─────────────┘               │
+            │                              │
+            ├── Pedimento found ──────────┤
+            │   (success!)                 │
+            │                              │
+            ├── No pedimento markers ─────┤
+            │   (is_not_pedimento=TRUE)    │
+            │                              │
+            ├── Partial match ────────────┤
+            │   (missing fields)           │
+            │                              │
+            └── Error ────────────────────┤
+                                           │
+                                           ▼
+                                ┌──────────────────────┐
+                                │  Update database:    │
+                                │  - detection_attempts │
+                                │  - detection_error    │
+                                │  - is_not_pedimento   │
+                                └──────────────────────┘
+```
+## Error Categories Explained
+### 1. **FILE_NOT_FOUND**
+```
+FILE_NOT_FOUND: File does not exist on filesystem. May have been moved or deleted after scan.
+```
+**Cause**: File was present during `arela scan` but no longer exists
+**Action**: No retry needed. File should be rescanned if it returns
+**Attempt behavior**: Counts toward max attempts
+### 2. **FILE_TOO_LARGE**
+```
+FILE_TOO_LARGE: File size 75.5MB exceeds 50MB limit.
+```
+**Cause**: PDF exceeds 50MB processing limit
+**Action**: Increase limit if needed, or process manually
+**Attempt behavior**: Counts toward max attempts
+### 3. **NOT_PEDIMENTO**
+```
+NOT_PEDIMENTO: File does not match pedimento-simplificado pattern. Missing key markers: "FORMA SIMPLIFICADA DE PEDIMENTO".
+```
+**Cause**: File doesn't contain pedimento markers
+**Action**: File is marked `is_not_pedimento = TRUE` and never retrieved again
+**Attempt behavior**: Counts toward max attempts, but file is excluded from future queries
+### 4. **INCOMPLETE_PEDIMENTO**
+```
+INCOMPLETE_PEDIMENTO: Detected as potential pedimento but missing fields: rfc, numPedimento. Matcher may need improvement.
+```
+**Cause**: File has some pedimento characteristics but missing required fields
+**Action**: Review matcher patterns in `pedimento-simplificado.js`
+**Attempt behavior**: Will retry up to max attempts (may succeed if matcher improves)
+### 5. **PDF_PARSE_ERROR**
+```
+PDF_PARSE_ERROR: Failed to extract text from PDF: Encrypted PDF
+```
+**Cause**: Corrupted, encrypted, or malformed PDF
+**Action**: Check file integrity, decrypt if password-protected
+**Attempt behavior**: Will retry up to max attempts
+### 6. **TEXT_EXTRACTION_ERROR**
+```
+TEXT_EXTRACTION_ERROR: Cannot extract text: Unsupported format
+```
+**Cause**: PDF format not supported by text extractor
+**Action**: Convert PDF to standard format
+**Attempt behavior**: Will retry up to max attempts
+## Statistics Display
+### Initial Stats
+```
+📈 Detection Status:
+   Total PDFs: 1000
+   Detected: 850
+   Pending: 100
+   Not Pedimento: 40
+   Max Attempts Reached: 10
+   Errors: 25
+```
+**Pending**: Files that can still be processed (haven't reached max attempts)
+**Not Pedimento**: Files marked as definitely not pedimentos (skipped in future runs)
+**Max Attempts Reached**: Files that exhausted retry attempts
+### Final Stats
+```
+✅ Identification Complete!
+📊 Results:
+   Processed: 100 files
+   Pedimentos Detected: 85
+   Errors: 5
+   Duration: 11.5s
+   Speed: 87 files/sec
+📈 Final Status:
+   Total PDFs: 1000
+   Detected: 935
+   Pending: 0
+   Not Pedimento: 50
+   Max Attempts Reached: 15
+   Errors: 30
+⚠️  15 PDFs reached max detection attempts.
+   Run with increased max_detection_attempts if needed, or review matcher patterns.
+```
+## Debugging Failed Detections
+### Query Files by Error Category
+```sql
+-- Find files with specific error type
+SELECT relative_path, detection_error, detection_attempts
+FROM cli.file_stats_acme_corp_nas01_data
+WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO%'
+ORDER BY detection_attempts DESC
+LIMIT 50;
+```
+### Find Files Marked as Not Pedimento
+```sql
+-- Review files marked as not pedimentos
+SELECT relative_path, detection_error, size_bytes
+FROM cli.file_stats_acme_corp_nas01_data
+WHERE is_not_pedimento = TRUE
+ORDER BY size_bytes DESC
+LIMIT 50;
+```
+### Find Files That Reached Max Attempts
+```sql
+-- Check files that exhausted retries
+SELECT relative_path, detection_error, detection_attempts, max_detection_attempts
+FROM cli.file_stats_acme_corp_nas01_data
+WHERE detection_attempts >= max_detection_attempts
+  AND detected_type IS NULL
+ORDER BY detection_attempts DESC
+LIMIT 50;
+```
+### Reset Specific Files for Retry
+```sql
+-- Reset detection attempts for specific files
+UPDATE cli.file_stats_acme_corp_nas01_data
+SET
+  detection_attempts = 0,
+  detection_attempted_at = NULL,
+  detection_error = NULL,
+  is_not_pedimento = FALSE
+WHERE relative_path LIKE '2024/3019796/%';
+```
+### Increase Max Attempts for All Files
+```sql
+-- Allow more retries for all files
+UPDATE cli.file_stats_acme_corp_nas01_data
+SET max_detection_attempts = 5
+WHERE max_detection_attempts = 3;
+```
+## Performance Impact
+### Query Performance
+**Before** (without attempt tracking):
+```sql
+EXPLAIN ANALYZE
+SELECT * FROM cli.file_stats_company_server_path
+WHERE file_extension = 'pdf' AND detection_attempted_at IS NULL;
+-- Seq Scan on file_stats_... (cost=0.00..1250.00 rows=5000)
+-- Planning Time: 0.5ms
+-- Execution Time: 45.2ms
+```
+**After** (with optimized index):
+```sql
+EXPLAIN ANALYZE
+SELECT * FROM cli.file_stats_company_server_path
+WHERE file_extension = 'pdf'
+  AND is_not_pedimento = FALSE
+  AND detection_attempts < max_detection_attempts;
+-- Index Scan using idx_..._detection_pending (cost=0.42..85.50 rows=1000)
+-- Planning Time: 0.3ms
+-- Execution Time: 8.1ms
+```
+**Improvement**: ~5.5x faster query execution
+### Storage Impact
+**Additional storage per record**:
+- `detection_attempts`: 4 bytes (INTEGER)
+- `max_detection_attempts`: 4 bytes (INTEGER)
+- `is_not_pedimento`: 1 byte (BOOLEAN)
+- Total: **9 bytes per record**
+For 1 million PDFs: ~9 MB additional storage (negligible)
+### Index Size
+**New partial indexes**:
+- `idx_detection_pending`: ~50-100 KB per 10,000 pending files
+- `idx_detection_errors`: ~20-50 KB per 10,000 errors
+Total index overhead: **< 1 MB per 100,000 files**
+## Best Practices
+### 1. **Review Error Patterns**
+Regularly check error distribution:
+```sql
+SELECT
+  SPLIT_PART(detection_error, ':', 1) as error_category,
+  COUNT(*) as count
+FROM cli.file_stats_company_server_path
+WHERE detection_error IS NOT NULL
+GROUP BY error_category
+ORDER BY count DESC;
+```
+### 2. **Adjust Max Attempts**
+For files with `INCOMPLETE_PEDIMENTO` errors, increase attempts if improving matchers:
+```sql
+UPDATE cli.file_stats_company_server_path
+SET max_detection_attempts = 5
+WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO%'
+  AND detection_attempts >= max_detection_attempts;
+```
+### 3. **Review Not Pedimento Files**
+Periodically verify files marked as not pedimentos:
+```sql
+-- Sample random not-pedimento files for manual review
+SELECT relative_path, size_bytes, detection_error
+FROM cli.file_stats_company_server_path
+WHERE is_not_pedimento = TRUE
+ORDER BY RANDOM()
+LIMIT 20;
+```
+### 4. **Clean Up Old Errors**
+After fixing matcher bugs, reset affected files:
+```sql
+-- Reset files with specific error after matcher improvement
+UPDATE cli.file_stats_company_server_path
+SET
+  detection_attempts = 0,
+  detection_error = NULL,
+  is_not_pedimento = FALSE
+WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO: Missing fields: patente%'
+  AND detection_attempts >= max_detection_attempts;
+```
+## Matcher Improvement Workflow
+When `INCOMPLETE_PEDIMENTO` errors indicate matcher issues:
+1. **Identify patterns**:
+   ```sql
+   SELECT detection_error, COUNT(*)
+   FROM cli.file_stats_company_server_path
+   WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO%'
+   GROUP BY detection_error;
+   ```
+2. **Sample affected files**:
+   ```sql
+   SELECT absolute_path, detection_error
+   FROM cli.file_stats_company_server_path
+   WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO: Missing fields: patente%'
+   LIMIT 5;
+   ```
+3. **Review PDFs manually** to understand patterns
+4. **Update matcher** in `src/document-types/pedimento-simplificado.js`
+5. **Reset affected files**:
+   ```sql
+   UPDATE cli.file_stats_company_server_path
+   SET detection_attempts = 0, detection_error = NULL
+   WHERE detection_error LIKE 'INCOMPLETE_PEDIMENTO: Missing fields: patente%';
+   ```
+6. **Re-run detection**:
+   ```bash
+   arela identify
+   ```
+## Conclusion
+The attempt tracking system provides:
+- **Better performance**: Avoids reprocessing impossible files
+- **Better debugging**: Categorized errors show exactly what's wrong
+- **Better visibility**: Statistics show what's pending vs. hopeless
+- **Better efficiency**: Optimized indexes make queries 5x faster
+This enables iterative improvement of detection matchers while maintaining system performance.