@danielarndt0/cnpj-db-loader 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +119 -0
- package/dist/cli.d.ts +1 -0
- package/dist/cli.js +10187 -0
- package/dist/cli.js.map +1 -0
- package/dist/index.d.ts +969 -0
- package/dist/index.js +8012 -0
- package/dist/index.js.map +1 -0
- package/docs/architecture.md +97 -0
- package/docs/cli.md +46 -0
- package/docs/commands.md +65 -0
- package/docs/federal-revenue.md +224 -0
- package/docs/quarantine.md +40 -0
- package/docs/sanitize.md +49 -0
- package/docs/usage.md +149 -0
- package/package.json +61 -0
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
# Architecture
|
|
2
|
+
|
|
3
|
+
## What matters in this version
|
|
4
|
+
|
|
5
|
+
The current CLI is centered on one practical job: move Receita Federal CNPJ data from downloaded archives into PostgreSQL safely.
|
|
6
|
+
|
|
7
|
+
## Main layers
|
|
8
|
+
|
|
9
|
+
| Folder | Purpose |
|
|
10
|
+
| ---------------- | -------------------------------------------------------------- |
|
|
11
|
+
| `src/cli` | Command registration and terminal output |
|
|
12
|
+
| `src/services` | Real application behavior used by the CLI |
|
|
13
|
+
| `src/dictionary` | Dataset layout definitions derived from the Receita dictionary |
|
|
14
|
+
| `src/core` | Shared errors, prompts, and utilities |
|
|
15
|
+
| `src/config` | Local configuration helpers and paths |
|
|
16
|
+
|
|
17
|
+
## Import design
|
|
18
|
+
|
|
19
|
+
The import pipeline now uses:
|
|
20
|
+
|
|
21
|
+
- deterministic dataset order to respect foreign keys
|
|
22
|
+
- an exact preparatory scan that counts total source rows and planned batches before the first write
|
|
23
|
+
- streaming file reads to avoid loading the full dataset into RAM
|
|
24
|
+
- an optional sanitize step that removes known low-level byte issues before import starts
|
|
25
|
+
- COPY-based staged writes for the large datasets followed by staged-to-final materialization
|
|
26
|
+
- conflict-safe upserts for the smaller domain datasets
|
|
27
|
+
- `import_plans` and `import_plan_files` to persist exact import plans and avoid recounting the same source files on resume
|
|
28
|
+
- `import_checkpoints` to resume a failed load without clearing the whole database
|
|
29
|
+
- `import_materialization_checkpoints` to resume staged-to-final consolidation by dataset and chunk
|
|
30
|
+
- `import_quarantine` to store invalid rows and continue long-running imports
|
|
31
|
+
- a dedicated `quarantine` service to inspect quarantine rows without touching the import pipeline
|
|
32
|
+
- conservative load units to reduce memory pressure and prevent giant rollbacks
|
|
33
|
+
- compatibility with simplified final schemas that keep derived identifiers as regular columns when needed
|
|
34
|
+
- remote Federal Revenue WebDAV checks/downloads plus local manifest, retry, cleanup, status, and sync locking as an additive pre-pipeline service
|
|
35
|
+
|
|
36
|
+
## Import modules
|
|
37
|
+
|
|
38
|
+
The importer is now split into focused modules so future performance work can replace parts of the pipeline without rewriting the whole command:
|
|
39
|
+
|
|
40
|
+
- `planner`: selects datasets, collects source files, reuses or creates persisted import plans
|
|
41
|
+
- `source-reader`: streams validated files by byte offset for resume-safe reads
|
|
42
|
+
- `parser`: converts raw Receita lines into delimited field arrays
|
|
43
|
+
- `normalizer`: validates field counts and transforms parsed rows into database-ready records
|
|
44
|
+
- `staging-writer`: chooses the current write target and uses COPY for staged bulk loads
|
|
45
|
+
- `materializer`: consolidates staged datasets into the final relational schema with ordered upserts and resumable chunk checkpoints
|
|
46
|
+
- the materializer now reconciles missing lookup/domain codes from staged datasets before final upserts so late foreign-key failures do not stop the consolidation flow on placeholder-compatible domains
|
|
47
|
+
- materialization progress is now exposed explicitly to the CLI progress reporter and to JSONL heartbeat logs so long-running final upserts do not look stalled
|
|
48
|
+
- `finalizer`: centralizes performance tracking and import summary generation
|
|
49
|
+
- `checkpoint-manager`: owns checkpoint resume, persistence, and failed-file markers
|
|
50
|
+
- `quarantine-writer`: stores bad rows without stopping long imports
|
|
51
|
+
- `runner`: orchestrates the current import flow while keeping the service entry point small
|
|
52
|
+
|
|
53
|
+
The project now also generates dedicated staging tables for large datasets. The CLI exposes both a one-shot command (`import`) and split commands (`import load`, `import materialize`). Staging cleanup is handled explicitly through `database cleanup staging`. The write path sends the heavy datasets to staging tables first with only light normalization, then consolidates them into a simplified final schema in dependency order while keeping the smaller catalog datasets on the final schema directly. The final schema now stays closer to the Receita layout so the API can derive richer views later without forcing every first load to pay that cost inside PostgreSQL.
|
|
54
|
+
|
|
55
|
+
## Staging schema
|
|
56
|
+
|
|
57
|
+
The generated SQL schema supports lightweight `staging_*` tables for the large datasets that now move through the staged bulk-load flow before controlled final materialization.
|
|
58
|
+
|
|
59
|
+
These staging tables are intentionally:
|
|
60
|
+
|
|
61
|
+
- `UNLOGGED` for faster write-heavy workloads
|
|
62
|
+
- free of foreign keys and secondary indexes
|
|
63
|
+
- free of generated columns and upsert-only constraints
|
|
64
|
+
- shaped to mirror the validated dataset rows with minimal insert overhead
|
|
65
|
+
- equipped with `staging_id` so the materializer can checkpoint chunk progress safely
|
|
66
|
+
|
|
67
|
+
## Federal Revenue pre-pipeline
|
|
68
|
+
|
|
69
|
+
The Federal Revenue integration is intentionally kept as a pre-pipeline module. It lives under `src/services/federal-revenue` and is exposed by `src/cli/commands/register-federal-revenue.ts`. The module is responsible for:
|
|
70
|
+
|
|
71
|
+
- listing monthly `YYYY-MM` references from the public WebDAV share
|
|
72
|
+
- selecting the latest, current, or explicit monthly reference
|
|
73
|
+
- listing only `.zip` files inside the selected reference
|
|
74
|
+
- downloading files with `.part` temporary files, retry attempts, and skip-on-existing behavior
|
|
75
|
+
- writing a local reference manifest with downloaded, failed, partial, and missing file state
|
|
76
|
+
- exposing `status`, `retry`, and `clean` so automation can inspect and repair local references safely
|
|
77
|
+
- using a local sync lock so two full sync processes do not use the same reference folder at the same time
|
|
78
|
+
- handing the completed download folder to the existing extraction, validation, sanitization, and import services during `federal-revenue sync`
|
|
79
|
+
|
|
80
|
+
Redis, background workers, and schedulers are not part of this CLI module. Those concerns should remain outside the loader if an external runner/orchestrator is added later.
|
|
81
|
+
|
|
82
|
+
## Current execution flow
|
|
83
|
+
|
|
84
|
+
```text
|
|
85
|
+
federal-revenue check/download/status/retry/clean/sync -> inspect -> extract -> validate -> sanitize -> db/schema -> import
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Internal import flow
|
|
89
|
+
|
|
90
|
+
```text
|
|
91
|
+
planner -> source-reader -> parser -> normalizer -> staging-writer -> materializer -> finalizer
|
|
92
|
+
| |
|
|
93
|
+
+-> checkpoint-manager +-> quarantine-writer
|
|
94
|
+
+-> materialization-checkpoints
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
- Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
|
package/docs/cli.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
# CLI
|
|
2
|
+
|
|
3
|
+
## Public command surface
|
|
4
|
+
|
|
5
|
+
```bash
|
|
6
|
+
cnpj-db-loader federal-revenue check [reference] [--reference <yyyy-mm>] [--current]
|
|
7
|
+
cnpj-db-loader federal-revenue download [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
|
|
8
|
+
cnpj-db-loader federal-revenue status [reference] [--reference <yyyy-mm>] [--current] [--output <path>]
|
|
9
|
+
cnpj-db-loader federal-revenue retry [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
|
|
10
|
+
cnpj-db-loader federal-revenue clean [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--partials | --failed | --all] [-f]
|
|
11
|
+
cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--extract-output <path>] [--sanitize-output <path>] [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [--force-lock] [-f]
|
|
12
|
+
cnpj-db-loader inspect <input>
|
|
13
|
+
cnpj-db-loader extract <input> [--output <path>]
|
|
14
|
+
cnpj-db-loader validate <input>
|
|
15
|
+
cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
|
|
16
|
+
cnpj-db-loader schema print [--profile <profile>]
|
|
17
|
+
cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
|
|
18
|
+
cnpj-db-loader database config set <url>
|
|
19
|
+
cnpj-db-loader database config show
|
|
20
|
+
cnpj-db-loader database config test [--db-url <url>]
|
|
21
|
+
cnpj-db-loader database config reset [--force]
|
|
22
|
+
cnpj-db-loader database cleanup staging [--db-url <url>] [--dataset <name>] [--validated-path <path>] [--force]
|
|
23
|
+
cnpj-db-loader database cleanup materialized [--db-url <url>] [--dataset <name>] [--force]
|
|
24
|
+
cnpj-db-loader database cleanup checkpoints [--db-url <url>] [--phase <phase>] [--dataset <name>] [--validated-path <path>] [--plan-id <id>] [--force]
|
|
25
|
+
cnpj-db-loader database cleanup plans [--db-url <url>] [--validated-path <path>] [--plan-id <id>] [--force]
|
|
26
|
+
cnpj-db-loader import <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
|
|
27
|
+
cnpj-db-loader import load <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--verbose-progress] [-f]
|
|
28
|
+
cnpj-db-loader import materialize <input> [--db-url <url>] [--dataset <name>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
|
|
29
|
+
cnpj-db-loader doctor [--input <path>] [--db-url <url>]
|
|
30
|
+
cnpj-db-loader quarantine stats [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal]
|
|
31
|
+
cnpj-db-loader quarantine list [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal] [--limit <number>] [--after-id <id>]
|
|
32
|
+
cnpj-db-loader quarantine show <id> [--db-url <url>]
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Design notes
|
|
36
|
+
|
|
37
|
+
- The public CLI stays intentionally small, but the import workflow now exposes split phases for automation.
|
|
38
|
+
- `import` runs the whole pipeline, while `import load` and `import materialize` keep staging and final consolidation independently runnable.
|
|
39
|
+
- Placeholder commands are not exposed.
|
|
40
|
+
- Positional arguments are preferred when they make commands easier to type.
|
|
41
|
+
- Destructive database maintenance actions ask for confirmation unless `--force` is provided.
|
|
42
|
+
|
|
43
|
+
- `federal-revenue` (alias `revenue`) is additive: it only automates the remote monthly CNPJ download phase and then reuses the existing extract, validate, sanitize, and import services.
|
|
44
|
+
- Federal Revenue downloads keep completed files by default and write incomplete transfers as `.part` files until the file is fully validated.
|
|
45
|
+
- `status`, `retry`, and `clean` use the local reference manifest so a future external runner can inspect and resume the workflow without duplicating loader rules.
|
|
46
|
+
- `sync` creates a local lock file to prevent two full sync operations from using the same reference folder at the same time.
|
package/docs/commands.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# Commands reference
|
|
2
|
+
|
|
3
|
+
| Command | Purpose |
|
|
4
|
+
| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
5
|
+
| `federal-revenue check` | Check the selected or latest Federal Revenue monthly CNPJ reference and list the remote ZIP files. |
|
|
6
|
+
| `federal-revenue download` | Download the selected Federal Revenue monthly CNPJ ZIP files with retries, `.part` files, and skip-on-existing behavior. |
|
|
7
|
+
| `federal-revenue status` | Read the local Federal Revenue manifest and report downloaded, failed, partial, and missing files. |
|
|
8
|
+
| `federal-revenue retry` | Retry incomplete Federal Revenue files without redownloading completed files. |
|
|
9
|
+
| `federal-revenue clean` | Clean local Federal Revenue `.part` files, failed/partial files, or a whole reference folder. |
|
|
10
|
+
| `federal-revenue sync` | Download, extract, validate, sanitize, and import the selected monthly CNPJ reference using the existing loader pipeline with a local sync lock. |
|
|
11
|
+
| `inspect <input>` | Detect whether the input is zipped, extracted, mixed, or empty. |
|
|
12
|
+
| `extract <input>` | Extract every ZIP archive found inside the input directory. |
|
|
13
|
+
| `validate <input>` | Validate an extracted dataset tree. |
|
|
14
|
+
| `sanitize <input>` | Prepare a sanitized dataset tree before import. |
|
|
15
|
+
| `schema print` | Print a generated PostgreSQL schema profile (`full`, `final`, or `staging`) to stdout. The final profile is simplified for fast first-load materialization. |
|
|
16
|
+
| `schema generate` | Write a generated schema profile to the current working directory by default. |
|
|
17
|
+
| `database config set <url>` | Persist the default PostgreSQL URL. |
|
|
18
|
+
| `database config show` | Show the saved PostgreSQL URL. |
|
|
19
|
+
| `database config test` | Test the connection using the saved or overridden URL. |
|
|
20
|
+
| `database config reset` | Remove the saved PostgreSQL URL after confirmation. |
|
|
21
|
+
| `database cleanup staging` | Truncate staging tables and optionally clear linked materialization checkpoints for a validated path. |
|
|
22
|
+
| `database cleanup materialized` | Truncate simplified final relational tables populated by materialization in safe order for the current schema. |
|
|
23
|
+
| `database cleanup checkpoints` | Clear load checkpoints, materialization checkpoints, or both without truncating staging or final tables. |
|
|
24
|
+
| `database cleanup plans` | Delete saved import plans. Related plan files and materialization checkpoints are removed by database cascade. |
|
|
25
|
+
| `import <input>` | Run the full pipeline: plan, load validated files into staging/direct final targets, materialize staged datasets into final tables, and finalize the import plan. |
|
|
26
|
+
| `import load <input>` | Prepare the plan and run only the load phase. Heavy datasets stop in `staging_*`; domain datasets still upsert directly into the final schema. |
|
|
27
|
+
| `import materialize <input>` | Resume from the saved import plan and materialize staged datasets into the final relational tables with resumable chunks. |
|
|
28
|
+
| `doctor` | Run a quick environment diagnosis. |
|
|
29
|
+
| `quarantine stats` | Show aggregate counts for the `import_quarantine` table. |
|
|
30
|
+
| `quarantine list` | List quarantined rows with optional filters. |
|
|
31
|
+
| `quarantine show` | Show one quarantined row in detail. |
|
|
32
|
+
|
|
33
|
+
## Examples
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
cnpj-db-loader federal-revenue check
|
|
37
|
+
cnpj-db-loader federal-revenue check 2026-05
|
|
38
|
+
cnpj-db-loader federal-revenue download --output ./downloads --force
|
|
39
|
+
cnpj-db-loader federal-revenue sync --output ./downloads --db-url "postgresql://user:password@localhost:5432/cnpj" --force
|
|
40
|
+
cnpj-db-loader inspect ./downloads
|
|
41
|
+
cnpj-db-loader extract ./downloads
|
|
42
|
+
cnpj-db-loader validate ./downloads/extracted
|
|
43
|
+
cnpj-db-loader sanitize ./downloads/extracted
|
|
44
|
+
cnpj-db-loader schema generate --profile full --name receita-v2 --output ./artifacts/sql
|
|
45
|
+
cnpj-db-loader schema generate --profile staging
|
|
46
|
+
cnpj-db-loader schema print --profile final
|
|
47
|
+
cnpj-db-loader database config set "postgresql://user:password@localhost:5432/cnpj"
|
|
48
|
+
cnpj-db-loader database config test
|
|
49
|
+
cnpj-db-loader database cleanup staging --validated-path ./downloads/sanitized --force
|
|
50
|
+
cnpj-db-loader database cleanup materialized --dataset companies --force
|
|
51
|
+
cnpj-db-loader database cleanup checkpoints --phase materialization --validated-path ./downloads/sanitized --force
|
|
52
|
+
cnpj-db-loader database cleanup plans --validated-path ./downloads/sanitized --force
|
|
53
|
+
cnpj-db-loader import ./downloads/sanitized
|
|
54
|
+
cnpj-db-loader import ./downloads/sanitized --db-url "postgresql://user:password@localhost:5432/cnpj"
|
|
55
|
+
cnpj-db-loader import ./downloads/sanitized --dataset companies --load-batch-size 500
|
|
56
|
+
cnpj-db-loader import load ./downloads/sanitized --load-batch-size 20000
|
|
57
|
+
cnpj-db-loader import materialize ./downloads/sanitized --materialize-batch-size 50000
|
|
58
|
+
cnpj-db-loader database cleanup staging --validated-path ./downloads/sanitized
|
|
59
|
+
cnpj-db-loader import ./downloads/sanitized --force
|
|
60
|
+
cnpj-db-loader quarantine stats
|
|
61
|
+
cnpj-db-loader quarantine stats --dataset establishments --category invalid_utf8_sequence --retryable
|
|
62
|
+
cnpj-db-loader quarantine list --dataset establishments --limit 10
|
|
63
|
+
cnpj-db-loader quarantine list --terminal --after-id 500
|
|
64
|
+
cnpj-db-loader quarantine show 42
|
|
65
|
+
```
|
|
@@ -0,0 +1,224 @@
|
|
|
1
|
+
# Federal Revenue integration
|
|
2
|
+
|
|
3
|
+
The `federal-revenue` command group automates the remote monthly CNPJ dataset phase for the Brazilian Federal Revenue public share.
|
|
4
|
+
|
|
5
|
+
This feature is additive. It does not replace the stable local commands. The full `sync` command uses the same internal services already used by the manual flow:
|
|
6
|
+
|
|
7
|
+
```text
|
|
8
|
+
check/download -> extract -> validate -> sanitize -> import
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
The command also has the shorter alias `revenue`.
|
|
12
|
+
|
|
13
|
+
## Commands
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
cnpj-db-loader federal-revenue check
|
|
17
|
+
cnpj-db-loader federal-revenue download --output ./downloads --force
|
|
18
|
+
cnpj-db-loader federal-revenue status --output ./downloads
|
|
19
|
+
cnpj-db-loader federal-revenue retry --output ./downloads --force
|
|
20
|
+
cnpj-db-loader federal-revenue clean --output ./downloads --partials --force
|
|
21
|
+
cnpj-db-loader federal-revenue sync --output ./downloads --db-url "postgresql://user:password@localhost:5432/cnpj" --force
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
Alias examples:
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
cnpj-db-loader revenue check
|
|
28
|
+
cnpj-db-loader revenue status 2026-05 --output ./downloads
|
|
29
|
+
cnpj-db-loader revenue retry 2026-05 --output ./downloads --force
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Reference selection
|
|
33
|
+
|
|
34
|
+
By default, remote commands list the public share and select the latest available folder in the `YYYY-MM` format.
|
|
35
|
+
|
|
36
|
+
Use an explicit reference when you need a deterministic month:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
cnpj-db-loader federal-revenue check --reference 2026-05
|
|
40
|
+
cnpj-db-loader federal-revenue check 2026-05
|
|
41
|
+
cnpj-db-loader federal-revenue download --reference 2026-05 --output ./downloads --force
|
|
42
|
+
cnpj-db-loader federal-revenue download 2026-05 --output ./downloads --force
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Use the current calendar month when you want the command to fail if that month has not been published yet:
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
cnpj-db-loader federal-revenue check --current
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
If an explicit or current reference does not exist in the public share, the command fails before listing or downloading files and reports the latest available reference:
|
|
52
|
+
|
|
53
|
+
```text
|
|
54
|
+
VALIDATION_ERROR Federal Revenue reference not found: 2026-06. Latest available reference is 2026-05.
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
Running without `--reference`, without `[reference]`, and without `--current` keeps the default behavior of selecting the latest published reference.
|
|
58
|
+
|
|
59
|
+
## Download behavior
|
|
60
|
+
|
|
61
|
+
`download` creates a child directory named with the selected reference inside the configured output root.
|
|
62
|
+
|
|
63
|
+
For example:
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
cnpj-db-loader federal-revenue download --output ./downloads --reference 2026-05 --force
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
writes files to:
|
|
70
|
+
|
|
71
|
+
```text
|
|
72
|
+
./downloads/2026-05
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
The downloader:
|
|
76
|
+
|
|
77
|
+
- lists `.zip` files from the selected monthly reference
|
|
78
|
+
- skips a completed local file when its size matches the remote size
|
|
79
|
+
- writes incomplete transfers as `<file>.part`
|
|
80
|
+
- retries each failed file before marking it as failed
|
|
81
|
+
- validates local size against the remote WebDAV size when available
|
|
82
|
+
- writes a local manifest for status, retry, clean, and future runner automation
|
|
83
|
+
- uses `--overwrite` only when a completed local file should be downloaded again
|
|
84
|
+
|
|
85
|
+
## Local manifest
|
|
86
|
+
|
|
87
|
+
Each downloaded reference receives a local operational manifest:
|
|
88
|
+
|
|
89
|
+
```text
|
|
90
|
+
<output>/<reference>/.cnpj-db-loader/federal-revenue/manifest.json
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
The manifest tracks:
|
|
94
|
+
|
|
95
|
+
- selected reference
|
|
96
|
+
- remote base URL
|
|
97
|
+
- output path
|
|
98
|
+
- file names and paths
|
|
99
|
+
- remote size and local size
|
|
100
|
+
- local status: `downloaded`, `failed`, `partial`, or `missing`
|
|
101
|
+
- last command and last status
|
|
102
|
+
- error message when a file fails
|
|
103
|
+
|
|
104
|
+
This state is intentionally local and file-based. It does not require Redis or PostgreSQL.
|
|
105
|
+
|
|
106
|
+
## Status
|
|
107
|
+
|
|
108
|
+
Use `status` to inspect the local reference state without starting a new download:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
cnpj-db-loader federal-revenue status 2026-05 --output ./downloads
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
The command exits with code `0` when the local reference is complete. It exits with code `1` when the manifest is missing or when at least one file is failed, partial, or missing.
|
|
115
|
+
|
|
116
|
+
## Retry
|
|
117
|
+
|
|
118
|
+
Use `retry` to download only incomplete files tracked by the manifest:
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
cnpj-db-loader federal-revenue retry 2026-05 --output ./downloads --force
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Completed files are kept. Failed, partial, and missing files are retried according to `--retries`.
|
|
125
|
+
|
|
126
|
+
## Clean
|
|
127
|
+
|
|
128
|
+
Use `clean` for local maintenance:
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
cnpj-db-loader federal-revenue clean 2026-05 --output ./downloads --partials --force
|
|
132
|
+
cnpj-db-loader federal-revenue clean 2026-05 --output ./downloads --failed --force
|
|
133
|
+
cnpj-db-loader federal-revenue clean 2026-05 --output ./downloads --all --force
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Cleanup modes:
|
|
137
|
+
|
|
138
|
+
| Mode | Behavior |
|
|
139
|
+
| ------------ | ------------------------------------------------------------------------------------- |
|
|
140
|
+
| `--partials` | Removes only `.part` files. |
|
|
141
|
+
| `--failed` | Removes failed and partial files tracked by the manifest, then marks them as missing. |
|
|
142
|
+
| `--all` | Removes the entire local reference folder, including ZIP files and manifest state. |
|
|
143
|
+
|
|
144
|
+
Only one cleanup mode can be used at a time.
|
|
145
|
+
|
|
146
|
+
## Sync lock
|
|
147
|
+
|
|
148
|
+
`sync` creates a local lock file before running the full pipeline:
|
|
149
|
+
|
|
150
|
+
```text
|
|
151
|
+
<output>/<reference>/.cnpj-db-loader/federal-revenue/sync.lock
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
This prevents two sync processes from using the same reference folder at the same time.
|
|
155
|
+
|
|
156
|
+
If a previous process was interrupted and the lock is stale, use `--force-lock` only after confirming that no other sync is running:
|
|
157
|
+
|
|
158
|
+
```bash
|
|
159
|
+
cnpj-db-loader federal-revenue sync 2026-05 --output ./downloads --force-lock --force
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
## Full sync
|
|
163
|
+
|
|
164
|
+
`sync` runs the remote download and then the local loader pipeline:
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
cnpj-db-loader federal-revenue sync \
|
|
168
|
+
--output ./downloads \
|
|
169
|
+
--db-url "postgresql://user:password@localhost:5432/cnpj" \
|
|
170
|
+
--load-batch-size 500 \
|
|
171
|
+
--materialize-batch-size 50000 \
|
|
172
|
+
--verbose-progress \
|
|
173
|
+
--force
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
Custom output directories can be used when an automation needs fixed paths:
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
cnpj-db-loader federal-revenue sync \
|
|
180
|
+
--reference 2026-05 \
|
|
181
|
+
--output ./downloads \
|
|
182
|
+
--extract-output ./work/2026-05/extracted \
|
|
183
|
+
--sanitize-output ./work/2026-05/sanitized \
|
|
184
|
+
--force
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
## Exit codes
|
|
188
|
+
|
|
189
|
+
| Command | Exit code `0` | Exit code `1` |
|
|
190
|
+
| ---------- | ---------------------------------------------------- | ------------------------------------------------------------------------ |
|
|
191
|
+
| `check` | Reference and file list were resolved. | Invalid reference, missing remote reference, or WebDAV error. |
|
|
192
|
+
| `download` | Download completed without failed files. | One or more files failed. |
|
|
193
|
+
| `status` | Local manifest exists and every file is downloaded. | Manifest missing or at least one file is failed/partial/missing. |
|
|
194
|
+
| `retry` | Retry finished with no failed/partial/missing files. | At least one file is still failed, partial, or missing. |
|
|
195
|
+
| `clean` | Cleanup completed. | Invalid cleanup mode or invalid reference. |
|
|
196
|
+
| `sync` | Full pipeline completed. | Download, extraction, validation, sanitization, import, or lock failure. |
|
|
197
|
+
|
|
198
|
+
## Options
|
|
199
|
+
|
|
200
|
+
| Option | Applies to | Purpose |
|
|
201
|
+
| --------------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------- |
|
|
202
|
+
| `[reference]` / `--reference <yyyy-mm>` | `check`, `download`, `status`, `retry`, `clean`, `sync` | Select a specific monthly reference. |
|
|
203
|
+
| `--current` | `check`, `download`, `status`, `retry`, `clean`, `sync` | Select the current calendar month. |
|
|
204
|
+
| `--output <path>` | `download`, `status`, `retry`, `clean`, `sync` | Download root directory. The reference folder is created inside it. |
|
|
205
|
+
| `--retries <number>` | `download`, `retry`, `sync` | Retry attempts per file. Defaults to 3. |
|
|
206
|
+
| `--overwrite` | `download`, `retry`, `sync` | Redownload files even when a completed local copy already exists. |
|
|
207
|
+
| `--partials` | `clean` | Remove only `.part` files. |
|
|
208
|
+
| `--failed` | `clean` | Remove failed and partial files tracked by the manifest. |
|
|
209
|
+
| `--all` | `clean` | Remove the entire local reference folder. |
|
|
210
|
+
| `--force-lock` | `sync` | Remove an existing sync lock before starting. |
|
|
211
|
+
| `--extract-output <path>` | `sync` | Custom extraction output directory. |
|
|
212
|
+
| `--sanitize-output <path>` | `sync` | Custom sanitized output directory. |
|
|
213
|
+
| `--db-url <url>` | `sync` | Override the saved PostgreSQL URL for the import phase. |
|
|
214
|
+
| `--dataset <dataset>` | `sync` | Restrict the import phase to one dataset. |
|
|
215
|
+
| `--load-batch-size <size>` | `sync` | Import load batch size. |
|
|
216
|
+
| `--materialize-batch-size <size>` | `sync` | Materialization chunk size. |
|
|
217
|
+
| `--verbose-progress` | `sync` | Show detailed import progress. |
|
|
218
|
+
| `--base-url <url>` | `check`, `download`, `retry`, `sync` | Override the WebDAV base URL. |
|
|
219
|
+
| `--share-token <token>` | `check`, `download`, `retry`, `sync` | Override the public share token. |
|
|
220
|
+
| `--force` | `download`, `retry`, `clean`, `sync` | Skip confirmation prompts. |
|
|
221
|
+
|
|
222
|
+
## Notes
|
|
223
|
+
|
|
224
|
+
This module intentionally does not include Redis, workers, or schedulers. The CLI remains responsible for deterministic one-shot operations and local operational control. A future external runner can call these commands or import the public service functions and add queueing, job-level retries, scheduling, and notifications without making the core loader heavier.
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# Quarantine
|
|
2
|
+
|
|
3
|
+
The `quarantine` service is a read-only CLI surface for inspecting rows written to the `import_quarantine` table during import.
|
|
4
|
+
|
|
5
|
+
## Commands
|
|
6
|
+
|
|
7
|
+
| Command | Purpose |
|
|
8
|
+
| ---------------------- | ------------------------------------------------------- |
|
|
9
|
+
| `quarantine stats` | Show totals and grouped counts for quarantine rows. |
|
|
10
|
+
| `quarantine list` | List quarantined rows with optional filters and paging. |
|
|
11
|
+
| `quarantine show <id>` | Show one quarantined row in detail. |
|
|
12
|
+
|
|
13
|
+
## Supported filters
|
|
14
|
+
|
|
15
|
+
| Option | Commands | Description |
|
|
16
|
+
| ------------------- | ----------------------- | ------------------------------------------------------ |
|
|
17
|
+
| `--db-url <url>` | `stats`, `list`, `show` | Override the persisted PostgreSQL URL. |
|
|
18
|
+
| `--dataset <name>` | `stats`, `list` | Filter rows by dataset name. |
|
|
19
|
+
| `--category <name>` | `stats`, `list` | Filter rows by error category. |
|
|
20
|
+
| `--stage <name>` | `stats`, `list` | Filter rows by error stage. |
|
|
21
|
+
| `--retryable` | `stats`, `list` | Keep only rows marked as retryable. |
|
|
22
|
+
| `--terminal` | `stats`, `list` | Keep only rows marked as terminal. |
|
|
23
|
+
| `--limit <number>` | `list` | Limit the number of returned rows. Defaults to `20`. |
|
|
24
|
+
| `--after-id <id>` | `list` | Return rows strictly after the provided quarantine id. |
|
|
25
|
+
|
|
26
|
+
## Examples
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
cnpj-db-loader quarantine stats
|
|
30
|
+
cnpj-db-loader quarantine stats --dataset establishments --category invalid_utf8_sequence --retryable
|
|
31
|
+
cnpj-db-loader quarantine list --dataset establishments --limit 10
|
|
32
|
+
cnpj-db-loader quarantine list --terminal --after-id 500
|
|
33
|
+
cnpj-db-loader quarantine show 42
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Notes
|
|
37
|
+
|
|
38
|
+
- `quarantine` is intentionally read-only. It does not retry or mutate quarantined rows.
|
|
39
|
+
- The service automatically ensures that the `import_quarantine` table and its newer columns exist before querying.
|
|
40
|
+
- A future replay/recovery command can reuse the same filters to target retryable or terminal rows.
|
package/docs/sanitize.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
# Sanitize
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
|
|
5
|
+
`sanitize` prepares a clean dataset tree before PostgreSQL import.
|
|
6
|
+
|
|
7
|
+
It removes known low-level byte issues, especially `0x00` / NUL bytes, from validated dataset files and writes the result to a new output directory. The goal is to reduce slow fallback work during import so PostgreSQL receives cleaner files from the start.
|
|
8
|
+
|
|
9
|
+
## Command
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
cnpj-db-loader sanitize <input>
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Options
|
|
16
|
+
|
|
17
|
+
| Option | Description |
|
|
18
|
+
| ------------------ | ------------------------------------------------------------------------- |
|
|
19
|
+
| `--output <path>` | Custom output directory for the sanitized dataset tree. |
|
|
20
|
+
| `--dataset <name>` | Sanitize only one dataset block, such as `establishments` or `companies`. |
|
|
21
|
+
| `-f, --force` | Skip the confirmation prompt. |
|
|
22
|
+
|
|
23
|
+
## Default output behavior
|
|
24
|
+
|
|
25
|
+
- when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`
|
|
26
|
+
- otherwise the default output is `<validated-path>-sanitized`
|
|
27
|
+
|
|
28
|
+
## Recommended flow
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
cnpj-db-loader inspect ./downloads
|
|
32
|
+
cnpj-db-loader extract ./downloads
|
|
33
|
+
cnpj-db-loader validate ./downloads/extracted
|
|
34
|
+
cnpj-db-loader sanitize ./downloads/extracted
|
|
35
|
+
cnpj-db-loader import ./downloads/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## What it improves
|
|
39
|
+
|
|
40
|
+
- fewer UTF-8 / NUL-byte related insert failures
|
|
41
|
+
- less row-by-row fallback during import
|
|
42
|
+
- better import throughput for large datasets
|
|
43
|
+
- cleaner quarantine data because known low-level issues are removed earlier
|
|
44
|
+
|
|
45
|
+
## Notes
|
|
46
|
+
|
|
47
|
+
- `sanitize` does not replace validation; it assumes the dataset tree is already valid
|
|
48
|
+
- `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization
|
|
49
|
+
- no database schema changes are required to use `sanitize`
|