@danielarndt0/cnpj-db-loader 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Daniel Arndt
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,119 @@
1
+ # CNPJ DB Loader
2
+
3
+ CNPJ DB Loader is a practical CLI for preparing Brazilian Federal Revenue CNPJ datasets for PostgreSQL.
4
+
5
+ ## Current scope
6
+
7
+ This version focuses on the real loading workflow:
8
+
9
+ - inspect a downloaded directory
10
+ - check, download, retry, clean, and inspect the latest Federal Revenue CNPJ monthly ZIP archives from the public share
11
+ - extract Receita Federal ZIP archives
12
+ - validate an extracted tree
13
+ - sanitize validated files before import to remove known low-level byte issues
14
+ - print or generate final, staging, or combined SQL schemas
15
+ - configure and test the default PostgreSQL URL
16
+ - import validated dataset files into PostgreSQL with:
17
+ - exact preparatory scanning for total rows and total batches before import starts
18
+ - persisted import plans reused on resume for the same validated input and batch size
19
+ - staged bulk loads for the large datasets through PostgreSQL COPY
20
+ - direct final-schema upserts for the smaller domain datasets
21
+ - checkpoint-based resume by file and byte offset
22
+ - row quarantine for invalid or constraint-breaking records without stopping the import
23
+ - quarantine inspection commands for analyzing rows stored in `import_quarantine`
24
+
25
+ ## Installation
26
+
27
+ ```bash
28
+ npm install
29
+ ```
30
+
31
+ During development:
32
+
33
+ ```bash
34
+ npm run cli -- --help
35
+ ```
36
+
37
+ ## Quick start
38
+
39
+ ```bash
40
+ cnpj-db-loader federal-revenue check
41
+ cnpj-db-loader federal-revenue download --output ./downloads
42
+ cnpj-db-loader federal-revenue status --output ./downloads
43
+ cnpj-db-loader inspect ./downloads/<reference>
44
+ cnpj-db-loader extract ./downloads/<reference>
45
+ cnpj-db-loader validate ./downloads/<reference>/extracted
46
+ cnpj-db-loader sanitize ./downloads/<reference>/extracted
47
+ cnpj-db-loader database config set "postgresql://user:password@localhost:5432/cnpj"
48
+ cnpj-db-loader schema generate --profile full
49
+ cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
50
+ ```
51
+
52
+ ## Stable commands
53
+
54
+ ```bash
55
+ cnpj-db-loader federal-revenue check [reference] [--reference <yyyy-mm>] [--current]
56
+ cnpj-db-loader federal-revenue download [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
57
+ cnpj-db-loader federal-revenue status [reference] [--reference <yyyy-mm>] [--current] [--output <path>]
58
+ cnpj-db-loader federal-revenue retry [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
59
+ cnpj-db-loader federal-revenue clean [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--partials | --failed | --all] [-f]
60
+ cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--extract-output <path>] [--sanitize-output <path>] [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [--force-lock] [-f]
61
+ cnpj-db-loader inspect <input>
62
+ cnpj-db-loader extract <input> [--output <path>]
63
+ cnpj-db-loader validate <input>
64
+ cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
65
+ cnpj-db-loader schema print [--profile <profile>]
66
+ cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
67
+ cnpj-db-loader database config set <url>
68
+ cnpj-db-loader database config show
69
+ cnpj-db-loader database config test [--db-url <url>]
70
+ cnpj-db-loader database config reset [--force]
71
+ cnpj-db-loader database cleanup staging [--db-url <url>] [--dataset <name>] [--validated-path <path>] [--force]
72
+ cnpj-db-loader database cleanup materialized [--db-url <url>] [--dataset <name>] [--force]
73
+ cnpj-db-loader database cleanup checkpoints [--db-url <url>] [--phase <phase>] [--dataset <name>] [--validated-path <path>] [--plan-id <id>] [--force]
74
+ cnpj-db-loader database cleanup plans [--db-url <url>] [--validated-path <path>] [--plan-id <id>] [--force]
75
+ cnpj-db-loader import <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
76
+ cnpj-db-loader import load <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--verbose-progress] [-f]
77
+ cnpj-db-loader import materialize <input> [--db-url <url>] [--dataset <name>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
78
+ cnpj-db-loader doctor [--input <path>] [--db-url <url>]
79
+ cnpj-db-loader quarantine stats [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal]
80
+ cnpj-db-loader quarantine list [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal] [--limit <number>] [--after-id <id>]
81
+ cnpj-db-loader quarantine show <id> [--db-url <url>]
82
+ ```
83
+
84
+ ## Logs
85
+
86
+ JSON execution logs are written inside the user home directory at `~/.cnpjdbloader/logs`.
87
+
88
+ Every JSON and JSONL log entry now includes a structured envelope with fields such as `timestamp`, `level`, `severity`, `event`, and `kind`. Command success logs are written with `status: "success"`, command failures are written with `status: "failure"`, and incremental import progress events are classified with levels such as `debug`, `info`, `warning`, and `error`.
89
+
90
+ For `import`, the CLI now also writes an incremental JSONL progress log with one event per committed batch, retry fallback, dataset metrics, file metrics, file failure, final completion summary, and top-level import failure when execution aborts early.
91
+
92
+ The final import summary now includes baseline timing and throughput metrics such as preparatory scan duration, execution duration, insert time, retry time, quarantine time, rows per second, and batches per minute.
93
+
94
+ The import internals are now split into dedicated modules such as planner, source reader, parser, normalizer, checkpoint manager, quarantine writer, staging writer, materializer, and finalizer so staged bulk-load and final materialization changes can be implemented without rewriting the whole import command.
95
+
96
+ The CLI now exposes a split workflow as well: `import` runs the full pipeline, `import load` stops after staging/direct writes, `import materialize` resumes from the saved plan and pushes staged rows into the final tables, and `database cleanup ...` exposes safe maintenance commands for staging tables, simplified final materialized tables, checkpoints, and saved plans.
97
+
98
+ Materialization progress is now checkpointed separately from file-load checkpoints, and the materializer works in resumable chunks controlled by `--materialize-batch-size`. During long final materialization steps, the CLI keeps the live progress output on a dedicated MATERIALIZING stage while reducing per-chunk checkpoint and JSONL write overhead so resumable chunks stay fast. The simplified final schema keeps raw secondary CNAE text in establishments and derives helper fields such as partner dedupe keys during materialization only when they are still stored physically in the target schema.
99
+
100
+ The Federal Revenue commands write the same structured command logs and keep the remote-download phase outside the import internals. Existing completed ZIP files are skipped by default, temporary `.part` files are used while downloads are still in progress, and each reference keeps a local manifest for `status`, `retry`, `clean`, and future runner automation.
101
+
102
+ The generated database schema now supports three profiles:
103
+
104
+ - `full`: final relational tables, import control tables, and staging tables
105
+ - `final`: only the final relational and control tables
106
+ - `staging`: only the lightweight staging tables used by the staged bulk-load flow
107
+
108
+ `import --verbose-progress` shows a fixed multi-line status block instead of spamming the terminal with a new line on every progress update.
109
+
110
+ ## Documentation
111
+
112
+ - [Usage](./docs/usage.md)
113
+ - [Architecture](./docs/architecture.md)
114
+ - [Commands](./docs/commands.md)
115
+ - [Quarantine](./docs/quarantine.md)
116
+ - [Sanitize](./docs/sanitize.md)
117
+ - [Federal Revenue](./docs/federal-revenue.md)
118
+
119
+ - Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
package/dist/cli.d.ts ADDED
@@ -0,0 +1 @@
1
+ #!/usr/bin/env node