@danielarndt0/cnpj-db-loader 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +119 -0
- package/dist/cli.d.ts +1 -0
- package/dist/cli.js +10187 -0
- package/dist/cli.js.map +1 -0
- package/dist/index.d.ts +969 -0
- package/dist/index.js +8012 -0
- package/dist/index.js.map +1 -0
- package/docs/architecture.md +97 -0
- package/docs/cli.md +46 -0
- package/docs/commands.md +65 -0
- package/docs/federal-revenue.md +224 -0
- package/docs/quarantine.md +40 -0
- package/docs/sanitize.md +49 -0
- package/docs/usage.md +149 -0
- package/package.json +61 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Daniel Arndt
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
# CNPJ DB Loader
|
|
2
|
+
|
|
3
|
+
CNPJ DB Loader is a practical CLI for preparing Brazilian Federal Revenue CNPJ datasets for PostgreSQL.
|
|
4
|
+
|
|
5
|
+
## Current scope
|
|
6
|
+
|
|
7
|
+
This version focuses on the real loading workflow:
|
|
8
|
+
|
|
9
|
+
- inspect a downloaded directory
|
|
10
|
+
- check, download, retry, clean, and inspect the latest Federal Revenue CNPJ monthly ZIP archives from the public share
|
|
11
|
+
- extract Receita Federal ZIP archives
|
|
12
|
+
- validate an extracted tree
|
|
13
|
+
- sanitize validated files before import to remove known low-level byte issues
|
|
14
|
+
- print or generate final, staging, or combined SQL schemas
|
|
15
|
+
- configure and test the default PostgreSQL URL
|
|
16
|
+
- import validated dataset files into PostgreSQL with:
|
|
17
|
+
- exact preparatory scanning for total rows and total batches before import starts
|
|
18
|
+
- persisted import plans reused on resume for the same validated input and batch size
|
|
19
|
+
- staged bulk loads for the large datasets through PostgreSQL COPY
|
|
20
|
+
- direct final-schema upserts for the smaller domain datasets
|
|
21
|
+
- checkpoint-based resume by file and byte offset
|
|
22
|
+
- row quarantine for invalid or constraint-breaking records without stopping the import
|
|
23
|
+
- quarantine inspection commands for analyzing rows stored in `import_quarantine`
|
|
24
|
+
|
|
25
|
+
## Installation
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
npm install
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
During development:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
npm run cli -- --help
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## Quick start
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
cnpj-db-loader federal-revenue check
|
|
41
|
+
cnpj-db-loader federal-revenue download --output ./downloads
|
|
42
|
+
cnpj-db-loader federal-revenue status --output ./downloads
|
|
43
|
+
cnpj-db-loader inspect ./downloads/<reference>
|
|
44
|
+
cnpj-db-loader extract ./downloads/<reference>
|
|
45
|
+
cnpj-db-loader validate ./downloads/<reference>/extracted
|
|
46
|
+
cnpj-db-loader sanitize ./downloads/<reference>/extracted
|
|
47
|
+
cnpj-db-loader database config set "postgresql://user:password@localhost:5432/cnpj"
|
|
48
|
+
cnpj-db-loader schema generate --profile full
|
|
49
|
+
cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Stable commands
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
cnpj-db-loader federal-revenue check [reference] [--reference <yyyy-mm>] [--current]
|
|
56
|
+
cnpj-db-loader federal-revenue download [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
|
|
57
|
+
cnpj-db-loader federal-revenue status [reference] [--reference <yyyy-mm>] [--current] [--output <path>]
|
|
58
|
+
cnpj-db-loader federal-revenue retry [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
|
|
59
|
+
cnpj-db-loader federal-revenue clean [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--partials | --failed | --all] [-f]
|
|
60
|
+
cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--extract-output <path>] [--sanitize-output <path>] [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [--force-lock] [-f]
|
|
61
|
+
cnpj-db-loader inspect <input>
|
|
62
|
+
cnpj-db-loader extract <input> [--output <path>]
|
|
63
|
+
cnpj-db-loader validate <input>
|
|
64
|
+
cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
|
|
65
|
+
cnpj-db-loader schema print [--profile <profile>]
|
|
66
|
+
cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
|
|
67
|
+
cnpj-db-loader database config set <url>
|
|
68
|
+
cnpj-db-loader database config show
|
|
69
|
+
cnpj-db-loader database config test [--db-url <url>]
|
|
70
|
+
cnpj-db-loader database config reset [--force]
|
|
71
|
+
cnpj-db-loader database cleanup staging [--db-url <url>] [--dataset <name>] [--validated-path <path>] [--force]
|
|
72
|
+
cnpj-db-loader database cleanup materialized [--db-url <url>] [--dataset <name>] [--force]
|
|
73
|
+
cnpj-db-loader database cleanup checkpoints [--db-url <url>] [--phase <phase>] [--dataset <name>] [--validated-path <path>] [--plan-id <id>] [--force]
|
|
74
|
+
cnpj-db-loader database cleanup plans [--db-url <url>] [--validated-path <path>] [--plan-id <id>] [--force]
|
|
75
|
+
cnpj-db-loader import <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
|
|
76
|
+
cnpj-db-loader import load <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--verbose-progress] [-f]
|
|
77
|
+
cnpj-db-loader import materialize <input> [--db-url <url>] [--dataset <name>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
|
|
78
|
+
cnpj-db-loader doctor [--input <path>] [--db-url <url>]
|
|
79
|
+
cnpj-db-loader quarantine stats [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal]
|
|
80
|
+
cnpj-db-loader quarantine list [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal] [--limit <number>] [--after-id <id>]
|
|
81
|
+
cnpj-db-loader quarantine show <id> [--db-url <url>]
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## Logs
|
|
85
|
+
|
|
86
|
+
JSON execution logs are written inside the user home directory at `~/.cnpjdbloader/logs`.
|
|
87
|
+
|
|
88
|
+
Every JSON and JSONL log entry now includes a structured envelope with fields such as `timestamp`, `level`, `severity`, `event`, and `kind`. Command success logs are written with `status: "success"`, command failures are written with `status: "failure"`, and incremental import progress events are classified with levels such as `debug`, `info`, `warning`, and `error`.
|
|
89
|
+
|
|
90
|
+
For `import`, the CLI now also writes an incremental JSONL progress log with one event per committed batch, retry fallback, dataset metrics, file metrics, file failure, final completion summary, and top-level import failure when execution aborts early.
|
|
91
|
+
|
|
92
|
+
The final import summary now includes baseline timing and throughput metrics such as preparatory scan duration, execution duration, insert time, retry time, quarantine time, rows per second, and batches per minute.
|
|
93
|
+
|
|
94
|
+
The import internals are now split into dedicated modules such as planner, source reader, parser, normalizer, checkpoint manager, quarantine writer, staging writer, materializer, and finalizer so staged bulk-load and final materialization changes can be implemented without rewriting the whole import command.
|
|
95
|
+
|
|
96
|
+
The CLI now exposes a split workflow as well: `import` runs the full pipeline, `import load` stops after staging/direct writes, `import materialize` resumes from the saved plan and pushes staged rows into the final tables, and `database cleanup ...` exposes safe maintenance commands for staging tables, simplified final materialized tables, checkpoints, and saved plans.
|
|
97
|
+
|
|
98
|
+
Materialization progress is now checkpointed separately from file-load checkpoints, and the materializer works in resumable chunks controlled by `--materialize-batch-size`. During long final materialization steps, the CLI keeps the live progress output on a dedicated MATERIALIZING stage while reducing per-chunk checkpoint and JSONL write overhead so resumable chunks stay fast. The simplified final schema keeps raw secondary CNAE text in establishments and derives helper fields such as partner dedupe keys during materialization only when they are still stored physically in the target schema.
|
|
99
|
+
|
|
100
|
+
The Federal Revenue commands write the same structured command logs and keep the remote-download phase outside the import internals. Existing completed ZIP files are skipped by default, temporary `.part` files are used while downloads are still in progress, and each reference keeps a local manifest for `status`, `retry`, `clean`, and future runner automation.
|
|
101
|
+
|
|
102
|
+
The generated database schema now supports three profiles:
|
|
103
|
+
|
|
104
|
+
- `full`: final relational tables, import control tables, and staging tables
|
|
105
|
+
- `final`: only the final relational and control tables
|
|
106
|
+
- `staging`: only the lightweight staging tables used by the staged bulk-load flow
|
|
107
|
+
|
|
108
|
+
`import --verbose-progress` shows a fixed multi-line status block instead of spamming the terminal with a new line on every progress update.
|
|
109
|
+
|
|
110
|
+
## Documentation
|
|
111
|
+
|
|
112
|
+
- [Usage](./docs/usage.md)
|
|
113
|
+
- [Architecture](./docs/architecture.md)
|
|
114
|
+
- [Commands](./docs/commands.md)
|
|
115
|
+
- [Quarantine](./docs/quarantine.md)
|
|
116
|
+
- [Sanitize](./docs/sanitize.md)
|
|
117
|
+
- [Federal Revenue](./docs/federal-revenue.md)
|
|
118
|
+
|
|
119
|
+
- Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
|
package/dist/cli.d.ts
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
#!/usr/bin/env node
|