@danielarndt0/cnpj-db-loader 2.3.1 → 2.4.0-beta.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +22 -2
- package/dist/cli.js +1544 -157
- package/dist/cli.js.map +1 -1
- package/dist/index.d.ts +134 -1
- package/dist/index.js +1174 -58
- package/dist/index.js.map +1 -1
- package/docs/architecture.md +9 -1
- package/docs/cli.md +1 -1
- package/docs/commands.md +23 -0
- package/docs/postgres-direct.md +138 -0
- package/docs/sanitize.md +52 -16
- package/docs/usage.md +14 -0
- package/package.json +3 -3
package/README.md
CHANGED
|
@@ -10,7 +10,7 @@ This version focuses on the real loading workflow:
|
|
|
10
10
|
- check, download, retry, clean, and inspect the latest Federal Revenue CNPJ monthly ZIP archives from the public share
|
|
11
11
|
- extract Receita Federal ZIP archives
|
|
12
12
|
- validate an extracted tree
|
|
13
|
-
- sanitize validated files before import
|
|
13
|
+
- sanitize validated files into clean UTF-8 before import, removing NUL bytes, invalid bytes and problematic control characters
|
|
14
14
|
- print or generate final, staging, or combined SQL schemas
|
|
15
15
|
- configure and test the default PostgreSQL URL
|
|
16
16
|
- import validated dataset files into PostgreSQL with:
|
|
@@ -21,6 +21,7 @@ This version focuses on the real loading workflow:
|
|
|
21
21
|
- direct final-schema upserts for the smaller domain datasets
|
|
22
22
|
- checkpoint-based resume by file and byte offset
|
|
23
23
|
- row quarantine for invalid or constraint-breaking records without stopping the import
|
|
24
|
+
- generate a direct `psql` import script that loads sanitized Receita files without rewriting the full dataset into another CSV tree
|
|
24
25
|
- quarantine inspection commands for analyzing rows stored in `import_quarantine`
|
|
25
26
|
|
|
26
27
|
## Installation
|
|
@@ -48,6 +49,10 @@ cnpj-db-loader sanitize ./downloads/<reference>/extracted
|
|
|
48
49
|
cnpj-db-loader database config set "postgresql://user:password@localhost:5432/cnpj"
|
|
49
50
|
cnpj-db-loader schema generate --profile full
|
|
50
51
|
cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
|
|
52
|
+
|
|
53
|
+
# Optional hybrid path for PostgreSQL direct loading
|
|
54
|
+
cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --force
|
|
55
|
+
psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
|
|
51
56
|
```
|
|
52
57
|
|
|
53
58
|
## Stable commands
|
|
@@ -62,7 +67,7 @@ cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--curre
|
|
|
62
67
|
cnpj-db-loader inspect <input>
|
|
63
68
|
cnpj-db-loader extract <input> [--output <path>]
|
|
64
69
|
cnpj-db-loader validate <input>
|
|
65
|
-
cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
|
|
70
|
+
cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [--source-encoding <encoding>] [-f]
|
|
66
71
|
cnpj-db-loader schema print [--profile <profile>]
|
|
67
72
|
cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
|
|
68
73
|
cnpj-db-loader database config set <url>
|
|
@@ -73,6 +78,8 @@ cnpj-db-loader database cleanup staging [--db-url <url>] [--dataset <name>] [--v
|
|
|
73
78
|
cnpj-db-loader database cleanup materialized [--db-url <url>] [--dataset <name>] [--force]
|
|
74
79
|
cnpj-db-loader database cleanup checkpoints [--db-url <url>] [--phase <phase>] [--dataset <name>] [--validated-path <path>] [--plan-id <id>] [--force]
|
|
75
80
|
cnpj-db-loader database cleanup plans [--db-url <url>] [--validated-path <path>] [--plan-id <id>] [--force]
|
|
81
|
+
cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <name>] [--script-name <name>] [--source-encoding <encoding>] [-f]
|
|
82
|
+
cnpj-db-loader postgres export-csv <input> [--output <path>] [--dataset <name>] [--script-name <name>] [-f]
|
|
76
83
|
cnpj-db-loader import <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
|
|
77
84
|
cnpj-db-loader import load <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--verbose-progress] [-f]
|
|
78
85
|
cnpj-db-loader import materialize <input> [--db-url <url>] [--dataset <name>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
|
|
@@ -82,6 +89,18 @@ cnpj-db-loader quarantine list [--dataset <name>] [--category <name>] [--stage <
|
|
|
82
89
|
cnpj-db-loader quarantine show <id> [--db-url <url>]
|
|
83
90
|
```
|
|
84
91
|
|
|
92
|
+
## PostgreSQL direct import workflow
|
|
93
|
+
|
|
94
|
+
For local benchmarks or controlled full loads, the CLI can now generate a direct `psql` import script after sanitization:
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
cnpj-db-loader sanitize ./downloads/<reference>/extracted
|
|
98
|
+
cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --force
|
|
99
|
+
psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
This path keeps download, extraction, validation and robust UTF-8 sanitization inside the loader, then lets PostgreSQL load the sanitized Receita files directly through `\copy`, convert values into staging tables and materialize the final tables with set-based SQL. The standard `import` command remains the safest path when checkpoint resume and quarantine recovery are required.
|
|
103
|
+
|
|
85
104
|
## Logs
|
|
86
105
|
|
|
87
106
|
JSON execution logs are written inside the user home directory at `~/.cnpjdbloader/logs`.
|
|
@@ -116,5 +135,6 @@ The generated database schema now supports three profiles:
|
|
|
116
135
|
- [Quarantine](./docs/quarantine.md)
|
|
117
136
|
- [Sanitize](./docs/sanitize.md)
|
|
118
137
|
- [Federal Revenue](./docs/federal-revenue.md)
|
|
138
|
+
- [PostgreSQL Direct Import](./docs/postgres-direct.md)
|
|
119
139
|
|
|
120
140
|
- Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
|