@danielarndt0/cnpj-db-loader 2.3.1 → 2.4.0-beta.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +20 -0
- package/dist/cli.js +1157 -10
- package/dist/cli.js.map +1 -1
- package/dist/index.d.ts +121 -1
- package/dist/index.js +913 -3
- package/dist/index.js.map +1 -1
- package/docs/architecture.md +8 -0
- package/docs/commands.md +23 -0
- package/docs/postgres-direct.md +138 -0
- package/docs/releases/v2.4.0.md +40 -0
- package/docs/usage.md +14 -0
- package/package.json +1 -1
package/docs/architecture.md
CHANGED
|
@@ -95,3 +95,11 @@ planner -> source-reader -> parser -> normalizer -> staging-writer -> materializ
|
|
|
95
95
|
```
|
|
96
96
|
|
|
97
97
|
- Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
|
|
98
|
+
|
|
99
|
+
## PostgreSQL direct import workflow
|
|
100
|
+
|
|
101
|
+
The PostgreSQL direct import workflow is a hybrid execution path. It keeps file detection, Receita parsing, validation and sanitization in the loader, then exports normalized CSV files and a generated `psql` script for direct PostgreSQL loading.
|
|
102
|
+
|
|
103
|
+
This keeps the parsing rules centralized in the TypeScript layouts while allowing PostgreSQL to run the heaviest bulk load and materialization work with set-based SQL.
|
|
104
|
+
|
|
105
|
+
The generated script resets staging tables, loads CSV files with `\copy`, upserts domain and final tables, materializes `establishment_secondary_cnaes`, and refreshes planner statistics with `ANALYZE`.
|
package/docs/commands.md
CHANGED
|
@@ -22,6 +22,8 @@
|
|
|
22
22
|
| `database cleanup materialized` | Truncate simplified final relational tables populated by materialization, including establishment secondary CNAEs when available, in safe order. |
|
|
23
23
|
| `database cleanup checkpoints` | Clear load checkpoints, materialization checkpoints, or both without truncating staging or final tables. |
|
|
24
24
|
| `database cleanup plans` | Delete saved import plans. Related plan files and materialization checkpoints are removed by database cascade. |
|
|
25
|
+
| `postgres generate-script` | Generate a direct `psql` import script that loads sanitized Receita files without rewriting them into new CSV files. |
|
|
26
|
+
| `postgres export-csv` | Convert sanitized Receita files into normalized PostgreSQL-ready CSV files and generate a direct `psql` import script for audit/debug workflows. |
|
|
25
27
|
| `import <input>` | Run the full pipeline: plan, load validated files into staging/direct final targets, materialize staged datasets into final tables, and finalize the import plan. |
|
|
26
28
|
| `import load <input>` | Prepare the plan and run only the load phase. Heavy datasets stop in `staging_*`; domain datasets still upsert directly into the final schema. |
|
|
27
29
|
| `import materialize <input>` | Resume from the saved import plan and materialize staged datasets into the final relational tables with resumable chunks. |
|
|
@@ -50,6 +52,8 @@ cnpj-db-loader database cleanup staging --validated-path ./downloads/sanitized -
|
|
|
50
52
|
cnpj-db-loader database cleanup materialized --dataset companies --force
|
|
51
53
|
cnpj-db-loader database cleanup checkpoints --phase materialization --validated-path ./downloads/sanitized --force
|
|
52
54
|
cnpj-db-loader database cleanup plans --validated-path ./downloads/sanitized --force
|
|
55
|
+
cnpj-db-loader postgres generate-script ./downloads/sanitized --output ./downloads/postgres-direct --force
|
|
56
|
+
psql "postgres://user:password@localhost:5432/cnpj" -f ./downloads/postgres-direct/import-postgres-direct.sql
|
|
53
57
|
cnpj-db-loader import ./downloads/sanitized
|
|
54
58
|
cnpj-db-loader import ./downloads/sanitized --db-url "postgresql://user:password@localhost:5432/cnpj"
|
|
55
59
|
cnpj-db-loader import ./downloads/sanitized --dataset companies --load-batch-size 500
|
|
@@ -63,3 +67,22 @@ cnpj-db-loader quarantine list --dataset establishments --limit 10
|
|
|
63
67
|
cnpj-db-loader quarantine list --terminal --after-id 500
|
|
64
68
|
cnpj-db-loader quarantine show 42
|
|
65
69
|
```
|
|
70
|
+
|
|
71
|
+
## PostgreSQL direct import helper
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [-f]
|
|
75
|
+
cnpj-db-loader postgres export-csv <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [-f]
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
`postgres generate-script` is the recommended hybrid workflow. The loader performs extraction, validation and sanitization, then generates a `psql` script that loads the sanitized Receita files directly through `\copy`.
|
|
79
|
+
|
|
80
|
+
`postgres export-csv` remains available when you explicitly want a normalized CSV output tree for audit/debug purposes.
|
|
81
|
+
|
|
82
|
+
Options:
|
|
83
|
+
|
|
84
|
+
- `--output <path>`: directory where manifest and SQL script are generated.
|
|
85
|
+
- `--dataset <dataset>`: generate only one dataset block.
|
|
86
|
+
- `--script-name <name>`: custom generated SQL script name.
|
|
87
|
+
- `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `WIN1252`.
|
|
88
|
+
- `-f, --force`: skip confirmation.
|
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
# PostgreSQL direct import workflow
|
|
2
|
+
|
|
3
|
+
The PostgreSQL direct import workflow is a hybrid path for environments where the standard resumable importer is too expensive for a full monthly load.
|
|
4
|
+
|
|
5
|
+
It keeps the safe preparation steps inside CNPJ DB Loader and moves the heaviest database load/materialization work into a generated `psql` script.
|
|
6
|
+
|
|
7
|
+
## Intended flow
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
cnpj-db-loader federal-revenue download --output ./downloads
|
|
11
|
+
cnpj-db-loader extract ./downloads/<reference>
|
|
12
|
+
cnpj-db-loader validate ./downloads/<reference>/extracted
|
|
13
|
+
cnpj-db-loader sanitize ./downloads/<reference>/extracted
|
|
14
|
+
cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
|
|
15
|
+
psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
The loader remains responsible for:
|
|
19
|
+
|
|
20
|
+
- Federal Revenue download and local manifest control
|
|
21
|
+
- extraction
|
|
22
|
+
- validation
|
|
23
|
+
- sanitization
|
|
24
|
+
- preserving the sanitized Receita files without rewriting the whole dataset
|
|
25
|
+
- generating the final `psql` import script
|
|
26
|
+
- optionally exporting PostgreSQL-ready CSV files through `postgres export-csv` when an audit/debug CSV tree is useful
|
|
27
|
+
|
|
28
|
+
PostgreSQL is then responsible for:
|
|
29
|
+
|
|
30
|
+
- `\copy` loading sanitized Receita files into temporary raw tables
|
|
31
|
+
- SQL-side conversion of dates, numeric values and nullable fields
|
|
32
|
+
- staging table population
|
|
33
|
+
- set-based final table upserts
|
|
34
|
+
- `establishment_secondary_cnaes` materialization
|
|
35
|
+
- planner statistics refresh through `ANALYZE`
|
|
36
|
+
|
|
37
|
+
## Why this exists
|
|
38
|
+
|
|
39
|
+
The standard `import` command is safer and resumable, but it keeps more orchestration inside the Node.js process. That is useful for production safety, checkpoints, quarantine and incremental recovery.
|
|
40
|
+
|
|
41
|
+
The direct PostgreSQL path is optimized for bulk loading after the input files have already been sanitized. It avoids per-batch Node.js database inserts and avoids rewriting the full dataset into a second CSV tree. Instead, `psql` streams the sanitized Receita files into temporary text tables and PostgreSQL performs the value conversion and materialization with set-based SQL.
|
|
42
|
+
|
|
43
|
+
Use this when you want to benchmark or run a faster controlled load on a local machine.
|
|
44
|
+
|
|
45
|
+
## Command
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [-f]
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### Arguments
|
|
52
|
+
|
|
53
|
+
| Argument | Description |
|
|
54
|
+
| --------- | ---------------------------------------- |
|
|
55
|
+
| `<input>` | Path to the sanitized dataset directory. |
|
|
56
|
+
|
|
57
|
+
### Options
|
|
58
|
+
|
|
59
|
+
| Option | Description |
|
|
60
|
+
| ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
61
|
+
| `--output <path>` | Custom output directory for the generated SQL script and manifest. |
|
|
62
|
+
| `--dataset <dataset>` | Generate a script only for one dataset block. Useful for debugging. |
|
|
63
|
+
| `--script-name <name>` | Name of the generated SQL script. Defaults to `import-postgres-direct.sql`. |
|
|
64
|
+
| `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `WIN1252`. Use `UTF8` only if the files are already UTF-8. |
|
|
65
|
+
| `-f, --force` | Skip the confirmation prompt. |
|
|
66
|
+
|
|
67
|
+
## Output structure
|
|
68
|
+
|
|
69
|
+
The command creates a small PostgreSQL direct output directory:
|
|
70
|
+
|
|
71
|
+
```text
|
|
72
|
+
postgres-direct/
|
|
73
|
+
manifest.json
|
|
74
|
+
import-postgres-direct.sql
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Unlike `postgres export-csv`, this command does not create a second tree of converted CSV files. The generated SQL script points directly to the sanitized Receita files.
|
|
78
|
+
|
|
79
|
+
This is faster for large monthly loads because it avoids reading and writing the entire dataset again just to add headers or change delimiters.
|
|
80
|
+
|
|
81
|
+
## Generated script behavior
|
|
82
|
+
|
|
83
|
+
The generated `import-postgres-direct.sql` script:
|
|
84
|
+
|
|
85
|
+
1. enables `ON_ERROR_STOP` for `psql`;
|
|
86
|
+
2. starts a transaction;
|
|
87
|
+
3. truncates the `staging_*` tables and restarts their identities;
|
|
88
|
+
4. sets the configured client encoding for `psql` copy operations;
|
|
89
|
+
5. loads domain datasets from sanitized Receita files into temporary raw text tables;
|
|
90
|
+
6. upserts final domain tables;
|
|
91
|
+
7. loads large datasets from sanitized Receita files into temporary raw text tables;
|
|
92
|
+
8. converts values inside PostgreSQL and inserts them into `staging_companies`, `staging_establishments`, `staging_partners` and `staging_simples_options`;
|
|
93
|
+
9. materializes final `companies`, `establishments`, `partners` and `simples_options` tables using set-based SQL;
|
|
94
|
+
10. populates `establishment_secondary_cnaes` from `secondary_cnaes_raw`;
|
|
95
|
+
11. runs `ANALYZE` on the main final tables;
|
|
96
|
+
12. commits the transaction.
|
|
97
|
+
|
|
98
|
+
The script does not recreate the schema. Run the normal schema first:
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
cnpj-db-loader schema generate --profile full --output ./sql/schema.sql
|
|
102
|
+
psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./sql/schema.sql
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Important notes
|
|
106
|
+
|
|
107
|
+
The generated script is designed for full controlled loads and benchmarks. It is not a replacement for the standard resumable `import` command when you need checkpoint-based recovery, row quarantine or long-running incremental resume behavior.
|
|
108
|
+
|
|
109
|
+
The generated script resets staging tables, but it does not truncate final tables. Final tables are updated through `ON CONFLICT` upserts.
|
|
110
|
+
|
|
111
|
+
For a fully clean rebuild, reset the database or run the appropriate database cleanup command before executing the generated script.
|
|
112
|
+
|
|
113
|
+
## Windows usage
|
|
114
|
+
|
|
115
|
+
On Windows, the script uses `\copy`, not server-side `COPY`.
|
|
116
|
+
|
|
117
|
+
This is intentional. With `\copy`, the `psql` client reads local files and streams them to PostgreSQL. This avoids common Windows service permission issues where the PostgreSQL service user cannot read files from your working directory.
|
|
118
|
+
|
|
119
|
+
Example:
|
|
120
|
+
|
|
121
|
+
```powershell
|
|
122
|
+
psql "postgres://postgres:postgres@localhost:5432/cnpj" -f "D:/cnpj-data/2026-05/postgres-direct/import-postgres-direct.sql"
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
## Recommended comparison benchmark
|
|
126
|
+
|
|
127
|
+
To compare the standard and hybrid paths:
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
# Standard path
|
|
131
|
+
cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
|
|
132
|
+
|
|
133
|
+
# Hybrid path
|
|
134
|
+
cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
|
|
135
|
+
psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
Compare total duration, disk usage, PostgreSQL CPU usage, WAL growth and final row counts.
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# v2.4.0 — PostgreSQL Direct Import Workflow
|
|
2
|
+
|
|
3
|
+
This release adds a hybrid PostgreSQL direct import workflow.
|
|
4
|
+
|
|
5
|
+
The loader can now generate a ready-to-run `psql` script that loads sanitized Receita Federal files directly through `\copy`, converts values inside PostgreSQL and materializes the final schema using set-based SQL.
|
|
6
|
+
|
|
7
|
+
The previous CSV export path remains available for audit/debug workflows, but the recommended fast path no longer rewrites the entire dataset into a second CSV tree.
|
|
8
|
+
|
|
9
|
+
## Added
|
|
10
|
+
|
|
11
|
+
- Added `postgres generate-script` command.
|
|
12
|
+
- Added direct `psql` script generation from sanitized Receita files.
|
|
13
|
+
- Added SQL-side raw temporary tables and value conversion for dates, numerics and nullable fields.
|
|
14
|
+
- Kept `postgres export-csv` for optional PostgreSQL-ready CSV export with headers, UTF-8 output and normalized values.
|
|
15
|
+
- Added generated `import-postgres-direct.sql` script.
|
|
16
|
+
- Added generated `manifest.json` for exported files and row counts.
|
|
17
|
+
- Added set-based SQL materialization for:
|
|
18
|
+
- `companies`
|
|
19
|
+
- `establishments`
|
|
20
|
+
- `establishment_secondary_cnaes`
|
|
21
|
+
- `partners`
|
|
22
|
+
- `simples_options`
|
|
23
|
+
- Added domain table loading through temporary tables and final upserts.
|
|
24
|
+
- Added documentation for the hybrid PostgreSQL workflow.
|
|
25
|
+
|
|
26
|
+
## Purpose
|
|
27
|
+
|
|
28
|
+
This workflow is designed for controlled bulk-load scenarios where the standard resumable importer is too slow for local full monthly loads.
|
|
29
|
+
|
|
30
|
+
The recommended flow is:
|
|
31
|
+
|
|
32
|
+
1. use the loader for download, extraction, validation and sanitization;
|
|
33
|
+
2. generate a direct `psql` script from the sanitized files;
|
|
34
|
+
3. run the generated `psql` script to load and materialize the database.
|
|
35
|
+
|
|
36
|
+
## Notes
|
|
37
|
+
|
|
38
|
+
The standard `import` command remains the safest option when checkpoint-based resume, row quarantine and detailed recovery behavior are required.
|
|
39
|
+
|
|
40
|
+
The new PostgreSQL direct workflow is intended for faster controlled imports and benchmarking while keeping extraction, validation and sanitization inside the loader. Value conversion for this path happens inside PostgreSQL to avoid unnecessary full-dataset rewriting.
|
package/docs/usage.md
CHANGED
|
@@ -117,6 +117,20 @@ The final import summary also includes baseline metrics for preparatory scan tim
|
|
|
117
117
|
|
|
118
118
|
The exact preparatory scan runs only when no saved import plan exists for the same validated source files and batch size. On resume, the importer reuses the saved plan and then reuses the checkpoint table to continue from the last committed byte offset instead of restarting the data load itself. Rows that fail after retries are written to `import_quarantine`, so a few bad rows do not stop the entire dataset. Running `sanitize` first reduces how often the importer has to fall back to those slower recovery paths.
|
|
119
119
|
|
|
120
|
+
## Hybrid PostgreSQL direct import
|
|
121
|
+
|
|
122
|
+
After sanitization, you can generate a direct `psql` script and let PostgreSQL load the sanitized Receita files without rewriting the full dataset into another CSV tree:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
cnpj-db-loader sanitize ./downloads/<reference>/extracted
|
|
126
|
+
cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
|
|
127
|
+
psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Use this flow when you want PostgreSQL to perform the heavy bulk load and set-based materialization directly. Use `postgres export-csv` only when you need an intermediate normalized CSV tree for audit/debug purposes. Use the standard `import` command when you need checkpoint-based resume and row quarantine recovery.
|
|
131
|
+
|
|
132
|
+
See [PostgreSQL Direct Import](./postgres-direct.md) for details.
|
|
133
|
+
|
|
120
134
|
## Quarantine analysis
|
|
121
135
|
|
|
122
136
|
Use the `quarantine` service after a long-running import when you want to inspect the rows that could not be inserted.
|