@danielarndt0/cnpj-db-loader 2.3.1 → 2.4.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -95,3 +95,11 @@ planner -> source-reader -> parser -> normalizer -> staging-writer -> materializ
95
95
  ```
96
96
 
97
97
  - Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
98
+
99
+ ## PostgreSQL direct import workflow
100
+
101
+ The PostgreSQL direct import workflow is a hybrid execution path. It keeps file detection, Receita parsing, validation and sanitization in the loader, then exports normalized CSV files and a generated `psql` script for direct PostgreSQL loading.
102
+
103
+ This keeps the parsing rules centralized in the TypeScript layouts while allowing PostgreSQL to run the heaviest bulk load and materialization work with set-based SQL.
104
+
105
+ The generated script resets staging tables, loads CSV files with `\copy`, upserts domain and final tables, materializes `establishment_secondary_cnaes`, and refreshes planner statistics with `ANALYZE`.
package/docs/commands.md CHANGED
@@ -22,6 +22,8 @@
22
22
  | `database cleanup materialized` | Truncate simplified final relational tables populated by materialization, including establishment secondary CNAEs when available, in safe order. |
23
23
  | `database cleanup checkpoints` | Clear load checkpoints, materialization checkpoints, or both without truncating staging or final tables. |
24
24
  | `database cleanup plans` | Delete saved import plans. Related plan files and materialization checkpoints are removed by database cascade. |
25
+ | `postgres generate-script` | Generate a direct `psql` import script that loads sanitized Receita files without rewriting them into new CSV files. |
26
+ | `postgres export-csv` | Convert sanitized Receita files into normalized PostgreSQL-ready CSV files and generate a direct `psql` import script for audit/debug workflows. |
25
27
  | `import <input>` | Run the full pipeline: plan, load validated files into staging/direct final targets, materialize staged datasets into final tables, and finalize the import plan. |
26
28
  | `import load <input>` | Prepare the plan and run only the load phase. Heavy datasets stop in `staging_*`; domain datasets still upsert directly into the final schema. |
27
29
  | `import materialize <input>` | Resume from the saved import plan and materialize staged datasets into the final relational tables with resumable chunks. |
@@ -50,6 +52,8 @@ cnpj-db-loader database cleanup staging --validated-path ./downloads/sanitized -
50
52
  cnpj-db-loader database cleanup materialized --dataset companies --force
51
53
  cnpj-db-loader database cleanup checkpoints --phase materialization --validated-path ./downloads/sanitized --force
52
54
  cnpj-db-loader database cleanup plans --validated-path ./downloads/sanitized --force
55
+ cnpj-db-loader postgres generate-script ./downloads/sanitized --output ./downloads/postgres-direct --force
56
+ psql "postgres://user:password@localhost:5432/cnpj" -f ./downloads/postgres-direct/import-postgres-direct.sql
53
57
  cnpj-db-loader import ./downloads/sanitized
54
58
  cnpj-db-loader import ./downloads/sanitized --db-url "postgresql://user:password@localhost:5432/cnpj"
55
59
  cnpj-db-loader import ./downloads/sanitized --dataset companies --load-batch-size 500
@@ -63,3 +67,22 @@ cnpj-db-loader quarantine list --dataset establishments --limit 10
63
67
  cnpj-db-loader quarantine list --terminal --after-id 500
64
68
  cnpj-db-loader quarantine show 42
65
69
  ```
70
+
71
+ ## PostgreSQL direct import helper
72
+
73
+ ```bash
74
+ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [-f]
75
+ cnpj-db-loader postgres export-csv <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [-f]
76
+ ```
77
+
78
+ `postgres generate-script` is the recommended hybrid workflow. The loader performs extraction, validation and sanitization, then generates a `psql` script that loads the sanitized Receita files directly through `\copy`.
79
+
80
+ `postgres export-csv` remains available when you explicitly want a normalized CSV output tree for audit/debug purposes.
81
+
82
+ Options:
83
+
84
+ - `--output <path>`: directory where manifest and SQL script are generated.
85
+ - `--dataset <dataset>`: generate only one dataset block.
86
+ - `--script-name <name>`: custom generated SQL script name.
87
+ - `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `WIN1252`.
88
+ - `-f, --force`: skip confirmation.
@@ -0,0 +1,138 @@
1
+ # PostgreSQL direct import workflow
2
+
3
+ The PostgreSQL direct import workflow is a hybrid path for environments where the standard resumable importer is too expensive for a full monthly load.
4
+
5
+ It keeps the safe preparation steps inside CNPJ DB Loader and moves the heaviest database load/materialization work into a generated `psql` script.
6
+
7
+ ## Intended flow
8
+
9
+ ```bash
10
+ cnpj-db-loader federal-revenue download --output ./downloads
11
+ cnpj-db-loader extract ./downloads/<reference>
12
+ cnpj-db-loader validate ./downloads/<reference>/extracted
13
+ cnpj-db-loader sanitize ./downloads/<reference>/extracted
14
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
15
+ psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
16
+ ```
17
+
18
+ The loader remains responsible for:
19
+
20
+ - Federal Revenue download and local manifest control
21
+ - extraction
22
+ - validation
23
+ - sanitization
24
+ - preserving the sanitized Receita files without rewriting the whole dataset
25
+ - generating the final `psql` import script
26
+ - optionally exporting PostgreSQL-ready CSV files through `postgres export-csv` when an audit/debug CSV tree is useful
27
+
28
+ PostgreSQL is then responsible for:
29
+
30
+ - `\copy` loading sanitized Receita files into temporary raw tables
31
+ - SQL-side conversion of dates, numeric values and nullable fields
32
+ - staging table population
33
+ - set-based final table upserts
34
+ - `establishment_secondary_cnaes` materialization
35
+ - planner statistics refresh through `ANALYZE`
36
+
37
+ ## Why this exists
38
+
39
+ The standard `import` command is safer and resumable, but it keeps more orchestration inside the Node.js process. That is useful for production safety, checkpoints, quarantine and incremental recovery.
40
+
41
+ The direct PostgreSQL path is optimized for bulk loading after the input files have already been sanitized. It avoids per-batch Node.js database inserts and avoids rewriting the full dataset into a second CSV tree. Instead, `psql` streams the sanitized Receita files into temporary text tables and PostgreSQL performs the value conversion and materialization with set-based SQL.
42
+
43
+ Use this when you want to benchmark or run a faster controlled load on a local machine.
44
+
45
+ ## Command
46
+
47
+ ```bash
48
+ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [-f]
49
+ ```
50
+
51
+ ### Arguments
52
+
53
+ | Argument | Description |
54
+ | --------- | ---------------------------------------- |
55
+ | `<input>` | Path to the sanitized dataset directory. |
56
+
57
+ ### Options
58
+
59
+ | Option | Description |
60
+ | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
61
+ | `--output <path>` | Custom output directory for the generated SQL script and manifest. |
62
+ | `--dataset <dataset>` | Generate a script only for one dataset block. Useful for debugging. |
63
+ | `--script-name <name>` | Name of the generated SQL script. Defaults to `import-postgres-direct.sql`. |
64
+ | `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `WIN1252`. Use `UTF8` only if the files are already UTF-8. |
65
+ | `-f, --force` | Skip the confirmation prompt. |
66
+
67
+ ## Output structure
68
+
69
+ The command creates a small PostgreSQL direct output directory:
70
+
71
+ ```text
72
+ postgres-direct/
73
+ manifest.json
74
+ import-postgres-direct.sql
75
+ ```
76
+
77
+ Unlike `postgres export-csv`, this command does not create a second tree of converted CSV files. The generated SQL script points directly to the sanitized Receita files.
78
+
79
+ This is faster for large monthly loads because it avoids reading and writing the entire dataset again just to add headers or change delimiters.
80
+
81
+ ## Generated script behavior
82
+
83
+ The generated `import-postgres-direct.sql` script:
84
+
85
+ 1. enables `ON_ERROR_STOP` for `psql`;
86
+ 2. starts a transaction;
87
+ 3. truncates the `staging_*` tables and restarts their identities;
88
+ 4. sets the configured client encoding for `psql` copy operations;
89
+ 5. loads domain datasets from sanitized Receita files into temporary raw text tables;
90
+ 6. upserts final domain tables;
91
+ 7. loads large datasets from sanitized Receita files into temporary raw text tables;
92
+ 8. converts values inside PostgreSQL and inserts them into `staging_companies`, `staging_establishments`, `staging_partners` and `staging_simples_options`;
93
+ 9. materializes final `companies`, `establishments`, `partners` and `simples_options` tables using set-based SQL;
94
+ 10. populates `establishment_secondary_cnaes` from `secondary_cnaes_raw`;
95
+ 11. runs `ANALYZE` on the main final tables;
96
+ 12. commits the transaction.
97
+
98
+ The script does not recreate the schema. Run the normal schema first:
99
+
100
+ ```bash
101
+ cnpj-db-loader schema generate --profile full --output ./sql/schema.sql
102
+ psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./sql/schema.sql
103
+ ```
104
+
105
+ ## Important notes
106
+
107
+ The generated script is designed for full controlled loads and benchmarks. It is not a replacement for the standard resumable `import` command when you need checkpoint-based recovery, row quarantine or long-running incremental resume behavior.
108
+
109
+ The generated script resets staging tables, but it does not truncate final tables. Final tables are updated through `ON CONFLICT` upserts.
110
+
111
+ For a fully clean rebuild, reset the database or run the appropriate database cleanup command before executing the generated script.
112
+
113
+ ## Windows usage
114
+
115
+ On Windows, the script uses `\copy`, not server-side `COPY`.
116
+
117
+ This is intentional. With `\copy`, the `psql` client reads local files and streams them to PostgreSQL. This avoids common Windows service permission issues where the PostgreSQL service user cannot read files from your working directory.
118
+
119
+ Example:
120
+
121
+ ```powershell
122
+ psql "postgres://postgres:postgres@localhost:5432/cnpj" -f "D:/cnpj-data/2026-05/postgres-direct/import-postgres-direct.sql"
123
+ ```
124
+
125
+ ## Recommended comparison benchmark
126
+
127
+ To compare the standard and hybrid paths:
128
+
129
+ ```bash
130
+ # Standard path
131
+ cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
132
+
133
+ # Hybrid path
134
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
135
+ psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
136
+ ```
137
+
138
+ Compare total duration, disk usage, PostgreSQL CPU usage, WAL growth and final row counts.
@@ -0,0 +1,40 @@
1
+ # v2.4.0 — PostgreSQL Direct Import Workflow
2
+
3
+ This release adds a hybrid PostgreSQL direct import workflow.
4
+
5
+ The loader can now generate a ready-to-run `psql` script that loads sanitized Receita Federal files directly through `\copy`, converts values inside PostgreSQL and materializes the final schema using set-based SQL.
6
+
7
+ The previous CSV export path remains available for audit/debug workflows, but the recommended fast path no longer rewrites the entire dataset into a second CSV tree.
8
+
9
+ ## Added
10
+
11
+ - Added `postgres generate-script` command.
12
+ - Added direct `psql` script generation from sanitized Receita files.
13
+ - Added SQL-side raw temporary tables and value conversion for dates, numerics and nullable fields.
14
+ - Kept `postgres export-csv` for optional PostgreSQL-ready CSV export with headers, UTF-8 output and normalized values.
15
+ - Added generated `import-postgres-direct.sql` script.
16
+ - Added generated `manifest.json` for exported files and row counts.
17
+ - Added set-based SQL materialization for:
18
+ - `companies`
19
+ - `establishments`
20
+ - `establishment_secondary_cnaes`
21
+ - `partners`
22
+ - `simples_options`
23
+ - Added domain table loading through temporary tables and final upserts.
24
+ - Added documentation for the hybrid PostgreSQL workflow.
25
+
26
+ ## Purpose
27
+
28
+ This workflow is designed for controlled bulk-load scenarios where the standard resumable importer is too slow for local full monthly loads.
29
+
30
+ The recommended flow is:
31
+
32
+ 1. use the loader for download, extraction, validation and sanitization;
33
+ 2. generate a direct `psql` script from the sanitized files;
34
+ 3. run the generated `psql` script to load and materialize the database.
35
+
36
+ ## Notes
37
+
38
+ The standard `import` command remains the safest option when checkpoint-based resume, row quarantine and detailed recovery behavior are required.
39
+
40
+ The new PostgreSQL direct workflow is intended for faster controlled imports and benchmarking while keeping extraction, validation and sanitization inside the loader. Value conversion for this path happens inside PostgreSQL to avoid unnecessary full-dataset rewriting.
package/docs/usage.md CHANGED
@@ -117,6 +117,20 @@ The final import summary also includes baseline metrics for preparatory scan tim
117
117
 
118
118
  The exact preparatory scan runs only when no saved import plan exists for the same validated source files and batch size. On resume, the importer reuses the saved plan and then reuses the checkpoint table to continue from the last committed byte offset instead of restarting the data load itself. Rows that fail after retries are written to `import_quarantine`, so a few bad rows do not stop the entire dataset. Running `sanitize` first reduces how often the importer has to fall back to those slower recovery paths.
119
119
 
120
+ ## Hybrid PostgreSQL direct import
121
+
122
+ After sanitization, you can generate a direct `psql` script and let PostgreSQL load the sanitized Receita files without rewriting the full dataset into another CSV tree:
123
+
124
+ ```bash
125
+ cnpj-db-loader sanitize ./downloads/<reference>/extracted
126
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
127
+ psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
128
+ ```
129
+
130
+ Use this flow when you want PostgreSQL to perform the heavy bulk load and set-based materialization directly. Use `postgres export-csv` only when you need an intermediate normalized CSV tree for audit/debug purposes. Use the standard `import` command when you need checkpoint-based resume and row quarantine recovery.
131
+
132
+ See [PostgreSQL Direct Import](./postgres-direct.md) for details.
133
+
120
134
  ## Quarantine analysis
121
135
 
122
136
  Use the `quarantine` service after a long-running import when you want to inspect the rows that could not be inserted.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@danielarndt0/cnpj-db-loader",
3
- "version": "2.3.1",
3
+ "version": "2.4.0-beta.1",
4
4
  "publishConfig": {
5
5
  "access": "public"
6
6
  },