@danielarndt0/cnpj-db-loader 2.4.0-beta.1 → 2.4.0-beta.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -21,7 +21,7 @@ The import pipeline now uses:
21
21
  - deterministic dataset order to respect foreign keys
22
22
  - an exact preparatory scan that counts total source rows and planned batches before the first write
23
23
  - streaming file reads to avoid loading the full dataset into RAM
24
- - an optional sanitize step that removes known low-level byte issues before import starts
24
+ - an optional sanitize step that writes clean UTF-8 files and removes known low-level byte issues before import starts
25
25
  - COPY-based staged writes for the large datasets followed by staged-to-final materialization
26
26
  - conflict-safe upserts for the smaller domain datasets
27
27
  - `import_plans` and `import_plan_files` to persist exact import plans and avoid recounting the same source files on resume
package/docs/cli.md CHANGED
@@ -12,7 +12,7 @@ cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--curre
12
12
  cnpj-db-loader inspect <input>
13
13
  cnpj-db-loader extract <input> [--output <path>]
14
14
  cnpj-db-loader validate <input>
15
- cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
15
+ cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [--source-encoding <encoding>] [-f]
16
16
  cnpj-db-loader schema print [--profile <profile>]
17
17
  cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
18
18
  cnpj-db-loader database config set <url>
package/docs/commands.md CHANGED
@@ -71,7 +71,7 @@ cnpj-db-loader quarantine show 42
71
71
  ## PostgreSQL direct import helper
72
72
 
73
73
  ```bash
74
- cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [-f]
74
+ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [--transaction-mode <mode>] [--include <items>] [--skip-indexes] [--skip-analyze] [-f]
75
75
  cnpj-db-loader postgres export-csv <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [-f]
76
76
  ```
77
77
 
@@ -84,5 +84,9 @@ Options:
84
84
  - `--output <path>`: directory where manifest and SQL script are generated.
85
85
  - `--dataset <dataset>`: generate only one dataset block.
86
86
  - `--script-name <name>`: custom generated SQL script name.
87
- - `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `WIN1252`.
87
+ - `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `UTF8`.
88
+ - `--transaction-mode <mode>`: generated transaction strategy: `single`, `phase` or `none`. Defaults to `single`.
89
+ - `--include <items>`: comma-separated generation targets such as `domains,companies,establishments,secondary-cnaes,analyze`.
90
+ - `--skip-indexes`: skip the generated indexes phase.
91
+ - `--skip-analyze`: skip the generated analyze phase.
88
92
  - `-f, --force`: skip confirmation.
@@ -2,7 +2,7 @@
2
2
 
3
3
  The PostgreSQL direct import workflow is a hybrid path for environments where the standard resumable importer is too expensive for a full monthly load.
4
4
 
5
- It keeps the safe preparation steps inside CNPJ DB Loader and moves the heaviest database load/materialization work into a generated `psql` script.
5
+ It keeps the safe preparation steps inside CNPJ DB Loader and moves the heaviest database load/materialization work into generated `psql` scripts.
6
6
 
7
7
  ## Intended flow
8
8
 
@@ -11,8 +11,8 @@ cnpj-db-loader federal-revenue download --output ./downloads
11
11
  cnpj-db-loader extract ./downloads/<reference>
12
12
  cnpj-db-loader validate ./downloads/<reference>/extracted
13
13
  cnpj-db-loader sanitize ./downloads/<reference>/extracted
14
- cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
15
- psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
14
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --transaction-mode phase --force
15
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
16
16
  ```
17
17
 
18
18
  The loader remains responsible for:
@@ -22,7 +22,7 @@ The loader remains responsible for:
22
22
  - validation
23
23
  - sanitization
24
24
  - preserving the sanitized Receita files without rewriting the whole dataset
25
- - generating the final `psql` import script
25
+ - generating the modular `psql` import scripts
26
26
  - optionally exporting PostgreSQL-ready CSV files through `postgres export-csv` when an audit/debug CSV tree is useful
27
27
 
28
28
  PostgreSQL is then responsible for:
@@ -34,18 +34,10 @@ PostgreSQL is then responsible for:
34
34
  - `establishment_secondary_cnaes` materialization
35
35
  - planner statistics refresh through `ANALYZE`
36
36
 
37
- ## Why this exists
38
-
39
- The standard `import` command is safer and resumable, but it keeps more orchestration inside the Node.js process. That is useful for production safety, checkpoints, quarantine and incremental recovery.
40
-
41
- The direct PostgreSQL path is optimized for bulk loading after the input files have already been sanitized. It avoids per-batch Node.js database inserts and avoids rewriting the full dataset into a second CSV tree. Instead, `psql` streams the sanitized Receita files into temporary text tables and PostgreSQL performs the value conversion and materialization with set-based SQL.
42
-
43
- Use this when you want to benchmark or run a faster controlled load on a local machine.
44
-
45
37
  ## Command
46
38
 
47
39
  ```bash
48
- cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [-f]
40
+ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <dataset>] [--script-name <name>] [--source-encoding <encoding>] [--transaction-mode <mode>] [--include <items>] [--skip-indexes] [--skip-analyze] [-f]
49
41
  ```
50
42
 
51
43
  ### Arguments
@@ -56,59 +48,261 @@ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <da
56
48
 
57
49
  ### Options
58
50
 
59
- | Option | Description |
60
- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
61
- | `--output <path>` | Custom output directory for the generated SQL script and manifest. |
62
- | `--dataset <dataset>` | Generate a script only for one dataset block. Useful for debugging. |
63
- | `--script-name <name>` | Name of the generated SQL script. Defaults to `import-postgres-direct.sql`. |
64
- | `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `WIN1252`. Use `UTF8` only if the files are already UTF-8. |
65
- | `-f, --force` | Skip the confirmation prompt. |
51
+ | Option | Description |
52
+ | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
53
+ | `--output <path>` | Custom output directory for the generated SQL scripts and manifest. |
54
+ | `--dataset <dataset>` | Generate scripts only for one dataset block. Useful for debugging. |
55
+ | `--script-name <name>` | Name of the generated orchestrator script. Defaults to `import-postgres-direct.sql`. |
56
+ | `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `UTF8` because the current `sanitize` command writes UTF-8 output. Use `WIN1252` or `LATIN1` only for legacy sanitized files generated by older versions. |
57
+ | `--transaction-mode <mode>` | Transaction strategy for generated scripts: `single`, `phase` or `none`. Defaults to `single`. |
58
+ | `--include <items>` | Comma-separated steps to include: `domains`, `companies`, `establishments`, `partners`, `simples`, `secondary-cnaes`, `indexes`, `analyze`. |
59
+ | `--skip-indexes` | Do not generate the `indexes.sql` step. |
60
+ | `--skip-analyze` | Do not generate the `analyze.sql` step. |
61
+ | `-f, --force` | Skip the confirmation prompt. |
62
+
63
+ ## Transaction modes
64
+
65
+ ### `single`
66
+
67
+ The orchestrator wraps all included steps in one transaction:
68
+
69
+ ```text
70
+ BEGIN
71
+ setup
72
+ load domains
73
+ load companies
74
+ load establishments
75
+ load partners
76
+ load simples
77
+ materialize
78
+ materialize secondary CNAEs
79
+ indexes
80
+ analyze
81
+ COMMIT
82
+ ```
83
+
84
+ This is the safest mode because a failure rolls back the whole run, but it is also the least convenient for very large imports because a late failure requires starting over.
85
+
86
+ ### `phase`
87
+
88
+ Each generated phase script wraps its own work in a transaction.
89
+
90
+ This is the recommended mode for long local runs because completed phases remain committed if a later phase fails.
91
+
92
+ ### `none`
93
+
94
+ No generated transaction wrapper is added.
95
+
96
+ This mode is useful for aggressive benchmark scenarios, but it can leave partial data if a command fails.
66
97
 
67
98
  ## Output structure
68
99
 
69
- The command creates a small PostgreSQL direct output directory:
100
+ The command now creates a modular PostgreSQL direct output directory:
70
101
 
71
102
  ```text
72
103
  postgres-direct/
73
104
  manifest.json
74
105
  import-postgres-direct.sql
106
+ setup.sql
107
+ load-domains.sql
108
+ load-companies.sql
109
+ load-establishments.sql
110
+ load-partners.sql
111
+ load-simples.sql
112
+ materialize.sql
113
+ materialize-secondary-cnaes.sql
114
+ indexes.sql
115
+ analyze.sql
116
+ ```
117
+
118
+ The `import-postgres-direct.sql` file is an orchestrator that runs the included phase scripts in the correct order with `\ir`.
119
+
120
+ You can execute the full flow:
121
+
122
+ ```bash
123
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./postgres-direct/import-postgres-direct.sql
124
+ ```
125
+
126
+ Or execute individual phase scripts:
127
+
128
+ ```bash
129
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./postgres-direct/load-domains.sql
130
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./postgres-direct/load-establishments.sql
131
+ ```
132
+
133
+ ## Partial generation
134
+
135
+ Generate only domain scripts:
136
+
137
+ ```bash
138
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --include domains --transaction-mode phase --force
139
+ ```
140
+
141
+ Generate without indexes and analyze:
142
+
143
+ ```bash
144
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --skip-indexes --skip-analyze --force
75
145
  ```
76
146
 
77
- Unlike `postgres export-csv`, this command does not create a second tree of converted CSV files. The generated SQL script points directly to the sanitized Receita files.
147
+ Generate only establishments and secondary CNAEs:
78
148
 
79
- This is faster for large monthly loads because it avoids reading and writing the entire dataset again just to add headers or change delimiters.
149
+ ```bash
150
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --include establishments,secondary-cnaes,analyze --transaction-mode phase --force
151
+ ```
80
152
 
81
153
  ## Generated script behavior
82
154
 
83
- The generated `import-postgres-direct.sql` script:
155
+ The generated scripts:
84
156
 
85
- 1. enables `ON_ERROR_STOP` for `psql`;
86
- 2. starts a transaction;
87
- 3. truncates the `staging_*` tables and restarts their identities;
88
- 4. sets the configured client encoding for `psql` copy operations;
89
- 5. loads domain datasets from sanitized Receita files into temporary raw text tables;
90
- 6. upserts final domain tables;
91
- 7. loads large datasets from sanitized Receita files into temporary raw text tables;
92
- 8. converts values inside PostgreSQL and inserts them into `staging_companies`, `staging_establishments`, `staging_partners` and `staging_simples_options`;
93
- 9. materializes final `companies`, `establishments`, `partners` and `simples_options` tables using set-based SQL;
94
- 10. populates `establishment_secondary_cnaes` from `secondary_cnaes_raw`;
95
- 11. runs `ANALYZE` on the main final tables;
96
- 12. commits the transaction.
157
+ 1. enable `ON_ERROR_STOP` for `psql`;
158
+ 2. set the configured client encoding for `psql` copy operations;
159
+ 3. load domain datasets from sanitized Receita files into temporary raw text tables;
160
+ 4. upsert final domain tables;
161
+ 5. load large datasets from sanitized Receita files into temporary raw text tables;
162
+ 6. convert values inside PostgreSQL and insert them into `staging_companies`, `staging_establishments`, `staging_partners` and `staging_simples_options`;
163
+ 7. materialize final `companies`, `establishments`, `partners` and `simples_options` tables using set-based SQL;
164
+ 8. populate `establishment_secondary_cnaes` from `secondary_cnaes_raw`;
165
+ 9. optionally generate an indexes phase;
166
+ 10. optionally run `ANALYZE` on the affected tables.
97
167
 
98
- The script does not recreate the schema. Run the normal schema first:
168
+ The scripts do not recreate the schema. Run the normal schema first:
99
169
 
100
170
  ```bash
101
171
  cnpj-db-loader schema generate --profile full --output ./sql/schema.sql
102
- psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./sql/schema.sql
172
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./sql/schema.sql
173
+ ```
174
+
175
+ ## Monitoring PostgreSQL while the import runs
176
+
177
+ The hybrid mode intentionally keeps loader checkpoints lightweight. Use PostgreSQL native views to monitor heavy work.
178
+
179
+ ### Active queries
180
+
181
+ ```sql
182
+ SELECT
183
+ pid,
184
+ now() - query_start AS duration,
185
+ state,
186
+ wait_event_type,
187
+ wait_event,
188
+ left(query, 200) AS query
189
+ FROM pg_stat_activity
190
+ WHERE datname = current_database()
191
+ AND state <> 'idle'
192
+ ORDER BY query_start;
103
193
  ```
104
194
 
105
- ## Important notes
195
+ ### COPY progress
196
+
197
+ ```sql
198
+ SELECT
199
+ pid,
200
+ command,
201
+ type,
202
+ bytes_processed,
203
+ bytes_total,
204
+ tuples_processed,
205
+ CASE
206
+ WHEN bytes_total > 0
207
+ THEN round((bytes_processed::numeric / bytes_total::numeric) * 100, 2)
208
+ ELSE NULL
209
+ END AS percent
210
+ FROM pg_stat_progress_copy;
211
+ ```
212
+
213
+ With auto-refresh in `psql`:
214
+
215
+ ```sql
216
+ SELECT
217
+ pid,
218
+ command,
219
+ type,
220
+ bytes_processed,
221
+ bytes_total,
222
+ tuples_processed,
223
+ CASE
224
+ WHEN bytes_total > 0
225
+ THEN round((bytes_processed::numeric / bytes_total::numeric) * 100, 2)
226
+ ELSE NULL
227
+ END AS percent
228
+ FROM pg_stat_progress_copy;
229
+ \watch 5
230
+ ```
231
+
232
+ ### Locks
233
+
234
+ ```sql
235
+ SELECT
236
+ blocked.pid AS blocked_pid,
237
+ blocking.pid AS blocking_pid,
238
+ blocked_activity.query AS blocked_query,
239
+ blocking_activity.query AS blocking_query
240
+ FROM pg_catalog.pg_locks blocked
241
+ JOIN pg_catalog.pg_stat_activity blocked_activity
242
+ ON blocked_activity.pid = blocked.pid
243
+ JOIN pg_catalog.pg_locks blocking
244
+ ON blocking.locktype = blocked.locktype
245
+ AND blocking.database IS NOT DISTINCT FROM blocked.database
246
+ AND blocking.relation IS NOT DISTINCT FROM blocked.relation
247
+ AND blocking.page IS NOT DISTINCT FROM blocked.page
248
+ AND blocking.tuple IS NOT DISTINCT FROM blocked.tuple
249
+ AND blocking.virtualxid IS NOT DISTINCT FROM blocked.virtualxid
250
+ AND blocking.transactionid IS NOT DISTINCT FROM blocked.transactionid
251
+ AND blocking.classid IS NOT DISTINCT FROM blocked.classid
252
+ AND blocking.objid IS NOT DISTINCT FROM blocked.objid
253
+ AND blocking.objsubid IS NOT DISTINCT FROM blocked.objsubid
254
+ AND blocking.pid <> blocked.pid
255
+ JOIN pg_catalog.pg_stat_activity blocking_activity
256
+ ON blocking_activity.pid = blocking.pid
257
+ WHERE NOT blocked.granted;
258
+ ```
259
+
260
+ ### Main table sizes
261
+
262
+ ```sql
263
+ SELECT
264
+ relname AS table_name,
265
+ pg_size_pretty(pg_total_relation_size(relid)) AS total_size
266
+ FROM pg_catalog.pg_statio_user_tables
267
+ WHERE relname IN (
268
+ 'companies',
269
+ 'establishments',
270
+ 'partners',
271
+ 'simples_options',
272
+ 'establishment_secondary_cnaes'
273
+ )
274
+ ORDER BY pg_total_relation_size(relid) DESC;
275
+ ```
106
276
 
107
- The generated script is designed for full controlled loads and benchmarks. It is not a replacement for the standard resumable `import` command when you need checkpoint-based recovery, row quarantine or long-running incremental resume behavior.
277
+ ### Estimated rows by table
278
+
279
+ ```sql
280
+ SELECT
281
+ relname AS table_name,
282
+ n_live_tup AS estimated_rows,
283
+ n_dead_tup AS dead_rows,
284
+ last_analyze,
285
+ last_autoanalyze
286
+ FROM pg_stat_user_tables
287
+ ORDER BY n_live_tup DESC;
288
+ ```
289
+
290
+ ### PostgreSQL logs on Windows
108
291
 
109
- The generated script resets staging tables, but it does not truncate final tables. Final tables are updated through `ON CONFLICT` upserts.
292
+ ```powershell
293
+ Get-ChildItem "C:\Program Files\PostgreSQL\16\data\log" |
294
+ Sort-Object LastWriteTime -Descending |
295
+ Select-Object -First 1 |
296
+ Get-Content -Tail 120
297
+ ```
110
298
 
111
- For a fully clean rebuild, reset the database or run the appropriate database cleanup command before executing the generated script.
299
+ Event Viewer logs through PowerShell:
300
+
301
+ ```powershell
302
+ Get-EventLog -LogName Application -Newest 80 |
303
+ Where-Object { $_.Source -like "*postgres*" -or $_.Message -like "*PostgreSQL*" } |
304
+ Format-List TimeGenerated, Source, EntryType, Message
305
+ ```
112
306
 
113
307
  ## Windows usage
114
308
 
@@ -119,7 +313,7 @@ This is intentional. With `\copy`, the `psql` client reads local files and strea
119
313
  Example:
120
314
 
121
315
  ```powershell
122
- psql "postgres://postgres:postgres@localhost:5432/cnpj" -f "D:/cnpj-data/2026-05/postgres-direct/import-postgres-direct.sql"
316
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f "D:/cnpj-data/2026-05/postgres-direct/import-postgres-direct.sql"
123
317
  ```
124
318
 
125
319
  ## Recommended comparison benchmark
@@ -131,8 +325,8 @@ To compare the standard and hybrid paths:
131
325
  cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
132
326
 
133
327
  # Hybrid path
134
- cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
135
- psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
328
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --transaction-mode phase --force
329
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
136
330
  ```
137
331
 
138
332
  Compare total duration, disk usage, PostgreSQL CPU usage, WAL growth and final row counts.
@@ -0,0 +1,42 @@
1
+ # v2.4.0-beta.3 — Modular PostgreSQL Direct Import Scripts
2
+
3
+ This beta release improves the hybrid PostgreSQL direct import mode with modular SQL generation, transaction mode selection and PostgreSQL monitoring documentation.
4
+
5
+ ## Added
6
+
7
+ - Added `--transaction-mode single|phase|none` to `postgres generate-script`.
8
+ - Added modular SQL output for the PostgreSQL direct import workflow.
9
+ - Added `import-postgres-direct.sql` as an orchestrator script.
10
+ - Added individual phase scripts:
11
+ - `setup.sql`
12
+ - `load-domains.sql`
13
+ - `load-companies.sql`
14
+ - `load-establishments.sql`
15
+ - `load-partners.sql`
16
+ - `load-simples.sql`
17
+ - `materialize.sql`
18
+ - `materialize-secondary-cnaes.sql`
19
+ - `indexes.sql`
20
+ - `analyze.sql`
21
+ - Added manifest metadata for generated steps and dependencies.
22
+ - Added improved `\echo` progress messages in generated scripts.
23
+ - Added `--include <items>` for partial script generation.
24
+ - Added `--skip-indexes` and `--skip-analyze`.
25
+ - Added PostgreSQL monitoring documentation with SQL snippets for:
26
+ - `pg_stat_activity`
27
+ - `pg_stat_progress_copy`
28
+ - locks
29
+ - table sizes
30
+ - estimated table rows
31
+ - PostgreSQL logs on Windows
32
+
33
+ ## Changed
34
+
35
+ - The hybrid direct import path now generates multiple SQL files instead of only one large script.
36
+ - The default orchestrator still remains `import-postgres-direct.sql`.
37
+ - The generated manifest now describes transaction mode, included targets, script files and steps.
38
+ - PostgreSQL direct import logs are clearer during long-running loads.
39
+
40
+ ## Notes
41
+
42
+ This is still a beta release. The goal is to validate the modular script layout and transaction modes with full Receita Federal datasets before stabilizing the hybrid path.
package/docs/sanitize.md CHANGED
@@ -4,7 +4,16 @@
4
4
 
5
5
  `sanitize` prepares a clean dataset tree before PostgreSQL import.
6
6
 
7
- It removes known low-level byte issues, especially `0x00` / NUL bytes, from validated dataset files and writes the result to a new output directory. The goal is to reduce slow fallback work during import so PostgreSQL receives cleaner files from the start.
7
+ The command now performs robust text sanitization for Receita Federal files:
8
+
9
+ - reads legacy Receita files using a configurable source encoding;
10
+ - writes sanitized output as clean UTF-8;
11
+ - removes NUL bytes;
12
+ - removes invalid bytes that cannot be safely decoded;
13
+ - removes problematic control characters while preserving line breaks;
14
+ - keeps the original dataset file names and directory structure.
15
+
16
+ This makes the sanitized tree safer for both the standard loader import flow and the hybrid PostgreSQL direct import flow.
8
17
 
9
18
  ## Command
10
19
 
@@ -14,18 +23,19 @@ cnpj-db-loader sanitize <input>
14
23
 
15
24
  ## Options
16
25
 
17
- | Option | Description |
18
- | ------------------ | ------------------------------------------------------------------------- |
19
- | `--output <path>` | Custom output directory for the sanitized dataset tree. |
20
- | `--dataset <name>` | Sanitize only one dataset block, such as `establishments` or `companies`. |
21
- | `-f, --force` | Skip the confirmation prompt. |
26
+ | Option | Description |
27
+ | ------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
28
+ | `--output <path>` | Custom output directory for the sanitized dataset tree. |
29
+ | `--dataset <name>` | Sanitize only one dataset block, such as `establishments` or `companies`. |
30
+ | `--source-encoding <encoding>` | Source file encoding used while reading Receita files. Defaults to `WIN1252`. Supported: `WIN1252`, `LATIN1`, `UTF8`. |
31
+ | `-f, --force` | Skip the confirmation prompt. |
22
32
 
23
33
  ## Default output behavior
24
34
 
25
- - when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`
26
- - otherwise the default output is `<validated-path>-sanitized`
35
+ - when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`;
36
+ - otherwise the default output is `<validated-path>-sanitized`.
27
37
 
28
- ## Recommended flow
38
+ ## Recommended standard flow
29
39
 
30
40
  ```bash
31
41
  cnpj-db-loader inspect ./downloads
@@ -35,15 +45,41 @@ cnpj-db-loader sanitize ./downloads/extracted
35
45
  cnpj-db-loader import ./downloads/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
36
46
  ```
37
47
 
48
+ ## Recommended hybrid PostgreSQL flow
49
+
50
+ Because sanitized files are now written as UTF-8, the direct PostgreSQL script can use `UTF8` as the source encoding.
51
+
52
+ ```bash
53
+ cnpj-db-loader sanitize ./downloads/extracted --output ./downloads/sanitized --force
54
+ cnpj-db-loader postgres generate-script ./downloads/sanitized --output ./downloads/postgres-direct --source-encoding UTF8 --force
55
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/postgres-direct/import-postgres-direct.sql
56
+ ```
57
+
38
58
  ## What it improves
39
59
 
40
- - fewer UTF-8 / NUL-byte related insert failures
41
- - less row-by-row fallback during import
42
- - better import throughput for large datasets
43
- - cleaner quarantine data because known low-level issues are removed earlier
60
+ - fewer encoding-related `COPY` failures;
61
+ - fewer UTF-8 / NUL-byte related insert failures;
62
+ - no invalid bytes in sanitized output;
63
+ - fewer problematic control characters in PostgreSQL input files;
64
+ - less row-by-row fallback during standard import;
65
+ - better throughput for large datasets;
66
+ - cleaner quarantine data because known low-level issues are removed earlier.
67
+
68
+ ## Encoding notes
69
+
70
+ The default source encoding is `WIN1252`, which matches the common legacy encoding used by Receita files.
71
+
72
+ If a source dataset still fails because of undefined Windows-1252 bytes, `LATIN1` can be used as a more permissive decoder:
73
+
74
+ ```bash
75
+ cnpj-db-loader sanitize ./downloads/extracted --source-encoding LATIN1 --output ./downloads/sanitized --force
76
+ ```
77
+
78
+ The output is still UTF-8 in both cases.
44
79
 
45
80
  ## Notes
46
81
 
47
- - `sanitize` does not replace validation; it assumes the dataset tree is already valid
48
- - `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization
49
- - no database schema changes are required to use `sanitize`
82
+ - `sanitize` does not replace validation; it assumes the dataset tree is already valid.
83
+ - `sanitize` preserves file names and relative paths so existing import logic can keep detecting datasets by name.
84
+ - `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization.
85
+ - no database schema changes are required to use `sanitize`.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@danielarndt0/cnpj-db-loader",
3
- "version": "2.4.0-beta.1",
3
+ "version": "2.4.0-beta.3",
4
4
  "publishConfig": {
5
5
  "access": "public"
6
6
  },
@@ -46,10 +46,10 @@
46
46
  "cli": "node --no-deprecation --import tsx src/cli.ts",
47
47
  "test": "vitest run",
48
48
  "lint": "eslint src",
49
+ "check": "npm run lint && npm run typecheck && npm run build",
49
50
  "format": "prettier . --write",
50
51
  "format:check": "prettier . --check",
51
- "typecheck": "tsc --noEmit",
52
- "prepublishOnly": "npm run lint && npm run typecheck && npm run build"
52
+ "typecheck": "tsc --noEmit"
53
53
  },
54
54
  "dependencies": {
55
55
  "commander": "^12.1.0",
@@ -1,40 +0,0 @@
1
- # v2.4.0 — PostgreSQL Direct Import Workflow
2
-
3
- This release adds a hybrid PostgreSQL direct import workflow.
4
-
5
- The loader can now generate a ready-to-run `psql` script that loads sanitized Receita Federal files directly through `\copy`, converts values inside PostgreSQL and materializes the final schema using set-based SQL.
6
-
7
- The previous CSV export path remains available for audit/debug workflows, but the recommended fast path no longer rewrites the entire dataset into a second CSV tree.
8
-
9
- ## Added
10
-
11
- - Added `postgres generate-script` command.
12
- - Added direct `psql` script generation from sanitized Receita files.
13
- - Added SQL-side raw temporary tables and value conversion for dates, numerics and nullable fields.
14
- - Kept `postgres export-csv` for optional PostgreSQL-ready CSV export with headers, UTF-8 output and normalized values.
15
- - Added generated `import-postgres-direct.sql` script.
16
- - Added generated `manifest.json` for exported files and row counts.
17
- - Added set-based SQL materialization for:
18
- - `companies`
19
- - `establishments`
20
- - `establishment_secondary_cnaes`
21
- - `partners`
22
- - `simples_options`
23
- - Added domain table loading through temporary tables and final upserts.
24
- - Added documentation for the hybrid PostgreSQL workflow.
25
-
26
- ## Purpose
27
-
28
- This workflow is designed for controlled bulk-load scenarios where the standard resumable importer is too slow for local full monthly loads.
29
-
30
- The recommended flow is:
31
-
32
- 1. use the loader for download, extraction, validation and sanitization;
33
- 2. generate a direct `psql` script from the sanitized files;
34
- 3. run the generated `psql` script to load and materialize the database.
35
-
36
- ## Notes
37
-
38
- The standard `import` command remains the safest option when checkpoint-based resume, row quarantine and detailed recovery behavior are required.
39
-
40
- The new PostgreSQL direct workflow is intended for faster controlled imports and benchmarking while keeping extraction, validation and sanitization inside the loader. Value conversion for this path happens inside PostgreSQL to avoid unnecessary full-dataset rewriting.