@danielarndt0/cnpj-db-loader 2.4.0-beta.1 → 2.4.0-beta.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -21,7 +21,7 @@ The import pipeline now uses:
21
21
  - deterministic dataset order to respect foreign keys
22
22
  - an exact preparatory scan that counts total source rows and planned batches before the first write
23
23
  - streaming file reads to avoid loading the full dataset into RAM
24
- - an optional sanitize step that removes known low-level byte issues before import starts
24
+ - an optional sanitize step that writes clean UTF-8 files and removes known low-level byte issues before import starts
25
25
  - COPY-based staged writes for the large datasets followed by staged-to-final materialization
26
26
  - conflict-safe upserts for the smaller domain datasets
27
27
  - `import_plans` and `import_plan_files` to persist exact import plans and avoid recounting the same source files on resume
package/docs/cli.md CHANGED
@@ -12,7 +12,7 @@ cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--curre
12
12
  cnpj-db-loader inspect <input>
13
13
  cnpj-db-loader extract <input> [--output <path>]
14
14
  cnpj-db-loader validate <input>
15
- cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
15
+ cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [--source-encoding <encoding>] [-f]
16
16
  cnpj-db-loader schema print [--profile <profile>]
17
17
  cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
18
18
  cnpj-db-loader database config set <url>
package/docs/commands.md CHANGED
@@ -84,5 +84,5 @@ Options:
84
84
  - `--output <path>`: directory where manifest and SQL script are generated.
85
85
  - `--dataset <dataset>`: generate only one dataset block.
86
86
  - `--script-name <name>`: custom generated SQL script name.
87
- - `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `WIN1252`.
87
+ - `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `UTF8`.
88
88
  - `-f, --force`: skip confirmation.
@@ -11,7 +11,7 @@ cnpj-db-loader federal-revenue download --output ./downloads
11
11
  cnpj-db-loader extract ./downloads/<reference>
12
12
  cnpj-db-loader validate ./downloads/<reference>/extracted
13
13
  cnpj-db-loader sanitize ./downloads/<reference>/extracted
14
- cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
14
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --force
15
15
  psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
16
16
  ```
17
17
 
@@ -56,13 +56,13 @@ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <da
56
56
 
57
57
  ### Options
58
58
 
59
- | Option | Description |
60
- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
61
- | `--output <path>` | Custom output directory for the generated SQL script and manifest. |
62
- | `--dataset <dataset>` | Generate a script only for one dataset block. Useful for debugging. |
63
- | `--script-name <name>` | Name of the generated SQL script. Defaults to `import-postgres-direct.sql`. |
64
- | `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `WIN1252`. Use `UTF8` only if the files are already UTF-8. |
65
- | `-f, --force` | Skip the confirmation prompt. |
59
+ | Option | Description |
60
+ | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
61
+ | `--output <path>` | Custom output directory for the generated SQL script and manifest. |
62
+ | `--dataset <dataset>` | Generate a script only for one dataset block. Useful for debugging. |
63
+ | `--script-name <name>` | Name of the generated SQL script. Defaults to `import-postgres-direct.sql`. |
64
+ | `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `UTF8` because the current `sanitize` command writes UTF-8 output. Use `WIN1252` or `LATIN1` only for legacy sanitized files generated by older versions. |
65
+ | `-f, --force` | Skip the confirmation prompt. |
66
66
 
67
67
  ## Output structure
68
68
 
@@ -74,7 +74,7 @@ postgres-direct/
74
74
  import-postgres-direct.sql
75
75
  ```
76
76
 
77
- Unlike `postgres export-csv`, this command does not create a second tree of converted CSV files. The generated SQL script points directly to the sanitized Receita files.
77
+ Unlike `postgres export-csv`, this command does not create a second tree of converted CSV files. The generated SQL script points directly to the sanitized Receita files. Current sanitized files are expected to be clean UTF-8 by default.
78
78
 
79
79
  This is faster for large monthly loads because it avoids reading and writing the entire dataset again just to add headers or change delimiters.
80
80
 
@@ -131,7 +131,7 @@ To compare the standard and hybrid paths:
131
131
  cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
132
132
 
133
133
  # Hybrid path
134
- cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
134
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --force
135
135
  psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
136
136
  ```
137
137
 
package/docs/sanitize.md CHANGED
@@ -4,7 +4,16 @@
4
4
 
5
5
  `sanitize` prepares a clean dataset tree before PostgreSQL import.
6
6
 
7
- It removes known low-level byte issues, especially `0x00` / NUL bytes, from validated dataset files and writes the result to a new output directory. The goal is to reduce slow fallback work during import so PostgreSQL receives cleaner files from the start.
7
+ The command now performs robust text sanitization for Receita Federal files:
8
+
9
+ - reads legacy Receita files using a configurable source encoding;
10
+ - writes sanitized output as clean UTF-8;
11
+ - removes NUL bytes;
12
+ - removes invalid bytes that cannot be safely decoded;
13
+ - removes problematic control characters while preserving line breaks;
14
+ - keeps the original dataset file names and directory structure.
15
+
16
+ This makes the sanitized tree safer for both the standard loader import flow and the hybrid PostgreSQL direct import flow.
8
17
 
9
18
  ## Command
10
19
 
@@ -14,18 +23,19 @@ cnpj-db-loader sanitize <input>
14
23
 
15
24
  ## Options
16
25
 
17
- | Option | Description |
18
- | ------------------ | ------------------------------------------------------------------------- |
19
- | `--output <path>` | Custom output directory for the sanitized dataset tree. |
20
- | `--dataset <name>` | Sanitize only one dataset block, such as `establishments` or `companies`. |
21
- | `-f, --force` | Skip the confirmation prompt. |
26
+ | Option | Description |
27
+ | ------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
28
+ | `--output <path>` | Custom output directory for the sanitized dataset tree. |
29
+ | `--dataset <name>` | Sanitize only one dataset block, such as `establishments` or `companies`. |
30
+ | `--source-encoding <encoding>` | Source file encoding used while reading Receita files. Defaults to `WIN1252`. Supported: `WIN1252`, `LATIN1`, `UTF8`. |
31
+ | `-f, --force` | Skip the confirmation prompt. |
22
32
 
23
33
  ## Default output behavior
24
34
 
25
- - when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`
26
- - otherwise the default output is `<validated-path>-sanitized`
35
+ - when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`;
36
+ - otherwise the default output is `<validated-path>-sanitized`.
27
37
 
28
- ## Recommended flow
38
+ ## Recommended standard flow
29
39
 
30
40
  ```bash
31
41
  cnpj-db-loader inspect ./downloads
@@ -35,15 +45,41 @@ cnpj-db-loader sanitize ./downloads/extracted
35
45
  cnpj-db-loader import ./downloads/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
36
46
  ```
37
47
 
48
+ ## Recommended hybrid PostgreSQL flow
49
+
50
+ Because sanitized files are now written as UTF-8, the direct PostgreSQL script can use `UTF8` as the source encoding.
51
+
52
+ ```bash
53
+ cnpj-db-loader sanitize ./downloads/extracted --output ./downloads/sanitized --force
54
+ cnpj-db-loader postgres generate-script ./downloads/sanitized --output ./downloads/postgres-direct --source-encoding UTF8 --force
55
+ psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/postgres-direct/import-postgres-direct.sql
56
+ ```
57
+
38
58
  ## What it improves
39
59
 
40
- - fewer UTF-8 / NUL-byte related insert failures
41
- - less row-by-row fallback during import
42
- - better import throughput for large datasets
43
- - cleaner quarantine data because known low-level issues are removed earlier
60
+ - fewer encoding-related `COPY` failures;
61
+ - fewer UTF-8 / NUL-byte related insert failures;
62
+ - no invalid bytes in sanitized output;
63
+ - fewer problematic control characters in PostgreSQL input files;
64
+ - less row-by-row fallback during standard import;
65
+ - better throughput for large datasets;
66
+ - cleaner quarantine data because known low-level issues are removed earlier.
67
+
68
+ ## Encoding notes
69
+
70
+ The default source encoding is `WIN1252`, which matches the common legacy encoding used by Receita files.
71
+
72
+ If a source dataset still fails because of undefined Windows-1252 bytes, `LATIN1` can be used as a more permissive decoder:
73
+
74
+ ```bash
75
+ cnpj-db-loader sanitize ./downloads/extracted --source-encoding LATIN1 --output ./downloads/sanitized --force
76
+ ```
77
+
78
+ The output is still UTF-8 in both cases.
44
79
 
45
80
  ## Notes
46
81
 
47
- - `sanitize` does not replace validation; it assumes the dataset tree is already valid
48
- - `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization
49
- - no database schema changes are required to use `sanitize`
82
+ - `sanitize` does not replace validation; it assumes the dataset tree is already valid.
83
+ - `sanitize` preserves file names and relative paths so existing import logic can keep detecting datasets by name.
84
+ - `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization.
85
+ - no database schema changes are required to use `sanitize`.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@danielarndt0/cnpj-db-loader",
3
- "version": "2.4.0-beta.1",
3
+ "version": "2.4.0-beta.2",
4
4
  "publishConfig": {
5
5
  "access": "public"
6
6
  },
@@ -46,10 +46,10 @@
46
46
  "cli": "node --no-deprecation --import tsx src/cli.ts",
47
47
  "test": "vitest run",
48
48
  "lint": "eslint src",
49
+ "check": "npm run lint && npm run typecheck && npm run build",
49
50
  "format": "prettier . --write",
50
51
  "format:check": "prettier . --check",
51
- "typecheck": "tsc --noEmit",
52
- "prepublishOnly": "npm run lint && npm run typecheck && npm run build"
52
+ "typecheck": "tsc --noEmit"
53
53
  },
54
54
  "dependencies": {
55
55
  "commander": "^12.1.0",
@@ -1,40 +0,0 @@
1
- # v2.4.0 — PostgreSQL Direct Import Workflow
2
-
3
- This release adds a hybrid PostgreSQL direct import workflow.
4
-
5
- The loader can now generate a ready-to-run `psql` script that loads sanitized Receita Federal files directly through `\copy`, converts values inside PostgreSQL and materializes the final schema using set-based SQL.
6
-
7
- The previous CSV export path remains available for audit/debug workflows, but the recommended fast path no longer rewrites the entire dataset into a second CSV tree.
8
-
9
- ## Added
10
-
11
- - Added `postgres generate-script` command.
12
- - Added direct `psql` script generation from sanitized Receita files.
13
- - Added SQL-side raw temporary tables and value conversion for dates, numerics and nullable fields.
14
- - Kept `postgres export-csv` for optional PostgreSQL-ready CSV export with headers, UTF-8 output and normalized values.
15
- - Added generated `import-postgres-direct.sql` script.
16
- - Added generated `manifest.json` for exported files and row counts.
17
- - Added set-based SQL materialization for:
18
- - `companies`
19
- - `establishments`
20
- - `establishment_secondary_cnaes`
21
- - `partners`
22
- - `simples_options`
23
- - Added domain table loading through temporary tables and final upserts.
24
- - Added documentation for the hybrid PostgreSQL workflow.
25
-
26
- ## Purpose
27
-
28
- This workflow is designed for controlled bulk-load scenarios where the standard resumable importer is too slow for local full monthly loads.
29
-
30
- The recommended flow is:
31
-
32
- 1. use the loader for download, extraction, validation and sanitization;
33
- 2. generate a direct `psql` script from the sanitized files;
34
- 3. run the generated `psql` script to load and materialize the database.
35
-
36
- ## Notes
37
-
38
- The standard `import` command remains the safest option when checkpoint-based resume, row quarantine and detailed recovery behavior are required.
39
-
40
- The new PostgreSQL direct workflow is intended for faster controlled imports and benchmarking while keeping extraction, validation and sanitization inside the loader. Value conversion for this path happens inside PostgreSQL to avoid unnecessary full-dataset rewriting.