npm - @danielarndt0/cnpj-db-loader - Versions diffs - 2.4.0-beta.1 → 2.4.0-beta.2 - Mend

@danielarndt0/cnpj-db-loader 2.4.0-beta.1 → 2.4.0-beta.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/docs/architecture.md CHANGED Viewed

@@ -21,7 +21,7 @@ The import pipeline now uses:
 - deterministic dataset order to respect foreign keys
 - an exact preparatory scan that counts total source rows and planned batches before the first write
 - streaming file reads to avoid loading the full dataset into RAM
-- an optional sanitize step that removes known low-level byte issues before import starts
+- an optional sanitize step that writes clean UTF-8 files and removes known low-level byte issues before import starts
 - COPY-based staged writes for the large datasets followed by staged-to-final materialization
 - conflict-safe upserts for the smaller domain datasets
 - `import_plans` and `import_plan_files` to persist exact import plans and avoid recounting the same source files on resume

package/docs/cli.md CHANGED Viewed

@@ -12,7 +12,7 @@ cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--curre
 cnpj-db-loader inspect <input>
 cnpj-db-loader extract <input> [--output <path>]
 cnpj-db-loader validate <input>
-cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
+cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [--source-encoding <encoding>] [-f]
 cnpj-db-loader schema print [--profile <profile>]
 cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
 cnpj-db-loader database config set <url>

package/docs/commands.md CHANGED Viewed

@@ -84,5 +84,5 @@ Options:
 - `--output <path>`: directory where manifest and SQL script are generated.
 - `--dataset <dataset>`: generate only one dataset block.
 - `--script-name <name>`: custom generated SQL script name.
-- `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `WIN1252`.
+- `--source-encoding <encoding>`: source file encoding for `psql` copy operations. Defaults to `UTF8`.
 - `-f, --force`: skip confirmation.

package/docs/postgres-direct.md CHANGED Viewed

@@ -11,7 +11,7 @@ cnpj-db-loader federal-revenue download --output ./downloads
 cnpj-db-loader extract ./downloads/<reference>
 cnpj-db-loader validate ./downloads/<reference>/extracted
 cnpj-db-loader sanitize ./downloads/<reference>/extracted
-cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
+cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --force
 psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
 ```
@@ -56,13 +56,13 @@ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <da
 ### Options
-| Option                         | Description                                                                                                                                           |
-| ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `--output <path>`              | Custom output directory for the generated SQL script and manifest.                                                                                    |
-| `--dataset <dataset>`          | Generate a script only for one dataset block. Useful for debugging.                                                                                   |
-| `--script-name <name>`         | Name of the generated SQL script. Defaults to `import-postgres-direct.sql`.                                                                           |
-| `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `WIN1252`. Use `UTF8` only if the files are already UTF-8. |
-| `-f, --force`                  | Skip the confirmation prompt.                                                                                                                         |
+| Option                         | Description                                                                                                                                                                                                                                          |
+| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--output <path>`              | Custom output directory for the generated SQL script and manifest.                                                                                                                                                                                   |
+| `--dataset <dataset>`          | Generate a script only for one dataset block. Useful for debugging.                                                                                                                                                                                  |
+| `--script-name <name>`         | Name of the generated SQL script. Defaults to `import-postgres-direct.sql`.                                                                                                                                                                          |
+| `--source-encoding <encoding>` | Source file encoding used by `psql` while reading the sanitized Receita files. Defaults to `UTF8` because the current `sanitize` command writes UTF-8 output. Use `WIN1252` or `LATIN1` only for legacy sanitized files generated by older versions. |
+| `-f, --force`                  | Skip the confirmation prompt.                                                                                                                                                                                                                        |
 ## Output structure
@@ -74,7 +74,7 @@ postgres-direct/
   import-postgres-direct.sql
 ```
-Unlike `postgres export-csv`, this command does not create a second tree of converted CSV files. The generated SQL script points directly to the sanitized Receita files.
+Unlike `postgres export-csv`, this command does not create a second tree of converted CSV files. The generated SQL script points directly to the sanitized Receita files. Current sanitized files are expected to be clean UTF-8 by default.
 This is faster for large monthly loads because it avoids reading and writing the entire dataset again just to add headers or change delimiters.
@@ -131,7 +131,7 @@ To compare the standard and hybrid paths:
 cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
 # Hybrid path
-cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
+cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --source-encoding UTF8 --force
 psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
 ```

package/docs/sanitize.md CHANGED Viewed

@@ -4,7 +4,16 @@
 `sanitize` prepares a clean dataset tree before PostgreSQL import.
-It removes known low-level byte issues, especially `0x00` / NUL bytes, from validated dataset files and writes the result to a new output directory. The goal is to reduce slow fallback work during import so PostgreSQL receives cleaner files from the start.
+The command now performs robust text sanitization for Receita Federal files:
+- reads legacy Receita files using a configurable source encoding;
+- writes sanitized output as clean UTF-8;
+- removes NUL bytes;
+- removes invalid bytes that cannot be safely decoded;
+- removes problematic control characters while preserving line breaks;
+- keeps the original dataset file names and directory structure.
+This makes the sanitized tree safer for both the standard loader import flow and the hybrid PostgreSQL direct import flow.
 ## Command
@@ -14,18 +23,19 @@ cnpj-db-loader sanitize <input>
 ## Options
-| Option             | Description                                                               |
-| ------------------ | ------------------------------------------------------------------------- |
-| `--output <path>`  | Custom output directory for the sanitized dataset tree.                   |
-| `--dataset <name>` | Sanitize only one dataset block, such as `establishments` or `companies`. |
-| `-f, --force`      | Skip the confirmation prompt.                                             |
+| Option                         | Description                                                                                                           |
+| ------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
+| `--output <path>`              | Custom output directory for the sanitized dataset tree.                                                               |
+| `--dataset <name>`             | Sanitize only one dataset block, such as `establishments` or `companies`.                                             |
+| `--source-encoding <encoding>` | Source file encoding used while reading Receita files. Defaults to `WIN1252`. Supported: `WIN1252`, `LATIN1`, `UTF8`. |
+| `-f, --force`                  | Skip the confirmation prompt.                                                                                         |
 ## Default output behavior
-- when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`
-- otherwise the default output is `<validated-path>-sanitized`
+- when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`;
+- otherwise the default output is `<validated-path>-sanitized`.
-## Recommended flow
+## Recommended standard flow
 ```bash
 cnpj-db-loader inspect ./downloads
@@ -35,15 +45,41 @@ cnpj-db-loader sanitize ./downloads/extracted
 cnpj-db-loader import ./downloads/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
 ```
+## Recommended hybrid PostgreSQL flow
+Because sanitized files are now written as UTF-8, the direct PostgreSQL script can use `UTF8` as the source encoding.
+```bash
+cnpj-db-loader sanitize ./downloads/extracted --output ./downloads/sanitized --force
+cnpj-db-loader postgres generate-script ./downloads/sanitized --output ./downloads/postgres-direct --source-encoding UTF8 --force
+psql -d "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/postgres-direct/import-postgres-direct.sql
+```
 ## What it improves
-- fewer UTF-8 / NUL-byte related insert failures
-- less row-by-row fallback during import
-- better import throughput for large datasets
-- cleaner quarantine data because known low-level issues are removed earlier
+- fewer encoding-related `COPY` failures;
+- fewer UTF-8 / NUL-byte related insert failures;
+- no invalid bytes in sanitized output;
+- fewer problematic control characters in PostgreSQL input files;
+- less row-by-row fallback during standard import;
+- better throughput for large datasets;
+- cleaner quarantine data because known low-level issues are removed earlier.
+## Encoding notes
+The default source encoding is `WIN1252`, which matches the common legacy encoding used by Receita files.
+If a source dataset still fails because of undefined Windows-1252 bytes, `LATIN1` can be used as a more permissive decoder:
+```bash
+cnpj-db-loader sanitize ./downloads/extracted --source-encoding LATIN1 --output ./downloads/sanitized --force
+```
+The output is still UTF-8 in both cases.
 ## Notes
-- `sanitize` does not replace validation; it assumes the dataset tree is already valid
-- `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization
-- no database schema changes are required to use `sanitize`
+- `sanitize` does not replace validation; it assumes the dataset tree is already valid.
+- `sanitize` preserves file names and relative paths so existing import logic can keep detecting datasets by name.
+- `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization.
+- no database schema changes are required to use `sanitize`.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@danielarndt0/cnpj-db-loader",
-  "version": "2.4.0-beta.1",
+  "version": "2.4.0-beta.2",
   "publishConfig": {
     "access": "public"
   },
@@ -46,10 +46,10 @@
     "cli": "node --no-deprecation --import tsx src/cli.ts",
     "test": "vitest run",
     "lint": "eslint src",
+    "check": "npm run lint && npm run typecheck && npm run build",
     "format": "prettier . --write",
     "format:check": "prettier . --check",
-    "typecheck": "tsc --noEmit",
-    "prepublishOnly": "npm run lint && npm run typecheck && npm run build"
+    "typecheck": "tsc --noEmit"
   },
   "dependencies": {
     "commander": "^12.1.0",

package/docs/releases/v2.4.0.md DELETED Viewed

@@ -1,40 +0,0 @@
-# v2.4.0 — PostgreSQL Direct Import Workflow
-This release adds a hybrid PostgreSQL direct import workflow.
-The loader can now generate a ready-to-run `psql` script that loads sanitized Receita Federal files directly through `\copy`, converts values inside PostgreSQL and materializes the final schema using set-based SQL.
-The previous CSV export path remains available for audit/debug workflows, but the recommended fast path no longer rewrites the entire dataset into a second CSV tree.
-## Added
-- Added `postgres generate-script` command.
-- Added direct `psql` script generation from sanitized Receita files.
-- Added SQL-side raw temporary tables and value conversion for dates, numerics and nullable fields.
-- Kept `postgres export-csv` for optional PostgreSQL-ready CSV export with headers, UTF-8 output and normalized values.
-- Added generated `import-postgres-direct.sql` script.
-- Added generated `manifest.json` for exported files and row counts.
-- Added set-based SQL materialization for:
-  - `companies`
-  - `establishments`
-  - `establishment_secondary_cnaes`
-  - `partners`
-  - `simples_options`
-- Added domain table loading through temporary tables and final upserts.
-- Added documentation for the hybrid PostgreSQL workflow.
-## Purpose
-This workflow is designed for controlled bulk-load scenarios where the standard resumable importer is too slow for local full monthly loads.
-The recommended flow is:
-1. use the loader for download, extraction, validation and sanitization;
-2. generate a direct `psql` script from the sanitized files;
-3. run the generated `psql` script to load and materialize the database.
-## Notes
-The standard `import` command remains the safest option when checkpoint-based resume, row quarantine and detailed recovery behavior are required.
-The new PostgreSQL direct workflow is intended for faster controlled imports and benchmarking while keeping extraction, validation and sanitization inside the loader. Value conversion for this path happens inside PostgreSQL to avoid unnecessary full-dataset rewriting.