@danielarndt0/cnpj-db-loader 2.3.0 → 2.4.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -21,6 +21,7 @@ This version focuses on the real loading workflow:
21
21
  - direct final-schema upserts for the smaller domain datasets
22
22
  - checkpoint-based resume by file and byte offset
23
23
  - row quarantine for invalid or constraint-breaking records without stopping the import
24
+ - generate a direct `psql` import script that loads sanitized Receita files without rewriting the full dataset into another CSV tree
24
25
  - quarantine inspection commands for analyzing rows stored in `import_quarantine`
25
26
 
26
27
  ## Installation
@@ -48,6 +49,10 @@ cnpj-db-loader sanitize ./downloads/<reference>/extracted
48
49
  cnpj-db-loader database config set "postgresql://user:password@localhost:5432/cnpj"
49
50
  cnpj-db-loader schema generate --profile full
50
51
  cnpj-db-loader import ./downloads/<reference>/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
52
+
53
+ # Optional hybrid path for PostgreSQL direct loading
54
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
55
+ psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
51
56
  ```
52
57
 
53
58
  ## Stable commands
@@ -73,6 +78,8 @@ cnpj-db-loader database cleanup staging [--db-url <url>] [--dataset <name>] [--v
73
78
  cnpj-db-loader database cleanup materialized [--db-url <url>] [--dataset <name>] [--force]
74
79
  cnpj-db-loader database cleanup checkpoints [--db-url <url>] [--phase <phase>] [--dataset <name>] [--validated-path <path>] [--plan-id <id>] [--force]
75
80
  cnpj-db-loader database cleanup plans [--db-url <url>] [--validated-path <path>] [--plan-id <id>] [--force]
81
+ cnpj-db-loader postgres generate-script <input> [--output <path>] [--dataset <name>] [--script-name <name>] [--source-encoding <encoding>] [-f]
82
+ cnpj-db-loader postgres export-csv <input> [--output <path>] [--dataset <name>] [--script-name <name>] [-f]
76
83
  cnpj-db-loader import <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
77
84
  cnpj-db-loader import load <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--verbose-progress] [-f]
78
85
  cnpj-db-loader import materialize <input> [--db-url <url>] [--dataset <name>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
@@ -82,6 +89,18 @@ cnpj-db-loader quarantine list [--dataset <name>] [--category <name>] [--stage <
82
89
  cnpj-db-loader quarantine show <id> [--db-url <url>]
83
90
  ```
84
91
 
92
+ ## PostgreSQL direct import workflow
93
+
94
+ For local benchmarks or controlled full loads, the CLI can now generate a direct `psql` import script after sanitization:
95
+
96
+ ```bash
97
+ cnpj-db-loader sanitize ./downloads/<reference>/extracted
98
+ cnpj-db-loader postgres generate-script ./downloads/<reference>/sanitized --output ./downloads/<reference>/postgres-direct --force
99
+ psql "postgres://postgres:postgres@localhost:5432/cnpj" -f ./downloads/<reference>/postgres-direct/import-postgres-direct.sql
100
+ ```
101
+
102
+ This path keeps download, extraction, validation and sanitization inside the loader, then lets PostgreSQL load the sanitized Receita files directly through `\copy`, convert values into staging tables and materialize the final tables with set-based SQL. The standard `import` command remains the safest path when checkpoint resume and quarantine recovery are required.
103
+
85
104
  ## Logs
86
105
 
87
106
  JSON execution logs are written inside the user home directory at `~/.cnpjdbloader/logs`.
@@ -116,5 +135,6 @@ The generated database schema now supports three profiles:
116
135
  - [Quarantine](./docs/quarantine.md)
117
136
  - [Sanitize](./docs/sanitize.md)
118
137
  - [Federal Revenue](./docs/federal-revenue.md)
138
+ - [PostgreSQL Direct Import](./docs/postgres-direct.md)
119
139
 
120
140
  - Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.