@danielarndt0/cnpj-db-loader 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/usage.md ADDED
@@ -0,0 +1,149 @@
1
+ # Usage
2
+
3
+ ## Recommended flow
4
+
5
+ ```bash
6
+ cnpj-db-loader inspect ./downloads
7
+ cnpj-db-loader extract ./downloads
8
+ cnpj-db-loader validate ./downloads/extracted
9
+ cnpj-db-loader sanitize ./downloads/extracted
10
+ cnpj-db-loader database config set "postgresql://user:password@localhost:5432/cnpj"
11
+ cnpj-db-loader schema generate --profile full
12
+ cnpj-db-loader import ./downloads/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
13
+ cnpj-db-loader import load ./downloads/sanitized --load-batch-size 20000
14
+ cnpj-db-loader import materialize ./downloads/sanitized --materialize-batch-size 50000
15
+ ```
16
+
17
+ ## Federal Revenue monthly download
18
+
19
+ The `federal-revenue` command group automates only the remote monthly dataset phase. It does not replace the existing loader pipeline; `sync` reuses the same extract, validate, sanitize, and import services that the manual flow uses.
20
+
21
+ ```bash
22
+ cnpj-db-loader federal-revenue check
23
+ cnpj-db-loader federal-revenue download --output ./downloads --force
24
+ cnpj-db-loader federal-revenue status --output ./downloads
25
+ cnpj-db-loader federal-revenue retry --output ./downloads --force
26
+ cnpj-db-loader federal-revenue sync --output ./downloads --db-url "postgresql://user:password@localhost:5432/cnpj" --force
27
+ ```
28
+
29
+ By default, the latest published `YYYY-MM` folder is selected from the Federal Revenue public share. Use `--current` to target the current calendar month, `--reference 2026-05`, or the positional shorthand `federal-revenue check 2026-05` to force a specific reference. Downloads are written to `<output>/<reference>`, completed local files are skipped, in-progress transfers use `.part` files, and local state is stored in `<output>/<reference>/.cnpj-db-loader/federal-revenue/manifest.json`.
30
+
31
+ ## What each step does
32
+
33
+ | Step | Command | Purpose |
34
+ | ---- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
35
+ | 0 | `federal-revenue check/download/status/retry/clean/sync` | Optionally check, download, inspect, retry, clean, or sync the latest monthly CNPJ files before the local processing flow |
36
+ | 1 | `inspect <input>` | Detect whether the folder contains ZIP archives, extracted content, or both |
37
+ | 2 | `extract <input>` | Extract every Receita ZIP archive into `./extracted` by default |
38
+ | 3 | `validate <input>` | Validate the extracted dataset tree and confirm that the required dataset blocks are present |
39
+ | 4 | `sanitize <input>` | Prepare a sanitized dataset tree by removing known low-level byte issues before import |
40
+ | 5 | `database config show` / `database config set <url>` | Review or configure the PostgreSQL connection |
41
+ | 6 | `schema generate --profile full` | Generate the combined SQL schema with final, control, and staging tables |
42
+ | 7 | `import <input>` | Run the full pipeline: staged/direct load, staged materialization, and final summary generation |
43
+ | 8 | `import load <input>` | Stop after the load phase when you want staging populated without immediately materializing it |
44
+ | 9 | `import materialize <input>` | Resume from the saved plan and materialize staged datasets into the final schema in chunks |
45
+
46
+ ## Schema profiles
47
+
48
+ Use the schema command profile that matches the database shape you want to prepare:
49
+
50
+ - `full`: final tables, import control tables, and staging tables
51
+ - `final`: only the final relational and control tables
52
+ - `staging`: only the lightweight `staging_*` tables used by the staged bulk-load steps before final materialization
53
+
54
+ Examples:
55
+
56
+ ```bash
57
+ cnpj-db-loader schema generate --profile full
58
+ cnpj-db-loader schema generate --profile final
59
+ cnpj-db-loader schema generate --profile staging
60
+ ```
61
+
62
+ ## Important behavior of import
63
+
64
+ `import` is designed to be safe for large datasets. The CLI now also exposes `import load`, `import materialize`, and `database cleanup ...` so the heavy phases and safe reset operations can be automated separately.
65
+
66
+ - it starts with an exact preparatory scan that counts source rows and planned batches when no saved plan exists
67
+ - it persists the import plan in the database and reuses it on resume when the validated source files and batch size match
68
+ - it reads files in streaming mode
69
+ - it loads the large datasets into lightweight staging tables through PostgreSQL COPY with only light normalization in the hot path and defers heavier work to the materialization stage in dependency order
70
+ - before each staged dataset is materialized into the final schema, the importer only reconciles missing lookup/domain codes when the current final schema still requires those lookup foreign keys
71
+ - once the file import phase ends, the terminal switches to a dedicated MATERIALIZING stage and the JSONL progress log emits heartbeat entries during long staged-to-final upserts
72
+ - it still upserts the smaller domain datasets directly into the final schema
73
+ - it commits per load unit instead of holding one giant transaction
74
+ - it stores file-load progress in `import_checkpoints`
75
+ - it stores materialization progress in `import_materialization_checkpoints`
76
+ - rows that still fail validation or database constraints are written to `import_quarantine` and skipped
77
+ - if a batch fails, rerunning the same command resumes from the last committed byte offset
78
+ - new import plans truncate the selected staging tables before loading, while resumed plans reuse staged rows that already match saved checkpoints before the final materialization pass runs again
79
+
80
+ ## Recommended import settings
81
+
82
+ For large first loads, sanitize first and then start with:
83
+
84
+ ```bash
85
+ cnpj-db-loader sanitize ./downloads/extracted
86
+ cnpj-db-loader import ./downloads/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
87
+ cnpj-db-loader import load ./downloads/sanitized --load-batch-size 20000
88
+ cnpj-db-loader import materialize ./downloads/sanitized --materialize-batch-size 50000
89
+ ```
90
+
91
+ Increase `--load-batch-size` only after you confirm that your PostgreSQL instance and memory budget can handle larger COPY load units. Use `--materialize-batch-size` to control how many staged rows each materialization chunk processes before saving a materialization checkpoint. The saved import plan keeps the original load batch size used during planning/loading, so changing only `--materialize-batch-size` does not create a new plan; the UI now shows both values separately during resume/materialization runs.
92
+
93
+ ## PostgreSQL and Docker recommendations
94
+
95
+ For a machine with 32 GB RAM, start conservatively:
96
+
97
+ - `shared_buffers = 512MB` to `1GB`
98
+ - `work_mem = 8MB` to `16MB`
99
+ - `maintenance_work_mem = 256MB`
100
+ - make sure Docker Desktop is not over-allocating memory to the container
101
+
102
+ These are starting points, not absolute rules. The safest optimization is still keeping `--load-batch-size` modest until you validate your PostgreSQL limits.
103
+
104
+ ## Import progress visibility
105
+
106
+ `import` writes two kinds of logs inside `~/.cnpjdbloader/logs`:
107
+
108
+ - a final JSON summary log
109
+ - an incremental JSONL progress log for every committed batch, retry fallback, file metrics, dataset metrics, final completion summary, and top-level import failure when execution aborts early
110
+
111
+ Every JSON and JSONL log now carries a structured envelope with `timestamp`, `level`, `severity`, `event`, and `kind`. This makes it easier to filter informational events versus warnings and errors in JSON viewers and JSONL extensions.
112
+
113
+ Use `--verbose-progress` when you want a fixed multi-line status block with dataset, file, committed rows, total batches, and file progress while the import is running.
114
+
115
+ The final import summary also includes baseline metrics for preparatory scan time, execution time, insert time, retry time, quarantine time, materialization time, rows per second, and batches per minute.
116
+
117
+ The exact preparatory scan runs only when no saved import plan exists for the same validated source files and batch size. On resume, the importer reuses the saved plan and then reuses the checkpoint table to continue from the last committed byte offset instead of restarting the data load itself. Rows that fail after retries are written to `import_quarantine`, so a few bad rows do not stop the entire dataset. Running `sanitize` first reduces how often the importer has to fall back to those slower recovery paths.
118
+
119
+ ## Quarantine analysis
120
+
121
+ Use the `quarantine` service after a long-running import when you want to inspect the rows that could not be inserted.
122
+
123
+ ```bash
124
+ cnpj-db-loader quarantine stats
125
+ cnpj-db-loader quarantine list --dataset establishments --limit 20
126
+ cnpj-db-loader quarantine show 42
127
+ ```
128
+
129
+ `quarantine stats` is useful for understanding the scale of a problem by dataset, error category, or error stage.
130
+
131
+ `quarantine list` is useful for paging through rows with filters such as `--retryable`, `--terminal`, `--category`, and `--stage`.
132
+
133
+ `quarantine show` loads one quarantined row in detail, including the raw line and parsed payload when available.
134
+
135
+ ## Database maintenance commands
136
+
137
+ The `database` command family now separates connection configuration from destructive maintenance actions:
138
+
139
+ ```bash
140
+ cnpj-db-loader database config show
141
+ cnpj-db-loader database cleanup staging --validated-path ./downloads/sanitized --force
142
+ cnpj-db-loader database cleanup materialized --dataset companies --force
143
+ cnpj-db-loader database cleanup checkpoints --phase all --validated-path ./downloads/sanitized --force
144
+ cnpj-db-loader database cleanup plans --validated-path ./downloads/sanitized --force
145
+ ```
146
+
147
+ Use `--force` to skip confirmation prompts. Without it, cleanup commands always ask before changing the database.
148
+
149
+ - Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
package/package.json ADDED
@@ -0,0 +1,61 @@
1
+ {
2
+ "name": "@danielarndt0/cnpj-db-loader",
3
+ "version": "2.2.0",
4
+ "publishConfig": {
5
+ "access": "public"
6
+ },
7
+ "description": "Practical CLI for preparing Brazilian Federal Revenue CNPJ open data for PostgreSQL.",
8
+ "author": "Daniel Arndt",
9
+ "license": "MIT",
10
+ "type": "module",
11
+ "main": "./dist/index.js",
12
+ "types": "./dist/index.d.ts",
13
+ "bin": {
14
+ "cnpj-db-loader": "./dist/cli.js",
15
+ "cdl": "./dist/cli.js"
16
+ },
17
+ "exports": {
18
+ ".": {
19
+ "types": "./dist/index.d.ts",
20
+ "import": "./dist/index.js"
21
+ }
22
+ },
23
+ "files": [
24
+ "dist",
25
+ "docs",
26
+ "README.md",
27
+ "LICENSE"
28
+ ],
29
+ "engines": {
30
+ "node": ">=20"
31
+ },
32
+ "scripts": {
33
+ "clean": "rimraf dist",
34
+ "build": "tsup",
35
+ "dev": "tsup --watch",
36
+ "cli": "node --no-deprecation --import tsx src/cli.ts",
37
+ "test": "vitest run",
38
+ "lint": "eslint src",
39
+ "format": "prettier . --write",
40
+ "format:check": "prettier . --check",
41
+ "typecheck": "tsc --noEmit",
42
+ "prepublishOnly": "npm run lint && npm run typecheck && npm run build"
43
+ },
44
+ "dependencies": {
45
+ "commander": "^12.1.0",
46
+ "extract-zip": "^2.0.1",
47
+ "pg": "^8.13.1"
48
+ },
49
+ "devDependencies": {
50
+ "@eslint/js": "^9.20.0",
51
+ "@types/node": "^24.0.0",
52
+ "@types/pg": "^8.11.10",
53
+ "eslint": "^10.1.0",
54
+ "prettier": "^3.5.0",
55
+ "tsup": "^8.4.0",
56
+ "tsx": "^4.21.0",
57
+ "typescript": "^5.8.0",
58
+ "typescript-eslint": "^8.24.0",
59
+ "vitest": "^3.0.0"
60
+ }
61
+ }