@danielarndt0/cnpj-db-loader 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,97 @@
1
+ # Architecture
2
+
3
+ ## What matters in this version
4
+
5
+ The current CLI is centered on one practical job: move Receita Federal CNPJ data from downloaded archives into PostgreSQL safely.
6
+
7
+ ## Main layers
8
+
9
+ | Folder | Purpose |
10
+ | ---------------- | -------------------------------------------------------------- |
11
+ | `src/cli` | Command registration and terminal output |
12
+ | `src/services` | Real application behavior used by the CLI |
13
+ | `src/dictionary` | Dataset layout definitions derived from the Receita dictionary |
14
+ | `src/core` | Shared errors, prompts, and utilities |
15
+ | `src/config` | Local configuration helpers and paths |
16
+
17
+ ## Import design
18
+
19
+ The import pipeline now uses:
20
+
21
+ - deterministic dataset order to respect foreign keys
22
+ - an exact preparatory scan that counts total source rows and planned batches before the first write
23
+ - streaming file reads to avoid loading the full dataset into RAM
24
+ - an optional sanitize step that removes known low-level byte issues before import starts
25
+ - COPY-based staged writes for the large datasets followed by staged-to-final materialization
26
+ - conflict-safe upserts for the smaller domain datasets
27
+ - `import_plans` and `import_plan_files` to persist exact import plans and avoid recounting the same source files on resume
28
+ - `import_checkpoints` to resume a failed load without clearing the whole database
29
+ - `import_materialization_checkpoints` to resume staged-to-final consolidation by dataset and chunk
30
+ - `import_quarantine` to store invalid rows and continue long-running imports
31
+ - a dedicated `quarantine` service to inspect quarantine rows without touching the import pipeline
32
+ - conservative load units to reduce memory pressure and prevent giant rollbacks
33
+ - compatibility with simplified final schemas that keep derived identifiers as regular columns when needed
34
+ - remote Federal Revenue WebDAV checks/downloads plus local manifest, retry, cleanup, status, and sync locking as an additive pre-pipeline service
35
+
36
+ ## Import modules
37
+
38
+ The importer is now split into focused modules so future performance work can replace parts of the pipeline without rewriting the whole command:
39
+
40
+ - `planner`: selects datasets, collects source files, reuses or creates persisted import plans
41
+ - `source-reader`: streams validated files by byte offset for resume-safe reads
42
+ - `parser`: converts raw Receita lines into delimited field arrays
43
+ - `normalizer`: validates field counts and transforms parsed rows into database-ready records
44
+ - `staging-writer`: chooses the current write target and uses COPY for staged bulk loads
45
+ - `materializer`: consolidates staged datasets into the final relational schema with ordered upserts and resumable chunk checkpoints
46
+ - the materializer now reconciles missing lookup/domain codes from staged datasets before final upserts so late foreign-key failures do not stop the consolidation flow on placeholder-compatible domains
47
+ - materialization progress is now exposed explicitly to the CLI progress reporter and to JSONL heartbeat logs so long-running final upserts do not look stalled
48
+ - `finalizer`: centralizes performance tracking and import summary generation
49
+ - `checkpoint-manager`: owns checkpoint resume, persistence, and failed-file markers
50
+ - `quarantine-writer`: stores bad rows without stopping long imports
51
+ - `runner`: orchestrates the current import flow while keeping the service entry point small
52
+
53
+ The project now also generates dedicated staging tables for large datasets. The CLI exposes both a one-shot command (`import`) and split commands (`import load`, `import materialize`). Staging cleanup is handled explicitly through `database cleanup staging`. The write path sends the heavy datasets to staging tables first with only light normalization, then consolidates them into a simplified final schema in dependency order while keeping the smaller catalog datasets on the final schema directly. The final schema now stays closer to the Receita layout so the API can derive richer views later without forcing every first load to pay that cost inside PostgreSQL.
54
+
55
+ ## Staging schema
56
+
57
+ The generated SQL schema supports lightweight `staging_*` tables for the large datasets that now move through the staged bulk-load flow before controlled final materialization.
58
+
59
+ These staging tables are intentionally:
60
+
61
+ - `UNLOGGED` for faster write-heavy workloads
62
+ - free of foreign keys and secondary indexes
63
+ - free of generated columns and upsert-only constraints
64
+ - shaped to mirror the validated dataset rows with minimal insert overhead
65
+ - equipped with `staging_id` so the materializer can checkpoint chunk progress safely
66
+
67
+ ## Federal Revenue pre-pipeline
68
+
69
+ The Federal Revenue integration is intentionally kept as a pre-pipeline module. It lives under `src/services/federal-revenue` and is exposed by `src/cli/commands/register-federal-revenue.ts`. The module is responsible for:
70
+
71
+ - listing monthly `YYYY-MM` references from the public WebDAV share
72
+ - selecting the latest, current, or explicit monthly reference
73
+ - listing only `.zip` files inside the selected reference
74
+ - downloading files with `.part` temporary files, retry attempts, and skip-on-existing behavior
75
+ - writing a local reference manifest with downloaded, failed, partial, and missing file state
76
+ - exposing `status`, `retry`, and `clean` so automation can inspect and repair local references safely
77
+ - using a local sync lock so two full sync processes do not use the same reference folder at the same time
78
+ - handing the completed download folder to the existing extraction, validation, sanitization, and import services during `federal-revenue sync`
79
+
80
+ Redis, background workers, and schedulers are not part of this CLI module. Those concerns should remain outside the loader if an external runner/orchestrator is added later.
81
+
82
+ ## Current execution flow
83
+
84
+ ```text
85
+ federal-revenue check/download/status/retry/clean/sync -> inspect -> extract -> validate -> sanitize -> db/schema -> import
86
+ ```
87
+
88
+ ## Internal import flow
89
+
90
+ ```text
91
+ planner -> source-reader -> parser -> normalizer -> staging-writer -> materializer -> finalizer
92
+ | |
93
+ +-> checkpoint-manager +-> quarantine-writer
94
+ +-> materialization-checkpoints
95
+ ```
96
+
97
+ - Materialization now stores lightweight staging validation markers (row count and max staging id) in the materialization checkpoint table so reruns can verify the live staging state quickly and reuse lookup reconciliation when the staging snapshot is unchanged. The runtime validates that the required import tables already exist but no longer creates or alters them automatically.
package/docs/cli.md ADDED
@@ -0,0 +1,46 @@
1
+ # CLI
2
+
3
+ ## Public command surface
4
+
5
+ ```bash
6
+ cnpj-db-loader federal-revenue check [reference] [--reference <yyyy-mm>] [--current]
7
+ cnpj-db-loader federal-revenue download [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
8
+ cnpj-db-loader federal-revenue status [reference] [--reference <yyyy-mm>] [--current] [--output <path>]
9
+ cnpj-db-loader federal-revenue retry [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--retries <number>] [--overwrite] [-f]
10
+ cnpj-db-loader federal-revenue clean [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--partials | --failed | --all] [-f]
11
+ cnpj-db-loader federal-revenue sync [reference] [--reference <yyyy-mm>] [--current] [--output <path>] [--extract-output <path>] [--sanitize-output <path>] [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [--force-lock] [-f]
12
+ cnpj-db-loader inspect <input>
13
+ cnpj-db-loader extract <input> [--output <path>]
14
+ cnpj-db-loader validate <input>
15
+ cnpj-db-loader sanitize <input> [--output <path>] [--dataset <name>] [-f]
16
+ cnpj-db-loader schema print [--profile <profile>]
17
+ cnpj-db-loader schema generate [--name <name>] [--output <path>] [--profile <profile>]
18
+ cnpj-db-loader database config set <url>
19
+ cnpj-db-loader database config show
20
+ cnpj-db-loader database config test [--db-url <url>]
21
+ cnpj-db-loader database config reset [--force]
22
+ cnpj-db-loader database cleanup staging [--db-url <url>] [--dataset <name>] [--validated-path <path>] [--force]
23
+ cnpj-db-loader database cleanup materialized [--db-url <url>] [--dataset <name>] [--force]
24
+ cnpj-db-loader database cleanup checkpoints [--db-url <url>] [--phase <phase>] [--dataset <name>] [--validated-path <path>] [--plan-id <id>] [--force]
25
+ cnpj-db-loader database cleanup plans [--db-url <url>] [--validated-path <path>] [--plan-id <id>] [--force]
26
+ cnpj-db-loader import <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
27
+ cnpj-db-loader import load <input> [--db-url <url>] [--dataset <name>] [--load-batch-size <size>] [--verbose-progress] [-f]
28
+ cnpj-db-loader import materialize <input> [--db-url <url>] [--dataset <name>] [--materialize-batch-size <size>] [--verbose-progress] [-f]
29
+ cnpj-db-loader doctor [--input <path>] [--db-url <url>]
30
+ cnpj-db-loader quarantine stats [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal]
31
+ cnpj-db-loader quarantine list [--dataset <name>] [--category <name>] [--stage <name>] [--retryable] [--terminal] [--limit <number>] [--after-id <id>]
32
+ cnpj-db-loader quarantine show <id> [--db-url <url>]
33
+ ```
34
+
35
+ ## Design notes
36
+
37
+ - The public CLI stays intentionally small, but the import workflow now exposes split phases for automation.
38
+ - `import` runs the whole pipeline, while `import load` and `import materialize` keep staging and final consolidation independently runnable.
39
+ - Placeholder commands are not exposed.
40
+ - Positional arguments are preferred when they make commands easier to type.
41
+ - Destructive database maintenance actions ask for confirmation unless `--force` is provided.
42
+
43
+ - `federal-revenue` (alias `revenue`) is additive: it only automates the remote monthly CNPJ download phase and then reuses the existing extract, validate, sanitize, and import services.
44
+ - Federal Revenue downloads keep completed files by default and write incomplete transfers as `.part` files until the file is fully validated.
45
+ - `status`, `retry`, and `clean` use the local reference manifest so a future external runner can inspect and resume the workflow without duplicating loader rules.
46
+ - `sync` creates a local lock file to prevent two full sync operations from using the same reference folder at the same time.
@@ -0,0 +1,65 @@
1
+ # Commands reference
2
+
3
+ | Command | Purpose |
4
+ | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
5
+ | `federal-revenue check` | Check the selected or latest Federal Revenue monthly CNPJ reference and list the remote ZIP files. |
6
+ | `federal-revenue download` | Download the selected Federal Revenue monthly CNPJ ZIP files with retries, `.part` files, and skip-on-existing behavior. |
7
+ | `federal-revenue status` | Read the local Federal Revenue manifest and report downloaded, failed, partial, and missing files. |
8
+ | `federal-revenue retry` | Retry incomplete Federal Revenue files without redownloading completed files. |
9
+ | `federal-revenue clean` | Clean local Federal Revenue `.part` files, failed/partial files, or a whole reference folder. |
10
+ | `federal-revenue sync` | Download, extract, validate, sanitize, and import the selected monthly CNPJ reference using the existing loader pipeline with a local sync lock. |
11
+ | `inspect <input>` | Detect whether the input is zipped, extracted, mixed, or empty. |
12
+ | `extract <input>` | Extract every ZIP archive found inside the input directory. |
13
+ | `validate <input>` | Validate an extracted dataset tree. |
14
+ | `sanitize <input>` | Prepare a sanitized dataset tree before import. |
15
+ | `schema print` | Print a generated PostgreSQL schema profile (`full`, `final`, or `staging`) to stdout. The final profile is simplified for fast first-load materialization. |
16
+ | `schema generate` | Write a generated schema profile to the current working directory by default. |
17
+ | `database config set <url>` | Persist the default PostgreSQL URL. |
18
+ | `database config show` | Show the saved PostgreSQL URL. |
19
+ | `database config test` | Test the connection using the saved or overridden URL. |
20
+ | `database config reset` | Remove the saved PostgreSQL URL after confirmation. |
21
+ | `database cleanup staging` | Truncate staging tables and optionally clear linked materialization checkpoints for a validated path. |
22
+ | `database cleanup materialized` | Truncate simplified final relational tables populated by materialization in safe order for the current schema. |
23
+ | `database cleanup checkpoints` | Clear load checkpoints, materialization checkpoints, or both without truncating staging or final tables. |
24
+ | `database cleanup plans` | Delete saved import plans. Related plan files and materialization checkpoints are removed by database cascade. |
25
+ | `import <input>` | Run the full pipeline: plan, load validated files into staging/direct final targets, materialize staged datasets into final tables, and finalize the import plan. |
26
+ | `import load <input>` | Prepare the plan and run only the load phase. Heavy datasets stop in `staging_*`; domain datasets still upsert directly into the final schema. |
27
+ | `import materialize <input>` | Resume from the saved import plan and materialize staged datasets into the final relational tables with resumable chunks. |
28
+ | `doctor` | Run a quick environment diagnosis. |
29
+ | `quarantine stats` | Show aggregate counts for the `import_quarantine` table. |
30
+ | `quarantine list` | List quarantined rows with optional filters. |
31
+ | `quarantine show` | Show one quarantined row in detail. |
32
+
33
+ ## Examples
34
+
35
+ ```bash
36
+ cnpj-db-loader federal-revenue check
37
+ cnpj-db-loader federal-revenue check 2026-05
38
+ cnpj-db-loader federal-revenue download --output ./downloads --force
39
+ cnpj-db-loader federal-revenue sync --output ./downloads --db-url "postgresql://user:password@localhost:5432/cnpj" --force
40
+ cnpj-db-loader inspect ./downloads
41
+ cnpj-db-loader extract ./downloads
42
+ cnpj-db-loader validate ./downloads/extracted
43
+ cnpj-db-loader sanitize ./downloads/extracted
44
+ cnpj-db-loader schema generate --profile full --name receita-v2 --output ./artifacts/sql
45
+ cnpj-db-loader schema generate --profile staging
46
+ cnpj-db-loader schema print --profile final
47
+ cnpj-db-loader database config set "postgresql://user:password@localhost:5432/cnpj"
48
+ cnpj-db-loader database config test
49
+ cnpj-db-loader database cleanup staging --validated-path ./downloads/sanitized --force
50
+ cnpj-db-loader database cleanup materialized --dataset companies --force
51
+ cnpj-db-loader database cleanup checkpoints --phase materialization --validated-path ./downloads/sanitized --force
52
+ cnpj-db-loader database cleanup plans --validated-path ./downloads/sanitized --force
53
+ cnpj-db-loader import ./downloads/sanitized
54
+ cnpj-db-loader import ./downloads/sanitized --db-url "postgresql://user:password@localhost:5432/cnpj"
55
+ cnpj-db-loader import ./downloads/sanitized --dataset companies --load-batch-size 500
56
+ cnpj-db-loader import load ./downloads/sanitized --load-batch-size 20000
57
+ cnpj-db-loader import materialize ./downloads/sanitized --materialize-batch-size 50000
58
+ cnpj-db-loader database cleanup staging --validated-path ./downloads/sanitized
59
+ cnpj-db-loader import ./downloads/sanitized --force
60
+ cnpj-db-loader quarantine stats
61
+ cnpj-db-loader quarantine stats --dataset establishments --category invalid_utf8_sequence --retryable
62
+ cnpj-db-loader quarantine list --dataset establishments --limit 10
63
+ cnpj-db-loader quarantine list --terminal --after-id 500
64
+ cnpj-db-loader quarantine show 42
65
+ ```
@@ -0,0 +1,224 @@
1
+ # Federal Revenue integration
2
+
3
+ The `federal-revenue` command group automates the remote monthly CNPJ dataset phase for the Brazilian Federal Revenue public share.
4
+
5
+ This feature is additive. It does not replace the stable local commands. The full `sync` command uses the same internal services already used by the manual flow:
6
+
7
+ ```text
8
+ check/download -> extract -> validate -> sanitize -> import
9
+ ```
10
+
11
+ The command also has the shorter alias `revenue`.
12
+
13
+ ## Commands
14
+
15
+ ```bash
16
+ cnpj-db-loader federal-revenue check
17
+ cnpj-db-loader federal-revenue download --output ./downloads --force
18
+ cnpj-db-loader federal-revenue status --output ./downloads
19
+ cnpj-db-loader federal-revenue retry --output ./downloads --force
20
+ cnpj-db-loader federal-revenue clean --output ./downloads --partials --force
21
+ cnpj-db-loader federal-revenue sync --output ./downloads --db-url "postgresql://user:password@localhost:5432/cnpj" --force
22
+ ```
23
+
24
+ Alias examples:
25
+
26
+ ```bash
27
+ cnpj-db-loader revenue check
28
+ cnpj-db-loader revenue status 2026-05 --output ./downloads
29
+ cnpj-db-loader revenue retry 2026-05 --output ./downloads --force
30
+ ```
31
+
32
+ ## Reference selection
33
+
34
+ By default, remote commands list the public share and select the latest available folder in the `YYYY-MM` format.
35
+
36
+ Use an explicit reference when you need a deterministic month:
37
+
38
+ ```bash
39
+ cnpj-db-loader federal-revenue check --reference 2026-05
40
+ cnpj-db-loader federal-revenue check 2026-05
41
+ cnpj-db-loader federal-revenue download --reference 2026-05 --output ./downloads --force
42
+ cnpj-db-loader federal-revenue download 2026-05 --output ./downloads --force
43
+ ```
44
+
45
+ Use the current calendar month when you want the command to fail if that month has not been published yet:
46
+
47
+ ```bash
48
+ cnpj-db-loader federal-revenue check --current
49
+ ```
50
+
51
+ If an explicit or current reference does not exist in the public share, the command fails before listing or downloading files and reports the latest available reference:
52
+
53
+ ```text
54
+ VALIDATION_ERROR Federal Revenue reference not found: 2026-06. Latest available reference is 2026-05.
55
+ ```
56
+
57
+ Running without `--reference`, without `[reference]`, and without `--current` keeps the default behavior of selecting the latest published reference.
58
+
59
+ ## Download behavior
60
+
61
+ `download` creates a child directory named with the selected reference inside the configured output root.
62
+
63
+ For example:
64
+
65
+ ```bash
66
+ cnpj-db-loader federal-revenue download --output ./downloads --reference 2026-05 --force
67
+ ```
68
+
69
+ writes files to:
70
+
71
+ ```text
72
+ ./downloads/2026-05
73
+ ```
74
+
75
+ The downloader:
76
+
77
+ - lists `.zip` files from the selected monthly reference
78
+ - skips a completed local file when its size matches the remote size
79
+ - writes incomplete transfers as `<file>.part`
80
+ - retries each failed file before marking it as failed
81
+ - validates local size against the remote WebDAV size when available
82
+ - writes a local manifest for status, retry, clean, and future runner automation
83
+ - uses `--overwrite` only when a completed local file should be downloaded again
84
+
85
+ ## Local manifest
86
+
87
+ Each downloaded reference receives a local operational manifest:
88
+
89
+ ```text
90
+ <output>/<reference>/.cnpj-db-loader/federal-revenue/manifest.json
91
+ ```
92
+
93
+ The manifest tracks:
94
+
95
+ - selected reference
96
+ - remote base URL
97
+ - output path
98
+ - file names and paths
99
+ - remote size and local size
100
+ - local status: `downloaded`, `failed`, `partial`, or `missing`
101
+ - last command and last status
102
+ - error message when a file fails
103
+
104
+ This state is intentionally local and file-based. It does not require Redis or PostgreSQL.
105
+
106
+ ## Status
107
+
108
+ Use `status` to inspect the local reference state without starting a new download:
109
+
110
+ ```bash
111
+ cnpj-db-loader federal-revenue status 2026-05 --output ./downloads
112
+ ```
113
+
114
+ The command exits with code `0` when the local reference is complete. It exits with code `1` when the manifest is missing or when at least one file is failed, partial, or missing.
115
+
116
+ ## Retry
117
+
118
+ Use `retry` to download only incomplete files tracked by the manifest:
119
+
120
+ ```bash
121
+ cnpj-db-loader federal-revenue retry 2026-05 --output ./downloads --force
122
+ ```
123
+
124
+ Completed files are kept. Failed, partial, and missing files are retried according to `--retries`.
125
+
126
+ ## Clean
127
+
128
+ Use `clean` for local maintenance:
129
+
130
+ ```bash
131
+ cnpj-db-loader federal-revenue clean 2026-05 --output ./downloads --partials --force
132
+ cnpj-db-loader federal-revenue clean 2026-05 --output ./downloads --failed --force
133
+ cnpj-db-loader federal-revenue clean 2026-05 --output ./downloads --all --force
134
+ ```
135
+
136
+ Cleanup modes:
137
+
138
+ | Mode | Behavior |
139
+ | ------------ | ------------------------------------------------------------------------------------- |
140
+ | `--partials` | Removes only `.part` files. |
141
+ | `--failed` | Removes failed and partial files tracked by the manifest, then marks them as missing. |
142
+ | `--all` | Removes the entire local reference folder, including ZIP files and manifest state. |
143
+
144
+ Only one cleanup mode can be used at a time.
145
+
146
+ ## Sync lock
147
+
148
+ `sync` creates a local lock file before running the full pipeline:
149
+
150
+ ```text
151
+ <output>/<reference>/.cnpj-db-loader/federal-revenue/sync.lock
152
+ ```
153
+
154
+ This prevents two sync processes from using the same reference folder at the same time.
155
+
156
+ If a previous process was interrupted and the lock is stale, use `--force-lock` only after confirming that no other sync is running:
157
+
158
+ ```bash
159
+ cnpj-db-loader federal-revenue sync 2026-05 --output ./downloads --force-lock --force
160
+ ```
161
+
162
+ ## Full sync
163
+
164
+ `sync` runs the remote download and then the local loader pipeline:
165
+
166
+ ```bash
167
+ cnpj-db-loader federal-revenue sync \
168
+ --output ./downloads \
169
+ --db-url "postgresql://user:password@localhost:5432/cnpj" \
170
+ --load-batch-size 500 \
171
+ --materialize-batch-size 50000 \
172
+ --verbose-progress \
173
+ --force
174
+ ```
175
+
176
+ Custom output directories can be used when an automation needs fixed paths:
177
+
178
+ ```bash
179
+ cnpj-db-loader federal-revenue sync \
180
+ --reference 2026-05 \
181
+ --output ./downloads \
182
+ --extract-output ./work/2026-05/extracted \
183
+ --sanitize-output ./work/2026-05/sanitized \
184
+ --force
185
+ ```
186
+
187
+ ## Exit codes
188
+
189
+ | Command | Exit code `0` | Exit code `1` |
190
+ | ---------- | ---------------------------------------------------- | ------------------------------------------------------------------------ |
191
+ | `check` | Reference and file list were resolved. | Invalid reference, missing remote reference, or WebDAV error. |
192
+ | `download` | Download completed without failed files. | One or more files failed. |
193
+ | `status` | Local manifest exists and every file is downloaded. | Manifest missing or at least one file is failed/partial/missing. |
194
+ | `retry` | Retry finished with no failed/partial/missing files. | At least one file is still failed, partial, or missing. |
195
+ | `clean` | Cleanup completed. | Invalid cleanup mode or invalid reference. |
196
+ | `sync` | Full pipeline completed. | Download, extraction, validation, sanitization, import, or lock failure. |
197
+
198
+ ## Options
199
+
200
+ | Option | Applies to | Purpose |
201
+ | --------------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------- |
202
+ | `[reference]` / `--reference <yyyy-mm>` | `check`, `download`, `status`, `retry`, `clean`, `sync` | Select a specific monthly reference. |
203
+ | `--current` | `check`, `download`, `status`, `retry`, `clean`, `sync` | Select the current calendar month. |
204
+ | `--output <path>` | `download`, `status`, `retry`, `clean`, `sync` | Download root directory. The reference folder is created inside it. |
205
+ | `--retries <number>` | `download`, `retry`, `sync` | Retry attempts per file. Defaults to 3. |
206
+ | `--overwrite` | `download`, `retry`, `sync` | Redownload files even when a completed local copy already exists. |
207
+ | `--partials` | `clean` | Remove only `.part` files. |
208
+ | `--failed` | `clean` | Remove failed and partial files tracked by the manifest. |
209
+ | `--all` | `clean` | Remove the entire local reference folder. |
210
+ | `--force-lock` | `sync` | Remove an existing sync lock before starting. |
211
+ | `--extract-output <path>` | `sync` | Custom extraction output directory. |
212
+ | `--sanitize-output <path>` | `sync` | Custom sanitized output directory. |
213
+ | `--db-url <url>` | `sync` | Override the saved PostgreSQL URL for the import phase. |
214
+ | `--dataset <dataset>` | `sync` | Restrict the import phase to one dataset. |
215
+ | `--load-batch-size <size>` | `sync` | Import load batch size. |
216
+ | `--materialize-batch-size <size>` | `sync` | Materialization chunk size. |
217
+ | `--verbose-progress` | `sync` | Show detailed import progress. |
218
+ | `--base-url <url>` | `check`, `download`, `retry`, `sync` | Override the WebDAV base URL. |
219
+ | `--share-token <token>` | `check`, `download`, `retry`, `sync` | Override the public share token. |
220
+ | `--force` | `download`, `retry`, `clean`, `sync` | Skip confirmation prompts. |
221
+
222
+ ## Notes
223
+
224
+ This module intentionally does not include Redis, workers, or schedulers. The CLI remains responsible for deterministic one-shot operations and local operational control. A future external runner can call these commands or import the public service functions and add queueing, job-level retries, scheduling, and notifications without making the core loader heavier.
@@ -0,0 +1,40 @@
1
+ # Quarantine
2
+
3
+ The `quarantine` service is a read-only CLI surface for inspecting rows written to the `import_quarantine` table during import.
4
+
5
+ ## Commands
6
+
7
+ | Command | Purpose |
8
+ | ---------------------- | ------------------------------------------------------- |
9
+ | `quarantine stats` | Show totals and grouped counts for quarantine rows. |
10
+ | `quarantine list` | List quarantined rows with optional filters and paging. |
11
+ | `quarantine show <id>` | Show one quarantined row in detail. |
12
+
13
+ ## Supported filters
14
+
15
+ | Option | Commands | Description |
16
+ | ------------------- | ----------------------- | ------------------------------------------------------ |
17
+ | `--db-url <url>` | `stats`, `list`, `show` | Override the persisted PostgreSQL URL. |
18
+ | `--dataset <name>` | `stats`, `list` | Filter rows by dataset name. |
19
+ | `--category <name>` | `stats`, `list` | Filter rows by error category. |
20
+ | `--stage <name>` | `stats`, `list` | Filter rows by error stage. |
21
+ | `--retryable` | `stats`, `list` | Keep only rows marked as retryable. |
22
+ | `--terminal` | `stats`, `list` | Keep only rows marked as terminal. |
23
+ | `--limit <number>` | `list` | Limit the number of returned rows. Defaults to `20`. |
24
+ | `--after-id <id>` | `list` | Return rows strictly after the provided quarantine id. |
25
+
26
+ ## Examples
27
+
28
+ ```bash
29
+ cnpj-db-loader quarantine stats
30
+ cnpj-db-loader quarantine stats --dataset establishments --category invalid_utf8_sequence --retryable
31
+ cnpj-db-loader quarantine list --dataset establishments --limit 10
32
+ cnpj-db-loader quarantine list --terminal --after-id 500
33
+ cnpj-db-loader quarantine show 42
34
+ ```
35
+
36
+ ## Notes
37
+
38
+ - `quarantine` is intentionally read-only. It does not retry or mutate quarantined rows.
39
+ - The service automatically ensures that the `import_quarantine` table and its newer columns exist before querying.
40
+ - A future replay/recovery command can reuse the same filters to target retryable or terminal rows.
@@ -0,0 +1,49 @@
1
+ # Sanitize
2
+
3
+ ## Purpose
4
+
5
+ `sanitize` prepares a clean dataset tree before PostgreSQL import.
6
+
7
+ It removes known low-level byte issues, especially `0x00` / NUL bytes, from validated dataset files and writes the result to a new output directory. The goal is to reduce slow fallback work during import so PostgreSQL receives cleaner files from the start.
8
+
9
+ ## Command
10
+
11
+ ```bash
12
+ cnpj-db-loader sanitize <input>
13
+ ```
14
+
15
+ ## Options
16
+
17
+ | Option | Description |
18
+ | ------------------ | ------------------------------------------------------------------------- |
19
+ | `--output <path>` | Custom output directory for the sanitized dataset tree. |
20
+ | `--dataset <name>` | Sanitize only one dataset block, such as `establishments` or `companies`. |
21
+ | `-f, --force` | Skip the confirmation prompt. |
22
+
23
+ ## Default output behavior
24
+
25
+ - when the validated path is `.../extracted`, the default sanitized output is `.../sanitized`
26
+ - otherwise the default output is `<validated-path>-sanitized`
27
+
28
+ ## Recommended flow
29
+
30
+ ```bash
31
+ cnpj-db-loader inspect ./downloads
32
+ cnpj-db-loader extract ./downloads
33
+ cnpj-db-loader validate ./downloads/extracted
34
+ cnpj-db-loader sanitize ./downloads/extracted
35
+ cnpj-db-loader import ./downloads/sanitized --load-batch-size 500 --materialize-batch-size 50000 --verbose-progress
36
+ ```
37
+
38
+ ## What it improves
39
+
40
+ - fewer UTF-8 / NUL-byte related insert failures
41
+ - less row-by-row fallback during import
42
+ - better import throughput for large datasets
43
+ - cleaner quarantine data because known low-level issues are removed earlier
44
+
45
+ ## Notes
46
+
47
+ - `sanitize` does not replace validation; it assumes the dataset tree is already valid
48
+ - `import` still keeps quarantine and retry logic for unexpected issues that survive sanitization
49
+ - no database schema changes are required to use `sanitize`