PyPI - dumpling-cli - Versions diffs - 0.7.0a0__tar.gz → 0.7.0b0__tar.gz - Mend

dumpling-cli 0.7.0a0tar.gz → 0.7.0b0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/AGENTS.md RENAMED Viewed

@@ -28,6 +28,9 @@ src/
   filter.rs      — Row-filter predicate evaluation (eq/neq/like/regex/JSON-path/…)
   scan.rs        — Post-transform residual PII scanner (email/SSN/PAN/token regex)
   report.rs      — JSON report data structures and Reporter helper
+  compressed_input.rs — gzip/ZIP wrappers; streaming vs temp materialization
+  dump_input_resolve.rs — shared `--input` file resolution for anonymize + scaffold-config
+  dump_input_detect.rs — PGDMP / directory dumps / MSSQL sniff helpers
 docs/src/        — mdBook documentation source
 .github/         — CI/CD GitHub Actions workflows
 Cargo.toml       — Rust package manifest

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/CHANGELOG.md RENAMED Viewed

@@ -7,9 +7,21 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht
 ## [Unreleased]
+## [0.7.0-beta] - 2026-05-07
+Second **0.7.x** prerelease toward stable **0.7.0**.
+### Added
+- **Gzip and ZIP inputs**: plain-SQL payloads inside **gzip** are decompressed **in-process** (streamed) when possible—no temporary file. Dumpling still materializes to the temp directory when required: **ZIP** archives (random-access central directory), **gzip wrapping `PGDMP`** or an inner **ZIP** (nested wrappers), or other cases where a filesystem path is needed for `pg_restore`. Temporary files are registered for removal when processing finishes. **`--in-place` is rejected** only when Dumpling had to write a **temporary** decompressed/extracted file (not when gzip plain-SQL streaming was used). Full multi-file ZIP packages (for example BACPAC) are still not supported as SQL input.
+### Changed
+- **CLI**: removed **`--dump-decode`**. PostgreSQL **custom-format** (`PGDMP`) and **directory-format** (`toc.dat`) inputs are auto-detected when `--format postgres` (default) and decoded with `pg_restore -f -`. Options renamed: **`--pg-restore-arg`** (repeatable, was `--dump-decode-arg`), **`--keep-original`** (was `--dump-decode-keep-input` / `--pg-restore-keep-input`). **`--keep-original` is incompatible with `--in-place`** (use `--output` or stdout). Optional `[pg_restore]` table in config for default path/args.
 ## [0.7.0-alpha] - 2026-05-04
-Pre-release toward **0.7.0** (stable **0.7.0** is not published yet; crates use the **0.7.0-alpha** prerelease identifier until then).
+First **0.7.x** prerelease toward stable **0.7.0** (superseded by **0.7.0-beta** for ongoing development builds).
 ### Removed
@@ -113,6 +125,8 @@ Pre-release toward **0.7.0** (stable **0.7.0** is not published yet; crates use
 - Configurable output scan severities and per-category thresholds via `[output_scan]`.
 - JSON report section for output scan findings including category, count, threshold, severity, and sample locations.
+[0.7.0-beta]: https://github.com/ababic/dumpling/compare/v0.7.0-alpha...v0.7.0-beta
+[0.7.0-alpha]: https://github.com/ababic/dumpling/compare/v0.6.0...v0.7.0-alpha
 [0.6.0]: https://github.com/ababic/dumpling/compare/v0.5.0...v0.6.0
 [0.5.0]: https://github.com/ababic/dumpling/compare/v0.4.3...v0.5.0
 [0.4.3]: https://github.com/ababic/dumpling/compare/v0.4.2...v0.4.3

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/Cargo.lock RENAMED Viewed

@@ -2,6 +2,12 @@
 # It is not intended for manual editing.
 version = 4
+[[package]]
+name = "adler2"
+version = "2.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
 [[package]]
 name = "aho-corasick"
 version = "1.1.4"
@@ -76,6 +82,15 @@ version = "1.0.102"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c"
+[[package]]
+name = "arbitrary"
+version = "1.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c3d036a3c4ab069c7b410a2ce876bd74808d2d0888a82667669f8e783a898bf1"
+dependencies = [
+ "derive_arbitrary",
+]
 [[package]]
 name = "autocfg"
 version = "1.5.0"
@@ -187,6 +202,21 @@ dependencies = [
  "libc",
 ]
+[[package]]
+name = "crc32fast"
+version = "1.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9481c1c90cbf2ac953f07c8d4a58aa3945c425b7185c9154d67a65e4230da511"
+dependencies = [
+ "cfg-if",
+]
+[[package]]
+name = "crossbeam-utils"
+version = "0.8.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d0a5c400df2834b80a4c3327b3aad3a4c4cd4de0629063962b03235697506a28"
 [[package]]
 name = "crypto-common"
 version = "0.1.7"
@@ -231,6 +261,17 @@ dependencies = [
  "syn",
 ]
+[[package]]
+name = "derive_arbitrary"
+version = "1.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1e567bd82dcff979e4b03460c307b3cdc9e96fde3d73bed1496d2bc75d9dd62a"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
 [[package]]
 name = "deunicode"
 version = "1.6.2"
@@ -248,6 +289,17 @@ dependencies = [
  "subtle",
 ]
+[[package]]
+name = "displaydoc"
+version = "0.2.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "97369cbbc041bc366949bc74d34658d6cda5621039731c6310521892a3a20ae0"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
 [[package]]
 name = "dummy"
 version = "0.11.0"
@@ -262,12 +314,13 @@ dependencies = [
 [[package]]
 name = "dumpling"
-version = "0.7.0-alpha"
+version = "0.7.0-beta"
 dependencies = [
  "anyhow",
  "chrono",
  "clap",
  "fake",
+ "flate2",
  "getrandom 0.2.17",
  "hmac",
  "lazy_static",
@@ -276,8 +329,9 @@ dependencies = [
  "serde",
  "serde_json",
  "sha2",
- "thiserror",
+ "thiserror 1.0.69",
  "toml",
+ "zip",
 ]
 [[package]]
@@ -310,6 +364,16 @@ version = "0.1.9"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "5baebc0774151f905a1a2cc41989300b1e6fbb29aff0ceffa1064fdd3088d582"
+[[package]]
+name = "flate2"
+version = "1.1.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "843fba2746e448b37e26a819579957415c8cef339bf08564fe8b7ddbd959573c"
+dependencies = [
+ "crc32fast",
+ "miniz_oxide",
+]
 [[package]]
 name = "fnv"
 version = "1.0.7"
@@ -456,6 +520,16 @@ version = "2.8.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79"
+[[package]]
+name = "miniz_oxide"
+version = "0.8.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316"
+dependencies = [
+ "adler2",
+ "simd-adler32",
+]
 [[package]]
 name = "num-traits"
 version = "0.2.19"
@@ -643,6 +717,12 @@ version = "1.3.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64"
+[[package]]
+name = "simd-adler32"
+version = "0.3.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "703d5c7ef118737c72f1af64ad2f6f8c5e1921f818cdcb97b8fe6fc69bf66214"
 [[package]]
 name = "strsim"
 version = "0.11.1"
@@ -672,7 +752,16 @@ version = "1.0.69"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52"
 dependencies = [
- "thiserror-impl",
+ "thiserror-impl 1.0.69",
+]
+[[package]]
+name = "thiserror"
+version = "2.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4288b5bcbc7920c07a1149a35cf9590a2aa808e0bc1eafaade0b80947865fbc4"
+dependencies = [
+ "thiserror-impl 2.0.18",
 ]
 [[package]]
@@ -686,6 +775,17 @@ dependencies = [
  "syn",
 ]
+[[package]]
+name = "thiserror-impl"
+version = "2.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ebc4ee7f67670e9b64d05fa4253e753e016c6c95ff35b89b7941d6b856dec1d5"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
 [[package]]
 name = "toml"
 version = "0.8.23"
@@ -914,8 +1014,37 @@ dependencies = [
  "syn",
 ]
+[[package]]
+name = "zip"
+version = "2.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fabe6324e908f85a1c52063ce7aa26b68dcb7eb6dbc83a2d148403c9bc3eba50"
+dependencies = [
+ "arbitrary",
+ "crc32fast",
+ "crossbeam-utils",
+ "displaydoc",
+ "flate2",
+ "indexmap",
+ "memchr",
+ "thiserror 2.0.18",
+ "zopfli",
+]
 [[package]]
 name = "zmij"
 version = "1.0.21"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa"
+[[package]]
+name = "zopfli"
+version = "0.8.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f05cd8797d63865425ff89b5c4a48804f35ba0ce8d125800027ad6017d2b5249"
+dependencies = [
+ "bumpalo",
+ "crc32fast",
+ "log",
+ "simd-adler32",
+]

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/Cargo.toml RENAMED Viewed

@@ -1,7 +1,9 @@
 [package]
 name = "dumpling"
-version = "0.7.0-alpha"
+version = "0.7.0-beta"
 edition = "2021"
+license = "MIT"
+authors = ["Andy Babic"]
 readme = "README.md"
 [dependencies]
@@ -19,3 +21,5 @@ chrono = { version = "0.4" }
 lazy_static = "1"
 fake = { version = "4", features = ["derive"] }
 rand = "0.9"
+flate2 = "1"
+zip = { version = "2", default-features = false, features = ["deflate"] }

dumpling_cli-0.7.0b0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Andy Babic
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: dumpling-cli
-Version: 0.7.0a0
+Version: 0.7.0b0
 Classifier: Development Status :: 4 - Beta
 Classifier: Environment :: Console
 Classifier: Intended Audience :: Developers
@@ -14,8 +14,11 @@ Classifier: Topic :: Database
 Classifier: Topic :: Security
 Classifier: Topic :: Software Development :: Libraries
 Classifier: Topic :: Utilities
+License-File: LICENSE
 Summary: Static anonymizer for plain SQL dumps (PostgreSQL, SQLite, SQL Server).
 Keywords: postgres,sqlite,mssql,sql,anonymization,cli,rust
+Author: Andy Babic
+License-Expression: MIT
 Requires-Python: >=3.8
 Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
@@ -101,12 +104,11 @@ dumpling --help
 Follow these steps once; you will have a working path from “raw dump” to “first sanitized output,” then you can deepen coverage using the rest of this README and the [documentation site](https://ababic.github.io/dumpling/).
-1. **Start from the example policy** — Copy [`.dumplingconf.example`](.dumplingconf.example) to `.dumplingconf` in your project root (or merge the same keys under `[tool.dumpling]` in `pyproject.toml`). Set environment variables for `salt` and any `${…}` references so Dumpling can resolve secrets at startup.
-2. **Name your tables and columns** — Open your dump next to the config. `CREATE TABLE`, `COPY … (…)` and `INSERT INTO … (…)` lines list the identifiers you need for `[rules."table"]` or `[rules."schema.table"]` (see [Configuration (TOML)](#configuration-toml) below). Trim the example rules down to the tables you care about first, then add columns and strategies as you go.
-3. **Run Dumpling** — `dumpling -i dump.sql -o sanitized.sql` (add `-c path` if the config is not in the default search path). Use `dumpling --check -i dump.sql` when you only want to know whether anything would change.
-4. **Tighten the policy** — Run `dumpling lint-policy` on your config. When you are ready for stricter gates, add `[sensitive_columns]` and use `--strict-coverage` / `--report` / `--scan-output` as described under [Usage](#usage).
-**Draft policy generation (planned)** — A future command will stream a dump and emit a **draft** starter TOML so you spend less time hunting table and column names and basic DDL hints (for example `varchar(N)` lengths). Output will be explicitly **draft**: always review and edit before production or compliance workflows; it is a time-saver, not a full policy.
+1. **Generate a draft policy (recommended)** — Run `dumpling scaffold-config -i dump.sql -o .dumplingconf` to emit a **beta** starter TOML with inferred `[rules]` from column names in `CREATE TABLE`, `INSERT`, and (PostgreSQL) `COPY` headers. Heuristics are **English-oriented**; treat the file as **draft only**—review every rule before production or compliance workflows. Add a global `salt` (for example `salt = "${DUMPLING_SALT}"`) and resolve `${…}` references before anonymizing. Optionally pass **`--infer-json-paths`** to sample up to **five rows per table** (reservoir) and suggest nested JSON keys as `column.path.to.leaf`; use **`--max-json-depth`** if you need a different walk depth (default 24). For PostgreSQL **custom-format** or **directory-format** archives, pass **`--input`** pointing at the archive with **`--format postgres`** (default); Dumpling auto-detects and runs **`pg_restore`** (optional **`--pg-restore-path`** / **`--pg-restore-arg`**). See `dumpling scaffold-config --help`.
+2. **Or start from the example policy** — Copy [`.dumplingconf.example`](.dumplingconf.example) to `.dumplingconf` (or merge under `[tool.dumpling]` in `pyproject.toml`) and edit `[rules]` by hand. Set environment variables for `salt` and any `${…}` references so Dumpling can resolve secrets at startup.
+3. **Align rules with your dump** — If you did not use `scaffold-config`, open the dump beside the config: `CREATE TABLE`, `COPY … (…)`, and `INSERT INTO … (…)` lines list identifiers for `[rules."table"]` or `[rules."schema.table"]` (see [Configuration (TOML)](#configuration-toml)). Trim rules to the tables you care about first, then extend columns and strategies as you go.
+4. **Run Dumpling** — `dumpling -i dump.sql -o sanitized.sql` (add `-c path` if the config is not in the default search path). Use `dumpling --check -i dump.sql` when you only want to know whether anything would change.
+5. **Tighten the policy** — Run `dumpling lint-policy` on your config. When you are ready for stricter gates, add `[sensitive_columns]` and use `--strict-coverage` / `--report` / `--scan-output` as described under [Usage](#usage).
 The same flow is spelled out in the docs: [Getting started](https://ababic.github.io/dumpling/getting-started.html).
@@ -131,6 +133,8 @@ dumpling --format sqlite -i data.db.sql -o out.sql  # process a SQLite .dump fil
 dumpling --format mssql  -i backup.sql -o out.sql   # process a SQL Server plain-SQL dump
 dumpling lint-policy                          # lint the anonymization policy config
 dumpling lint-policy --config .dumplingconf   # lint with explicit config path
+dumpling scaffold-config -i dump.sql -o .dumplingconf   # draft [rules] from column names (beta)
+dumpling scaffold-config -i dump.sql -o draft.toml --infer-json-paths   # include JSON path hints (beta)
 ```
 Configuration is loaded in this order:
@@ -493,12 +497,16 @@ Produced by `pg_dump --format=plain`. Handles:
 Binary, custom, and directory formats from `pg_dump` are not parsed directly — Dumpling’s SQL pipeline expects plain text. Use either:
 - **`pg_dump --format=plain`** when you control capture, or
-- **`dumpling --dump-decode`** with `--input` set to a **custom-format** (`.dump`) or **directory-format** folder: Dumpling runs `pg_restore -f -` and streams the resulting SQL (same as a manual `pg_restore` “script” output, no database required). Requires PostgreSQL client tools on `PATH` (`pg_restore`), or set `--pg-restore-path`. Use `--dump-decode-arg` to pass extra flags (e.g. `--no-owner --no-acl`). **By default** the archive is removed after a fully successful run; pass **`--dump-decode-keep-input`** to retain it. **`--check`** requires **`--dump-decode-keep-input`** so the archive still exists if changes would be detected.
+- **Auto-detected PostgreSQL archives** with `--format postgres` (default): if `--input` is a **custom-format** file (begins with `PGDMP`) or a **directory-format** dump (folder containing `toc.dat`), Dumpling runs **`pg_restore -f -`** and streams the resulting SQL (same as a manual `pg_restore` “script” output; no database required). Requires PostgreSQL client tools on **`PATH`** (`pg_restore`), or **`--pg-restore-path`**. Extra flags: **`--pg-restore-arg`** (repeatable), or defaults from **`[pg_restore]`** in `.dumplingconf` / `pyproject.toml` (CLI wins when set).
+**Compressed inputs:** **`.gz`** files whose payload is plain SQL are **decompressed in-process** (no temporary file). **ZIP** archives (and gzip wrapping `PGDMP` or an inner ZIP) are expanded under the system temp directory; those paths are removed when the run finishes. **`--in-place`** is rejected when Dumpling had to materialize a temp file for compression or when the input is a PostgreSQL archive path that must go through `pg_restore` (use **`--output`** or stdout instead).
+**Keeping archives:** **By default** the `--input` archive path (file or directory-format folder) is **removed** after a fully successful run. Pass **`--keep-original`** or set **`keep_original = true`** in config to retain it. **`--check`** against an archive requires an effective keep-original (CLI or config); **`--keep-original` cannot be combined with `--in-place`**.
 Example (e.g. after `heroku pg:backups:download`):
 ```bash
-dumpling --dump-decode -i latest.dump -c .dumplingconf -o anonymized.sql
+dumpling -i latest.dump -c .dumplingconf -o anonymized.sql
 ```
 ### SQLite (`--format sqlite`)

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/README.md RENAMED Viewed

@@ -80,12 +80,11 @@ dumpling --help
 Follow these steps once; you will have a working path from “raw dump” to “first sanitized output,” then you can deepen coverage using the rest of this README and the [documentation site](https://ababic.github.io/dumpling/).
-1. **Start from the example policy** — Copy [`.dumplingconf.example`](.dumplingconf.example) to `.dumplingconf` in your project root (or merge the same keys under `[tool.dumpling]` in `pyproject.toml`). Set environment variables for `salt` and any `${…}` references so Dumpling can resolve secrets at startup.
-2. **Name your tables and columns** — Open your dump next to the config. `CREATE TABLE`, `COPY … (…)` and `INSERT INTO … (…)` lines list the identifiers you need for `[rules."table"]` or `[rules."schema.table"]` (see [Configuration (TOML)](#configuration-toml) below). Trim the example rules down to the tables you care about first, then add columns and strategies as you go.
-3. **Run Dumpling** — `dumpling -i dump.sql -o sanitized.sql` (add `-c path` if the config is not in the default search path). Use `dumpling --check -i dump.sql` when you only want to know whether anything would change.
-4. **Tighten the policy** — Run `dumpling lint-policy` on your config. When you are ready for stricter gates, add `[sensitive_columns]` and use `--strict-coverage` / `--report` / `--scan-output` as described under [Usage](#usage).
-**Draft policy generation (planned)** — A future command will stream a dump and emit a **draft** starter TOML so you spend less time hunting table and column names and basic DDL hints (for example `varchar(N)` lengths). Output will be explicitly **draft**: always review and edit before production or compliance workflows; it is a time-saver, not a full policy.
+1. **Generate a draft policy (recommended)** — Run `dumpling scaffold-config -i dump.sql -o .dumplingconf` to emit a **beta** starter TOML with inferred `[rules]` from column names in `CREATE TABLE`, `INSERT`, and (PostgreSQL) `COPY` headers. Heuristics are **English-oriented**; treat the file as **draft only**—review every rule before production or compliance workflows. Add a global `salt` (for example `salt = "${DUMPLING_SALT}"`) and resolve `${…}` references before anonymizing. Optionally pass **`--infer-json-paths`** to sample up to **five rows per table** (reservoir) and suggest nested JSON keys as `column.path.to.leaf`; use **`--max-json-depth`** if you need a different walk depth (default 24). For PostgreSQL **custom-format** or **directory-format** archives, pass **`--input`** pointing at the archive with **`--format postgres`** (default); Dumpling auto-detects and runs **`pg_restore`** (optional **`--pg-restore-path`** / **`--pg-restore-arg`**). See `dumpling scaffold-config --help`.
+2. **Or start from the example policy** — Copy [`.dumplingconf.example`](.dumplingconf.example) to `.dumplingconf` (or merge under `[tool.dumpling]` in `pyproject.toml`) and edit `[rules]` by hand. Set environment variables for `salt` and any `${…}` references so Dumpling can resolve secrets at startup.
+3. **Align rules with your dump** — If you did not use `scaffold-config`, open the dump beside the config: `CREATE TABLE`, `COPY … (…)`, and `INSERT INTO … (…)` lines list identifiers for `[rules."table"]` or `[rules."schema.table"]` (see [Configuration (TOML)](#configuration-toml)). Trim rules to the tables you care about first, then extend columns and strategies as you go.
+4. **Run Dumpling** — `dumpling -i dump.sql -o sanitized.sql` (add `-c path` if the config is not in the default search path). Use `dumpling --check -i dump.sql` when you only want to know whether anything would change.
+5. **Tighten the policy** — Run `dumpling lint-policy` on your config. When you are ready for stricter gates, add `[sensitive_columns]` and use `--strict-coverage` / `--report` / `--scan-output` as described under [Usage](#usage).
 The same flow is spelled out in the docs: [Getting started](https://ababic.github.io/dumpling/getting-started.html).
@@ -110,6 +109,8 @@ dumpling --format sqlite -i data.db.sql -o out.sql  # process a SQLite .dump fil
 dumpling --format mssql  -i backup.sql -o out.sql   # process a SQL Server plain-SQL dump
 dumpling lint-policy                          # lint the anonymization policy config
 dumpling lint-policy --config .dumplingconf   # lint with explicit config path
+dumpling scaffold-config -i dump.sql -o .dumplingconf   # draft [rules] from column names (beta)
+dumpling scaffold-config -i dump.sql -o draft.toml --infer-json-paths   # include JSON path hints (beta)
 ```
 Configuration is loaded in this order:
@@ -472,12 +473,16 @@ Produced by `pg_dump --format=plain`. Handles:
 Binary, custom, and directory formats from `pg_dump` are not parsed directly — Dumpling’s SQL pipeline expects plain text. Use either:
 - **`pg_dump --format=plain`** when you control capture, or
-- **`dumpling --dump-decode`** with `--input` set to a **custom-format** (`.dump`) or **directory-format** folder: Dumpling runs `pg_restore -f -` and streams the resulting SQL (same as a manual `pg_restore` “script” output, no database required). Requires PostgreSQL client tools on `PATH` (`pg_restore`), or set `--pg-restore-path`. Use `--dump-decode-arg` to pass extra flags (e.g. `--no-owner --no-acl`). **By default** the archive is removed after a fully successful run; pass **`--dump-decode-keep-input`** to retain it. **`--check`** requires **`--dump-decode-keep-input`** so the archive still exists if changes would be detected.
+- **Auto-detected PostgreSQL archives** with `--format postgres` (default): if `--input` is a **custom-format** file (begins with `PGDMP`) or a **directory-format** dump (folder containing `toc.dat`), Dumpling runs **`pg_restore -f -`** and streams the resulting SQL (same as a manual `pg_restore` “script” output; no database required). Requires PostgreSQL client tools on **`PATH`** (`pg_restore`), or **`--pg-restore-path`**. Extra flags: **`--pg-restore-arg`** (repeatable), or defaults from **`[pg_restore]`** in `.dumplingconf` / `pyproject.toml` (CLI wins when set).
+**Compressed inputs:** **`.gz`** files whose payload is plain SQL are **decompressed in-process** (no temporary file). **ZIP** archives (and gzip wrapping `PGDMP` or an inner ZIP) are expanded under the system temp directory; those paths are removed when the run finishes. **`--in-place`** is rejected when Dumpling had to materialize a temp file for compression or when the input is a PostgreSQL archive path that must go through `pg_restore` (use **`--output`** or stdout instead).
+**Keeping archives:** **By default** the `--input` archive path (file or directory-format folder) is **removed** after a fully successful run. Pass **`--keep-original`** or set **`keep_original = true`** in config to retain it. **`--check`** against an archive requires an effective keep-original (CLI or config); **`--keep-original` cannot be combined with `--in-place`**.
 Example (e.g. after `heroku pg:backups:download`):
 ```bash
-dumpling --dump-decode -i latest.dump -c .dumplingconf -o anonymized.sql
+dumpling -i latest.dump -c .dumplingconf -o anonymized.sql
 ```
 ### SQLite (`--format sqlite`)

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/docs/src/configuration.md RENAMED Viewed

@@ -6,7 +6,7 @@ Use `--format` to declare the SQL dialect of your input file:
 | Value | Description |
 |---|---|
-| `postgres` (default) | PostgreSQL `pg_dump` plain-text format. Supports `COPY … FROM stdin` blocks, `"double-quoted"` identifiers, `''`-escaped strings. Custom-format (`-Fc`) or directory dumps can be decoded on the fly with `dumpling --dump-decode` (wraps `pg_restore -f -`; requires client tools). By default the archive is deleted after success; use `--dump-decode-keep-input` to retain it. |
+| `postgres` (default) | PostgreSQL `pg_dump` plain-text format. Supports `COPY … FROM stdin` blocks, `"double-quoted"` identifiers, `''`-escaped strings. **Custom-format** (`PGDMP`) and **directory-format** (`toc.dat`) dumps are **auto-detected** and decoded with `pg_restore -f -` (requires client tools). **Gzip** — wrapped plain SQL is decompressed in-process; **ZIP** (or gzip wrapping `PGDMP`/nested ZIP) uses a temp file that is removed after the run. By default the archive is deleted after success; use **`--keep-original`** or **`keep_original`** in config to retain it. |
 | `sqlite` | SQLite `.dump` format. Adds `INSERT OR REPLACE INTO` / `INSERT OR IGNORE INTO` support. No COPY blocks. |
 | `mssql` | SQL Server / MSSQL plain SQL. Adds `[bracket]` identifier quoting, `N'…'` Unicode string literals, and `nvarchar(n)` / `nchar(n)` length extraction. No COPY blocks. |
@@ -17,26 +17,58 @@ dumpling --format sqlite -i data.db.sql -o anonymized.sql
 dumpling --format mssql  -i backup.sql  -o anonymized.sql
 ```
-### PostgreSQL custom-format archives (`--dump-decode`)
+### PostgreSQL archives and compressed inputs
-Heroku PGBackups and many pipelines ship **`pg_dump` custom format** (`-Fc`) or **directory-format** dumps to save bandwidth. Dumpling’s SQL engine still expects **plain text**; use **`--dump-decode`** so Dumpling runs **`pg_restore -f -`** (script to stdout, no database) and pipes the result through the same anonymizer as a normal plain-SQL file.
+Heroku PGBackups and many pipelines ship **`pg_dump` custom format** (`-Fc`), **directory-format** dumps, or **gzip**/**ZIP**-wrapped files. Dumpling’s SQL engine still expects **plain text** at the parser; anything else is normalized first.
-**Requirements:** PostgreSQL client tools on `PATH` (`pg_restore`), or set **`--pg-restore-path`**. Use **`--dump-decode-arg`** (repeatable) for extra `pg_restore` flags, e.g. `--dump-decode-arg=--no-owner --dump-decode-arg=--no-acl`.
+#### Custom-format and directory dumps (auto-detected)
-**Input deletion:** After a **fully successful** run, Dumpling **removes** the `--input` path (single file or directory-format folder) by default so only the anonymized output remains. Pass **`--dump-decode-keep-input`** to retain the archive.
+With **`--format postgres`** (default), Dumpling detects:
-**Check mode:** **`--check`** with **`--dump-decode`** requires **`--dump-decode-keep-input`**. Otherwise the default would delete the dump before you can iterate on config.
+- **Custom-format** files (magic `PGDMP` at the start of the file), and
+- **Directory-format** folders (a `toc.dat` beside table blobs),
-Example (e.g. after `heroku pg:backups:download`):
+then runs **`pg_restore -f -`** (script to stdout inside the process — no database) and pipes the result through the same anonymizer as a normal plain-SQL file. Detection from **`--input`** is automatic.
+**Requirements:** PostgreSQL client tools on **`PATH`** (`pg_restore`), or **`--pg-restore-path`**.
+**Extra `pg_restore` arguments:**
+- CLI: **`--pg-restore-arg`** (repeatable), e.g. `--pg-restore-arg=--no-owner --pg-restore-arg=--no-acl`
+- Config (optional): **`[pg_restore]`** — CLI overrides these when you pass path or args:
+```toml
+[pg_restore]
+path = "/usr/bin/pg_restore"
+args = ["--no-owner", "--no-acl"]
+```
+#### Gzip and ZIP wrappers
+- **Gzip (`.gz`)** whose decompressed payload is **plain SQL**: decompressed **in-process** (streamed); no temporary dump file.
+- **ZIP** containing a single dump file (or a single `.sql` when multiple files exist), **gzip wrapping `PGDMP`**, or **gzip wrapping an inner ZIP**: Dumpling writes under the system temp directory and **removes** those paths when the run completes (including after errors — cleanup runs on drop).
+**`--in-place`** is **rejected** when Dumpling had to **materialize** a temp file for compression **or** when the resolved input is a PostgreSQL archive decoded via **`pg_restore`** (use **`--output`** or stdout).
+#### Keeping inputs and `--check`
+After a **fully successful** run, Dumpling **removes** the `--input` archive path (single file or directory-format folder) **by default**. To keep it:
+- **`--keep-original`**, or
+- **`keep_original = true`** at the top level of `.dumplingconf` / `[tool.dumpling]` (merged with CLI; **`--keep-original` cannot be used with `--in-place`**).
+**`--check`** with a PostgreSQL archive requires an **effective** keep-original (CLI or config); otherwise the default deletion would remove the dump before you iterate on policy.
+Examples (e.g. after `heroku pg:backups:download`):
 ```bash
-dumpling --dump-decode -i latest.dump -c .dumplingconf -o anonymized.sql
+dumpling -i latest.dump -c .dumplingconf -o anonymized.sql
 ```
 Dry run while keeping the downloaded file:
 ```bash
-dumpling --dump-decode --dump-decode-keep-input --check -i latest.dump -c .dumplingconf
+dumpling --keep-original --check -i latest.dump -c .dumplingconf
 ```
 ---

dumpling_cli-0.7.0b0/docs/src/getting-started.md ADDED Viewed

@@ -0,0 +1,63 @@
+# Getting started
+This page is the **shortest path** from zero to a first successful run. For strategy details, row filters, dump seals, and CI patterns, continue with the [configuration guide](configuration.md) and the repository `README.md`.
+## Prerequisites
+- Rust **stable** toolchain (`rustup` recommended). The repo includes `rust-toolchain.toml` (stable + `rustfmt` + `clippy`) so CI and local `cargo` stay aligned.
+- `cargo` on your `PATH`
+Optional: run **`./scripts/setup-dev.sh`** once from the repo root — it installs toolchain components, **`cargo fetch`**, and a pinned **mdBook** under `.tools/` for the same docs build CI uses.
+## Build
+```bash
+cargo build --release
+./target/release/dumpling --help
+```
+### Python / pip (`dumpling-cli`)
+```bash
+pip install dumpling-cli
+dumpling --help
+```
+## First anonymization
+1. **Generate a draft policy (recommended)** — From your project root (or anywhere you keep config):
+   ```bash
+   dumpling scaffold-config -i dump.sql -o .dumplingconf
+   ```
+   This **beta** subcommand streams the dump once and writes inferred `[rules]` from SQL column names (`CREATE TABLE`, `INSERT`, and PostgreSQL `COPY` column lists). Heuristics are **English-oriented**; output is **draft only**—review and edit every rule, add a top-level **`salt`** (for hashing) and any **`${…}`** secret placeholders before production use.
+   Useful flags:
+   - **`--infer-json-paths`** — Keep up to **five sampled rows per table** (reservoir) and suggest nested JSON rules as `column.path.leaf`.
+   - **`--max-json-depth`** — Cap JSON walking depth when using `--infer-json-paths` (default 24).
+   - **`--format`** — `postgres` (default), `sqlite`, or `mssql`.
+   - **`--pg-restore-path`** / **`--pg-restore-arg`** — Optional **`pg_restore`** binary and extra arguments when **`--input`** is a PostgreSQL custom-format or directory-format archive (auto-detected with **`--format postgres`**); see [PostgreSQL archives and compressed inputs](configuration.md#postgresql-archives-and-compressed-inputs).
+   Run `dumpling scaffold-config --help` for the full flag list.
+2. **Or start from the example policy** — Copy [`.dumplingconf.example`](https://github.com/ababic/dumpling/blob/main/.dumplingconf.example) to `.dumplingconf` (or merge under `[tool.dumpling]` in `pyproject.toml`) and author `[rules]` by hand. Set environment variables for `salt` and any `${…}` references.
+3. **Align rules with your dump (manual path only)** — If you skipped `scaffold-config`, use `CREATE TABLE`, `COPY … (…)`, and `INSERT INTO … (…)` lines to name `[rules."table"]` or `[rules."schema.table"]` keys. Trim to the tables you care about first.
+4. **Run Dumpling** — `dumpling -i dump.sql -o sanitized.sql` (add `-c path` if the config is not in the default search path). Use `dumpling --check -i dump.sql` when you only want to know whether anything would change.
+5. **Tighten the policy** — Run `dumpling lint-policy` on your config. When you are ready for stricter gates, add `[sensitive_columns]` and use `--strict-coverage`, `--report`, and `--scan-output` as described in the [configuration guide](configuration.md) and the repository `README.md`.
+## PostgreSQL custom-format archives
+If your input is a PostgreSQL **custom-format** file or **directory-format** folder (not plain SQL), use **`--format postgres`** (default): Dumpling **auto-detects** the archive and runs **`pg_restore -f -`** (needs `pg_restore` from PostgreSQL client tools). Gzip-wrapped plain SQL is streamed without a temp file; ZIP (or gzip wrapping `PGDMP`) uses a temp extract that is cleaned up afterward. See [PostgreSQL archives and compressed inputs](configuration.md#postgresql-archives-and-compressed-inputs) in the configuration guide.
+## Test locally (contributors)
+```bash
+cargo fmt --all -- --check
+cargo clippy --all-targets --all-features
+cargo test --all-targets --all-features
+```

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/docs/src/index.md RENAMED Viewed

@@ -1,8 +1,8 @@
 # Dumpling documentation
-Dumpling is a streaming anonymizer for plain SQL dumps. It supports PostgreSQL (`pg_dump` plain format), SQLite (`.dump`), and SQL Server / MSSQL (SSMS / mssql-scripter plain SQL output). For PostgreSQL **custom-format** archives (e.g. Heroku `pg:backups:download`), use **`--dump-decode`** so Dumpling invokes `pg_restore` and streams plain SQL—see [Dump format](configuration.html#postgresql-custom-format-archives---dump-decode) in the configuration guide.
+Dumpling is a streaming anonymizer for plain SQL dumps. It supports PostgreSQL (`pg_dump` plain format), SQLite (`.dump`), and SQL Server / MSSQL (SSMS / mssql-scripter plain SQL output). For PostgreSQL **custom-format** or **directory-format** archives (e.g. Heroku `pg:backups:download`), Dumpling **auto-detects** them when `--format postgres` (default) and invokes `pg_restore -f -`—see [PostgreSQL archives and compressed inputs](configuration.html#postgresql-archives-and-compressed-inputs) in the configuration guide.
-**New here?** Start with [**Getting started**](getting-started.html): copy the example config, align rules with your dump, run Dumpling, then tighten with `lint-policy` and optional CI flags.
+**New here?** Start with [**Getting started**](getting-started.html): generate a **draft** policy with `scaffold-config`, review and add secrets, run Dumpling, then tighten with `lint-policy` and optional CI flags.
 This documentation covers the operating model for day-to-day use:

{dumpling_cli-0.7.0a0 → dumpling_cli-0.7.0b0}/pyproject.toml RENAMED Viewed

@@ -4,9 +4,12 @@ build-backend = "maturin"
 [project]
 name = "dumpling-cli"
-version = "0.7.0-alpha"
+version = "0.7.0-beta"
 description = "Static anonymizer for plain SQL dumps (PostgreSQL, SQLite, SQL Server)."
 readme = "README.md"
+license = "MIT"
+license-files = ["LICENSE"]
+authors = [{ name = "Andy Babic" }]
 requires-python = ">=3.8"
 keywords = ["postgres", "sqlite", "mssql", "sql", "anonymization", "cli", "rust"]
 classifiers = [

dumpling-cli 0.7.0a0__tar.gz → 0.7.0b0__tar.gz

dumpling-cli 0.7.0a0tar.gz → 0.7.0b0tar.gz