lakediff 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,61 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.pyo
5
+ *.pyd
6
+ .Python
7
+ *.egg
8
+ *.egg-info/
9
+ dist/
10
+ build/
11
+ .eggs/
12
+ pip-wheel-metadata/
13
+ *.whl
14
+
15
+ # Virtual environments
16
+ .venv/
17
+ venv/
18
+ env/
19
+ ENV/
20
+
21
+ # Testing / coverage
22
+ .pytest_cache/
23
+ .coverage
24
+ coverage.xml
25
+ htmlcov/
26
+ .hypothesis/
27
+
28
+ # Type checking
29
+ .mypy_cache/
30
+ .ruff_cache/
31
+
32
+ # IDE
33
+ .vscode/
34
+ .idea/
35
+ *.swp
36
+ *.swo
37
+
38
+ # macOS
39
+ .DS_Store
40
+
41
+ # Example output (generated files, not committed)
42
+ examples/output/
43
+
44
+ # Docs build
45
+ docs/_build/
46
+ site/
47
+
48
+ # Test data and output files
49
+ *.pyc
50
+ taxi_data/
51
+ load_test_results/
52
+ lakediff_*.html
53
+ lakediff_*.json
54
+ lakediff_*.md
55
+
56
+ # Release process (internal, not for contributors)
57
+ RELEASE.md
58
+
59
+ # DuckDB temp files
60
+ *.duckdb
61
+ *.duckdb.wal
@@ -0,0 +1,68 @@
1
+ # Changelog
2
+
3
+ All notable changes to lakediff are documented here.
4
+ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
5
+ Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ---
8
+
9
+ ## [1.0.0] — 2025-03-13
10
+
11
+ First stable release. Complete rewrite on DuckDB backend.
12
+
13
+ ### Added
14
+ - **DuckDB backend** — all diff operations run as SQL inside DuckDB. No data ever enters Python memory during schema or stats diff. Row diff uses a native FULL OUTER JOIN.
15
+ - **Any-scale row diff** — DuckDB spills to disk automatically. Files that previously caused OOM on the Polars backend now run cleanly.
16
+ - **Cloud storage** — S3, GCS, and Azure Blob Storage supported natively. Credentials read from environment variables. No extra code needed.
17
+ - **Delta Lake** — directory-based Delta tables supported via DuckDB delta extension, auto-installed on first use.
18
+ - **Avro** — `.avro` files supported via DuckDB avro extension.
19
+ - **Iceberg** — Iceberg tables supported via DuckDB iceberg extension.
20
+ - **HTTP / HTTPS Parquet** — any public Parquet URL works as a source or target path.
21
+ - **Cross-cloud diff** — source and target can be in different clouds or mixed local/cloud.
22
+ - **`--where` SQL filter** — applied as a DuckDB predicate before any diff or preview. Works on all formats.
23
+ - **`lakediff show`** — file preview command with `--schema`, `--count`, `--stats`, `--where`, `--columns`, `--tail`, `--rows` flags.
24
+ - **Rich spinner** — visible progress indicator while DuckDB works on large files or remote paths.
25
+ - **Low-cardinality key guard** — detects keys with fewer than 10 unique values and blocks the join with a clear explanation and column suggestions.
26
+ - **Key error messages** — when a key column is missing, lakediff shows which file it is missing from and suggests the closest matching column names.
27
+ - **Actionable cloud error messages** — access denied and credential errors are caught and converted into human-readable messages with the exact environment variables to set.
28
+ - **`lakediff formats`** — command listing all supported formats, extensions, and cloud URI schemes.
29
+ - **Sample materialisation** — when `--sample` or `--limit` is used, the sampled rows are now materialised as a real in-memory DuckDB table before stats diff runs. Previously each column aggregation re-scanned the full Parquet file through the sample subquery (O(columns × file_size)). Now it's one scan to materialise then fast in-memory queries (O(file_size + columns × sample_size)). Speedup: 14–28x on real data. Stats with `--sample 500k` on 35M rows: 312s → 11s.
30
+ - **`lakediff diff`** — alias for `lakediff compare`. Works identically, exists because everyone types it first.
31
+ - **`--ignore-columns`** — exclude audit/ETL columns that always differ and would pollute diff results.
32
+ - **`--limit N`** — use first N rows in file order (deterministic). Complements `--sample` which draws randomly.
33
+ - **`--order-by col [DESC]`** — sort rows in `lakediff show` without knowing the threshold.
34
+ - **`--freq col`** — top-10 value frequencies for a column with count, percentage, and bar chart.
35
+ - **Auto output filename** — `--output html` without `--out` auto-generates `lakediff_src_vs_tgt_timestamp.html`.
36
+ - **Pre-flight info** — `compare` and `diff` show source/target row counts and file sizes before running.
37
+ - **Summary line** — single-line diff summary printed after every run, easy to copy into Slack or PRs.
38
+ - **Overview mode** — `lakediff show` with no flags now shows schema + count + first 5 rows in one output.
39
+ - **Datetime stats** — timestamp columns now show min/max dates in `--stats` and drift detection for date shifts.
40
+ - **`lakediff.yaml` config file** — project-level config, auto-loaded from working directory. CLI flags always override.
41
+ - **CI exit codes** — exit 0 (no drift), exit 1 (error), exit 2 (drift alerts fired).
42
+ - **`--sample N`** — random row sampling via `USING SAMPLE N ROWS` in DuckDB, applied before stats and row diff.
43
+ - **Composite primary keys** — comma-separated or list form: `--key tenant_id,order_id,event_date`.
44
+ - **Rename detection** — uses Jaro-Winkler string similarity to detect likely column renames.
45
+ - **KL divergence** — distribution shift metric computed entirely in DuckDB SQL without pulling data into Python.
46
+ - **HTML report** — self-contained with Chart.js doughnut chart, dark theme, no external server needed.
47
+ - **Markdown report** — GitHub PR / dbt-ready, post directly as a PR comment.
48
+ - **JSON report** — full machine-readable diff result, suitable for downstream automation.
49
+ - **Full test suite** — 75+ tests covering all formats, all modes, edge cases, key errors, WHERE filters, CLI commands, and report output.
50
+
51
+ ### Removed
52
+ - Polars dependency — removed entirely. No Polars, PyArrow, NumPy, or SciPy in core dependencies.
53
+ - Per-format loaders (`CsvLoader`, `ParquetLoader`, etc.) — replaced by `connection.py` which uses DuckDB SQL for all formats.
54
+
55
+ ---
56
+
57
+ ## [0.1.0] — 2024-01-01
58
+
59
+ Initial release. Polars-based backend.
60
+
61
+ ### Added
62
+ - CSV, TSV, Parquet, JSON, JSONL, Delta Lake loaders via Polars
63
+ - Schema diff, stats diff, row diff
64
+ - CLI, HTML, JSON, Markdown output
65
+ - Drift threshold alerting
66
+
67
+ [1.0.0]: https://github.com/lakediff/lakediff/releases/tag/v1.0.0
68
+ [0.1.0]: https://github.com/lakediff/lakediff/releases/tag/v0.1.0
lakediff-1.0.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Vinodh Mani
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.