lakediff 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- lakediff-1.0.0/.gitignore +61 -0
- lakediff-1.0.0/CHANGELOG.md +68 -0
- lakediff-1.0.0/LICENSE +21 -0
- lakediff-1.0.0/PKG-INFO +846 -0
- lakediff-1.0.0/README.md +773 -0
- lakediff-1.0.0/examples/run_demo.py +133 -0
- lakediff-1.0.0/lakediff/__init__.py +5 -0
- lakediff-1.0.0/lakediff/cli.py +685 -0
- lakediff-1.0.0/lakediff/connection.py +360 -0
- lakediff-1.0.0/lakediff/core.py +295 -0
- lakediff-1.0.0/lakediff/differ/__init__.py +5 -0
- lakediff-1.0.0/lakediff/differ/row_differ.py +299 -0
- lakediff-1.0.0/lakediff/differ/schema_differ.py +107 -0
- lakediff-1.0.0/lakediff/differ/stats_differ.py +437 -0
- lakediff-1.0.0/lakediff/models.py +183 -0
- lakediff-1.0.0/lakediff/py.typed +0 -0
- lakediff-1.0.0/lakediff/reporters/__init__.py +6 -0
- lakediff-1.0.0/lakediff/reporters/cli_reporter.py +198 -0
- lakediff-1.0.0/lakediff/reporters/html_reporter.py +247 -0
- lakediff-1.0.0/lakediff/reporters/json_reporter.py +21 -0
- lakediff-1.0.0/lakediff/reporters/markdown_reporter.py +136 -0
- lakediff-1.0.0/pyproject.toml +118 -0
- lakediff-1.0.0/tests/conftest.py +5 -0
- lakediff-1.0.0/tests/test_lakediff.py +915 -0
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*.pyo
|
|
5
|
+
*.pyd
|
|
6
|
+
.Python
|
|
7
|
+
*.egg
|
|
8
|
+
*.egg-info/
|
|
9
|
+
dist/
|
|
10
|
+
build/
|
|
11
|
+
.eggs/
|
|
12
|
+
pip-wheel-metadata/
|
|
13
|
+
*.whl
|
|
14
|
+
|
|
15
|
+
# Virtual environments
|
|
16
|
+
.venv/
|
|
17
|
+
venv/
|
|
18
|
+
env/
|
|
19
|
+
ENV/
|
|
20
|
+
|
|
21
|
+
# Testing / coverage
|
|
22
|
+
.pytest_cache/
|
|
23
|
+
.coverage
|
|
24
|
+
coverage.xml
|
|
25
|
+
htmlcov/
|
|
26
|
+
.hypothesis/
|
|
27
|
+
|
|
28
|
+
# Type checking
|
|
29
|
+
.mypy_cache/
|
|
30
|
+
.ruff_cache/
|
|
31
|
+
|
|
32
|
+
# IDE
|
|
33
|
+
.vscode/
|
|
34
|
+
.idea/
|
|
35
|
+
*.swp
|
|
36
|
+
*.swo
|
|
37
|
+
|
|
38
|
+
# macOS
|
|
39
|
+
.DS_Store
|
|
40
|
+
|
|
41
|
+
# Example output (generated files, not committed)
|
|
42
|
+
examples/output/
|
|
43
|
+
|
|
44
|
+
# Docs build
|
|
45
|
+
docs/_build/
|
|
46
|
+
site/
|
|
47
|
+
|
|
48
|
+
# Test data and output files
|
|
49
|
+
*.pyc
|
|
50
|
+
taxi_data/
|
|
51
|
+
load_test_results/
|
|
52
|
+
lakediff_*.html
|
|
53
|
+
lakediff_*.json
|
|
54
|
+
lakediff_*.md
|
|
55
|
+
|
|
56
|
+
# Release process (internal, not for contributors)
|
|
57
|
+
RELEASE.md
|
|
58
|
+
|
|
59
|
+
# DuckDB temp files
|
|
60
|
+
*.duckdb
|
|
61
|
+
*.duckdb.wal
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to lakediff are documented here.
|
|
4
|
+
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
5
|
+
Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## [1.0.0] — 2025-03-13
|
|
10
|
+
|
|
11
|
+
First stable release. Complete rewrite on DuckDB backend.
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
- **DuckDB backend** — all diff operations run as SQL inside DuckDB. No data ever enters Python memory during schema or stats diff. Row diff uses a native FULL OUTER JOIN.
|
|
15
|
+
- **Any-scale row diff** — DuckDB spills to disk automatically. Files that previously caused OOM on the Polars backend now run cleanly.
|
|
16
|
+
- **Cloud storage** — S3, GCS, and Azure Blob Storage supported natively. Credentials read from environment variables. No extra code needed.
|
|
17
|
+
- **Delta Lake** — directory-based Delta tables supported via DuckDB delta extension, auto-installed on first use.
|
|
18
|
+
- **Avro** — `.avro` files supported via DuckDB avro extension.
|
|
19
|
+
- **Iceberg** — Iceberg tables supported via DuckDB iceberg extension.
|
|
20
|
+
- **HTTP / HTTPS Parquet** — any public Parquet URL works as a source or target path.
|
|
21
|
+
- **Cross-cloud diff** — source and target can be in different clouds or mixed local/cloud.
|
|
22
|
+
- **`--where` SQL filter** — applied as a DuckDB predicate before any diff or preview. Works on all formats.
|
|
23
|
+
- **`lakediff show`** — file preview command with `--schema`, `--count`, `--stats`, `--where`, `--columns`, `--tail`, `--rows` flags.
|
|
24
|
+
- **Rich spinner** — visible progress indicator while DuckDB works on large files or remote paths.
|
|
25
|
+
- **Low-cardinality key guard** — detects keys with fewer than 10 unique values and blocks the join with a clear explanation and column suggestions.
|
|
26
|
+
- **Key error messages** — when a key column is missing, lakediff shows which file it is missing from and suggests the closest matching column names.
|
|
27
|
+
- **Actionable cloud error messages** — access denied and credential errors are caught and converted into human-readable messages with the exact environment variables to set.
|
|
28
|
+
- **`lakediff formats`** — command listing all supported formats, extensions, and cloud URI schemes.
|
|
29
|
+
- **Sample materialisation** — when `--sample` or `--limit` is used, the sampled rows are now materialised as a real in-memory DuckDB table before stats diff runs. Previously each column aggregation re-scanned the full Parquet file through the sample subquery (O(columns × file_size)). Now it's one scan to materialise then fast in-memory queries (O(file_size + columns × sample_size)). Speedup: 14–28x on real data. Stats with `--sample 500k` on 35M rows: 312s → 11s.
|
|
30
|
+
- **`lakediff diff`** — alias for `lakediff compare`. Works identically, exists because everyone types it first.
|
|
31
|
+
- **`--ignore-columns`** — exclude audit/ETL columns that always differ and would pollute diff results.
|
|
32
|
+
- **`--limit N`** — use first N rows in file order (deterministic). Complements `--sample` which draws randomly.
|
|
33
|
+
- **`--order-by col [DESC]`** — sort rows in `lakediff show` without knowing the threshold.
|
|
34
|
+
- **`--freq col`** — top-10 value frequencies for a column with count, percentage, and bar chart.
|
|
35
|
+
- **Auto output filename** — `--output html` without `--out` auto-generates `lakediff_src_vs_tgt_timestamp.html`.
|
|
36
|
+
- **Pre-flight info** — `compare` and `diff` show source/target row counts and file sizes before running.
|
|
37
|
+
- **Summary line** — single-line diff summary printed after every run, easy to copy into Slack or PRs.
|
|
38
|
+
- **Overview mode** — `lakediff show` with no flags now shows schema + count + first 5 rows in one output.
|
|
39
|
+
- **Datetime stats** — timestamp columns now show min/max dates in `--stats` and drift detection for date shifts.
|
|
40
|
+
- **`lakediff.yaml` config file** — project-level config, auto-loaded from working directory. CLI flags always override.
|
|
41
|
+
- **CI exit codes** — exit 0 (no drift), exit 1 (error), exit 2 (drift alerts fired).
|
|
42
|
+
- **`--sample N`** — random row sampling via `USING SAMPLE N ROWS` in DuckDB, applied before stats and row diff.
|
|
43
|
+
- **Composite primary keys** — comma-separated or list form: `--key tenant_id,order_id,event_date`.
|
|
44
|
+
- **Rename detection** — uses Jaro-Winkler string similarity to detect likely column renames.
|
|
45
|
+
- **KL divergence** — distribution shift metric computed entirely in DuckDB SQL without pulling data into Python.
|
|
46
|
+
- **HTML report** — self-contained with Chart.js doughnut chart, dark theme, no external server needed.
|
|
47
|
+
- **Markdown report** — GitHub PR / dbt-ready, post directly as a PR comment.
|
|
48
|
+
- **JSON report** — full machine-readable diff result, suitable for downstream automation.
|
|
49
|
+
- **Full test suite** — 75+ tests covering all formats, all modes, edge cases, key errors, WHERE filters, CLI commands, and report output.
|
|
50
|
+
|
|
51
|
+
### Removed
|
|
52
|
+
- Polars dependency — removed entirely. No Polars, PyArrow, NumPy, or SciPy in core dependencies.
|
|
53
|
+
- Per-format loaders (`CsvLoader`, `ParquetLoader`, etc.) — replaced by `connection.py` which uses DuckDB SQL for all formats.
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## [0.1.0] — 2024-01-01
|
|
58
|
+
|
|
59
|
+
Initial release. Polars-based backend.
|
|
60
|
+
|
|
61
|
+
### Added
|
|
62
|
+
- CSV, TSV, Parquet, JSON, JSONL, Delta Lake loaders via Polars
|
|
63
|
+
- Schema diff, stats diff, row diff
|
|
64
|
+
- CLI, HTML, JSON, Markdown output
|
|
65
|
+
- Drift threshold alerting
|
|
66
|
+
|
|
67
|
+
[1.0.0]: https://github.com/lakediff/lakediff/releases/tag/v1.0.0
|
|
68
|
+
[0.1.0]: https://github.com/lakediff/lakediff/releases/tag/v0.1.0
|
lakediff-1.0.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 Vinodh Mani
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|