wherefore 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- wherefore-0.1.0/LICENSE +202 -0
- wherefore-0.1.0/NOTICE +5 -0
- wherefore-0.1.0/PKG-INFO +447 -0
- wherefore-0.1.0/README.md +404 -0
- wherefore-0.1.0/pyproject.toml +83 -0
- wherefore-0.1.0/setup.cfg +4 -0
- wherefore-0.1.0/src/wherefore/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/cli.py +632 -0
- wherefore-0.1.0/src/wherefore/clustering/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/clustering/cluster_mismatches.py +344 -0
- wherefore-0.1.0/src/wherefore/clustering/signatures.py +446 -0
- wherefore-0.1.0/src/wherefore/comparison/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/comparison/diff_engine.py +229 -0
- wherefore-0.1.0/src/wherefore/comparison/diff_result.py +151 -0
- wherefore-0.1.0/src/wherefore/comparison/key_matching.py +149 -0
- wherefore-0.1.0/src/wherefore/comparison/loaders.py +370 -0
- wherefore-0.1.0/src/wherefore/reasoning/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/reasoning/explain.py +207 -0
- wherefore-0.1.0/src/wherefore/reasoning/prompts/cluster_explanation_v1.md +71 -0
- wherefore-0.1.0/src/wherefore/reasoning/providers/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/reasoning/providers/base.py +30 -0
- wherefore-0.1.0/src/wherefore/reasoning/providers/claude.py +81 -0
- wherefore-0.1.0/src/wherefore/reasoning/redaction.py +118 -0
- wherefore-0.1.0/src/wherefore/reasoning/report.py +25 -0
- wherefore-0.1.0/src/wherefore/synthetic/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/synthetic/base_dataset.py +259 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/dedup_failure.py +68 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/encoding_mismatch.py +84 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/enum_drift.py +70 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/float_precision.py +78 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/key_mismatch.py +100 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/null_type_coercion.py +84 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/timezone_shift.py +73 -0
- wherefore-0.1.0/src/wherefore/synthetic/corruptors/truncation.py +72 -0
- wherefore-0.1.0/src/wherefore/synthetic/ground_truth.py +83 -0
- wherefore-0.1.0/src/wherefore/taxonomy/__init__.py +0 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/dedup_failure.yaml +52 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/encoding_mismatch.yaml +45 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/enum_drift.yaml +48 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/float_precision.yaml +51 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/key_mismatch.yaml +59 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/null_type_coercion.yaml +52 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/timezone_shift.yaml +50 -0
- wherefore-0.1.0/src/wherefore/taxonomy/patterns/truncation.yaml +44 -0
- wherefore-0.1.0/src/wherefore/taxonomy/registry.py +180 -0
- wherefore-0.1.0/src/wherefore/taxonomy/schema.py +156 -0
- wherefore-0.1.0/src/wherefore.egg-info/PKG-INFO +447 -0
- wherefore-0.1.0/src/wherefore.egg-info/SOURCES.txt +51 -0
- wherefore-0.1.0/src/wherefore.egg-info/dependency_links.txt +1 -0
- wherefore-0.1.0/src/wherefore.egg-info/entry_points.txt +2 -0
- wherefore-0.1.0/src/wherefore.egg-info/requires.txt +18 -0
- wherefore-0.1.0/src/wherefore.egg-info/top_level.txt +1 -0
wherefore-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
Apache License
|
|
2
|
+
Version 2.0, January 2004
|
|
3
|
+
http://www.apache.org/licenses/
|
|
4
|
+
|
|
5
|
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
|
6
|
+
|
|
7
|
+
1. Definitions.
|
|
8
|
+
|
|
9
|
+
"License" shall mean the terms and conditions for use, reproduction,
|
|
10
|
+
and distribution as defined by Sections 1 through 9 of this document.
|
|
11
|
+
|
|
12
|
+
"Licensor" shall mean the copyright owner or entity authorized by
|
|
13
|
+
the copyright owner that is granting the License.
|
|
14
|
+
|
|
15
|
+
"Legal Entity" shall mean the union of the acting entity and all
|
|
16
|
+
other entities that control, are controlled by, or are under common
|
|
17
|
+
control with that entity. For the purposes of this definition,
|
|
18
|
+
"control" means (i) the power, direct or indirect, to cause the
|
|
19
|
+
direction or management of such entity, whether by contract or
|
|
20
|
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
|
21
|
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
|
22
|
+
|
|
23
|
+
"You" (or "Your") shall mean an individual or Legal Entity
|
|
24
|
+
exercising permissions granted by this License.
|
|
25
|
+
|
|
26
|
+
"Source" form shall mean the preferred form for making modifications,
|
|
27
|
+
including but not limited to software source code, documentation
|
|
28
|
+
source, and configuration files.
|
|
29
|
+
|
|
30
|
+
"Object" form shall mean any form resulting from mechanical
|
|
31
|
+
transformation or translation of a Source form, including but
|
|
32
|
+
not limited to compiled object code, generated documentation,
|
|
33
|
+
and conversions to other media types.
|
|
34
|
+
|
|
35
|
+
"Work" shall mean the work of authorship, whether in Source or
|
|
36
|
+
Object form, made available under the License, as indicated by a
|
|
37
|
+
copyright notice that is included in or attached to the work
|
|
38
|
+
(an example is provided in the Appendix below).
|
|
39
|
+
|
|
40
|
+
"Derivative Works" shall mean any work, whether in Source or Object
|
|
41
|
+
form, that is based on (or derived from) the Work and for which the
|
|
42
|
+
editorial revisions, annotations, elaborations, or other modifications
|
|
43
|
+
represent, as a whole, an original work of authorship. For the
|
|
44
|
+
purposes of this License, Derivative Works shall not include works
|
|
45
|
+
that remain separable from, or merely link (or bind by name) to the
|
|
46
|
+
interfaces of, the Work and Derivative Works thereof.
|
|
47
|
+
|
|
48
|
+
"Contribution" shall mean any work of authorship, including the
|
|
49
|
+
original version of the Work and any modifications or additions
|
|
50
|
+
to that Work or Derivative Works thereof, that is intentionally
|
|
51
|
+
submitted to Licensor for inclusion in the Work by the copyright
|
|
52
|
+
owner or by an individual or Legal Entity authorized to submit on
|
|
53
|
+
behalf of the copyright owner. For the purposes of this definition,
|
|
54
|
+
"submitted" means any form of electronic, verbal, or written
|
|
55
|
+
communication sent to the Licensor or its representatives,
|
|
56
|
+
including but not limited to communication on electronic mailing
|
|
57
|
+
lists, source code control systems, and issue tracking systems
|
|
58
|
+
that are managed by, or on behalf of, the Licensor for the purpose
|
|
59
|
+
of discussing and improving the Work, but excluding communication
|
|
60
|
+
that is conspicuously marked or otherwise designated in writing
|
|
61
|
+
by the copyright owner as "Not a Contribution."
|
|
62
|
+
|
|
63
|
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
|
64
|
+
on behalf of whom a Contribution has been received by Licensor and
|
|
65
|
+
subsequently incorporated within the Work.
|
|
66
|
+
|
|
67
|
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
|
68
|
+
this License, each Contributor hereby grants to You a perpetual,
|
|
69
|
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
70
|
+
copyright license to reproduce, prepare Derivative Works of,
|
|
71
|
+
publicly display, publicly perform, sublicense, and distribute the
|
|
72
|
+
Work and such Derivative Works in Source or Object form.
|
|
73
|
+
|
|
74
|
+
3. Grant of Patent License. Subject to the terms and conditions of
|
|
75
|
+
this License, each Contributor hereby grants to You a perpetual,
|
|
76
|
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
77
|
+
(except as stated in this section) patent license to make, have
|
|
78
|
+
made, use, offer to sell, sell, import, and otherwise transfer the
|
|
79
|
+
Work, where such license applies only to those patent claims
|
|
80
|
+
licensable by such Contributor that are necessarily infringed by
|
|
81
|
+
their Contribution(s) alone or by combination of their
|
|
82
|
+
Contribution(s) with the Work to which such Contribution(s) was
|
|
83
|
+
submitted. If You institute patent litigation against any entity
|
|
84
|
+
(including a cross-claim or counterclaim in a lawsuit) alleging
|
|
85
|
+
that the Work or a Contribution incorporated within the Work
|
|
86
|
+
constitutes direct or contributory patent infringement, then any
|
|
87
|
+
patent licenses granted to You under this License for that Work
|
|
88
|
+
shall terminate as of the date such litigation is filed.
|
|
89
|
+
|
|
90
|
+
4. Redistribution. You may reproduce and distribute copies of the
|
|
91
|
+
Work or Derivative Works thereof in any medium, with or without
|
|
92
|
+
modifications, and in Source or Object form, provided that You
|
|
93
|
+
meet the following conditions:
|
|
94
|
+
|
|
95
|
+
(a) You must give any other recipients of the Work or
|
|
96
|
+
Derivative Works a copy of this License; and
|
|
97
|
+
|
|
98
|
+
(b) You must cause any modified files to carry prominent notices
|
|
99
|
+
stating that You changed the files; and
|
|
100
|
+
|
|
101
|
+
(c) You must retain, in the Source form of any Derivative Works
|
|
102
|
+
that You distribute, all copyright, patent, trademark, and
|
|
103
|
+
attribution notices from the Source form of the Work,
|
|
104
|
+
excluding those notices that do not pertain to any part of
|
|
105
|
+
the Derivative Works; and
|
|
106
|
+
|
|
107
|
+
(d) If the Work includes a "NOTICE" text file as part of its
|
|
108
|
+
distribution, then any Derivative Works that You distribute must
|
|
109
|
+
include a readable copy of the attribution notices contained
|
|
110
|
+
within such NOTICE file, excluding those notices that do not
|
|
111
|
+
pertain to any part of the Derivative Works, in at least one
|
|
112
|
+
of the following places: within a NOTICE text file distributed
|
|
113
|
+
as part of the Derivative Works; within the Source form or
|
|
114
|
+
documentation, if provided along with the Derivative Works; or,
|
|
115
|
+
within a display generated by the Derivative Works, if and
|
|
116
|
+
wherever such third-party notices normally appear. The contents
|
|
117
|
+
of the NOTICE file are for informational purposes only and
|
|
118
|
+
do not modify the License. You may add Your own attribution
|
|
119
|
+
notices within Derivative Works that You distribute, alongside
|
|
120
|
+
or as an addendum to the NOTICE text from the Work, provided
|
|
121
|
+
that such additional attribution notices cannot be construed
|
|
122
|
+
as modifying the License.
|
|
123
|
+
|
|
124
|
+
You may add Your own copyright statement to Your modifications and
|
|
125
|
+
may provide additional or different license terms and conditions
|
|
126
|
+
for use, reproduction, or distribution of Your modifications, or
|
|
127
|
+
for any such Derivative Works as a whole, provided Your use,
|
|
128
|
+
reproduction, and distribution of the Work otherwise complies with
|
|
129
|
+
the conditions stated in this License.
|
|
130
|
+
|
|
131
|
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
|
132
|
+
any Contribution intentionally submitted for inclusion in the Work
|
|
133
|
+
by You to the Licensor shall be under the terms and conditions of
|
|
134
|
+
this License, without any additional terms or conditions.
|
|
135
|
+
Notwithstanding the above, nothing herein shall supersede or modify
|
|
136
|
+
the terms of any separate license agreement you may have executed
|
|
137
|
+
with Licensor regarding such Contributions.
|
|
138
|
+
|
|
139
|
+
6. Trademarks. This License does not grant permission to use the trade
|
|
140
|
+
names, trademarks, service marks, or product names of the Licensor,
|
|
141
|
+
except as required for reasonable and customary use in describing
|
|
142
|
+
the origin of the Work and reproducing the content of the NOTICE file.
|
|
143
|
+
|
|
144
|
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
|
145
|
+
agreed to in writing, Licensor provides the Work (and each
|
|
146
|
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
|
147
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
|
148
|
+
implied, including, without limitation, any warranties or conditions
|
|
149
|
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
|
150
|
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
|
151
|
+
appropriateness of using or redistributing the Work and assume any
|
|
152
|
+
risks associated with Your exercise of permissions under this License.
|
|
153
|
+
|
|
154
|
+
8. Limitation of Liability. In no event and under no legal theory,
|
|
155
|
+
whether in tort (including negligence), contract, or otherwise,
|
|
156
|
+
unless required by applicable law (such as deliberate and grossly
|
|
157
|
+
negligent acts) or agreed to in writing, shall any Contributor be
|
|
158
|
+
liable to You for damages, including any direct, indirect, special,
|
|
159
|
+
incidental, or consequential damages of any character arising as a
|
|
160
|
+
result of this License or out of the use or inability to use the
|
|
161
|
+
Work (including but not limited to damages for loss of goodwill,
|
|
162
|
+
work stoppage, computer failure or malfunction, or any and all
|
|
163
|
+
other commercial damages or losses), even if such Contributor
|
|
164
|
+
has been advised of the possibility of such damages.
|
|
165
|
+
|
|
166
|
+
9. Accepting Warranty or Additional Liability. While redistributing
|
|
167
|
+
the Work or Derivative Works thereof, You may choose to offer,
|
|
168
|
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
|
169
|
+
or other liability obligations and/or rights consistent with this
|
|
170
|
+
License. However, in accepting such obligations, You may act only
|
|
171
|
+
on Your own behalf and on Your sole responsibility, not on behalf
|
|
172
|
+
of any other Contributor, and only if You agree to indemnify,
|
|
173
|
+
defend, and hold each Contributor harmless for any liability
|
|
174
|
+
incurred by, or claims asserted against, such Contributor by reason
|
|
175
|
+
of your accepting any such warranty or additional liability.
|
|
176
|
+
|
|
177
|
+
END OF TERMS AND CONDITIONS
|
|
178
|
+
|
|
179
|
+
APPENDIX: How to apply the Apache License to your work.
|
|
180
|
+
|
|
181
|
+
To apply the Apache License to your work, attach the following
|
|
182
|
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
|
183
|
+
replaced with your own identifying information. (Don't include
|
|
184
|
+
the brackets!) The text should be enclosed in the appropriate
|
|
185
|
+
comment syntax for the file format. We also recommend that a
|
|
186
|
+
file or class name and description of purpose be included on the
|
|
187
|
+
same "printed page" as the copyright notice for easier
|
|
188
|
+
identification within third-party archives.
|
|
189
|
+
|
|
190
|
+
Copyright 2026 ArunMishra1
|
|
191
|
+
|
|
192
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
|
193
|
+
you may not use this file except in compliance with the License.
|
|
194
|
+
You may obtain a copy of the License at
|
|
195
|
+
|
|
196
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
|
197
|
+
|
|
198
|
+
Unless required by applicable law or agreed to in writing, software
|
|
199
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
|
200
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
201
|
+
See the License for the specific language governing permissions and
|
|
202
|
+
limitations under the License.
|
wherefore-0.1.0/NOTICE
ADDED
wherefore-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,447 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: wherefore
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Explains WHY two datasets differ, in plain English, with eval-backed accuracy claims.
|
|
5
|
+
License: Apache-2.0
|
|
6
|
+
Project-URL: Homepage, https://github.com/tracelore/wherefore
|
|
7
|
+
Project-URL: Repository, https://github.com/tracelore/wherefore
|
|
8
|
+
Project-URL: Issues, https://github.com/tracelore/wherefore/issues
|
|
9
|
+
Keywords: data-diff,data-quality,data-migration,csv,datacompy,etl-testing
|
|
10
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
17
|
+
Classifier: Intended Audience :: Developers
|
|
18
|
+
Classifier: Topic :: Software Development :: Quality Assurance
|
|
19
|
+
Classifier: Topic :: Database
|
|
20
|
+
Classifier: Environment :: Console
|
|
21
|
+
Classifier: Operating System :: OS Independent
|
|
22
|
+
Requires-Python: >=3.10
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
License-File: LICENSE
|
|
25
|
+
License-File: NOTICE
|
|
26
|
+
Requires-Dist: pandas>=2.2
|
|
27
|
+
Requires-Dist: numpy>=1.24
|
|
28
|
+
Requires-Dist: datacompy>=1.0
|
|
29
|
+
Requires-Dist: pydantic>=2.0
|
|
30
|
+
Requires-Dist: pyyaml>=6.0
|
|
31
|
+
Requires-Dist: typer>=0.12
|
|
32
|
+
Requires-Dist: anthropic>=0.110
|
|
33
|
+
Requires-Dist: rapidfuzz>=3.0
|
|
34
|
+
Requires-Dist: pyarrow>=14.0
|
|
35
|
+
Requires-Dist: openpyxl>=3.1
|
|
36
|
+
Provides-Extra: dev
|
|
37
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
38
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
|
39
|
+
Requires-Dist: moto[s3]>=5.2; extra == "dev"
|
|
40
|
+
Provides-Extra: s3
|
|
41
|
+
Requires-Dist: boto3>=1.34; extra == "s3"
|
|
42
|
+
Dynamic: license-file
|
|
43
|
+
|
|
44
|
+
# wherefore
|
|
45
|
+
|
|
46
|
+
**Tells you *why* two datasets differ — not just that they do.**
|
|
47
|
+
|
|
48
|
+
[](https://github.com/tracelore/wherefore/actions/workflows/ci.yml)
|
|
49
|
+
[](https://www.apache.org/licenses/LICENSE-2.0)
|
|
50
|
+
[](https://www.python.org/)
|
|
51
|
+
|
|
52
|
+
Data diffing tools tell you *that* 40 rows mismatched in `created_at`.
|
|
53
|
+
None of them tell you *why*. `wherefore` is the layer on top of a diff
|
|
54
|
+
that answers that question — in plain English, with real example rows
|
|
55
|
+
cited — and honestly says "I don't recognize this pattern" rather than
|
|
56
|
+
confidently guessing wrong.
|
|
57
|
+
|
|
58
|
+
**Free and key-free out of the box.** No account, no server, no API
|
|
59
|
+
key required for the diffing and pattern-matching — only the optional
|
|
60
|
+
AI narrative (`--explain`) talks to an external API, and that's off by
|
|
61
|
+
default.
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
pip install -e .
|
|
65
|
+
wherefore compare old_export.csv new_export.csv
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
$ wherefore compare old_export.csv new_export.csv --explain
|
|
70
|
+
Calling Claude for 1 cluster(s)...
|
|
71
|
+
Compared 5 source rows against 5 target rows.
|
|
72
|
+
Matched rows: 5
|
|
73
|
+
hire_date: 5 mismatches, matches 'timezone_shift' (confidence 1.00)
|
|
74
|
+
AI: Every affected row is shifted forward by exactly 5 hours,
|
|
75
|
+
consistent with a UTC-vs-local-time mismatch introduced during
|
|
76
|
+
the export. Likely cause: the source system's timestamps were
|
|
77
|
+
re-interpreted in the wrong timezone during migration.
|
|
78
|
+
|
|
79
|
+
Full report written to report.md
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
That's a real run, real output — not a mockup. Try it on your own
|
|
83
|
+
files in two minutes: see [Quickstart](#quickstart).
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
### Contents
|
|
88
|
+
|
|
89
|
+
[Quickstart](#quickstart) · [Why this exists](#why-this-exists) ·
|
|
90
|
+
[What's built](#whats-built) · [Architecture](#architecture) ·
|
|
91
|
+
[Evals](#evals--why-trust-the-explanations) · [All flags](#all-flags) ·
|
|
92
|
+
[Contributing](#contributing)
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## Quickstart
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
git clone https://github.com/tracelore/wherefore.git
|
|
100
|
+
cd wherefore
|
|
101
|
+
./dev_setup.sh
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
This creates a `.venv/`, installs everything, and runs the test suite
|
|
105
|
+
(should show **316 passed**, no API key needed — the test suite uses a
|
|
106
|
+
fake AI provider, zero network calls). Safe to re-run.
|
|
107
|
+
|
|
108
|
+
Then, on any two files of yours:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
wherefore compare old_export.csv new_export.csv --output report.md
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Works the same with `.csv`, `.json`, `.parquet`, or `.xlsx`/`.xls` —
|
|
115
|
+
mix and match freely, format is auto-detected per file. No key column
|
|
116
|
+
needed; `wherefore` finds one. If it picks wrong, or you have many
|
|
117
|
+
table pairs to check at once (a real migration is dozens of tables,
|
|
118
|
+
not one), see [usage details](#usage) below.
|
|
119
|
+
|
|
120
|
+
**Don't have a Parquet or Excel file handy?** Make one from the CSVs
|
|
121
|
+
above in two lines, run inside the same activated `.venv` from
|
|
122
|
+
`dev_setup.sh` (so `pandas`/`pyarrow` are already available) — note
|
|
123
|
+
`parse_dates=[...]` on any datetime column, since Parquet and Excel
|
|
124
|
+
store dates natively and pandas needs to know which column is one
|
|
125
|
+
before writing (without it, the column round-trips as plain text and
|
|
126
|
+
date-based patterns like `timezone_shift` won't be detected —
|
|
127
|
+
confirmed by testing this exact gap):
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
python3 -c "
|
|
131
|
+
import pandas as pd
|
|
132
|
+
df = pd.read_csv('old_export.csv', parse_dates=['hire_date']) # name your actual date column
|
|
133
|
+
df.to_parquet('old_export.parquet', index=False)
|
|
134
|
+
df.to_excel('old_export.xlsx', index=False)
|
|
135
|
+
"
|
|
136
|
+
wherefore compare old_export.parquet new_export.parquet # or .xlsx
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
Want the AI explanation, not just the statistical match?
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
export ANTHROPIC_API_KEY="sk-ant-..."
|
|
143
|
+
wherefore compare old_export.csv new_export.csv --explain
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
Sensitive-looking values (emails, SSNs, card numbers, phone numbers)
|
|
147
|
+
are redacted before anything is sent — on by default, see
|
|
148
|
+
[Privacy & data handling](#privacy--data-handling).
|
|
149
|
+
|
|
150
|
+
<details>
|
|
151
|
+
<summary>Manual setup, if you'd rather not run the script</summary>
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
python3 -m venv .venv
|
|
155
|
+
source .venv/bin/activate # Windows: .venv\Scripts\activate
|
|
156
|
+
pip install --upgrade pip
|
|
157
|
+
pip install -e ".[dev]"
|
|
158
|
+
pytest tests/ -v
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
**Requires Python 3.10+.** Tested on 3.10–3.12. On very recent Python
|
|
162
|
+
(3.14+), pandas/numpy are compatible, but if `pip install` fails, a
|
|
163
|
+
smaller transitive dependency without 3.14 wheels yet is the likely
|
|
164
|
+
cause — try 3.11/3.12 if you hit this.
|
|
165
|
+
</details>
|
|
166
|
+
|
|
167
|
+
## Why this exists
|
|
168
|
+
|
|
169
|
+
Imagine two boxes of identical LEGO sets. Someone copied box A into
|
|
170
|
+
box B, but a few pieces are missing or the wrong color. Most tools
|
|
171
|
+
that check this say: *"12 pieces are different."* That's it.
|
|
172
|
+
|
|
173
|
+
`wherefore` looks at those 12 differences and says: *"These aren't
|
|
174
|
+
random — every one has the same color swapped the same way, consistent
|
|
175
|
+
with a colorblind sort. That's your root cause."* It explains the
|
|
176
|
+
pattern behind the differences, not just the differences themselves.
|
|
177
|
+
|
|
178
|
+
To know if it's actually doing this well — not just sounding plausible
|
|
179
|
+
— we build our own "messed-up" datasets on purpose, with a known,
|
|
180
|
+
labeled answer, and grade whether the tool finds it. That's the
|
|
181
|
+
[eval harness](#evals--why-trust-the-explanations), and it's
|
|
182
|
+
first-class here, not an afterthought.
|
|
183
|
+
|
|
184
|
+
This is not a thin prompt wrapper around an LLM. The AI sits behind a
|
|
185
|
+
deterministic clustering and statistical-signature step, and every
|
|
186
|
+
accuracy claim below is backed by that eval harness against labeled
|
|
187
|
+
ground truth — not vibes.
|
|
188
|
+
|
|
189
|
+
## What's built
|
|
190
|
+
|
|
191
|
+
🚧 Actively built in public. The full pipeline is real, end-to-end:
|
|
192
|
+
statistical detection, AI explanation, and a scored eval harness.
|
|
193
|
+
|
|
194
|
+
| | |
|
|
195
|
+
|---|---|
|
|
196
|
+
| **Formats** | CSV, JSON, Parquet, Excel — local or `s3://`, auto-detected, mix-and-match |
|
|
197
|
+
| **Modes** | One file pair (`compare`) or a whole directory (`compare-dir`) |
|
|
198
|
+
| **Taxonomy** | 8 failure patterns built & tested: `timezone_shift`, `truncation`, `enum_drift`, `null_type_coercion`, `float_precision`, `encoding_mismatch`, `dedup_failure`, `key_mismatch` |
|
|
199
|
+
| **AI layer** | Verified against the real Claude API twice — manually and via the scored eval harness — 100% match on a small (seven-fixture) sample |
|
|
200
|
+
| **Privacy** | Redacts emails/SSNs/cards/phones before any `--explain` call, on by default |
|
|
201
|
+
| **Tests** | 316 passing, including a real (mocked) S3 round-trip and end-to-end runs against real generated files |
|
|
202
|
+
|
|
203
|
+
`dedup_failure` and `key_mismatch` are structurally different from the
|
|
204
|
+
other six — `dedup_failure` detects duplicated rows (re-inserted with
|
|
205
|
+
a new key, not the same key twice); `key_mismatch` detects a row whose
|
|
206
|
+
join key was reformatted (`EMP-1001` vs `EMP1001`) so it never matched
|
|
207
|
+
at all. Both show up as extra/missing rows rather than a column-level
|
|
208
|
+
mismatch, both have their own clustering path
|
|
209
|
+
(`detect_row_presence_patterns`), and both are verified by real,
|
|
210
|
+
dedicated tests — including a regression test confirming they don't
|
|
211
|
+
false-positive on each other's fixtures, and a regression test for a
|
|
212
|
+
real false positive caught while building `key_mismatch` (two
|
|
213
|
+
genuinely unrelated keys sharing a domain's ID prefix scored close
|
|
214
|
+
enough on a similarity heuristic to need a deterministic check
|
|
215
|
+
instead — see `TAXONOMY.md`). Neither is yet wired into the automated
|
|
216
|
+
eval harness above (that harness currently only scores column-mismatch
|
|
217
|
+
patterns) — tracked honestly as a gap, not hidden.
|
|
218
|
+
|
|
219
|
+
**Not built yet:** wiring `dedup_failure`/`key_mismatch` into the eval
|
|
220
|
+
harness, more fixture coverage at scale, and database connectivity
|
|
221
|
+
(Postgres, MySQL,
|
|
222
|
+
SQLite). File-based sources — local and `s3://` — and CSV/JSON/Parquet/
|
|
223
|
+
Excel are all supported today. See [`TAXONOMY.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY.md) for
|
|
224
|
+
the current pattern list and what's planned next.
|
|
225
|
+
|
|
226
|
+
<details>
|
|
227
|
+
<summary>The harder bugs this surfaced, if you're curious</summary>
|
|
228
|
+
|
|
229
|
+
Building the 4th pattern (`null_type_coercion`) surfaced three real
|
|
230
|
+
bugs spanning the comparison engine, the file loaders, and the eval
|
|
231
|
+
harness itself. Building the 5th (`float_precision`) surfaced a
|
|
232
|
+
subtler one: a magnitude-based heuristic that looked right scored a
|
|
233
|
+
real false positive on an adversarial test case, fixed by checking the
|
|
234
|
+
underlying mechanism (an exact float32 round-trip) directly instead of
|
|
235
|
+
approximating its size. Full account, including how each was found and
|
|
236
|
+
fixed: [`TAXONOMY_TODO.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY_TODO.md).
|
|
237
|
+
</details>
|
|
238
|
+
|
|
239
|
+
## Architecture
|
|
240
|
+
|
|
241
|
+
```
|
|
242
|
+
source file, target file (CSV, JSON, Parquet, or Excel)
|
|
243
|
+
│
|
|
244
|
+
▼
|
|
245
|
+
loaders + key matching (exact by default; --fuzzy-keys for reformatted keys)
|
|
246
|
+
│
|
|
247
|
+
▼
|
|
248
|
+
comparison engine (wraps datacompy; schema-aware diffing)
|
|
249
|
+
│
|
|
250
|
+
▼
|
|
251
|
+
deterministic clustering (groups mismatches; runs cheap statistical
|
|
252
|
+
│ signature checks — NO causal claims here)
|
|
253
|
+
▼
|
|
254
|
+
candidate pattern(s), confidence-scored
|
|
255
|
+
│
|
|
256
|
+
├─── default: stop here, statistics only (free, no API key)
|
|
257
|
+
│
|
|
258
|
+
▼ with --explain
|
|
259
|
+
AI reasoning layer (Claude; redacts sensitive patterns by default;
|
|
260
|
+
│ writes the causal narrative, cites real rows,
|
|
261
|
+
│ honestly flags "unrecognized" when nothing fits)
|
|
262
|
+
▼
|
|
263
|
+
Markdown report (statistics always; AI narrative alongside
|
|
264
|
+
the evidence, never instead of it)
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
**Failure patterns are data, not code.** Each one is a YAML file under
|
|
268
|
+
`src/wherefore/taxonomy/patterns/`, validated against a strict schema.
|
|
269
|
+
Adding a new pattern means writing a YAML file and a small corruptor
|
|
270
|
+
function — never touching clustering or reasoning code. See
|
|
271
|
+
[`CONTRIBUTING.md`](https://github.com/tracelore/wherefore/blob/main/CONTRIBUTING.md) for the contract.
|
|
272
|
+
|
|
273
|
+
**Clustering and reasoning are deliberately separated.** Clustering
|
|
274
|
+
only ever produces statistical observations ("these 12 rows differ by
|
|
275
|
+
exactly 5 hours"). Causal attribution ("this is a timezone bug") is
|
|
276
|
+
the AI's job, every time — if clustering started asserting causes, the
|
|
277
|
+
AI layer would become decorative and the evals would stop measuring
|
|
278
|
+
anything meaningful.
|
|
279
|
+
|
|
280
|
+
## Evals — why trust the explanations?
|
|
281
|
+
|
|
282
|
+
Because we control the ground truth. The synthetic data generator
|
|
283
|
+
creates clean datasets, then deliberately corrupts them using a known
|
|
284
|
+
failure pattern — recording exactly what it did in a committed
|
|
285
|
+
`ground_truth.json`. The eval harness runs the real pipeline against
|
|
286
|
+
these labeled fixtures and scores the result as precision/recall per
|
|
287
|
+
pattern, tracking "correctly said unrecognized" separately from
|
|
288
|
+
"confidently named the wrong pattern" — very different failure modes a
|
|
289
|
+
naive right/wrong scorer would conflate.
|
|
290
|
+
|
|
291
|
+
**Statistical mode, free, no API key, against all 7 fixtures:**
|
|
292
|
+
|
|
293
|
+
```
|
|
294
|
+
$ python3 -m evals.harness.run_eval
|
|
295
|
+
Total cases: 7
|
|
296
|
+
Overall accuracy (correct match + honest abstain): 100.00%
|
|
297
|
+
|
|
298
|
+
encoding_mismatch: precision=1.00 recall=1.00
|
|
299
|
+
enum_drift: precision=1.00 recall=1.00
|
|
300
|
+
float_precision: precision=1.00 recall=1.00
|
|
301
|
+
null_type_coercion: precision=1.00 recall=1.00
|
|
302
|
+
timezone_shift: precision=1.00 recall=1.00
|
|
303
|
+
truncation: precision=1.00 recall=1.00
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
**LLM mode** (`python3 -m evals.harness.run_eval --llm`, real API
|
|
307
|
+
calls, scores the AI's final answer instead of clustering's raw
|
|
308
|
+
statistics) — **also 100%**, including the one fixture designed to
|
|
309
|
+
test something the statistics alone can't: a cluster that legitimately
|
|
310
|
+
matches two patterns at once, where the AI correctly picked the right
|
|
311
|
+
one by reasoning about the actual values, not by defaulting to
|
|
312
|
+
whichever candidate came first.
|
|
313
|
+
|
|
314
|
+
Both are reproducible — clone the repo, run the commands yourself.
|
|
315
|
+
Seven fixtures proves the *mechanism* works end-to-end against the real
|
|
316
|
+
API; it doesn't prove either layer is bulletproof at scale. That's the
|
|
317
|
+
honest caveat, and expanding fixture coverage is the tracked next step
|
|
318
|
+
in [`TAXONOMY_TODO.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY_TODO.md).
|
|
319
|
+
|
|
320
|
+
<details>
|
|
321
|
+
<summary>The multi-candidate case, if you want the detail</summary>
|
|
322
|
+
|
|
323
|
+
`null_type_coercion` and `enum_drift` can legitimately both match the
|
|
324
|
+
same cluster — a null consistently coerced to one sentinel string is,
|
|
325
|
+
statistically, also a "consistent value mapping." Clustering reports
|
|
326
|
+
both honestly rather than guessing which is "more right," since that's
|
|
327
|
+
a causal judgment that belongs to the AI layer, not clustering. The
|
|
328
|
+
eval harness scores this correctly too: a true pattern counts as found
|
|
329
|
+
if it appears anywhere among the reported candidates, not only if it's
|
|
330
|
+
listed first. Full story of how this was found and fixed:
|
|
331
|
+
[`TAXONOMY_TODO.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY_TODO.md).
|
|
332
|
+
</details>
|
|
333
|
+
|
|
334
|
+
## Usage
|
|
335
|
+
|
|
336
|
+
### One file pair
|
|
337
|
+
|
|
338
|
+
```bash
|
|
339
|
+
wherefore compare old_export.csv new_export.csv --key employee_id
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
`--key` is optional — omit it and `wherefore` looks for a column that
|
|
343
|
+
looks like an identifier (mostly-unique values, often named with "id"
|
|
344
|
+
or "key"). If the same record has a differently-formatted key on each
|
|
345
|
+
side (e.g. `EMP-1001` vs `EMP1001`, common after a migration), add
|
|
346
|
+
`--fuzzy-keys`.
|
|
347
|
+
|
|
348
|
+
Files can live in S3, not just on disk — mix and match freely:
|
|
349
|
+
|
|
350
|
+
```bash
|
|
351
|
+
pip install "wherefore[s3]" # boto3 is optional, only needed for s3:// paths
|
|
352
|
+
wherefore compare s3://old-bucket/accounts.csv s3://new-bucket/accounts.csv
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
Uses the standard AWS credential chain (env vars, `~/.aws/credentials`,
|
|
356
|
+
IAM role, `AWS_PROFILE`) — `wherefore` doesn't invent its own.
|
|
357
|
+
|
|
358
|
+
### A whole migration, not one table
|
|
359
|
+
|
|
360
|
+
```bash
|
|
361
|
+
$ wherefore compare-dir old_exports new_exports --output-dir reports
|
|
362
|
+
Found 3 matching file pair(s). Comparing...
|
|
363
|
+
|
|
364
|
+
[DIFF] accounts.csv: 1 finding(s) (timezone_shift)
|
|
365
|
+
[DIFF] patients.csv: 1 finding(s) (truncation)
|
|
366
|
+
[OK] transactions.csv: no mismatches
|
|
367
|
+
|
|
368
|
+
Done: 3 compared, 0 skipped. Reports written to reports/
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
Files are matched by **identical filename** between the two
|
|
372
|
+
directories — no fuzzy matching at the file level, since guessing
|
|
373
|
+
wrong about *which two tables* you're comparing is worse than guessing
|
|
374
|
+
wrong about a row key. A pair that can't be compared (bad format, no
|
|
375
|
+
detectable key) is skipped and reported, not fatal to the rest of the
|
|
376
|
+
batch. Every `compare` flag works here too, applied to every pair.
|
|
377
|
+
|
|
378
|
+
### What you get without an API key
|
|
379
|
+
|
|
380
|
+
Real diffing, real grouping, real pattern matching — and a confidence
|
|
381
|
+
score that's a genuine deterministic measurement (e.g. "every
|
|
382
|
+
mismatched value differs by exactly the same 5-hour delta"), not an AI
|
|
383
|
+
guess. If nothing in the taxonomy matches, `wherefore` says
|
|
384
|
+
`pattern unrecognized` rather than forcing one.
|
|
385
|
+
|
|
386
|
+
### What `--explain` adds
|
|
387
|
+
|
|
388
|
+
The plain-English *why*, shown **alongside** the statistical evidence
|
|
389
|
+
it was reasoned from, not instead of it — so you can check the claim
|
|
390
|
+
yourself. In testing, the AI correctly identified a genuinely random,
|
|
391
|
+
non-matching corruption and proposed real alternative hypotheses (a
|
|
392
|
+
bad join, a mis-wired column) instead of inventing a pattern that
|
|
393
|
+
wasn't there.
|
|
394
|
+
|
|
395
|
+
## Privacy & data handling
|
|
396
|
+
|
|
397
|
+
`--explain` sends mismatched cell values to the Claude API. Before
|
|
398
|
+
that happens, values are checked against a redaction layer — emails,
|
|
399
|
+
SSNs, credit card numbers, US phone numbers — **on by default, no flag
|
|
400
|
+
needed.** Anything masked is called out in the output
|
|
401
|
+
(`Redacted before sending to Claude: email`). Disable with
|
|
402
|
+
`--no-redact` if you've already vetted your data.
|
|
403
|
+
|
|
404
|
+
Be precise about scope: this is pattern-based detection of
|
|
405
|
+
*structurally recognizable* sensitive data, not a general PII scanner
|
|
406
|
+
— it won't know that a name or a home address is sensitive. Full
|
|
407
|
+
detail, including a documented false-positive case found during
|
|
408
|
+
testing (long numeric IDs can resemble card numbers):
|
|
409
|
+
[`SECURITY.md`](https://github.com/tracelore/wherefore/blob/main/SECURITY.md).
|
|
410
|
+
|
|
411
|
+
## All flags
|
|
412
|
+
|
|
413
|
+
<details>
|
|
414
|
+
<summary>Expand</summary>
|
|
415
|
+
|
|
416
|
+
```bash
|
|
417
|
+
wherefore compare SOURCE TARGET [OPTIONS]
|
|
418
|
+
|
|
419
|
+
--key TEXT Join key column. Auto-detected if omitted.
|
|
420
|
+
--fuzzy-keys Allow approximate key matching (e.g. 'CUST-001' vs 'CUST001').
|
|
421
|
+
--output TEXT Where to write the report (default: report.md).
|
|
422
|
+
--confidence-threshold FLOAT Minimum confidence to count as a pattern match (default: 0.9).
|
|
423
|
+
--explain Generate plain-English AI explanations via the Claude API.
|
|
424
|
+
Requires ANTHROPIC_API_KEY. Makes real, billed API calls. Off by default.
|
|
425
|
+
--no-redact Disable automatic redaction of emails/SSNs/cards/phones before
|
|
426
|
+
sending data to Claude with --explain. Redaction is ON by default.
|
|
427
|
+
|
|
428
|
+
wherefore compare-dir SOURCE_DIR TARGET_DIR [OPTIONS]
|
|
429
|
+
|
|
430
|
+
--output-dir TEXT Directory for one report per pair (default: reports).
|
|
431
|
+
--key, --fuzzy-keys, --confidence-threshold, --explain, --no-redact Same as `compare`, applied to every pair.
|
|
432
|
+
```
|
|
433
|
+
</details>
|
|
434
|
+
|
|
435
|
+
## Contributing
|
|
436
|
+
|
|
437
|
+
Contributions are welcome, especially new taxonomy patterns. Start
|
|
438
|
+
with [`CONTRIBUTING.md`](https://github.com/tracelore/wherefore/blob/main/CONTRIBUTING.md) — the pattern contract, why
|
|
439
|
+
patterns are built corruptor-first rather than YAML-first, and the
|
|
440
|
+
design decisions worth knowing before you dig in.
|
|
441
|
+
|
|
442
|
+
Found a security issue? See [`SECURITY.md`](https://github.com/tracelore/wherefore/blob/main/SECURITY.md).
|
|
443
|
+
|
|
444
|
+
## License
|
|
445
|
+
|
|
446
|
+
Apache License 2.0 — see [`LICENSE`](https://github.com/tracelore/wherefore/blob/main/LICENSE). Contributions are
|
|
447
|
+
accepted under the same license (see `NOTICE` for attribution).
|