wherefore 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. wherefore-0.1.0/LICENSE +202 -0
  2. wherefore-0.1.0/NOTICE +5 -0
  3. wherefore-0.1.0/PKG-INFO +447 -0
  4. wherefore-0.1.0/README.md +404 -0
  5. wherefore-0.1.0/pyproject.toml +83 -0
  6. wherefore-0.1.0/setup.cfg +4 -0
  7. wherefore-0.1.0/src/wherefore/__init__.py +0 -0
  8. wherefore-0.1.0/src/wherefore/cli.py +632 -0
  9. wherefore-0.1.0/src/wherefore/clustering/__init__.py +0 -0
  10. wherefore-0.1.0/src/wherefore/clustering/cluster_mismatches.py +344 -0
  11. wherefore-0.1.0/src/wherefore/clustering/signatures.py +446 -0
  12. wherefore-0.1.0/src/wherefore/comparison/__init__.py +0 -0
  13. wherefore-0.1.0/src/wherefore/comparison/diff_engine.py +229 -0
  14. wherefore-0.1.0/src/wherefore/comparison/diff_result.py +151 -0
  15. wherefore-0.1.0/src/wherefore/comparison/key_matching.py +149 -0
  16. wherefore-0.1.0/src/wherefore/comparison/loaders.py +370 -0
  17. wherefore-0.1.0/src/wherefore/reasoning/__init__.py +0 -0
  18. wherefore-0.1.0/src/wherefore/reasoning/explain.py +207 -0
  19. wherefore-0.1.0/src/wherefore/reasoning/prompts/cluster_explanation_v1.md +71 -0
  20. wherefore-0.1.0/src/wherefore/reasoning/providers/__init__.py +0 -0
  21. wherefore-0.1.0/src/wherefore/reasoning/providers/base.py +30 -0
  22. wherefore-0.1.0/src/wherefore/reasoning/providers/claude.py +81 -0
  23. wherefore-0.1.0/src/wherefore/reasoning/redaction.py +118 -0
  24. wherefore-0.1.0/src/wherefore/reasoning/report.py +25 -0
  25. wherefore-0.1.0/src/wherefore/synthetic/__init__.py +0 -0
  26. wherefore-0.1.0/src/wherefore/synthetic/base_dataset.py +259 -0
  27. wherefore-0.1.0/src/wherefore/synthetic/corruptors/__init__.py +0 -0
  28. wherefore-0.1.0/src/wherefore/synthetic/corruptors/dedup_failure.py +68 -0
  29. wherefore-0.1.0/src/wherefore/synthetic/corruptors/encoding_mismatch.py +84 -0
  30. wherefore-0.1.0/src/wherefore/synthetic/corruptors/enum_drift.py +70 -0
  31. wherefore-0.1.0/src/wherefore/synthetic/corruptors/float_precision.py +78 -0
  32. wherefore-0.1.0/src/wherefore/synthetic/corruptors/key_mismatch.py +100 -0
  33. wherefore-0.1.0/src/wherefore/synthetic/corruptors/null_type_coercion.py +84 -0
  34. wherefore-0.1.0/src/wherefore/synthetic/corruptors/timezone_shift.py +73 -0
  35. wherefore-0.1.0/src/wherefore/synthetic/corruptors/truncation.py +72 -0
  36. wherefore-0.1.0/src/wherefore/synthetic/ground_truth.py +83 -0
  37. wherefore-0.1.0/src/wherefore/taxonomy/__init__.py +0 -0
  38. wherefore-0.1.0/src/wherefore/taxonomy/patterns/dedup_failure.yaml +52 -0
  39. wherefore-0.1.0/src/wherefore/taxonomy/patterns/encoding_mismatch.yaml +45 -0
  40. wherefore-0.1.0/src/wherefore/taxonomy/patterns/enum_drift.yaml +48 -0
  41. wherefore-0.1.0/src/wherefore/taxonomy/patterns/float_precision.yaml +51 -0
  42. wherefore-0.1.0/src/wherefore/taxonomy/patterns/key_mismatch.yaml +59 -0
  43. wherefore-0.1.0/src/wherefore/taxonomy/patterns/null_type_coercion.yaml +52 -0
  44. wherefore-0.1.0/src/wherefore/taxonomy/patterns/timezone_shift.yaml +50 -0
  45. wherefore-0.1.0/src/wherefore/taxonomy/patterns/truncation.yaml +44 -0
  46. wherefore-0.1.0/src/wherefore/taxonomy/registry.py +180 -0
  47. wherefore-0.1.0/src/wherefore/taxonomy/schema.py +156 -0
  48. wherefore-0.1.0/src/wherefore.egg-info/PKG-INFO +447 -0
  49. wherefore-0.1.0/src/wherefore.egg-info/SOURCES.txt +51 -0
  50. wherefore-0.1.0/src/wherefore.egg-info/dependency_links.txt +1 -0
  51. wherefore-0.1.0/src/wherefore.egg-info/entry_points.txt +2 -0
  52. wherefore-0.1.0/src/wherefore.egg-info/requires.txt +18 -0
  53. wherefore-0.1.0/src/wherefore.egg-info/top_level.txt +1 -0
@@ -0,0 +1,202 @@
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the
44
+ purposes of this License, Derivative Works shall not include works
45
+ that remain separable from, or merely link (or bind by name) to the
46
+ interfaces of, the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including the
49
+ original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright
52
+ owner or by an individual or Legal Entity authorized to submit on
53
+ behalf of the copyright owner. For the purposes of this definition,
54
+ "submitted" means any form of electronic, verbal, or written
55
+ communication sent to the Licensor or its representatives,
56
+ including but not limited to communication on electronic mailing
57
+ lists, source code control systems, and issue tracking systems
58
+ that are managed by, or on behalf of, the Licensor for the purpose
59
+ of discussing and improving the Work, but excluding communication
60
+ that is conspicuously marked or otherwise designated in writing
61
+ by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have
78
+ made, use, offer to sell, sell, import, and otherwise transfer the
79
+ Work, where such license applies only to those patent claims
80
+ licensable by such Contributor that are necessarily infringed by
81
+ their Contribution(s) alone or by combination of their
82
+ Contribution(s) with the Work to which such Contribution(s) was
83
+ submitted. If You institute patent litigation against any entity
84
+ (including a cross-claim or counterclaim in a lawsuit) alleging
85
+ that the Work or a Contribution incorporated within the Work
86
+ constitutes direct or contributory patent infringement, then any
87
+ patent licenses granted to You under this License for that Work
88
+ shall terminate as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing
142
+ the origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright 2026 ArunMishra1
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
wherefore-0.1.0/NOTICE ADDED
@@ -0,0 +1,5 @@
1
+ wherefore
2
+ Copyright 2026 ArunMishra1
3
+
4
+ This product includes software developed as part of the wherefore project
5
+ (https://github.com/ArunMishra1/wherefore).
@@ -0,0 +1,447 @@
1
+ Metadata-Version: 2.4
2
+ Name: wherefore
3
+ Version: 0.1.0
4
+ Summary: Explains WHY two datasets differ, in plain English, with eval-backed accuracy claims.
5
+ License: Apache-2.0
6
+ Project-URL: Homepage, https://github.com/tracelore/wherefore
7
+ Project-URL: Repository, https://github.com/tracelore/wherefore
8
+ Project-URL: Issues, https://github.com/tracelore/wherefore/issues
9
+ Keywords: data-diff,data-quality,data-migration,csv,datacompy,etl-testing
10
+ Classifier: License :: OSI Approved :: Apache Software License
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.10
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Programming Language :: Python :: 3.13
17
+ Classifier: Intended Audience :: Developers
18
+ Classifier: Topic :: Software Development :: Quality Assurance
19
+ Classifier: Topic :: Database
20
+ Classifier: Environment :: Console
21
+ Classifier: Operating System :: OS Independent
22
+ Requires-Python: >=3.10
23
+ Description-Content-Type: text/markdown
24
+ License-File: LICENSE
25
+ License-File: NOTICE
26
+ Requires-Dist: pandas>=2.2
27
+ Requires-Dist: numpy>=1.24
28
+ Requires-Dist: datacompy>=1.0
29
+ Requires-Dist: pydantic>=2.0
30
+ Requires-Dist: pyyaml>=6.0
31
+ Requires-Dist: typer>=0.12
32
+ Requires-Dist: anthropic>=0.110
33
+ Requires-Dist: rapidfuzz>=3.0
34
+ Requires-Dist: pyarrow>=14.0
35
+ Requires-Dist: openpyxl>=3.1
36
+ Provides-Extra: dev
37
+ Requires-Dist: pytest>=8.0; extra == "dev"
38
+ Requires-Dist: pytest-cov; extra == "dev"
39
+ Requires-Dist: moto[s3]>=5.2; extra == "dev"
40
+ Provides-Extra: s3
41
+ Requires-Dist: boto3>=1.34; extra == "s3"
42
+ Dynamic: license-file
43
+
44
+ # wherefore
45
+
46
+ **Tells you *why* two datasets differ — not just that they do.**
47
+
48
+ [![CI](https://github.com/tracelore/wherefore/actions/workflows/ci.yml/badge.svg)](https://github.com/tracelore/wherefore/actions/workflows/ci.yml)
49
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
50
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
51
+
52
+ Data diffing tools tell you *that* 40 rows mismatched in `created_at`.
53
+ None of them tell you *why*. `wherefore` is the layer on top of a diff
54
+ that answers that question — in plain English, with real example rows
55
+ cited — and honestly says "I don't recognize this pattern" rather than
56
+ confidently guessing wrong.
57
+
58
+ **Free and key-free out of the box.** No account, no server, no API
59
+ key required for the diffing and pattern-matching — only the optional
60
+ AI narrative (`--explain`) talks to an external API, and that's off by
61
+ default.
62
+
63
+ ```bash
64
+ pip install -e .
65
+ wherefore compare old_export.csv new_export.csv
66
+ ```
67
+
68
+ ```
69
+ $ wherefore compare old_export.csv new_export.csv --explain
70
+ Calling Claude for 1 cluster(s)...
71
+ Compared 5 source rows against 5 target rows.
72
+ Matched rows: 5
73
+ hire_date: 5 mismatches, matches 'timezone_shift' (confidence 1.00)
74
+ AI: Every affected row is shifted forward by exactly 5 hours,
75
+ consistent with a UTC-vs-local-time mismatch introduced during
76
+ the export. Likely cause: the source system's timestamps were
77
+ re-interpreted in the wrong timezone during migration.
78
+
79
+ Full report written to report.md
80
+ ```
81
+
82
+ That's a real run, real output — not a mockup. Try it on your own
83
+ files in two minutes: see [Quickstart](#quickstart).
84
+
85
+ ---
86
+
87
+ ### Contents
88
+
89
+ [Quickstart](#quickstart) · [Why this exists](#why-this-exists) ·
90
+ [What's built](#whats-built) · [Architecture](#architecture) ·
91
+ [Evals](#evals--why-trust-the-explanations) · [All flags](#all-flags) ·
92
+ [Contributing](#contributing)
93
+
94
+ ---
95
+
96
+ ## Quickstart
97
+
98
+ ```bash
99
+ git clone https://github.com/tracelore/wherefore.git
100
+ cd wherefore
101
+ ./dev_setup.sh
102
+ ```
103
+
104
+ This creates a `.venv/`, installs everything, and runs the test suite
105
+ (should show **316 passed**, no API key needed — the test suite uses a
106
+ fake AI provider, zero network calls). Safe to re-run.
107
+
108
+ Then, on any two files of yours:
109
+
110
+ ```bash
111
+ wherefore compare old_export.csv new_export.csv --output report.md
112
+ ```
113
+
114
+ Works the same with `.csv`, `.json`, `.parquet`, or `.xlsx`/`.xls` —
115
+ mix and match freely, format is auto-detected per file. No key column
116
+ needed; `wherefore` finds one. If it picks wrong, or you have many
117
+ table pairs to check at once (a real migration is dozens of tables,
118
+ not one), see [usage details](#usage) below.
119
+
120
+ **Don't have a Parquet or Excel file handy?** Make one from the CSVs
121
+ above in two lines, run inside the same activated `.venv` from
122
+ `dev_setup.sh` (so `pandas`/`pyarrow` are already available) — note
123
+ `parse_dates=[...]` on any datetime column, since Parquet and Excel
124
+ store dates natively and pandas needs to know which column is one
125
+ before writing (without it, the column round-trips as plain text and
126
+ date-based patterns like `timezone_shift` won't be detected —
127
+ confirmed by testing this exact gap):
128
+
129
+ ```bash
130
+ python3 -c "
131
+ import pandas as pd
132
+ df = pd.read_csv('old_export.csv', parse_dates=['hire_date']) # name your actual date column
133
+ df.to_parquet('old_export.parquet', index=False)
134
+ df.to_excel('old_export.xlsx', index=False)
135
+ "
136
+ wherefore compare old_export.parquet new_export.parquet # or .xlsx
137
+ ```
138
+
139
+ Want the AI explanation, not just the statistical match?
140
+
141
+ ```bash
142
+ export ANTHROPIC_API_KEY="sk-ant-..."
143
+ wherefore compare old_export.csv new_export.csv --explain
144
+ ```
145
+
146
+ Sensitive-looking values (emails, SSNs, card numbers, phone numbers)
147
+ are redacted before anything is sent — on by default, see
148
+ [Privacy & data handling](#privacy--data-handling).
149
+
150
+ <details>
151
+ <summary>Manual setup, if you'd rather not run the script</summary>
152
+
153
+ ```bash
154
+ python3 -m venv .venv
155
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
156
+ pip install --upgrade pip
157
+ pip install -e ".[dev]"
158
+ pytest tests/ -v
159
+ ```
160
+
161
+ **Requires Python 3.10+.** Tested on 3.10–3.12. On very recent Python
162
+ (3.14+), pandas/numpy are compatible, but if `pip install` fails, a
163
+ smaller transitive dependency without 3.14 wheels yet is the likely
164
+ cause — try 3.11/3.12 if you hit this.
165
+ </details>
166
+
167
+ ## Why this exists
168
+
169
+ Imagine two boxes of identical LEGO sets. Someone copied box A into
170
+ box B, but a few pieces are missing or the wrong color. Most tools
171
+ that check this say: *"12 pieces are different."* That's it.
172
+
173
+ `wherefore` looks at those 12 differences and says: *"These aren't
174
+ random — every one has the same color swapped the same way, consistent
175
+ with a colorblind sort. That's your root cause."* It explains the
176
+ pattern behind the differences, not just the differences themselves.
177
+
178
+ To know if it's actually doing this well — not just sounding plausible
179
+ — we build our own "messed-up" datasets on purpose, with a known,
180
+ labeled answer, and grade whether the tool finds it. That's the
181
+ [eval harness](#evals--why-trust-the-explanations), and it's
182
+ first-class here, not an afterthought.
183
+
184
+ This is not a thin prompt wrapper around an LLM. The AI sits behind a
185
+ deterministic clustering and statistical-signature step, and every
186
+ accuracy claim below is backed by that eval harness against labeled
187
+ ground truth — not vibes.
188
+
189
+ ## What's built
190
+
191
+ 🚧 Actively built in public. The full pipeline is real, end-to-end:
192
+ statistical detection, AI explanation, and a scored eval harness.
193
+
194
+ | | |
195
+ |---|---|
196
+ | **Formats** | CSV, JSON, Parquet, Excel — local or `s3://`, auto-detected, mix-and-match |
197
+ | **Modes** | One file pair (`compare`) or a whole directory (`compare-dir`) |
198
+ | **Taxonomy** | 8 failure patterns built & tested: `timezone_shift`, `truncation`, `enum_drift`, `null_type_coercion`, `float_precision`, `encoding_mismatch`, `dedup_failure`, `key_mismatch` |
199
+ | **AI layer** | Verified against the real Claude API twice — manually and via the scored eval harness — 100% match on a small (seven-fixture) sample |
200
+ | **Privacy** | Redacts emails/SSNs/cards/phones before any `--explain` call, on by default |
201
+ | **Tests** | 316 passing, including a real (mocked) S3 round-trip and end-to-end runs against real generated files |
202
+
203
+ `dedup_failure` and `key_mismatch` are structurally different from the
204
+ other six — `dedup_failure` detects duplicated rows (re-inserted with
205
+ a new key, not the same key twice); `key_mismatch` detects a row whose
206
+ join key was reformatted (`EMP-1001` vs `EMP1001`) so it never matched
207
+ at all. Both show up as extra/missing rows rather than a column-level
208
+ mismatch, both have their own clustering path
209
+ (`detect_row_presence_patterns`), and both are verified by real,
210
+ dedicated tests — including a regression test confirming they don't
211
+ false-positive on each other's fixtures, and a regression test for a
212
+ real false positive caught while building `key_mismatch` (two
213
+ genuinely unrelated keys sharing a domain's ID prefix scored close
214
+ enough on a similarity heuristic to need a deterministic check
215
+ instead — see `TAXONOMY.md`). Neither is yet wired into the automated
216
+ eval harness above (that harness currently only scores column-mismatch
217
+ patterns) — tracked honestly as a gap, not hidden.
218
+
219
+ **Not built yet:** wiring `dedup_failure`/`key_mismatch` into the eval
220
+ harness, more fixture coverage at scale, and database connectivity
221
+ (Postgres, MySQL,
222
+ SQLite). File-based sources — local and `s3://` — and CSV/JSON/Parquet/
223
+ Excel are all supported today. See [`TAXONOMY.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY.md) for
224
+ the current pattern list and what's planned next.
225
+
226
+ <details>
227
+ <summary>The harder bugs this surfaced, if you're curious</summary>
228
+
229
+ Building the 4th pattern (`null_type_coercion`) surfaced three real
230
+ bugs spanning the comparison engine, the file loaders, and the eval
231
+ harness itself. Building the 5th (`float_precision`) surfaced a
232
+ subtler one: a magnitude-based heuristic that looked right scored a
233
+ real false positive on an adversarial test case, fixed by checking the
234
+ underlying mechanism (an exact float32 round-trip) directly instead of
235
+ approximating its size. Full account, including how each was found and
236
+ fixed: [`TAXONOMY_TODO.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY_TODO.md).
237
+ </details>
238
+
239
+ ## Architecture
240
+
241
+ ```
242
+ source file, target file (CSV, JSON, Parquet, or Excel)
243
+
244
+
245
+ loaders + key matching (exact by default; --fuzzy-keys for reformatted keys)
246
+
247
+
248
+ comparison engine (wraps datacompy; schema-aware diffing)
249
+
250
+
251
+ deterministic clustering (groups mismatches; runs cheap statistical
252
+ │ signature checks — NO causal claims here)
253
+
254
+ candidate pattern(s), confidence-scored
255
+
256
+ ├─── default: stop here, statistics only (free, no API key)
257
+
258
+ ▼ with --explain
259
+ AI reasoning layer (Claude; redacts sensitive patterns by default;
260
+ │ writes the causal narrative, cites real rows,
261
+ │ honestly flags "unrecognized" when nothing fits)
262
+
263
+ Markdown report (statistics always; AI narrative alongside
264
+ the evidence, never instead of it)
265
+ ```
266
+
267
+ **Failure patterns are data, not code.** Each one is a YAML file under
268
+ `src/wherefore/taxonomy/patterns/`, validated against a strict schema.
269
+ Adding a new pattern means writing a YAML file and a small corruptor
270
+ function — never touching clustering or reasoning code. See
271
+ [`CONTRIBUTING.md`](https://github.com/tracelore/wherefore/blob/main/CONTRIBUTING.md) for the contract.
272
+
273
+ **Clustering and reasoning are deliberately separated.** Clustering
274
+ only ever produces statistical observations ("these 12 rows differ by
275
+ exactly 5 hours"). Causal attribution ("this is a timezone bug") is
276
+ the AI's job, every time — if clustering started asserting causes, the
277
+ AI layer would become decorative and the evals would stop measuring
278
+ anything meaningful.
279
+
280
+ ## Evals — why trust the explanations?
281
+
282
+ Because we control the ground truth. The synthetic data generator
283
+ creates clean datasets, then deliberately corrupts them using a known
284
+ failure pattern — recording exactly what it did in a committed
285
+ `ground_truth.json`. The eval harness runs the real pipeline against
286
+ these labeled fixtures and scores the result as precision/recall per
287
+ pattern, tracking "correctly said unrecognized" separately from
288
+ "confidently named the wrong pattern" — very different failure modes a
289
+ naive right/wrong scorer would conflate.
290
+
291
+ **Statistical mode, free, no API key, against all 7 fixtures:**
292
+
293
+ ```
294
+ $ python3 -m evals.harness.run_eval
295
+ Total cases: 7
296
+ Overall accuracy (correct match + honest abstain): 100.00%
297
+
298
+ encoding_mismatch: precision=1.00 recall=1.00
299
+ enum_drift: precision=1.00 recall=1.00
300
+ float_precision: precision=1.00 recall=1.00
301
+ null_type_coercion: precision=1.00 recall=1.00
302
+ timezone_shift: precision=1.00 recall=1.00
303
+ truncation: precision=1.00 recall=1.00
304
+ ```
305
+
306
+ **LLM mode** (`python3 -m evals.harness.run_eval --llm`, real API
307
+ calls, scores the AI's final answer instead of clustering's raw
308
+ statistics) — **also 100%**, including the one fixture designed to
309
+ test something the statistics alone can't: a cluster that legitimately
310
+ matches two patterns at once, where the AI correctly picked the right
311
+ one by reasoning about the actual values, not by defaulting to
312
+ whichever candidate came first.
313
+
314
+ Both are reproducible — clone the repo, run the commands yourself.
315
+ Seven fixtures proves the *mechanism* works end-to-end against the real
316
+ API; it doesn't prove either layer is bulletproof at scale. That's the
317
+ honest caveat, and expanding fixture coverage is the tracked next step
318
+ in [`TAXONOMY_TODO.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY_TODO.md).
319
+
320
+ <details>
321
+ <summary>The multi-candidate case, if you want the detail</summary>
322
+
323
+ `null_type_coercion` and `enum_drift` can legitimately both match the
324
+ same cluster — a null consistently coerced to one sentinel string is,
325
+ statistically, also a "consistent value mapping." Clustering reports
326
+ both honestly rather than guessing which is "more right," since that's
327
+ a causal judgment that belongs to the AI layer, not clustering. The
328
+ eval harness scores this correctly too: a true pattern counts as found
329
+ if it appears anywhere among the reported candidates, not only if it's
330
+ listed first. Full story of how this was found and fixed:
331
+ [`TAXONOMY_TODO.md`](https://github.com/tracelore/wherefore/blob/main/TAXONOMY_TODO.md).
332
+ </details>
333
+
334
+ ## Usage
335
+
336
+ ### One file pair
337
+
338
+ ```bash
339
+ wherefore compare old_export.csv new_export.csv --key employee_id
340
+ ```
341
+
342
+ `--key` is optional — omit it and `wherefore` looks for a column that
343
+ looks like an identifier (mostly-unique values, often named with "id"
344
+ or "key"). If the same record has a differently-formatted key on each
345
+ side (e.g. `EMP-1001` vs `EMP1001`, common after a migration), add
346
+ `--fuzzy-keys`.
347
+
348
+ Files can live in S3, not just on disk — mix and match freely:
349
+
350
+ ```bash
351
+ pip install "wherefore[s3]" # boto3 is optional, only needed for s3:// paths
352
+ wherefore compare s3://old-bucket/accounts.csv s3://new-bucket/accounts.csv
353
+ ```
354
+
355
+ Uses the standard AWS credential chain (env vars, `~/.aws/credentials`,
356
+ IAM role, `AWS_PROFILE`) — `wherefore` doesn't invent its own.
357
+
358
+ ### A whole migration, not one table
359
+
360
+ ```bash
361
+ $ wherefore compare-dir old_exports new_exports --output-dir reports
362
+ Found 3 matching file pair(s). Comparing...
363
+
364
+ [DIFF] accounts.csv: 1 finding(s) (timezone_shift)
365
+ [DIFF] patients.csv: 1 finding(s) (truncation)
366
+ [OK] transactions.csv: no mismatches
367
+
368
+ Done: 3 compared, 0 skipped. Reports written to reports/
369
+ ```
370
+
371
+ Files are matched by **identical filename** between the two
372
+ directories — no fuzzy matching at the file level, since guessing
373
+ wrong about *which two tables* you're comparing is worse than guessing
374
+ wrong about a row key. A pair that can't be compared (bad format, no
375
+ detectable key) is skipped and reported, not fatal to the rest of the
376
+ batch. Every `compare` flag works here too, applied to every pair.
377
+
378
+ ### What you get without an API key
379
+
380
+ Real diffing, real grouping, real pattern matching — and a confidence
381
+ score that's a genuine deterministic measurement (e.g. "every
382
+ mismatched value differs by exactly the same 5-hour delta"), not an AI
383
+ guess. If nothing in the taxonomy matches, `wherefore` says
384
+ `pattern unrecognized` rather than forcing one.
385
+
386
+ ### What `--explain` adds
387
+
388
+ The plain-English *why*, shown **alongside** the statistical evidence
389
+ it was reasoned from, not instead of it — so you can check the claim
390
+ yourself. In testing, the AI correctly identified a genuinely random,
391
+ non-matching corruption and proposed real alternative hypotheses (a
392
+ bad join, a mis-wired column) instead of inventing a pattern that
393
+ wasn't there.
394
+
395
+ ## Privacy & data handling
396
+
397
+ `--explain` sends mismatched cell values to the Claude API. Before
398
+ that happens, values are checked against a redaction layer — emails,
399
+ SSNs, credit card numbers, US phone numbers — **on by default, no flag
400
+ needed.** Anything masked is called out in the output
401
+ (`Redacted before sending to Claude: email`). Disable with
402
+ `--no-redact` if you've already vetted your data.
403
+
404
+ Be precise about scope: this is pattern-based detection of
405
+ *structurally recognizable* sensitive data, not a general PII scanner
406
+ — it won't know that a name or a home address is sensitive. Full
407
+ detail, including a documented false-positive case found during
408
+ testing (long numeric IDs can resemble card numbers):
409
+ [`SECURITY.md`](https://github.com/tracelore/wherefore/blob/main/SECURITY.md).
410
+
411
+ ## All flags
412
+
413
+ <details>
414
+ <summary>Expand</summary>
415
+
416
+ ```bash
417
+ wherefore compare SOURCE TARGET [OPTIONS]
418
+
419
+ --key TEXT Join key column. Auto-detected if omitted.
420
+ --fuzzy-keys Allow approximate key matching (e.g. 'CUST-001' vs 'CUST001').
421
+ --output TEXT Where to write the report (default: report.md).
422
+ --confidence-threshold FLOAT Minimum confidence to count as a pattern match (default: 0.9).
423
+ --explain Generate plain-English AI explanations via the Claude API.
424
+ Requires ANTHROPIC_API_KEY. Makes real, billed API calls. Off by default.
425
+ --no-redact Disable automatic redaction of emails/SSNs/cards/phones before
426
+ sending data to Claude with --explain. Redaction is ON by default.
427
+
428
+ wherefore compare-dir SOURCE_DIR TARGET_DIR [OPTIONS]
429
+
430
+ --output-dir TEXT Directory for one report per pair (default: reports).
431
+ --key, --fuzzy-keys, --confidence-threshold, --explain, --no-redact Same as `compare`, applied to every pair.
432
+ ```
433
+ </details>
434
+
435
+ ## Contributing
436
+
437
+ Contributions are welcome, especially new taxonomy patterns. Start
438
+ with [`CONTRIBUTING.md`](https://github.com/tracelore/wherefore/blob/main/CONTRIBUTING.md) — the pattern contract, why
439
+ patterns are built corruptor-first rather than YAML-first, and the
440
+ design decisions worth knowing before you dig in.
441
+
442
+ Found a security issue? See [`SECURITY.md`](https://github.com/tracelore/wherefore/blob/main/SECURITY.md).
443
+
444
+ ## License
445
+
446
+ Apache License 2.0 — see [`LICENSE`](https://github.com/tracelore/wherefore/blob/main/LICENSE). Contributions are
447
+ accepted under the same license (see `NOTICE` for attribution).