pipedog 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
pipedog-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Jishn
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
pipedog-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,311 @@
1
+ Metadata-Version: 2.4
2
+ Name: pipedog
3
+ Version: 0.1.0
4
+ Summary: CLI tool for data quality checks and schema drift detection on CSV, Parquet, and JSON files
5
+ License: MIT
6
+ License-File: LICENSE
7
+ Keywords: data-quality,schema-drift,data-validation,cli,csv,parquet,pandas
8
+ Author: Jishn
9
+ Author-email: you@example.com
10
+ Requires-Python: >=3.9,<4.0
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.9
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Programming Language :: Python :: 3.13
23
+ Classifier: Programming Language :: Python :: 3.14
24
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
25
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
26
+ Requires-Dist: duckdb (>=0.10.0,<0.11.0)
27
+ Requires-Dist: pandas (>=2.2.0,<3.0.0)
28
+ Requires-Dist: pyarrow (>=15.0.0,<16.0.0)
29
+ Requires-Dist: pydantic (>=2.6.0,<3.0.0)
30
+ Requires-Dist: rich (>=13.7.0,<14.0.0)
31
+ Requires-Dist: typer[all] (>=0.12.0,<0.13.0)
32
+ Project-URL: Homepage, https://github.com/JKK-Jishnu/pipedog
33
+ Project-URL: Repository, https://github.com/JKK-Jishnu/pipedog
34
+ Description-Content-Type: text/markdown
35
+
36
+ # Pipedog
37
+
38
+ An open source data quality and schema drift detection tool for analysts and data engineers. Point it at a CSV, Parquet, or JSON file and it will profile the data, auto-generate quality checks, and alert you the moment something changes.
39
+
40
+ ---
41
+
42
+ ## Why Pipedog?
43
+
44
+ Data pipelines break silently. A column gets renamed upstream, nulls creep into a field that was always clean, a price column suddenly contains strings. These issues reach production before anyone notices.
45
+
46
+ Pipedog solves this by:
47
+ - **Taking a snapshot** of your data's structure and statistics on day one.
48
+ - **Scanning every new file** against that snapshot and failing loudly when something drifts.
49
+ - **Explaining what went wrong** in plain English, not stack traces.
50
+
51
+ ---
52
+
53
+ ## Installation
54
+
55
+ ### With pip (quickest)
56
+
57
+ ```bash
58
+ pip install pipedog
59
+ ```
60
+
61
+ ### With Poetry (for development)
62
+
63
+ ```bash
64
+ git clone https://github.com/JKK-Jishnu/pipedog.git
65
+ cd pipedog
66
+ poetry install
67
+ ```
68
+
69
+ ### Dependencies
70
+
71
+ | Package | Purpose |
72
+ |-----------|--------------------------------------|
73
+ | typer | CLI framework |
74
+ | rich | Coloured terminal output |
75
+ | pandas | File reading (CSV, Parquet, JSON) |
76
+ | pyarrow | Parquet support for pandas |
77
+ | duckdb | SQL engine (reserved for future use) |
78
+ | pydantic | Schema validation and JSON I/O |
79
+
80
+ ---
81
+
82
+ ## Quick Start
83
+
84
+ ```bash
85
+ # 1. Profile your file and save a baseline snapshot
86
+ pipedog init data/orders.csv
87
+
88
+ # 2. Tomorrow, when a new file arrives, scan it
89
+ pipedog scan data/orders_new.csv
90
+
91
+ # 3. Explore any file without saving anything
92
+ pipedog profile data/orders.csv
93
+ ```
94
+
95
+ ---
96
+
97
+ ## Commands
98
+
99
+ ### `pipedog init <file>`
100
+
101
+ Profiles the file and saves two files to `.pipedog/`:
102
+
103
+ - **`.pipedog/schema.json`** — column names, types, null stats, value ranges, timestamps.
104
+ - **`.pipedog/checks.json`** — auto-generated quality rules derived from the baseline.
105
+
106
+ ```
107
+ pipedog init sample_data/orders.csv
108
+ ```
109
+
110
+ **What gets auto-generated:**
111
+
112
+ | Rule | When generated | Severity |
113
+ |-------------|---------------------------------------------|----------|
114
+ | `not_null` | Column had zero nulls at init time | error |
115
+ | `null_rate` | Column had some nulls; threshold = pct + 10 | warning |
116
+ | `min_value` | Numeric column; locks in the observed min | error |
117
+ | `max_value` | Numeric column; locks in the observed max | error |
118
+ | `unique` | Every value was distinct (looks like a key) | error |
119
+
120
+ Re-running `init` refreshes the baseline to the current file.
121
+
122
+ ---
123
+
124
+ ### `pipedog scan <file>`
125
+
126
+ Compares the file against the baseline and runs all quality checks.
127
+
128
+ ```
129
+ pipedog scan sample_data/orders.csv
130
+ ```
131
+
132
+ **Exit codes:**
133
+ - `0` — all checks passed (warnings are allowed).
134
+ - `1` — one or more error-severity checks failed.
135
+
136
+ This makes `pipedog scan` CI/CD friendly — pipe it into your build and it will fail the pipeline when data quality breaks.
137
+
138
+ **What gets checked:**
139
+
140
+ 1. **Schema drift** — were columns added, removed, or changed type?
141
+ 2. **Quality checks** — do null rates, value ranges, and uniqueness still match the baseline?
142
+
143
+ **Output example (all passing):**
144
+ ```
145
+ +------------------------------- Pipedog Scan --------------------------------+
146
+ | ALL CHECKS PASSED |
147
+ | 10 rows · 7 columns · 17 passed · 0 warnings · 0 failed |
148
+ +-----------------------------------------------------------------------------+
149
+
150
+ Passed Checks
151
+ PASS No nulls found in 'order_id'.
152
+ PASS 'price' maximum is 149.99, within baseline maximum of 149.99.
153
+ ...
154
+ ```
155
+
156
+ **Output example (failure):**
157
+ ```
158
+ +------------------------------- Pipedog Scan --------------------------------+
159
+ | CHECKS FAILED |
160
+ | 12 rows · 6 columns · 14 passed · 0 warnings · 2 failed |
161
+ +-----------------------------------------------------------------------------+
162
+
163
+ Schema Drift Detected
164
+ FAIL Column 'status' existed in the baseline but is missing from the current file.
165
+
166
+ Failed Checks
167
+ FAIL 'order_id' has 2 null value(s) (16.67% of rows).
168
+ ```
169
+
170
+ ---
171
+
172
+ ### `pipedog profile <file>`
173
+
174
+ Shows a data summary without saving anything to disk. Useful for exploring a file before committing to a baseline.
175
+
176
+ ```
177
+ pipedog profile sample_data/orders.csv
178
+ ```
179
+
180
+ **Output includes:**
181
+ - Total row and column count.
182
+ - Per-column type, null count, null percentage, unique count.
183
+ - Min and max for numeric columns.
184
+ - Up to 3 sample values per column.
185
+
186
+ ---
187
+
188
+ ## Supported File Types
189
+
190
+ | Extension | Format |
191
+ |------------------|---------|
192
+ | `.csv` | CSV |
193
+ | `.parquet` `.pq` | Parquet |
194
+ | `.json` | JSON |
195
+
196
+ File type is detected automatically from the extension.
197
+
198
+ ---
199
+
200
+ ## How It Works
201
+
202
+ ```
203
+ pipedog init orders.csv
204
+
205
+ ├─ load_file() reads CSV/Parquet/JSON into a DataFrame
206
+ ├─ profile_dataframe() computes stats for every column
207
+ ├─ generate_checks() auto-generates quality rules from the stats
208
+ └─ save_snapshot() writes .pipedog/schema.json + checks.json
209
+
210
+ pipedog scan orders_new.csv
211
+
212
+ ├─ load_file() reads the new file
213
+ ├─ load_snapshot() loads baseline from .pipedog/
214
+ ├─ profile_dataframe() profiles the new file
215
+ ├─ detect_drift() compares column structure
216
+ ├─ run_quality_checks() evaluates every rule
217
+ └─ print_scan_results() renders colour-coded report, returns exit code
218
+ ```
219
+
220
+ ---
221
+
222
+ ## Project Structure
223
+
224
+ ```
225
+ pipedog/
226
+ ├── pyproject.toml # Poetry config and PyPI metadata
227
+ ├── README.md # This file
228
+ ├── sample_data/
229
+ │ └── orders.csv # Example file to test with
230
+ └── pipedog/
231
+ ├── __init__.py # Package version
232
+ ├── main.py # CLI commands (init, scan, profile)
233
+ ├── schema.py # Pydantic models (ColumnSchema, DataSchema, etc.)
234
+ ├── profiler.py # File loading, type inference, statistical profiling
235
+ ├── scanner.py # Drift detection and quality check evaluation
236
+ └── output.py # Rich terminal output (tables, panels, colours)
237
+ ```
238
+
239
+ ---
240
+
241
+ ## Snapshot Files
242
+
243
+ After running `pipedog init`, a `.pipedog/` directory is created:
244
+
245
+ ```
246
+ .pipedog/
247
+ ├── schema.json # baseline column statistics
248
+ └── checks.json # auto-generated quality rules
249
+ ```
250
+
251
+ These files are plain JSON and human-readable. You can commit them to version control to track schema changes over time, or add `.pipedog/` to `.gitignore` to keep them local.
252
+
253
+ **Example `.pipedog/schema.json`:**
254
+ ```json
255
+ {
256
+ "file": "/data/orders.csv",
257
+ "row_count": 10,
258
+ "column_count": 7,
259
+ "columns": [
260
+ {
261
+ "name": "order_id",
262
+ "dtype": "integer",
263
+ "nullable": false,
264
+ "null_count": 0,
265
+ "null_pct": 0.0,
266
+ "unique_count": 10,
267
+ "sample_values": [1, 2, 3],
268
+ "min_value": 1.0,
269
+ "max_value": 10.0,
270
+ "mean_value": 5.5
271
+ }
272
+ ],
273
+ "captured_at": "2026-03-26T18:34:20.123456+00:00"
274
+ }
275
+ ```
276
+
277
+ ---
278
+
279
+ ## CI/CD Integration
280
+
281
+ Because `pipedog scan` exits with code `1` on failure, it drops straight into any CI pipeline:
282
+
283
+ **GitHub Actions:**
284
+ ```yaml
285
+ - name: Check data quality
286
+ run: pipedog scan data/daily_export.csv
287
+ ```
288
+
289
+ **Makefile:**
290
+ ```makefile
291
+ check:
292
+ pipedog scan data/daily_export.csv
293
+ ```
294
+
295
+ ---
296
+
297
+ ## Roadmap
298
+
299
+ - [ ] `pipedog diff` — side-by-side comparison of two snapshots
300
+ - [ ] Custom checks via `checks.json` (regex patterns, allowed value sets)
301
+ - [ ] JSON Lines (`.jsonl`) support
302
+ - [ ] `--output json` flag for machine-readable scan results
303
+ - [ ] Excel (`.xlsx`) support
304
+ - [ ] Slack / webhook notifications on failure
305
+
306
+ ---
307
+
308
+ ## License
309
+
310
+ MIT
311
+
@@ -0,0 +1,275 @@
1
+ # Pipedog
2
+
3
+ An open source data quality and schema drift detection tool for analysts and data engineers. Point it at a CSV, Parquet, or JSON file and it will profile the data, auto-generate quality checks, and alert you the moment something changes.
4
+
5
+ ---
6
+
7
+ ## Why Pipedog?
8
+
9
+ Data pipelines break silently. A column gets renamed upstream, nulls creep into a field that was always clean, a price column suddenly contains strings. These issues reach production before anyone notices.
10
+
11
+ Pipedog solves this by:
12
+ - **Taking a snapshot** of your data's structure and statistics on day one.
13
+ - **Scanning every new file** against that snapshot and failing loudly when something drifts.
14
+ - **Explaining what went wrong** in plain English, not stack traces.
15
+
16
+ ---
17
+
18
+ ## Installation
19
+
20
+ ### With pip (quickest)
21
+
22
+ ```bash
23
+ pip install pipedog
24
+ ```
25
+
26
+ ### With Poetry (for development)
27
+
28
+ ```bash
29
+ git clone https://github.com/JKK-Jishnu/pipedog.git
30
+ cd pipedog
31
+ poetry install
32
+ ```
33
+
34
+ ### Dependencies
35
+
36
+ | Package | Purpose |
37
+ |-----------|--------------------------------------|
38
+ | typer | CLI framework |
39
+ | rich | Coloured terminal output |
40
+ | pandas | File reading (CSV, Parquet, JSON) |
41
+ | pyarrow | Parquet support for pandas |
42
+ | duckdb | SQL engine (reserved for future use) |
43
+ | pydantic | Schema validation and JSON I/O |
44
+
45
+ ---
46
+
47
+ ## Quick Start
48
+
49
+ ```bash
50
+ # 1. Profile your file and save a baseline snapshot
51
+ pipedog init data/orders.csv
52
+
53
+ # 2. Tomorrow, when a new file arrives, scan it
54
+ pipedog scan data/orders_new.csv
55
+
56
+ # 3. Explore any file without saving anything
57
+ pipedog profile data/orders.csv
58
+ ```
59
+
60
+ ---
61
+
62
+ ## Commands
63
+
64
+ ### `pipedog init <file>`
65
+
66
+ Profiles the file and saves two files to `.pipedog/`:
67
+
68
+ - **`.pipedog/schema.json`** — column names, types, null stats, value ranges, timestamps.
69
+ - **`.pipedog/checks.json`** — auto-generated quality rules derived from the baseline.
70
+
71
+ ```
72
+ pipedog init sample_data/orders.csv
73
+ ```
74
+
75
+ **What gets auto-generated:**
76
+
77
+ | Rule | When generated | Severity |
78
+ |-------------|---------------------------------------------|----------|
79
+ | `not_null` | Column had zero nulls at init time | error |
80
+ | `null_rate` | Column had some nulls; threshold = pct + 10 | warning |
81
+ | `min_value` | Numeric column; locks in the observed min | error |
82
+ | `max_value` | Numeric column; locks in the observed max | error |
83
+ | `unique` | Every value was distinct (looks like a key) | error |
84
+
85
+ Re-running `init` refreshes the baseline to the current file.
86
+
87
+ ---
88
+
89
+ ### `pipedog scan <file>`
90
+
91
+ Compares the file against the baseline and runs all quality checks.
92
+
93
+ ```
94
+ pipedog scan sample_data/orders.csv
95
+ ```
96
+
97
+ **Exit codes:**
98
+ - `0` — all checks passed (warnings are allowed).
99
+ - `1` — one or more error-severity checks failed.
100
+
101
+ This makes `pipedog scan` CI/CD friendly — pipe it into your build and it will fail the pipeline when data quality breaks.
102
+
103
+ **What gets checked:**
104
+
105
+ 1. **Schema drift** — were columns added, removed, or changed type?
106
+ 2. **Quality checks** — do null rates, value ranges, and uniqueness still match the baseline?
107
+
108
+ **Output example (all passing):**
109
+ ```
110
+ +------------------------------- Pipedog Scan --------------------------------+
111
+ | ALL CHECKS PASSED |
112
+ | 10 rows · 7 columns · 17 passed · 0 warnings · 0 failed |
113
+ +-----------------------------------------------------------------------------+
114
+
115
+ Passed Checks
116
+ PASS No nulls found in 'order_id'.
117
+ PASS 'price' maximum is 149.99, within baseline maximum of 149.99.
118
+ ...
119
+ ```
120
+
121
+ **Output example (failure):**
122
+ ```
123
+ +------------------------------- Pipedog Scan --------------------------------+
124
+ | CHECKS FAILED |
125
+ | 12 rows · 6 columns · 14 passed · 0 warnings · 2 failed |
126
+ +-----------------------------------------------------------------------------+
127
+
128
+ Schema Drift Detected
129
+ FAIL Column 'status' existed in the baseline but is missing from the current file.
130
+
131
+ Failed Checks
132
+ FAIL 'order_id' has 2 null value(s) (16.67% of rows).
133
+ ```
134
+
135
+ ---
136
+
137
+ ### `pipedog profile <file>`
138
+
139
+ Shows a data summary without saving anything to disk. Useful for exploring a file before committing to a baseline.
140
+
141
+ ```
142
+ pipedog profile sample_data/orders.csv
143
+ ```
144
+
145
+ **Output includes:**
146
+ - Total row and column count.
147
+ - Per-column type, null count, null percentage, unique count.
148
+ - Min and max for numeric columns.
149
+ - Up to 3 sample values per column.
150
+
151
+ ---
152
+
153
+ ## Supported File Types
154
+
155
+ | Extension | Format |
156
+ |------------------|---------|
157
+ | `.csv` | CSV |
158
+ | `.parquet` `.pq` | Parquet |
159
+ | `.json` | JSON |
160
+
161
+ File type is detected automatically from the extension.
162
+
163
+ ---
164
+
165
+ ## How It Works
166
+
167
+ ```
168
+ pipedog init orders.csv
169
+
170
+ ├─ load_file() reads CSV/Parquet/JSON into a DataFrame
171
+ ├─ profile_dataframe() computes stats for every column
172
+ ├─ generate_checks() auto-generates quality rules from the stats
173
+ └─ save_snapshot() writes .pipedog/schema.json + checks.json
174
+
175
+ pipedog scan orders_new.csv
176
+
177
+ ├─ load_file() reads the new file
178
+ ├─ load_snapshot() loads baseline from .pipedog/
179
+ ├─ profile_dataframe() profiles the new file
180
+ ├─ detect_drift() compares column structure
181
+ ├─ run_quality_checks() evaluates every rule
182
+ └─ print_scan_results() renders colour-coded report, returns exit code
183
+ ```
184
+
185
+ ---
186
+
187
+ ## Project Structure
188
+
189
+ ```
190
+ pipedog/
191
+ ├── pyproject.toml # Poetry config and PyPI metadata
192
+ ├── README.md # This file
193
+ ├── sample_data/
194
+ │ └── orders.csv # Example file to test with
195
+ └── pipedog/
196
+ ├── __init__.py # Package version
197
+ ├── main.py # CLI commands (init, scan, profile)
198
+ ├── schema.py # Pydantic models (ColumnSchema, DataSchema, etc.)
199
+ ├── profiler.py # File loading, type inference, statistical profiling
200
+ ├── scanner.py # Drift detection and quality check evaluation
201
+ └── output.py # Rich terminal output (tables, panels, colours)
202
+ ```
203
+
204
+ ---
205
+
206
+ ## Snapshot Files
207
+
208
+ After running `pipedog init`, a `.pipedog/` directory is created:
209
+
210
+ ```
211
+ .pipedog/
212
+ ├── schema.json # baseline column statistics
213
+ └── checks.json # auto-generated quality rules
214
+ ```
215
+
216
+ These files are plain JSON and human-readable. You can commit them to version control to track schema changes over time, or add `.pipedog/` to `.gitignore` to keep them local.
217
+
218
+ **Example `.pipedog/schema.json`:**
219
+ ```json
220
+ {
221
+ "file": "/data/orders.csv",
222
+ "row_count": 10,
223
+ "column_count": 7,
224
+ "columns": [
225
+ {
226
+ "name": "order_id",
227
+ "dtype": "integer",
228
+ "nullable": false,
229
+ "null_count": 0,
230
+ "null_pct": 0.0,
231
+ "unique_count": 10,
232
+ "sample_values": [1, 2, 3],
233
+ "min_value": 1.0,
234
+ "max_value": 10.0,
235
+ "mean_value": 5.5
236
+ }
237
+ ],
238
+ "captured_at": "2026-03-26T18:34:20.123456+00:00"
239
+ }
240
+ ```
241
+
242
+ ---
243
+
244
+ ## CI/CD Integration
245
+
246
+ Because `pipedog scan` exits with code `1` on failure, it drops straight into any CI pipeline:
247
+
248
+ **GitHub Actions:**
249
+ ```yaml
250
+ - name: Check data quality
251
+ run: pipedog scan data/daily_export.csv
252
+ ```
253
+
254
+ **Makefile:**
255
+ ```makefile
256
+ check:
257
+ pipedog scan data/daily_export.csv
258
+ ```
259
+
260
+ ---
261
+
262
+ ## Roadmap
263
+
264
+ - [ ] `pipedog diff` — side-by-side comparison of two snapshots
265
+ - [ ] Custom checks via `checks.json` (regex patterns, allowed value sets)
266
+ - [ ] JSON Lines (`.jsonl`) support
267
+ - [ ] `--output json` flag for machine-readable scan results
268
+ - [ ] Excel (`.xlsx`) support
269
+ - [ ] Slack / webhook notifications on failure
270
+
271
+ ---
272
+
273
+ ## License
274
+
275
+ MIT
@@ -0,0 +1 @@
1
+ __version__ = "0.1.0"