pipedog 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pipedog-0.1.0/LICENSE +21 -0
- pipedog-0.1.0/PKG-INFO +311 -0
- pipedog-0.1.0/README.md +275 -0
- pipedog-0.1.0/pipedog/__init__.py +1 -0
- pipedog-0.1.0/pipedog/main.py +174 -0
- pipedog-0.1.0/pipedog/output.py +225 -0
- pipedog-0.1.0/pipedog/profiler.py +367 -0
- pipedog-0.1.0/pipedog/scanner.py +267 -0
- pipedog-0.1.0/pipedog/schema.py +149 -0
- pipedog-0.1.0/pyproject.toml +44 -0
pipedog-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Jishn
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
pipedog-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,311 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pipedog
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: CLI tool for data quality checks and schema drift detection on CSV, Parquet, and JSON files
|
|
5
|
+
License: MIT
|
|
6
|
+
License-File: LICENSE
|
|
7
|
+
Keywords: data-quality,schema-drift,data-validation,cli,csv,parquet,pandas
|
|
8
|
+
Author: Jishn
|
|
9
|
+
Author-email: you@example.com
|
|
10
|
+
Requires-Python: >=3.9,<4.0
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Environment :: Console
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
23
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
24
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
25
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
26
|
+
Requires-Dist: duckdb (>=0.10.0,<0.11.0)
|
|
27
|
+
Requires-Dist: pandas (>=2.2.0,<3.0.0)
|
|
28
|
+
Requires-Dist: pyarrow (>=15.0.0,<16.0.0)
|
|
29
|
+
Requires-Dist: pydantic (>=2.6.0,<3.0.0)
|
|
30
|
+
Requires-Dist: rich (>=13.7.0,<14.0.0)
|
|
31
|
+
Requires-Dist: typer[all] (>=0.12.0,<0.13.0)
|
|
32
|
+
Project-URL: Homepage, https://github.com/JKK-Jishnu/pipedog
|
|
33
|
+
Project-URL: Repository, https://github.com/JKK-Jishnu/pipedog
|
|
34
|
+
Description-Content-Type: text/markdown
|
|
35
|
+
|
|
36
|
+
# Pipedog
|
|
37
|
+
|
|
38
|
+
An open source data quality and schema drift detection tool for analysts and data engineers. Point it at a CSV, Parquet, or JSON file and it will profile the data, auto-generate quality checks, and alert you the moment something changes.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Why Pipedog?
|
|
43
|
+
|
|
44
|
+
Data pipelines break silently. A column gets renamed upstream, nulls creep into a field that was always clean, a price column suddenly contains strings. These issues reach production before anyone notices.
|
|
45
|
+
|
|
46
|
+
Pipedog solves this by:
|
|
47
|
+
- **Taking a snapshot** of your data's structure and statistics on day one.
|
|
48
|
+
- **Scanning every new file** against that snapshot and failing loudly when something drifts.
|
|
49
|
+
- **Explaining what went wrong** in plain English, not stack traces.
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Installation
|
|
54
|
+
|
|
55
|
+
### With pip (quickest)
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
pip install pipedog
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### With Poetry (for development)
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
git clone https://github.com/JKK-Jishnu/pipedog.git
|
|
65
|
+
cd pipedog
|
|
66
|
+
poetry install
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### Dependencies
|
|
70
|
+
|
|
71
|
+
| Package | Purpose |
|
|
72
|
+
|-----------|--------------------------------------|
|
|
73
|
+
| typer | CLI framework |
|
|
74
|
+
| rich | Coloured terminal output |
|
|
75
|
+
| pandas | File reading (CSV, Parquet, JSON) |
|
|
76
|
+
| pyarrow | Parquet support for pandas |
|
|
77
|
+
| duckdb | SQL engine (reserved for future use) |
|
|
78
|
+
| pydantic | Schema validation and JSON I/O |
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Quick Start
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
# 1. Profile your file and save a baseline snapshot
|
|
86
|
+
pipedog init data/orders.csv
|
|
87
|
+
|
|
88
|
+
# 2. Tomorrow, when a new file arrives, scan it
|
|
89
|
+
pipedog scan data/orders_new.csv
|
|
90
|
+
|
|
91
|
+
# 3. Explore any file without saving anything
|
|
92
|
+
pipedog profile data/orders.csv
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Commands
|
|
98
|
+
|
|
99
|
+
### `pipedog init <file>`
|
|
100
|
+
|
|
101
|
+
Profiles the file and saves two files to `.pipedog/`:
|
|
102
|
+
|
|
103
|
+
- **`.pipedog/schema.json`** — column names, types, null stats, value ranges, timestamps.
|
|
104
|
+
- **`.pipedog/checks.json`** — auto-generated quality rules derived from the baseline.
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
pipedog init sample_data/orders.csv
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**What gets auto-generated:**
|
|
111
|
+
|
|
112
|
+
| Rule | When generated | Severity |
|
|
113
|
+
|-------------|---------------------------------------------|----------|
|
|
114
|
+
| `not_null` | Column had zero nulls at init time | error |
|
|
115
|
+
| `null_rate` | Column had some nulls; threshold = pct + 10 | warning |
|
|
116
|
+
| `min_value` | Numeric column; locks in the observed min | error |
|
|
117
|
+
| `max_value` | Numeric column; locks in the observed max | error |
|
|
118
|
+
| `unique` | Every value was distinct (looks like a key) | error |
|
|
119
|
+
|
|
120
|
+
Re-running `init` refreshes the baseline to the current file.
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
### `pipedog scan <file>`
|
|
125
|
+
|
|
126
|
+
Compares the file against the baseline and runs all quality checks.
|
|
127
|
+
|
|
128
|
+
```
|
|
129
|
+
pipedog scan sample_data/orders.csv
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
**Exit codes:**
|
|
133
|
+
- `0` — all checks passed (warnings are allowed).
|
|
134
|
+
- `1` — one or more error-severity checks failed.
|
|
135
|
+
|
|
136
|
+
This makes `pipedog scan` CI/CD friendly — pipe it into your build and it will fail the pipeline when data quality breaks.
|
|
137
|
+
|
|
138
|
+
**What gets checked:**
|
|
139
|
+
|
|
140
|
+
1. **Schema drift** — were columns added, removed, or changed type?
|
|
141
|
+
2. **Quality checks** — do null rates, value ranges, and uniqueness still match the baseline?
|
|
142
|
+
|
|
143
|
+
**Output example (all passing):**
|
|
144
|
+
```
|
|
145
|
+
+------------------------------- Pipedog Scan --------------------------------+
|
|
146
|
+
| ALL CHECKS PASSED |
|
|
147
|
+
| 10 rows · 7 columns · 17 passed · 0 warnings · 0 failed |
|
|
148
|
+
+-----------------------------------------------------------------------------+
|
|
149
|
+
|
|
150
|
+
Passed Checks
|
|
151
|
+
PASS No nulls found in 'order_id'.
|
|
152
|
+
PASS 'price' maximum is 149.99, within baseline maximum of 149.99.
|
|
153
|
+
...
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
**Output example (failure):**
|
|
157
|
+
```
|
|
158
|
+
+------------------------------- Pipedog Scan --------------------------------+
|
|
159
|
+
| CHECKS FAILED |
|
|
160
|
+
| 12 rows · 6 columns · 14 passed · 0 warnings · 2 failed |
|
|
161
|
+
+-----------------------------------------------------------------------------+
|
|
162
|
+
|
|
163
|
+
Schema Drift Detected
|
|
164
|
+
FAIL Column 'status' existed in the baseline but is missing from the current file.
|
|
165
|
+
|
|
166
|
+
Failed Checks
|
|
167
|
+
FAIL 'order_id' has 2 null value(s) (16.67% of rows).
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
### `pipedog profile <file>`
|
|
173
|
+
|
|
174
|
+
Shows a data summary without saving anything to disk. Useful for exploring a file before committing to a baseline.
|
|
175
|
+
|
|
176
|
+
```
|
|
177
|
+
pipedog profile sample_data/orders.csv
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
**Output includes:**
|
|
181
|
+
- Total row and column count.
|
|
182
|
+
- Per-column type, null count, null percentage, unique count.
|
|
183
|
+
- Min and max for numeric columns.
|
|
184
|
+
- Up to 3 sample values per column.
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Supported File Types
|
|
189
|
+
|
|
190
|
+
| Extension | Format |
|
|
191
|
+
|------------------|---------|
|
|
192
|
+
| `.csv` | CSV |
|
|
193
|
+
| `.parquet` `.pq` | Parquet |
|
|
194
|
+
| `.json` | JSON |
|
|
195
|
+
|
|
196
|
+
File type is detected automatically from the extension.
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## How It Works
|
|
201
|
+
|
|
202
|
+
```
|
|
203
|
+
pipedog init orders.csv
|
|
204
|
+
│
|
|
205
|
+
├─ load_file() reads CSV/Parquet/JSON into a DataFrame
|
|
206
|
+
├─ profile_dataframe() computes stats for every column
|
|
207
|
+
├─ generate_checks() auto-generates quality rules from the stats
|
|
208
|
+
└─ save_snapshot() writes .pipedog/schema.json + checks.json
|
|
209
|
+
|
|
210
|
+
pipedog scan orders_new.csv
|
|
211
|
+
│
|
|
212
|
+
├─ load_file() reads the new file
|
|
213
|
+
├─ load_snapshot() loads baseline from .pipedog/
|
|
214
|
+
├─ profile_dataframe() profiles the new file
|
|
215
|
+
├─ detect_drift() compares column structure
|
|
216
|
+
├─ run_quality_checks() evaluates every rule
|
|
217
|
+
└─ print_scan_results() renders colour-coded report, returns exit code
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
---
|
|
221
|
+
|
|
222
|
+
## Project Structure
|
|
223
|
+
|
|
224
|
+
```
|
|
225
|
+
pipedog/
|
|
226
|
+
├── pyproject.toml # Poetry config and PyPI metadata
|
|
227
|
+
├── README.md # This file
|
|
228
|
+
├── sample_data/
|
|
229
|
+
│ └── orders.csv # Example file to test with
|
|
230
|
+
└── pipedog/
|
|
231
|
+
├── __init__.py # Package version
|
|
232
|
+
├── main.py # CLI commands (init, scan, profile)
|
|
233
|
+
├── schema.py # Pydantic models (ColumnSchema, DataSchema, etc.)
|
|
234
|
+
├── profiler.py # File loading, type inference, statistical profiling
|
|
235
|
+
├── scanner.py # Drift detection and quality check evaluation
|
|
236
|
+
└── output.py # Rich terminal output (tables, panels, colours)
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
---
|
|
240
|
+
|
|
241
|
+
## Snapshot Files
|
|
242
|
+
|
|
243
|
+
After running `pipedog init`, a `.pipedog/` directory is created:
|
|
244
|
+
|
|
245
|
+
```
|
|
246
|
+
.pipedog/
|
|
247
|
+
├── schema.json # baseline column statistics
|
|
248
|
+
└── checks.json # auto-generated quality rules
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
These files are plain JSON and human-readable. You can commit them to version control to track schema changes over time, or add `.pipedog/` to `.gitignore` to keep them local.
|
|
252
|
+
|
|
253
|
+
**Example `.pipedog/schema.json`:**
|
|
254
|
+
```json
|
|
255
|
+
{
|
|
256
|
+
"file": "/data/orders.csv",
|
|
257
|
+
"row_count": 10,
|
|
258
|
+
"column_count": 7,
|
|
259
|
+
"columns": [
|
|
260
|
+
{
|
|
261
|
+
"name": "order_id",
|
|
262
|
+
"dtype": "integer",
|
|
263
|
+
"nullable": false,
|
|
264
|
+
"null_count": 0,
|
|
265
|
+
"null_pct": 0.0,
|
|
266
|
+
"unique_count": 10,
|
|
267
|
+
"sample_values": [1, 2, 3],
|
|
268
|
+
"min_value": 1.0,
|
|
269
|
+
"max_value": 10.0,
|
|
270
|
+
"mean_value": 5.5
|
|
271
|
+
}
|
|
272
|
+
],
|
|
273
|
+
"captured_at": "2026-03-26T18:34:20.123456+00:00"
|
|
274
|
+
}
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
---
|
|
278
|
+
|
|
279
|
+
## CI/CD Integration
|
|
280
|
+
|
|
281
|
+
Because `pipedog scan` exits with code `1` on failure, it drops straight into any CI pipeline:
|
|
282
|
+
|
|
283
|
+
**GitHub Actions:**
|
|
284
|
+
```yaml
|
|
285
|
+
- name: Check data quality
|
|
286
|
+
run: pipedog scan data/daily_export.csv
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
**Makefile:**
|
|
290
|
+
```makefile
|
|
291
|
+
check:
|
|
292
|
+
pipedog scan data/daily_export.csv
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
---
|
|
296
|
+
|
|
297
|
+
## Roadmap
|
|
298
|
+
|
|
299
|
+
- [ ] `pipedog diff` — side-by-side comparison of two snapshots
|
|
300
|
+
- [ ] Custom checks via `checks.json` (regex patterns, allowed value sets)
|
|
301
|
+
- [ ] JSON Lines (`.jsonl`) support
|
|
302
|
+
- [ ] `--output json` flag for machine-readable scan results
|
|
303
|
+
- [ ] Excel (`.xlsx`) support
|
|
304
|
+
- [ ] Slack / webhook notifications on failure
|
|
305
|
+
|
|
306
|
+
---
|
|
307
|
+
|
|
308
|
+
## License
|
|
309
|
+
|
|
310
|
+
MIT
|
|
311
|
+
|
pipedog-0.1.0/README.md
ADDED
|
@@ -0,0 +1,275 @@
|
|
|
1
|
+
# Pipedog
|
|
2
|
+
|
|
3
|
+
An open source data quality and schema drift detection tool for analysts and data engineers. Point it at a CSV, Parquet, or JSON file and it will profile the data, auto-generate quality checks, and alert you the moment something changes.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Why Pipedog?
|
|
8
|
+
|
|
9
|
+
Data pipelines break silently. A column gets renamed upstream, nulls creep into a field that was always clean, a price column suddenly contains strings. These issues reach production before anyone notices.
|
|
10
|
+
|
|
11
|
+
Pipedog solves this by:
|
|
12
|
+
- **Taking a snapshot** of your data's structure and statistics on day one.
|
|
13
|
+
- **Scanning every new file** against that snapshot and failing loudly when something drifts.
|
|
14
|
+
- **Explaining what went wrong** in plain English, not stack traces.
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## Installation
|
|
19
|
+
|
|
20
|
+
### With pip (quickest)
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
pip install pipedog
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
### With Poetry (for development)
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
git clone https://github.com/JKK-Jishnu/pipedog.git
|
|
30
|
+
cd pipedog
|
|
31
|
+
poetry install
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
### Dependencies
|
|
35
|
+
|
|
36
|
+
| Package | Purpose |
|
|
37
|
+
|-----------|--------------------------------------|
|
|
38
|
+
| typer | CLI framework |
|
|
39
|
+
| rich | Coloured terminal output |
|
|
40
|
+
| pandas | File reading (CSV, Parquet, JSON) |
|
|
41
|
+
| pyarrow | Parquet support for pandas |
|
|
42
|
+
| duckdb | SQL engine (reserved for future use) |
|
|
43
|
+
| pydantic | Schema validation and JSON I/O |
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Quick Start
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
# 1. Profile your file and save a baseline snapshot
|
|
51
|
+
pipedog init data/orders.csv
|
|
52
|
+
|
|
53
|
+
# 2. Tomorrow, when a new file arrives, scan it
|
|
54
|
+
pipedog scan data/orders_new.csv
|
|
55
|
+
|
|
56
|
+
# 3. Explore any file without saving anything
|
|
57
|
+
pipedog profile data/orders.csv
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## Commands
|
|
63
|
+
|
|
64
|
+
### `pipedog init <file>`
|
|
65
|
+
|
|
66
|
+
Profiles the file and saves two files to `.pipedog/`:
|
|
67
|
+
|
|
68
|
+
- **`.pipedog/schema.json`** — column names, types, null stats, value ranges, timestamps.
|
|
69
|
+
- **`.pipedog/checks.json`** — auto-generated quality rules derived from the baseline.
|
|
70
|
+
|
|
71
|
+
```
|
|
72
|
+
pipedog init sample_data/orders.csv
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
**What gets auto-generated:**
|
|
76
|
+
|
|
77
|
+
| Rule | When generated | Severity |
|
|
78
|
+
|-------------|---------------------------------------------|----------|
|
|
79
|
+
| `not_null` | Column had zero nulls at init time | error |
|
|
80
|
+
| `null_rate` | Column had some nulls; threshold = pct + 10 | warning |
|
|
81
|
+
| `min_value` | Numeric column; locks in the observed min | error |
|
|
82
|
+
| `max_value` | Numeric column; locks in the observed max | error |
|
|
83
|
+
| `unique` | Every value was distinct (looks like a key) | error |
|
|
84
|
+
|
|
85
|
+
Re-running `init` refreshes the baseline to the current file.
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
### `pipedog scan <file>`
|
|
90
|
+
|
|
91
|
+
Compares the file against the baseline and runs all quality checks.
|
|
92
|
+
|
|
93
|
+
```
|
|
94
|
+
pipedog scan sample_data/orders.csv
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
**Exit codes:**
|
|
98
|
+
- `0` — all checks passed (warnings are allowed).
|
|
99
|
+
- `1` — one or more error-severity checks failed.
|
|
100
|
+
|
|
101
|
+
This makes `pipedog scan` CI/CD friendly — pipe it into your build and it will fail the pipeline when data quality breaks.
|
|
102
|
+
|
|
103
|
+
**What gets checked:**
|
|
104
|
+
|
|
105
|
+
1. **Schema drift** — were columns added, removed, or changed type?
|
|
106
|
+
2. **Quality checks** — do null rates, value ranges, and uniqueness still match the baseline?
|
|
107
|
+
|
|
108
|
+
**Output example (all passing):**
|
|
109
|
+
```
|
|
110
|
+
+------------------------------- Pipedog Scan --------------------------------+
|
|
111
|
+
| ALL CHECKS PASSED |
|
|
112
|
+
| 10 rows · 7 columns · 17 passed · 0 warnings · 0 failed |
|
|
113
|
+
+-----------------------------------------------------------------------------+
|
|
114
|
+
|
|
115
|
+
Passed Checks
|
|
116
|
+
PASS No nulls found in 'order_id'.
|
|
117
|
+
PASS 'price' maximum is 149.99, within baseline maximum of 149.99.
|
|
118
|
+
...
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
**Output example (failure):**
|
|
122
|
+
```
|
|
123
|
+
+------------------------------- Pipedog Scan --------------------------------+
|
|
124
|
+
| CHECKS FAILED |
|
|
125
|
+
| 12 rows · 6 columns · 14 passed · 0 warnings · 2 failed |
|
|
126
|
+
+-----------------------------------------------------------------------------+
|
|
127
|
+
|
|
128
|
+
Schema Drift Detected
|
|
129
|
+
FAIL Column 'status' existed in the baseline but is missing from the current file.
|
|
130
|
+
|
|
131
|
+
Failed Checks
|
|
132
|
+
FAIL 'order_id' has 2 null value(s) (16.67% of rows).
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
### `pipedog profile <file>`
|
|
138
|
+
|
|
139
|
+
Shows a data summary without saving anything to disk. Useful for exploring a file before committing to a baseline.
|
|
140
|
+
|
|
141
|
+
```
|
|
142
|
+
pipedog profile sample_data/orders.csv
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
**Output includes:**
|
|
146
|
+
- Total row and column count.
|
|
147
|
+
- Per-column type, null count, null percentage, unique count.
|
|
148
|
+
- Min and max for numeric columns.
|
|
149
|
+
- Up to 3 sample values per column.
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Supported File Types
|
|
154
|
+
|
|
155
|
+
| Extension | Format |
|
|
156
|
+
|------------------|---------|
|
|
157
|
+
| `.csv` | CSV |
|
|
158
|
+
| `.parquet` `.pq` | Parquet |
|
|
159
|
+
| `.json` | JSON |
|
|
160
|
+
|
|
161
|
+
File type is detected automatically from the extension.
|
|
162
|
+
|
|
163
|
+
---
|
|
164
|
+
|
|
165
|
+
## How It Works
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
pipedog init orders.csv
|
|
169
|
+
│
|
|
170
|
+
├─ load_file() reads CSV/Parquet/JSON into a DataFrame
|
|
171
|
+
├─ profile_dataframe() computes stats for every column
|
|
172
|
+
├─ generate_checks() auto-generates quality rules from the stats
|
|
173
|
+
└─ save_snapshot() writes .pipedog/schema.json + checks.json
|
|
174
|
+
|
|
175
|
+
pipedog scan orders_new.csv
|
|
176
|
+
│
|
|
177
|
+
├─ load_file() reads the new file
|
|
178
|
+
├─ load_snapshot() loads baseline from .pipedog/
|
|
179
|
+
├─ profile_dataframe() profiles the new file
|
|
180
|
+
├─ detect_drift() compares column structure
|
|
181
|
+
├─ run_quality_checks() evaluates every rule
|
|
182
|
+
└─ print_scan_results() renders colour-coded report, returns exit code
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
## Project Structure
|
|
188
|
+
|
|
189
|
+
```
|
|
190
|
+
pipedog/
|
|
191
|
+
├── pyproject.toml # Poetry config and PyPI metadata
|
|
192
|
+
├── README.md # This file
|
|
193
|
+
├── sample_data/
|
|
194
|
+
│ └── orders.csv # Example file to test with
|
|
195
|
+
└── pipedog/
|
|
196
|
+
├── __init__.py # Package version
|
|
197
|
+
├── main.py # CLI commands (init, scan, profile)
|
|
198
|
+
├── schema.py # Pydantic models (ColumnSchema, DataSchema, etc.)
|
|
199
|
+
├── profiler.py # File loading, type inference, statistical profiling
|
|
200
|
+
├── scanner.py # Drift detection and quality check evaluation
|
|
201
|
+
└── output.py # Rich terminal output (tables, panels, colours)
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## Snapshot Files
|
|
207
|
+
|
|
208
|
+
After running `pipedog init`, a `.pipedog/` directory is created:
|
|
209
|
+
|
|
210
|
+
```
|
|
211
|
+
.pipedog/
|
|
212
|
+
├── schema.json # baseline column statistics
|
|
213
|
+
└── checks.json # auto-generated quality rules
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
These files are plain JSON and human-readable. You can commit them to version control to track schema changes over time, or add `.pipedog/` to `.gitignore` to keep them local.
|
|
217
|
+
|
|
218
|
+
**Example `.pipedog/schema.json`:**
|
|
219
|
+
```json
|
|
220
|
+
{
|
|
221
|
+
"file": "/data/orders.csv",
|
|
222
|
+
"row_count": 10,
|
|
223
|
+
"column_count": 7,
|
|
224
|
+
"columns": [
|
|
225
|
+
{
|
|
226
|
+
"name": "order_id",
|
|
227
|
+
"dtype": "integer",
|
|
228
|
+
"nullable": false,
|
|
229
|
+
"null_count": 0,
|
|
230
|
+
"null_pct": 0.0,
|
|
231
|
+
"unique_count": 10,
|
|
232
|
+
"sample_values": [1, 2, 3],
|
|
233
|
+
"min_value": 1.0,
|
|
234
|
+
"max_value": 10.0,
|
|
235
|
+
"mean_value": 5.5
|
|
236
|
+
}
|
|
237
|
+
],
|
|
238
|
+
"captured_at": "2026-03-26T18:34:20.123456+00:00"
|
|
239
|
+
}
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
---
|
|
243
|
+
|
|
244
|
+
## CI/CD Integration
|
|
245
|
+
|
|
246
|
+
Because `pipedog scan` exits with code `1` on failure, it drops straight into any CI pipeline:
|
|
247
|
+
|
|
248
|
+
**GitHub Actions:**
|
|
249
|
+
```yaml
|
|
250
|
+
- name: Check data quality
|
|
251
|
+
run: pipedog scan data/daily_export.csv
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
**Makefile:**
|
|
255
|
+
```makefile
|
|
256
|
+
check:
|
|
257
|
+
pipedog scan data/daily_export.csv
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## Roadmap
|
|
263
|
+
|
|
264
|
+
- [ ] `pipedog diff` — side-by-side comparison of two snapshots
|
|
265
|
+
- [ ] Custom checks via `checks.json` (regex patterns, allowed value sets)
|
|
266
|
+
- [ ] JSON Lines (`.jsonl`) support
|
|
267
|
+
- [ ] `--output json` flag for machine-readable scan results
|
|
268
|
+
- [ ] Excel (`.xlsx`) support
|
|
269
|
+
- [ ] Slack / webhook notifications on failure
|
|
270
|
+
|
|
271
|
+
---
|
|
272
|
+
|
|
273
|
+
## License
|
|
274
|
+
|
|
275
|
+
MIT
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
__version__ = "0.1.0"
|