dfwatcher 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Abinesh N
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,453 @@
1
+ Metadata-Version: 2.4
2
+ Name: dfwatcher
3
+ Version: 0.1.0
4
+ Summary: Silent data watcher — tells you what your pipeline did to your data.
5
+ Author: Abineshabee
6
+ Author-email: abineshabee@gmail.com
7
+ License-Expression: MIT
8
+ Project-URL: Homepage, https://github.com/Abineshabee/watcher
9
+ Project-URL: Documentation, https://github.com/Abineshabee/watcher/blob/main/docs/usage.md
10
+ Project-URL: Bug Tracker, https://github.com/Abineshabee/watcher/issues
11
+ Keywords: dataframe,pipeline,data-quality,pandas,etl,debugging
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Scientific/Engineering
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Requires-Python: >=3.10
23
+ Description-Content-Type: text/markdown
24
+ License-File: LICENSE
25
+ Requires-Dist: pandas>=1.5
26
+ Provides-Extra: rich
27
+ Requires-Dist: rich>=13.0; extra == "rich"
28
+ Provides-Extra: memory
29
+ Requires-Dist: psutil>=5.9; extra == "memory"
30
+ Provides-Extra: full
31
+ Requires-Dist: rich>=13.0; extra == "full"
32
+ Requires-Dist: psutil>=5.9; extra == "full"
33
+ Provides-Extra: dev
34
+ Requires-Dist: pytest>=8.0; extra == "dev"
35
+ Requires-Dist: pytest-cov>=5.0; extra == "dev"
36
+ Requires-Dist: numpy>=1.24; extra == "dev"
37
+ Requires-Dist: rich>=13.0; extra == "dev"
38
+ Requires-Dist: psutil>=5.9; extra == "dev"
39
+ Dynamic: license-file
40
+
41
+
42
+ <p align="center">
43
+ <img src="assets/logo/watcher_logo_text_right.svg" width="500">
44
+ </p>
45
+
46
+ > **The silent data watcher.** Decorates your pipeline functions and tells you exactly what happened to your data — row counts, schema drift, null changes, memory usage, join explosions — automatically, with zero config.
47
+
48
+ [![CI](https://github.com/Abineshabee/watcher/actions/workflows/ci.yml/badge.svg)](https://github.com/Abineshabee/watcher/actions)
49
+ [![PyPI](https://img.shields.io/pypi/v/watcher)](https://pypi.org/project/watcher/)
50
+ [![GitHub release](https://img.shields.io/github/v/release/Abineshabee/watcher)](https://github.com/Abineshabee/watcher/releases)
51
+ [![Python](https://img.shields.io/pypi/pyversions/watcher)](https://pypi.org/project/watcher/)
52
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
53
+
54
+ ---
55
+
56
+ ## The problem
57
+
58
+ You run a data pipeline. The output looks wrong. Your only clue:
59
+
60
+ ```
61
+ Input: 1,000,000 rows
62
+ Output: 263,979 rows
63
+ ```
64
+
65
+ Which step dropped the rows? Was it a filter, a null drop, or a bad join? You have no idea without adding print statements everywhere and re-running the whole thing.
66
+
67
+ **watcher answers that — automatically.**
68
+
69
+ ---
70
+
71
+ ## Install
72
+
73
+ ```bash
74
+ pip install watcher # core only (pandas)
75
+ pip install "watcher[rich]" # + coloured terminal output
76
+ pip install "watcher[full]" # + Rich + psutil memory tracking
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Quickstart
82
+
83
+ ```python
84
+ import pandas as pd
85
+ from watcher import watch, session
86
+
87
+ raw = pd.DataFrame({
88
+ "customer_id": [1, 2, 3, 4],
89
+ "status": ["active", "inactive", "active", None]
90
+ })
91
+
92
+ orders = pd.DataFrame({
93
+ "customer_id": [1, 3],
94
+ "amount": [250.0, 150.0]
95
+ })
96
+
97
+ @watch
98
+ def clean(df):
99
+ return df.dropna()
100
+
101
+ @watch
102
+ def merge_orders(df):
103
+ return df.merge(orders, on="customer_id", how="left")
104
+
105
+ @watch
106
+ def filter_active(df):
107
+ return df[df["status"] == "active"]
108
+
109
+ # 3. Run the session to see the watcher summary!
110
+ if __name__ == "__main__":
111
+ with session("nightly ETL") as s:
112
+ df = clean(raw)
113
+ df = merge_orders(df)
114
+ df = filter_active(df)
115
+
116
+ #=====================================
117
+ # For more Examples : exammples/
118
+ # For Syntax and Usage : docs/usage.md
119
+ # ====================================
120
+ ```
121
+
122
+ **Output — automatically, no extra code:**
123
+
124
+ ```
125
+ ──────────────────────── watcher · nightly ETL ─────────────────────────
126
+ clean() 1,000,000 → 964,203 ▼ -35,797 rows (-3.6%) 12.3 ms
127
+ nulls -35,797 status (35,797 → 0)
128
+
129
+ merge_orders() 964,203 → 1,069,104 ▲ +104,901 rows (+10.9%) ⚠ 41.1 ms
130
+ columns added : +tier
131
+ 💥 join explosion · duplication ratio 10.9%
132
+ key column top value repeat count
133
+ customer_id 9182 184
134
+ customer_id 3310 97
135
+
136
+ filter_active() 1,069,104 → 631,822 ▼ -437,282 rows (-40.9%) 18.7 ms
137
+
138
+ ╭──────────────── watcher · nightly ETL · summary ───────────────────╮
139
+ │ step rows in rows out Δ rows time (ms) │
140
+ │ clean 1,000,000 964,203 -35,797 12.3 │
141
+ │ merge_orders 964,203 1,069,104 +104,901 41.1 │
142
+ │ filter_active 1,069,104 631,822 -437,282 18.7 │
143
+ │ │
144
+ │ total 1,000,000 → 631,822 (-368,178 rows) 72.1 ms │
145
+ ╰────────────────────────────────────────────────────────────────────╯
146
+ ```
147
+
148
+ ---
149
+
150
+ ## Documentation
151
+
152
+ - [Usage Guide](docs/usage.md)
153
+ - [API Reference](docs/index.md)
154
+ - [Examples](examples/)
155
+
156
+ ---
157
+
158
+ For advanced pipeline patterns and debugging workflows, see the full documentation.
159
+ ## Features
160
+
161
+ ### Row tracking
162
+
163
+ Every decorated function shows rows before → after, the signed diff, percentage change, and elapsed time. Nothing is hidden, nothing needs configuring.
164
+
165
+ ```
166
+ drop_nulls() 1,000,000 → 921,330 ▼ -78,670 rows (-7.9%) 68.5 ms
167
+ ```
168
+
169
+ ---
170
+
171
+ ### Null-count deltas
172
+
173
+ Per-column null counts are compared before and after each step. The worst offenders are shown first.
174
+
175
+ ```
176
+ drop_nulls() 1,000,000 → 921,330 ▼ -78,670 rows (-7.9%)
177
+ nulls -2,477 status (2,477 → 0)
178
+ nulls -1,448 revenue (1,448 → 0)
179
+ ```
180
+
181
+ ---
182
+
183
+ ### Schema drift
184
+
185
+ Columns added or removed between steps are detected and reported immediately.
186
+
187
+ ```
188
+ add_revenue_band() 582,246 → 582,246 ● +0 rows
189
+ columns added : +revenue_band
190
+
191
+ drop_temp_columns() 582,246 → 582,246 ● +0 rows
192
+ columns removed : -created_at
193
+ ```
194
+
195
+ ---
196
+
197
+ ### Dtype change detection
198
+
199
+ If a step changes a column's dtype — widening (`int32` → `int64`) or narrowing (`float64` → `object`) — watcher flags it.
200
+
201
+ ```
202
+ coerce_step() 10,000 → 10,000 ● +0 rows
203
+ dtype change : customer_id int64 → object
204
+ ```
205
+
206
+ ---
207
+
208
+ ### Join explosion detection
209
+
210
+ When a merge fans out unexpectedly, watcher tells you which key column caused it, which values are duplicated, and how many times — not just that rows were gained.
211
+
212
+ ```
213
+ merge_orders() 10,000 → 20,000 ▲ +10,000 rows (+100.0%) ⚠ 💥 join explosion
214
+ columns added : +tier
215
+ join explosion · duplication ratio 100.0%
216
+ key column top value repeat count
217
+ customer_id 72 30
218
+ customer_id 383 30
219
+ customer_id 1034 28
220
+ ```
221
+
222
+ ---
223
+
224
+ ### Threshold guards
225
+
226
+ Turn watcher into a data contract enforcer. Set soft warnings or hard stops on row gain or loss.
227
+
228
+ ```python
229
+ @watch(
230
+ warn_on_loss=0.05, # ⚠ warn if > 5 % rows lost
231
+ raise_on_loss=0.20, # ✗ raise if > 20 % rows lost
232
+ warn_on_gain=0.10, # ⚠ warn if > 10 % rows gained
233
+ raise_on_gain=1.00, # ✗ raise if rows more than double
234
+ )
235
+ def merge_orders(df):
236
+ return df.merge(orders, on="customer_id", how="left")
237
+ ```
238
+
239
+ Catching exceptions in CI:
240
+
241
+ ```python
242
+ from watcher.exceptions import ThresholdExceeded, WatcherWarning
243
+
244
+ try:
245
+ result = pipeline(df)
246
+ except ThresholdExceeded as exc:
247
+ logger.error("Data contract violated: %s", exc)
248
+ raise
249
+ ```
250
+
251
+ ---
252
+
253
+ ### Memory tracking
254
+
255
+ ```python
256
+ @watch(track_memory="rss") # process RSS via psutil — captures NumPy/pandas C allocations
257
+ @watch(track_memory="peak") # Python-heap peak via tracemalloc — no psutil needed
258
+ @watch(track_memory="off") # disabled — zero overhead for production pipelines
259
+ @watch(track_memory=True) # alias for "rss"
260
+ @watch(track_memory=False) # alias for "off"
261
+ ```
262
+
263
+ Example output with RSS tracking on a 1M-row allocation:
264
+
265
+ ```
266
+ big_allocation() 1,000,000 → 1,000,000 ● +0 rows 56.2 ms mem +38.5 MB (rss)
267
+ columns added : +col1, +col2, +col3, +col4, +col5
268
+ ```
269
+
270
+ ---
271
+
272
+ ### Session grouping
273
+
274
+ Group multiple steps into one named pipeline run. Get a full summary table and a machine-readable dict for CI assertions.
275
+
276
+ ```python
277
+ with session("user churn model — daily run") as s:
278
+ df = clean(df)
279
+ df = merge(df)
280
+ df = score(df)
281
+
282
+ summary = s.summary()
283
+ assert summary["total_rows_out"] > 500_000, "Too many rows dropped!"
284
+ print(summary["total_elapsed_s"])
285
+ ```
286
+
287
+ `summary()` returns:
288
+
289
+ ```python
290
+ {
291
+ "name": "user churn model — daily run",
292
+ "steps": [
293
+ {"func": "clean", "rows_in": 1000000, "rows_out": 964203, "diff": -35797, ...},
294
+ {"func": "merge", ...},
295
+ {"func": "score", ...},
296
+ ],
297
+ "total_rows_in": 1000000,
298
+ "total_rows_out": 631822,
299
+ "total_elapsed_s": 0.072,
300
+ "total_memory_delta_mb": +38.5,
301
+ }
302
+ ```
303
+
304
+ ---
305
+
306
+ ### Custom handlers
307
+
308
+ Swap or extend the output layer without touching your pipeline code. Every step fires `on_step()` on all registered handlers.
309
+
310
+ ```python
311
+ from watcher import register_handler, deregister_handler
312
+ from watcher.handlers import HandlerBase
313
+ from watcher.core import StepResult
314
+ import json
315
+
316
+ class JSONLogHandler(HandlerBase):
317
+ def __init__(self):
318
+ self.log = []
319
+
320
+ def on_step(self, step: StepResult):
321
+ self.log.append({
322
+ "step": step.func_name,
323
+ "rows_in": step.rows_in,
324
+ "rows_out": step.rows_out,
325
+ "diff": step.row_diff,
326
+ "ms": round(step.elapsed_s * 1000, 2),
327
+ })
328
+
329
+ handler = JSONLogHandler()
330
+ register_handler(handler)
331
+
332
+ # ... run your pipeline ...
333
+
334
+ deregister_handler(handler)
335
+ print(json.dumps(handler.log, indent=2))
336
+ ```
337
+
338
+ ---
339
+
340
+ ## API reference
341
+
342
+ ### `@watch`
343
+
344
+ ```python
345
+ @watch(
346
+ label: str | None = None, # custom step name shown in output
347
+ warn_on_loss: float | None = None, # soft warning threshold (0.0–1.0)
348
+ raise_on_loss: float | None = None, # hard stop threshold (0.0–1.0)
349
+ warn_on_gain: float | None = None, # soft warning on row gain
350
+ raise_on_gain: float | None = None, # hard stop on row gain
351
+ track_memory: bool | str | MemoryMode = "rss",
352
+ verbose: bool = True, # False = silent, step still tracked in session
353
+ )
354
+ ```
355
+
356
+ Can be used bare (`@watch`) or with arguments (`@watch(warn_on_loss=0.05)`).
357
+
358
+ ---
359
+
360
+ ### `session(name)`
361
+
362
+ Context manager. Groups `@watch` steps into one named pipeline run and prints a summary table on exit. Access `.summary()` on the session object for machine-readable results.
363
+
364
+ ---
365
+
366
+ ### `MemoryMode`
367
+
368
+ | Value | Meaning |
369
+ |---|---|
370
+ | `"rss"` / `True` | Process RSS via psutil — captures NumPy, pandas, Arrow C allocations |
371
+ | `"peak"` | Python-heap peak via tracemalloc — no extra dependencies |
372
+ | `"off"` / `False` | Disabled — zero overhead |
373
+
374
+ ---
375
+
376
+ ### `StepResult` attributes
377
+
378
+ | Attribute | Type | Description |
379
+ |---|---|---|
380
+ | `func_name` | `str` | Decorated function name (or `label`) |
381
+ | `rows_in` | `int` | Row count before the step |
382
+ | `rows_out` | `int` | Row count after the step |
383
+ | `row_diff` | `int` | Signed difference (`rows_out - rows_in`) |
384
+ | `row_diff_pct` | `float` | Fractional change relative to input |
385
+ | `lost_rows` | `bool` | True when rows were dropped |
386
+ | `gained_rows` | `bool` | True when rows were added |
387
+ | `is_join_explosion` | `bool` | True when a fan-out was detected |
388
+ | `elapsed_s` | `float` | Wall-clock time in seconds |
389
+ | `memory_delta_mb` | `float` | Memory change in MB |
390
+ | `memory_mode` | `MemoryMode` | Which memory strategy was used |
391
+ | `warned` | `bool` | True when a `warn_on_*` threshold fired |
392
+ | `stats` | `StepStats` | Full column-level stats (nulls, dtypes, schema drift) |
393
+
394
+ ---
395
+
396
+ ### Exceptions
397
+
398
+ | Exception | When |
399
+ |---|---|
400
+ | `ThresholdExceeded` | A `raise_on_*` threshold is breached — hard stop |
401
+ | `WatcherWarning` | A `warn_on_*` threshold is breached — soft, pipeline continues |
402
+ | `ConfigurationError` | Invalid decorator arguments at decoration time |
403
+ | `BackendError` | A backend adapter failed at runtime |
404
+
405
+ All exceptions inherit from `WatcherError` so you can catch the entire family with one clause.
406
+
407
+ ---
408
+
409
+ ### `HandlerBase`
410
+
411
+ | Method | Called when |
412
+ |---|---|
413
+ | `on_session_start(session)` | A `session()` block opens |
414
+ | `on_step(step)` | A decorated function completes |
415
+ | `on_session_end(session)` | A `session()` block closes |
416
+
417
+ ---
418
+
419
+ ## Examples
420
+
421
+ ```bash
422
+ python examples/basic_pipeline.py # 1M-row e-commerce ETL with session summary
423
+ python examples/threshold_demo.py # all four threshold modes demonstrated
424
+ ```
425
+
426
+ ---
427
+
428
+ ## Development
429
+
430
+ ```bash
431
+ git clone https://github.com/Abineshabee/watcher
432
+ cd watcher
433
+ pip install -e ".[dev]"
434
+ pytest tests/ -v --cov=watcher
435
+ ```
436
+
437
+ CI runs on Python 3.10–3.13 across Ubuntu, Windows, and macOS on every push.
438
+
439
+ ---
440
+
441
+ ## Roadmap
442
+
443
+ - Polars backend
444
+ - DuckDB backend
445
+ - Notebook / HTML renderer
446
+ - JSON handler for structured logging pipelines
447
+ - `watcher.config` — global defaults without decorator arguments
448
+
449
+ ---
450
+
451
+ ## License
452
+
453
+ MIT — see [LICENSE](LICENSE).