tracepipe 0.2.0__py3-none-any.whl → 0.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,508 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: tracepipe
3
- Version: 0.2.0
4
- Summary: Row-level data lineage tracking for pandas pipelines
5
- Project-URL: Homepage, https://github.com/tracepipe/tracepipe
6
- Project-URL: Documentation, https://github.com/tracepipe/tracepipe#readme
7
- Project-URL: Repository, https://github.com/tracepipe/tracepipe.git
8
- Project-URL: Issues, https://github.com/tracepipe/tracepipe/issues
9
- Author: Gauthier Piarrette
10
- License: MIT License
11
-
12
- Copyright (c) 2026 Gauthier Piarrette
13
-
14
- Permission is hereby granted, free of charge, to any person obtaining a copy
15
- of this software and associated documentation files (the "Software"), to deal
16
- in the Software without restriction, including without limitation the rights
17
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
18
- copies of the Software, and to permit persons to whom the Software is
19
- furnished to do so, subject to the following conditions:
20
-
21
- The above copyright notice and this permission notice shall be included in all
22
- copies or substantial portions of the Software.
23
-
24
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
25
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
26
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
27
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
28
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
29
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
30
- SOFTWARE.
31
- License-File: LICENSE
32
- Keywords: data-engineering,data-lineage,data-quality,debugging,observability,pandas
33
- Classifier: Development Status :: 4 - Beta
34
- Classifier: Intended Audience :: Developers
35
- Classifier: Intended Audience :: Science/Research
36
- Classifier: License :: OSI Approved :: MIT License
37
- Classifier: Operating System :: OS Independent
38
- Classifier: Programming Language :: Python :: 3
39
- Classifier: Programming Language :: Python :: 3.9
40
- Classifier: Programming Language :: Python :: 3.10
41
- Classifier: Programming Language :: Python :: 3.11
42
- Classifier: Programming Language :: Python :: 3.12
43
- Classifier: Topic :: Scientific/Engineering
44
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
45
- Requires-Python: >=3.9
46
- Requires-Dist: numpy>=1.20.0
47
- Requires-Dist: pandas>=1.5.0
48
- Provides-Extra: all
49
- Requires-Dist: psutil>=5.9.0; extra == 'all'
50
- Requires-Dist: pyarrow>=10.0.0; extra == 'all'
51
- Provides-Extra: arrow
52
- Requires-Dist: pyarrow>=10.0.0; extra == 'arrow'
53
- Provides-Extra: dev
54
- Requires-Dist: black>=23.0.0; extra == 'dev'
55
- Requires-Dist: pre-commit>=3.5.0; extra == 'dev'
56
- Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
57
- Requires-Dist: pytest>=7.0.0; extra == 'dev'
58
- Requires-Dist: ruff>=0.1.0; extra == 'dev'
59
- Requires-Dist: taskipy>=1.12.0; extra == 'dev'
60
- Provides-Extra: memory
61
- Requires-Dist: psutil>=5.9.0; extra == 'memory'
62
- Description-Content-Type: text/markdown
63
-
64
- # 🔍 TracePipe
65
-
66
- **Row-level data lineage tracking for pandas pipelines**
67
-
68
- TracePipe tracks every row, every change, every step in your data pipelines. Point at any row and instantly see its complete transformation history.
69
-
70
- ## Why TracePipe?
71
-
72
- Ever asked "Why did this row get dropped?" or "What happened to this user's data?" Traditional pipeline logging tells you *what operations ran*, but not *what happened to specific data points*.
73
-
74
- TracePipe gives you **row-level provenance**:
75
- - 🎯 **Track individual rows** through filters, transforms, and aggregations
76
- - 📊 **Cell-level diffs** - see exactly what changed (e.g., `age: NaN → 30`)
77
- - 🔗 **Aggregation lineage** - trace which source rows contributed to each group
78
- - 🚫 **Zero code changes** - just enable and your pipeline is tracked
79
-
80
- ## Installation
81
-
82
- ```bash
83
- pip install tracepipe
84
-
85
- # With optional dependencies
86
- pip install tracepipe[arrow] # For Parquet/Arrow export
87
- pip install tracepipe[all] # All optional dependencies
88
- ```
89
-
90
- ## Quick Start
91
-
92
- ```python
93
- import tracepipe
94
- import pandas as pd
95
-
96
- # Enable tracking
97
- tracepipe.enable()
98
- tracepipe.watch("age", "salary") # Track cell-level changes for these columns
99
-
100
- # Your normal pandas code
101
- df = pd.DataFrame({
102
- "name": ["Alice", "Bob", "Charlie"],
103
- "age": [28, None, 35],
104
- "salary": [75000, 65000, None]
105
- })
106
-
107
- # Data cleaning
108
- df = df.dropna()
109
- df["salary"] = df["salary"] * 1.1 # Give a raise
110
-
111
- # Query lineage
112
- dropped = tracepipe.dropped_rows()
113
- print(f"Dropped rows: {dropped}") # [1, 2] - Bob and Charlie
114
-
115
- row = tracepipe.explain(0) # Alice's journey
116
- print(row.history())
117
- # [{'operation': 'DataFrame.__setitem__[salary]', 'col': 'salary',
118
- # 'old_val': 75000.0, 'new_val': 82500.0, ...}]
119
-
120
- # Export
121
- tracepipe.save("lineage_report.html")
122
- tracepipe.disable()
123
- ```
124
-
125
- ## Features
126
-
127
- ### 🎯 Row-Level Tracking
128
-
129
- Every row gets a unique ID that persists through operations:
130
-
131
- ```python
132
- tracepipe.enable()
133
- df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
134
-
135
- # Filter some rows
136
- df = df[df["a"] > 2]
137
-
138
- # Which rows were dropped?
139
- dropped = tracepipe.dropped_rows()
140
- print(dropped) # [0, 1] - rows with a=1,2
141
-
142
- # What happened to a specific row?
143
- row = tracepipe.explain(2) # Row with a=3
144
- print(row.is_alive()) # True
145
- ```
146
-
147
- ### 📊 Cell-Level Diffs
148
-
149
- Watch specific columns to track value changes:
150
-
151
- ```python
152
- tracepipe.enable()
153
- tracepipe.watch("age", "income")
154
-
155
- df = pd.DataFrame({"age": [25, None, 35], "income": [50000, 60000, None]})
156
- df["age"] = df["age"].fillna(30)
157
-
158
- # What changed for row 1?
159
- row = tracepipe.explain(1)
160
- history = row.cell_history("age")
161
- print(history)
162
- # [{'col': 'age', 'old_val': None, 'new_val': 30.0, 'change_type': 'MODIFIED'}]
163
- ```
164
-
165
- ### 🔗 Aggregation Lineage
166
-
167
- Trace back from aggregated results to source rows:
168
-
169
- ```python
170
- tracepipe.enable()
171
- df = pd.DataFrame({
172
- "department": ["Engineering", "Engineering", "Sales"],
173
- "salary": [80000, 90000, 70000]
174
- })
175
-
176
- summary = df.groupby("department").mean()
177
-
178
- # Which rows contributed to the Engineering average?
179
- group = tracepipe.explain_group("Engineering")
180
- print(group.row_ids) # [0, 1]
181
- print(group.row_count) # 2
182
- ```
183
-
184
- ### 📋 Pipeline Stages
185
-
186
- Organize tracking by logical stages:
187
-
188
- ```python
189
- with tracepipe.stage("cleaning"):
190
- df = df.dropna()
191
- df = df.fillna(0)
192
-
193
- with tracepipe.stage("feature_engineering"):
194
- df["new_feature"] = df["a"] * df["b"]
195
-
196
- # Steps are tagged with stage names
197
- steps = tracepipe.steps()
198
- for step in steps:
199
- print(f"{step['operation']} [{step['stage']}]")
200
- ```
201
-
202
- ### 📤 Export & Visualization
203
-
204
- ```python
205
- # HTML report
206
- tracepipe.save("report.html")
207
- tracepipe.save("report.html", row_id=42) # Specific row's journey
208
-
209
- # JSON export
210
- tracepipe.export_json("lineage.json")
211
-
212
- # Parquet export (requires pyarrow)
213
- tracepipe.export_arrow("lineage.parquet")
214
- ```
215
-
216
- ## API Reference
217
-
218
- ### Core Functions
219
-
220
- | Function | Description |
221
- |----------|-------------|
222
- | `enable(config=None)` | Enable lineage tracking |
223
- | `disable()` | Disable tracking and restore pandas |
224
- | `reset()` | Clear all tracking state |
225
- | `configure(**kwargs)` | Update configuration |
226
-
227
- ### Column Watching
228
-
229
- | Function | Description |
230
- |----------|-------------|
231
- | `watch(*columns)` | Track cell-level changes for columns |
232
- | `unwatch(*columns)` | Stop tracking columns |
233
- | `register(df)` | Manually register a DataFrame |
234
-
235
- ### Query Functions
236
-
237
- | Function | Description |
238
- |----------|-------------|
239
- | `explain(row_id)` | Get a row's complete lineage |
240
- | `explain_group(group_key)` | Get aggregation group membership |
241
- | `dropped_rows()` | List all dropped row IDs |
242
- | `dropped_rows_by_step()` | Count dropped rows per operation |
243
- | `steps()` | List all tracked operations |
244
- | `stats()` | Get tracking statistics |
245
-
246
- ### Export Functions
247
-
248
- | Function | Description |
249
- |----------|-------------|
250
- | `save(filepath)` | Save HTML report |
251
- | `export_json(filepath)` | Export to JSON |
252
- | `export_arrow(filepath)` | Export to Parquet |
253
-
254
- ### RowLineageResult Methods
255
-
256
- | Method | Description |
257
- |--------|-------------|
258
- | `.is_alive` | True if row wasn't dropped (property) |
259
- | `.dropped_at` | Operation that dropped the row (property) |
260
- | `.history()` | Full event history |
261
- | `.cell_history(col)` | Changes to specific column |
262
- | `.gaps` | Lineage completeness info |
263
-
264
- ### GroupLineageResult Methods
265
-
266
- | Method | Description |
267
- |--------|-------------|
268
- | `.row_ids` | List of contributing row IDs |
269
- | `.row_count` | Number of rows in group |
270
- | `.group_column` | Column used for grouping |
271
- | `.aggregation_functions` | Functions applied |
272
-
273
- ## Tracked Operations
274
-
275
- ### Pandas DataFrame
276
-
277
- **Filters** (track dropped rows):
278
- - `dropna`, `drop_duplicates`, `query`, `head`, `tail`, `sample`
279
- - `df[mask]` boolean indexing
280
- - `df.drop(index=...)`
281
-
282
- **Transforms** (track cell changes):
283
- - `fillna`, `replace`, `astype`
284
- - `df[col] = value` assignment
285
-
286
- **Aggregations** (track group membership):
287
- - `groupby().agg()`, `groupby().sum()`, `groupby().mean()`, etc.
288
-
289
- **Index Operations**:
290
- - `reset_index`, `set_index`, `sort_values`
291
-
292
- **Other**:
293
- - `copy`, `merge`, `join`, `pd.concat`
294
-
295
- ## Configuration
296
-
297
- ```python
298
- from tracepipe import TracePipeConfig
299
-
300
- config = TracePipeConfig(
301
- max_diffs_in_memory=500_000, # Spill to disk above this
302
- max_diffs_per_step=100_000, # Mark as "mass update" above this
303
- max_group_membership_size=100_000, # Store count-only for large groups
304
- strict_mode=False, # Raise on instrumentation errors
305
- warn_on_duplicate_index=True, # Warn about ambiguous row identity
306
- )
307
-
308
- tracepipe.enable(config=config)
309
- ```
310
-
311
- Environment variables:
312
- - `TRACEPIPE_MAX_DIFFS` - Max diffs in memory
313
- - `TRACEPIPE_STRICT` - Enable strict mode (`1`)
314
- - `TRACEPIPE_AUTO_WATCH` - Auto-watch columns with nulls (`1`)
315
-
316
- ## Extensibility
317
-
318
- TracePipe uses protocols for pluggable backends:
319
-
320
- ```python
321
- from tracepipe import LineageBackend, RowIdentityStrategy
322
-
323
- # Custom storage backend (e.g., SQLite)
324
- class SQLiteBackend:
325
- """Implements LineageBackend protocol."""
326
- ...
327
-
328
- # Custom engine support (e.g., Polars)
329
- class PolarsRowIdentity:
330
- """Implements RowIdentityStrategy protocol."""
331
- ...
332
-
333
- tracepipe.enable(backend=my_backend, identity=my_identity)
334
- ```
335
-
336
- ## Limitations
337
-
338
- TracePipe v0.2.0 has some known limitations:
339
-
340
- | Limitation | Behavior | Future |
341
- |------------|----------|--------|
342
- | `merge`/`concat` | Lineage reset (UNKNOWN completeness) | |
343
- | `apply`/`pipe` | Output tracked, internals unknown (PARTIAL) | Inherent |
344
- | Series methods | Not tracked (e.g., `df['col'].str.upper()`) | |
345
- | `loc`/`iloc` | Not tracked | |
346
- | Very large datasets | May spill to disk | Configure thresholds |
347
-
348
- **Tip**: For Series operations, the column assignment is tracked:
349
- ```python
350
- # The str.upper() isn't tracked, but the assignment is
351
- df['name'] = df['name'].str.upper()
352
- ```
353
-
354
- ## Performance & Benchmarks
355
-
356
- ### Key Insight: Overhead is ADDITIVE, not MULTIPLICATIVE
357
-
358
- TracePipe adds a **fixed time cost** for row tracking and change detection. This overhead is **independent** of how long your pandas operations take. For pipelines with heavy computation (model training, I/O, complex aggregations), TracePipe overhead becomes negligible.
359
-
360
- ### Benchmark Results
361
-
362
- **Test Configuration**: MacBook Pro M1, pandas 2.0+, 5M rows, 12 columns
363
-
364
- #### Operation-Level Overhead
365
-
366
- | Operation | Without TracePipe | With TracePipe | Overhead | Slowdown |
367
- |-----------|-------------------|----------------|----------|----------|
368
- | `drop_duplicates` (50K rows) | 45ms | 67ms | +22ms | 1.49x |
369
- | `dropna` (50K rows) | 38ms | 56ms | +18ms | 1.47x |
370
- | `fillna` (50K rows) | 52ms | 89ms | +37ms | 1.71x |
371
- | Boolean filter `[mask]` (5M rows) | 2.1s | 3.8s | +1.7s | 1.81x |
372
- | `drop_duplicates` (5M rows) | 3.2s | 5.9s | +2.7s | 1.84x |
373
-
374
- #### End-to-End Pipeline Performance
375
-
376
- **Small Dataset (50K rows)**:
377
- ```
378
- WITHOUT TracePipe: 0.89 seconds
379
- WITH TracePipe: 3.98 seconds (tracking 3 columns)
380
- Overhead: +3.09 seconds
381
- Slowdown: 4.47x
382
- ```
383
-
384
- **Large Dataset (5M rows)**:
385
- ```
386
- WITHOUT TracePipe: 6.25 seconds
387
- WITH TracePipe: 16.19 seconds (tracking 3 columns)
388
- Overhead: +9.94 seconds
389
- Slowdown: 2.59x
390
- ```
391
-
392
- #### Real-World Pipeline Scenarios
393
-
394
- The overhead is **fixed** regardless of pipeline duration:
395
-
396
- | Pipeline Type | Duration | TracePipe Overhead | Actual Slowdown |
397
- |--------------|----------|-------------------|-----------------|
398
- | Quick data cleaning | 10 seconds | +5 seconds | **1.5x** |
399
- | ETL pipeline | 5 minutes | +10 seconds | **1.03x** |
400
- | Feature engineering + model training | 1 hour | +15 seconds | **1.0004x** (0.04%) |
401
- | Full ML workflow | 3 hours | +20 seconds | **< 1.0001x** (< 0.01%) |
402
-
403
- **Why?** TracePipe only tracks data transformations. It does NOT slow down:
404
- - ❌ Model training (scikit-learn, PyTorch, etc.)
405
- - ❌ I/O operations (reading/writing files, databases)
406
- - ❌ Network calls (APIs, distributed computing)
407
- - ❌ Complex pandas aggregations (rolling windows, complex groupby)
408
-
409
- ### Memory Usage
410
-
411
- - **Columnar storage**: ~40 bytes per diff
412
- - **Example**: 1M cell changes = ~40 MB memory
413
- - **Automatic spillover**: Configurable threshold (default: 500K diffs)
414
- - **Mass update detection**: Skips cell diffs when threshold exceeded
415
-
416
- ### Production Recommendations
417
-
418
- ✅ **Safe for production** when:
419
- - Pipeline takes > 1 minute
420
- - You need debugging/audit capabilities
421
- - Memory allows ~40 bytes per expected change
422
-
423
- ⚠️ **Consider overhead** when:
424
- - Pipeline is < 30 seconds
425
- - Running in tight loops (thousands of iterations)
426
- - Extremely memory-constrained environment
427
-
428
- ### Configuration for Performance
429
-
430
- ```python
431
- from tracepipe import TracePipeConfig
432
-
433
- config = TracePipeConfig(
434
- max_diffs_in_memory=500_000, # Reduce if memory-constrained
435
- max_diffs_per_step=100_000, # Mass updates skip cell diffs
436
- max_group_membership_size=100_000, # Large groups store count-only
437
- )
438
-
439
- tracepipe.enable(config=config)
440
-
441
- # Only watch columns you need to debug
442
- tracepipe.watch("age", "income") # Not all columns
443
- ```
444
-
445
- ### Running Benchmarks
446
-
447
- ```bash
448
- # Operation-level benchmarks
449
- python examples/benchmark_overhead.py
450
-
451
- # Scale tests
452
- python examples/demo_50k_scale_test.py
453
- python examples/demo_5m_stress_test.py
454
- ```
455
-
456
- ## Use Cases
457
-
458
- ### Debugging Data Quality Issues
459
- ```python
460
- # Which rows were dropped and why?
461
- for step, count in tracepipe.dropped_rows_by_step().items():
462
- print(f"{step}: {count} rows dropped")
463
- ```
464
-
465
- ### Compliance & Audit
466
- ```python
467
- # Export complete data lineage for audit
468
- tracepipe.export_json("audit_trail.json")
469
- ```
470
-
471
- ### Understanding Aggregations
472
- ```python
473
- # Which transactions contributed to this customer's total?
474
- group = tracepipe.explain_group("customer_123")
475
- for row_id in group.row_ids:
476
- print(tracepipe.explain(row_id).history())
477
- ```
478
-
479
- ## Development
480
-
481
- ```bash
482
- # Install for development
483
- pip install -e ".[dev]"
484
-
485
- # Run tests
486
- PYTHONPATH=. python -m pytest tests/ -v
487
-
488
- # Run demo
489
- PYTHONPATH=. python examples/demo_v2.py
490
- ```
491
-
492
- ## License
493
-
494
- MIT License - see LICENSE file for details.
495
-
496
- ## Changelog
497
-
498
- ### v0.2.0 (2026-01-28)
499
-
500
- **Major rewrite with row-level provenance:**
501
- - Row identity tracking through all operations
502
- - Cell-level diffs for watched columns
503
- - Aggregation group membership tracking
504
- - Thread-safe context (per-thread isolation)
505
- - Protocol-based extensibility
506
- - Memory-efficient columnar storage (SoA pattern)
507
- - Automatic spillover to disk
508
- - Safe instrumentation (never crashes user code)
@@ -1,19 +0,0 @@
1
- tracepipe/__init__.py,sha256=etBJ3loGRsMCeNueO7c-jmCexYeGjS6KHcytHjm4mxs,2389
2
- tracepipe/api.py,sha256=lh1iKL9Ivuf0NQVhPi6hn8pWi_dXC1bSahpMMgjLaJw,15817
3
- tracepipe/context.py,sha256=wPsxBR7M60G-mwj18YhJd6Ci1Tdj5cHIcvCptbhytQc,3205
4
- tracepipe/core.py,sha256=sRJ6d4w5Fa0nAch3LQwjITGCQMkeNpdIZxmQcsy-DdE,3294
5
- tracepipe/safety.py,sha256=-16OTg6iU_31vS3x2fiwgOTr1wFXB6SRSBDe6L0PfBE,5442
6
- tracepipe/instrumentation/__init__.py,sha256=pd0n6Z9m_V3gcBv097cXWFOZEzAP9sAq1jjQnNRrDZ8,222
7
- tracepipe/instrumentation/pandas_inst.py,sha256=WOfuuMC9NqMRZTQP-X7p0Cgsy4CbY_5QQJ6O4xzPpwk,32556
8
- tracepipe/storage/__init__.py,sha256=pGFMfbIgIi2kofVPwYDqe2HTYMYJoabiGjTq77pYi-g,348
9
- tracepipe/storage/base.py,sha256=dp-ra7qGfJKdKYxGE5JzphxpkUh9maT2QH_UZi1LZz8,4717
10
- tracepipe/storage/lineage_store.py,sha256=CqHdeHa9BNLPVNYPRRmTSoXAUO_JDkZi5kE7oQnDzwU,19403
11
- tracepipe/storage/row_identity.py,sha256=j_7B0mH0cz8YXnlRBDoejmFzkd742creAT4Rqtb9CJs,7654
12
- tracepipe/utils/__init__.py,sha256=CI_GXViCjdMbu1j6HuzZhoQZEW0sIB6WAve6j5pfOC0,182
13
- tracepipe/utils/value_capture.py,sha256=wGgegQmJnVHxHbwHSH9di7JAOBChzD3ERJrabZNiayk,4092
14
- tracepipe/visualization/__init__.py,sha256=M3s44ZTUNEToyghjhQW0FgbmWHKPr4Xc-7iNF6DpI_E,132
15
- tracepipe/visualization/html_export.py,sha256=f74qmg1PgXMYPuDZsNhRTEOi27CsAtKrC_jiI50cUmc,41833
16
- tracepipe-0.2.0.dist-info/METADATA,sha256=QHZiXA2HT1TI-j7iWHy84KMAbeyPG4Cx3hq5oZX0ZQU,15377
17
- tracepipe-0.2.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
18
- tracepipe-0.2.0.dist-info/licenses/LICENSE,sha256=HMOAFHBClL79POwWL-2_aDcx42DJAq7Ce-nwJPvMB9U,1075
19
- tracepipe-0.2.0.dist-info/RECORD,,