PyPI - tracepipe - Versions diffs - 0.4.1__tar.gz → 0.4.2__tar.gz - Mend

tracepipe 0.4.1tar.gz → 0.4.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (151) hide show

{tracepipe-0.4.1 → tracepipe-0.4.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tracepipe
-Version: 0.4.1
+Version: 0.4.2
 Summary: Row-level data lineage tracking for pandas pipelines
 Project-URL: Homepage, https://github.com/tracepipe/tracepipe
 Project-URL: Documentation, https://tracepipe.github.io/tracepipe/

{tracepipe-0.4.1 → tracepipe-0.4.2}/docs/changelog.md RENAMED Viewed

@@ -5,6 +5,27 @@ All notable changes to TracePipe will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.4.2] - 2026-02-04
+### Fixed
+- **`CheckResult` change tracking**: Added `n_changes` and `changes_by_op` properties in debug mode to track value changes across pipeline steps
+- **`TraceResult` status fields**: Added `status`, `dropped_by`, and `dropped_at_step` properties for clearer dropped row analysis
+- **`DiffResult` completeness**: Added `cells_changed`, `changes_by_column`, `rows_unchanged`, and `changed_rows` for detailed snapshot comparison
+- **Ghost value API**: Implemented `dbg.get_ghost_values(row_id)` for retrieving last known values of dropped rows
+- **Merge provenance**: `trace.origin` and `trace.merge_origin` now properly populated for merged rows
+- **Documentation alignment**: All documented APIs now match actual implementation with comprehensive test coverage
+### Changed
+- **`tp.trace()` API enhancement**: Added `row_id=` parameter for explicit internal row ID tracking
+  - `row=` now strictly refers to DataFrame positional index
+  - `row_id=` refers to TracePipe's internal row identifier (stable across operations)
+  - Supports tracing dropped rows by ID: `tp.trace(df, row_id=42)`
+- **`tp.why()` API enhancement**: Added `row_id=` parameter matching `tp.trace()` signature
+### Added
+- Comprehensive test suite (`test_doc_api_alignment.py`) with 27 tests validating documented API features
+- Better error messages for out-of-bounds row access
 ## [0.4.1] - 2026-02-04
 ### Fixed

{tracepipe-0.4.1 → tracepipe-0.4.2}/docs/getting-started/quickstart.md RENAMED Viewed

@@ -43,7 +43,7 @@ Output:
 TracePipe Check: [OK] Pipeline healthy
   Mode: debug
-Retention: 2/4 (50.0%)
+Retention: 50%
 Dropped: 2 rows
   • DataFrame.dropna: 1
   • DataFrame.__getitem__[mask]: 1
@@ -52,6 +52,17 @@ Value changes: 2 cells
   • DataFrame.__setitem__[total]: 2
 ```
+The `CheckResult` object provides convenient properties:
+```python
+result.passed       # True/False
+result.retention    # 0.5 (row retention rate)
+result.n_dropped    # 2 (total dropped rows)
+result.drops_by_op  # {"DataFrame.dropna": 1, ...}
+result.n_changes    # 2 (cell changes, debug mode only)
+result.changes_by_op # {"DataFrame.__setitem__[total]": 2}
+```
 ## 4. Trace a Row's Journey
 ```python

{tracepipe-0.4.1 → tracepipe-0.4.2}/docs/guide/cell-provenance.md RENAMED Viewed

@@ -140,9 +140,17 @@ For dropped rows, you can still query their last known values:
 ```python
 dbg = tp.debug.inspect()
-# Get ghost values for a dropped row
+# Get ghost values for a specific dropped row
 dropped_rid = list(dbg.dropped_rows())[0]
 ghost = dbg.get_ghost_values(dropped_rid)
 print(f"Last known values: {ghost}")
+# {"age": 25, "salary": 50000}
+# Or get all ghost rows as a DataFrame
+ghost_df = dbg.ghost_rows()
+print(ghost_df)
+# DataFrame with __tp_row_id__, __tp_dropped_by__, and watched columns
 ```
+The `get_ghost_values(row_id)` method returns a dict mapping column names to
+their last known values, or `None` if the row wasn't found in ghost storage.

{tracepipe-0.4.1 → tracepipe-0.4.2}/docs/guide/row-tracing.md RENAMED Viewed

@@ -20,12 +20,15 @@ Output:
 Row 42 Journey:
   Status: [OK] Alive
-  Events: 3
-    [SURVIVED] DataFrame.dropna
+  Events: 1
     [MODIFIED] DataFrame.fillna: income
-    [SURVIVED] DataFrame.__getitem__[mask]
 ```
+!!! note "Event Recording"
+    TracePipe records MODIFIED events for cells that change in watched columns.
+    Rows that pass through operations unchanged are not recorded as separate events
+    (they are implicitly "survived"). Drop events are recorded for filtered rows.
 ## The TraceResult Object
 ```python
@@ -34,12 +37,17 @@ trace = tp.trace(df, row=0)
 # Access fields
 trace.row_id           # int: internal row ID
 trace.status           # str: "alive" or "dropped"
-trace.events           # list[TraceEvent]: all events
+trace.is_alive         # bool: True if row still exists
+trace.events           # list[dict]: all events for this row
 # For dropped rows
 trace.dropped_by       # str: operation that dropped the row
 trace.dropped_at_step  # int: step number
+# Provenance (v0.4+)
+trace.origin           # dict: {"type": "concat"|"merge", ...} or None
+trace.representative   # dict: for dedup-dropped rows, which row was kept
 # Export
 trace.to_dict()        # dict representation
 ```
@@ -74,10 +82,16 @@ tp.trace(df, where={"email": None})
 | Event Type | Description |
 |------------|-------------|
-| `SURVIVED` | Row passed through operation unchanged |
-| `MODIFIED` | One or more cells changed |
-| `DROPPED` | Row was removed |
-| `CREATED` | Row first appeared (e.g., from merge) |
+| `MODIFIED` | One or more cells changed in watched columns |
+| `DROPPED` | Row was removed by a filter operation |
+!!! note "Design Note"
+    TracePipe does not explicitly record "SURVIVED" events because they would
+    create excessive noise for most pipelines. Instead, rows that exist in the
+    final DataFrame are implicitly considered to have survived all operations.
+    If you need to know which operations a row passed through, check the
+    `steps` list via `tp.debug.inspect().steps`.
 ## Tracing Dropped Rows

{tracepipe-0.4.1 → tracepipe-0.4.2}/docs/guide/snapshots.md RENAMED Viewed

@@ -34,14 +34,17 @@ Output:
 ```
 Snapshot Diff:
-  Rows: 1000 → 847 (-153)
-  Columns: ['id', 'price', 'qty'] → ['id', 'price', 'qty'] (unchanged)
+  - 153 rows removed
+  ! 153 new drops
   Changes:
-    - 153 rows removed
-    - 847 cells modified in 'price'
+    - 847 cells modified
+      price: 847
 ```
+!!! tip "Enabling Cell-Level Diff"
+    To see cell-level changes, create snapshots with `include_values=True`.
 ## The Snapshot Object
 ```python
@@ -64,21 +67,38 @@ snapshot.data          # DataFrame copy (if captured)
 ```python
 diff = tp.diff(before, after)
-# Access fields
-diff.rows_added        # int: new rows
-diff.rows_removed      # int: removed rows
-diff.rows_unchanged    # int: unchanged rows
-diff.cells_changed     # int: modified cells
+# Row-level changes (always available)
+diff.rows_added        # set[int]: IDs of new rows
+diff.rows_removed      # set[int]: IDs of removed rows
+diff.new_drops         # set[int]: newly dropped row IDs
+diff.recovered_rows    # set[int]: rows that were dropped but now exist
 # Column changes
 diff.columns_added     # list[str]: new columns
 diff.columns_removed   # list[str]: removed columns
-# Detailed changes (if both snapshots have data)
-diff.changed_rows      # set[int]: IDs of changed rows
+# Cell-level changes (requires include_values=True on both snapshots)
+diff.cells_changed     # int: total modified cells
+diff.changed_rows      # set[int]: IDs of rows with value changes
 diff.changes_by_column # dict: {col: count}
+# Stats changes
+diff.stats_changes     # dict: {col: {metric: (old, new)}}
+diff.drops_delta       # dict: {operation: delta_count}
 ```
+!!! note "Cell-Level Diff Requirements"
+    To get `cells_changed` and `changes_by_column`, both snapshots must be
+    created with `include_values=True`:
+    ```python
+    before = tp.snapshot(df, include_values=True)
+    # ... operations ...
+    after = tp.snapshot(df, include_values=True)
+    diff = tp.diff(before, after)
+    print(f"{diff.cells_changed} cells modified")
+    ```
 ## Options
 ### Include Data

{tracepipe-0.4.1 → tracepipe-0.4.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "tracepipe"
-version = "0.4.1"
+version = "0.4.2"
 description = "Row-level data lineage tracking for pandas pipelines"
 readme = "README.md"
 license = {file = "LICENSE"}

{tracepipe-0.4.1 → tracepipe-0.4.2}/tests/test_convenience_debug.py RENAMED Viewed

@@ -333,10 +333,14 @@ class TestTraceResult:
         tp.enable(mode="debug")
         df = pd.DataFrame({"a": [1, None, 3]})
         df = df.dropna()
-        result = tp.trace(df, row=1)
+        # Use row_id parameter to trace a dropped row by its internal ID
+        dbg = tp.debug.inspect()
+        dropped = dbg.dropped_rows()
+        assert len(dropped) >= 1
+        result = tp.trace(df, row_id=dropped[0])
         assert result.is_alive is False
         text = str(result)
-        assert "Dropped" in text or "X" in text
+        assert "Dropped" in text or "DROPPED" in text
     def test_trace_with_events(self):
         """TraceResult shows events when cell is modified."""
@@ -386,7 +390,11 @@ class TestTraceResult:
         tp.enable(mode="debug", watch=["a"])
         df = pd.DataFrame({"a": [1, 2, 3]})
         df = df.head(1)  # Drop rows 1 and 2
-        result = tp.trace(df, row=1)  # Trace dropped row
+        # Use row_id parameter to trace a dropped row
+        dbg = tp.debug.inspect()
+        dropped = dbg.dropped_rows()
+        assert len(dropped) >= 1
+        result = tp.trace(df, row_id=dropped[0])  # Trace dropped row by ID
         # Dropped row should have ghost values in debug mode
         text = result.to_text(verbose=True)
         assert result.is_alive is False

{tracepipe-0.4.1 → tracepipe-0.4.2}/tests/test_public_api.py RENAMED Viewed

@@ -100,8 +100,13 @@ class TestTrace:
         tp.enable(mode="debug")
         df = pd.DataFrame({"a": [1, None, 3]})
         df = df.dropna()
-        result = tp.trace(df, row=1)
+        # Use row_id parameter to trace dropped row
+        dbg = tp.debug.inspect()
+        dropped = dbg.dropped_rows()
+        assert len(dropped) >= 1
+        result = tp.trace(df, row_id=dropped[0])
         assert result is not None
+        assert result.is_alive is False
     def test_trace_with_where(self):
         """trace() with where clause."""

{tracepipe-0.4.1 → tracepipe-0.4.2}/tests/test_row_provenance.py RENAMED Viewed

@@ -569,10 +569,8 @@ class TestTraceResultOriginProperty:
         result = df1.merge(df2, on="key")
-        # Use the actual row_id from the result DataFrame
-        ctx = get_context()
-        result_rids = ctx.row_manager.get_ids_array(result)
-        trace = tp.trace(result, row=result_rids[0])
+        # Use row=0 to trace the first row in the result DataFrame
+        trace = tp.trace(result, row=0)
         # Should have merge origin
         assert trace.origin is not None

{tracepipe-0.4.1 → tracepipe-0.4.2}/tracepipe/__init__.py RENAMED Viewed

@@ -81,7 +81,7 @@ from .core import TracePipeConfig, TracePipeMode
 from .snapshot import DiffResult, Snapshot, diff, snapshot
 # === VERSION ===
-__version__ = "0.4.1"
+__version__ = "0.4.2"
 # === MINIMAL __all__ ===
 __all__ = [

{tracepipe-0.4.1 → tracepipe-0.4.2}/tracepipe/convenience.py RENAMED Viewed

@@ -60,6 +60,8 @@ class CheckResult:
         .retention    - Row retention rate (0.0-1.0)
         .n_dropped    - Total rows dropped
         .drops_by_op  - Drops broken down by operation
+        .n_changes    - Total cell-level changes (debug mode only)
+        .changes_by_op - Changes broken down by operation (debug mode only)
     """
     ok: bool
@@ -69,6 +71,9 @@ class CheckResult:
     mode: str
     # Internal: store drops_by_op so we don't need to recompute
     _drops_by_op: dict[str, int] = field(default_factory=dict)
+    # Internal: store cell change counts (debug mode only)
+    _n_changes: int = 0
+    _changes_by_op: dict[str, int] = field(default_factory=dict)
     # === CONVENIENCE PROPERTIES ===
@@ -97,6 +102,16 @@ class CheckResult:
         """Total pipeline steps recorded."""
         return self.facts.get("total_steps", 0)
+    @property
+    def n_changes(self) -> int:
+        """Total cell-level changes (debug mode only, 0 if not tracked)."""
+        return self._n_changes
+    @property
+    def changes_by_op(self) -> dict[str, int]:
+        """Cell changes broken down by operation (debug mode only)."""
+        return self._changes_by_op
     # === EXISTING PROPERTIES ===
     @property
@@ -127,6 +142,20 @@ class CheckResult:
         lines.append(f"TracePipe Check: {status}")
         lines.append(f"  Mode: {self.mode}")
+        # Always show key metrics in compact form
+        if self.retention is not None:
+            lines.append(f"\nRetention: {int(self.retention * 100)}%")
+        if self.n_dropped > 0:
+            lines.append(f"Dropped: {self.n_dropped} rows")
+            if self.drops_by_op:
+                for op, count in list(self.drops_by_op.items())[:5]:
+                    lines.append(f"  • {op}: {count}")
+        if self.n_changes > 0:
+            lines.append(f"\nValue changes: {self.n_changes} cells")
+            if self.changes_by_op:
+                for op, count in list(self.changes_by_op.items())[:5]:
+                    lines.append(f"  • {op}: {count}")
         if verbose and self.facts:
             lines.append("\n  Measured facts:")
             for k, v in self.facts.items():
@@ -158,6 +187,8 @@ class CheckResult:
             "n_dropped": self.n_dropped,
             "n_steps": self.n_steps,
             "drops_by_op": self.drops_by_op,
+            "n_changes": self.n_changes,
+            "changes_by_op": self.changes_by_op,
             "facts": self.facts,
             "suggestions": self.suggestions,
             "warnings": [
@@ -191,6 +222,7 @@ class TraceResult:
     Events are in CHRONOLOGICAL order (oldest->newest).
     Key attributes:
+        status: "alive" or "dropped" (string representation)
         origin: Where this row came from (concat, merge, or original)
         representative: If dropped by dedup, which row was kept instead
     """
@@ -207,6 +239,27 @@ class TraceResult:
     # v0.4+ provenance
     concat_origin: dict[str, Any] | None = None
     dedup_representative: dict[str, Any] | None = None
+    # Steps this row survived (for SURVIVED event generation)
+    _survived_steps: list[dict[str, Any]] = field(default_factory=list)
+    @property
+    def status(self) -> str:
+        """Row status as string: 'alive' or 'dropped'."""
+        return "alive" if self.is_alive else "dropped"
+    @property
+    def dropped_by(self) -> str | None:
+        """Operation that dropped this row, or None if alive."""
+        if self.dropped_at:
+            return self.dropped_at.get("operation")
+        return None
+    @property
+    def dropped_at_step(self) -> int | None:
+        """Step number where this row was dropped, or None if alive."""
+        if self.dropped_at:
+            return self.dropped_at.get("step_id")
+        return None
     @property
     def n_events(self) -> int:
@@ -258,8 +311,10 @@ class TraceResult:
         """Export to dictionary."""
         return {
             "row_id": self.row_id,
+            "status": self.status,
             "is_alive": self.is_alive,
             "dropped_at": self.dropped_at,
+            "dropped_by": self.dropped_at.get("operation") if self.dropped_at else None,
             "origin": self.origin,
             "representative": self.representative,
             "n_events": self.n_events,
@@ -280,10 +335,11 @@ class TraceResult:
         lines = [f"Row {self.row_id} Journey:"]
+        # Status line matches documentation format
         if self.is_alive:
             lines.append("  Status: [OK] Alive")
         else:
-            lines.append("  Status: [X] Dropped")
+            lines.append("  Status: [DROPPED]")
             if self.dropped_at:
                 lines.append(
                     f"    at step {self.dropped_at['step_id']}: {self.dropped_at['operation']}"
@@ -579,6 +635,21 @@ def check(
         if count > 1000:
             suggestions.append(f"'{op}' dropped {count} rows - review if intentional")
+    # === CELL CHANGES (debug mode only) ===
+    n_changes = 0
+    changes_by_op: dict[str, int] = {}
+    if ctx.config.mode == TracePipeMode.DEBUG:
+        # Count non-drop diffs (cell-level changes)
+        step_map = {s.step_id: s.operation for s in ctx.store.steps}
+        for i in range(len(ctx.store.diff_step_ids)):
+            col = ctx.store.diff_cols[i]
+            if col != "__row__":  # Skip drop events
+                n_changes += 1
+                step_id = ctx.store.diff_step_ids[i]
+                op = step_map.get(step_id, "unknown")
+                changes_by_op[op] = changes_by_op.get(op, 0) + 1
+        facts["n_changes"] = n_changes
     ok = len([w for w in warnings_list if w.severity == "fact"]) == 0
     return CheckResult(
@@ -588,6 +659,8 @@ def check(
         suggestions=suggestions,
         mode=ctx.config.mode.value,
         _drops_by_op=drops_by_op,
+        _n_changes=n_changes,
+        _changes_by_op=changes_by_op,
     )
@@ -595,6 +668,7 @@ def trace(
     df: pd.DataFrame,
     *,
     row: int | None = None,
+    row_id: int | None = None,
     where: dict[str, Any] | None = None,
     include_ghost: bool = True,
 ) -> TraceResult | list[TraceResult]:
@@ -603,7 +677,8 @@ def trace(
     Args:
         df: DataFrame to search in
-        row: Row ID (if known)
+        row: Row position (0-based index into current DataFrame)
+        row_id: Internal row ID (use for tracing dropped rows)
         where: Selector dict, e.g. {"customer_id": "C123"}
         include_ghost: Include last-known values for dropped rows
@@ -612,8 +687,14 @@ def trace(
         Use print(result) for pretty output, result.to_dict() for data.
     Examples:
-        result = tp.trace(df, row=5)
-        print(result)
+        # Trace by position in current DataFrame
+        result = tp.trace(df, row=0)  # First row
+        # Trace by internal row ID (for dropped rows)
+        dropped = tp.debug.inspect().dropped_rows()
+        result = tp.trace(df, row_id=dropped[0])
+        # Trace by business key
         tp.trace(df, where={"customer_id": "C123"})
     """
     ctx = get_context()
@@ -624,12 +705,30 @@ def trace(
         pass
     # Resolve row IDs
-    if row is not None:
-        row_ids = [row]
+    if row_id is not None:
+        # Direct row ID specified - use as-is
+        row_ids = [row_id]
+    elif row is not None:
+        # row= is a DataFrame index position (0-based), not a row ID
+        # Convert to actual row ID using the DataFrame's registered IDs
+        rids = ctx.row_manager.get_ids_array(df)
+        if rids is not None:
+            # Handle negative indexing
+            if row < 0:
+                row = len(rids) + row
+            if 0 <= row < len(rids):
+                row_ids = [int(rids[row])]
+            else:
+                raise ValueError(
+                    f"Row index {row} out of bounds for DataFrame with {len(rids)} rows"
+                )
+        else:
+            # DataFrame not tracked - use row as-is (legacy behavior)
+            row_ids = [row]
     elif where is not None:
         row_ids = _resolve_where(df, where, ctx)
     else:
-        raise ValueError("Must provide 'row' or 'where'")
+        raise ValueError("Must provide 'row', 'row_id', or 'where'")
     results = []
     for rid in row_ids:
@@ -644,6 +743,7 @@ def why(
     *,
     col: str,
     row: int | None = None,
+    row_id: int | None = None,
     where: dict[str, Any] | None = None,
 ) -> WhyResult | list[WhyResult]:
     """
@@ -652,7 +752,8 @@ def why(
     Args:
         df: DataFrame to search in
         col: Column name to trace
-        row: Row ID (if known)
+        row: Row position (0-based index into current DataFrame)
+        row_id: Internal row ID (use for cells in dropped rows)
         where: Selector dict, e.g. {"customer_id": "C123"}
     Returns:
@@ -660,7 +761,7 @@ def why(
         Use print(result) for pretty output, result.to_dict() for data.
     Examples:
-        result = tp.why(df, col="amount", row=5)
+        result = tp.why(df, col="amount", row=0)  # First row
         print(result)
         tp.why(df, col="email", where={"user_id": "U123"})
     """
@@ -676,12 +777,30 @@ def why(
         )
     # Resolve row IDs
-    if row is not None:
-        row_ids = [row]
+    if row_id is not None:
+        # Direct row ID specified - use as-is
+        row_ids = [row_id]
+    elif row is not None:
+        # row= is a DataFrame index position (0-based), not a row ID
+        # Convert to actual row ID using the DataFrame's registered IDs
+        rids = ctx.row_manager.get_ids_array(df)
+        if rids is not None:
+            # Handle negative indexing
+            if row < 0:
+                row = len(rids) + row
+            if 0 <= row < len(rids):
+                row_ids = [int(rids[row])]
+            else:
+                raise ValueError(
+                    f"Row index {row} out of bounds for DataFrame with {len(rids)} rows"
+                )
+        else:
+            # DataFrame not tracked - use row as-is (legacy behavior)
+            row_ids = [row]
     elif where is not None:
         row_ids = _resolve_where(df, where, ctx)
     else:
-        raise ValueError("Must provide 'row' or 'where'")
+        raise ValueError("Must provide 'row', 'row_id', or 'where'")
     results = []
     for rid in row_ids:

{tracepipe-0.4.1 → tracepipe-0.4.2}/tracepipe/debug.py RENAMED Viewed

@@ -179,6 +179,46 @@ class DebugInspector:
         ctx = get_context()
         return ctx.row_manager.get_ghost_rows(limit=limit)
+    def get_ghost_values(self, row_id: int) -> dict[str, Any] | None:
+        """
+        Get last-known values for a specific dropped row (DEBUG mode only).
+        Args:
+            row_id: The row ID to look up
+        Returns:
+            Dict mapping column names to their last known values,
+            or None if the row was not found in ghost storage.
+        Example:
+            dbg = tp.debug.inspect()
+            dropped_rid = list(dbg.dropped_rows())[0]
+            ghost = dbg.get_ghost_values(dropped_rid)
+            print(f"Last known values: {ghost}")
+        """
+        ctx = get_context()
+        ghost_df = ctx.row_manager.get_ghost_rows(limit=100000)
+        if ghost_df.empty or "__tp_row_id__" not in ghost_df.columns:
+            return None
+        row_match = ghost_df[ghost_df["__tp_row_id__"] == row_id]
+        if row_match.empty:
+            return None
+        # Convert to dict and remove internal columns
+        result = row_match.iloc[0].to_dict()
+        internal_cols = [
+            "__tp_row_id__",
+            "__tp_dropped_by__",
+            "__tp_dropped_step__",
+            "__tp_original_position__",
+        ]
+        for col in internal_cols:
+            result.pop(col, None)
+        return result
     def stats(self) -> dict:
         """Get comprehensive tracking statistics."""
         ctx = get_context()

tracepipe 0.4.1__tar.gz → 0.4.2__tar.gz

tracepipe 0.4.1tar.gz → 0.4.2tar.gz