PyPI - recursive-cleaner - Versions diffs - 0.7.0__tar.gz → 0.8.0__tar.gz - Mend

recursive-cleaner 0.7.0tar.gz → 0.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (94) hide show

{recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/CLAUDE.md RENAMED Viewed

@@ -4,7 +4,9 @@
 | Version | Status | Date |
 |---------|--------|------|
-| v0.6.0 | **Implemented** | 2025-01-15 |
+| v0.8.0 | **Implemented** | 2025-01-19 |
+| v0.7.0 | Implemented | 2025-01-17 |
+| v0.6.0 | Implemented | 2025-01-15 |
 | v0.5.1 | Implemented | 2025-01-15 |
 | v0.5.0 | Implemented | 2025-01-15 |
 | v0.4.0 | Implemented | 2025-01-15 |
@@ -12,9 +14,11 @@
 | v0.2.0 | Implemented | 2025-01-14 |
 | v0.1.0 | Implemented | 2025-01-14 |
-**Current State**: v0.6.0 complete. 392 tests passing, 2,967 lines total.
+**Current State**: v0.8.0 complete. 465 tests passing.
 ### Version History
+- **v0.8.0**: Terminal UI with Rich dashboard, mission control aesthetic, transmission log
+- **v0.7.0**: Markitdown integration (20+ formats), Parquet support, LLM-generated parsers
 - **v0.6.0**: Latency metrics, import consolidation, cleaning report, dry-run mode
 - **v0.5.1**: Dangerous code detection (AST-based security)
 - **v0.5.0**: Two-pass optimization with LLM agency (consolidation, early termination)
@@ -69,6 +73,8 @@ cleaner = DataCleaner(
     # Observability (v0.6.0)
     report_path="cleaning_report.md",  # Generate markdown report (None to disable)
     dry_run=False,  # Set True to analyze without generating functions
+    # Terminal UI (v0.8.0)
+    tui=True,  # Enable Rich dashboard (requires pip install recursive-cleaner[tui])
 )
 cleaner.run()  # Outputs: cleaning_functions.py, cleaning_report.md
@@ -159,6 +165,7 @@ recursive_cleaner/
     report.py            # Markdown report generation (~120 lines) [v0.6.0]
     response.py          # XML/markdown parsing + agency dataclasses (~292 lines)
     schema.py            # Schema inference (~117 lines) [v0.2.0]
+    tui.py               # Rich terminal dashboard (~520 lines) [v0.8.0]
     types.py             # LLMBackend protocol (~11 lines)
     validation.py        # Runtime validation + safety checks (~200 lines)
     vendor/
@@ -187,6 +194,7 @@ tests/                   # 392 tests
     test_sampling.py     # Sampling strategy tests [v0.4.0]
     test_schema.py       # Schema inference tests
     test_text_mode.py    # Text mode tests [v0.3.0]
+    test_tui.py          # Terminal UI tests [v0.8.0]
     test_validation.py   # Runtime validation + safety tests
     test_vendor_chunker.py  # Vendored chunker tests [v0.3.0]

{recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: recursive-cleaner
-Version: 0.7.0
+Version: 0.8.0
 Summary: LLM-powered incremental data cleaning pipeline that processes massive datasets in chunks and generates Python cleaning functions
 Project-URL: Homepage, https://github.com/gaztrabisme/recursive-data-cleaner
 Project-URL: Repository, https://github.com/gaztrabisme/recursive-data-cleaner
@@ -32,6 +32,8 @@ Provides-Extra: mlx
 Requires-Dist: mlx-lm>=0.10.0; extra == 'mlx'
 Provides-Extra: parquet
 Requires-Dist: pyarrow>=14.0.0; extra == 'parquet'
+Provides-Extra: tui
+Requires-Dist: rich>=13.0; extra == 'tui'
 Description-Content-Type: text/markdown
 # Recursive Data Cleaner
@@ -40,7 +42,7 @@ LLM-powered incremental data cleaning for massive datasets. Process files in chu
 ## How It Works
-1. **Chunk** your data (JSONL, CSV, JSON, or text)
+1. **Chunk** your data (JSONL, CSV, JSON, Parquet, PDF, Word, Excel, XML, and more)
 2. **Analyze** each chunk with an LLM to identify issues
 3. **Generate** one cleaning function per issue
 4. **Validate** functions on holdout data before accepting
@@ -59,6 +61,21 @@ For Apple Silicon (MLX backend):
 pip install -e ".[mlx]"
 ```
+For document conversion (PDF, Word, Excel, HTML, etc.):
+```bash
+pip install -e ".[markitdown]"
+```
+For Parquet files:
+```bash
+pip install -e ".[parquet]"
+```
+For Terminal UI (Rich dashboard):
+```bash
+pip install -e ".[tui]"
+```
 ## Quick Start
 ```python
@@ -111,6 +128,18 @@ cleaner.run()  # Generates cleaning_functions.py
 - **Cleaning Reports**: Markdown summary with functions, timing, quality delta
 - **Dry-Run Mode**: Analyze data without generating functions
+### Format Expansion (v0.7.0)
+- **Markitdown Integration**: Convert 20+ formats (PDF, Word, Excel, PowerPoint, HTML, EPUB, etc.) to text
+- **Parquet Support**: Load parquet files as structured data via pyarrow
+- **LLM-Generated Parsers**: Auto-generate parsers for XML and unknown formats (`auto_parse=True`)
+### Terminal UI (v0.8.0)
+- **Mission Control Dashboard**: Rich-based live terminal UI with retro aesthetic
+- **Real-time Progress**: Animated progress bars, chunk/iteration counters
+- **Transmission Log**: Parsed LLM responses showing issues detected and functions being generated
+- **Token Estimation**: Track estimated input/output tokens across the run
+- **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
 ## Configuration
 ```python
@@ -142,6 +171,12 @@ cleaner = DataCleaner(
     report_path="report.md",    # Markdown report output (None to disable)
     dry_run=False,              # Analyze without generating functions
+    # Format Expansion
+    auto_parse=False,           # LLM generates parser for unknown formats
+    # Terminal UI
+    tui=True,                   # Enable Rich dashboard (requires [tui] extra)
     # Progress & State
     on_progress=callback,       # Progress event callback
     state_file="state.json",    # Enable resume on interrupt
@@ -235,20 +270,22 @@ cleaner.run()
 ```
 recursive_cleaner/
-├── cleaner.py       # Main DataCleaner class (~580 lines)
-├── context.py       # Docstring registry with FIFO eviction
-├── dependencies.py  # Topological sort for function ordering
-├── metrics.py       # Quality metrics before/after
-├── optimizer.py     # Two-pass consolidation with LLM agency
-├── output.py        # Function file generation + import consolidation
-├── parsers.py       # Chunking for JSONL/CSV/JSON/text + sampling
-├── prompt.py        # LLM prompt templates
-├── report.py        # Markdown report generation
-├── response.py      # XML/markdown parsing + agency dataclasses
-├── schema.py        # Schema inference
-├── validation.py    # Runtime validation + holdout
+├── cleaner.py          # Main DataCleaner class
+├── context.py          # Docstring registry with FIFO eviction
+├── dependencies.py     # Topological sort for function ordering
+├── metrics.py          # Quality metrics before/after
+├── optimizer.py        # Two-pass consolidation with LLM agency
+├── output.py           # Function file generation + import consolidation
+├── parser_generator.py # LLM-generated parsers for unknown formats
+├── parsers.py          # Chunking for all formats + sampling
+├── prompt.py           # LLM prompt templates
+├── report.py           # Markdown report generation
+├── response.py         # XML/markdown parsing + agency dataclasses
+├── schema.py           # Schema inference
+├── tui.py              # Rich terminal dashboard
+├── validation.py       # Runtime validation + holdout
 └── vendor/
-    └── chunker.py   # Vendored sentence-aware chunker
+    └── chunker.py      # Vendored sentence-aware chunker
 ```
 ## Testing
@@ -257,7 +294,7 @@ recursive_cleaner/
 pytest tests/ -v
 ```
-392 tests covering all features. Test datasets in `test_cases/`:
+465 tests covering all features. Test datasets in `test_cases/`:
 - E-commerce product catalogs
 - Healthcare patient records
 - Financial transaction data
@@ -273,6 +310,8 @@ pytest tests/ -v
 | Version | Features |
 |---------|----------|
+| v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
+| v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
 | v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
 | v0.5.1 | Dangerous code detection (AST-based security) |
 | v0.5.0 | Two-pass optimization, early termination, LLM agency |

{recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/README.md RENAMED Viewed

@@ -4,7 +4,7 @@ LLM-powered incremental data cleaning for massive datasets. Process files in chu
 ## How It Works
-1. **Chunk** your data (JSONL, CSV, JSON, or text)
+1. **Chunk** your data (JSONL, CSV, JSON, Parquet, PDF, Word, Excel, XML, and more)
 2. **Analyze** each chunk with an LLM to identify issues
 3. **Generate** one cleaning function per issue
 4. **Validate** functions on holdout data before accepting
@@ -23,6 +23,21 @@ For Apple Silicon (MLX backend):
 pip install -e ".[mlx]"
 ```
+For document conversion (PDF, Word, Excel, HTML, etc.):
+```bash
+pip install -e ".[markitdown]"
+```
+For Parquet files:
+```bash
+pip install -e ".[parquet]"
+```
+For Terminal UI (Rich dashboard):
+```bash
+pip install -e ".[tui]"
+```
 ## Quick Start
 ```python
@@ -75,6 +90,18 @@ cleaner.run()  # Generates cleaning_functions.py
 - **Cleaning Reports**: Markdown summary with functions, timing, quality delta
 - **Dry-Run Mode**: Analyze data without generating functions
+### Format Expansion (v0.7.0)
+- **Markitdown Integration**: Convert 20+ formats (PDF, Word, Excel, PowerPoint, HTML, EPUB, etc.) to text
+- **Parquet Support**: Load parquet files as structured data via pyarrow
+- **LLM-Generated Parsers**: Auto-generate parsers for XML and unknown formats (`auto_parse=True`)
+### Terminal UI (v0.8.0)
+- **Mission Control Dashboard**: Rich-based live terminal UI with retro aesthetic
+- **Real-time Progress**: Animated progress bars, chunk/iteration counters
+- **Transmission Log**: Parsed LLM responses showing issues detected and functions being generated
+- **Token Estimation**: Track estimated input/output tokens across the run
+- **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
 ## Configuration
 ```python
@@ -106,6 +133,12 @@ cleaner = DataCleaner(
     report_path="report.md",    # Markdown report output (None to disable)
     dry_run=False,              # Analyze without generating functions
+    # Format Expansion
+    auto_parse=False,           # LLM generates parser for unknown formats
+    # Terminal UI
+    tui=True,                   # Enable Rich dashboard (requires [tui] extra)
     # Progress & State
     on_progress=callback,       # Progress event callback
     state_file="state.json",    # Enable resume on interrupt
@@ -199,20 +232,22 @@ cleaner.run()
 ```
 recursive_cleaner/
-├── cleaner.py       # Main DataCleaner class (~580 lines)
-├── context.py       # Docstring registry with FIFO eviction
-├── dependencies.py  # Topological sort for function ordering
-├── metrics.py       # Quality metrics before/after
-├── optimizer.py     # Two-pass consolidation with LLM agency
-├── output.py        # Function file generation + import consolidation
-├── parsers.py       # Chunking for JSONL/CSV/JSON/text + sampling
-├── prompt.py        # LLM prompt templates
-├── report.py        # Markdown report generation
-├── response.py      # XML/markdown parsing + agency dataclasses
-├── schema.py        # Schema inference
-├── validation.py    # Runtime validation + holdout
+├── cleaner.py          # Main DataCleaner class
+├── context.py          # Docstring registry with FIFO eviction
+├── dependencies.py     # Topological sort for function ordering
+├── metrics.py          # Quality metrics before/after
+├── optimizer.py        # Two-pass consolidation with LLM agency
+├── output.py           # Function file generation + import consolidation
+├── parser_generator.py # LLM-generated parsers for unknown formats
+├── parsers.py          # Chunking for all formats + sampling
+├── prompt.py           # LLM prompt templates
+├── report.py           # Markdown report generation
+├── response.py         # XML/markdown parsing + agency dataclasses
+├── schema.py           # Schema inference
+├── tui.py              # Rich terminal dashboard
+├── validation.py       # Runtime validation + holdout
 └── vendor/
-    └── chunker.py   # Vendored sentence-aware chunker
+    └── chunker.py      # Vendored sentence-aware chunker
 ```
 ## Testing
@@ -221,7 +256,7 @@ recursive_cleaner/
 pytest tests/ -v
 ```
-392 tests covering all features. Test datasets in `test_cases/`:
+465 tests covering all features. Test datasets in `test_cases/`:
 - E-commerce product catalogs
 - Healthcare patient records
 - Financial transaction data
@@ -237,6 +272,8 @@ pytest tests/ -v
 | Version | Features |
 |---------|----------|
+| v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
+| v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
 | v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
 | v0.5.1 | Dangerous code detection (AST-based security) |
 | v0.5.0 | Two-pass optimization, early termination, LLM agency |

recursive_cleaner-0.8.0/demo_tui.py ADDED Viewed

@@ -0,0 +1,54 @@
+#!/usr/bin/env python3
+"""
+Demo script to showcase the Rich TUI with real MLX backend.
+Run with:
+    python demo_tui.py
+Requirements:
+    pip install recursive-cleaner[mlx,tui]
+"""
+from backends import MLXBackend
+from recursive_cleaner import DataCleaner
+# Use a smaller/faster model for demo (change to your preferred model)
+MODEL = "lmstudio-community/Qwen3-Next-80B-A3B-Instruct-MLX-4bit"
+print("=" * 60)
+print("  RECURSIVE DATA CLEANER - TUI DEMO")
+print("=" * 60)
+print(f"\nLoading model: {MODEL}")
+print("This may take a moment on first run...\n")
+llm = MLXBackend(
+    model_path=MODEL,
+    max_tokens=2048,
+    temperature=0.3,  # Lower for more consistent output
+    verbose=False,  # Disable token streaming to avoid interfering with TUI
+)
+cleaner = DataCleaner(
+    llm_backend=llm,
+    file_path="test_cases/ecommerce_products.jsonl",
+    chunk_size=5,  # Small chunks for demo
+    max_iterations=3,  # Limit iterations per chunk
+    instructions="""
+    E-commerce product data cleaning:
+    - Normalize prices to float (remove $ symbols)
+    - Fix category typos and normalize to Title Case
+    - Convert weights to kg as float
+    - Ensure stock_quantity is non-negative integer
+    """,
+    tui=True,  # Enable the Rich dashboard!
+    track_metrics=True,
+)
+print("\nStarting cleaner with TUI enabled...")
+print("Watch the dashboard below!\n")
+cleaner.run()
+print("\n" + "=" * 60)
+print("Demo complete! Check cleaning_functions.py for output.")
+print("=" * 60)

recursive_cleaner-0.8.0/docs/contracts/v080-api-contract.md ADDED Viewed

@@ -0,0 +1,62 @@
+# API Contract: Rich TUI (v0.8.0)
+## New Parameter
+```python
+DataCleaner(
+    ...,
+    tui: bool = False,  # Enable Rich terminal dashboard
+)
+```
+## Behavior Matrix
+| `tui` | Rich installed | Behavior |
+|-------|----------------|----------|
+| `False` | Any | Existing callback-based output (no change) |
+| `True` | Yes | Live dashboard replaces callback prints |
+| `True` | No | Warning logged, falls back to callbacks |
+## New Optional Dependency
+```toml
+[project.optional-dependencies]
+tui = ["rich>=13.0"]
+```
+```bash
+pip install recursive-cleaner[tui]
+```
+## TUI Module API
+### `recursive_cleaner/tui.py`
+```python
+# Check availability
+HAS_RICH: bool
+# Main renderer class
+class TUIRenderer:
+    def __init__(self, file_path: str, total_chunks: int, total_records: int)
+    def start(self) -> None
+    def stop(self) -> None
+    def update_chunk(self, chunk_index: int, iteration: int, max_iterations: int) -> None
+    def update_llm_status(self, status: str) -> None  # "calling" | "idle"
+    def add_function(self, name: str, docstring: str) -> None
+    def update_metrics(self, quality_delta: float, latency_last: float, latency_avg: float, latency_total: float, llm_calls: int) -> None
+    def show_complete(self, summary: dict) -> None
+```
+## Integration with DataCleaner
+When `tui=True` and Rich available:
+1. `on_progress` callback still fires (for logging, state tracking)
+2. TUI replaces console output, not callbacks
+3. TUI auto-stops on completion or error
+## No Breaking Changes
+- All existing parameters unchanged
+- All existing callbacks unchanged
+- `tui=False` (default) = identical to v0.7.0 behavior

recursive_cleaner-0.8.0/docs/contracts/v080-data-schema.md ADDED Viewed

@@ -0,0 +1,90 @@
+# Data Schema: TUI Display State (v0.8.0)
+## Dashboard State
+```python
+@dataclass
+class TUIState:
+    # Header
+    file_path: str
+    total_records: int
+    version: str = "0.8.0"
+    # Progress
+    current_chunk: int = 0
+    total_chunks: int = 0
+    current_iteration: int = 0
+    max_iterations: int = 5
+    # LLM Status
+    llm_status: Literal["idle", "calling"] = "idle"
+    # Functions
+    functions: list[FunctionInfo] = field(default_factory=list)
+    # Metrics
+    quality_delta: float = 0.0  # Percentage improvement
+    latency_last_ms: float = 0.0
+    latency_avg_ms: float = 0.0
+    latency_total_ms: float = 0.0
+    llm_call_count: int = 0
+@dataclass
+class FunctionInfo:
+    name: str
+    docstring: str  # First 50 chars displayed
+```
+## Dashboard Layout Schema
+```
+┌─────────────────────────────────────────────────────────┐
+│  {file_path}                              v{version}    │  <- HEADER (size=3)
+├────────────────────┬────────────────────────────────────┤
+│  PROGRESS          │  FUNCTIONS ({len(functions)})      │  <- BODY
+│  [████░░░░░░] {%}  │  ├─ {functions[0].name}            │
+│  Chunk {cur}/{tot} │  ├─ {functions[1].name}            │
+│  Iter {i}/{max}    │  └─ {functions[2].name}            │
+│                    │      (+{n} more)                   │
+│  {spinner} {status}│  QUALITY: +{quality_delta}%        │
+├────────────────────┴────────────────────────────────────┤
+│  ⏱️ {latency_last}ms │ avg {latency_avg}ms │ {llm_calls} │  <- FOOTER (size=3)
+└─────────────────────────────────────────────────────────┘
+```
+## Color Scheme
+| Element | Color | Condition |
+|---------|-------|-----------|
+| Header title | cyan | Always |
+| Progress bar | yellow | In progress |
+| Progress bar | green | Chunk complete |
+| Spinner | yellow | LLM calling |
+| Function names | green | Always |
+| Quality delta | green | Positive |
+| Quality delta | red | Negative |
+| Latency | dim white | Always |
+## Spinner States
+| `llm_status` | Display |
+|--------------|---------|
+| `"calling"` | Animated spinner + "Calling LLM..." |
+| `"idle"` | Static checkmark or empty |
+## Completion Summary
+On `show_complete()`:
+```
+┌─────────────────────────────────────────────────────────┐
+│  ✓ COMPLETE                                             │
+├─────────────────────────────────────────────────────────┤
+│  Functions generated: {n}                               │
+│  Chunks processed: {total_chunks}                       │
+│  Quality improvement: +{quality_delta}%                 │
+│  Total time: {latency_total}ms ({llm_calls} LLM calls)  │
+│                                                         │
+│  Output: cleaning_functions.py                          │
+└─────────────────────────────────────────────────────────┘
+```

recursive_cleaner-0.8.0/docs/contracts/v080-success-criteria.md ADDED Viewed

@@ -0,0 +1,70 @@
+# Success Criteria: Rich TUI (v0.8.0)
+## Project-Level Success
+- [ ] `pip install recursive-cleaner[tui]` installs rich>=13.0
+- [ ] `DataCleaner(..., tui=True)` shows live dashboard
+- [ ] Dashboard displays all state from data schema contract
+- [ ] Falls back gracefully when Rich not installed
+- [ ] All 432 existing tests pass
+- [ ] Zero breaking changes to existing API
+## Phase 1: Core TUI Module
+**Deliverables:**
+- [ ] `recursive_cleaner/tui.py` with `TUIRenderer` class
+- [ ] `HAS_RICH` check with graceful import
+- [ ] Basic `start()` / `stop()` lifecycle
+- [ ] Static layout matching schema (header, body split, footer)
+**Success Criteria:**
+- [ ] `from recursive_cleaner.tui import TUIRenderer, HAS_RICH` works
+- [ ] `TUIRenderer` can be instantiated without Rich (no crash)
+- [ ] With Rich: `start()` shows layout, `stop()` exits cleanly
+- [ ] Layout has correct sections per data schema
+**Tests:**
+- [ ] test_tui_import_without_rich
+- [ ] test_tui_renderer_lifecycle
+- [ ] test_tui_layout_structure
+## Phase 2: Dynamic Updates
+**Deliverables:**
+- [ ] `update_chunk()` updates progress bar and counters
+- [ ] `update_llm_status()` shows/hides spinner
+- [ ] `add_function()` appends to function list
+- [ ] `update_metrics()` updates footer stats
+**Success Criteria:**
+- [ ] Progress bar fills based on chunk_index/total_chunks
+- [ ] Spinner animates when status="calling", stops when "idle"
+- [ ] Functions list grows, shows "+N more" when >5 functions
+- [ ] Metrics panel shows formatted latency and counts
+**Tests:**
+- [ ] test_progress_updates
+- [ ] test_spinner_states
+- [ ] test_function_list_display
+- [ ] test_metrics_display
+## Phase 3: Integration & Polish
+**Deliverables:**
+- [ ] `tui=True` parameter on DataCleaner
+- [ ] Integration: TUI updates from cleaner loop
+- [ ] `show_complete()` with summary panel
+- [ ] Fallback warning when Rich not installed
+- [ ] Color transitions (yellow→green on chunk complete)
+**Success Criteria:**
+- [ ] Full cleaner run with `tui=True` shows live dashboard
+- [ ] Completion shows summary with all stats
+- [ ] `tui=True` without Rich logs warning, uses callbacks
+- [ ] Chunk completion triggers green color flash
+**Tests:**
+- [ ] test_datacleaner_tui_integration
+- [ ] test_tui_fallback_warning
+- [ ] test_completion_summary
+- [ ] test_color_transitions

recursive-cleaner 0.7.0__tar.gz → 0.8.0__tar.gz

recursive-cleaner 0.7.0tar.gz → 0.8.0tar.gz