PyPI - recursive-cleaner - Versions diffs - 0.8.0__tar.gz → 1.0.1__tar.gz - Mend

recursive-cleaner 0.8.0tar.gz → 1.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (114) hide show

recursive_cleaner-1.0.1/AGENTS.md ADDED Viewed

	@@ -0,0 +1 @@
1	+ CLAUDE.md

{recursive_cleaner-0.8.0 → recursive_cleaner-1.0.1}/CLAUDE.md RENAMED Viewed

@@ -4,7 +4,10 @@
 | Version | Status | Date |
 |---------|--------|------|
-| v0.8.0 | **Implemented** | 2025-01-19 |
+| v1.0.1 | **Implemented** | 2025-02-05 |
+| v1.0.0 | Implemented | 2025-01-30 |
+| v0.9.0 | Implemented | 2025-01-19 |
+| v0.8.0 | Implemented | 2025-01-19 |
 | v0.7.0 | Implemented | 2025-01-17 |
 | v0.6.0 | Implemented | 2025-01-15 |
 | v0.5.1 | Implemented | 2025-01-15 |
@@ -14,9 +17,12 @@
 | v0.2.0 | Implemented | 2025-01-14 |
 | v0.1.0 | Implemented | 2025-01-14 |
-**Current State**: v0.8.0 complete. 465 tests passing.
+**Current State**: v1.0.1 complete. 555 tests passing.
 ### Version History
+- **v1.0.1**: Return type validation, prompt signature clarity, duplicate field detection
+- **v1.0.0**: Apply mode for applying cleaning functions to data, Excel support, TUI color enhancement
+- **v0.9.0**: CLI tool with MLX and OpenAI-compatible backends (LM Studio, Ollama)
 - **v0.8.0**: Terminal UI with Rich dashboard, mission control aesthetic, transmission log
 - **v0.7.0**: Markitdown integration (20+ formats), Parquet support, LLM-generated parsers
 - **v0.6.0**: Latency metrics, import consolidation, cleaning report, dry-run mode

{recursive_cleaner-0.8.0 → recursive_cleaner-1.0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: recursive-cleaner
-Version: 0.8.0
+Version: 1.0.1
 Summary: LLM-powered incremental data cleaning pipeline that processes massive datasets in chunks and generates Python cleaning functions
 Project-URL: Homepage, https://github.com/gaztrabisme/recursive-data-cleaner
 Project-URL: Repository, https://github.com/gaztrabisme/recursive-data-cleaner
@@ -9,7 +9,7 @@ Author: Gary Tran
 License-Expression: MIT
 License-File: LICENSE
 Keywords: automation,data-cleaning,data-quality,etl,llm,machine-learning
-Classifier: Development Status :: 4 - Beta
+Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: MIT License
@@ -26,10 +26,15 @@ Requires-Dist: tenacity>=8.0
 Provides-Extra: dev
 Requires-Dist: pytest-cov>=4.0; extra == 'dev'
 Requires-Dist: pytest>=7.0; extra == 'dev'
+Provides-Extra: excel
+Requires-Dist: openpyxl>=3.0.0; extra == 'excel'
+Requires-Dist: xlrd>=2.0.0; extra == 'excel'
 Provides-Extra: markitdown
 Requires-Dist: markitdown>=0.1.0; extra == 'markitdown'
 Provides-Extra: mlx
 Requires-Dist: mlx-lm>=0.10.0; extra == 'mlx'
+Provides-Extra: openai
+Requires-Dist: openai>=1.0.0; extra == 'openai'
 Provides-Extra: parquet
 Requires-Dist: pyarrow>=14.0.0; extra == 'parquet'
 Provides-Extra: tui
@@ -140,6 +145,91 @@ cleaner.run()  # Generates cleaning_functions.py
 - **Token Estimation**: Track estimated input/output tokens across the run
 - **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
+### CLI (v0.9.0)
+- **Command Line Interface**: Use without writing Python code
+- **Multiple Backends**: MLX (Apple Silicon) and OpenAI-compatible (OpenAI, LM Studio, Ollama)
+- **Four Commands**: `generate`, `analyze` (dry-run), `resume`, `apply`
+### Apply Mode (v1.0.0)
+- **Apply Cleaning Functions**: Apply generated functions to full datasets
+- **Data Formats**: JSONL, CSV, JSON, Parquet, Excel (.xlsx/.xls) output same format
+- **Text Formats**: PDF, Word, HTML, etc. output as Markdown
+- **Streaming**: Memory-efficient line-by-line processing for JSONL/CSV
+- **Colored TUI**: Enhanced transmission log with syntax-highlighted XML parsing
+## Command Line Interface
+After installation, the `recursive-cleaner` command is available:
+```bash
+# Generate cleaning functions with MLX (Apple Silicon)
+recursive-cleaner generate data.jsonl \
+  --provider mlx \
+  --model "lmstudio-community/Qwen3-80B-MLX-4bit" \
+  --instructions "Normalize phone numbers to E.164" \
+  --output cleaning_functions.py
+# Use OpenAI
+export OPENAI_API_KEY=your-key
+recursive-cleaner generate data.jsonl \
+  --provider openai \
+  --model gpt-4o \
+  --instructions "Fix date formats"
+# Use LM Studio or Ollama (OpenAI-compatible)
+recursive-cleaner generate data.jsonl \
+  --provider openai \
+  --model "qwen/qwen3-vl-30b" \
+  --base-url http://localhost:1234/v1 \
+  --instructions "Normalize prices"
+# Dry-run analysis
+recursive-cleaner analyze data.jsonl \
+  --provider openai \
+  --model gpt-4o \
+  --instructions @instructions.txt
+# Resume from checkpoint
+recursive-cleaner resume cleaning_state.json \
+  --provider mlx \
+  --model "model-path"
+# Apply cleaning functions to data
+recursive-cleaner apply data.jsonl \
+  --functions cleaning_functions.py \
+  --output cleaned_data.jsonl
+# Apply to Excel (outputs same format)
+recursive-cleaner apply sales.xlsx \
+  --functions cleaning_functions.py
+# Apply to PDF (outputs markdown)
+recursive-cleaner apply document.pdf \
+  --functions cleaning_functions.py \
+  --output cleaned.md
+```
+### CLI Options
+```
+recursive-cleaner generate <FILE> [OPTIONS]
+Required:
+  FILE                      Input data file
+  -p, --provider {mlx,openai}  LLM provider
+  -m, --model MODEL         Model name/path
+Optional:
+  -i, --instructions TEXT   Cleaning instructions (or @file.txt)
+  --base-url URL            API URL for OpenAI-compatible servers
+  --chunk-size N            Items per chunk (default: 50)
+  --max-iterations N        Max iterations per chunk (default: 5)
+  -o, --output PATH         Output file (default: cleaning_functions.py)
+  --tui                     Enable Rich dashboard
+  --optimize                Consolidate redundant functions
+  --track-metrics           Measure before/after quality
+```
 ## Configuration
 ```python
@@ -270,6 +360,7 @@ cleaner.run()
 ```
 recursive_cleaner/
+├── cli.py              # Command line interface
 ├── cleaner.py          # Main DataCleaner class
 ├── context.py          # Docstring registry with FIFO eviction
 ├── dependencies.py     # Topological sort for function ordering
@@ -286,6 +377,10 @@ recursive_cleaner/
 ├── validation.py       # Runtime validation + holdout
 └── vendor/
     └── chunker.py      # Vendored sentence-aware chunker
+backends/
+├── mlx_backend.py      # MLX-LM backend for Apple Silicon
+└── openai_backend.py   # OpenAI-compatible backend
 ```
 ## Testing
@@ -294,14 +389,14 @@ recursive_cleaner/
 pytest tests/ -v
 ```
-465 tests covering all features. Test datasets in `test_cases/`:
+555 tests covering all features. Test datasets in `test_cases/`:
 - E-commerce product catalogs
 - Healthcare patient records
 - Financial transaction data
 ## Philosophy
-- **Simplicity over extensibility**: ~3,000 lines that do one thing well
+- **Simplicity over extensibility**: ~5,000 lines that do one thing well
 - **stdlib over dependencies**: Only `tenacity` required
 - **Retry over recover**: On error, retry with error in prompt
 - **Wu wei**: Let the LLM make decisions about data it understands
@@ -310,6 +405,7 @@ pytest tests/ -v
 | Version | Features |
 |---------|----------|
+| v0.9.0 | CLI tool with MLX and OpenAI-compatible backends (LM Studio, Ollama) |
 | v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
 | v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
 | v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |

{recursive_cleaner-0.8.0 → recursive_cleaner-1.0.1}/README.md RENAMED Viewed

@@ -102,6 +102,91 @@ cleaner.run()  # Generates cleaning_functions.py
 - **Token Estimation**: Track estimated input/output tokens across the run
 - **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
+### CLI (v0.9.0)
+- **Command Line Interface**: Use without writing Python code
+- **Multiple Backends**: MLX (Apple Silicon) and OpenAI-compatible (OpenAI, LM Studio, Ollama)
+- **Four Commands**: `generate`, `analyze` (dry-run), `resume`, `apply`
+### Apply Mode (v1.0.0)
+- **Apply Cleaning Functions**: Apply generated functions to full datasets
+- **Data Formats**: JSONL, CSV, JSON, Parquet, Excel (.xlsx/.xls) output same format
+- **Text Formats**: PDF, Word, HTML, etc. output as Markdown
+- **Streaming**: Memory-efficient line-by-line processing for JSONL/CSV
+- **Colored TUI**: Enhanced transmission log with syntax-highlighted XML parsing
+## Command Line Interface
+After installation, the `recursive-cleaner` command is available:
+```bash
+# Generate cleaning functions with MLX (Apple Silicon)
+recursive-cleaner generate data.jsonl \
+  --provider mlx \
+  --model "lmstudio-community/Qwen3-80B-MLX-4bit" \
+  --instructions "Normalize phone numbers to E.164" \
+  --output cleaning_functions.py
+# Use OpenAI
+export OPENAI_API_KEY=your-key
+recursive-cleaner generate data.jsonl \
+  --provider openai \
+  --model gpt-4o \
+  --instructions "Fix date formats"
+# Use LM Studio or Ollama (OpenAI-compatible)
+recursive-cleaner generate data.jsonl \
+  --provider openai \
+  --model "qwen/qwen3-vl-30b" \
+  --base-url http://localhost:1234/v1 \
+  --instructions "Normalize prices"
+# Dry-run analysis
+recursive-cleaner analyze data.jsonl \
+  --provider openai \
+  --model gpt-4o \
+  --instructions @instructions.txt
+# Resume from checkpoint
+recursive-cleaner resume cleaning_state.json \
+  --provider mlx \
+  --model "model-path"
+# Apply cleaning functions to data
+recursive-cleaner apply data.jsonl \
+  --functions cleaning_functions.py \
+  --output cleaned_data.jsonl
+# Apply to Excel (outputs same format)
+recursive-cleaner apply sales.xlsx \
+  --functions cleaning_functions.py
+# Apply to PDF (outputs markdown)
+recursive-cleaner apply document.pdf \
+  --functions cleaning_functions.py \
+  --output cleaned.md
+```
+### CLI Options
+```
+recursive-cleaner generate <FILE> [OPTIONS]
+Required:
+  FILE                      Input data file
+  -p, --provider {mlx,openai}  LLM provider
+  -m, --model MODEL         Model name/path
+Optional:
+  -i, --instructions TEXT   Cleaning instructions (or @file.txt)
+  --base-url URL            API URL for OpenAI-compatible servers
+  --chunk-size N            Items per chunk (default: 50)
+  --max-iterations N        Max iterations per chunk (default: 5)
+  -o, --output PATH         Output file (default: cleaning_functions.py)
+  --tui                     Enable Rich dashboard
+  --optimize                Consolidate redundant functions
+  --track-metrics           Measure before/after quality
+```
 ## Configuration
 ```python
@@ -232,6 +317,7 @@ cleaner.run()
 ```
 recursive_cleaner/
+├── cli.py              # Command line interface
 ├── cleaner.py          # Main DataCleaner class
 ├── context.py          # Docstring registry with FIFO eviction
 ├── dependencies.py     # Topological sort for function ordering
@@ -248,6 +334,10 @@ recursive_cleaner/
 ├── validation.py       # Runtime validation + holdout
 └── vendor/
     └── chunker.py      # Vendored sentence-aware chunker
+backends/
+├── mlx_backend.py      # MLX-LM backend for Apple Silicon
+└── openai_backend.py   # OpenAI-compatible backend
 ```
 ## Testing
@@ -256,14 +346,14 @@ recursive_cleaner/
 pytest tests/ -v
 ```
-465 tests covering all features. Test datasets in `test_cases/`:
+555 tests covering all features. Test datasets in `test_cases/`:
 - E-commerce product catalogs
 - Healthcare patient records
 - Financial transaction data
 ## Philosophy
-- **Simplicity over extensibility**: ~3,000 lines that do one thing well
+- **Simplicity over extensibility**: ~5,000 lines that do one thing well
 - **stdlib over dependencies**: Only `tenacity` required
 - **Retry over recover**: On error, retry with error in prompt
 - **Wu wei**: Let the LLM make decisions about data it understands
@@ -272,6 +362,7 @@ pytest tests/ -v
 | Version | Features |
 |---------|----------|
+| v0.9.0 | CLI tool with MLX and OpenAI-compatible backends (LM Studio, Ollama) |
 | v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
 | v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
 | v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |

recursive_cleaner-1.0.1/TODO.md ADDED Viewed

@@ -0,0 +1,119 @@
+# TODO - Recursive Data Cleaner Roadmap
+## Current Version: v0.9.0
+502 tests passing, ~3,400 lines. CLI complete.
+---
+## Completed Work
+| Version | Features |
+|---------|----------|
+| v0.1.0 | Core pipeline, chunking, docstring registry |
+| v0.2.0 | Runtime validation, schema inference, callbacks, incremental saves |
+| v0.3.0 | Text mode with sentence-aware chunking |
+| v0.4.0 | Holdout validation, dependency resolution, smart sampling, quality metrics |
+| v0.5.0 | Two-pass optimization, early termination, LLM agency |
+| v0.5.1 | Dangerous code detection (AST-based security) |
+| v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
+| v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
+| v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic |
+| v0.9.0 | CLI tool with MLX and OpenAI-compatible backends |
+---
+## Version Progression
+| Version | Theme |
+|---------|-------|
+| v0.1-0.2 | Core pipeline + validation |
+| v0.3-0.4 | Data quality assurance |
+| v0.5-0.6 | Optimization + observability |
+| v0.7-0.8 | Accessibility (formats + UI) |
+| v0.9-1.0 | Complete workflow |
+---
+## Roadmap to v1.0
+### v0.9.0 - CLI Tool ✅ COMPLETE
+CLI implemented with:
+- `recursive_cleaner/cli.py` - argparse CLI (346 lines)
+- `backends/openai_backend.py` - OpenAI-compatible backend (71 lines)
+- Commands: `generate`, `analyze`, `resume`
+- Backends: MLX, OpenAI, LM Studio, Ollama (via --base-url)
+### v1.0.0 - Apply Mode (~150 lines)
+The final step: actually cleaning the data, not just generating functions.
+```python
+cleaner = DataCleaner(...)
+cleaner.run()  # Generates cleaning_functions.py
+# NEW: Apply to full dataset
+cleaner.apply(output_path="cleaned_data.jsonl")
+```
+**Implementation:**
+- [ ] `DataCleaner.apply(output_path)` method
+- [ ] Stream-process file applying generated functions
+- [ ] Progress callbacks for large files
+- [ ] Validate output schema matches input
+- [ ] CLI integration: `recursive-cleaner apply`
+---
+## Patterns That Worked
+These patterns proved high-value with low implementation effort:
+1. **AST walking** - Dependency detection, dangerous code detection. ~50 lines each.
+2. **LLM agency** - Let model decide chunk cleanliness, saturation, consolidation. Elegant.
+3. **Retry with feedback** - On error, append error to prompt and retry. No complex recovery.
+4. **Holdout validation** - Test on unseen data before accepting. Catches edge cases.
+5. **Simple data structures** - List of dicts, JSON serialization. Easy to debug/resume.
+---
+## What We're Not Doing
+| Feature | Reason |
+|---------|--------|
+| Global deduplication | Adds complexity, breaks chunk-based philosophy |
+| Built-in LLM backends | Users bring their own, keeps us dependency-free |
+| Config files (YAML/TOML) | Python is already config, YAGNI |
+| Plugin system | No interfaces for things with one implementation |
+| Async multi-chunk | Complexity not justified; sequential is predictable |
+| Vector retrieval | Adds chromadb dependency; FIFO works for typical use |
+---
+## Line Count Budget
+| Component | Current | After v1.0 |
+|-----------|---------|------------|
+| Core library | ~3,000 | ~3,350 |
+| Tests | ~4,000 | ~4,400 |
+Staying under 3,500 lines for the library keeps us true to the philosophy.
+---
+## Philosophy Reminder
+From CLAUDE.md:
+- **Simplicity over extensibility** - Keep it lean
+- **stdlib over dependencies** - Only tenacity required
+- **Functions over classes** - Unless state genuinely helps
+- **Delete over abstract** - No interfaces for single implementations
+- **Retry over recover** - On error, retry with error in prompt
+- **Wu wei** - Let the LLM make decisions about data it understands
+---
+## Known Limitation
+**Stateful ops within chunks only** - Deduplication and aggregations don't work globally. This is architectural and accepted.

{recursive_cleaner-0.8.0 → recursive_cleaner-1.0.1}/backends/__init__.py RENAMED Viewed

@@ -1,5 +1,6 @@
 """Backend implementations for Recursive Data Cleaner."""
 from .mlx_backend import MLXBackend
+from .openai_backend import OpenAIBackend
-__all__ = ["MLXBackend"]
+__all__ = ["MLXBackend", "OpenAIBackend"]

recursive_cleaner-1.0.1/backends/openai_backend.py ADDED Viewed

@@ -0,0 +1,71 @@
+"""OpenAI-compatible backend for Recursive Data Cleaner."""
+import os
+class OpenAIBackend:
+    """
+    OpenAI-compatible backend implementation.
+    Works with OpenAI API, LM Studio, Ollama, and other OpenAI-compatible servers.
+    Conforms to the LLMBackend protocol.
+    """
+    def __init__(
+        self,
+        model: str,
+        api_key: str | None = None,
+        base_url: str | None = None,
+        max_tokens: int = 4096,
+        temperature: float = 0.7,
+    ):
+        """
+        Initialize the OpenAI backend.
+        Args:
+            model: Model name (e.g., "gpt-4o", "gpt-3.5-turbo")
+            api_key: API key (defaults to OPENAI_API_KEY env var, or "not-needed" for local)
+            base_url: API base URL (defaults to OpenAI's API)
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+        """
+        try:
+            import openai
+        except ImportError:
+            raise ImportError(
+                "OpenAI SDK not installed. Install with: pip install openai"
+            )
+        self.model = model
+        self.max_tokens = max_tokens
+        self.temperature = temperature
+        # Resolve API key: explicit > env var > "not-needed" for local servers
+        if api_key is not None:
+            resolved_key = api_key
+        else:
+            resolved_key = os.environ.get("OPENAI_API_KEY", "not-needed")
+        # Create client
+        self._client = openai.OpenAI(
+            api_key=resolved_key,
+            base_url=base_url,
+        )
+    def generate(self, prompt: str) -> str:
+        """
+        Generate a response from the LLM.
+        Args:
+            prompt: The input prompt
+        Returns:
+            The generated text response
+        """
+        response = self._client.chat.completions.create(
+            model=self.model,
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=self.max_tokens,
+            temperature=self.temperature,
+        )
+        return response.choices[0].message.content or ""

recursive-cleaner 0.8.0__tar.gz → 1.0.1__tar.gz

recursive-cleaner 0.8.0tar.gz → 1.0.1tar.gz