PyPI - recursive-cleaner - Versions diffs - 0.7.1__py3-none-any.whl → 1.0.0__py3-none-any.whl - Mend

recursive-cleaner 0.7.1py3-none-any.whl → 1.0.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

backends/__init__.py +2 -1
backends/openai_backend.py +71 -0
recursive_cleaner/__init__.py +5 -0
recursive_cleaner/__main__.py +8 -0
recursive_cleaner/apply.py +483 -0
recursive_cleaner/cleaner.py +122 -29
recursive_cleaner/cli.py +395 -0
recursive_cleaner/tui.py +614 -0
{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/METADATA +119 -4
{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/RECORD +13 -7
recursive_cleaner-1.0.0.dist-info/entry_points.txt +2 -0
{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/WHEEL +0 -0
{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/licenses/LICENSE +0 -0

{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: recursive-cleaner
-Version: 0.7.1
+Version: 1.0.0
 Summary: LLM-powered incremental data cleaning pipeline that processes massive datasets in chunks and generates Python cleaning functions
 Project-URL: Homepage, https://github.com/gaztrabisme/recursive-data-cleaner
 Project-URL: Repository, https://github.com/gaztrabisme/recursive-data-cleaner
@@ -9,7 +9,7 @@ Author: Gary Tran
 License-Expression: MIT
 License-File: LICENSE
 Keywords: automation,data-cleaning,data-quality,etl,llm,machine-learning
-Classifier: Development Status :: 4 - Beta
+Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: MIT License
@@ -26,12 +26,19 @@ Requires-Dist: tenacity>=8.0
 Provides-Extra: dev
 Requires-Dist: pytest-cov>=4.0; extra == 'dev'
 Requires-Dist: pytest>=7.0; extra == 'dev'
+Provides-Extra: excel
+Requires-Dist: openpyxl>=3.0.0; extra == 'excel'
+Requires-Dist: xlrd>=2.0.0; extra == 'excel'
 Provides-Extra: markitdown
 Requires-Dist: markitdown>=0.1.0; extra == 'markitdown'
 Provides-Extra: mlx
 Requires-Dist: mlx-lm>=0.10.0; extra == 'mlx'
+Provides-Extra: openai
+Requires-Dist: openai>=1.0.0; extra == 'openai'
 Provides-Extra: parquet
 Requires-Dist: pyarrow>=14.0.0; extra == 'parquet'
+Provides-Extra: tui
+Requires-Dist: rich>=13.0; extra == 'tui'
 Description-Content-Type: text/markdown
 # Recursive Data Cleaner
@@ -69,6 +76,11 @@ For Parquet files:
 pip install -e ".[parquet]"
 ```
+For Terminal UI (Rich dashboard):
+```bash
+pip install -e ".[tui]"
+```
 ## Quick Start
 ```python
@@ -126,6 +138,98 @@ cleaner.run()  # Generates cleaning_functions.py
 - **Parquet Support**: Load parquet files as structured data via pyarrow
 - **LLM-Generated Parsers**: Auto-generate parsers for XML and unknown formats (`auto_parse=True`)
+### Terminal UI (v0.8.0)
+- **Mission Control Dashboard**: Rich-based live terminal UI with retro aesthetic
+- **Real-time Progress**: Animated progress bars, chunk/iteration counters
+- **Transmission Log**: Parsed LLM responses showing issues detected and functions being generated
+- **Token Estimation**: Track estimated input/output tokens across the run
+- **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
+### CLI (v0.9.0)
+- **Command Line Interface**: Use without writing Python code
+- **Multiple Backends**: MLX (Apple Silicon) and OpenAI-compatible (OpenAI, LM Studio, Ollama)
+- **Four Commands**: `generate`, `analyze` (dry-run), `resume`, `apply`
+### Apply Mode (v1.0.0)
+- **Apply Cleaning Functions**: Apply generated functions to full datasets
+- **Data Formats**: JSONL, CSV, JSON, Parquet, Excel (.xlsx/.xls) output same format
+- **Text Formats**: PDF, Word, HTML, etc. output as Markdown
+- **Streaming**: Memory-efficient line-by-line processing for JSONL/CSV
+- **Colored TUI**: Enhanced transmission log with syntax-highlighted XML parsing
+## Command Line Interface
+After installation, the `recursive-cleaner` command is available:
+```bash
+# Generate cleaning functions with MLX (Apple Silicon)
+recursive-cleaner generate data.jsonl \
+  --provider mlx \
+  --model "lmstudio-community/Qwen3-80B-MLX-4bit" \
+  --instructions "Normalize phone numbers to E.164" \
+  --output cleaning_functions.py
+# Use OpenAI
+export OPENAI_API_KEY=your-key
+recursive-cleaner generate data.jsonl \
+  --provider openai \
+  --model gpt-4o \
+  --instructions "Fix date formats"
+# Use LM Studio or Ollama (OpenAI-compatible)
+recursive-cleaner generate data.jsonl \
+  --provider openai \
+  --model "qwen/qwen3-vl-30b" \
+  --base-url http://localhost:1234/v1 \
+  --instructions "Normalize prices"
+# Dry-run analysis
+recursive-cleaner analyze data.jsonl \
+  --provider openai \
+  --model gpt-4o \
+  --instructions @instructions.txt
+# Resume from checkpoint
+recursive-cleaner resume cleaning_state.json \
+  --provider mlx \
+  --model "model-path"
+# Apply cleaning functions to data
+recursive-cleaner apply data.jsonl \
+  --functions cleaning_functions.py \
+  --output cleaned_data.jsonl
+# Apply to Excel (outputs same format)
+recursive-cleaner apply sales.xlsx \
+  --functions cleaning_functions.py
+# Apply to PDF (outputs markdown)
+recursive-cleaner apply document.pdf \
+  --functions cleaning_functions.py \
+  --output cleaned.md
+```
+### CLI Options
+```
+recursive-cleaner generate <FILE> [OPTIONS]
+Required:
+  FILE                      Input data file
+  -p, --provider {mlx,openai}  LLM provider
+  -m, --model MODEL         Model name/path
+Optional:
+  -i, --instructions TEXT   Cleaning instructions (or @file.txt)
+  --base-url URL            API URL for OpenAI-compatible servers
+  --chunk-size N            Items per chunk (default: 50)
+  --max-iterations N        Max iterations per chunk (default: 5)
+  -o, --output PATH         Output file (default: cleaning_functions.py)
+  --tui                     Enable Rich dashboard
+  --optimize                Consolidate redundant functions
+  --track-metrics           Measure before/after quality
+```
 ## Configuration
 ```python
@@ -160,6 +264,9 @@ cleaner = DataCleaner(
     # Format Expansion
     auto_parse=False,           # LLM generates parser for unknown formats
+    # Terminal UI
+    tui=True,                   # Enable Rich dashboard (requires [tui] extra)
     # Progress & State
     on_progress=callback,       # Progress event callback
     state_file="state.json",    # Enable resume on interrupt
@@ -253,6 +360,7 @@ cleaner.run()
 ```
 recursive_cleaner/
+├── cli.py              # Command line interface
 ├── cleaner.py          # Main DataCleaner class
 ├── context.py          # Docstring registry with FIFO eviction
 ├── dependencies.py     # Topological sort for function ordering
@@ -265,9 +373,14 @@ recursive_cleaner/
 ├── report.py           # Markdown report generation
 ├── response.py         # XML/markdown parsing + agency dataclasses
 ├── schema.py           # Schema inference
+├── tui.py              # Rich terminal dashboard
 ├── validation.py       # Runtime validation + holdout
 └── vendor/
     └── chunker.py      # Vendored sentence-aware chunker
+backends/
+├── mlx_backend.py      # MLX-LM backend for Apple Silicon
+└── openai_backend.py   # OpenAI-compatible backend
 ```
 ## Testing
@@ -276,14 +389,14 @@ recursive_cleaner/
 pytest tests/ -v
 ```
-432 tests covering all features. Test datasets in `test_cases/`:
+548 tests covering all features. Test datasets in `test_cases/`:
 - E-commerce product catalogs
 - Healthcare patient records
 - Financial transaction data
 ## Philosophy
-- **Simplicity over extensibility**: ~3,000 lines that do one thing well
+- **Simplicity over extensibility**: ~5,000 lines that do one thing well
 - **stdlib over dependencies**: Only `tenacity` required
 - **Retry over recover**: On error, retry with error in prompt
 - **Wu wei**: Let the LLM make decisions about data it understands
@@ -292,6 +405,8 @@ pytest tests/ -v
 | Version | Features |
 |---------|----------|
+| v0.9.0 | CLI tool with MLX and OpenAI-compatible backends (LM Studio, Ollama) |
+| v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
 | v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
 | v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
 | v0.5.1 | Dangerous code detection (AST-based security) |

{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/RECORD RENAMED Viewed

@@ -1,7 +1,11 @@
-backends/__init__.py,sha256=FUgODeYSGBvT0-z6myVby6YeAHG0nEUgWLITBKobUew,121
+backends/__init__.py,sha256=vWcPASV0GGEAydzOSjdrknkSHoGbSs4edtuv9HIzBhI,180
 backends/mlx_backend.py,sha256=0U6IqmDHyk4vjKzytvEcQvSUBryQTgFtsNOcpwFNKk8,2945
-recursive_cleaner/__init__.py,sha256=bG83PcmkxAYMC17FmKuyMJUrMnuukp32JO3rlCLyB-Q,1698
-recursive_cleaner/cleaner.py,sha256=J2X5bnk2OsWJyOn4BNR-cj0sqeKCylznfs_WEyMGxG8,26280
+backends/openai_backend.py,sha256=vKWsXKltBv_tJDoQfQ_7KVMZDfomhFFN2vl1oZ1KGbQ,2057
+recursive_cleaner/__init__.py,sha256=xCFlkqmmBoa7ntUZQnRQxVMv9iLeOvmboDS_j2EHfZI,1862
+recursive_cleaner/__main__.py,sha256=WXmMaL_myHPsG_qXAhZDufD43Ydsd25RV2IPeW2Kg08,152
+recursive_cleaner/apply.py,sha256=hjeljhZNiOuwz9m09RYVLl_z_9tet7LwubH6cb_Wy6Y,13855
+recursive_cleaner/cleaner.py,sha256=kPOQ44hgiJzABiqdmjg2hqd7Ot9uxKUSOe8_jz0UBQc,29911
+recursive_cleaner/cli.py,sha256=Sk_qYKxSn1PiPmMLKkyj9VxsseHaSXmSlGazxfmkTFc,12807
 recursive_cleaner/context.py,sha256=avMXRDxLd7nd8CKWtvPHQy1MFhBKiA0aUVVJIlWoLZ4,824
 recursive_cleaner/dependencies.py,sha256=vlYeoGL517v3yUSWN0wYDuIs9OOuQwM_dCBADrlitW8,2080
 recursive_cleaner/errors.py,sha256=hwRJF8NSmWy_FZHCxcZDZxLQ0zqvo5dX8ImkB9mrOYc,433
@@ -14,11 +18,13 @@ recursive_cleaner/prompt.py,sha256=ep0eOXz_XbhH3HduJ76LvzVSftonhcv4GLEecIqd3lY,6
 recursive_cleaner/report.py,sha256=AWWneRjvl76ccLlExdkKJeY3GVFUG_LtmzVIJJT5cFI,4629
 recursive_cleaner/response.py,sha256=3w0mLnqEPdB4daMSF0mtTcG0PTP-utb1HFtKuYA1ljw,9064
 recursive_cleaner/schema.py,sha256=w2hcEdApR15KVI9SFWB3VfumMoHFwn1YJrktdfgPo8M,3925
+recursive_cleaner/tui.py,sha256=zuiFPtMh3K-sC1CWZoaoUmgZ3rESkl10gYcqMzpVqiM,22598
 recursive_cleaner/types.py,sha256=-GdCmsfHd3rfdfCi5c-RXqX4TyuCSHgA__3AF3bMhoQ,290
 recursive_cleaner/validation.py,sha256=-KAolhw3GQyhHwmh0clEj8xqPD5O-R2AO5rx7vubIME,6442
 recursive_cleaner/vendor/__init__.py,sha256=E87TjmjRzu8ty39nqThvBwM611yXlLKQZ6KGY_zp3Dk,117
 recursive_cleaner/vendor/chunker.py,sha256=pDDbfF6FoSmUji0-RG4MletPxJ-VybGw0yfnhh0aMSQ,6730
-recursive_cleaner-0.7.1.dist-info/METADATA,sha256=X5_HVPMIPUULKKIgDvqhN0ZRQQBcZ1lupGb9frLdCSI,10258
-recursive_cleaner-0.7.1.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
-recursive_cleaner-0.7.1.dist-info/licenses/LICENSE,sha256=P8hRMK-UqRbQDsVN9nr901wpZcqwXEHr28DXhBUheF0,1064
-recursive_cleaner-0.7.1.dist-info/RECORD,,
+recursive_cleaner-1.0.0.dist-info/METADATA,sha256=L86ATNd8JxmPp32HKaO6PPwkmq4sIE3Mdvgx3pmUulE,14285
+recursive_cleaner-1.0.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
+recursive_cleaner-1.0.0.dist-info/entry_points.txt,sha256=S5nbi0rnifpShxdXGExeZnd65UZfp8K7DNyuKPST6nk,65
+recursive_cleaner-1.0.0.dist-info/licenses/LICENSE,sha256=P8hRMK-UqRbQDsVN9nr901wpZcqwXEHr28DXhBUheF0,1064
+recursive_cleaner-1.0.0.dist-info/RECORD,,

recursive_cleaner-1.0.0.dist-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ [console_scripts]
2	+ recursive-cleaner = recursive_cleaner.cli:main

{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/WHEEL RENAMED Viewed

File without changes

{recursive_cleaner-0.7.1.dist-info → recursive_cleaner-1.0.0.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

recursive-cleaner 0.7.1__py3-none-any.whl → 1.0.0__py3-none-any.whl

recursive-cleaner 0.7.1py3-none-any.whl → 1.0.0py3-none-any.whl