recursive-cleaner 0.7.0__tar.gz → 0.8.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/CLAUDE.md +10 -2
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/PKG-INFO +55 -16
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/README.md +52 -15
- recursive_cleaner-0.8.0/demo_tui.py +54 -0
- recursive_cleaner-0.8.0/docs/contracts/v080-api-contract.md +62 -0
- recursive_cleaner-0.8.0/docs/contracts/v080-data-schema.md +90 -0
- recursive_cleaner-0.8.0/docs/contracts/v080-success-criteria.md +70 -0
- recursive_cleaner-0.8.0/docs/implementation-plan-v080.md +182 -0
- recursive_cleaner-0.8.0/docs/research/rich-tui-patterns.md +110 -0
- recursive_cleaner-0.8.0/docs/workflow-state.md +24 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/pyproject.toml +4 -1
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/__init__.py +3 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/cleaner.py +117 -26
- recursive_cleaner-0.8.0/recursive_cleaner/tui.py +595 -0
- recursive_cleaner-0.8.0/tests/test_tui.py +758 -0
- recursive_cleaner-0.7.0/docs/workflow-state.md +0 -26
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/.gitignore +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/LICENSE +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/TODO.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/backends/__init__.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/backends/mlx_backend.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/api-contract.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/data-schema.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/success-criteria.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/text-mode-contract.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/tier2-contract.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/tier4-contract.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/tier4-success-criteria.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/two-pass-contract.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/v070-success-criteria.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/handoffs/tier4-handoff.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-tier4.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-v03.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-v04.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-v05.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/langchain-analysis.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/langgraph-analysis.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/mlx-lm-guide.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/other-frameworks-analysis.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/data/dependency.json +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/data/stats.json +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/plan.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/report.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/research/chonkie-extraction.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/research/chonkie.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/research/markitdown.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/smolagents-analysis.md +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/context.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/dependencies.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/errors.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/metrics.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/optimizer.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/output.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/parser_generator.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/parsers.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/prompt.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/report.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/response.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/schema.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/types.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/validation.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/vendor/__init__.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/vendor/chunker.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/ecommerce_instructions.txt +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/ecommerce_products.jsonl +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/financial_instructions.txt +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/financial_transactions.jsonl +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/healthcare_instructions.txt +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/healthcare_patients.jsonl +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/run_ecommerce_test.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/run_financial_test.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/run_healthcare_test.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/__init__.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_callbacks.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_cleaner.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_context.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_dependencies.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_dry_run.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_holdout.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_incremental.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_integration.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_latency.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_metrics.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_optimizer.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_output.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_parser_generator.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_parsers.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_report.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_sampling.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_schema.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_text_mode.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_validation.py +0 -0
- {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_vendor_chunker.py +0 -0
|
@@ -4,7 +4,9 @@
|
|
|
4
4
|
|
|
5
5
|
| Version | Status | Date |
|
|
6
6
|
|---------|--------|------|
|
|
7
|
-
| v0.
|
|
7
|
+
| v0.8.0 | **Implemented** | 2025-01-19 |
|
|
8
|
+
| v0.7.0 | Implemented | 2025-01-17 |
|
|
9
|
+
| v0.6.0 | Implemented | 2025-01-15 |
|
|
8
10
|
| v0.5.1 | Implemented | 2025-01-15 |
|
|
9
11
|
| v0.5.0 | Implemented | 2025-01-15 |
|
|
10
12
|
| v0.4.0 | Implemented | 2025-01-15 |
|
|
@@ -12,9 +14,11 @@
|
|
|
12
14
|
| v0.2.0 | Implemented | 2025-01-14 |
|
|
13
15
|
| v0.1.0 | Implemented | 2025-01-14 |
|
|
14
16
|
|
|
15
|
-
**Current State**: v0.
|
|
17
|
+
**Current State**: v0.8.0 complete. 465 tests passing.
|
|
16
18
|
|
|
17
19
|
### Version History
|
|
20
|
+
- **v0.8.0**: Terminal UI with Rich dashboard, mission control aesthetic, transmission log
|
|
21
|
+
- **v0.7.0**: Markitdown integration (20+ formats), Parquet support, LLM-generated parsers
|
|
18
22
|
- **v0.6.0**: Latency metrics, import consolidation, cleaning report, dry-run mode
|
|
19
23
|
- **v0.5.1**: Dangerous code detection (AST-based security)
|
|
20
24
|
- **v0.5.0**: Two-pass optimization with LLM agency (consolidation, early termination)
|
|
@@ -69,6 +73,8 @@ cleaner = DataCleaner(
|
|
|
69
73
|
# Observability (v0.6.0)
|
|
70
74
|
report_path="cleaning_report.md", # Generate markdown report (None to disable)
|
|
71
75
|
dry_run=False, # Set True to analyze without generating functions
|
|
76
|
+
# Terminal UI (v0.8.0)
|
|
77
|
+
tui=True, # Enable Rich dashboard (requires pip install recursive-cleaner[tui])
|
|
72
78
|
)
|
|
73
79
|
|
|
74
80
|
cleaner.run() # Outputs: cleaning_functions.py, cleaning_report.md
|
|
@@ -159,6 +165,7 @@ recursive_cleaner/
|
|
|
159
165
|
report.py # Markdown report generation (~120 lines) [v0.6.0]
|
|
160
166
|
response.py # XML/markdown parsing + agency dataclasses (~292 lines)
|
|
161
167
|
schema.py # Schema inference (~117 lines) [v0.2.0]
|
|
168
|
+
tui.py # Rich terminal dashboard (~520 lines) [v0.8.0]
|
|
162
169
|
types.py # LLMBackend protocol (~11 lines)
|
|
163
170
|
validation.py # Runtime validation + safety checks (~200 lines)
|
|
164
171
|
vendor/
|
|
@@ -187,6 +194,7 @@ tests/ # 392 tests
|
|
|
187
194
|
test_sampling.py # Sampling strategy tests [v0.4.0]
|
|
188
195
|
test_schema.py # Schema inference tests
|
|
189
196
|
test_text_mode.py # Text mode tests [v0.3.0]
|
|
197
|
+
test_tui.py # Terminal UI tests [v0.8.0]
|
|
190
198
|
test_validation.py # Runtime validation + safety tests
|
|
191
199
|
test_vendor_chunker.py # Vendored chunker tests [v0.3.0]
|
|
192
200
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: recursive-cleaner
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.8.0
|
|
4
4
|
Summary: LLM-powered incremental data cleaning pipeline that processes massive datasets in chunks and generates Python cleaning functions
|
|
5
5
|
Project-URL: Homepage, https://github.com/gaztrabisme/recursive-data-cleaner
|
|
6
6
|
Project-URL: Repository, https://github.com/gaztrabisme/recursive-data-cleaner
|
|
@@ -32,6 +32,8 @@ Provides-Extra: mlx
|
|
|
32
32
|
Requires-Dist: mlx-lm>=0.10.0; extra == 'mlx'
|
|
33
33
|
Provides-Extra: parquet
|
|
34
34
|
Requires-Dist: pyarrow>=14.0.0; extra == 'parquet'
|
|
35
|
+
Provides-Extra: tui
|
|
36
|
+
Requires-Dist: rich>=13.0; extra == 'tui'
|
|
35
37
|
Description-Content-Type: text/markdown
|
|
36
38
|
|
|
37
39
|
# Recursive Data Cleaner
|
|
@@ -40,7 +42,7 @@ LLM-powered incremental data cleaning for massive datasets. Process files in chu
|
|
|
40
42
|
|
|
41
43
|
## How It Works
|
|
42
44
|
|
|
43
|
-
1. **Chunk** your data (JSONL, CSV, JSON,
|
|
45
|
+
1. **Chunk** your data (JSONL, CSV, JSON, Parquet, PDF, Word, Excel, XML, and more)
|
|
44
46
|
2. **Analyze** each chunk with an LLM to identify issues
|
|
45
47
|
3. **Generate** one cleaning function per issue
|
|
46
48
|
4. **Validate** functions on holdout data before accepting
|
|
@@ -59,6 +61,21 @@ For Apple Silicon (MLX backend):
|
|
|
59
61
|
pip install -e ".[mlx]"
|
|
60
62
|
```
|
|
61
63
|
|
|
64
|
+
For document conversion (PDF, Word, Excel, HTML, etc.):
|
|
65
|
+
```bash
|
|
66
|
+
pip install -e ".[markitdown]"
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
For Parquet files:
|
|
70
|
+
```bash
|
|
71
|
+
pip install -e ".[parquet]"
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
For Terminal UI (Rich dashboard):
|
|
75
|
+
```bash
|
|
76
|
+
pip install -e ".[tui]"
|
|
77
|
+
```
|
|
78
|
+
|
|
62
79
|
## Quick Start
|
|
63
80
|
|
|
64
81
|
```python
|
|
@@ -111,6 +128,18 @@ cleaner.run() # Generates cleaning_functions.py
|
|
|
111
128
|
- **Cleaning Reports**: Markdown summary with functions, timing, quality delta
|
|
112
129
|
- **Dry-Run Mode**: Analyze data without generating functions
|
|
113
130
|
|
|
131
|
+
### Format Expansion (v0.7.0)
|
|
132
|
+
- **Markitdown Integration**: Convert 20+ formats (PDF, Word, Excel, PowerPoint, HTML, EPUB, etc.) to text
|
|
133
|
+
- **Parquet Support**: Load parquet files as structured data via pyarrow
|
|
134
|
+
- **LLM-Generated Parsers**: Auto-generate parsers for XML and unknown formats (`auto_parse=True`)
|
|
135
|
+
|
|
136
|
+
### Terminal UI (v0.8.0)
|
|
137
|
+
- **Mission Control Dashboard**: Rich-based live terminal UI with retro aesthetic
|
|
138
|
+
- **Real-time Progress**: Animated progress bars, chunk/iteration counters
|
|
139
|
+
- **Transmission Log**: Parsed LLM responses showing issues detected and functions being generated
|
|
140
|
+
- **Token Estimation**: Track estimated input/output tokens across the run
|
|
141
|
+
- **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
|
|
142
|
+
|
|
114
143
|
## Configuration
|
|
115
144
|
|
|
116
145
|
```python
|
|
@@ -142,6 +171,12 @@ cleaner = DataCleaner(
|
|
|
142
171
|
report_path="report.md", # Markdown report output (None to disable)
|
|
143
172
|
dry_run=False, # Analyze without generating functions
|
|
144
173
|
|
|
174
|
+
# Format Expansion
|
|
175
|
+
auto_parse=False, # LLM generates parser for unknown formats
|
|
176
|
+
|
|
177
|
+
# Terminal UI
|
|
178
|
+
tui=True, # Enable Rich dashboard (requires [tui] extra)
|
|
179
|
+
|
|
145
180
|
# Progress & State
|
|
146
181
|
on_progress=callback, # Progress event callback
|
|
147
182
|
state_file="state.json", # Enable resume on interrupt
|
|
@@ -235,20 +270,22 @@ cleaner.run()
|
|
|
235
270
|
|
|
236
271
|
```
|
|
237
272
|
recursive_cleaner/
|
|
238
|
-
├── cleaner.py
|
|
239
|
-
├── context.py
|
|
240
|
-
├── dependencies.py
|
|
241
|
-
├── metrics.py
|
|
242
|
-
├── optimizer.py
|
|
243
|
-
├── output.py
|
|
244
|
-
├──
|
|
245
|
-
├──
|
|
246
|
-
├──
|
|
247
|
-
├──
|
|
248
|
-
├──
|
|
249
|
-
├──
|
|
273
|
+
├── cleaner.py # Main DataCleaner class
|
|
274
|
+
├── context.py # Docstring registry with FIFO eviction
|
|
275
|
+
├── dependencies.py # Topological sort for function ordering
|
|
276
|
+
├── metrics.py # Quality metrics before/after
|
|
277
|
+
├── optimizer.py # Two-pass consolidation with LLM agency
|
|
278
|
+
├── output.py # Function file generation + import consolidation
|
|
279
|
+
├── parser_generator.py # LLM-generated parsers for unknown formats
|
|
280
|
+
├── parsers.py # Chunking for all formats + sampling
|
|
281
|
+
├── prompt.py # LLM prompt templates
|
|
282
|
+
├── report.py # Markdown report generation
|
|
283
|
+
├── response.py # XML/markdown parsing + agency dataclasses
|
|
284
|
+
├── schema.py # Schema inference
|
|
285
|
+
├── tui.py # Rich terminal dashboard
|
|
286
|
+
├── validation.py # Runtime validation + holdout
|
|
250
287
|
└── vendor/
|
|
251
|
-
└── chunker.py
|
|
288
|
+
└── chunker.py # Vendored sentence-aware chunker
|
|
252
289
|
```
|
|
253
290
|
|
|
254
291
|
## Testing
|
|
@@ -257,7 +294,7 @@ recursive_cleaner/
|
|
|
257
294
|
pytest tests/ -v
|
|
258
295
|
```
|
|
259
296
|
|
|
260
|
-
|
|
297
|
+
465 tests covering all features. Test datasets in `test_cases/`:
|
|
261
298
|
- E-commerce product catalogs
|
|
262
299
|
- Healthcare patient records
|
|
263
300
|
- Financial transaction data
|
|
@@ -273,6 +310,8 @@ pytest tests/ -v
|
|
|
273
310
|
|
|
274
311
|
| Version | Features |
|
|
275
312
|
|---------|----------|
|
|
313
|
+
| v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
|
|
314
|
+
| v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
|
|
276
315
|
| v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
|
|
277
316
|
| v0.5.1 | Dangerous code detection (AST-based security) |
|
|
278
317
|
| v0.5.0 | Two-pass optimization, early termination, LLM agency |
|
|
@@ -4,7 +4,7 @@ LLM-powered incremental data cleaning for massive datasets. Process files in chu
|
|
|
4
4
|
|
|
5
5
|
## How It Works
|
|
6
6
|
|
|
7
|
-
1. **Chunk** your data (JSONL, CSV, JSON,
|
|
7
|
+
1. **Chunk** your data (JSONL, CSV, JSON, Parquet, PDF, Word, Excel, XML, and more)
|
|
8
8
|
2. **Analyze** each chunk with an LLM to identify issues
|
|
9
9
|
3. **Generate** one cleaning function per issue
|
|
10
10
|
4. **Validate** functions on holdout data before accepting
|
|
@@ -23,6 +23,21 @@ For Apple Silicon (MLX backend):
|
|
|
23
23
|
pip install -e ".[mlx]"
|
|
24
24
|
```
|
|
25
25
|
|
|
26
|
+
For document conversion (PDF, Word, Excel, HTML, etc.):
|
|
27
|
+
```bash
|
|
28
|
+
pip install -e ".[markitdown]"
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
For Parquet files:
|
|
32
|
+
```bash
|
|
33
|
+
pip install -e ".[parquet]"
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
For Terminal UI (Rich dashboard):
|
|
37
|
+
```bash
|
|
38
|
+
pip install -e ".[tui]"
|
|
39
|
+
```
|
|
40
|
+
|
|
26
41
|
## Quick Start
|
|
27
42
|
|
|
28
43
|
```python
|
|
@@ -75,6 +90,18 @@ cleaner.run() # Generates cleaning_functions.py
|
|
|
75
90
|
- **Cleaning Reports**: Markdown summary with functions, timing, quality delta
|
|
76
91
|
- **Dry-Run Mode**: Analyze data without generating functions
|
|
77
92
|
|
|
93
|
+
### Format Expansion (v0.7.0)
|
|
94
|
+
- **Markitdown Integration**: Convert 20+ formats (PDF, Word, Excel, PowerPoint, HTML, EPUB, etc.) to text
|
|
95
|
+
- **Parquet Support**: Load parquet files as structured data via pyarrow
|
|
96
|
+
- **LLM-Generated Parsers**: Auto-generate parsers for XML and unknown formats (`auto_parse=True`)
|
|
97
|
+
|
|
98
|
+
### Terminal UI (v0.8.0)
|
|
99
|
+
- **Mission Control Dashboard**: Rich-based live terminal UI with retro aesthetic
|
|
100
|
+
- **Real-time Progress**: Animated progress bars, chunk/iteration counters
|
|
101
|
+
- **Transmission Log**: Parsed LLM responses showing issues detected and functions being generated
|
|
102
|
+
- **Token Estimation**: Track estimated input/output tokens across the run
|
|
103
|
+
- **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
|
|
104
|
+
|
|
78
105
|
## Configuration
|
|
79
106
|
|
|
80
107
|
```python
|
|
@@ -106,6 +133,12 @@ cleaner = DataCleaner(
|
|
|
106
133
|
report_path="report.md", # Markdown report output (None to disable)
|
|
107
134
|
dry_run=False, # Analyze without generating functions
|
|
108
135
|
|
|
136
|
+
# Format Expansion
|
|
137
|
+
auto_parse=False, # LLM generates parser for unknown formats
|
|
138
|
+
|
|
139
|
+
# Terminal UI
|
|
140
|
+
tui=True, # Enable Rich dashboard (requires [tui] extra)
|
|
141
|
+
|
|
109
142
|
# Progress & State
|
|
110
143
|
on_progress=callback, # Progress event callback
|
|
111
144
|
state_file="state.json", # Enable resume on interrupt
|
|
@@ -199,20 +232,22 @@ cleaner.run()
|
|
|
199
232
|
|
|
200
233
|
```
|
|
201
234
|
recursive_cleaner/
|
|
202
|
-
├── cleaner.py
|
|
203
|
-
├── context.py
|
|
204
|
-
├── dependencies.py
|
|
205
|
-
├── metrics.py
|
|
206
|
-
├── optimizer.py
|
|
207
|
-
├── output.py
|
|
208
|
-
├──
|
|
209
|
-
├──
|
|
210
|
-
├──
|
|
211
|
-
├──
|
|
212
|
-
├──
|
|
213
|
-
├──
|
|
235
|
+
├── cleaner.py # Main DataCleaner class
|
|
236
|
+
├── context.py # Docstring registry with FIFO eviction
|
|
237
|
+
├── dependencies.py # Topological sort for function ordering
|
|
238
|
+
├── metrics.py # Quality metrics before/after
|
|
239
|
+
├── optimizer.py # Two-pass consolidation with LLM agency
|
|
240
|
+
├── output.py # Function file generation + import consolidation
|
|
241
|
+
├── parser_generator.py # LLM-generated parsers for unknown formats
|
|
242
|
+
├── parsers.py # Chunking for all formats + sampling
|
|
243
|
+
├── prompt.py # LLM prompt templates
|
|
244
|
+
├── report.py # Markdown report generation
|
|
245
|
+
├── response.py # XML/markdown parsing + agency dataclasses
|
|
246
|
+
├── schema.py # Schema inference
|
|
247
|
+
├── tui.py # Rich terminal dashboard
|
|
248
|
+
├── validation.py # Runtime validation + holdout
|
|
214
249
|
└── vendor/
|
|
215
|
-
└── chunker.py
|
|
250
|
+
└── chunker.py # Vendored sentence-aware chunker
|
|
216
251
|
```
|
|
217
252
|
|
|
218
253
|
## Testing
|
|
@@ -221,7 +256,7 @@ recursive_cleaner/
|
|
|
221
256
|
pytest tests/ -v
|
|
222
257
|
```
|
|
223
258
|
|
|
224
|
-
|
|
259
|
+
465 tests covering all features. Test datasets in `test_cases/`:
|
|
225
260
|
- E-commerce product catalogs
|
|
226
261
|
- Healthcare patient records
|
|
227
262
|
- Financial transaction data
|
|
@@ -237,6 +272,8 @@ pytest tests/ -v
|
|
|
237
272
|
|
|
238
273
|
| Version | Features |
|
|
239
274
|
|---------|----------|
|
|
275
|
+
| v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
|
|
276
|
+
| v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
|
|
240
277
|
| v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
|
|
241
278
|
| v0.5.1 | Dangerous code detection (AST-based security) |
|
|
242
279
|
| v0.5.0 | Two-pass optimization, early termination, LLM agency |
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""
|
|
3
|
+
Demo script to showcase the Rich TUI with real MLX backend.
|
|
4
|
+
|
|
5
|
+
Run with:
|
|
6
|
+
python demo_tui.py
|
|
7
|
+
|
|
8
|
+
Requirements:
|
|
9
|
+
pip install recursive-cleaner[mlx,tui]
|
|
10
|
+
"""
|
|
11
|
+
|
|
12
|
+
from backends import MLXBackend
|
|
13
|
+
from recursive_cleaner import DataCleaner
|
|
14
|
+
|
|
15
|
+
# Use a smaller/faster model for demo (change to your preferred model)
|
|
16
|
+
MODEL = "lmstudio-community/Qwen3-Next-80B-A3B-Instruct-MLX-4bit"
|
|
17
|
+
|
|
18
|
+
print("=" * 60)
|
|
19
|
+
print(" RECURSIVE DATA CLEANER - TUI DEMO")
|
|
20
|
+
print("=" * 60)
|
|
21
|
+
print(f"\nLoading model: {MODEL}")
|
|
22
|
+
print("This may take a moment on first run...\n")
|
|
23
|
+
|
|
24
|
+
llm = MLXBackend(
|
|
25
|
+
model_path=MODEL,
|
|
26
|
+
max_tokens=2048,
|
|
27
|
+
temperature=0.3, # Lower for more consistent output
|
|
28
|
+
verbose=False, # Disable token streaming to avoid interfering with TUI
|
|
29
|
+
)
|
|
30
|
+
|
|
31
|
+
cleaner = DataCleaner(
|
|
32
|
+
llm_backend=llm,
|
|
33
|
+
file_path="test_cases/ecommerce_products.jsonl",
|
|
34
|
+
chunk_size=5, # Small chunks for demo
|
|
35
|
+
max_iterations=3, # Limit iterations per chunk
|
|
36
|
+
instructions="""
|
|
37
|
+
E-commerce product data cleaning:
|
|
38
|
+
- Normalize prices to float (remove $ symbols)
|
|
39
|
+
- Fix category typos and normalize to Title Case
|
|
40
|
+
- Convert weights to kg as float
|
|
41
|
+
- Ensure stock_quantity is non-negative integer
|
|
42
|
+
""",
|
|
43
|
+
tui=True, # Enable the Rich dashboard!
|
|
44
|
+
track_metrics=True,
|
|
45
|
+
)
|
|
46
|
+
|
|
47
|
+
print("\nStarting cleaner with TUI enabled...")
|
|
48
|
+
print("Watch the dashboard below!\n")
|
|
49
|
+
|
|
50
|
+
cleaner.run()
|
|
51
|
+
|
|
52
|
+
print("\n" + "=" * 60)
|
|
53
|
+
print("Demo complete! Check cleaning_functions.py for output.")
|
|
54
|
+
print("=" * 60)
|
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
# API Contract: Rich TUI (v0.8.0)
|
|
2
|
+
|
|
3
|
+
## New Parameter
|
|
4
|
+
|
|
5
|
+
```python
|
|
6
|
+
DataCleaner(
|
|
7
|
+
...,
|
|
8
|
+
tui: bool = False, # Enable Rich terminal dashboard
|
|
9
|
+
)
|
|
10
|
+
```
|
|
11
|
+
|
|
12
|
+
## Behavior Matrix
|
|
13
|
+
|
|
14
|
+
| `tui` | Rich installed | Behavior |
|
|
15
|
+
|-------|----------------|----------|
|
|
16
|
+
| `False` | Any | Existing callback-based output (no change) |
|
|
17
|
+
| `True` | Yes | Live dashboard replaces callback prints |
|
|
18
|
+
| `True` | No | Warning logged, falls back to callbacks |
|
|
19
|
+
|
|
20
|
+
## New Optional Dependency
|
|
21
|
+
|
|
22
|
+
```toml
|
|
23
|
+
[project.optional-dependencies]
|
|
24
|
+
tui = ["rich>=13.0"]
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
pip install recursive-cleaner[tui]
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## TUI Module API
|
|
32
|
+
|
|
33
|
+
### `recursive_cleaner/tui.py`
|
|
34
|
+
|
|
35
|
+
```python
|
|
36
|
+
# Check availability
|
|
37
|
+
HAS_RICH: bool
|
|
38
|
+
|
|
39
|
+
# Main renderer class
|
|
40
|
+
class TUIRenderer:
|
|
41
|
+
def __init__(self, file_path: str, total_chunks: int, total_records: int)
|
|
42
|
+
def start(self) -> None
|
|
43
|
+
def stop(self) -> None
|
|
44
|
+
def update_chunk(self, chunk_index: int, iteration: int, max_iterations: int) -> None
|
|
45
|
+
def update_llm_status(self, status: str) -> None # "calling" | "idle"
|
|
46
|
+
def add_function(self, name: str, docstring: str) -> None
|
|
47
|
+
def update_metrics(self, quality_delta: float, latency_last: float, latency_avg: float, latency_total: float, llm_calls: int) -> None
|
|
48
|
+
def show_complete(self, summary: dict) -> None
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## Integration with DataCleaner
|
|
52
|
+
|
|
53
|
+
When `tui=True` and Rich available:
|
|
54
|
+
1. `on_progress` callback still fires (for logging, state tracking)
|
|
55
|
+
2. TUI replaces console output, not callbacks
|
|
56
|
+
3. TUI auto-stops on completion or error
|
|
57
|
+
|
|
58
|
+
## No Breaking Changes
|
|
59
|
+
|
|
60
|
+
- All existing parameters unchanged
|
|
61
|
+
- All existing callbacks unchanged
|
|
62
|
+
- `tui=False` (default) = identical to v0.7.0 behavior
|
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
# Data Schema: TUI Display State (v0.8.0)
|
|
2
|
+
|
|
3
|
+
## Dashboard State
|
|
4
|
+
|
|
5
|
+
```python
|
|
6
|
+
@dataclass
|
|
7
|
+
class TUIState:
|
|
8
|
+
# Header
|
|
9
|
+
file_path: str
|
|
10
|
+
total_records: int
|
|
11
|
+
version: str = "0.8.0"
|
|
12
|
+
|
|
13
|
+
# Progress
|
|
14
|
+
current_chunk: int = 0
|
|
15
|
+
total_chunks: int = 0
|
|
16
|
+
current_iteration: int = 0
|
|
17
|
+
max_iterations: int = 5
|
|
18
|
+
|
|
19
|
+
# LLM Status
|
|
20
|
+
llm_status: Literal["idle", "calling"] = "idle"
|
|
21
|
+
|
|
22
|
+
# Functions
|
|
23
|
+
functions: list[FunctionInfo] = field(default_factory=list)
|
|
24
|
+
|
|
25
|
+
# Metrics
|
|
26
|
+
quality_delta: float = 0.0 # Percentage improvement
|
|
27
|
+
latency_last_ms: float = 0.0
|
|
28
|
+
latency_avg_ms: float = 0.0
|
|
29
|
+
latency_total_ms: float = 0.0
|
|
30
|
+
llm_call_count: int = 0
|
|
31
|
+
|
|
32
|
+
@dataclass
|
|
33
|
+
class FunctionInfo:
|
|
34
|
+
name: str
|
|
35
|
+
docstring: str # First 50 chars displayed
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Dashboard Layout Schema
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
┌─────────────────────────────────────────────────────────┐
|
|
42
|
+
│ {file_path} v{version} │ <- HEADER (size=3)
|
|
43
|
+
├────────────────────┬────────────────────────────────────┤
|
|
44
|
+
│ PROGRESS │ FUNCTIONS ({len(functions)}) │ <- BODY
|
|
45
|
+
│ [████░░░░░░] {%} │ ├─ {functions[0].name} │
|
|
46
|
+
│ Chunk {cur}/{tot} │ ├─ {functions[1].name} │
|
|
47
|
+
│ Iter {i}/{max} │ └─ {functions[2].name} │
|
|
48
|
+
│ │ (+{n} more) │
|
|
49
|
+
│ {spinner} {status}│ QUALITY: +{quality_delta}% │
|
|
50
|
+
├────────────────────┴────────────────────────────────────┤
|
|
51
|
+
│ ⏱️ {latency_last}ms │ avg {latency_avg}ms │ {llm_calls} │ <- FOOTER (size=3)
|
|
52
|
+
└─────────────────────────────────────────────────────────┘
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
## Color Scheme
|
|
56
|
+
|
|
57
|
+
| Element | Color | Condition |
|
|
58
|
+
|---------|-------|-----------|
|
|
59
|
+
| Header title | cyan | Always |
|
|
60
|
+
| Progress bar | yellow | In progress |
|
|
61
|
+
| Progress bar | green | Chunk complete |
|
|
62
|
+
| Spinner | yellow | LLM calling |
|
|
63
|
+
| Function names | green | Always |
|
|
64
|
+
| Quality delta | green | Positive |
|
|
65
|
+
| Quality delta | red | Negative |
|
|
66
|
+
| Latency | dim white | Always |
|
|
67
|
+
|
|
68
|
+
## Spinner States
|
|
69
|
+
|
|
70
|
+
| `llm_status` | Display |
|
|
71
|
+
|--------------|---------|
|
|
72
|
+
| `"calling"` | Animated spinner + "Calling LLM..." |
|
|
73
|
+
| `"idle"` | Static checkmark or empty |
|
|
74
|
+
|
|
75
|
+
## Completion Summary
|
|
76
|
+
|
|
77
|
+
On `show_complete()`:
|
|
78
|
+
|
|
79
|
+
```
|
|
80
|
+
┌─────────────────────────────────────────────────────────┐
|
|
81
|
+
│ ✓ COMPLETE │
|
|
82
|
+
├─────────────────────────────────────────────────────────┤
|
|
83
|
+
│ Functions generated: {n} │
|
|
84
|
+
│ Chunks processed: {total_chunks} │
|
|
85
|
+
│ Quality improvement: +{quality_delta}% │
|
|
86
|
+
│ Total time: {latency_total}ms ({llm_calls} LLM calls) │
|
|
87
|
+
│ │
|
|
88
|
+
│ Output: cleaning_functions.py │
|
|
89
|
+
└─────────────────────────────────────────────────────────┘
|
|
90
|
+
```
|
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
# Success Criteria: Rich TUI (v0.8.0)
|
|
2
|
+
|
|
3
|
+
## Project-Level Success
|
|
4
|
+
|
|
5
|
+
- [ ] `pip install recursive-cleaner[tui]` installs rich>=13.0
|
|
6
|
+
- [ ] `DataCleaner(..., tui=True)` shows live dashboard
|
|
7
|
+
- [ ] Dashboard displays all state from data schema contract
|
|
8
|
+
- [ ] Falls back gracefully when Rich not installed
|
|
9
|
+
- [ ] All 432 existing tests pass
|
|
10
|
+
- [ ] Zero breaking changes to existing API
|
|
11
|
+
|
|
12
|
+
## Phase 1: Core TUI Module
|
|
13
|
+
|
|
14
|
+
**Deliverables:**
|
|
15
|
+
- [ ] `recursive_cleaner/tui.py` with `TUIRenderer` class
|
|
16
|
+
- [ ] `HAS_RICH` check with graceful import
|
|
17
|
+
- [ ] Basic `start()` / `stop()` lifecycle
|
|
18
|
+
- [ ] Static layout matching schema (header, body split, footer)
|
|
19
|
+
|
|
20
|
+
**Success Criteria:**
|
|
21
|
+
- [ ] `from recursive_cleaner.tui import TUIRenderer, HAS_RICH` works
|
|
22
|
+
- [ ] `TUIRenderer` can be instantiated without Rich (no crash)
|
|
23
|
+
- [ ] With Rich: `start()` shows layout, `stop()` exits cleanly
|
|
24
|
+
- [ ] Layout has correct sections per data schema
|
|
25
|
+
|
|
26
|
+
**Tests:**
|
|
27
|
+
- [ ] test_tui_import_without_rich
|
|
28
|
+
- [ ] test_tui_renderer_lifecycle
|
|
29
|
+
- [ ] test_tui_layout_structure
|
|
30
|
+
|
|
31
|
+
## Phase 2: Dynamic Updates
|
|
32
|
+
|
|
33
|
+
**Deliverables:**
|
|
34
|
+
- [ ] `update_chunk()` updates progress bar and counters
|
|
35
|
+
- [ ] `update_llm_status()` shows/hides spinner
|
|
36
|
+
- [ ] `add_function()` appends to function list
|
|
37
|
+
- [ ] `update_metrics()` updates footer stats
|
|
38
|
+
|
|
39
|
+
**Success Criteria:**
|
|
40
|
+
- [ ] Progress bar fills based on chunk_index/total_chunks
|
|
41
|
+
- [ ] Spinner animates when status="calling", stops when "idle"
|
|
42
|
+
- [ ] Functions list grows, shows "+N more" when >5 functions
|
|
43
|
+
- [ ] Metrics panel shows formatted latency and counts
|
|
44
|
+
|
|
45
|
+
**Tests:**
|
|
46
|
+
- [ ] test_progress_updates
|
|
47
|
+
- [ ] test_spinner_states
|
|
48
|
+
- [ ] test_function_list_display
|
|
49
|
+
- [ ] test_metrics_display
|
|
50
|
+
|
|
51
|
+
## Phase 3: Integration & Polish
|
|
52
|
+
|
|
53
|
+
**Deliverables:**
|
|
54
|
+
- [ ] `tui=True` parameter on DataCleaner
|
|
55
|
+
- [ ] Integration: TUI updates from cleaner loop
|
|
56
|
+
- [ ] `show_complete()` with summary panel
|
|
57
|
+
- [ ] Fallback warning when Rich not installed
|
|
58
|
+
- [ ] Color transitions (yellow→green on chunk complete)
|
|
59
|
+
|
|
60
|
+
**Success Criteria:**
|
|
61
|
+
- [ ] Full cleaner run with `tui=True` shows live dashboard
|
|
62
|
+
- [ ] Completion shows summary with all stats
|
|
63
|
+
- [ ] `tui=True` without Rich logs warning, uses callbacks
|
|
64
|
+
- [ ] Chunk completion triggers green color flash
|
|
65
|
+
|
|
66
|
+
**Tests:**
|
|
67
|
+
- [ ] test_datacleaner_tui_integration
|
|
68
|
+
- [ ] test_tui_fallback_warning
|
|
69
|
+
- [ ] test_completion_summary
|
|
70
|
+
- [ ] test_color_transitions
|