recursive-cleaner 0.7.0__tar.gz → 0.8.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (94) hide show
  1. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/CLAUDE.md +10 -2
  2. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/PKG-INFO +55 -16
  3. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/README.md +52 -15
  4. recursive_cleaner-0.8.0/demo_tui.py +54 -0
  5. recursive_cleaner-0.8.0/docs/contracts/v080-api-contract.md +62 -0
  6. recursive_cleaner-0.8.0/docs/contracts/v080-data-schema.md +90 -0
  7. recursive_cleaner-0.8.0/docs/contracts/v080-success-criteria.md +70 -0
  8. recursive_cleaner-0.8.0/docs/implementation-plan-v080.md +182 -0
  9. recursive_cleaner-0.8.0/docs/research/rich-tui-patterns.md +110 -0
  10. recursive_cleaner-0.8.0/docs/workflow-state.md +24 -0
  11. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/pyproject.toml +4 -1
  12. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/__init__.py +3 -0
  13. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/cleaner.py +117 -26
  14. recursive_cleaner-0.8.0/recursive_cleaner/tui.py +595 -0
  15. recursive_cleaner-0.8.0/tests/test_tui.py +758 -0
  16. recursive_cleaner-0.7.0/docs/workflow-state.md +0 -26
  17. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/.gitignore +0 -0
  18. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/LICENSE +0 -0
  19. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/TODO.md +0 -0
  20. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/backends/__init__.py +0 -0
  21. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/backends/mlx_backend.py +0 -0
  22. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/api-contract.md +0 -0
  23. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/data-schema.md +0 -0
  24. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/success-criteria.md +0 -0
  25. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/text-mode-contract.md +0 -0
  26. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/tier2-contract.md +0 -0
  27. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/tier4-contract.md +0 -0
  28. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/tier4-success-criteria.md +0 -0
  29. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/two-pass-contract.md +0 -0
  30. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/contracts/v070-success-criteria.md +0 -0
  31. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/handoffs/tier4-handoff.md +0 -0
  32. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-tier4.md +0 -0
  33. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-v03.md +0 -0
  34. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-v04.md +0 -0
  35. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan-v05.md +0 -0
  36. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/implementation-plan.md +0 -0
  37. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/langchain-analysis.md +0 -0
  38. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/langgraph-analysis.md +0 -0
  39. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/mlx-lm-guide.md +0 -0
  40. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/other-frameworks-analysis.md +0 -0
  41. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/data/dependency.json +0 -0
  42. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/data/stats.json +0 -0
  43. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/plan.md +0 -0
  44. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/refactor-assessment/report.md +0 -0
  45. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/research/chonkie-extraction.md +0 -0
  46. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/research/chonkie.md +0 -0
  47. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/research/markitdown.md +0 -0
  48. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/docs/smolagents-analysis.md +0 -0
  49. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/context.py +0 -0
  50. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/dependencies.py +0 -0
  51. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/errors.py +0 -0
  52. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/metrics.py +0 -0
  53. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/optimizer.py +0 -0
  54. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/output.py +0 -0
  55. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/parser_generator.py +0 -0
  56. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/parsers.py +0 -0
  57. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/prompt.py +0 -0
  58. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/report.py +0 -0
  59. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/response.py +0 -0
  60. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/schema.py +0 -0
  61. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/types.py +0 -0
  62. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/validation.py +0 -0
  63. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/vendor/__init__.py +0 -0
  64. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/recursive_cleaner/vendor/chunker.py +0 -0
  65. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/ecommerce_instructions.txt +0 -0
  66. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/ecommerce_products.jsonl +0 -0
  67. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/financial_instructions.txt +0 -0
  68. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/financial_transactions.jsonl +0 -0
  69. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/healthcare_instructions.txt +0 -0
  70. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/healthcare_patients.jsonl +0 -0
  71. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/run_ecommerce_test.py +0 -0
  72. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/run_financial_test.py +0 -0
  73. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/test_cases/run_healthcare_test.py +0 -0
  74. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/__init__.py +0 -0
  75. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_callbacks.py +0 -0
  76. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_cleaner.py +0 -0
  77. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_context.py +0 -0
  78. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_dependencies.py +0 -0
  79. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_dry_run.py +0 -0
  80. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_holdout.py +0 -0
  81. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_incremental.py +0 -0
  82. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_integration.py +0 -0
  83. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_latency.py +0 -0
  84. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_metrics.py +0 -0
  85. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_optimizer.py +0 -0
  86. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_output.py +0 -0
  87. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_parser_generator.py +0 -0
  88. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_parsers.py +0 -0
  89. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_report.py +0 -0
  90. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_sampling.py +0 -0
  91. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_schema.py +0 -0
  92. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_text_mode.py +0 -0
  93. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_validation.py +0 -0
  94. {recursive_cleaner-0.7.0 → recursive_cleaner-0.8.0}/tests/test_vendor_chunker.py +0 -0
@@ -4,7 +4,9 @@
4
4
 
5
5
  | Version | Status | Date |
6
6
  |---------|--------|------|
7
- | v0.6.0 | **Implemented** | 2025-01-15 |
7
+ | v0.8.0 | **Implemented** | 2025-01-19 |
8
+ | v0.7.0 | Implemented | 2025-01-17 |
9
+ | v0.6.0 | Implemented | 2025-01-15 |
8
10
  | v0.5.1 | Implemented | 2025-01-15 |
9
11
  | v0.5.0 | Implemented | 2025-01-15 |
10
12
  | v0.4.0 | Implemented | 2025-01-15 |
@@ -12,9 +14,11 @@
12
14
  | v0.2.0 | Implemented | 2025-01-14 |
13
15
  | v0.1.0 | Implemented | 2025-01-14 |
14
16
 
15
- **Current State**: v0.6.0 complete. 392 tests passing, 2,967 lines total.
17
+ **Current State**: v0.8.0 complete. 465 tests passing.
16
18
 
17
19
  ### Version History
20
+ - **v0.8.0**: Terminal UI with Rich dashboard, mission control aesthetic, transmission log
21
+ - **v0.7.0**: Markitdown integration (20+ formats), Parquet support, LLM-generated parsers
18
22
  - **v0.6.0**: Latency metrics, import consolidation, cleaning report, dry-run mode
19
23
  - **v0.5.1**: Dangerous code detection (AST-based security)
20
24
  - **v0.5.0**: Two-pass optimization with LLM agency (consolidation, early termination)
@@ -69,6 +73,8 @@ cleaner = DataCleaner(
69
73
  # Observability (v0.6.0)
70
74
  report_path="cleaning_report.md", # Generate markdown report (None to disable)
71
75
  dry_run=False, # Set True to analyze without generating functions
76
+ # Terminal UI (v0.8.0)
77
+ tui=True, # Enable Rich dashboard (requires pip install recursive-cleaner[tui])
72
78
  )
73
79
 
74
80
  cleaner.run() # Outputs: cleaning_functions.py, cleaning_report.md
@@ -159,6 +165,7 @@ recursive_cleaner/
159
165
  report.py # Markdown report generation (~120 lines) [v0.6.0]
160
166
  response.py # XML/markdown parsing + agency dataclasses (~292 lines)
161
167
  schema.py # Schema inference (~117 lines) [v0.2.0]
168
+ tui.py # Rich terminal dashboard (~520 lines) [v0.8.0]
162
169
  types.py # LLMBackend protocol (~11 lines)
163
170
  validation.py # Runtime validation + safety checks (~200 lines)
164
171
  vendor/
@@ -187,6 +194,7 @@ tests/ # 392 tests
187
194
  test_sampling.py # Sampling strategy tests [v0.4.0]
188
195
  test_schema.py # Schema inference tests
189
196
  test_text_mode.py # Text mode tests [v0.3.0]
197
+ test_tui.py # Terminal UI tests [v0.8.0]
190
198
  test_validation.py # Runtime validation + safety tests
191
199
  test_vendor_chunker.py # Vendored chunker tests [v0.3.0]
192
200
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: recursive-cleaner
3
- Version: 0.7.0
3
+ Version: 0.8.0
4
4
  Summary: LLM-powered incremental data cleaning pipeline that processes massive datasets in chunks and generates Python cleaning functions
5
5
  Project-URL: Homepage, https://github.com/gaztrabisme/recursive-data-cleaner
6
6
  Project-URL: Repository, https://github.com/gaztrabisme/recursive-data-cleaner
@@ -32,6 +32,8 @@ Provides-Extra: mlx
32
32
  Requires-Dist: mlx-lm>=0.10.0; extra == 'mlx'
33
33
  Provides-Extra: parquet
34
34
  Requires-Dist: pyarrow>=14.0.0; extra == 'parquet'
35
+ Provides-Extra: tui
36
+ Requires-Dist: rich>=13.0; extra == 'tui'
35
37
  Description-Content-Type: text/markdown
36
38
 
37
39
  # Recursive Data Cleaner
@@ -40,7 +42,7 @@ LLM-powered incremental data cleaning for massive datasets. Process files in chu
40
42
 
41
43
  ## How It Works
42
44
 
43
- 1. **Chunk** your data (JSONL, CSV, JSON, or text)
45
+ 1. **Chunk** your data (JSONL, CSV, JSON, Parquet, PDF, Word, Excel, XML, and more)
44
46
  2. **Analyze** each chunk with an LLM to identify issues
45
47
  3. **Generate** one cleaning function per issue
46
48
  4. **Validate** functions on holdout data before accepting
@@ -59,6 +61,21 @@ For Apple Silicon (MLX backend):
59
61
  pip install -e ".[mlx]"
60
62
  ```
61
63
 
64
+ For document conversion (PDF, Word, Excel, HTML, etc.):
65
+ ```bash
66
+ pip install -e ".[markitdown]"
67
+ ```
68
+
69
+ For Parquet files:
70
+ ```bash
71
+ pip install -e ".[parquet]"
72
+ ```
73
+
74
+ For Terminal UI (Rich dashboard):
75
+ ```bash
76
+ pip install -e ".[tui]"
77
+ ```
78
+
62
79
  ## Quick Start
63
80
 
64
81
  ```python
@@ -111,6 +128,18 @@ cleaner.run() # Generates cleaning_functions.py
111
128
  - **Cleaning Reports**: Markdown summary with functions, timing, quality delta
112
129
  - **Dry-Run Mode**: Analyze data without generating functions
113
130
 
131
+ ### Format Expansion (v0.7.0)
132
+ - **Markitdown Integration**: Convert 20+ formats (PDF, Word, Excel, PowerPoint, HTML, EPUB, etc.) to text
133
+ - **Parquet Support**: Load parquet files as structured data via pyarrow
134
+ - **LLM-Generated Parsers**: Auto-generate parsers for XML and unknown formats (`auto_parse=True`)
135
+
136
+ ### Terminal UI (v0.8.0)
137
+ - **Mission Control Dashboard**: Rich-based live terminal UI with retro aesthetic
138
+ - **Real-time Progress**: Animated progress bars, chunk/iteration counters
139
+ - **Transmission Log**: Parsed LLM responses showing issues detected and functions being generated
140
+ - **Token Estimation**: Track estimated input/output tokens across the run
141
+ - **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
142
+
114
143
  ## Configuration
115
144
 
116
145
  ```python
@@ -142,6 +171,12 @@ cleaner = DataCleaner(
142
171
  report_path="report.md", # Markdown report output (None to disable)
143
172
  dry_run=False, # Analyze without generating functions
144
173
 
174
+ # Format Expansion
175
+ auto_parse=False, # LLM generates parser for unknown formats
176
+
177
+ # Terminal UI
178
+ tui=True, # Enable Rich dashboard (requires [tui] extra)
179
+
145
180
  # Progress & State
146
181
  on_progress=callback, # Progress event callback
147
182
  state_file="state.json", # Enable resume on interrupt
@@ -235,20 +270,22 @@ cleaner.run()
235
270
 
236
271
  ```
237
272
  recursive_cleaner/
238
- ├── cleaner.py # Main DataCleaner class (~580 lines)
239
- ├── context.py # Docstring registry with FIFO eviction
240
- ├── dependencies.py # Topological sort for function ordering
241
- ├── metrics.py # Quality metrics before/after
242
- ├── optimizer.py # Two-pass consolidation with LLM agency
243
- ├── output.py # Function file generation + import consolidation
244
- ├── parsers.py # Chunking for JSONL/CSV/JSON/text + sampling
245
- ├── prompt.py # LLM prompt templates
246
- ├── report.py # Markdown report generation
247
- ├── response.py # XML/markdown parsing + agency dataclasses
248
- ├── schema.py # Schema inference
249
- ├── validation.py # Runtime validation + holdout
273
+ ├── cleaner.py # Main DataCleaner class
274
+ ├── context.py # Docstring registry with FIFO eviction
275
+ ├── dependencies.py # Topological sort for function ordering
276
+ ├── metrics.py # Quality metrics before/after
277
+ ├── optimizer.py # Two-pass consolidation with LLM agency
278
+ ├── output.py # Function file generation + import consolidation
279
+ ├── parser_generator.py # LLM-generated parsers for unknown formats
280
+ ├── parsers.py # Chunking for all formats + sampling
281
+ ├── prompt.py # LLM prompt templates
282
+ ├── report.py # Markdown report generation
283
+ ├── response.py # XML/markdown parsing + agency dataclasses
284
+ ├── schema.py # Schema inference
285
+ ├── tui.py # Rich terminal dashboard
286
+ ├── validation.py # Runtime validation + holdout
250
287
  └── vendor/
251
- └── chunker.py # Vendored sentence-aware chunker
288
+ └── chunker.py # Vendored sentence-aware chunker
252
289
  ```
253
290
 
254
291
  ## Testing
@@ -257,7 +294,7 @@ recursive_cleaner/
257
294
  pytest tests/ -v
258
295
  ```
259
296
 
260
- 392 tests covering all features. Test datasets in `test_cases/`:
297
+ 465 tests covering all features. Test datasets in `test_cases/`:
261
298
  - E-commerce product catalogs
262
299
  - Healthcare patient records
263
300
  - Financial transaction data
@@ -273,6 +310,8 @@ pytest tests/ -v
273
310
 
274
311
  | Version | Features |
275
312
  |---------|----------|
313
+ | v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
314
+ | v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
276
315
  | v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
277
316
  | v0.5.1 | Dangerous code detection (AST-based security) |
278
317
  | v0.5.0 | Two-pass optimization, early termination, LLM agency |
@@ -4,7 +4,7 @@ LLM-powered incremental data cleaning for massive datasets. Process files in chu
4
4
 
5
5
  ## How It Works
6
6
 
7
- 1. **Chunk** your data (JSONL, CSV, JSON, or text)
7
+ 1. **Chunk** your data (JSONL, CSV, JSON, Parquet, PDF, Word, Excel, XML, and more)
8
8
  2. **Analyze** each chunk with an LLM to identify issues
9
9
  3. **Generate** one cleaning function per issue
10
10
  4. **Validate** functions on holdout data before accepting
@@ -23,6 +23,21 @@ For Apple Silicon (MLX backend):
23
23
  pip install -e ".[mlx]"
24
24
  ```
25
25
 
26
+ For document conversion (PDF, Word, Excel, HTML, etc.):
27
+ ```bash
28
+ pip install -e ".[markitdown]"
29
+ ```
30
+
31
+ For Parquet files:
32
+ ```bash
33
+ pip install -e ".[parquet]"
34
+ ```
35
+
36
+ For Terminal UI (Rich dashboard):
37
+ ```bash
38
+ pip install -e ".[tui]"
39
+ ```
40
+
26
41
  ## Quick Start
27
42
 
28
43
  ```python
@@ -75,6 +90,18 @@ cleaner.run() # Generates cleaning_functions.py
75
90
  - **Cleaning Reports**: Markdown summary with functions, timing, quality delta
76
91
  - **Dry-Run Mode**: Analyze data without generating functions
77
92
 
93
+ ### Format Expansion (v0.7.0)
94
+ - **Markitdown Integration**: Convert 20+ formats (PDF, Word, Excel, PowerPoint, HTML, EPUB, etc.) to text
95
+ - **Parquet Support**: Load parquet files as structured data via pyarrow
96
+ - **LLM-Generated Parsers**: Auto-generate parsers for XML and unknown formats (`auto_parse=True`)
97
+
98
+ ### Terminal UI (v0.8.0)
99
+ - **Mission Control Dashboard**: Rich-based live terminal UI with retro aesthetic
100
+ - **Real-time Progress**: Animated progress bars, chunk/iteration counters
101
+ - **Transmission Log**: Parsed LLM responses showing issues detected and functions being generated
102
+ - **Token Estimation**: Track estimated input/output tokens across the run
103
+ - **Graceful Fallback**: Works without Rich installed (falls back to callbacks)
104
+
78
105
  ## Configuration
79
106
 
80
107
  ```python
@@ -106,6 +133,12 @@ cleaner = DataCleaner(
106
133
  report_path="report.md", # Markdown report output (None to disable)
107
134
  dry_run=False, # Analyze without generating functions
108
135
 
136
+ # Format Expansion
137
+ auto_parse=False, # LLM generates parser for unknown formats
138
+
139
+ # Terminal UI
140
+ tui=True, # Enable Rich dashboard (requires [tui] extra)
141
+
109
142
  # Progress & State
110
143
  on_progress=callback, # Progress event callback
111
144
  state_file="state.json", # Enable resume on interrupt
@@ -199,20 +232,22 @@ cleaner.run()
199
232
 
200
233
  ```
201
234
  recursive_cleaner/
202
- ├── cleaner.py # Main DataCleaner class (~580 lines)
203
- ├── context.py # Docstring registry with FIFO eviction
204
- ├── dependencies.py # Topological sort for function ordering
205
- ├── metrics.py # Quality metrics before/after
206
- ├── optimizer.py # Two-pass consolidation with LLM agency
207
- ├── output.py # Function file generation + import consolidation
208
- ├── parsers.py # Chunking for JSONL/CSV/JSON/text + sampling
209
- ├── prompt.py # LLM prompt templates
210
- ├── report.py # Markdown report generation
211
- ├── response.py # XML/markdown parsing + agency dataclasses
212
- ├── schema.py # Schema inference
213
- ├── validation.py # Runtime validation + holdout
235
+ ├── cleaner.py # Main DataCleaner class
236
+ ├── context.py # Docstring registry with FIFO eviction
237
+ ├── dependencies.py # Topological sort for function ordering
238
+ ├── metrics.py # Quality metrics before/after
239
+ ├── optimizer.py # Two-pass consolidation with LLM agency
240
+ ├── output.py # Function file generation + import consolidation
241
+ ├── parser_generator.py # LLM-generated parsers for unknown formats
242
+ ├── parsers.py # Chunking for all formats + sampling
243
+ ├── prompt.py # LLM prompt templates
244
+ ├── report.py # Markdown report generation
245
+ ├── response.py # XML/markdown parsing + agency dataclasses
246
+ ├── schema.py # Schema inference
247
+ ├── tui.py # Rich terminal dashboard
248
+ ├── validation.py # Runtime validation + holdout
214
249
  └── vendor/
215
- └── chunker.py # Vendored sentence-aware chunker
250
+ └── chunker.py # Vendored sentence-aware chunker
216
251
  ```
217
252
 
218
253
  ## Testing
@@ -221,7 +256,7 @@ recursive_cleaner/
221
256
  pytest tests/ -v
222
257
  ```
223
258
 
224
- 392 tests covering all features. Test datasets in `test_cases/`:
259
+ 465 tests covering all features. Test datasets in `test_cases/`:
225
260
  - E-commerce product catalogs
226
261
  - Healthcare patient records
227
262
  - Financial transaction data
@@ -237,6 +272,8 @@ pytest tests/ -v
237
272
 
238
273
  | Version | Features |
239
274
  |---------|----------|
275
+ | v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
276
+ | v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
240
277
  | v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
241
278
  | v0.5.1 | Dangerous code detection (AST-based security) |
242
279
  | v0.5.0 | Two-pass optimization, early termination, LLM agency |
@@ -0,0 +1,54 @@
1
+ #!/usr/bin/env python3
2
+ """
3
+ Demo script to showcase the Rich TUI with real MLX backend.
4
+
5
+ Run with:
6
+ python demo_tui.py
7
+
8
+ Requirements:
9
+ pip install recursive-cleaner[mlx,tui]
10
+ """
11
+
12
+ from backends import MLXBackend
13
+ from recursive_cleaner import DataCleaner
14
+
15
+ # Use a smaller/faster model for demo (change to your preferred model)
16
+ MODEL = "lmstudio-community/Qwen3-Next-80B-A3B-Instruct-MLX-4bit"
17
+
18
+ print("=" * 60)
19
+ print(" RECURSIVE DATA CLEANER - TUI DEMO")
20
+ print("=" * 60)
21
+ print(f"\nLoading model: {MODEL}")
22
+ print("This may take a moment on first run...\n")
23
+
24
+ llm = MLXBackend(
25
+ model_path=MODEL,
26
+ max_tokens=2048,
27
+ temperature=0.3, # Lower for more consistent output
28
+ verbose=False, # Disable token streaming to avoid interfering with TUI
29
+ )
30
+
31
+ cleaner = DataCleaner(
32
+ llm_backend=llm,
33
+ file_path="test_cases/ecommerce_products.jsonl",
34
+ chunk_size=5, # Small chunks for demo
35
+ max_iterations=3, # Limit iterations per chunk
36
+ instructions="""
37
+ E-commerce product data cleaning:
38
+ - Normalize prices to float (remove $ symbols)
39
+ - Fix category typos and normalize to Title Case
40
+ - Convert weights to kg as float
41
+ - Ensure stock_quantity is non-negative integer
42
+ """,
43
+ tui=True, # Enable the Rich dashboard!
44
+ track_metrics=True,
45
+ )
46
+
47
+ print("\nStarting cleaner with TUI enabled...")
48
+ print("Watch the dashboard below!\n")
49
+
50
+ cleaner.run()
51
+
52
+ print("\n" + "=" * 60)
53
+ print("Demo complete! Check cleaning_functions.py for output.")
54
+ print("=" * 60)
@@ -0,0 +1,62 @@
1
+ # API Contract: Rich TUI (v0.8.0)
2
+
3
+ ## New Parameter
4
+
5
+ ```python
6
+ DataCleaner(
7
+ ...,
8
+ tui: bool = False, # Enable Rich terminal dashboard
9
+ )
10
+ ```
11
+
12
+ ## Behavior Matrix
13
+
14
+ | `tui` | Rich installed | Behavior |
15
+ |-------|----------------|----------|
16
+ | `False` | Any | Existing callback-based output (no change) |
17
+ | `True` | Yes | Live dashboard replaces callback prints |
18
+ | `True` | No | Warning logged, falls back to callbacks |
19
+
20
+ ## New Optional Dependency
21
+
22
+ ```toml
23
+ [project.optional-dependencies]
24
+ tui = ["rich>=13.0"]
25
+ ```
26
+
27
+ ```bash
28
+ pip install recursive-cleaner[tui]
29
+ ```
30
+
31
+ ## TUI Module API
32
+
33
+ ### `recursive_cleaner/tui.py`
34
+
35
+ ```python
36
+ # Check availability
37
+ HAS_RICH: bool
38
+
39
+ # Main renderer class
40
+ class TUIRenderer:
41
+ def __init__(self, file_path: str, total_chunks: int, total_records: int)
42
+ def start(self) -> None
43
+ def stop(self) -> None
44
+ def update_chunk(self, chunk_index: int, iteration: int, max_iterations: int) -> None
45
+ def update_llm_status(self, status: str) -> None # "calling" | "idle"
46
+ def add_function(self, name: str, docstring: str) -> None
47
+ def update_metrics(self, quality_delta: float, latency_last: float, latency_avg: float, latency_total: float, llm_calls: int) -> None
48
+ def show_complete(self, summary: dict) -> None
49
+ ```
50
+
51
+ ## Integration with DataCleaner
52
+
53
+ When `tui=True` and Rich available:
54
+ 1. `on_progress` callback still fires (for logging, state tracking)
55
+ 2. TUI replaces console output, not callbacks
56
+ 3. TUI auto-stops on completion or error
57
+
58
+ ## No Breaking Changes
59
+
60
+ - All existing parameters unchanged
61
+ - All existing callbacks unchanged
62
+ - `tui=False` (default) = identical to v0.7.0 behavior
@@ -0,0 +1,90 @@
1
+ # Data Schema: TUI Display State (v0.8.0)
2
+
3
+ ## Dashboard State
4
+
5
+ ```python
6
+ @dataclass
7
+ class TUIState:
8
+ # Header
9
+ file_path: str
10
+ total_records: int
11
+ version: str = "0.8.0"
12
+
13
+ # Progress
14
+ current_chunk: int = 0
15
+ total_chunks: int = 0
16
+ current_iteration: int = 0
17
+ max_iterations: int = 5
18
+
19
+ # LLM Status
20
+ llm_status: Literal["idle", "calling"] = "idle"
21
+
22
+ # Functions
23
+ functions: list[FunctionInfo] = field(default_factory=list)
24
+
25
+ # Metrics
26
+ quality_delta: float = 0.0 # Percentage improvement
27
+ latency_last_ms: float = 0.0
28
+ latency_avg_ms: float = 0.0
29
+ latency_total_ms: float = 0.0
30
+ llm_call_count: int = 0
31
+
32
+ @dataclass
33
+ class FunctionInfo:
34
+ name: str
35
+ docstring: str # First 50 chars displayed
36
+ ```
37
+
38
+ ## Dashboard Layout Schema
39
+
40
+ ```
41
+ ┌─────────────────────────────────────────────────────────┐
42
+ │ {file_path} v{version} │ <- HEADER (size=3)
43
+ ├────────────────────┬────────────────────────────────────┤
44
+ │ PROGRESS │ FUNCTIONS ({len(functions)}) │ <- BODY
45
+ │ [████░░░░░░] {%} │ ├─ {functions[0].name} │
46
+ │ Chunk {cur}/{tot} │ ├─ {functions[1].name} │
47
+ │ Iter {i}/{max} │ └─ {functions[2].name} │
48
+ │ │ (+{n} more) │
49
+ │ {spinner} {status}│ QUALITY: +{quality_delta}% │
50
+ ├────────────────────┴────────────────────────────────────┤
51
+ │ ⏱️ {latency_last}ms │ avg {latency_avg}ms │ {llm_calls} │ <- FOOTER (size=3)
52
+ └─────────────────────────────────────────────────────────┘
53
+ ```
54
+
55
+ ## Color Scheme
56
+
57
+ | Element | Color | Condition |
58
+ |---------|-------|-----------|
59
+ | Header title | cyan | Always |
60
+ | Progress bar | yellow | In progress |
61
+ | Progress bar | green | Chunk complete |
62
+ | Spinner | yellow | LLM calling |
63
+ | Function names | green | Always |
64
+ | Quality delta | green | Positive |
65
+ | Quality delta | red | Negative |
66
+ | Latency | dim white | Always |
67
+
68
+ ## Spinner States
69
+
70
+ | `llm_status` | Display |
71
+ |--------------|---------|
72
+ | `"calling"` | Animated spinner + "Calling LLM..." |
73
+ | `"idle"` | Static checkmark or empty |
74
+
75
+ ## Completion Summary
76
+
77
+ On `show_complete()`:
78
+
79
+ ```
80
+ ┌─────────────────────────────────────────────────────────┐
81
+ │ ✓ COMPLETE │
82
+ ├─────────────────────────────────────────────────────────┤
83
+ │ Functions generated: {n} │
84
+ │ Chunks processed: {total_chunks} │
85
+ │ Quality improvement: +{quality_delta}% │
86
+ │ Total time: {latency_total}ms ({llm_calls} LLM calls) │
87
+ │ │
88
+ │ Output: cleaning_functions.py │
89
+ └─────────────────────────────────────────────────────────┘
90
+ ```
@@ -0,0 +1,70 @@
1
+ # Success Criteria: Rich TUI (v0.8.0)
2
+
3
+ ## Project-Level Success
4
+
5
+ - [ ] `pip install recursive-cleaner[tui]` installs rich>=13.0
6
+ - [ ] `DataCleaner(..., tui=True)` shows live dashboard
7
+ - [ ] Dashboard displays all state from data schema contract
8
+ - [ ] Falls back gracefully when Rich not installed
9
+ - [ ] All 432 existing tests pass
10
+ - [ ] Zero breaking changes to existing API
11
+
12
+ ## Phase 1: Core TUI Module
13
+
14
+ **Deliverables:**
15
+ - [ ] `recursive_cleaner/tui.py` with `TUIRenderer` class
16
+ - [ ] `HAS_RICH` check with graceful import
17
+ - [ ] Basic `start()` / `stop()` lifecycle
18
+ - [ ] Static layout matching schema (header, body split, footer)
19
+
20
+ **Success Criteria:**
21
+ - [ ] `from recursive_cleaner.tui import TUIRenderer, HAS_RICH` works
22
+ - [ ] `TUIRenderer` can be instantiated without Rich (no crash)
23
+ - [ ] With Rich: `start()` shows layout, `stop()` exits cleanly
24
+ - [ ] Layout has correct sections per data schema
25
+
26
+ **Tests:**
27
+ - [ ] test_tui_import_without_rich
28
+ - [ ] test_tui_renderer_lifecycle
29
+ - [ ] test_tui_layout_structure
30
+
31
+ ## Phase 2: Dynamic Updates
32
+
33
+ **Deliverables:**
34
+ - [ ] `update_chunk()` updates progress bar and counters
35
+ - [ ] `update_llm_status()` shows/hides spinner
36
+ - [ ] `add_function()` appends to function list
37
+ - [ ] `update_metrics()` updates footer stats
38
+
39
+ **Success Criteria:**
40
+ - [ ] Progress bar fills based on chunk_index/total_chunks
41
+ - [ ] Spinner animates when status="calling", stops when "idle"
42
+ - [ ] Functions list grows, shows "+N more" when >5 functions
43
+ - [ ] Metrics panel shows formatted latency and counts
44
+
45
+ **Tests:**
46
+ - [ ] test_progress_updates
47
+ - [ ] test_spinner_states
48
+ - [ ] test_function_list_display
49
+ - [ ] test_metrics_display
50
+
51
+ ## Phase 3: Integration & Polish
52
+
53
+ **Deliverables:**
54
+ - [ ] `tui=True` parameter on DataCleaner
55
+ - [ ] Integration: TUI updates from cleaner loop
56
+ - [ ] `show_complete()` with summary panel
57
+ - [ ] Fallback warning when Rich not installed
58
+ - [ ] Color transitions (yellow→green on chunk complete)
59
+
60
+ **Success Criteria:**
61
+ - [ ] Full cleaner run with `tui=True` shows live dashboard
62
+ - [ ] Completion shows summary with all stats
63
+ - [ ] `tui=True` without Rich logs warning, uses callbacks
64
+ - [ ] Chunk completion triggers green color flash
65
+
66
+ **Tests:**
67
+ - [ ] test_datacleaner_tui_integration
68
+ - [ ] test_tui_fallback_warning
69
+ - [ ] test_completion_summary
70
+ - [ ] test_color_transitions