tput 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- tput-0.1.0/PKG-INFO +238 -0
- tput-0.1.0/README.md +224 -0
- tput-0.1.0/pyproject.toml +27 -0
- tput-0.1.0/setup.cfg +4 -0
- tput-0.1.0/tput/__init__.py +1 -0
- tput-0.1.0/tput/core.py +1742 -0
- tput-0.1.0/tput/report.py +817 -0
- tput-0.1.0/tput.egg-info/PKG-INFO +238 -0
- tput-0.1.0/tput.egg-info/SOURCES.txt +10 -0
- tput-0.1.0/tput.egg-info/dependency_links.txt +1 -0
- tput-0.1.0/tput.egg-info/requires.txt +7 -0
- tput-0.1.0/tput.egg-info/top_level.txt +1 -0
tput-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,238 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: tput
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Automated DataFrame audit library for ML pipelines
|
|
5
|
+
License: MIT
|
|
6
|
+
Requires-Python: >=3.9
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
Requires-Dist: pandas>=1.5
|
|
9
|
+
Requires-Dist: scipy>=1.10
|
|
10
|
+
Requires-Dist: scikit-learn>=1.0
|
|
11
|
+
Provides-Extra: dev
|
|
12
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
13
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
|
14
|
+
|
|
15
|
+
# tput — Automated DataFrame Audit for ML
|
|
16
|
+
|
|
17
|
+
`tput` is a Python library for fast, opinionated data auditing before ML pipelines. Pass it a DataFrame and get a structured report that surfaces everything that matters before you start modelling: missing values, outliers, skewness, type issues, categorical noise, correlations, multicollinearity, and more.
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
from tput import quick_report
|
|
21
|
+
|
|
22
|
+
report = quick_report(df, target="SalePrice")
|
|
23
|
+
report.show() # full column-by-column report
|
|
24
|
+
report.summary() # condensed overview for large datasets
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Installation
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
# Not yet on PyPI — clone and import locally
|
|
33
|
+
git clone https://github.com/yourname/tput
|
|
34
|
+
cd tput
|
|
35
|
+
pip install -e .
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Dependencies: `pandas>=1.5`, `scipy>=1.10`, `scikit-learn>=1.0`
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Quickstart
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
import pandas as pd
|
|
46
|
+
from tput import quick_report
|
|
47
|
+
|
|
48
|
+
df = pd.read_csv("titanic.csv")
|
|
49
|
+
|
|
50
|
+
# Basic audit
|
|
51
|
+
report = quick_report(df)
|
|
52
|
+
report.show()
|
|
53
|
+
|
|
54
|
+
# With target column — unlocks classification/regression analysis
|
|
55
|
+
report = quick_report(df, target="Survived")
|
|
56
|
+
report.show()
|
|
57
|
+
|
|
58
|
+
# Condensed view — useful for wide datasets (50+ columns)
|
|
59
|
+
report.summary()
|
|
60
|
+
|
|
61
|
+
# Programmatic access
|
|
62
|
+
report.get("nan_analysis")
|
|
63
|
+
report.get("target_analysis")
|
|
64
|
+
report.warnings
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## What it detects
|
|
70
|
+
|
|
71
|
+
### Data quality
|
|
72
|
+
| Step | What it catches |
|
|
73
|
+
|---|---|
|
|
74
|
+
| `overview` | Shape, dtypes |
|
|
75
|
+
| `visible_missing` | NaN counts and percentages per column |
|
|
76
|
+
| `duplicates` | Fully duplicated rows |
|
|
77
|
+
| `categorical_profile` | Cardinality, mode, top/bottom values, hidden missing (`""`, `" "`), case collisions (`"Paris"` vs `"paris"`) |
|
|
78
|
+
| `type_issues` | String columns that should be numeric or datetime |
|
|
79
|
+
| `row_analysis` | Rows with too many missing fields (default: ≥50%) — flagged for removal |
|
|
80
|
+
|
|
81
|
+
### Statistical analysis
|
|
82
|
+
| Step | What it catches |
|
|
83
|
+
|---|---|
|
|
84
|
+
| `numeric_profile` | mean, std, Q1/Q3, min/max, mode |
|
|
85
|
+
| `skewness` | Asymmetry level (symmetric / moderate / high) + recommended transform (log1p, sqrt, reflect + log1p, yeo-johnson) |
|
|
86
|
+
| `outliers` | IQR on symmetric columns, MAD on skewed ones — auto-selected per column |
|
|
87
|
+
| `nan_analysis` | MCAR / MAR / MNAR classification + imputation strategy (mean, median, mode, knn_or_regression). High missing rate (>70%) flagged as potential semantic NaN (e.g. "no pool", "no garage") |
|
|
88
|
+
| `feature_quality` | quasi-constant columns, low-cardinality numerics, suspected ID columns |
|
|
89
|
+
|
|
90
|
+
### Multicollinearity & associations
|
|
91
|
+
| Step | What it catches |
|
|
92
|
+
|---|---|
|
|
93
|
+
| `correlations` | Pearson pairs above threshold (default: 0.85) + Chi² / Cramér's V (bias-corrected) for categorical pairs |
|
|
94
|
+
| `vif` | Variance Inflation Factor per continuous column — flags multicollinearity for linear models |
|
|
95
|
+
|
|
96
|
+
### Target analysis (`target=` parameter)
|
|
97
|
+
When a target column is specified, `tput` automatically detects classification vs regression and runs:
|
|
98
|
+
|
|
99
|
+
**Classification:**
|
|
100
|
+
- Class balance and imbalance detection (minority class < 20% → warning)
|
|
101
|
+
- Feature correlation ranking (Cramér's V for categorical, point-biserial for numeric)
|
|
102
|
+
- Leakage detection (correlation > 0.95 with target)
|
|
103
|
+
|
|
104
|
+
**Regression:**
|
|
105
|
+
- Target skewness + recommended transform
|
|
106
|
+
- Outliers in the target (direct impact on loss function)
|
|
107
|
+
- Pearson correlation ranking for all numeric features
|
|
108
|
+
- Leakage detection
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Display modes
|
|
113
|
+
|
|
114
|
+
### `report.show()`
|
|
115
|
+
Column-by-column view — every column gets a block with all its properties stacked:
|
|
116
|
+
|
|
117
|
+
```
|
|
118
|
+
--- COLUMN: Age [float64] ---
|
|
119
|
+
nulls : 177 (19.87%)
|
|
120
|
+
mean / median : 29.70 / 28.0 | std: 14.53
|
|
121
|
+
skewness : 0.389 (symmetric)
|
|
122
|
+
outliers : 11 (1.54%) [method=IQR bounds=-6.69, 64.81]
|
|
123
|
+
nan_analysis : MAR -> impute (knn_or_regression)
|
|
124
|
+
correlated_with: Pclass (r=0.173), Parch (r=-0.124)
|
|
125
|
+
target_corr : point_biserial_r=0.077 with 'Survived'
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### `report.summary()`
|
|
129
|
+
Condensed grouped view — designed for datasets with 50+ columns where `show()` would be overwhelming:
|
|
130
|
+
|
|
131
|
+
```
|
|
132
|
+
=== TPUT SUMMARY ===
|
|
133
|
+
Shape : 1460 rows x 81 columns
|
|
134
|
+
|
|
135
|
+
ISSUES DETECTED:
|
|
136
|
+
missing values : 19 columns affected (6 proposed drop, 13 impute, 11 MAR)
|
|
137
|
+
skewness : 34 high, 12 moderate
|
|
138
|
+
outliers : 28 columns, 1842 values total
|
|
139
|
+
feature quality : 9 quasi-constant, 18 low_cardinality, 1 potential_id
|
|
140
|
+
vif (redundant) : 8 columns
|
|
141
|
+
correlations : 1 numeric pairs, 182 categorical pairs (showing top 10 in warnings)
|
|
142
|
+
|
|
143
|
+
Total warnings : 156 (use report.show() for full detail)
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Parameters
|
|
149
|
+
|
|
150
|
+
```python
|
|
151
|
+
quick_report(
|
|
152
|
+
df,
|
|
153
|
+
|
|
154
|
+
# Steps — toggle any on/off
|
|
155
|
+
overview=True,
|
|
156
|
+
visible_missing=True,
|
|
157
|
+
duplicates=True,
|
|
158
|
+
categorical_profile=True,
|
|
159
|
+
type_issues=True,
|
|
160
|
+
numeric_profile=True,
|
|
161
|
+
skewness=True,
|
|
162
|
+
outliers=True,
|
|
163
|
+
nan_analysis=True,
|
|
164
|
+
correlations=True,
|
|
165
|
+
feature_quality=True,
|
|
166
|
+
row_analysis=True,
|
|
167
|
+
vif=True,
|
|
168
|
+
|
|
169
|
+
# Target column — enables target_analysis
|
|
170
|
+
target=None, # e.g. target="SalePrice" or target="Survived"
|
|
171
|
+
|
|
172
|
+
# Thresholds
|
|
173
|
+
nan_drop_threshold=0.30, # drop column if missing rate > 30%
|
|
174
|
+
correlation_threshold=0.85, # Pearson |r| threshold for numeric pairs
|
|
175
|
+
cramers_v_threshold=0.25, # Cramér's V threshold for categorical pairs
|
|
176
|
+
max_correlation_warnings=10, # cap categorical warnings, show top N by V
|
|
177
|
+
row_drop_threshold=0.50, # flag rows with >= 50% NaN
|
|
178
|
+
apply_row_filter=True, # run outliers/nan/correlations on filtered df
|
|
179
|
+
quasi_constant_threshold=0.95, # flag column if dominant value > 95%
|
|
180
|
+
low_cardinality_max_unique=10, # numeric columns with <= N values → categorical
|
|
181
|
+
vif_threshold=10.0, # VIF threshold for multicollinearity warning
|
|
182
|
+
|
|
183
|
+
# Display
|
|
184
|
+
feature_display=True, # True = column-by-column view, False = step-by-step
|
|
185
|
+
)
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## Programmatic access
|
|
191
|
+
|
|
192
|
+
```python
|
|
193
|
+
# Access any step result directly
|
|
194
|
+
report.get("outliers")
|
|
195
|
+
report.get("nan_analysis")
|
|
196
|
+
report.get("skewness")
|
|
197
|
+
report.get("correlations") # includes high_pairs and categorical_associations
|
|
198
|
+
report.get("target_analysis") # classification or regression breakdown
|
|
199
|
+
|
|
200
|
+
# All warnings as a list
|
|
201
|
+
report.warnings
|
|
202
|
+
|
|
203
|
+
# Drop rows flagged by row_analysis
|
|
204
|
+
drop_idx = report.get("row_analysis")["rows_to_drop_idx"]
|
|
205
|
+
df_clean = df.drop(drop_idx)
|
|
206
|
+
|
|
207
|
+
# Full categorical association list (bypasses max_correlation_warnings cap)
|
|
208
|
+
all_cat_pairs = report.get("correlations")["categorical_associations"]
|
|
209
|
+
|
|
210
|
+
# Feature ranking by correlation with target
|
|
211
|
+
feature_corr = report.get("target_analysis")["feature_correlations"]
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
---
|
|
215
|
+
|
|
216
|
+
## Design decisions
|
|
217
|
+
|
|
218
|
+
**IQR vs MAD** — outlier method is chosen per column based on skewness. IQR is appropriate for symmetric distributions; MAD (Median Absolute Deviation) is more robust on skewed data where extreme values distort the mean and the IQR bounds.
|
|
219
|
+
|
|
220
|
+
**Bias-corrected Cramér's V** — categorical associations use the Bergsma (2013) bias correction, which removes the positive bias of the standard formula on small samples and low-cardinality columns. High-cardinality columns (>50% unique values) are excluded from Chi² computation as they produce artificially inflated V values.
|
|
221
|
+
|
|
222
|
+
**VIF via sklearn** — implemented without `statsmodels` using `LinearRegression` from sklearn. Only continuous columns (nunique > 10) are included. The target column is automatically excluded when `target=` is specified.
|
|
223
|
+
|
|
224
|
+
**Row filter** — `row_analysis` identifies sparse rows, then `outliers`, `nan_analysis`, `correlations`, and `vif` all run on the filtered DataFrame. This prevents sparse rows from distorting column-level statistics. The raw DataFrame is always used for the upstream steps (`overview`, `visible_missing`, `duplicates`, `categorical_profile`, `type_issues`, `numeric_profile`, `skewness`, `row_analysis` itself).
|
|
225
|
+
|
|
226
|
+
**Semantic NaN detection** — columns with >70% missing rate that are proposed for dropping receive an additional note: the high missing rate may encode absence of a feature (e.g. `PoolQC` missing = no pool) rather than truly missing data. Replacing NaN with a `"None"` category is often preferable to dropping.
|
|
227
|
+
|
|
228
|
+
**Target exclusion** — when `target=` is specified, the target column is automatically excluded from inter-feature correlation computation (both Pearson and Cramér's V). Feature-target relationships are computed separately in `target_analysis`.
|
|
229
|
+
|
|
230
|
+
---
|
|
231
|
+
|
|
232
|
+
## Known limitations
|
|
233
|
+
|
|
234
|
+
- **No domain knowledge.** A latitude of -78° or a weight of 1100 kg may be perfectly valid. All flagging is statistical.
|
|
235
|
+
- **Skewness on small samples.** Columns with <30 non-null values produce unreliable skewness estimates.
|
|
236
|
+
- **Outlier detection assumes unimodal distributions.** IQR and MAD both fail on bimodal distributions (e.g. a mixed-species weight column).
|
|
237
|
+
- **MNAR is a hypothesis.** The library flags it when missingness is high but unexplained — proving MNAR requires domain knowledge.
|
|
238
|
+
- **VIF assumes no perfect multicollinearity.** If the matrix is singular (e.g. `BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF = TotalBsmtSF` exactly), VIF is capped at 10000 to avoid division-by-zero. This is a signal, not a bug.
|
tput-0.1.0/README.md
ADDED
|
@@ -0,0 +1,224 @@
|
|
|
1
|
+
# tput — Automated DataFrame Audit for ML
|
|
2
|
+
|
|
3
|
+
`tput` is a Python library for fast, opinionated data auditing before ML pipelines. Pass it a DataFrame and get a structured report that surfaces everything that matters before you start modelling: missing values, outliers, skewness, type issues, categorical noise, correlations, multicollinearity, and more.
|
|
4
|
+
|
|
5
|
+
```python
|
|
6
|
+
from tput import quick_report
|
|
7
|
+
|
|
8
|
+
report = quick_report(df, target="SalePrice")
|
|
9
|
+
report.show() # full column-by-column report
|
|
10
|
+
report.summary() # condensed overview for large datasets
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Installation
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
# Not yet on PyPI — clone and import locally
|
|
19
|
+
git clone https://github.com/yourname/tput
|
|
20
|
+
cd tput
|
|
21
|
+
pip install -e .
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
Dependencies: `pandas>=1.5`, `scipy>=1.10`, `scikit-learn>=1.0`
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## Quickstart
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
import pandas as pd
|
|
32
|
+
from tput import quick_report
|
|
33
|
+
|
|
34
|
+
df = pd.read_csv("titanic.csv")
|
|
35
|
+
|
|
36
|
+
# Basic audit
|
|
37
|
+
report = quick_report(df)
|
|
38
|
+
report.show()
|
|
39
|
+
|
|
40
|
+
# With target column — unlocks classification/regression analysis
|
|
41
|
+
report = quick_report(df, target="Survived")
|
|
42
|
+
report.show()
|
|
43
|
+
|
|
44
|
+
# Condensed view — useful for wide datasets (50+ columns)
|
|
45
|
+
report.summary()
|
|
46
|
+
|
|
47
|
+
# Programmatic access
|
|
48
|
+
report.get("nan_analysis")
|
|
49
|
+
report.get("target_analysis")
|
|
50
|
+
report.warnings
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## What it detects
|
|
56
|
+
|
|
57
|
+
### Data quality
|
|
58
|
+
| Step | What it catches |
|
|
59
|
+
|---|---|
|
|
60
|
+
| `overview` | Shape, dtypes |
|
|
61
|
+
| `visible_missing` | NaN counts and percentages per column |
|
|
62
|
+
| `duplicates` | Fully duplicated rows |
|
|
63
|
+
| `categorical_profile` | Cardinality, mode, top/bottom values, hidden missing (`""`, `" "`), case collisions (`"Paris"` vs `"paris"`) |
|
|
64
|
+
| `type_issues` | String columns that should be numeric or datetime |
|
|
65
|
+
| `row_analysis` | Rows with too many missing fields (default: ≥50%) — flagged for removal |
|
|
66
|
+
|
|
67
|
+
### Statistical analysis
|
|
68
|
+
| Step | What it catches |
|
|
69
|
+
|---|---|
|
|
70
|
+
| `numeric_profile` | mean, std, Q1/Q3, min/max, mode |
|
|
71
|
+
| `skewness` | Asymmetry level (symmetric / moderate / high) + recommended transform (log1p, sqrt, reflect + log1p, yeo-johnson) |
|
|
72
|
+
| `outliers` | IQR on symmetric columns, MAD on skewed ones — auto-selected per column |
|
|
73
|
+
| `nan_analysis` | MCAR / MAR / MNAR classification + imputation strategy (mean, median, mode, knn_or_regression). High missing rate (>70%) flagged as potential semantic NaN (e.g. "no pool", "no garage") |
|
|
74
|
+
| `feature_quality` | quasi-constant columns, low-cardinality numerics, suspected ID columns |
|
|
75
|
+
|
|
76
|
+
### Multicollinearity & associations
|
|
77
|
+
| Step | What it catches |
|
|
78
|
+
|---|---|
|
|
79
|
+
| `correlations` | Pearson pairs above threshold (default: 0.85) + Chi² / Cramér's V (bias-corrected) for categorical pairs |
|
|
80
|
+
| `vif` | Variance Inflation Factor per continuous column — flags multicollinearity for linear models |
|
|
81
|
+
|
|
82
|
+
### Target analysis (`target=` parameter)
|
|
83
|
+
When a target column is specified, `tput` automatically detects classification vs regression and runs:
|
|
84
|
+
|
|
85
|
+
**Classification:**
|
|
86
|
+
- Class balance and imbalance detection (minority class < 20% → warning)
|
|
87
|
+
- Feature correlation ranking (Cramér's V for categorical, point-biserial for numeric)
|
|
88
|
+
- Leakage detection (correlation > 0.95 with target)
|
|
89
|
+
|
|
90
|
+
**Regression:**
|
|
91
|
+
- Target skewness + recommended transform
|
|
92
|
+
- Outliers in the target (direct impact on loss function)
|
|
93
|
+
- Pearson correlation ranking for all numeric features
|
|
94
|
+
- Leakage detection
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
## Display modes
|
|
99
|
+
|
|
100
|
+
### `report.show()`
|
|
101
|
+
Column-by-column view — every column gets a block with all its properties stacked:
|
|
102
|
+
|
|
103
|
+
```
|
|
104
|
+
--- COLUMN: Age [float64] ---
|
|
105
|
+
nulls : 177 (19.87%)
|
|
106
|
+
mean / median : 29.70 / 28.0 | std: 14.53
|
|
107
|
+
skewness : 0.389 (symmetric)
|
|
108
|
+
outliers : 11 (1.54%) [method=IQR bounds=-6.69, 64.81]
|
|
109
|
+
nan_analysis : MAR -> impute (knn_or_regression)
|
|
110
|
+
correlated_with: Pclass (r=0.173), Parch (r=-0.124)
|
|
111
|
+
target_corr : point_biserial_r=0.077 with 'Survived'
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
### `report.summary()`
|
|
115
|
+
Condensed grouped view — designed for datasets with 50+ columns where `show()` would be overwhelming:
|
|
116
|
+
|
|
117
|
+
```
|
|
118
|
+
=== TPUT SUMMARY ===
|
|
119
|
+
Shape : 1460 rows x 81 columns
|
|
120
|
+
|
|
121
|
+
ISSUES DETECTED:
|
|
122
|
+
missing values : 19 columns affected (6 proposed drop, 13 impute, 11 MAR)
|
|
123
|
+
skewness : 34 high, 12 moderate
|
|
124
|
+
outliers : 28 columns, 1842 values total
|
|
125
|
+
feature quality : 9 quasi-constant, 18 low_cardinality, 1 potential_id
|
|
126
|
+
vif (redundant) : 8 columns
|
|
127
|
+
correlations : 1 numeric pairs, 182 categorical pairs (showing top 10 in warnings)
|
|
128
|
+
|
|
129
|
+
Total warnings : 156 (use report.show() for full detail)
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Parameters
|
|
135
|
+
|
|
136
|
+
```python
|
|
137
|
+
quick_report(
|
|
138
|
+
df,
|
|
139
|
+
|
|
140
|
+
# Steps — toggle any on/off
|
|
141
|
+
overview=True,
|
|
142
|
+
visible_missing=True,
|
|
143
|
+
duplicates=True,
|
|
144
|
+
categorical_profile=True,
|
|
145
|
+
type_issues=True,
|
|
146
|
+
numeric_profile=True,
|
|
147
|
+
skewness=True,
|
|
148
|
+
outliers=True,
|
|
149
|
+
nan_analysis=True,
|
|
150
|
+
correlations=True,
|
|
151
|
+
feature_quality=True,
|
|
152
|
+
row_analysis=True,
|
|
153
|
+
vif=True,
|
|
154
|
+
|
|
155
|
+
# Target column — enables target_analysis
|
|
156
|
+
target=None, # e.g. target="SalePrice" or target="Survived"
|
|
157
|
+
|
|
158
|
+
# Thresholds
|
|
159
|
+
nan_drop_threshold=0.30, # drop column if missing rate > 30%
|
|
160
|
+
correlation_threshold=0.85, # Pearson |r| threshold for numeric pairs
|
|
161
|
+
cramers_v_threshold=0.25, # Cramér's V threshold for categorical pairs
|
|
162
|
+
max_correlation_warnings=10, # cap categorical warnings, show top N by V
|
|
163
|
+
row_drop_threshold=0.50, # flag rows with >= 50% NaN
|
|
164
|
+
apply_row_filter=True, # run outliers/nan/correlations on filtered df
|
|
165
|
+
quasi_constant_threshold=0.95, # flag column if dominant value > 95%
|
|
166
|
+
low_cardinality_max_unique=10, # numeric columns with <= N values → categorical
|
|
167
|
+
vif_threshold=10.0, # VIF threshold for multicollinearity warning
|
|
168
|
+
|
|
169
|
+
# Display
|
|
170
|
+
feature_display=True, # True = column-by-column view, False = step-by-step
|
|
171
|
+
)
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
---
|
|
175
|
+
|
|
176
|
+
## Programmatic access
|
|
177
|
+
|
|
178
|
+
```python
|
|
179
|
+
# Access any step result directly
|
|
180
|
+
report.get("outliers")
|
|
181
|
+
report.get("nan_analysis")
|
|
182
|
+
report.get("skewness")
|
|
183
|
+
report.get("correlations") # includes high_pairs and categorical_associations
|
|
184
|
+
report.get("target_analysis") # classification or regression breakdown
|
|
185
|
+
|
|
186
|
+
# All warnings as a list
|
|
187
|
+
report.warnings
|
|
188
|
+
|
|
189
|
+
# Drop rows flagged by row_analysis
|
|
190
|
+
drop_idx = report.get("row_analysis")["rows_to_drop_idx"]
|
|
191
|
+
df_clean = df.drop(drop_idx)
|
|
192
|
+
|
|
193
|
+
# Full categorical association list (bypasses max_correlation_warnings cap)
|
|
194
|
+
all_cat_pairs = report.get("correlations")["categorical_associations"]
|
|
195
|
+
|
|
196
|
+
# Feature ranking by correlation with target
|
|
197
|
+
feature_corr = report.get("target_analysis")["feature_correlations"]
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## Design decisions
|
|
203
|
+
|
|
204
|
+
**IQR vs MAD** — outlier method is chosen per column based on skewness. IQR is appropriate for symmetric distributions; MAD (Median Absolute Deviation) is more robust on skewed data where extreme values distort the mean and the IQR bounds.
|
|
205
|
+
|
|
206
|
+
**Bias-corrected Cramér's V** — categorical associations use the Bergsma (2013) bias correction, which removes the positive bias of the standard formula on small samples and low-cardinality columns. High-cardinality columns (>50% unique values) are excluded from Chi² computation as they produce artificially inflated V values.
|
|
207
|
+
|
|
208
|
+
**VIF via sklearn** — implemented without `statsmodels` using `LinearRegression` from sklearn. Only continuous columns (nunique > 10) are included. The target column is automatically excluded when `target=` is specified.
|
|
209
|
+
|
|
210
|
+
**Row filter** — `row_analysis` identifies sparse rows, then `outliers`, `nan_analysis`, `correlations`, and `vif` all run on the filtered DataFrame. This prevents sparse rows from distorting column-level statistics. The raw DataFrame is always used for the upstream steps (`overview`, `visible_missing`, `duplicates`, `categorical_profile`, `type_issues`, `numeric_profile`, `skewness`, `row_analysis` itself).
|
|
211
|
+
|
|
212
|
+
**Semantic NaN detection** — columns with >70% missing rate that are proposed for dropping receive an additional note: the high missing rate may encode absence of a feature (e.g. `PoolQC` missing = no pool) rather than truly missing data. Replacing NaN with a `"None"` category is often preferable to dropping.
|
|
213
|
+
|
|
214
|
+
**Target exclusion** — when `target=` is specified, the target column is automatically excluded from inter-feature correlation computation (both Pearson and Cramér's V). Feature-target relationships are computed separately in `target_analysis`.
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
## Known limitations
|
|
219
|
+
|
|
220
|
+
- **No domain knowledge.** A latitude of -78° or a weight of 1100 kg may be perfectly valid. All flagging is statistical.
|
|
221
|
+
- **Skewness on small samples.** Columns with <30 non-null values produce unreliable skewness estimates.
|
|
222
|
+
- **Outlier detection assumes unimodal distributions.** IQR and MAD both fail on bimodal distributions (e.g. a mixed-species weight column).
|
|
223
|
+
- **MNAR is a hypothesis.** The library flags it when missingness is high but unexplained — proving MNAR requires domain knowledge.
|
|
224
|
+
- **VIF assumes no perfect multicollinearity.** If the matrix is singular (e.g. `BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF = TotalBsmtSF` exactly), VIF is capped at 10000 to avoid division-by-zero. This is a signal, not a bug.
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=68", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "tput"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Automated DataFrame audit library for ML pipelines"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
license = { text = "MIT" }
|
|
11
|
+
requires-python = ">=3.9"
|
|
12
|
+
|
|
13
|
+
dependencies = [
|
|
14
|
+
"pandas>=1.5",
|
|
15
|
+
"scipy>=1.10",
|
|
16
|
+
"scikit-learn>=1.0",
|
|
17
|
+
]
|
|
18
|
+
|
|
19
|
+
[project.optional-dependencies]
|
|
20
|
+
dev = [
|
|
21
|
+
"pytest>=7.0",
|
|
22
|
+
"pytest-cov",
|
|
23
|
+
]
|
|
24
|
+
|
|
25
|
+
[tool.setuptools.packages.find]
|
|
26
|
+
where = ["."]
|
|
27
|
+
include = ["tput*"]
|
tput-0.1.0/setup.cfg
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
from .core import quick_report
|