tput 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
tput-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,238 @@
1
+ Metadata-Version: 2.4
2
+ Name: tput
3
+ Version: 0.1.0
4
+ Summary: Automated DataFrame audit library for ML pipelines
5
+ License: MIT
6
+ Requires-Python: >=3.9
7
+ Description-Content-Type: text/markdown
8
+ Requires-Dist: pandas>=1.5
9
+ Requires-Dist: scipy>=1.10
10
+ Requires-Dist: scikit-learn>=1.0
11
+ Provides-Extra: dev
12
+ Requires-Dist: pytest>=7.0; extra == "dev"
13
+ Requires-Dist: pytest-cov; extra == "dev"
14
+
15
+ # tput — Automated DataFrame Audit for ML
16
+
17
+ `tput` is a Python library for fast, opinionated data auditing before ML pipelines. Pass it a DataFrame and get a structured report that surfaces everything that matters before you start modelling: missing values, outliers, skewness, type issues, categorical noise, correlations, multicollinearity, and more.
18
+
19
+ ```python
20
+ from tput import quick_report
21
+
22
+ report = quick_report(df, target="SalePrice")
23
+ report.show() # full column-by-column report
24
+ report.summary() # condensed overview for large datasets
25
+ ```
26
+
27
+ ---
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ # Not yet on PyPI — clone and import locally
33
+ git clone https://github.com/yourname/tput
34
+ cd tput
35
+ pip install -e .
36
+ ```
37
+
38
+ Dependencies: `pandas>=1.5`, `scipy>=1.10`, `scikit-learn>=1.0`
39
+
40
+ ---
41
+
42
+ ## Quickstart
43
+
44
+ ```python
45
+ import pandas as pd
46
+ from tput import quick_report
47
+
48
+ df = pd.read_csv("titanic.csv")
49
+
50
+ # Basic audit
51
+ report = quick_report(df)
52
+ report.show()
53
+
54
+ # With target column — unlocks classification/regression analysis
55
+ report = quick_report(df, target="Survived")
56
+ report.show()
57
+
58
+ # Condensed view — useful for wide datasets (50+ columns)
59
+ report.summary()
60
+
61
+ # Programmatic access
62
+ report.get("nan_analysis")
63
+ report.get("target_analysis")
64
+ report.warnings
65
+ ```
66
+
67
+ ---
68
+
69
+ ## What it detects
70
+
71
+ ### Data quality
72
+ | Step | What it catches |
73
+ |---|---|
74
+ | `overview` | Shape, dtypes |
75
+ | `visible_missing` | NaN counts and percentages per column |
76
+ | `duplicates` | Fully duplicated rows |
77
+ | `categorical_profile` | Cardinality, mode, top/bottom values, hidden missing (`""`, `" "`), case collisions (`"Paris"` vs `"paris"`) |
78
+ | `type_issues` | String columns that should be numeric or datetime |
79
+ | `row_analysis` | Rows with too many missing fields (default: ≥50%) — flagged for removal |
80
+
81
+ ### Statistical analysis
82
+ | Step | What it catches |
83
+ |---|---|
84
+ | `numeric_profile` | mean, std, Q1/Q3, min/max, mode |
85
+ | `skewness` | Asymmetry level (symmetric / moderate / high) + recommended transform (log1p, sqrt, reflect + log1p, yeo-johnson) |
86
+ | `outliers` | IQR on symmetric columns, MAD on skewed ones — auto-selected per column |
87
+ | `nan_analysis` | MCAR / MAR / MNAR classification + imputation strategy (mean, median, mode, knn_or_regression). High missing rate (>70%) flagged as potential semantic NaN (e.g. "no pool", "no garage") |
88
+ | `feature_quality` | quasi-constant columns, low-cardinality numerics, suspected ID columns |
89
+
90
+ ### Multicollinearity & associations
91
+ | Step | What it catches |
92
+ |---|---|
93
+ | `correlations` | Pearson pairs above threshold (default: 0.85) + Chi² / Cramér's V (bias-corrected) for categorical pairs |
94
+ | `vif` | Variance Inflation Factor per continuous column — flags multicollinearity for linear models |
95
+
96
+ ### Target analysis (`target=` parameter)
97
+ When a target column is specified, `tput` automatically detects classification vs regression and runs:
98
+
99
+ **Classification:**
100
+ - Class balance and imbalance detection (minority class < 20% → warning)
101
+ - Feature correlation ranking (Cramér's V for categorical, point-biserial for numeric)
102
+ - Leakage detection (correlation > 0.95 with target)
103
+
104
+ **Regression:**
105
+ - Target skewness + recommended transform
106
+ - Outliers in the target (direct impact on loss function)
107
+ - Pearson correlation ranking for all numeric features
108
+ - Leakage detection
109
+
110
+ ---
111
+
112
+ ## Display modes
113
+
114
+ ### `report.show()`
115
+ Column-by-column view — every column gets a block with all its properties stacked:
116
+
117
+ ```
118
+ --- COLUMN: Age [float64] ---
119
+ nulls : 177 (19.87%)
120
+ mean / median : 29.70 / 28.0 | std: 14.53
121
+ skewness : 0.389 (symmetric)
122
+ outliers : 11 (1.54%) [method=IQR bounds=-6.69, 64.81]
123
+ nan_analysis : MAR -> impute (knn_or_regression)
124
+ correlated_with: Pclass (r=0.173), Parch (r=-0.124)
125
+ target_corr : point_biserial_r=0.077 with 'Survived'
126
+ ```
127
+
128
+ ### `report.summary()`
129
+ Condensed grouped view — designed for datasets with 50+ columns where `show()` would be overwhelming:
130
+
131
+ ```
132
+ === TPUT SUMMARY ===
133
+ Shape : 1460 rows x 81 columns
134
+
135
+ ISSUES DETECTED:
136
+ missing values : 19 columns affected (6 proposed drop, 13 impute, 11 MAR)
137
+ skewness : 34 high, 12 moderate
138
+ outliers : 28 columns, 1842 values total
139
+ feature quality : 9 quasi-constant, 18 low_cardinality, 1 potential_id
140
+ vif (redundant) : 8 columns
141
+ correlations : 1 numeric pairs, 182 categorical pairs (showing top 10 in warnings)
142
+
143
+ Total warnings : 156 (use report.show() for full detail)
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Parameters
149
+
150
+ ```python
151
+ quick_report(
152
+ df,
153
+
154
+ # Steps — toggle any on/off
155
+ overview=True,
156
+ visible_missing=True,
157
+ duplicates=True,
158
+ categorical_profile=True,
159
+ type_issues=True,
160
+ numeric_profile=True,
161
+ skewness=True,
162
+ outliers=True,
163
+ nan_analysis=True,
164
+ correlations=True,
165
+ feature_quality=True,
166
+ row_analysis=True,
167
+ vif=True,
168
+
169
+ # Target column — enables target_analysis
170
+ target=None, # e.g. target="SalePrice" or target="Survived"
171
+
172
+ # Thresholds
173
+ nan_drop_threshold=0.30, # drop column if missing rate > 30%
174
+ correlation_threshold=0.85, # Pearson |r| threshold for numeric pairs
175
+ cramers_v_threshold=0.25, # Cramér's V threshold for categorical pairs
176
+ max_correlation_warnings=10, # cap categorical warnings, show top N by V
177
+ row_drop_threshold=0.50, # flag rows with >= 50% NaN
178
+ apply_row_filter=True, # run outliers/nan/correlations on filtered df
179
+ quasi_constant_threshold=0.95, # flag column if dominant value > 95%
180
+ low_cardinality_max_unique=10, # numeric columns with <= N values → categorical
181
+ vif_threshold=10.0, # VIF threshold for multicollinearity warning
182
+
183
+ # Display
184
+ feature_display=True, # True = column-by-column view, False = step-by-step
185
+ )
186
+ ```
187
+
188
+ ---
189
+
190
+ ## Programmatic access
191
+
192
+ ```python
193
+ # Access any step result directly
194
+ report.get("outliers")
195
+ report.get("nan_analysis")
196
+ report.get("skewness")
197
+ report.get("correlations") # includes high_pairs and categorical_associations
198
+ report.get("target_analysis") # classification or regression breakdown
199
+
200
+ # All warnings as a list
201
+ report.warnings
202
+
203
+ # Drop rows flagged by row_analysis
204
+ drop_idx = report.get("row_analysis")["rows_to_drop_idx"]
205
+ df_clean = df.drop(drop_idx)
206
+
207
+ # Full categorical association list (bypasses max_correlation_warnings cap)
208
+ all_cat_pairs = report.get("correlations")["categorical_associations"]
209
+
210
+ # Feature ranking by correlation with target
211
+ feature_corr = report.get("target_analysis")["feature_correlations"]
212
+ ```
213
+
214
+ ---
215
+
216
+ ## Design decisions
217
+
218
+ **IQR vs MAD** — outlier method is chosen per column based on skewness. IQR is appropriate for symmetric distributions; MAD (Median Absolute Deviation) is more robust on skewed data where extreme values distort the mean and the IQR bounds.
219
+
220
+ **Bias-corrected Cramér's V** — categorical associations use the Bergsma (2013) bias correction, which removes the positive bias of the standard formula on small samples and low-cardinality columns. High-cardinality columns (>50% unique values) are excluded from Chi² computation as they produce artificially inflated V values.
221
+
222
+ **VIF via sklearn** — implemented without `statsmodels` using `LinearRegression` from sklearn. Only continuous columns (nunique > 10) are included. The target column is automatically excluded when `target=` is specified.
223
+
224
+ **Row filter** — `row_analysis` identifies sparse rows, then `outliers`, `nan_analysis`, `correlations`, and `vif` all run on the filtered DataFrame. This prevents sparse rows from distorting column-level statistics. The raw DataFrame is always used for the upstream steps (`overview`, `visible_missing`, `duplicates`, `categorical_profile`, `type_issues`, `numeric_profile`, `skewness`, `row_analysis` itself).
225
+
226
+ **Semantic NaN detection** — columns with >70% missing rate that are proposed for dropping receive an additional note: the high missing rate may encode absence of a feature (e.g. `PoolQC` missing = no pool) rather than truly missing data. Replacing NaN with a `"None"` category is often preferable to dropping.
227
+
228
+ **Target exclusion** — when `target=` is specified, the target column is automatically excluded from inter-feature correlation computation (both Pearson and Cramér's V). Feature-target relationships are computed separately in `target_analysis`.
229
+
230
+ ---
231
+
232
+ ## Known limitations
233
+
234
+ - **No domain knowledge.** A latitude of -78° or a weight of 1100 kg may be perfectly valid. All flagging is statistical.
235
+ - **Skewness on small samples.** Columns with <30 non-null values produce unreliable skewness estimates.
236
+ - **Outlier detection assumes unimodal distributions.** IQR and MAD both fail on bimodal distributions (e.g. a mixed-species weight column).
237
+ - **MNAR is a hypothesis.** The library flags it when missingness is high but unexplained — proving MNAR requires domain knowledge.
238
+ - **VIF assumes no perfect multicollinearity.** If the matrix is singular (e.g. `BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF = TotalBsmtSF` exactly), VIF is capped at 10000 to avoid division-by-zero. This is a signal, not a bug.
tput-0.1.0/README.md ADDED
@@ -0,0 +1,224 @@
1
+ # tput — Automated DataFrame Audit for ML
2
+
3
+ `tput` is a Python library for fast, opinionated data auditing before ML pipelines. Pass it a DataFrame and get a structured report that surfaces everything that matters before you start modelling: missing values, outliers, skewness, type issues, categorical noise, correlations, multicollinearity, and more.
4
+
5
+ ```python
6
+ from tput import quick_report
7
+
8
+ report = quick_report(df, target="SalePrice")
9
+ report.show() # full column-by-column report
10
+ report.summary() # condensed overview for large datasets
11
+ ```
12
+
13
+ ---
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ # Not yet on PyPI — clone and import locally
19
+ git clone https://github.com/yourname/tput
20
+ cd tput
21
+ pip install -e .
22
+ ```
23
+
24
+ Dependencies: `pandas>=1.5`, `scipy>=1.10`, `scikit-learn>=1.0`
25
+
26
+ ---
27
+
28
+ ## Quickstart
29
+
30
+ ```python
31
+ import pandas as pd
32
+ from tput import quick_report
33
+
34
+ df = pd.read_csv("titanic.csv")
35
+
36
+ # Basic audit
37
+ report = quick_report(df)
38
+ report.show()
39
+
40
+ # With target column — unlocks classification/regression analysis
41
+ report = quick_report(df, target="Survived")
42
+ report.show()
43
+
44
+ # Condensed view — useful for wide datasets (50+ columns)
45
+ report.summary()
46
+
47
+ # Programmatic access
48
+ report.get("nan_analysis")
49
+ report.get("target_analysis")
50
+ report.warnings
51
+ ```
52
+
53
+ ---
54
+
55
+ ## What it detects
56
+
57
+ ### Data quality
58
+ | Step | What it catches |
59
+ |---|---|
60
+ | `overview` | Shape, dtypes |
61
+ | `visible_missing` | NaN counts and percentages per column |
62
+ | `duplicates` | Fully duplicated rows |
63
+ | `categorical_profile` | Cardinality, mode, top/bottom values, hidden missing (`""`, `" "`), case collisions (`"Paris"` vs `"paris"`) |
64
+ | `type_issues` | String columns that should be numeric or datetime |
65
+ | `row_analysis` | Rows with too many missing fields (default: ≥50%) — flagged for removal |
66
+
67
+ ### Statistical analysis
68
+ | Step | What it catches |
69
+ |---|---|
70
+ | `numeric_profile` | mean, std, Q1/Q3, min/max, mode |
71
+ | `skewness` | Asymmetry level (symmetric / moderate / high) + recommended transform (log1p, sqrt, reflect + log1p, yeo-johnson) |
72
+ | `outliers` | IQR on symmetric columns, MAD on skewed ones — auto-selected per column |
73
+ | `nan_analysis` | MCAR / MAR / MNAR classification + imputation strategy (mean, median, mode, knn_or_regression). High missing rate (>70%) flagged as potential semantic NaN (e.g. "no pool", "no garage") |
74
+ | `feature_quality` | quasi-constant columns, low-cardinality numerics, suspected ID columns |
75
+
76
+ ### Multicollinearity & associations
77
+ | Step | What it catches |
78
+ |---|---|
79
+ | `correlations` | Pearson pairs above threshold (default: 0.85) + Chi² / Cramér's V (bias-corrected) for categorical pairs |
80
+ | `vif` | Variance Inflation Factor per continuous column — flags multicollinearity for linear models |
81
+
82
+ ### Target analysis (`target=` parameter)
83
+ When a target column is specified, `tput` automatically detects classification vs regression and runs:
84
+
85
+ **Classification:**
86
+ - Class balance and imbalance detection (minority class < 20% → warning)
87
+ - Feature correlation ranking (Cramér's V for categorical, point-biserial for numeric)
88
+ - Leakage detection (correlation > 0.95 with target)
89
+
90
+ **Regression:**
91
+ - Target skewness + recommended transform
92
+ - Outliers in the target (direct impact on loss function)
93
+ - Pearson correlation ranking for all numeric features
94
+ - Leakage detection
95
+
96
+ ---
97
+
98
+ ## Display modes
99
+
100
+ ### `report.show()`
101
+ Column-by-column view — every column gets a block with all its properties stacked:
102
+
103
+ ```
104
+ --- COLUMN: Age [float64] ---
105
+ nulls : 177 (19.87%)
106
+ mean / median : 29.70 / 28.0 | std: 14.53
107
+ skewness : 0.389 (symmetric)
108
+ outliers : 11 (1.54%) [method=IQR bounds=-6.69, 64.81]
109
+ nan_analysis : MAR -> impute (knn_or_regression)
110
+ correlated_with: Pclass (r=0.173), Parch (r=-0.124)
111
+ target_corr : point_biserial_r=0.077 with 'Survived'
112
+ ```
113
+
114
+ ### `report.summary()`
115
+ Condensed grouped view — designed for datasets with 50+ columns where `show()` would be overwhelming:
116
+
117
+ ```
118
+ === TPUT SUMMARY ===
119
+ Shape : 1460 rows x 81 columns
120
+
121
+ ISSUES DETECTED:
122
+ missing values : 19 columns affected (6 proposed drop, 13 impute, 11 MAR)
123
+ skewness : 34 high, 12 moderate
124
+ outliers : 28 columns, 1842 values total
125
+ feature quality : 9 quasi-constant, 18 low_cardinality, 1 potential_id
126
+ vif (redundant) : 8 columns
127
+ correlations : 1 numeric pairs, 182 categorical pairs (showing top 10 in warnings)
128
+
129
+ Total warnings : 156 (use report.show() for full detail)
130
+ ```
131
+
132
+ ---
133
+
134
+ ## Parameters
135
+
136
+ ```python
137
+ quick_report(
138
+ df,
139
+
140
+ # Steps — toggle any on/off
141
+ overview=True,
142
+ visible_missing=True,
143
+ duplicates=True,
144
+ categorical_profile=True,
145
+ type_issues=True,
146
+ numeric_profile=True,
147
+ skewness=True,
148
+ outliers=True,
149
+ nan_analysis=True,
150
+ correlations=True,
151
+ feature_quality=True,
152
+ row_analysis=True,
153
+ vif=True,
154
+
155
+ # Target column — enables target_analysis
156
+ target=None, # e.g. target="SalePrice" or target="Survived"
157
+
158
+ # Thresholds
159
+ nan_drop_threshold=0.30, # drop column if missing rate > 30%
160
+ correlation_threshold=0.85, # Pearson |r| threshold for numeric pairs
161
+ cramers_v_threshold=0.25, # Cramér's V threshold for categorical pairs
162
+ max_correlation_warnings=10, # cap categorical warnings, show top N by V
163
+ row_drop_threshold=0.50, # flag rows with >= 50% NaN
164
+ apply_row_filter=True, # run outliers/nan/correlations on filtered df
165
+ quasi_constant_threshold=0.95, # flag column if dominant value > 95%
166
+ low_cardinality_max_unique=10, # numeric columns with <= N values → categorical
167
+ vif_threshold=10.0, # VIF threshold for multicollinearity warning
168
+
169
+ # Display
170
+ feature_display=True, # True = column-by-column view, False = step-by-step
171
+ )
172
+ ```
173
+
174
+ ---
175
+
176
+ ## Programmatic access
177
+
178
+ ```python
179
+ # Access any step result directly
180
+ report.get("outliers")
181
+ report.get("nan_analysis")
182
+ report.get("skewness")
183
+ report.get("correlations") # includes high_pairs and categorical_associations
184
+ report.get("target_analysis") # classification or regression breakdown
185
+
186
+ # All warnings as a list
187
+ report.warnings
188
+
189
+ # Drop rows flagged by row_analysis
190
+ drop_idx = report.get("row_analysis")["rows_to_drop_idx"]
191
+ df_clean = df.drop(drop_idx)
192
+
193
+ # Full categorical association list (bypasses max_correlation_warnings cap)
194
+ all_cat_pairs = report.get("correlations")["categorical_associations"]
195
+
196
+ # Feature ranking by correlation with target
197
+ feature_corr = report.get("target_analysis")["feature_correlations"]
198
+ ```
199
+
200
+ ---
201
+
202
+ ## Design decisions
203
+
204
+ **IQR vs MAD** — outlier method is chosen per column based on skewness. IQR is appropriate for symmetric distributions; MAD (Median Absolute Deviation) is more robust on skewed data where extreme values distort the mean and the IQR bounds.
205
+
206
+ **Bias-corrected Cramér's V** — categorical associations use the Bergsma (2013) bias correction, which removes the positive bias of the standard formula on small samples and low-cardinality columns. High-cardinality columns (>50% unique values) are excluded from Chi² computation as they produce artificially inflated V values.
207
+
208
+ **VIF via sklearn** — implemented without `statsmodels` using `LinearRegression` from sklearn. Only continuous columns (nunique > 10) are included. The target column is automatically excluded when `target=` is specified.
209
+
210
+ **Row filter** — `row_analysis` identifies sparse rows, then `outliers`, `nan_analysis`, `correlations`, and `vif` all run on the filtered DataFrame. This prevents sparse rows from distorting column-level statistics. The raw DataFrame is always used for the upstream steps (`overview`, `visible_missing`, `duplicates`, `categorical_profile`, `type_issues`, `numeric_profile`, `skewness`, `row_analysis` itself).
211
+
212
+ **Semantic NaN detection** — columns with >70% missing rate that are proposed for dropping receive an additional note: the high missing rate may encode absence of a feature (e.g. `PoolQC` missing = no pool) rather than truly missing data. Replacing NaN with a `"None"` category is often preferable to dropping.
213
+
214
+ **Target exclusion** — when `target=` is specified, the target column is automatically excluded from inter-feature correlation computation (both Pearson and Cramér's V). Feature-target relationships are computed separately in `target_analysis`.
215
+
216
+ ---
217
+
218
+ ## Known limitations
219
+
220
+ - **No domain knowledge.** A latitude of -78° or a weight of 1100 kg may be perfectly valid. All flagging is statistical.
221
+ - **Skewness on small samples.** Columns with <30 non-null values produce unreliable skewness estimates.
222
+ - **Outlier detection assumes unimodal distributions.** IQR and MAD both fail on bimodal distributions (e.g. a mixed-species weight column).
223
+ - **MNAR is a hypothesis.** The library flags it when missingness is high but unexplained — proving MNAR requires domain knowledge.
224
+ - **VIF assumes no perfect multicollinearity.** If the matrix is singular (e.g. `BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF = TotalBsmtSF` exactly), VIF is capped at 10000 to avoid division-by-zero. This is a signal, not a bug.
@@ -0,0 +1,27 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "tput"
7
+ version = "0.1.0"
8
+ description = "Automated DataFrame audit library for ML pipelines"
9
+ readme = "README.md"
10
+ license = { text = "MIT" }
11
+ requires-python = ">=3.9"
12
+
13
+ dependencies = [
14
+ "pandas>=1.5",
15
+ "scipy>=1.10",
16
+ "scikit-learn>=1.0",
17
+ ]
18
+
19
+ [project.optional-dependencies]
20
+ dev = [
21
+ "pytest>=7.0",
22
+ "pytest-cov",
23
+ ]
24
+
25
+ [tool.setuptools.packages.find]
26
+ where = ["."]
27
+ include = ["tput*"]
tput-0.1.0/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1 @@
1
+ from .core import quick_report