datadiagnose 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Nilotpal Dhar
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,397 @@
1
+ Metadata-Version: 2.4
2
+ Name: datadiagnose
3
+ Version: 1.0.0
4
+ Summary: Dataset Auto-Diagnosis Python Library — find and fix data problems before model training.
5
+ Author-email: Nilotpal Dhar <nilotpaldhar@example.com>
6
+ Maintainer-email: Nilotpal Dhar <nilotpaldhar@example.com>
7
+ License: MIT
8
+ Project-URL: Homepage, https://github.com/nilotpaldhar2004/datadiagnose
9
+ Project-URL: Repository, https://github.com/nilotpaldhar2004/datadiagnose
10
+ Project-URL: Documentation, https://github.com/nilotpaldhar2004/datadiagnose/blob/main/README.md
11
+ Project-URL: Bug Tracker, https://github.com/nilotpaldhar2004/datadiagnose/issues
12
+ Project-URL: Changelog, https://github.com/nilotpaldhar2004/datadiagnose/blob/main/CHANGELOG.md
13
+ Keywords: data science,machine learning,dataset,data quality,data cleaning,eda,exploratory data analysis,missing values,outliers,data leakage,class imbalance,python,beginner
14
+ Classifier: Development Status :: 4 - Beta
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Intended Audience :: Science/Research
17
+ Classifier: Intended Audience :: Education
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Classifier: License :: OSI Approved :: MIT License
22
+ Classifier: Programming Language :: Python :: 3
23
+ Classifier: Programming Language :: Python :: 3.7
24
+ Classifier: Programming Language :: Python :: 3.8
25
+ Classifier: Programming Language :: Python :: 3.9
26
+ Classifier: Programming Language :: Python :: 3.10
27
+ Classifier: Programming Language :: Python :: 3.11
28
+ Classifier: Programming Language :: Python :: 3.12
29
+ Classifier: Operating System :: OS Independent
30
+ Classifier: Typing :: Typed
31
+ Requires-Python: >=3.7
32
+ Description-Content-Type: text/markdown
33
+ License-File: LICENSE
34
+ Provides-Extra: dev
35
+ Requires-Dist: pytest>=7.0; extra == "dev"
36
+ Requires-Dist: pytest-cov>=4.0; extra == "dev"
37
+ Requires-Dist: flake8>=6.0; extra == "dev"
38
+ Provides-Extra: data
39
+ Requires-Dist: pandas>=1.3; extra == "data"
40
+ Requires-Dist: numpy>=1.21; extra == "data"
41
+ Provides-Extra: all
42
+ Requires-Dist: pytest>=7.0; extra == "all"
43
+ Requires-Dist: pytest-cov>=4.0; extra == "all"
44
+ Requires-Dist: flake8>=6.0; extra == "all"
45
+ Requires-Dist: pandas>=1.3; extra == "all"
46
+ Requires-Dist: numpy>=1.21; extra == "all"
47
+ Dynamic: license-file
48
+
49
+ # DataDiagnose
50
+
51
+ **A Python library that looks at your dataset and tells you exactly what is wrong with it.**
52
+
53
+ [![Tests](https://github.com/nilotpaldhar/datadiagnose/actions/workflows/tests.yml/badge.svg)](https://github.com/nilotpaldhar/datadiagnose/actions)
54
+ [![Python](https://img.shields.io/badge/python-3.7%2B-blue)](https://www.python.org)
55
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
56
+ [![Version](https://img.shields.io/badge/version-1.0.0-orange)](https://github.com/nilotpaldhar/datadiagnose)
57
+ [![Dependencies](https://img.shields.io/badge/dependencies-none-brightgreen)](pyproject.toml)
58
+
59
+ ---
60
+
61
+ ## The Problem This Solves
62
+
63
+ Every beginner data scientist goes through the same painful experience. You download a dataset, you are excited to build your first model, you train it — and the results are terrible. 50% accuracy. Predictions that make no sense. Hours of confusion.
64
+
65
+ In most cases the model is not the problem. **The data is broken.** There are missing values pulling your model in the wrong direction. An outlier like an age of 900 years is destroying your statistics. Your target column has 95% of rows saying "no", so your model just learned to say "no" for everything.
66
+
67
+ Experienced data scientists know to check for these things before touching a model. They run a full diagnostic on the dataset first. **DataDiagnose automates that diagnostic in one function call.**
68
+
69
+ ---
70
+
71
+ ## What It Does
72
+
73
+ Give DataDiagnose any dataset and it returns:
74
+
75
+ - A **health score** from 0 to 100 showing how clean your data is
76
+ - A list of every **problem detected**, with severity (CRITICAL / HIGH / MEDIUM / LOW)
77
+ - A specific **fix suggestion** for every problem
78
+ - **Model recommendations** based on your data characteristics
79
+ - **Feature engineering hints** based on your column names
80
+
81
+ Eight problems are detected automatically:
82
+
83
+ | Problem | What It Means |
84
+ |---|---|
85
+ | 🕳️ Missing Values | Null, empty, or NaN values in any column |
86
+ | 🎯 Outliers | Extreme values detected by IQR and Z-score methods |
87
+ | 📐 Skewness | Lopsided distributions that hurt linear models |
88
+ | ⚖️ Class Imbalance | One class vastly outnumbering others in your target |
89
+ | 🚨 Data Leakage | Columns that secretly contain the answer |
90
+ | 🔁 Duplicate Rows | Identical rows that bias your model |
91
+ | 📊 Constant Columns | Columns with zero variation — zero information |
92
+ | 🃏 High Cardinality | ID-like columns with almost all unique values |
93
+
94
+ ---
95
+
96
+ ## Installation
97
+
98
+ DataDiagnose has **zero external dependencies**. It runs on pure Python standard library. No pandas, no numpy, no scikit-learn required.
99
+
100
+ ```bash
101
+ # Once published on PyPI
102
+ pip install datadiagnose
103
+ ```
104
+
105
+ For now, copy `datadiagnose/` into your project folder and import directly.
106
+
107
+ ---
108
+
109
+ ## Quick Start
110
+
111
+ ```python
112
+ from datadiagnose import diagnose
113
+
114
+ dataset = {
115
+ "age": [25, 30, None, 22, 900, 28],
116
+ "income": [50000, 60000, None, 48000, 52000, 61000],
117
+ "city": ["KOL", "MUM", "KOL", "DEL", "KOL", "KOL"],
118
+ "target": [1, 0, 1, 0, 1, 0],
119
+ }
120
+
121
+ report = diagnose(dataset, target_col="target")
122
+ print(report)
123
+ ```
124
+
125
+ Output:
126
+
127
+ ```
128
+ ==============================================================
129
+ DATADIAGNOSE REPORT — DATASET
130
+ ==============================================================
131
+ Rows : 6
132
+ Columns : 4
133
+ Score : 69/100 ⚠️ Needs Work
134
+ --------------------------------------------------------------
135
+ 🔍 Issues Found (2)
136
+
137
+ 1. 🟡 MEDIUM
138
+ Missing Values in 'age'
139
+ → 16.7% of values are missing.
140
+ 💡 Fix: Fill 'age' with median (numeric) or mode (categorical).
141
+
142
+ 2. 🟡 MEDIUM
143
+ Missing Values in 'income'
144
+ → 16.7% of values are missing.
145
+ 💡 Fix: Fill 'income' with median (numeric) or mode (categorical).
146
+ ...
147
+ ```
148
+
149
+ ---
150
+
151
+ ## Works With Pandas Too
152
+
153
+ DataDiagnose is not a pandas replacement — it works alongside it. Convert your DataFrame in one line:
154
+
155
+ ```python
156
+ import pandas as pd
157
+ from datadiagnose import diagnose
158
+
159
+ df = pd.read_csv("my_data.csv")
160
+ report = diagnose(df.to_dict(orient="list"), target_col="target")
161
+
162
+ print(f"Health score: {report.score}/100")
163
+ ```
164
+
165
+ ---
166
+
167
+ ## Full API
168
+
169
+ ### `diagnose(dataset, target_col=None, dataset_name="dataset")`
170
+
171
+ The main function. Runs all eight detectors and returns a `DiagnosisReport`.
172
+
173
+ ```python
174
+ report = diagnose(dataset, target_col="label", dataset_name="Titanic")
175
+
176
+ report.score # int — health score 0-100
177
+ report.issues # list of Issue objects
178
+ report.suggestions # list of fix strings
179
+ report.model_types # list of recommended model names
180
+ report.column_reports # dict of per-column statistics
181
+ ```
182
+
183
+ ### `quick_scan(dataset, target_col=None)`
184
+
185
+ One-liner that runs the diagnosis and immediately prints the report.
186
+
187
+ ```python
188
+ quick_scan(dataset, target_col="label")
189
+ ```
190
+
191
+ ### `health_score(dataset, target_col=None)`
192
+
193
+ Returns only the integer score. Perfect for quality gates in automated pipelines.
194
+
195
+ ```python
196
+ score = health_score(dataset, target_col="label")
197
+
198
+ if score < 70:
199
+ raise ValueError(f"Data quality too low: {score}/100. Fix issues first.")
200
+ ```
201
+
202
+ ### `list_issues(dataset, target_col=None)`
203
+
204
+ Returns a concise list of `(severity, title)` tuples.
205
+
206
+ ```python
207
+ for severity, title in list_issues(dataset, "label"):
208
+ print(f"[{severity}] {title}")
209
+ ```
210
+
211
+ ### `get_suggestions(dataset, target_col=None)`
212
+
213
+ Returns only the actionable fix suggestions as strings.
214
+
215
+ ```python
216
+ for tip in get_suggestions(dataset, "label"):
217
+ print("-", tip)
218
+ ```
219
+
220
+ ### `column_summary(dataset, col_name, target_col=None)`
221
+
222
+ Deep-dives into one specific column.
223
+
224
+ ```python
225
+ rep = column_summary(dataset, "age")
226
+ print(rep.details)
227
+ # {'type': 'numeric', 'mean': '27.4', 'std': '3.2', ...}
228
+ ```
229
+
230
+ ---
231
+
232
+ ## Understanding the Health Score
233
+
234
+ Every dataset starts at 100. Each detected issue deducts points based on severity:
235
+
236
+ | Severity | Points Lost | Example |
237
+ |---|---|---|
238
+ | 🔴 CRITICAL | 25 | Data leakage, >60% missing values |
239
+ | 🟠 HIGH | 15 | >30% missing, severe class imbalance |
240
+ | 🟡 MEDIUM | 8 | Moderate skewness, some outliers |
241
+ | ⚪ LOW | 3 | A few minor outliers |
242
+
243
+ | Score | Status | What To Do |
244
+ |---|---|---|
245
+ | 80 – 100 | ✅ Healthy | Data is ready for modelling |
246
+ | 50 – 79 | ⚠️ Needs Work | Fix HIGH and CRITICAL issues first |
247
+ | 0 – 49 | ❌ Critical | Do not train models yet |
248
+
249
+ ---
250
+
251
+ ## Project Structure
252
+
253
+ ```
254
+ datadiagnose/
255
+
256
+ ├── datadiagnose/ ← The Python package
257
+ │ ├── __init__.py ← Public API
258
+ │ ├── core.py ← Main diagnose() engine
259
+ │ ├── detectors.py ← All 8 detector functions
260
+ │ ├── models.py ← DiagnosisReport, Issue, ColumnReport classes
261
+ │ └── utils.py ← Pure math helpers (no dependencies)
262
+
263
+ ├── tests/
264
+ │ ├── sample_data.py ← 11 sample datasets with known problems
265
+ │ ├── test_detectors.py ← 60 unit tests for each detector
266
+ │ └── test_core.py ← 80 integration tests for the full API
267
+
268
+ ├── examples/
269
+ │ ├── basic_usage.py ← Start here — every function shown
270
+ │ ├── pandas_integration.py ← How to use with pandas DataFrames
271
+ │ └── student_dataset_demo.py ← Full workflow, step by step
272
+
273
+ ├── docs/
274
+ │ └── DataDiagnose_Documentation.pdf
275
+
276
+ ├── .github/workflows/tests.yml ← Auto-run tests on every push
277
+ ├── .gitignore
278
+ ├── LICENSE
279
+ ├── README.md
280
+ └── pyproject.toml
281
+ ```
282
+
283
+ ---
284
+
285
+ ## Running the Tests
286
+
287
+ DataDiagnose has 140 tests covering every detector, every public function, and edge cases. Tests use only Python's built-in `unittest` — no pytest required (though pytest works too).
288
+
289
+ ```bash
290
+ # Run all tests
291
+ python -m unittest discover -s tests -v
292
+
293
+ # Run just the detector tests
294
+ python -m unittest tests.test_detectors -v
295
+
296
+ # Run just the core API tests
297
+ python -m unittest tests.test_core -v
298
+ ```
299
+
300
+ All 140 tests should pass with output ending:
301
+
302
+ ```
303
+ ----------------------------------------------------------------------
304
+ Ran 140 tests in 0.08s
305
+
306
+ OK
307
+ ```
308
+
309
+ ---
310
+
311
+ ## Running the Examples
312
+
313
+ ```bash
314
+ # Simplest introduction — run this first
315
+ python examples/basic_usage.py
316
+
317
+ # How to use with pandas DataFrames
318
+ python examples/pandas_integration.py
319
+
320
+ # A full realistic data cleaning workflow
321
+ python examples/student_dataset_demo.py
322
+ ```
323
+
324
+ ---
325
+
326
+ ## Why Zero Dependencies?
327
+
328
+ DataDiagnose uses only Python's built-in `math`, `statistics`, and `collections` modules. This was a deliberate decision:
329
+
330
+ 1. **Works everywhere** — any Python 3.7+ environment, no pip install needed beyond the library itself
331
+ 2. **No version conflicts** — adding numpy or pandas as dependencies would create compatibility issues for people who already have specific versions installed
332
+ 3. **Educational** — every algorithm (IQR, Pearson correlation, skewness) is implemented from scratch in readable Python, so you can read the code and learn exactly how it works
333
+ 4. **Lightweight** — the entire library is five Python files totalling around 1000 lines
334
+
335
+ ---
336
+
337
+ ## Design Decisions
338
+
339
+ **Why does it only detect problems and not fix them automatically?**
340
+
341
+ Automatically fixing data without human judgment is dangerous. Filling missing values with the wrong strategy can make your model *worse*. Whether you should drop a column or impute it, and what value to impute with, depends on domain knowledge — your understanding of what the data means. DataDiagnose gives you the information and recommendation. The decision is yours.
342
+
343
+ **Why is it a dict-of-lists and not a DataFrame?**
344
+
345
+ Accepting a plain Python dict means the library works with no dependencies at all. If you have a pandas DataFrame, converting it takes one line: `df.to_dict(orient="list")`. Supporting DataFrames directly would require adding pandas as a dependency, which defeats the zero-dependency design.
346
+
347
+ ---
348
+
349
+ ## How to Contribute
350
+
351
+ Contributions are welcome. Here are some ideas from the roadmap:
352
+
353
+ - HTML report export (generate a self-contained HTML file with charts)
354
+ - Correlation matrix analysis (detect multicollinearity between features)
355
+ - Direct pandas DataFrame support without conversion
356
+ - Web dashboard (Flask/FastAPI endpoint to upload CSV and get diagnosis)
357
+
358
+ To contribute:
359
+
360
+ 1. Fork the repository on GitHub
361
+ 2. Create a branch: `git checkout -b feature/my-new-detector`
362
+ 3. Write your code and tests
363
+ 4. Make sure all 140 existing tests still pass
364
+ 5. Open a pull request with a clear description
365
+
366
+ ---
367
+
368
+ ## Changelog
369
+
370
+ ### v1.0.0 — Initial Release
371
+ - Eight detectors: missing values, outliers, skewness, class imbalance, data leakage, duplicate rows, constant columns, high cardinality
372
+ - Feature engineering hints based on column name patterns
373
+ - Model recommendation engine
374
+ - Health score system (0–100)
375
+ - Full public API: `diagnose`, `quick_scan`, `health_score`, `list_issues`, `get_suggestions`, `column_summary`
376
+ - 140 unit and integration tests
377
+ - Zero external dependencies
378
+
379
+ ---
380
+
381
+ ## License
382
+
383
+ This project is licensed under the **MIT License** — see the [LICENSE](LICENSE) file for the full text.
384
+
385
+ In plain English: you can use this code for anything, including commercial projects, as long as you keep the copyright notice with my name in any copy you distribute.
386
+
387
+ Copyright (c) 2026 **Nilotpal Dhar**
388
+
389
+ ---
390
+
391
+ ## Author
392
+
393
+ **Nilotpal Dhar**
394
+
395
+ Built as a beginner Python project to learn how data science diagnostics work from first principles. Every algorithm in this library — IQR outlier detection, Pearson correlation, skewness calculation — is implemented from scratch in plain Python so that reading the code teaches you how the maths actually works.
396
+
397
+ If this library helped you, star the repository on GitHub. If you found a bug or have a feature idea, open an issue.