PyPI - datadiagnose - Versions diffs - 1.0.0__tar.gz - Mend

datadiagnose 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

datadiagnose-1.0.0/LICENSE +21 -0
datadiagnose-1.0.0/PKG-INFO +397 -0
datadiagnose-1.0.0/README.md +349 -0
datadiagnose-1.0.0/datadiagnose/__init__.py +68 -0
datadiagnose-1.0.0/datadiagnose/core.py +339 -0
datadiagnose-1.0.0/datadiagnose/detectors.py +683 -0
datadiagnose-1.0.0/datadiagnose/models.py +247 -0
datadiagnose-1.0.0/datadiagnose/utils.py +358 -0
datadiagnose-1.0.0/datadiagnose.egg-info/PKG-INFO +397 -0
datadiagnose-1.0.0/datadiagnose.egg-info/SOURCES.txt +15 -0
datadiagnose-1.0.0/datadiagnose.egg-info/dependency_links.txt +1 -0
datadiagnose-1.0.0/datadiagnose.egg-info/requires.txt +16 -0
datadiagnose-1.0.0/datadiagnose.egg-info/top_level.txt +1 -0
datadiagnose-1.0.0/pyproject.toml +232 -0
datadiagnose-1.0.0/setup.cfg +4 -0
datadiagnose-1.0.0/tests/test_core.py +642 -0
datadiagnose-1.0.0/tests/test_detectors.py +664 -0

datadiagnose-1.0.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Nilotpal Dhar
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

datadiagnose-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,397 @@
+Metadata-Version: 2.4
+Name: datadiagnose
+Version: 1.0.0
+Summary: Dataset Auto-Diagnosis Python Library — find and fix data problems before model training.
+Author-email: Nilotpal Dhar <nilotpaldhar@example.com>
+Maintainer-email: Nilotpal Dhar <nilotpaldhar@example.com>
+License: MIT
+Project-URL: Homepage, https://github.com/nilotpaldhar2004/datadiagnose
+Project-URL: Repository, https://github.com/nilotpaldhar2004/datadiagnose
+Project-URL: Documentation, https://github.com/nilotpaldhar2004/datadiagnose/blob/main/README.md
+Project-URL: Bug Tracker, https://github.com/nilotpaldhar2004/datadiagnose/issues
+Project-URL: Changelog, https://github.com/nilotpaldhar2004/datadiagnose/blob/main/CHANGELOG.md
+Keywords: data science,machine learning,dataset,data quality,data cleaning,eda,exploratory data analysis,missing values,outliers,data leakage,class imbalance,python,beginner
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Education
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Operating System :: OS Independent
+Classifier: Typing :: Typed
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0; extra == "dev"
+Requires-Dist: flake8>=6.0; extra == "dev"
+Provides-Extra: data
+Requires-Dist: pandas>=1.3; extra == "data"
+Requires-Dist: numpy>=1.21; extra == "data"
+Provides-Extra: all
+Requires-Dist: pytest>=7.0; extra == "all"
+Requires-Dist: pytest-cov>=4.0; extra == "all"
+Requires-Dist: flake8>=6.0; extra == "all"
+Requires-Dist: pandas>=1.3; extra == "all"
+Requires-Dist: numpy>=1.21; extra == "all"
+Dynamic: license-file
+# DataDiagnose
+**A Python library that looks at your dataset and tells you exactly what is wrong with it.**
+[![Tests](https://github.com/nilotpaldhar/datadiagnose/actions/workflows/tests.yml/badge.svg)](https://github.com/nilotpaldhar/datadiagnose/actions)
+[![Python](https://img.shields.io/badge/python-3.7%2B-blue)](https://www.python.org)
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
+[![Version](https://img.shields.io/badge/version-1.0.0-orange)](https://github.com/nilotpaldhar/datadiagnose)
+[![Dependencies](https://img.shields.io/badge/dependencies-none-brightgreen)](pyproject.toml)
+---
+## The Problem This Solves
+Every beginner data scientist goes through the same painful experience. You download a dataset, you are excited to build your first model, you train it — and the results are terrible. 50% accuracy. Predictions that make no sense. Hours of confusion.
+In most cases the model is not the problem. **The data is broken.** There are missing values pulling your model in the wrong direction. An outlier like an age of 900 years is destroying your statistics. Your target column has 95% of rows saying "no", so your model just learned to say "no" for everything.
+Experienced data scientists know to check for these things before touching a model. They run a full diagnostic on the dataset first. **DataDiagnose automates that diagnostic in one function call.**
+---
+## What It Does
+Give DataDiagnose any dataset and it returns:
+- A **health score** from 0 to 100 showing how clean your data is
+- A list of every **problem detected**, with severity (CRITICAL / HIGH / MEDIUM / LOW)
+- A specific **fix suggestion** for every problem
+- **Model recommendations** based on your data characteristics
+- **Feature engineering hints** based on your column names
+Eight problems are detected automatically:
+| Problem | What It Means |
+|---|---|
+| 🕳️ Missing Values | Null, empty, or NaN values in any column |
+| 🎯 Outliers | Extreme values detected by IQR and Z-score methods |
+| 📐 Skewness | Lopsided distributions that hurt linear models |
+| ⚖️ Class Imbalance | One class vastly outnumbering others in your target |
+| 🚨 Data Leakage | Columns that secretly contain the answer |
+| 🔁 Duplicate Rows | Identical rows that bias your model |
+| 📊 Constant Columns | Columns with zero variation — zero information |
+| 🃏 High Cardinality | ID-like columns with almost all unique values |
+---
+## Installation
+DataDiagnose has **zero external dependencies**. It runs on pure Python standard library. No pandas, no numpy, no scikit-learn required.
+```bash
+# Once published on PyPI
+pip install datadiagnose
+```
+For now, copy `datadiagnose/` into your project folder and import directly.
+---
+## Quick Start
+```python
+from datadiagnose import diagnose
+dataset = {
+    "age":    [25, 30, None, 22, 900, 28],
+    "income": [50000, 60000, None, 48000, 52000, 61000],
+    "city":   ["KOL", "MUM", "KOL", "DEL", "KOL", "KOL"],
+    "target": [1, 0, 1, 0, 1, 0],
+}
+report = diagnose(dataset, target_col="target")
+print(report)
+```
+Output:
+```
+==============================================================
+  DATADIAGNOSE REPORT — DATASET
+==============================================================
+  Rows    : 6
+  Columns : 4
+  Score   : 69/100   ⚠️  Needs Work
+--------------------------------------------------------------
+  🔍  Issues Found (2)
+  1. 🟡 MEDIUM
+     Missing Values in 'age'
+     → 16.7% of values are missing.
+     💡 Fix: Fill 'age' with median (numeric) or mode (categorical).
+  2. 🟡 MEDIUM
+     Missing Values in 'income'
+     → 16.7% of values are missing.
+     💡 Fix: Fill 'income' with median (numeric) or mode (categorical).
+...
+```
+---
+## Works With Pandas Too
+DataDiagnose is not a pandas replacement — it works alongside it. Convert your DataFrame in one line:
+```python
+import pandas as pd
+from datadiagnose import diagnose
+df     = pd.read_csv("my_data.csv")
+report = diagnose(df.to_dict(orient="list"), target_col="target")
+print(f"Health score: {report.score}/100")
+```
+---
+## Full API
+### `diagnose(dataset, target_col=None, dataset_name="dataset")`
+The main function. Runs all eight detectors and returns a `DiagnosisReport`.
+```python
+report = diagnose(dataset, target_col="label", dataset_name="Titanic")
+report.score          # int — health score 0-100
+report.issues         # list of Issue objects
+report.suggestions    # list of fix strings
+report.model_types    # list of recommended model names
+report.column_reports # dict of per-column statistics
+```
+### `quick_scan(dataset, target_col=None)`
+One-liner that runs the diagnosis and immediately prints the report.
+```python
+quick_scan(dataset, target_col="label")
+```
+### `health_score(dataset, target_col=None)`
+Returns only the integer score. Perfect for quality gates in automated pipelines.
+```python
+score = health_score(dataset, target_col="label")
+if score < 70:
+    raise ValueError(f"Data quality too low: {score}/100. Fix issues first.")
+```
+### `list_issues(dataset, target_col=None)`
+Returns a concise list of `(severity, title)` tuples.
+```python
+for severity, title in list_issues(dataset, "label"):
+    print(f"[{severity}] {title}")
+```
+### `get_suggestions(dataset, target_col=None)`
+Returns only the actionable fix suggestions as strings.
+```python
+for tip in get_suggestions(dataset, "label"):
+    print("-", tip)
+```
+### `column_summary(dataset, col_name, target_col=None)`
+Deep-dives into one specific column.
+```python
+rep = column_summary(dataset, "age")
+print(rep.details)
+# {'type': 'numeric', 'mean': '27.4', 'std': '3.2', ...}
+```
+---
+## Understanding the Health Score
+Every dataset starts at 100. Each detected issue deducts points based on severity:
+| Severity | Points Lost | Example |
+|---|---|---|
+| 🔴 CRITICAL | 25 | Data leakage, >60% missing values |
+| 🟠 HIGH | 15 | >30% missing, severe class imbalance |
+| 🟡 MEDIUM | 8 | Moderate skewness, some outliers |
+| ⚪ LOW | 3 | A few minor outliers |
+| Score | Status | What To Do |
+|---|---|---|
+| 80 – 100 | ✅ Healthy | Data is ready for modelling |
+| 50 – 79 | ⚠️ Needs Work | Fix HIGH and CRITICAL issues first |
+| 0 – 49 | ❌ Critical | Do not train models yet |
+---
+## Project Structure
+```
+datadiagnose/
+│
+├── datadiagnose/          ← The Python package
+│   ├── __init__.py        ← Public API
+│   ├── core.py            ← Main diagnose() engine
+│   ├── detectors.py       ← All 8 detector functions
+│   ├── models.py          ← DiagnosisReport, Issue, ColumnReport classes
+│   └── utils.py           ← Pure math helpers (no dependencies)
+│
+├── tests/
+│   ├── sample_data.py     ← 11 sample datasets with known problems
+│   ├── test_detectors.py  ← 60 unit tests for each detector
+│   └── test_core.py       ← 80 integration tests for the full API
+│
+├── examples/
+│   ├── basic_usage.py          ← Start here — every function shown
+│   ├── pandas_integration.py   ← How to use with pandas DataFrames
+│   └── student_dataset_demo.py ← Full workflow, step by step
+│
+├── docs/
+│   └── DataDiagnose_Documentation.pdf
+│
+├── .github/workflows/tests.yml ← Auto-run tests on every push
+├── .gitignore
+├── LICENSE
+├── README.md
+└── pyproject.toml
+```
+---
+## Running the Tests
+DataDiagnose has 140 tests covering every detector, every public function, and edge cases. Tests use only Python's built-in `unittest` — no pytest required (though pytest works too).
+```bash
+# Run all tests
+python -m unittest discover -s tests -v
+# Run just the detector tests
+python -m unittest tests.test_detectors -v
+# Run just the core API tests
+python -m unittest tests.test_core -v
+```
+All 140 tests should pass with output ending:
+```
+----------------------------------------------------------------------
+Ran 140 tests in 0.08s
+OK
+```
+---
+## Running the Examples
+```bash
+# Simplest introduction — run this first
+python examples/basic_usage.py
+# How to use with pandas DataFrames
+python examples/pandas_integration.py
+# A full realistic data cleaning workflow
+python examples/student_dataset_demo.py
+```
+---
+## Why Zero Dependencies?
+DataDiagnose uses only Python's built-in `math`, `statistics`, and `collections` modules. This was a deliberate decision:
+1. **Works everywhere** — any Python 3.7+ environment, no pip install needed beyond the library itself
+2. **No version conflicts** — adding numpy or pandas as dependencies would create compatibility issues for people who already have specific versions installed
+3. **Educational** — every algorithm (IQR, Pearson correlation, skewness) is implemented from scratch in readable Python, so you can read the code and learn exactly how it works
+4. **Lightweight** — the entire library is five Python files totalling around 1000 lines
+---
+## Design Decisions
+**Why does it only detect problems and not fix them automatically?**
+Automatically fixing data without human judgment is dangerous. Filling missing values with the wrong strategy can make your model *worse*. Whether you should drop a column or impute it, and what value to impute with, depends on domain knowledge — your understanding of what the data means. DataDiagnose gives you the information and recommendation. The decision is yours.
+**Why is it a dict-of-lists and not a DataFrame?**
+Accepting a plain Python dict means the library works with no dependencies at all. If you have a pandas DataFrame, converting it takes one line: `df.to_dict(orient="list")`. Supporting DataFrames directly would require adding pandas as a dependency, which defeats the zero-dependency design.
+---
+## How to Contribute
+Contributions are welcome. Here are some ideas from the roadmap:
+- HTML report export (generate a self-contained HTML file with charts)
+- Correlation matrix analysis (detect multicollinearity between features)
+- Direct pandas DataFrame support without conversion
+- Web dashboard (Flask/FastAPI endpoint to upload CSV and get diagnosis)
+To contribute:
+1. Fork the repository on GitHub
+2. Create a branch: `git checkout -b feature/my-new-detector`
+3. Write your code and tests
+4. Make sure all 140 existing tests still pass
+5. Open a pull request with a clear description
+---
+## Changelog
+### v1.0.0 — Initial Release
+- Eight detectors: missing values, outliers, skewness, class imbalance, data leakage, duplicate rows, constant columns, high cardinality
+- Feature engineering hints based on column name patterns
+- Model recommendation engine
+- Health score system (0–100)
+- Full public API: `diagnose`, `quick_scan`, `health_score`, `list_issues`, `get_suggestions`, `column_summary`
+- 140 unit and integration tests
+- Zero external dependencies
+---
+## License
+This project is licensed under the **MIT License** — see the [LICENSE](LICENSE) file for the full text.
+In plain English: you can use this code for anything, including commercial projects, as long as you keep the copyright notice with my name in any copy you distribute.
+Copyright (c) 2026 **Nilotpal Dhar**
+---
+## Author
+**Nilotpal Dhar**
+Built as a beginner Python project to learn how data science diagnostics work from first principles. Every algorithm in this library — IQR outlier detection, Pearson correlation, skewness calculation — is implemented from scratch in plain Python so that reading the code teaches you how the maths actually works.
+If this library helped you, star the repository on GitHub. If you found a bug or have a feature idea, open an issue.