PyPI - clean-data-tools - Versions diffs - 0.1.0__py3-none-any.whl - Mend

clean-data-tools 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

clean_data_tools-0.1.0.dist-info/METADATA +334 -0
clean_data_tools-0.1.0.dist-info/RECORD +10 -0
clean_data_tools-0.1.0.dist-info/WHEEL +5 -0
clean_data_tools-0.1.0.dist-info/licenses/LICENSE +21 -0
clean_data_tools-0.1.0.dist-info/top_level.txt +1 -0
cleandata/__init__.py +16 -0
cleandata/cleaner.py +188 -0
cleandata/normalizer.py +121 -0
cleandata/outlier.py +147 -0
cleandata/utils.py +65 -0

clean_data_tools-0.1.0.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,334 @@
+Metadata-Version: 2.4
+Name: clean-data-tools
+Version: 0.1.0
+Summary: ابزارهای قدرتمند برای تمیزکاری و پیش‌پردازش داده‌ها
+Author-email: Hasan Bagheri <hasan111bagher@gmail.com>
+License: MIT
+Project-URL: Homepage, https://github.com/0hasanbagheri0/clean-data
+Project-URL: Repository, https://github.com/0hasanbagheri0/clean-data
+Project-URL: Issues, https://github.com/0hasanbagheri0/clean-data/issues
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Intended Audience :: Developers
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pandas>=1.0.0
+Requires-Dist: numpy>=1.18.0
+Requires-Dist: scipy>=1.4.0
+Dynamic: license-file
+markdown
+# Clean-Data
+ابزارهای قدرتمند برای تمیزکاری و پیش‌پردازش داده‌ها در پایتون
+[![PyPI version](https://badge.fury.io/py/clean-data.svg)](https://badge.fury.io/py/clean-data)
+[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+---
+## 🌐 English | [فارسی](#فارسی)
+---
+# English Documentation
+## 📁 Clean-Data
+**Clean-Data** is a powerful library for data cleaning and preprocessing in Python. It simplifies repetitive tasks like handling missing values, removing duplicates, detecting outliers, and normalizing data.
+---
+### ✨ Key Features
+- **Remove Duplicates**: Eliminate duplicate records easily
+- **Handle Missing Values**: Fill with mean, median, mode, or custom values
+- **Outlier Detection**: Using IQR and Z-Score methods
+- **Data Normalization**: Min-Max, Standardization, and Robust Scaling
+- **Auto Type Conversion**: Convert columns to appropriate types
+- **Quality Report**: Get detailed statistics about your data
+---
+### 📦 Installation
+```bash
+pip install clean-data
+```
+---
+### 🚀 Quick Start
+```python
+import pandas as pd
+from cleandata import DataCleaner, OutlierDetector, Normalizer, get_data_quality_report
+```
+# Load data
+```bash
+df = pd.read_csv("data.csv")
+```
+# Clean data
+```bash
+cleaner = DataCleaner(df)
+```
+```bash
+cleaner.remove_duplicates()
+```
+```bash
+cleaner.fill_missing("mean")
+```
+```bash
+cleaner.strip_strings()
+```
+# Detect and remove outliers
+```bash
+detector = OutlierDetector(cleaner.get_data())
+```
+```bash
+outliers = detector.detect_iqr()
+```
+```bash
+df_clean = detector.remove_outliers()
+```
+# Normalize
+```bash
+normalizer = Normalizer(df_clean)
+```
+```bash
+df_scaled = normalizer.min_max_scale()
+```
+# Quality report
+```bash
+report = get_data_quality_report(df_clean)
+```
+```bash
+print(report)
+```
+---
+### 📚 API Reference
+# DataCleaner Class
+|Method|	Description|
+| :--- | :--- |
+|remove_duplicates(subset, keep)	|Remove duplicate rows|
+|fill_missing(method, columns)	|Fill missing values with mean, median, mode, or custom|
+|remove_missing(threshold, axis)	|Remove rows/columns with too many missing values|
+|convert_types(columns)	|Auto-convert column data types|
+|strip_strings(columns)	|Remove extra whitespace from strings|
+|rename_columns(mapping)	|Rename columns|
+|filter_rows(condition)	|Filter rows based on condition|
+|reset()	|Revert to original data|
+# OutlierDetector Class
+|Method	|Description|
+| :--- | :--- |
+|detect_iqr(columns, multiplier)	|Detect outliers using IQR method|
+|detect_zscore(columns, threshold)	|Detect outliers using Z-Score method|
+|remove_outliers(columns, method, threshold)	|Remove rows with outliers|
+|replace_outliers(columns, method, multiplier)	|Replace outliers with mean/median/custom|
+# Normalizer Class
+|Method	|Description|
+| :--- | :--- |
+|min_max_scale(columns, feature_range)	|Scale to a range (default 0-1)|
+|standardize(columns)	|Standardize to mean=0, std=1|
+|robust_scale(columns)	|Scale using median and IQR (robust to outliers)|
+|log_transform(columns)	|Apply log transformation|
+# Utility Functions
+|Function	|Description|
+| :--- | :--- |
+|get_data_quality_report(df)|	Get comprehensive data quality report|
+|get_column_info(df, column)|	Get detailed info about a specific column|
+---
+### 🛠️ Requirements
+Python 3.7 or higher
+pandas>=1.0.0
+numpy>=1.18.0
+scipy>=1.4.0
+---
+### 🤝 Contributing
+We welcome contributions! Please:
+1.Fork the repository
+2.Create a new branch (git checkout -b feature/amazing-feature)
+3.Commit your changes (git commit -m 'Add amazing feature')
+4.Push to the branch (git push origin feature/amazing-feature)
+5.Open a Pull Request
+---
+### 📄 License
+This project is licensed under the MIT License.
+---
+### 📧 Contact
+Email: hasan111bagher@gmail.com
+GitHub: 0hasanbagheri0
+---
+---
+فارسی
+---
+### 📁 Clean-Data
+Clean-Data یک کتابخانه قدرتمند برای تمیزکاری و پیش‌پردازش داده‌ها در پایتون است. این کتابخانه کارهای تکراری مانند مدیریت مقادیر خالی، حذف رکوردهای تکراری، تشخیص داده‌های پرت و نرمال‌سازی داده‌ها را ساده می‌کند.
+---
+### ✨ ویژگی‌های کلیدی
+حذف رکوردهای تکراری: حذف آسان رکوردهای تکراری
+مدیریت مقادیر خالی: پر کردن با میانگین، میانه، مد یا مقدار دلخواه
+تشخیص داده‌های پرت: با روش‌های IQR و Z-Score
+نرمال‌سازی داده‌ها: Min-Max، Standardization و Robust Scaling
+تبدیل خودکار نوع داده‌ها: تبدیل ستون‌ها به نوع مناسب
+گزارش کیفیت: دریافت آمار دقیق از داده‌ها
+---
+### 📦 نصب
+```bash
+pip install clean-data
+```
+---
+### 🚀 شروع سریع
+```python
+import pandas as pd
+from cleandata import DataCleaner, OutlierDetector, Normalizer, get_data_quality_report
+```
+# بارگذاری داده
+```bash
+df = pd.read_csv("data.csv")
+```
+# تمیزکاری
+```bash
+cleaner = DataCleaner(df)
+```
+```bash
+cleaner.remove_duplicates()
+```
+```bash
+cleaner.fill_missing("mean")
+```
+```bash
+cleaner.strip_strings()
+```
+# تشخیص و حذف داده‌های پرت
+```bash
+detector = OutlierDetector(cleaner.get_data())
+```
+```bash
+outliers = detector.detect_iqr()
+```
+```bash
+df_clean = detector.remove_outliers()
+```
+# نرمال‌سازی
+```bash
+normalizer = Normalizer(df_clean)
+```
+```bash
+df_scaled = normalizer.min_max_scale()
+```
+# گزارش کیفیت
+```bash
+report = get_data_quality_report(df_clean)
+print(report)
+```
+---
+### 📚 راهنمای توابع
+# کلاس DataCleaner
+|تابع	|توضیح|
+| :--- | :--- |
+|remove_duplicates(subset, keep)|	حذف سطرهای تکراری|
+|fill_missing(method, columns)	|پر کردن مقادیر خالی با میانگین، میانه، مد یا مقدار دلخواه|
+|remove_missing(threshold, axis)|	حذف سطرها/ستون‌هایی که مقادیر خالی زیادی دارند|
+|convert_types(columns)|	تبدیل خودکار نوع ستون‌ها|
+|strip_strings(columns)|	حذف فاصله‌های اضافی از رشته‌ها|
+|rename_columns(mapping)|	تغییر نام ستون‌ها|
+|filter_rows(condition)|	فیلتر کردن سطرها بر اساس شرط|
+|reset()|	بازگشت به داده‌های اصلی|
+# کلاس OutlierDetector
+|تابع	|توضیح|
+| :--- | :--- |
+|detect_iqr(columns, multiplier)	|تشخیص داده‌های پرت با روش IQR|
+|detect_zscore(columns, threshold)	|تشخیص داده‌های پرت با روش Z-Score|
+|remove_outliers(columns, method, threshold)	|حذف سطرهای حاوی داده‌های پرت|
+|replace_outliers(columns, method, multiplier)	|جایگزینی داده‌های پرت با میانگین/میانه/مقدار دلخواه|
+# کلاس Normalizer
+|تابع	|توضیح|
+| :--- | :--- |
+|min_max_scale(columns, feature_range)	|مقیاس‌سازی به بازه مشخص (پیش‌فرض ۰ تا ۱)|
+|standardize(columns)	|استانداردسازی (میانگین صفر، انحراف معیار یک)|
+|robust_scale(columns)	|مقیاس‌سازی مقاوم به داده‌های پرت (با میانه و IQR)|
+|log_transform(columns)|	اعمال تبدیل لگاریتمی|
+# توابع کمکی
+|تابع	|توضیح|
+| :--- | :--- |
+|get_data_quality_report(df)	|دریافت گزارش کامل کیفیت داده|
+|get_column_info(df, column)	|دریافت اطلاعات دقیق یک ستون خاص|
+### 🛠️ نیازمندی‌ها
+Python 3.7 یا بالاتر
+pandas>=1.0.0
+numpy>=1.18.0
+scipy>=1.4.0
+### 🤝 مشارکت
+از مشارکت شما استقبال می‌کنیم! لطفاً:
+1.مخزن را Fork کنید
+2.یک شاخه جدید بسازید (git checkout -b feature/amazing-feature)
+3.تغییرات را Commit کنید (git commit -m 'Add amazing feature')
+4.به شاخه خود Push کنید (git push origin feature/amazing-feature)
+5.یک Pull Request باز کنید
+### 📄 مجوز
+این پروژه تحت مجوز MIT منتشر شده است.
+### 📧 ارتباط با من
+ایمیل: hasan111bagher@gmail.com
+گیت‌هاب: 0hasanbagheri0
+✨ اگر این کتابخانه برای شما مفید بود، به آن یک ⭐ در گیت‌هاب بدهید!

clean_data_tools-0.1.0.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,10 @@
+clean_data_tools-0.1.0.dist-info/licenses/LICENSE,sha256=rL-ShFgU5oeaikXXGM55veZaqeTbpEhE4lnDOdN81d0,1091
+cleandata/__init__.py,sha256=7t2d_oGM1EEOCRgg52F7OcPOAVkodNh_PszoVS5lgF8,388
+cleandata/cleaner.py,sha256=Viz7_Tt_e7lvAl0EDSPPIXjnlz5LwpJsHUweWKo980s,7249
+cleandata/normalizer.py,sha256=x_jIzb9vGnZmvxaENLd-P0LS_sbD7hbQYsjZLjt_qi4,4173
+cleandata/outlier.py,sha256=oYw6ng1LMm-lRoONPjv9LsenOOcPGS7oiq5HSCtXE6M,5079
+cleandata/utils.py,sha256=xKqxKp9hIzWNX6M4SExsxoMQ8TNLDsokHoPPm7L8frY,2266
+clean_data_tools-0.1.0.dist-info/METADATA,sha256=3GHJZwixBuI9HxKekQdgbMtwoz7M__ocSHQLCgwOelA,9898
+clean_data_tools-0.1.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
+clean_data_tools-0.1.0.dist-info/top_level.txt,sha256=Aeh0-TH-86FG7nBEMjYyK7ZzxgYRnHO_y27aZkEAITU,10
+clean_data_tools-0.1.0.dist-info/RECORD,,

clean_data_tools-0.1.0.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,5 @@
+Wheel-Version: 1.0
+Generator: setuptools (82.0.1)
+Root-Is-Purelib: true
+Tag: py3-none-any

clean_data_tools-0.1.0.dist-info/licenses/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Hasan Bagheri
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

clean_data_tools-0.1.0.dist-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ cleandata

cleandata/__init__.py ADDED Viewed

@@ -0,0 +1,16 @@
+"""
+Clean-Data - ابزارهای تمیزکاری و پیش‌پردازش داده‌ها
+"""
+from .cleaner import DataCleaner
+from .normalizer import Normalizer
+from .outlier import OutlierDetector
+from .utils import get_data_quality_report
+__version__ = "0.1.0"
+__all__ = [
+    "DataCleaner",
+    "Normalizer",
+    "OutlierDetector",
+    "get_data_quality_report"
+]

cleandata/cleaner.py ADDED Viewed

@@ -0,0 +1,188 @@
+"""
+کلاس اصلی برای تمیزکاری داده‌ها
+"""
+import pandas as pd
+import numpy as np
+from typing import Union, Optional, List, Any
+from pathlib import Path
+class DataCleaner:
+    """
+    کلاس اصلی برای تمیزکاری داده‌ها
+    Examples:
+        >>> cleaner = DataCleaner(df)
+        >>> cleaner.remove_duplicates()
+        >>> cleaner.fill_missing("mean")
+        >>> df_clean = cleaner.get_data()
+    """
+    def __init__(self, data: Union[pd.DataFrame, str, Path]):
+        """
+        مقداردهی اولیه
+        Args:
+            data: دیتافریم پانداس یا مسیر فایل CSV/Excel
+        """
+        if isinstance(data, (str, Path)):
+            self.df = self._load_data(data)
+        elif isinstance(data, pd.DataFrame):
+            self.df = data.copy()
+        else:
+            raise TypeError("ورودی باید دیتافریم پانداس یا مسیر فایل باشد")
+        self.original_df = self.df.copy()
+        self._changes_log = []
+    def _load_data(self, path: Union[str, Path]) -> pd.DataFrame:
+        """بارگذاری داده از فایل"""
+        path = Path(path)
+        if path.suffix == '.csv':
+            return pd.read_csv(path)
+        elif path.suffix in ['.xlsx', '.xls']:
+            return pd.read_excel(path)
+        else:
+            raise ValueError("فرمت فایل پشتیبانی نمی‌شود. فقط CSV و Excel")
+    def get_data(self) -> pd.DataFrame:
+        """دریافت دیتافریم تمیز شده"""
+        return self.df
+    def get_original(self) -> pd.DataFrame:
+        """دریافت دیتافریم اصلی"""
+        return self.original_df
+    def get_changes_log(self) -> List[str]:
+        """دریافت گزارش تغییرات"""
+        return self._changes_log
+    def _log_change(self, message: str):
+        """ثبت تغییر در گزارش"""
+        self._changes_log.append(message)
+    def remove_duplicates(self, subset: Optional[List[str]] = None,
+                          keep: str = 'first') -> 'DataCleaner':
+        """
+        حذف رکوردهای تکراری
+        Args:
+            subset: لیست ستون‌ها برای بررسی تکراری بودن
+            keep: 'first', 'last', یا False
+        """
+        before = len(self.df)
+        self.df = self.df.drop_duplicates(subset=subset, keep=keep)
+        after = len(self.df)
+        self._log_change(f"حذف {before - after} رکورد تکراری")
+        return self
+    def fill_missing(self, method: Union[str, dict, int, float],
+                     columns: Optional[List[str]] = None) -> 'DataCleaner':
+        """
+        پر کردن مقادیر خالی
+        Args:
+            method: 'mean', 'median', 'mode', 'zero', یا مقدار دلخواه
+            columns: لیست ستون‌ها (اگر None باشد، همه ستون‌ها)
+        """
+        if columns is None:
+            columns = self.df.columns
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            missing_count = self.df[col].isna().sum()
+            if missing_count == 0:
+                continue
+            if isinstance(method, dict):
+                fill_value = method.get(col, 0)
+            elif method == 'mean':
+                fill_value = self.df[col].mean()
+            elif method == 'median':
+                fill_value = self.df[col].median()
+            elif method == 'mode':
+                fill_value = self.df[col].mode()[0] if not self.df[col].mode().empty else 0
+            elif method == 'zero':
+                fill_value = 0
+            else:
+                fill_value = method
+            self.df[col] = self.df[col].fillna(fill_value)
+            self._log_change(f"پر کردن {missing_count} مقدار خالی در ستون '{col}'")
+        return self
+    def remove_missing(self, threshold: float = 0.5,
+                       axis: int = 0) -> 'DataCleaner':
+        """
+        حذف سطرها یا ستون‌های با مقدار خالی زیاد
+        Args:
+            threshold: حداقل درصد داده‌های غیر خالی (بین 0 تا 1)
+            axis: 0 برای سطر، 1 برای ستون
+        """
+        before = len(self.df) if axis == 0 else len(self.df.columns)
+        self.df = self.df.dropna(thresh=int(threshold * len(self.df)), axis=axis)
+        after = len(self.df) if axis == 0 else len(self.df.columns)
+        self._log_change(f"حذف {before - after} {'سطر' if axis == 0 else 'ستون'}")
+        return self
+    def convert_types(self, columns: Optional[List[str]] = None) -> 'DataCleaner':
+        """
+        تبدیل خودکار نوع ستون‌ها (عددی، تاریخ، رشته)
+        """
+        if columns is None:
+            columns = self.df.columns
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            try:
+                # تلاش برای تبدیل به عدد
+                self.df[col] = pd.to_numeric(self.df[col], errors='ignore')
+            except:
+                pass
+            try:
+                # تلاش برای تبدیل به تاریخ
+                self.df[col] = pd.to_datetime(self.df[col], errors='ignore')
+            except:
+                pass
+        self._log_change("تبدیل خودکار انواع داده")
+        return self
+    def strip_strings(self, columns: Optional[List[str]] = None) -> 'DataCleaner':
+        """حذف فاصله‌های اضافی از رشته‌ها"""
+        if columns is None:
+            columns = self.df.select_dtypes(include=['object']).columns
+        for col in columns:
+            if col in self.df.columns and self.df[col].dtype == 'object':
+                self.df[col] = self.df[col].str.strip()
+        self._log_change("حذف فاصله‌های اضافی از رشته‌ها")
+        return self
+    def rename_columns(self, mapping: dict) -> 'DataCleaner':
+        """تغییر نام ستون‌ها"""
+        self.df = self.df.rename(columns=mapping)
+        self._log_change(f"تغییر نام {len(mapping)} ستون")
+        return self
+    def filter_rows(self, condition: Any) -> 'DataCleaner':
+        """فیلتر کردن سطرها بر اساس شرط"""
+        before = len(self.df)
+        self.df = self.df[condition]
+        after = len(self.df)
+        self._log_change(f"فیلتر کردن: {before - after} سطر حذف شد")
+        return self
+    def reset(self) -> 'DataCleaner':
+        """بازگشت به داده‌های اصلی"""
+        self.df = self.original_df.copy()
+        self._changes_log = []
+        self._log_change("بازگشت به داده‌های اصلی")
+        return self

cleandata/normalizer.py ADDED Viewed

@@ -0,0 +1,121 @@
+"""
+توابع نرمال‌سازی داده‌ها
+"""
+import pandas as pd
+import numpy as np
+from typing import List, Optional, Union
+class Normalizer:
+    """
+    کلاس نرمال‌سازی داده‌ها
+    Examples:
+        >>> normalizer = Normalizer(df)
+        >>> df_scaled = normalizer.min_max_scale()
+        >>> df_standard = normalizer.standardize()
+    """
+    def __init__(self, data: pd.DataFrame):
+        self.df = data.copy()
+        self._params = {}
+    def min_max_scale(self, columns: Optional[List[str]] = None,
+                       feature_range: tuple = (0, 1)) -> pd.DataFrame:
+        """
+        نرمال‌سازی Min-Max (مقیاس‌سازی به بازه مشخص)
+        Args:
+            columns: لیست ستون‌ها (اگر None باشد، همه ستون‌های عددی)
+            feature_range: بازه مورد نظر (min, max)
+        """
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        result = self.df.copy()
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            min_val = self.df[col].min()
+            max_val = self.df[col].max()
+            if max_val - min_val == 0:
+                result[col] = 0
+            else:
+                result[col] = (self.df[col] - min_val) / (max_val - min_val) * (feature_range[1] - feature_range[0]) + feature_range[0]
+            self._params[col] = {'min': min_val, 'max': max_val}
+        return result
+    def standardize(self, columns: Optional[List[str]] = None) -> pd.DataFrame:
+        """
+        استانداردسازی (میانگین صفر، انحراف معیار یک)
+        """
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        result = self.df.copy()
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            mean = self.df[col].mean()
+            std = self.df[col].std()
+            if std == 0:
+                result[col] = 0
+            else:
+                result[col] = (self.df[col] - mean) / std
+            self._params[col] = {'mean': mean, 'std': std}
+        return result
+    def robust_scale(self, columns: Optional[List[str]] = None) -> pd.DataFrame:
+        """
+        مقیاس‌سازی مقاوم به داده‌های پرت (با استفاده از میانه و IQR)
+        """
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        result = self.df.copy()
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            median = self.df[col].median()
+            q1 = self.df[col].quantile(0.25)
+            q3 = self.df[col].quantile(0.75)
+            iqr = q3 - q1
+            if iqr == 0:
+                result[col] = 0
+            else:
+                result[col] = (self.df[col] - median) / iqr
+            self._params[col] = {'median': median, 'iqr': iqr}
+        return result
+    def log_transform(self, columns: Optional[List[str]] = None) -> pd.DataFrame:
+        """
+        تبدیل لگاریتمی (برای داده‌های با توزیع چوله)
+        """
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        result = self.df.copy()
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            # اطمینان از مثبت بودن داده‌ها
+            if (self.df[col] <= 0).any():
+                shift = abs(self.df[col].min()) + 1
+                result[col] = np.log1p(self.df[col] + shift)
+            else:
+                result[col] = np.log(self.df[col])
+        return result

cleandata/outlier.py ADDED Viewed

@@ -0,0 +1,147 @@
+"""
+تشخیص و حذف داده‌های پرت (Outlier)
+"""
+import pandas as pd
+import numpy as np
+from typing import List, Optional, Union
+from scipy import stats
+class OutlierDetector:
+    """
+    کلاس تشخیص و مدیریت داده‌های پرت
+    Examples:
+        >>> detector = OutlierDetector(df)
+        >>> outliers = detector.detect_iqr()
+        >>> df_clean = detector.remove_outliers()
+    """
+    def __init__(self, data: pd.DataFrame):
+        self.df = data.copy()
+        self.outliers_info = {}
+    def detect_iqr(self, columns: Optional[List[str]] = None,
+                   multiplier: float = 1.5) -> dict:
+        """
+        تشخیص داده‌های پرت با روش IQR
+        Args:
+            columns: لیست ستون‌ها
+            multiplier: ضریب (پیش‌فرض 1.5)
+        """
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        outliers = {}
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            q1 = self.df[col].quantile(0.25)
+            q3 = self.df[col].quantile(0.75)
+            iqr = q3 - q1
+            lower_bound = q1 - multiplier * iqr
+            upper_bound = q3 + multiplier * iqr
+            mask = (self.df[col] < lower_bound) | (self.df[col] > upper_bound)
+            outliers[col] = {
+                'count': mask.sum(),
+                'indices': self.df.index[mask].tolist(),
+                'lower_bound': lower_bound,
+                'upper_bound': upper_bound
+            }
+        self.outliers_info = outliers
+        return outliers
+    def detect_zscore(self, columns: Optional[List[str]] = None,
+                      threshold: float = 3) -> dict:
+        """
+        تشخیص داده‌های پرت با روش Z-Score
+        """
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        outliers = {}
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            z_scores = np.abs(stats.zscore(self.df[col].dropna()))
+            mask = z_scores > threshold
+            outliers[col] = {
+                'count': mask.sum(),
+                'indices': self.df.index[mask].tolist() if not mask.empty else []
+            }
+        self.outliers_info = outliers
+        return outliers
+    def remove_outliers(self, columns: Optional[List[str]] = None,
+                        method: str = 'iqr',
+                        threshold: float = 1.5) -> pd.DataFrame:
+        """
+        حذف رکوردهای حاوی داده‌های پرت
+        """
+        result = self.df.copy()
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        mask = pd.Series([False] * len(result), index=result.index)
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            if method == 'iqr':
+                q1 = self.df[col].quantile(0.25)
+                q3 = self.df[col].quantile(0.75)
+                iqr = q3 - q1
+                lower = q1 - threshold * iqr
+                upper = q3 + threshold * iqr
+                mask |= (self.df[col] < lower) | (self.df[col] > upper)
+            elif method == 'zscore':
+                z_scores = np.abs(stats.zscore(self.df[col].dropna()))
+                mask |= pd.Series(z_scores > threshold, index=self.df.index).fillna(False)
+        result = result[~mask]
+        return result
+    def replace_outliers(self, columns: Optional[List[str]] = None,
+                         method: str = 'median',
+                         multiplier: float = 1.5) -> pd.DataFrame:
+        """
+        جایگزینی داده‌های پرت با میانگین، میانه یا مقدار دلخواه
+        """
+        result = self.df.copy()
+        if columns is None:
+            columns = self.df.select_dtypes(include=[np.number]).columns
+        for col in columns:
+            if col not in self.df.columns:
+                continue
+            q1 = self.df[col].quantile(0.25)
+            q3 = self.df[col].quantile(0.75)
+            iqr = q3 - q1
+            lower = q1 - multiplier * iqr
+            upper = q3 + multiplier * iqr
+            mask = (self.df[col] < lower) | (self.df[col] > upper)
+            if method == 'mean':
+                replacement = self.df[col].mean()
+            elif method == 'median':
+                replacement = self.df[col].median()
+            else:
+                replacement = method
+            result.loc[mask, col] = replacement
+        return result

cleandata/utils.py ADDED Viewed

@@ -0,0 +1,65 @@
+"""
+توابع کمکی برای گزارش‌گیری از کیفیت داده
+"""
+import pandas as pd
+import numpy as np
+from typing import Dict, Any
+def get_data_quality_report(df: pd.DataFrame) -> Dict[str, Any]:
+    """
+    دریافت گزارش کامل از کیفیت داده‌ها
+    Returns:
+        دیکشنری شامل: تعداد سطرها، ستون‌ها، مقادیر خالی، تکراری‌ها، نوع داده‌ها، و ...
+    """
+    report = {
+        'shape': {
+            'rows': len(df),
+            'columns': len(df.columns)
+        },
+        'missing': {},
+        'duplicates': {
+            'count': df.duplicated().sum(),
+            'percentage': (df.duplicated().sum() / len(df)) * 100
+        },
+        'data_types': df.dtypes.to_dict(),
+        'statistics': {},
+        'memory_usage': df.memory_usage(deep=True).sum() / 1024  # KB
+    }
+    # بررسی مقادیر خالی
+    for col in df.columns:
+        missing_count = df[col].isna().sum()
+        report['missing'][col] = {
+            'count': missing_count,
+            'percentage': (missing_count / len(df)) * 100
+        }
+    # آمار توصیفی برای ستون‌های عددی
+    numeric_cols = df.select_dtypes(include=[np.number]).columns
+    for col in numeric_cols:
+        report['statistics'][col] = {
+            'min': df[col].min(),
+            'max': df[col].max(),
+            'mean': df[col].mean(),
+            'median': df[col].median(),
+            'std': df[col].std(),
+            'unique': df[col].nunique()
+        }
+    return report
+def get_column_info(df: pd.DataFrame, column: str) -> Dict[str, Any]:
+    """دریافت اطلاعات کامل یک ستون خاص"""
+    if column not in df.columns:
+        raise ValueError(f"ستون '{column}' در دیتافریم وجود ندارد")
+    return {
+        'name': column,
+        'dtype': str(df[column].dtype),
+        'missing_count': df[column].isna().sum(),
+        'missing_percentage': (df[column].isna().sum() / len(df)) * 100,
+        'unique_values': df[column].nunique(),
+        'memory_usage': df[column].memory_usage(deep=True) / 1024  # KB
+    }