PyPI - autosweep-preprocessing - Versions diffs - 0.1.1__tar.gz - Mend

autosweep-preprocessing 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

autosweep_preprocessing-0.1.1/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

autosweep_preprocessing-0.1.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,188 @@
+Metadata-Version: 2.4
+Name: autosweep-preprocessing
+Version: 0.1.1
+Summary: Flexible tabular data preprocessing utility with a single AutoSweep API
+Author: Harsh Kakadiya
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/harsh-kakadiya1/Autosweep
+Project-URL: Repository, https://github.com/harsh-kakadiya1/Autosweep
+Project-URL: Issues, https://github.com/harsh-kakadiya1/Autosweep/issues
+Keywords: preprocessing,machine-learning,feature-engineering,data-cleaning
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Operating System :: OS Independent
+Classifier: Intended Audience :: Science/Research
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pandas>=1.5
+Requires-Dist: numpy>=1.23
+Requires-Dist: scikit-learn>=1.2
+Requires-Dist: openpyxl>=3.1
+Dynamic: license-file
+# autosweep-preprocessing
+A lightweight preprocessing library built around a single flexible API: `AutoSweep`.
+## Usage
+```python
+from autosweep_preprocessing import AutoSweep
+result = AutoSweep(
+    file_path="data.csv",
+    target_column="target",
+    encode_categorical="onehot",
+    remove_correlated=True,
+    structured_output=True,
+)
+X = result["X"]
+y = result["y"]
+info = result["info"]
+```
+## Function
+`AutoSweep` supports:
+- CSV/Excel loading
+- Missing value handling and imputation
+- Numeric scaling (`standard`, `minmax`, `robust`)
+- Categorical encoding (`onehot`, `ordinal`, `label`)
+- Optional datetime feature extraction
+- Optional outlier handling (`iqr`, `zscore`)
+- Optional correlation and low-variance filtering
+- Structured output for pipeline diagnostics
+## AutoSweep Arguments Guide
+### Required / Core
+- `file_path` (required)
+    - What it does: Path to input dataset (`.csv` or Excel file).
+    - Use case: Point to your raw training file before preprocessing.
+    - Example: `file_path="data/train.csv"`
+- `target_column` (default: `None`)
+    - What it does: Separates target variable from features and returns it as `y`.
+    - Use case: Set this when you want to train/evaluate models after preprocessing.
+    - Example: `target_column="price"`
+### Column cleaning
+- `drop_columns` (default: `None`)
+    - What it does: Drops specific columns by name.
+    - Use case: Remove IDs, leakage columns, or metadata fields.
+    - Example: `drop_columns=["id", "created_at"]`
+- `drop_threshold` (default: `1.0`)
+    - What it does: Drops columns whose missing-value fraction is greater than this threshold.
+    - Use case: Use `0.4`/`0.5` to remove heavily incomplete columns.
+    - Example: `drop_threshold=0.5`
+### Missing values
+- `impute_strategy_num` (default: `'mean'`)
+    - What it does: Numeric imputation strategy.
+    - Allowed: `'mean'`, `'median'`, `'most_frequent'`, `'constant'`, `'knn'`, `'mode'`.
+    - Use case: Use `'median'` for skewed numeric data, `'knn'` for richer local patterns.
+    - Example: `impute_strategy_num="median"`
+- `impute_strategy_cat` (default: `'most_frequent'`)
+    - What it does: Categorical imputation strategy.
+    - Allowed: any `SimpleImputer` categorical strategy (commonly `'most_frequent'`, `'constant'`).
+    - Use case: Use `'most_frequent'` for stable categories.
+    - Example: `impute_strategy_cat="most_frequent"`
+### Scaling and encoding
+- `scaler` (default: `'standard'`)
+    - What it does: Scales numeric features.
+    - Allowed: `'standard'`, `'minmax'`, `'robust'`, or any other value for passthrough.
+    - Use case: Use `'robust'` when outliers are present.
+    - Example: `scaler="robust"`
+- `encode_categorical` (default: `None`)
+    - What it does: Encodes categorical columns.
+    - Allowed: `None`, `'none'`, `'passthrough'`, `'onehot'`, `'ordinal'`, `'label'`.
+    - Use case: Use `'onehot'` for linear/tree models; `'label'` for compact numeric conversion.
+    - Example: `encode_categorical="onehot"`
+### Feature selection
+- `remove_low_variance` (default: `False`)
+    - What it does: Removes low-variance numeric features after preprocessing.
+    - Use case: Enable when many near-constant numeric features exist.
+    - Example: `remove_low_variance=True`
+- `variance_thresh` (default: `0.0`)
+    - What it does: Variance cutoff used by low-variance filtering.
+    - Use case: Increase (e.g., `0.01`) to remove weak/noisy features.
+    - Example: `variance_thresh=0.01`
+- `remove_correlated` (default: `False`)
+    - What it does: Drops highly correlated numeric features.
+    - Use case: Reduce multicollinearity and redundant columns.
+    - Example: `remove_correlated=True`
+- `corr_threshold` (default: `0.95`)
+    - What it does: Absolute correlation threshold for dropping features.
+    - Use case: Use `0.85-0.95` depending on how aggressively you want feature pruning.
+    - Example: `corr_threshold=0.9`
+### Outlier handling
+- `outlier_method` (default: `None`)
+    - What it does: Enables outlier detection.
+    - Allowed: `None`, `'iqr'`, `'zscore'` (also `'z-score'`, `'z_score'`).
+    - Use case: Use `'iqr'` for non-normal data; `'zscore'` for roughly normal distributions.
+    - Example: `outlier_method="iqr"`
+- `outlier_threshold` (default: `1.5`)
+    - What it does: Threshold used by outlier method.
+    - Use case: Increase to keep more rows, decrease to be stricter.
+    - Example: `outlier_threshold=3.0` (common for z-score)
+- `cap_outliers` (default: `False`)
+    - What it does: Caps outliers to bounds instead of dropping rows.
+    - Use case: Set `True` when you want to preserve dataset size.
+    - Example: `cap_outliers=True`
+### Datetime features
+- `extract_datetime` (default: `False`)
+    - What it does: Parses datetime-like columns and extracts year/month/day/weekday/hour.
+    - Use case: Enable when date fields carry predictive signal.
+    - Example: `extract_datetime=True`
+- `drop_datetime_original` (default: `False`)
+    - What it does: Drops original datetime columns after extraction.
+    - Use case: Keep only engineered datetime parts to simplify model input.
+    - Example: `drop_datetime_original=True`
+### Target encoding and output format
+- `target_encode` (default: `False`)
+    - What it does: Applies mean target encoding to categorical features.
+    - Use case: Helpful for high-cardinality categorical variables.
+    - Important: Requires `target_column`; avoid leakage by fitting only on training data in production workflows.
+    - Example: `target_encode=True`
+- `structured_output` (default: `True`)
+    - What it does: Controls return format.
+    - If `True`: returns `{ 'X', 'y', 'feature_names', 'info' }`.
+    - If `False`: returns tuple(s) (`X, y, feature_names` or `X, feature_names`).
+    - Use case: Keep `True` for debugging and pipeline introspection.
+- `verbose` (default: `True`)
+    - What it does: Prints detailed preprocessing diagnostics.
+    - Use case: Set `False` for cleaner logs in training pipelines.
+    - Example: `verbose=False`
+## Notes
+- If you use Excel input, keep `openpyxl` installed.
+- If `target_encode=True`, provide a valid `target_column`.

autosweep_preprocessing-0.1.1/README.md ADDED Viewed

@@ -0,0 +1,164 @@
+# autosweep-preprocessing
+A lightweight preprocessing library built around a single flexible API: `AutoSweep`.
+## Usage
+```python
+from autosweep_preprocessing import AutoSweep
+result = AutoSweep(
+    file_path="data.csv",
+    target_column="target",
+    encode_categorical="onehot",
+    remove_correlated=True,
+    structured_output=True,
+)
+X = result["X"]
+y = result["y"]
+info = result["info"]
+```
+## Function
+`AutoSweep` supports:
+- CSV/Excel loading
+- Missing value handling and imputation
+- Numeric scaling (`standard`, `minmax`, `robust`)
+- Categorical encoding (`onehot`, `ordinal`, `label`)
+- Optional datetime feature extraction
+- Optional outlier handling (`iqr`, `zscore`)
+- Optional correlation and low-variance filtering
+- Structured output for pipeline diagnostics
+## AutoSweep Arguments Guide
+### Required / Core
+- `file_path` (required)
+    - What it does: Path to input dataset (`.csv` or Excel file).
+    - Use case: Point to your raw training file before preprocessing.
+    - Example: `file_path="data/train.csv"`
+- `target_column` (default: `None`)
+    - What it does: Separates target variable from features and returns it as `y`.
+    - Use case: Set this when you want to train/evaluate models after preprocessing.
+    - Example: `target_column="price"`
+### Column cleaning
+- `drop_columns` (default: `None`)
+    - What it does: Drops specific columns by name.
+    - Use case: Remove IDs, leakage columns, or metadata fields.
+    - Example: `drop_columns=["id", "created_at"]`
+- `drop_threshold` (default: `1.0`)
+    - What it does: Drops columns whose missing-value fraction is greater than this threshold.
+    - Use case: Use `0.4`/`0.5` to remove heavily incomplete columns.
+    - Example: `drop_threshold=0.5`
+### Missing values
+- `impute_strategy_num` (default: `'mean'`)
+    - What it does: Numeric imputation strategy.
+    - Allowed: `'mean'`, `'median'`, `'most_frequent'`, `'constant'`, `'knn'`, `'mode'`.
+    - Use case: Use `'median'` for skewed numeric data, `'knn'` for richer local patterns.
+    - Example: `impute_strategy_num="median"`
+- `impute_strategy_cat` (default: `'most_frequent'`)
+    - What it does: Categorical imputation strategy.
+    - Allowed: any `SimpleImputer` categorical strategy (commonly `'most_frequent'`, `'constant'`).
+    - Use case: Use `'most_frequent'` for stable categories.
+    - Example: `impute_strategy_cat="most_frequent"`
+### Scaling and encoding
+- `scaler` (default: `'standard'`)
+    - What it does: Scales numeric features.
+    - Allowed: `'standard'`, `'minmax'`, `'robust'`, or any other value for passthrough.
+    - Use case: Use `'robust'` when outliers are present.
+    - Example: `scaler="robust"`
+- `encode_categorical` (default: `None`)
+    - What it does: Encodes categorical columns.
+    - Allowed: `None`, `'none'`, `'passthrough'`, `'onehot'`, `'ordinal'`, `'label'`.
+    - Use case: Use `'onehot'` for linear/tree models; `'label'` for compact numeric conversion.
+    - Example: `encode_categorical="onehot"`
+### Feature selection
+- `remove_low_variance` (default: `False`)
+    - What it does: Removes low-variance numeric features after preprocessing.
+    - Use case: Enable when many near-constant numeric features exist.
+    - Example: `remove_low_variance=True`
+- `variance_thresh` (default: `0.0`)
+    - What it does: Variance cutoff used by low-variance filtering.
+    - Use case: Increase (e.g., `0.01`) to remove weak/noisy features.
+    - Example: `variance_thresh=0.01`
+- `remove_correlated` (default: `False`)
+    - What it does: Drops highly correlated numeric features.
+    - Use case: Reduce multicollinearity and redundant columns.
+    - Example: `remove_correlated=True`
+- `corr_threshold` (default: `0.95`)
+    - What it does: Absolute correlation threshold for dropping features.
+    - Use case: Use `0.85-0.95` depending on how aggressively you want feature pruning.
+    - Example: `corr_threshold=0.9`
+### Outlier handling
+- `outlier_method` (default: `None`)
+    - What it does: Enables outlier detection.
+    - Allowed: `None`, `'iqr'`, `'zscore'` (also `'z-score'`, `'z_score'`).
+    - Use case: Use `'iqr'` for non-normal data; `'zscore'` for roughly normal distributions.
+    - Example: `outlier_method="iqr"`
+- `outlier_threshold` (default: `1.5`)
+    - What it does: Threshold used by outlier method.
+    - Use case: Increase to keep more rows, decrease to be stricter.
+    - Example: `outlier_threshold=3.0` (common for z-score)
+- `cap_outliers` (default: `False`)
+    - What it does: Caps outliers to bounds instead of dropping rows.
+    - Use case: Set `True` when you want to preserve dataset size.
+    - Example: `cap_outliers=True`
+### Datetime features
+- `extract_datetime` (default: `False`)
+    - What it does: Parses datetime-like columns and extracts year/month/day/weekday/hour.
+    - Use case: Enable when date fields carry predictive signal.
+    - Example: `extract_datetime=True`
+- `drop_datetime_original` (default: `False`)
+    - What it does: Drops original datetime columns after extraction.
+    - Use case: Keep only engineered datetime parts to simplify model input.
+    - Example: `drop_datetime_original=True`
+### Target encoding and output format
+- `target_encode` (default: `False`)
+    - What it does: Applies mean target encoding to categorical features.
+    - Use case: Helpful for high-cardinality categorical variables.
+    - Important: Requires `target_column`; avoid leakage by fitting only on training data in production workflows.
+    - Example: `target_encode=True`
+- `structured_output` (default: `True`)
+    - What it does: Controls return format.
+    - If `True`: returns `{ 'X', 'y', 'feature_names', 'info' }`.
+    - If `False`: returns tuple(s) (`X, y, feature_names` or `X, feature_names`).
+    - Use case: Keep `True` for debugging and pipeline introspection.
+- `verbose` (default: `True`)
+    - What it does: Prints detailed preprocessing diagnostics.
+    - Use case: Set `False` for cleaner logs in training pipelines.
+    - Example: `verbose=False`
+## Notes
+- If you use Excel input, keep `openpyxl` installed.
+- If `target_encode=True`, provide a valid `target_column`.

autosweep_preprocessing-0.1.1/pyproject.toml ADDED Viewed

@@ -0,0 +1,39 @@
+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "autosweep-preprocessing"
+version = "0.1.1"
+description = "Flexible tabular data preprocessing utility with a single AutoSweep API"
+readme = "README.md"
+requires-python = ">=3.9"
+license = "MIT"
+authors = [
+  { name = "Harsh Kakadiya" }
+]
+keywords = ["preprocessing", "machine-learning", "feature-engineering", "data-cleaning"]
+classifiers = [
+  "Programming Language :: Python :: 3",
+  "Programming Language :: Python :: 3 :: Only",
+  "Operating System :: OS Independent",
+  "Intended Audience :: Science/Research",
+  "Topic :: Scientific/Engineering :: Artificial Intelligence"
+]
+dependencies = [
+  "pandas>=1.5",
+  "numpy>=1.23",
+  "scikit-learn>=1.2",
+  "openpyxl>=3.1"
+]
+[project.urls]
+Homepage = "https://github.com/harsh-kakadiya1/Autosweep"
+Repository = "https://github.com/harsh-kakadiya1/Autosweep"
+Issues = "https://github.com/harsh-kakadiya1/Autosweep/issues"
+[tool.setuptools]
+package-dir = {"" = "src"}
+[tool.setuptools.packages.find]
+where = ["src"]

autosweep_preprocessing-0.1.1/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

autosweep_preprocessing-0.1.1/src/autosweep_preprocessing/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+from .core import AutoSweep
+__all__ = ["AutoSweep"]