PyPI - unslml - Versions diffs - 0.1.0__tar.gz - Mend

unslml 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

unslml-0.1.0/PKG-INFO +86 -0
unslml-0.1.0/README.md +73 -0
unslml-0.1.0/pyproject.toml +20 -0
unslml-0.1.0/setup.cfg +4 -0
unslml-0.1.0/setup.py +16 -0
unslml-0.1.0/unslml/__init__.py +1 -0
unslml-0.1.0/unslml/automl.py +144 -0
unslml-0.1.0/unslml.egg-info/PKG-INFO +86 -0
unslml-0.1.0/unslml.egg-info/SOURCES.txt +10 -0
unslml-0.1.0/unslml.egg-info/dependency_links.txt +1 -0
unslml-0.1.0/unslml.egg-info/requires.txt +4 -0
unslml-0.1.0/unslml.egg-info/top_level.txt +1 -0

unslml-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,86 @@
+Metadata-Version: 2.4
+Name: unslml
+Version: 0.1.0
+Summary: Simple AutoML library for classification and regression
+Author: Naveen
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+Requires-Dist: pandas
+Requires-Dist: numpy
+Requires-Dist: scikit-learn
+Requires-Dist: joblib
+Dynamic: requires-python
+# UNSLML
+A state-of-the-art, robust, and highly accurate **AutoML and Machine Learning Library** in Python.
+`unslml` automatically detects task types (classification or regression), performs stateful feature engineering, filters extreme outliers, conducts hyperparameter searches, and provides simple one-line model saving and loading.
+---
+## 🌟 Key Features
+* **Auto-Task Detection**: Automatically detects whether your target is a classification or regression task based on target column datatypes.
+* **Smart Numeric Text Parser**: Automatically extracts numerical values from string columns that represent measurements or values (e.g., `"1200 sqft"` -> `1200.0`, `"42 Lac"` -> `4,200,000.0`, `"1.40 Cr"` -> `14,000,000.0`).
+* **Robust Outlier Filtering**: Automatically identifies and filters extreme target outliers in regression (e.g. data entry typos) to prevent metric skew.
+* **Stateful Preprocessing**: Saves imputations and categorical mapping encodings during training to ensure identical transformation on test/prediction sets.
+* **Auto-Hyperparameter Tuning**: Performs grid search cross-validation across multiple standard estimators (Logistic/Linear Regression, Decision Trees, Random Forests, KNN).
+* **Smart Performance Scaling**: Sub-samples extremely large datasets during the parameter search phase to run in seconds rather than hours.
+* **Pipeline Serialization (Save & Load)**: Prompts you to save the entire pipeline state to a `.pkl` file at the end of training, which can be loaded back with a single line of code.
+---
+## 🚀 Installation
+Install the library directly from PyPI using pip:
+```bash
+pip install unslml
+```
+---
+## 💻 How to Use
+### 1. Training & Auto-Saving a Pipeline
+Create a script (e.g., `train.py`) to fit the model. The fitting process automatically runs preprocessing, tunes multiple models, reports evaluation scores, and prompts you to save the best model:
+```python
+from unslml import AutoML
+# Initialize AutoML pipeline
+ml = AutoML()
+# Fit model (auto-detects task type, handles preprocessing & fits best model)
+ml.fit(
+    file="house_prices.csv",
+    target="Price (in rupees)"
+)
+# Prompt: "Enter the file path to save the best model (default: best_model.pkl): "
+```
+### 2. Loading & Predicting on Unseen Data
+You can load the saved `.pkl` file (which contains the best model, categorical mappings, and median values) and predict on raw, unprocessed pandas DataFrames:
+```python
+import pandas as pd
+from unslml import AutoML
+# Load the entire trained pipeline
+ml_loaded = AutoML.load("best_model.pkl")
+# New raw sample data to predict
+new_houses = pd.DataFrame({
+    'location': ['location_name'],
+    'Bathroom': [2],
+    'Balcony': [1.0],
+    'facing': ['North'],
+    'Furnishing': ['Semi-Furnished'],
+    'Transaction': ['Resale']
+})
+# Make predictions directly (preprocessing is applied automatically)
+predictions = ml_loaded.predict(new_houses)
+print("Predicted Prices:", predictions)
+```

unslml-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,73 @@
+# UNSLML
+A state-of-the-art, robust, and highly accurate **AutoML and Machine Learning Library** in Python.
+`unslml` automatically detects task types (classification or regression), performs stateful feature engineering, filters extreme outliers, conducts hyperparameter searches, and provides simple one-line model saving and loading.
+---
+## 🌟 Key Features
+* **Auto-Task Detection**: Automatically detects whether your target is a classification or regression task based on target column datatypes.
+* **Smart Numeric Text Parser**: Automatically extracts numerical values from string columns that represent measurements or values (e.g., `"1200 sqft"` -> `1200.0`, `"42 Lac"` -> `4,200,000.0`, `"1.40 Cr"` -> `14,000,000.0`).
+* **Robust Outlier Filtering**: Automatically identifies and filters extreme target outliers in regression (e.g. data entry typos) to prevent metric skew.
+* **Stateful Preprocessing**: Saves imputations and categorical mapping encodings during training to ensure identical transformation on test/prediction sets.
+* **Auto-Hyperparameter Tuning**: Performs grid search cross-validation across multiple standard estimators (Logistic/Linear Regression, Decision Trees, Random Forests, KNN).
+* **Smart Performance Scaling**: Sub-samples extremely large datasets during the parameter search phase to run in seconds rather than hours.
+* **Pipeline Serialization (Save & Load)**: Prompts you to save the entire pipeline state to a `.pkl` file at the end of training, which can be loaded back with a single line of code.
+---
+## 🚀 Installation
+Install the library directly from PyPI using pip:
+```bash
+pip install unslml
+```
+---
+## 💻 How to Use
+### 1. Training & Auto-Saving a Pipeline
+Create a script (e.g., `train.py`) to fit the model. The fitting process automatically runs preprocessing, tunes multiple models, reports evaluation scores, and prompts you to save the best model:
+```python
+from unslml import AutoML
+# Initialize AutoML pipeline
+ml = AutoML()
+# Fit model (auto-detects task type, handles preprocessing & fits best model)
+ml.fit(
+    file="house_prices.csv",
+    target="Price (in rupees)"
+)
+# Prompt: "Enter the file path to save the best model (default: best_model.pkl): "
+```
+### 2. Loading & Predicting on Unseen Data
+You can load the saved `.pkl` file (which contains the best model, categorical mappings, and median values) and predict on raw, unprocessed pandas DataFrames:
+```python
+import pandas as pd
+from unslml import AutoML
+# Load the entire trained pipeline
+ml_loaded = AutoML.load("best_model.pkl")
+# New raw sample data to predict
+new_houses = pd.DataFrame({
+    'location': ['location_name'],
+    'Bathroom': [2],
+    'Balcony': [1.0],
+    'facing': ['North'],
+    'Furnishing': ['Semi-Furnished'],
+    'Transaction': ['Resale']
+})
+# Make predictions directly (preprocessing is applied automatically)
+predictions = ml_loaded.predict(new_houses)
+print("Predicted Prices:", predictions)
+```

unslml-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,20 @@
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "unslml"
+version = "0.1.0"
+authors = [
+  { name="Naveen" }
+]
+description = "Simple AutoML library for classification and regression"
+readme = "README.md"
+requires-python = ">=3.9"
+dependencies = [
+    "pandas",
+    "numpy",
+    "scikit-learn",
+    "joblib"
+]

unslml-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

unslml-0.1.0/setup.py ADDED Viewed

@@ -0,0 +1,16 @@
+from setuptools import setup, find_packages
+setup(
+    name="unslml",
+    version="0.1.0",
+    packages=find_packages(),
+    install_requires=[
+        "pandas>=1.3.0",
+        "numpy>=1.20.0",
+        "scikit-learn>=1.0.0"
+    ],
+    author="Galen",
+    description="AutoML and Machine Learning Library",
+    python_requires=">=3.8",
+)

unslml-0.1.0/unslml/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ from .automl import AutoML

unslml-0.1.0/unslml/automl.py ADDED Viewed

@@ -0,0 +1,144 @@
+from .data.loader import load_data
+from .preprocessing.preprocess import preprocess_data, is_genuine_numeric_string
+from .models.classifier import train_models
+from sklearn.model_selection import train_test_split
+import pandas as pd
+class AutoML:
+    def fit(self, file, target):
+        # Load and clean target column
+        self.df = load_data(file)
+        self.df = self.df.dropna(subset=[target])
+        self.target = target
+        self.X = self.df.drop(self.target, axis=1)
+        self.y = self.df[self.target]
+        # Detect task type: classification or regression
+        if pd.api.types.is_numeric_dtype(self.y):
+            if pd.api.types.is_float_dtype(self.y) or self.y.nunique() > 20:
+                self.task_type = "regression"
+            else:
+                self.task_type = "classification"
+        else:
+            self.task_type = "classification"
+        print(f"Detected task type: {self.task_type}")
+        # Drop extreme target outliers for regression
+        if self.task_type == "regression":
+            q_99 = self.y.quantile(0.99)
+            if self.y.max() > 10 * q_99:
+                print(f"Filtering out extreme target outliers above {10 * q_99}")
+                valid_idx = self.y[self.y <= 10 * q_99].index
+                self.df = self.df.loc[valid_idx]
+                self.X = self.X.loc[valid_idx]
+                self.y = self.y.loc[valid_idx]
+        # Identify high-cardinality columns before preprocessing (when they are still string types)
+        # but skip those that can be parsed as numeric.
+        self.cols_to_drop = []
+        for col in self.X.columns:
+            if pd.api.types.is_string_dtype(self.X[col]) and self.X[col].nunique() > 200:
+                non_nulls = self.X[col].dropna()
+                if len(non_nulls) > 0:
+                    sample = non_nulls.head(100)
+                    is_numeric_sample = sample.apply(is_genuine_numeric_string)
+                    if is_numeric_sample.sum() / len(sample) > 0.5:
+                        continue
+                self.cols_to_drop.append(col)
+        if self.cols_to_drop:
+            print(f"Dropping high-cardinality categorical columns: {self.cols_to_drop}")
+            self.X = self.X.drop(columns=self.cols_to_drop)
+        # Train/Test Split
+        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
+            self.X, self.y, test_size=0.2, random_state=42
+        )
+        # Stateful Preprocessing
+        self.X_train, self.encoders = preprocess_data(self.X_train.copy())
+        self.X_test, _ = preprocess_data(self.X_test.copy(), self.encoders)
+        # Performance optimization: sample down if dataset is very large
+        X_train_fit = self.X_train
+        y_train_fit = self.y_train
+        if len(self.X_train) > 10000:
+            print(f"Training set has {len(self.X_train)} samples. Sampling 10,000 for tuning...")
+            sample_idx = self.X_train.sample(n=10000, random_state=42).index
+            X_train_fit = self.X_train.loc[sample_idx]
+            y_train_fit = self.y_train.loc[sample_idx]
+        X_test_fit = self.X_test
+        y_test_fit = self.y_test
+        if len(self.X_test) > 2000:
+            print(f"Test set has {len(self.X_test)} samples. Sampling 2,000 for evaluation...")
+            sample_idx = self.X_test.sample(n=2000, random_state=42).index
+            X_test_fit = self.X_test.loc[sample_idx]
+            y_test_fit = self.y_test.loc[sample_idx]
+        # Fit models
+        self.results, self.best_models = train_models(
+            X_train_fit, X_test_fit, y_train_fit, y_test_fit, task_type=self.task_type
+        )
+        print("\nModel Results\n")
+        metric_name = "R2 Score" if self.task_type == "regression" else "Accuracy"
+        for name, score in self.results.items():
+            print(f"{name}: {score:.4f} ({metric_name})")
+        self.best_model = max(self.results, key=self.results.get)
+        print(f"\nBest Model: {self.best_model}")
+        # Prompt user to save the model
+        try:
+            save_path = input("\nEnter the file path to save the best model (default: best_model.pkl): ").strip()
+            if not save_path:
+                save_path = "best_model.pkl"
+        except (IOError, EOFError):
+            save_path = "best_model.pkl"
+        self.save(save_path)
+    def predict(self, X):
+        X_proc = X.drop(columns=self.cols_to_drop, errors='ignore')
+        X_proc, _ = preprocess_data(X_proc.copy(), self.encoders)
+        return self.best_models[self.best_model].predict(X_proc)
+    def save(self, filepath):
+        """
+        Save the entire AutoML pipeline (best model, encoders, and config) to a pickle file.
+        """
+        import pickle
+        pipeline_data = {
+            "model": self.best_models[self.best_model],
+            "encoders": self.encoders,
+            "task_type": self.task_type,
+            "cols_to_drop": self.cols_to_drop,
+            "target": self.target,
+            "best_model_name": self.best_model
+        }
+        with open(filepath, "wb") as f:
+            pickle.dump(pipeline_data, f)
+        print(f"Pipeline saved successfully to {filepath}")
+    @classmethod
+    def load(cls, filepath):
+        """
+        Load a saved AutoML pipeline from a pickle file.
+        """
+        import pickle
+        with open(filepath, "rb") as f:
+            data = pickle.load(f)
+        instance = cls()
+        instance.best_model = data["best_model_name"]
+        instance.best_models = {instance.best_model: data["model"]}
+        instance.encoders = data["encoders"]
+        instance.task_type = data["task_type"]
+        instance.cols_to_drop = data["cols_to_drop"]
+        instance.target = data["target"]
+        return instance

unslml-0.1.0/unslml.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,86 @@
+Metadata-Version: 2.4
+Name: unslml
+Version: 0.1.0
+Summary: Simple AutoML library for classification and regression
+Author: Naveen
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+Requires-Dist: pandas
+Requires-Dist: numpy
+Requires-Dist: scikit-learn
+Requires-Dist: joblib
+Dynamic: requires-python
+# UNSLML
+A state-of-the-art, robust, and highly accurate **AutoML and Machine Learning Library** in Python.
+`unslml` automatically detects task types (classification or regression), performs stateful feature engineering, filters extreme outliers, conducts hyperparameter searches, and provides simple one-line model saving and loading.
+---
+## 🌟 Key Features
+* **Auto-Task Detection**: Automatically detects whether your target is a classification or regression task based on target column datatypes.
+* **Smart Numeric Text Parser**: Automatically extracts numerical values from string columns that represent measurements or values (e.g., `"1200 sqft"` -> `1200.0`, `"42 Lac"` -> `4,200,000.0`, `"1.40 Cr"` -> `14,000,000.0`).
+* **Robust Outlier Filtering**: Automatically identifies and filters extreme target outliers in regression (e.g. data entry typos) to prevent metric skew.
+* **Stateful Preprocessing**: Saves imputations and categorical mapping encodings during training to ensure identical transformation on test/prediction sets.
+* **Auto-Hyperparameter Tuning**: Performs grid search cross-validation across multiple standard estimators (Logistic/Linear Regression, Decision Trees, Random Forests, KNN).
+* **Smart Performance Scaling**: Sub-samples extremely large datasets during the parameter search phase to run in seconds rather than hours.
+* **Pipeline Serialization (Save & Load)**: Prompts you to save the entire pipeline state to a `.pkl` file at the end of training, which can be loaded back with a single line of code.
+---
+## 🚀 Installation
+Install the library directly from PyPI using pip:
+```bash
+pip install unslml
+```
+---
+## 💻 How to Use
+### 1. Training & Auto-Saving a Pipeline
+Create a script (e.g., `train.py`) to fit the model. The fitting process automatically runs preprocessing, tunes multiple models, reports evaluation scores, and prompts you to save the best model:
+```python
+from unslml import AutoML
+# Initialize AutoML pipeline
+ml = AutoML()
+# Fit model (auto-detects task type, handles preprocessing & fits best model)
+ml.fit(
+    file="house_prices.csv",
+    target="Price (in rupees)"
+)
+# Prompt: "Enter the file path to save the best model (default: best_model.pkl): "
+```
+### 2. Loading & Predicting on Unseen Data
+You can load the saved `.pkl` file (which contains the best model, categorical mappings, and median values) and predict on raw, unprocessed pandas DataFrames:
+```python
+import pandas as pd
+from unslml import AutoML
+# Load the entire trained pipeline
+ml_loaded = AutoML.load("best_model.pkl")
+# New raw sample data to predict
+new_houses = pd.DataFrame({
+    'location': ['location_name'],
+    'Bathroom': [2],
+    'Balcony': [1.0],
+    'facing': ['North'],
+    'Furnishing': ['Semi-Furnished'],
+    'Transaction': ['Resale']
+})
+# Make predictions directly (preprocessing is applied automatically)
+predictions = ml_loaded.predict(new_houses)
+print("Predicted Prices:", predictions)
+```

unslml-0.1.0/unslml.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,10 @@
+README.md
+pyproject.toml
+setup.py
+unslml/__init__.py
+unslml/automl.py
+unslml.egg-info/PKG-INFO
+unslml.egg-info/SOURCES.txt
+unslml.egg-info/dependency_links.txt
+unslml.egg-info/requires.txt
+unslml.egg-info/top_level.txt

unslml-0.1.0/unslml.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

unslml-0.1.0/unslml.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,4 @@
+pandas
+numpy
+scikit-learn
+joblib

unslml-0.1.0/unslml.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ unslml