unslml 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
unslml-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,86 @@
1
+ Metadata-Version: 2.4
2
+ Name: unslml
3
+ Version: 0.1.0
4
+ Summary: Simple AutoML library for classification and regression
5
+ Author: Naveen
6
+ Requires-Python: >=3.8
7
+ Description-Content-Type: text/markdown
8
+ Requires-Dist: pandas
9
+ Requires-Dist: numpy
10
+ Requires-Dist: scikit-learn
11
+ Requires-Dist: joblib
12
+ Dynamic: requires-python
13
+
14
+ # UNSLML
15
+
16
+ A state-of-the-art, robust, and highly accurate **AutoML and Machine Learning Library** in Python.
17
+
18
+ `unslml` automatically detects task types (classification or regression), performs stateful feature engineering, filters extreme outliers, conducts hyperparameter searches, and provides simple one-line model saving and loading.
19
+
20
+ ---
21
+
22
+ ## 🌟 Key Features
23
+
24
+ * **Auto-Task Detection**: Automatically detects whether your target is a classification or regression task based on target column datatypes.
25
+ * **Smart Numeric Text Parser**: Automatically extracts numerical values from string columns that represent measurements or values (e.g., `"1200 sqft"` -> `1200.0`, `"42 Lac"` -> `4,200,000.0`, `"1.40 Cr"` -> `14,000,000.0`).
26
+ * **Robust Outlier Filtering**: Automatically identifies and filters extreme target outliers in regression (e.g. data entry typos) to prevent metric skew.
27
+ * **Stateful Preprocessing**: Saves imputations and categorical mapping encodings during training to ensure identical transformation on test/prediction sets.
28
+ * **Auto-Hyperparameter Tuning**: Performs grid search cross-validation across multiple standard estimators (Logistic/Linear Regression, Decision Trees, Random Forests, KNN).
29
+ * **Smart Performance Scaling**: Sub-samples extremely large datasets during the parameter search phase to run in seconds rather than hours.
30
+ * **Pipeline Serialization (Save & Load)**: Prompts you to save the entire pipeline state to a `.pkl` file at the end of training, which can be loaded back with a single line of code.
31
+
32
+ ---
33
+
34
+ ## 🚀 Installation
35
+
36
+ Install the library directly from PyPI using pip:
37
+
38
+ ```bash
39
+ pip install unslml
40
+ ```
41
+
42
+ ---
43
+
44
+ ## 💻 How to Use
45
+
46
+ ### 1. Training & Auto-Saving a Pipeline
47
+ Create a script (e.g., `train.py`) to fit the model. The fitting process automatically runs preprocessing, tunes multiple models, reports evaluation scores, and prompts you to save the best model:
48
+
49
+ ```python
50
+ from unslml import AutoML
51
+
52
+ # Initialize AutoML pipeline
53
+ ml = AutoML()
54
+
55
+ # Fit model (auto-detects task type, handles preprocessing & fits best model)
56
+ ml.fit(
57
+ file="house_prices.csv",
58
+ target="Price (in rupees)"
59
+ )
60
+ # Prompt: "Enter the file path to save the best model (default: best_model.pkl): "
61
+ ```
62
+
63
+ ### 2. Loading & Predicting on Unseen Data
64
+ You can load the saved `.pkl` file (which contains the best model, categorical mappings, and median values) and predict on raw, unprocessed pandas DataFrames:
65
+
66
+ ```python
67
+ import pandas as pd
68
+ from unslml import AutoML
69
+
70
+ # Load the entire trained pipeline
71
+ ml_loaded = AutoML.load("best_model.pkl")
72
+
73
+ # New raw sample data to predict
74
+ new_houses = pd.DataFrame({
75
+ 'location': ['location_name'],
76
+ 'Bathroom': [2],
77
+ 'Balcony': [1.0],
78
+ 'facing': ['North'],
79
+ 'Furnishing': ['Semi-Furnished'],
80
+ 'Transaction': ['Resale']
81
+ })
82
+
83
+ # Make predictions directly (preprocessing is applied automatically)
84
+ predictions = ml_loaded.predict(new_houses)
85
+ print("Predicted Prices:", predictions)
86
+ ```
unslml-0.1.0/README.md ADDED
@@ -0,0 +1,73 @@
1
+ # UNSLML
2
+
3
+ A state-of-the-art, robust, and highly accurate **AutoML and Machine Learning Library** in Python.
4
+
5
+ `unslml` automatically detects task types (classification or regression), performs stateful feature engineering, filters extreme outliers, conducts hyperparameter searches, and provides simple one-line model saving and loading.
6
+
7
+ ---
8
+
9
+ ## 🌟 Key Features
10
+
11
+ * **Auto-Task Detection**: Automatically detects whether your target is a classification or regression task based on target column datatypes.
12
+ * **Smart Numeric Text Parser**: Automatically extracts numerical values from string columns that represent measurements or values (e.g., `"1200 sqft"` -> `1200.0`, `"42 Lac"` -> `4,200,000.0`, `"1.40 Cr"` -> `14,000,000.0`).
13
+ * **Robust Outlier Filtering**: Automatically identifies and filters extreme target outliers in regression (e.g. data entry typos) to prevent metric skew.
14
+ * **Stateful Preprocessing**: Saves imputations and categorical mapping encodings during training to ensure identical transformation on test/prediction sets.
15
+ * **Auto-Hyperparameter Tuning**: Performs grid search cross-validation across multiple standard estimators (Logistic/Linear Regression, Decision Trees, Random Forests, KNN).
16
+ * **Smart Performance Scaling**: Sub-samples extremely large datasets during the parameter search phase to run in seconds rather than hours.
17
+ * **Pipeline Serialization (Save & Load)**: Prompts you to save the entire pipeline state to a `.pkl` file at the end of training, which can be loaded back with a single line of code.
18
+
19
+ ---
20
+
21
+ ## 🚀 Installation
22
+
23
+ Install the library directly from PyPI using pip:
24
+
25
+ ```bash
26
+ pip install unslml
27
+ ```
28
+
29
+ ---
30
+
31
+ ## 💻 How to Use
32
+
33
+ ### 1. Training & Auto-Saving a Pipeline
34
+ Create a script (e.g., `train.py`) to fit the model. The fitting process automatically runs preprocessing, tunes multiple models, reports evaluation scores, and prompts you to save the best model:
35
+
36
+ ```python
37
+ from unslml import AutoML
38
+
39
+ # Initialize AutoML pipeline
40
+ ml = AutoML()
41
+
42
+ # Fit model (auto-detects task type, handles preprocessing & fits best model)
43
+ ml.fit(
44
+ file="house_prices.csv",
45
+ target="Price (in rupees)"
46
+ )
47
+ # Prompt: "Enter the file path to save the best model (default: best_model.pkl): "
48
+ ```
49
+
50
+ ### 2. Loading & Predicting on Unseen Data
51
+ You can load the saved `.pkl` file (which contains the best model, categorical mappings, and median values) and predict on raw, unprocessed pandas DataFrames:
52
+
53
+ ```python
54
+ import pandas as pd
55
+ from unslml import AutoML
56
+
57
+ # Load the entire trained pipeline
58
+ ml_loaded = AutoML.load("best_model.pkl")
59
+
60
+ # New raw sample data to predict
61
+ new_houses = pd.DataFrame({
62
+ 'location': ['location_name'],
63
+ 'Bathroom': [2],
64
+ 'Balcony': [1.0],
65
+ 'facing': ['North'],
66
+ 'Furnishing': ['Semi-Furnished'],
67
+ 'Transaction': ['Resale']
68
+ })
69
+
70
+ # Make predictions directly (preprocessing is applied automatically)
71
+ predictions = ml_loaded.predict(new_houses)
72
+ print("Predicted Prices:", predictions)
73
+ ```
@@ -0,0 +1,20 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "unslml"
7
+ version = "0.1.0"
8
+ authors = [
9
+ { name="Naveen" }
10
+ ]
11
+ description = "Simple AutoML library for classification and regression"
12
+ readme = "README.md"
13
+ requires-python = ">=3.9"
14
+
15
+ dependencies = [
16
+ "pandas",
17
+ "numpy",
18
+ "scikit-learn",
19
+ "joblib"
20
+ ]
unslml-0.1.0/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
unslml-0.1.0/setup.py ADDED
@@ -0,0 +1,16 @@
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="unslml",
5
+ version="0.1.0",
6
+ packages=find_packages(),
7
+ install_requires=[
8
+ "pandas>=1.3.0",
9
+ "numpy>=1.20.0",
10
+ "scikit-learn>=1.0.0"
11
+ ],
12
+ author="Galen",
13
+ description="AutoML and Machine Learning Library",
14
+ python_requires=">=3.8",
15
+ )
16
+
@@ -0,0 +1 @@
1
+ from .automl import AutoML
@@ -0,0 +1,144 @@
1
+ from .data.loader import load_data
2
+ from .preprocessing.preprocess import preprocess_data, is_genuine_numeric_string
3
+ from .models.classifier import train_models
4
+ from sklearn.model_selection import train_test_split
5
+ import pandas as pd
6
+
7
+ class AutoML:
8
+ def fit(self, file, target):
9
+ # Load and clean target column
10
+ self.df = load_data(file)
11
+ self.df = self.df.dropna(subset=[target])
12
+ self.target = target
13
+
14
+ self.X = self.df.drop(self.target, axis=1)
15
+ self.y = self.df[self.target]
16
+
17
+ # Detect task type: classification or regression
18
+ if pd.api.types.is_numeric_dtype(self.y):
19
+ if pd.api.types.is_float_dtype(self.y) or self.y.nunique() > 20:
20
+ self.task_type = "regression"
21
+ else:
22
+ self.task_type = "classification"
23
+ else:
24
+ self.task_type = "classification"
25
+
26
+ print(f"Detected task type: {self.task_type}")
27
+
28
+ # Drop extreme target outliers for regression
29
+ if self.task_type == "regression":
30
+ q_99 = self.y.quantile(0.99)
31
+ if self.y.max() > 10 * q_99:
32
+ print(f"Filtering out extreme target outliers above {10 * q_99}")
33
+ valid_idx = self.y[self.y <= 10 * q_99].index
34
+ self.df = self.df.loc[valid_idx]
35
+ self.X = self.X.loc[valid_idx]
36
+ self.y = self.y.loc[valid_idx]
37
+
38
+
39
+ # Identify high-cardinality columns before preprocessing (when they are still string types)
40
+ # but skip those that can be parsed as numeric.
41
+ self.cols_to_drop = []
42
+ for col in self.X.columns:
43
+ if pd.api.types.is_string_dtype(self.X[col]) and self.X[col].nunique() > 200:
44
+ non_nulls = self.X[col].dropna()
45
+ if len(non_nulls) > 0:
46
+ sample = non_nulls.head(100)
47
+ is_numeric_sample = sample.apply(is_genuine_numeric_string)
48
+ if is_numeric_sample.sum() / len(sample) > 0.5:
49
+ continue
50
+ self.cols_to_drop.append(col)
51
+
52
+ if self.cols_to_drop:
53
+ print(f"Dropping high-cardinality categorical columns: {self.cols_to_drop}")
54
+ self.X = self.X.drop(columns=self.cols_to_drop)
55
+
56
+
57
+ # Train/Test Split
58
+ self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
59
+ self.X, self.y, test_size=0.2, random_state=42
60
+ )
61
+
62
+ # Stateful Preprocessing
63
+ self.X_train, self.encoders = preprocess_data(self.X_train.copy())
64
+ self.X_test, _ = preprocess_data(self.X_test.copy(), self.encoders)
65
+
66
+ # Performance optimization: sample down if dataset is very large
67
+ X_train_fit = self.X_train
68
+ y_train_fit = self.y_train
69
+ if len(self.X_train) > 10000:
70
+ print(f"Training set has {len(self.X_train)} samples. Sampling 10,000 for tuning...")
71
+ sample_idx = self.X_train.sample(n=10000, random_state=42).index
72
+ X_train_fit = self.X_train.loc[sample_idx]
73
+ y_train_fit = self.y_train.loc[sample_idx]
74
+
75
+ X_test_fit = self.X_test
76
+ y_test_fit = self.y_test
77
+ if len(self.X_test) > 2000:
78
+ print(f"Test set has {len(self.X_test)} samples. Sampling 2,000 for evaluation...")
79
+ sample_idx = self.X_test.sample(n=2000, random_state=42).index
80
+ X_test_fit = self.X_test.loc[sample_idx]
81
+ y_test_fit = self.y_test.loc[sample_idx]
82
+
83
+ # Fit models
84
+ self.results, self.best_models = train_models(
85
+ X_train_fit, X_test_fit, y_train_fit, y_test_fit, task_type=self.task_type
86
+ )
87
+
88
+ print("\nModel Results\n")
89
+ metric_name = "R2 Score" if self.task_type == "regression" else "Accuracy"
90
+ for name, score in self.results.items():
91
+ print(f"{name}: {score:.4f} ({metric_name})")
92
+
93
+ self.best_model = max(self.results, key=self.results.get)
94
+ print(f"\nBest Model: {self.best_model}")
95
+
96
+ # Prompt user to save the model
97
+ try:
98
+ save_path = input("\nEnter the file path to save the best model (default: best_model.pkl): ").strip()
99
+ if not save_path:
100
+ save_path = "best_model.pkl"
101
+ except (IOError, EOFError):
102
+ save_path = "best_model.pkl"
103
+
104
+ self.save(save_path)
105
+
106
+ def predict(self, X):
107
+ X_proc = X.drop(columns=self.cols_to_drop, errors='ignore')
108
+ X_proc, _ = preprocess_data(X_proc.copy(), self.encoders)
109
+ return self.best_models[self.best_model].predict(X_proc)
110
+
111
+ def save(self, filepath):
112
+ """
113
+ Save the entire AutoML pipeline (best model, encoders, and config) to a pickle file.
114
+ """
115
+ import pickle
116
+ pipeline_data = {
117
+ "model": self.best_models[self.best_model],
118
+ "encoders": self.encoders,
119
+ "task_type": self.task_type,
120
+ "cols_to_drop": self.cols_to_drop,
121
+ "target": self.target,
122
+ "best_model_name": self.best_model
123
+ }
124
+ with open(filepath, "wb") as f:
125
+ pickle.dump(pipeline_data, f)
126
+ print(f"Pipeline saved successfully to {filepath}")
127
+
128
+ @classmethod
129
+ def load(cls, filepath):
130
+ """
131
+ Load a saved AutoML pipeline from a pickle file.
132
+ """
133
+ import pickle
134
+ with open(filepath, "rb") as f:
135
+ data = pickle.load(f)
136
+
137
+ instance = cls()
138
+ instance.best_model = data["best_model_name"]
139
+ instance.best_models = {instance.best_model: data["model"]}
140
+ instance.encoders = data["encoders"]
141
+ instance.task_type = data["task_type"]
142
+ instance.cols_to_drop = data["cols_to_drop"]
143
+ instance.target = data["target"]
144
+ return instance
@@ -0,0 +1,86 @@
1
+ Metadata-Version: 2.4
2
+ Name: unslml
3
+ Version: 0.1.0
4
+ Summary: Simple AutoML library for classification and regression
5
+ Author: Naveen
6
+ Requires-Python: >=3.8
7
+ Description-Content-Type: text/markdown
8
+ Requires-Dist: pandas
9
+ Requires-Dist: numpy
10
+ Requires-Dist: scikit-learn
11
+ Requires-Dist: joblib
12
+ Dynamic: requires-python
13
+
14
+ # UNSLML
15
+
16
+ A state-of-the-art, robust, and highly accurate **AutoML and Machine Learning Library** in Python.
17
+
18
+ `unslml` automatically detects task types (classification or regression), performs stateful feature engineering, filters extreme outliers, conducts hyperparameter searches, and provides simple one-line model saving and loading.
19
+
20
+ ---
21
+
22
+ ## 🌟 Key Features
23
+
24
+ * **Auto-Task Detection**: Automatically detects whether your target is a classification or regression task based on target column datatypes.
25
+ * **Smart Numeric Text Parser**: Automatically extracts numerical values from string columns that represent measurements or values (e.g., `"1200 sqft"` -> `1200.0`, `"42 Lac"` -> `4,200,000.0`, `"1.40 Cr"` -> `14,000,000.0`).
26
+ * **Robust Outlier Filtering**: Automatically identifies and filters extreme target outliers in regression (e.g. data entry typos) to prevent metric skew.
27
+ * **Stateful Preprocessing**: Saves imputations and categorical mapping encodings during training to ensure identical transformation on test/prediction sets.
28
+ * **Auto-Hyperparameter Tuning**: Performs grid search cross-validation across multiple standard estimators (Logistic/Linear Regression, Decision Trees, Random Forests, KNN).
29
+ * **Smart Performance Scaling**: Sub-samples extremely large datasets during the parameter search phase to run in seconds rather than hours.
30
+ * **Pipeline Serialization (Save & Load)**: Prompts you to save the entire pipeline state to a `.pkl` file at the end of training, which can be loaded back with a single line of code.
31
+
32
+ ---
33
+
34
+ ## 🚀 Installation
35
+
36
+ Install the library directly from PyPI using pip:
37
+
38
+ ```bash
39
+ pip install unslml
40
+ ```
41
+
42
+ ---
43
+
44
+ ## 💻 How to Use
45
+
46
+ ### 1. Training & Auto-Saving a Pipeline
47
+ Create a script (e.g., `train.py`) to fit the model. The fitting process automatically runs preprocessing, tunes multiple models, reports evaluation scores, and prompts you to save the best model:
48
+
49
+ ```python
50
+ from unslml import AutoML
51
+
52
+ # Initialize AutoML pipeline
53
+ ml = AutoML()
54
+
55
+ # Fit model (auto-detects task type, handles preprocessing & fits best model)
56
+ ml.fit(
57
+ file="house_prices.csv",
58
+ target="Price (in rupees)"
59
+ )
60
+ # Prompt: "Enter the file path to save the best model (default: best_model.pkl): "
61
+ ```
62
+
63
+ ### 2. Loading & Predicting on Unseen Data
64
+ You can load the saved `.pkl` file (which contains the best model, categorical mappings, and median values) and predict on raw, unprocessed pandas DataFrames:
65
+
66
+ ```python
67
+ import pandas as pd
68
+ from unslml import AutoML
69
+
70
+ # Load the entire trained pipeline
71
+ ml_loaded = AutoML.load("best_model.pkl")
72
+
73
+ # New raw sample data to predict
74
+ new_houses = pd.DataFrame({
75
+ 'location': ['location_name'],
76
+ 'Bathroom': [2],
77
+ 'Balcony': [1.0],
78
+ 'facing': ['North'],
79
+ 'Furnishing': ['Semi-Furnished'],
80
+ 'Transaction': ['Resale']
81
+ })
82
+
83
+ # Make predictions directly (preprocessing is applied automatically)
84
+ predictions = ml_loaded.predict(new_houses)
85
+ print("Predicted Prices:", predictions)
86
+ ```
@@ -0,0 +1,10 @@
1
+ README.md
2
+ pyproject.toml
3
+ setup.py
4
+ unslml/__init__.py
5
+ unslml/automl.py
6
+ unslml.egg-info/PKG-INFO
7
+ unslml.egg-info/SOURCES.txt
8
+ unslml.egg-info/dependency_links.txt
9
+ unslml.egg-info/requires.txt
10
+ unslml.egg-info/top_level.txt
@@ -0,0 +1,4 @@
1
+ pandas
2
+ numpy
3
+ scikit-learn
4
+ joblib
@@ -0,0 +1 @@
1
+ unslml