PyPI - smallaxe - Versions diffs - 0.6.2__tar.gz → 0.6.4__tar.gz - Mend

smallaxe 0.6.2tar.gz → 0.6.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

smallaxe-0.6.4/Goals.md ADDED Viewed

@@ -0,0 +1,220 @@
+> Note for AI agents working on this repository:
+>
+> Your job is to move smallaxe toward the goals in this document while keeping the library readable, simple to use, and extensible. Prefer clear APIs, small focused abstractions, and implementation patterns that match the existing codebase.
+>
+> Make goal-related changes on a git branch named `goals`. If the branch does not exist, create it before editing.
+>
+> Always validate changes with tests. The Python environment is managed with UV and should be activated with:
+>
+> ```bash
+> source ~/Desktop/basic/bin/activate
+> ```
+>
+> If a new Python library is needed in the environment, install it with:
+>
+> ```bash
+> uv pip install <library-name>
+> ```
+>
+> For PySpark tests on this machine, use OpenJDK 11:
+>
+> ```bash
+> export JAVA_HOME=/opt/homebrew/opt/openjdk@11
+> export PATH="$JAVA_HOME/bin:$PATH"
+> ```
+>
+> Run relevant focused tests after each change, and run the full suite before considering work complete:
+>
+> ```bash
+> pytest -q
+> ```
+>
+> If you are unsure about the current behavior, API, or implementation details of a dependency or library, use the available DeepWiki MCP tools to inspect authoritative project documentation before making assumptions.
+# smallaxe Goals
+## Product Goal
+smallaxe should make common supervised modeling on PySpark DataFrames feel as simple as scikit-learn on pandas, while keeping execution distributed through Spark-native and Spark-compatible ML libraries.
+The first stable target is:
+- Binary classification.
+- Standard continuous regression across Random Forest, LightGBM, XGBoost, and CatBoost regressors.
+- A simple, consistent user API for preprocessing, training, evaluation, prediction, persistence, and pipeline composition.
+Longer-term expansion should add multiclass classification, multilabel classification, and specialized regression tasks such as quantile regression.
+## Current Baseline
+The current implementation already has useful foundations:
+- Global configuration, custom exceptions, sample datasets, metrics, preprocessing, pipeline, and training modules.
+- Random Forest regressors/classifiers backed by PySpark ML.
+- Optional XGBoost and LightGBM wrappers.
+- Imputer, Scaler, Encoder, and Pipeline classes.
+- Model metadata, validation scores, feature importance, and save/load support for individual models.
+- A substantial test suite. With `~/Desktop/basic`, PySpark 3.5.x, and OpenJDK 11, the current suite passes: 485 passed, 102 skipped. The skipped tests are optional XGBoost/LightGBM coverage when those libraries are not installed.
+## Missing For v1
+### 1. Align Public API With Actual Capabilities
+- Update README to describe only implemented APIs, or implement the advertised APIs before release.
+- Current README advertises `smallaxe.search.optimize`, `smallaxe.auto.AutomatedTraining`, visualization, and CatBoost, but those modules are empty or missing.
+- Decide whether the first regression API is called "regression" or "linear regression." Random Forest, XGBoost, LightGBM, and CatBoost are not linear models. If true linear regression is a first-class goal, add a Spark `LinearRegression` baseline separately.
+### 2. Finish The Four-Algorithm Training Surface
+- Add CatBoost regressor and binary classifier support, or remove CatBoost from public docs until implemented.
+- Add factory methods for LightGBM in `Regressors` and `Classifiers`; the classes exist, but the factories only expose Random Forest and XGBoost.
+- Make optional dependency handling explicit:
+  - `available_models()` should report installed and unavailable models with install hints.
+  - Factories should raise clear `DependencyError` messages when a requested optional model is missing.
+  - Tests should verify missing optional dependency behavior without being globally skipped.
+- Normalize model parameter names across algorithms where possible:
+  - User-facing: `n_estimators`, `max_depth`, `learning_rate`, `seed`.
+  - Internal adapters translate to Spark/XGBoost/LightGBM/CatBoost-specific names.
+### 3. Make Preprocessing Production-Ready
+- Split categorical and numeric preprocessing into predictable steps:
+  - Numeric imputation.
+  - Categorical imputation.
+  - Categorical encoding.
+  - Numeric scaling when useful.
+  - Feature vector assembly.
+- Add a fitted preprocessing schema artifact:
+  - Input columns.
+  - Output feature columns.
+  - Encoded category mappings.
+  - Unknown-category behavior.
+  - Null handling behavior.
+- Replace Python UDF extraction in Scaler/Encoder where practical with Spark SQL/vector functions for performance.
+- Ensure transform-time behavior is stable for unseen categories, missing columns, and changed schemas.
+- Avoid silently dropping rows during feature assembly. Current `VectorAssembler(handleInvalid="skip")` can change row counts during training or prediction.
+### 4. Harden Pipeline Semantics
+- Pipeline should own feature-column construction instead of passing all non-label columns to the model.
+- Pipelines should support both:
+  - Preprocessing-only `fit/transform`.
+  - End-to-end `fit/predict/evaluate/save/load` with a model step.
+- Add robust pipeline persistence for model pipelines, not only preprocessing pipelines.
+- Save/load must preserve:
+  - Preprocessing state.
+  - Model artifacts.
+  - Feature schema.
+  - Label column.
+  - Task type.
+  - Model params.
+  - Validation/evaluation metadata.
+- Add tests for saving and loading full pipelines with Random Forest first, then optional algorithm-specific tests.
+### 5. Evaluation API
+- Add a model-level `evaluate(df, label_col=None, metrics=None)` method.
+- Add a pipeline-level `evaluate(...)` method that preprocesses, predicts, and scores in one call.
+- For binary classification, support at least:
+  - Accuracy.
+  - Precision.
+  - Recall.
+  - F1.
+  - ROC AUC.
+  - PR AUC.
+  - Log loss.
+  - Confusion matrix.
+- For regression, support at least:
+  - RMSE.
+  - MAE.
+  - MSE.
+  - R2.
+  - MAPE.
+- Keep multiclass and multilabel metrics separate from binary metrics. The current binary precision/recall/F1 implementation should not be reused for multiclass without explicit averaging policy.
+### 6. Training And Validation
+- Move train/test split and k-fold logic into a dedicated validation module.
+- Add public split utilities for reuse and testing.
+- Make validation behavior explicit:
+  - `validation="none" | "train_test" | "kfold"`.
+  - `stratified=True` only for classification.
+  - Fixed seed behavior.
+  - Empty fold and tiny-class handling.
+- Add train/validation metrics and final model metadata in a consistent structure.
+- Add an option to cache training data during fitting, with documented tradeoffs.
+### 7. Model Persistence And Registry-Ready Artifacts
+- Define a stable artifact layout:
+  - `metadata.json`.
+  - `preprocessing/`.
+  - `model/`.
+  - `metrics.json`.
+  - `schema.json`.
+- Include a `smallaxe_version`, Spark version, algorithm name, task type, params, feature schema, and timestamp.
+- Provide `load_model(path)` and `load_pipeline(path)` convenience functions.
+- Ensure loaded models produce the same predictions as saved models on deterministic test data.
+- Design the artifact format so it can later plug into MLflow or a model registry.
+### 8. Automated Training
+- Implement `AutomatedTraining` after the four algorithm wrappers are stable.
+- It should:
+  - Train all available compatible algorithms.
+  - Skip missing optional dependencies with warnings and install hints.
+  - Return a comparison table as a Spark or pandas DataFrame.
+  - Select `best_model` by a user-specified metric.
+  - Persist the winning model or full comparison run.
+- Keep the first version constrained to binary classification and continuous regression.
+### 9. Hyperparameter Search
+- Implement `smallaxe.search.optimize`.
+- Start with a simple, predictable API:
+  - model instance.
+  - DataFrame.
+  - label column.
+  - search space.
+  - metric.
+  - validation strategy.
+  - max evaluations.
+- Preserve `best_params`, `best_score`, and trial history.
+- Make search optional and clearly dependency-gated if using Hyperopt.
+### 10. Documentation And Examples
+- Rewrite README around the actual v1 user journey:
+  - Install.
+  - Build a preprocessing pipeline.
+  - Train binary classifier.
+  - Train regressor.
+  - Evaluate.
+  - Save/load.
+  - Use optional algorithms.
+- Add examples for:
+  - Random Forest binary classification.
+  - XGBoost regression.
+  - LightGBM classification when dependency is installed.
+  - Full pipeline save/load.
+- Add a compatibility matrix for Python, Spark, Java, and optional algorithm packages.
+## v1 Acceptance Criteria
+- A new user can train, evaluate, save, load, and predict with Random Forest on a PySpark DataFrame in under 20 lines of code.
+- The same user-facing workflow works for XGBoost, LightGBM, and CatBoost when optional dependencies are installed.
+- Binary classification and continuous regression have clear metrics and stable output schemas.
+- A full preprocessing-plus-model pipeline can be saved and loaded with identical predictions on deterministic data.
+- Missing optional dependencies fail with actionable install instructions.
+- Documentation does not advertise unimplemented APIs.
+- CI runs core tests on supported Python/Spark versions and optional algorithm tests in separate dependency-enabled jobs.
+## Later Goals
+- Multiclass classification with explicit averaging options for metrics.
+- Multilabel classification.
+- Quantile regression and other specialized regression objectives.
+- Calibration and threshold tuning for binary classifiers.
+- Feature importance and model comparison visualizations.
+- MLflow integration for experiment tracking and model registry workflows.
+- Distributed hyperparameter tuning with Spark-aware execution.

{smallaxe-0.6.2 → smallaxe-0.6.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: smallaxe
-Version: 0.6.2
+Version: 0.6.4
 Summary: A PySpark MLOps library for simplified model training and optimization
 Author: Henok Yemam
 License: MIT

{smallaxe-0.6.2 → smallaxe-0.6.4}/TODO.md RENAMED Viewed

@@ -2,14 +2,6 @@
 ### Phase 8: Training Module - External Algorithms (v0.7.0)
-#### Step 8.2: LightGBM
-- [ ] Create `smallaxe/training/lightgbm.py`
-- [ ] Implement `LightGBMRegressor` and `LightGBMClassifier`
-- [ ] Handle optional dependency
-- [ ] Create `tests/test_lightgbm.py`
-- [ ] Commit: "Add LightGBM support"
-- [ ] PR → main
 #### Step 8.3: CatBoost
 - [ ] Create `smallaxe/training/catboost.py`
 - [ ] Implement `CatBoostRegressor` and `CatBoostClassifier`
@@ -236,4 +228,4 @@
 | v0.10.0 | Optimization (hyperopt) |
 | v0.11.0 | AutomatedTraining |
 | v0.12.0 | Visualization |
-| v1.0.0 | Integration, README, PyPI publish |
+| v1.0.0 | Integration, README, PyPI publish |

{smallaxe-0.6.2 → smallaxe-0.6.4}/smallaxe/training/__init__.py RENAMED Viewed

@@ -27,3 +27,29 @@ try:
     __all__.extend(["XGBoostRegressor", "XGBoostClassifier"])
 except ImportError:
     pass
+# Import LightGBM classes if available (optional dependency)
+try:
+    from smallaxe.training.lightgbm import (
+        LightGBMClassifier as LightGBMClassifier,
+    )
+    from smallaxe.training.lightgbm import (
+        LightGBMRegressor as LightGBMRegressor,
+    )
+    __all__.extend(["LightGBMRegressor", "LightGBMClassifier"])
+except ImportError:
+    pass
+# Import CatBoost classes if available (optional dependency)
+try:
+    from smallaxe.training.catboost import (
+        CatBoostClassifier as CatBoostClassifier,
+    )
+    from smallaxe.training.catboost import (
+        CatBoostRegressor as CatBoostRegressor,
+    )
+    __all__.extend(["CatBoostRegressor", "CatBoostClassifier"])
+except ImportError:
+    pass

smallaxe-0.6.4/smallaxe/training/catboost.py ADDED Viewed

@@ -0,0 +1,296 @@
+"""CatBoost models for regression and classification."""
+import shutil
+import tempfile
+from typing import Any, Dict, List, Optional
+from pyspark.sql import DataFrame
+from smallaxe.exceptions import DependencyError
+from smallaxe.training.base import BaseClassifier, BaseRegressor
+CATBOOST_AVAILABLE = False
+SparkCatBoostRegressor = None
+SparkCatBoostRegressionModel = None
+SparkCatBoostClassifier = None
+SparkCatBoostClassificationModel = None
+def _load_catboost_spark() -> bool:
+    """Load CatBoost Spark classes if Spark has made them importable."""
+    global CATBOOST_AVAILABLE
+    global SparkCatBoostRegressor
+    global SparkCatBoostRegressionModel
+    global SparkCatBoostClassifier
+    global SparkCatBoostClassificationModel
+    if CATBOOST_AVAILABLE:
+        return True
+    try:
+        from catboost_spark import (
+            CatBoostClassificationModel,
+            CatBoostClassifier,
+            CatBoostRegressionModel,
+            CatBoostRegressor,
+        )
+    except ImportError:
+        return False
+    SparkCatBoostRegressor = CatBoostRegressor
+    SparkCatBoostRegressionModel = CatBoostRegressionModel
+    SparkCatBoostClassifier = CatBoostClassifier
+    SparkCatBoostClassificationModel = CatBoostClassificationModel
+    CATBOOST_AVAILABLE = True
+    return True
+_load_catboost_spark()
+def _check_catboost_available() -> None:
+    """Check if CatBoost Spark support is available."""
+    if not _load_catboost_spark():
+        raise DependencyError(
+            package="catboost_spark",
+            install_command=(
+                "pip install smallaxe[catboost] and configure Spark with "
+                "ai.catboost:catboost-spark_3.5_2.12:1.2.10"
+            ),
+        )
+def is_catboost_available() -> bool:
+    """Return whether CatBoost Spark support is currently importable."""
+    return _load_catboost_spark()
+def catboost_install_hint() -> str:
+    """Return the install and Spark package hint for CatBoost support."""
+    return (
+        "pip install smallaxe[catboost] and configure Spark with "
+        "ai.catboost:catboost-spark_3.5_2.12:1.2.10"
+    )
+class CatBoostRegressor(BaseRegressor):
+    """CatBoost regressor for regression tasks.
+    This class wraps CatBoost for Spark's CatBoostRegressor to provide the
+    same smallaxe fit/predict/save/load interface as the other Spark-backed
+    regressors.
+    """
+    def __init__(self, task: str = "simple_regression") -> None:
+        """Initialize the CatBoost regressor."""
+        _check_catboost_available()
+        super().__init__(task)
+    @property
+    def params(self) -> Dict[str, str]:
+        """Get parameter descriptions."""
+        return {
+            "n_estimators": "Number of boosting iterations",
+            "max_depth": "Maximum tree depth",
+            "learning_rate": "Boosting learning rate",
+            "subsample": "Sample rate for bagging",
+            "l2_leaf_reg": "L2 regularization coefficient",
+            "random_strength": "Amount of randomness used when scoring splits",
+            "one_hot_max_size": "Maximum categorical cardinality for one-hot encoding",
+            "allow_writing_files": "Whether CatBoost may write training artifacts",
+            "train_dir": "Directory for CatBoost training artifacts",
+            "seed": "Random seed for reproducibility",
+        }
+    @property
+    def default_params(self) -> Dict[str, Any]:
+        """Get default parameter values."""
+        return {
+            "n_estimators": 100,
+            "max_depth": 6,
+            "learning_rate": 0.03,
+            "subsample": None,
+            "l2_leaf_reg": 3.0,
+            "random_strength": 1.0,
+            "one_hot_max_size": None,
+            "allow_writing_files": False,
+            "train_dir": None,
+            "seed": None,
+        }
+    def _catboost_params(
+        self,
+        label_col: Optional[str] = None,
+        train_dir: Optional[str] = None,
+    ) -> Dict[str, Any]:
+        """Translate smallaxe parameter names to CatBoost Spark parameter names."""
+        params = {
+            "iterations": self.get_param("n_estimators"),
+            "depth": self.get_param("max_depth"),
+            "learningRate": self.get_param("learning_rate"),
+            "l2LeafReg": self.get_param("l2_leaf_reg"),
+            "randomStrength": self.get_param("random_strength"),
+            "lossFunction": "RMSE",
+            "allowWritingFiles": self.get_param("allow_writing_files"),
+            "featuresCol": self.FEATURES_COL,
+            "predictionCol": self.PREDICTION_COL,
+        }
+        if label_col is not None:
+            params["labelCol"] = label_col
+        configured_train_dir = train_dir or self.get_param("train_dir")
+        if configured_train_dir is not None:
+            params["trainDir"] = configured_train_dir
+        optional_params = {
+            "subsample": self.get_param("subsample"),
+            "oneHotMaxSize": self.get_param("one_hot_max_size"),
+            "randomSeed": self.get_param("seed"),
+        }
+        params.update({name: value for name, value in optional_params.items() if value is not None})
+        return params
+    def _create_spark_estimator(self) -> Any:
+        """Create the underlying CatBoost Spark regressor."""
+        return SparkCatBoostRegressor(**self._catboost_params())
+    def _fit_spark_model(
+        self,
+        df: DataFrame,
+        label_col: str,
+        feature_cols: List[str],
+    ) -> Any:
+        """Fit the CatBoost Spark regressor."""
+        df_with_features = self._assemble_features(df, feature_cols)
+        temp_train_dir = None
+        if self.get_param("train_dir") is None:
+            temp_train_dir = tempfile.mkdtemp(prefix="smallaxe_catboost_")
+        estimator = SparkCatBoostRegressor(
+            **self._catboost_params(label_col, train_dir=temp_train_dir)
+        )
+        self._feature_cols = feature_cols
+        self._label_col = label_col
+        try:
+            self._spark_model = estimator.fit(df_with_features)
+        finally:
+            if temp_train_dir is not None:
+                shutil.rmtree(temp_train_dir, ignore_errors=True)
+        return self._spark_model
+    def _load_artifacts(self, path: str) -> None:
+        """Load the CatBoost Spark model from disk."""
+        self._load_spark_model(path, SparkCatBoostRegressionModel)
+class CatBoostClassifier(BaseClassifier):
+    """CatBoost classifier for binary and multiclass classification tasks."""
+    def __init__(self, task: str = "binary") -> None:
+        """Initialize the CatBoost classifier."""
+        _check_catboost_available()
+        super().__init__(task)
+    @property
+    def params(self) -> Dict[str, str]:
+        """Get parameter descriptions."""
+        return {
+            "n_estimators": "Number of boosting iterations",
+            "max_depth": "Maximum tree depth",
+            "learning_rate": "Boosting learning rate",
+            "subsample": "Sample rate for bagging",
+            "l2_leaf_reg": "L2 regularization coefficient",
+            "random_strength": "Amount of randomness used when scoring splits",
+            "one_hot_max_size": "Maximum categorical cardinality for one-hot encoding",
+            "scale_pos_weight": "Class 1 weight multiplier for binary classification",
+            "allow_writing_files": "Whether CatBoost may write training artifacts",
+            "train_dir": "Directory for CatBoost training artifacts",
+            "seed": "Random seed for reproducibility",
+        }
+    @property
+    def default_params(self) -> Dict[str, Any]:
+        """Get default parameter values."""
+        return {
+            "n_estimators": 100,
+            "max_depth": 6,
+            "learning_rate": 0.03,
+            "subsample": None,
+            "l2_leaf_reg": 3.0,
+            "random_strength": 1.0,
+            "one_hot_max_size": None,
+            "scale_pos_weight": None,
+            "allow_writing_files": False,
+            "train_dir": None,
+            "seed": None,
+        }
+    def _catboost_params(
+        self,
+        label_col: Optional[str] = None,
+        train_dir: Optional[str] = None,
+    ) -> Dict[str, Any]:
+        """Translate smallaxe parameter names to CatBoost Spark parameter names."""
+        loss_function = "Logloss" if self.task == "binary" else "MultiClass"
+        params = {
+            "iterations": self.get_param("n_estimators"),
+            "depth": self.get_param("max_depth"),
+            "learningRate": self.get_param("learning_rate"),
+            "l2LeafReg": self.get_param("l2_leaf_reg"),
+            "randomStrength": self.get_param("random_strength"),
+            "lossFunction": loss_function,
+            "allowWritingFiles": self.get_param("allow_writing_files"),
+            "featuresCol": self.FEATURES_COL,
+            "predictionCol": self.PREDICTION_COL,
+            "probabilityCol": self.PROBABILITY_COL,
+            "rawPredictionCol": self.RAW_PREDICTION_COL,
+        }
+        if label_col is not None:
+            params["labelCol"] = label_col
+        configured_train_dir = train_dir or self.get_param("train_dir")
+        if configured_train_dir is not None:
+            params["trainDir"] = configured_train_dir
+        optional_params = {
+            "subsample": self.get_param("subsample"),
+            "oneHotMaxSize": self.get_param("one_hot_max_size"),
+            "scalePosWeight": self.get_param("scale_pos_weight"),
+            "randomSeed": self.get_param("seed"),
+        }
+        params.update({name: value for name, value in optional_params.items() if value is not None})
+        return params
+    def _create_spark_estimator(self) -> Any:
+        """Create the underlying CatBoost Spark classifier."""
+        return SparkCatBoostClassifier(**self._catboost_params())
+    def _fit_spark_model(
+        self,
+        df: DataFrame,
+        label_col: str,
+        feature_cols: List[str],
+    ) -> Any:
+        """Fit the CatBoost Spark classifier."""
+        df_with_features = self._assemble_features(df, feature_cols)
+        temp_train_dir = None
+        if self.get_param("train_dir") is None:
+            temp_train_dir = tempfile.mkdtemp(prefix="smallaxe_catboost_")
+        estimator = SparkCatBoostClassifier(
+            **self._catboost_params(label_col, train_dir=temp_train_dir)
+        )
+        self._feature_cols = feature_cols
+        self._label_col = label_col
+        try:
+            self._spark_model = estimator.fit(df_with_features)
+        finally:
+            if temp_train_dir is not None:
+                shutil.rmtree(temp_train_dir, ignore_errors=True)
+        return self._spark_model
+    def _load_artifacts(self, path: str) -> None:
+        """Load the CatBoost Spark model from disk."""
+        self._load_spark_model(path, SparkCatBoostClassificationModel)

smallaxe 0.6.2__tar.gz → 0.6.4__tar.gz

smallaxe 0.6.2tar.gz → 0.6.4tar.gz