PyPI - fklearn - Versions diffs - 2.2.0__tar.gz → 2.3.0__tar.gz - Mend

fklearn 2.2.0tar.gz → 2.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

fklearn-2.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,68 @@
+Metadata-Version: 2.1
+Name: fklearn
+Version: 2.3.0
+Summary: Functional machine learning
+Home-page: https://github.com/nubank/fklearn
+Author: Nubank
+Classifier: Programming Language :: Python :: 3.6
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Requires-Python: >=3.6.2,<3.10
+Description-Content-Type: text/markdown
+Provides-Extra: test_deps
+Provides-Extra: lgbm
+Provides-Extra: xgboost
+Provides-Extra: catboost
+Provides-Extra: tools
+Provides-Extra: devel
+Provides-Extra: all_models
+Provides-Extra: all
+License-File: LICENSE
+# fklearn: Functional Machine Learning
+![PyPI](https://img.shields.io/pypi/v/fklearn.svg?style=flat-square)
+[![Documentation Status](https://readthedocs.org/projects/fklearn/badge/?version=latest)](https://fklearn.readthedocs.io/en/latest/?badge=latest)
+[![Gitter](https://badges.gitter.im/fklearn-python/community.svg)](https://gitter.im/fklearn-python/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
+![Tests](https://github.com/nubank/fklearn/actions/workflows/push.yaml/badge.svg?branch=master)
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+**fklearn** uses functional programming principles to make it easier to solve real problems with Machine Learning.
+The name is a reference to the widely known [scikit-learn](https://scikit-learn.org/stable/) library.
+**fklearn Principles**
+1. Validation should reflect real-life situations.
+2. Production models should match validated models.
+3. Models should be production-ready with few extra steps.
+4. Reproducibility and in-depth analysis of model results should be easy to achieve.
+[Documentation](https://fklearn.readthedocs.io/en/latest/) |
+[Getting Started](https://fklearn.readthedocs.io/en/latest/getting_started.html) |
+[API Docs](https://fklearn.readthedocs.io/en/latest/api/modules.html) |
+[Contributing](https://fklearn.readthedocs.io/en/latest/contributing.html) |
+## Installation
+To install via pip:
+```
+pip install fklearn
+```
+You can also install from the source:
+```sh
+git clone git@github.com:nubank/fklearn.git
+cd fklearn
+git checkout master
+pip install -e .
+```
+## License
+[Apache License 2.0](LICENSE)

{fklearn-2.2.0 → fklearn-2.3.0}/README.md RENAMED Viewed

@@ -3,8 +3,7 @@
 ![PyPI](https://img.shields.io/pypi/v/fklearn.svg?style=flat-square)
 [![Documentation Status](https://readthedocs.org/projects/fklearn/badge/?version=latest)](https://fklearn.readthedocs.io/en/latest/?badge=latest)
 [![Gitter](https://badges.gitter.im/fklearn-python/community.svg)](https://gitter.im/fklearn-python/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
-[![CircleCI](https://circleci.com/gh/nubank/fklearn.svg?style=svg)](https://circleci.com/gh/nubank/fklearn)
-[![codecov.io](https://codecov.io/github/nubank/fklearn/branch/master/graph/badge.svg)](https://codecov.io/github/nubank/fklearn)
+![Tests](https://github.com/nubank/fklearn/actions/workflows/push.yaml/badge.svg?branch=master)
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 **fklearn** uses functional programming principles to make it easier to solve real problems with Machine Learning.
@@ -21,7 +20,7 @@ The name is a reference to the widely known [scikit-learn](https://scikit-learn.
 [Documentation](https://fklearn.readthedocs.io/en/latest/) |
 [Getting Started](https://fklearn.readthedocs.io/en/latest/getting_started.html) |
-[API Docs](https://fklearn.readthedocs.io/en/latest/api.html) |
+[API Docs](https://fklearn.readthedocs.io/en/latest/api/modules.html) |
 [Contributing](https://fklearn.readthedocs.io/en/latest/contributing.html) |

{fklearn-2.2.0 → fklearn-2.3.0}/requirements.txt RENAMED Viewed

@@ -1,6 +1,6 @@
 joblib>=0.13.2,<2
 numpy>=1.16.4,<2
 pandas>=0.24.1,<2
-scikit-learn>=0.21.2,<0.24.0
+scikit-learn>=0.21.2,<0.25.0
 statsmodels>=0.9.0,<1
 toolz>=0.9.0,<1

{fklearn-2.2.0 → fklearn-2.3.0}/requirements_test.txt RENAMED Viewed

@@ -2,5 +2,6 @@ pytest>=4.2.1,<7
 pytest-cov>=2.6.1,<3
 pytest-xdist>=1.26.1,<3
 mypy>=0.670,<1
+coverage<5
 codecov>=2.0,<3
 hypothesis>=5.5.4,<7

{fklearn-2.2.0 → fklearn-2.3.0}/setup.py RENAMED Viewed

@@ -26,10 +26,15 @@ all_models_deps = lgbm_deps + xgboost_deps + catboost_deps
 all_deps = all_models_deps + tools_deps
 devel_deps = test_deps + all_deps
+with open("README.md", "r") as fh:
+    long_description = fh.read()
 setup(name=MODULE_NAME,
       description="Functional machine learning",
+      long_description=long_description,
+      long_description_content_type="text/markdown",
       url='https://github.com/nubank/{:s}'.format(REPO_NAME),
-      python_requires='>=3.6.2',
+      python_requires='>=3.6.2,<3.10',
       author="Nubank",
       package_dir={'': 'src'},
       packages=find_packages('src'),
@@ -46,5 +51,9 @@ setup(name=MODULE_NAME,
                       "all": all_deps},
       include_package_data=True,
       zip_safe=False,
-      classifiers=['Programming Language :: Python :: 3.6'])
+      classifiers=[
+          'Programming Language :: Python :: 3.6',
+          'Programming Language :: Python :: 3.7',
+          'Programming Language :: Python :: 3.8',
+          'Programming Language :: Python :: 3.9'
+          ])

{fklearn-2.2.0 → fklearn-2.3.0}/src/fklearn/causal/cate_learning/meta_learners.py RENAMED Viewed

@@ -1,6 +1,6 @@
 import copy
 import inspect
-from typing import Callable, List, Tuple
+from typing import Callable, Dict, List, Tuple
 import numpy as np
 import pandas as pd
@@ -194,25 +194,19 @@ def causal_s_classification_learner(
     Parameters
     ----------
     df : pd.DataFrame
         A Pandas' DataFrame with features and target columns.
         The model will be trained to predict the target column
         from the features.
     treatment_col: str
         The name of the column in `df` which contains the names of
         the treatments or control to which each data sample was subjected.
     control_name: str
         The name of the control group.
     prediction_column : str
         The name of the column with the predictions from the provided learner.
     learner: Callable
         A fklearn classification learner function.
     learner_transformers: list
         A list of fklearn transformer functions to be applied after the learner and before estimating the CATE.
         This parameter may be useful, for example, to estimate the CATE with calibrated classifiers.
@@ -266,3 +260,189 @@ def causal_s_classification_learner(
 causal_s_classification_learner.__doc__ += learner_return_docstring(
     "Causal S-Learner Classifier"
 )
+def _simulate_t_learner_treatment_effect(
+    df: pd.DataFrame,
+    learners: dict,
+    treatments: list,
+    control_name: str,
+    prediction_column: str,
+) -> pd.DataFrame:
+    control_fcn = learners[control_name]
+    control_conversion_probability = control_fcn(df)[prediction_column].values
+    scored_df = df.copy()
+    uplift_cols = []
+    for treatment_name in treatments:
+        treatment_fcn = learners[treatment_name]
+        treatment_conversion_probability = treatment_fcn(df)[prediction_column].values
+        scored_df[
+            f"treatment_{treatment_name}__{prediction_column}_on_treatment"
+        ] = treatment_conversion_probability
+        uplift_cols.append(f"treatment_{treatment_name}__uplift")
+        scored_df[uplift_cols[-1]] = (
+            treatment_conversion_probability - control_conversion_probability
+        )
+    scored_df["uplift"] = scored_df[uplift_cols].max(axis=1).values
+    scored_df["suggested_treatment"] = np.where(
+        scored_df["uplift"].values <= 0,
+        control_name,
+        scored_df[uplift_cols].idxmax(axis=1).values,
+    )
+    scored_df["suggested_treatment"] = (
+        scored_df["suggested_treatment"]
+        .apply(lambda x: x.replace("__uplift", ""))
+        .values
+    )
+    return scored_df
+def _get_model_fcn(
+    df: pd.DataFrame,
+    treatment_col: str,
+    treatment_name: str,
+    learner: Callable,
+) -> Tuple[Callable, dict, dict]:
+    """
+    Returns a function that predicts the target column from the features.
+    """
+    treatment_names = df[treatment_col].unique()
+    if treatment_name not in treatment_names:
+        raise MissingTreatmentError()
+    df = df.loc[df[treatment_col] == treatment_name].reset_index(drop=True).copy()
+    return learner(df)
+def _get_learners(
+    df: pd.DataFrame,
+    control_learner: Callable,
+    treatment_learner: Callable,
+    unique_treatments: List[str],
+    control_name: str,
+    treatment_col: str,
+) -> Tuple[Dict[str, Callable], Dict[str, dict]]:
+    learners: Dict[str, Callable] = {}
+    logs: Dict[str, dict] = {}
+    learner_fcn, _, learner_logs = _get_model_fcn(
+        df, treatment_col, control_name, control_learner
+    )
+    learners[control_name] = learner_fcn
+    logs[control_name] = learner_logs
+    for treatment_name in unique_treatments:
+        learner_fcn, _, learner_logs = _get_model_fcn(
+            df, treatment_col, treatment_name, treatment_learner
+        )
+        learners[treatment_name] = learner_fcn
+        logs[treatment_name] = learner_logs
+    return learners, logs
+@curry
+def causal_t_classification_learner(
+    df: pd.DataFrame,
+    treatment_col: str,
+    control_name: str,
+    prediction_column: str,
+    learner: LearnerFnType,
+    treatment_learner: LearnerFnType = None,
+    learner_transformers: List[LearnerFnType] = None,
+) -> LearnerReturnType:
+    """
+    Fits a Causal T-Learner classifier. The T-Learner is a meta-learner which learns the
+    Conditional Average Treatment Effect (CATE) through the use of one Machine Learning
+    model for each treatment and for the control group. Each model is fitted in a subset of
+    the data, according to the treatment: the CATE $\tau$ is defined as
+    $\tau(x_{i}) = M_{1}(X=x_{i}, T=1) - M_{0}(X=x_{i}, T=0)$, being $M_{1}$ a model fitted
+    with treatment data and $M_{0}$ a model fitted with control data. Notice that $M_{0}$
+    and $M_{1}$ are traditional Machine Learning models such as a LightGBM Classifier and
+    that $x_{i}$ is the feature set of sample $i$.
+    **References:**
+    [1] https://matheusfacure.github.io/python-causality-handbook/21-Meta-Learners.html
+    [2] https://causalml.readthedocs.io/en/latest/methodology.html
+    Parameters
+    ----------
+    df : pd.DataFrame
+        A Pandas' DataFrame with features and target columns.
+        The model will be trained to predict the target column
+        from the features.
+    treatment_col: str
+        The name of the column in `df` which contains the names of
+        the treatments and control to which each data sample was subjected.
+    control_name: str
+        The name of the control group.
+    prediction_column : str
+        The name of the column with the predictions from the provided learner.
+    learner: LearnerFnType
+        A fklearn classification learner function.
+    treatment_learner: LearnerFnType
+        An optional fklearn classification learner function.
+    learner_transformers: List[LearnerFnType]
+        A list of fklearn transformer functions to be applied after the learner and before estimating the CATE.
+        This parameter may be useful, for example, to estimate the CATE with calibrated classifiers.
+    """
+    control_learner = copy.deepcopy(learner)
+    if treatment_learner is None:
+        treatment_learner = copy.deepcopy(learner)
+    # pipeline
+    if learner_transformers is not None:
+        learner_transformers = copy.deepcopy(learner_transformers)
+        control_learner_pipe = build_pipeline(*[control_learner] + learner_transformers)
+        treatment_learner_pipe = build_pipeline(
+            *[treatment_learner] + learner_transformers
+        )
+    else:
+        control_learner_pipe = copy.deepcopy(control_learner)
+        treatment_learner_pipe = copy.deepcopy(treatment_learner)
+    # learners
+    unique_treatments = _get_unique_treatments(df, treatment_col, control_name)
+    learners, learners_logs = _get_learners(
+        df=df,
+        control_learner=control_learner_pipe,
+        treatment_learner=treatment_learner_pipe,
+        unique_treatments=unique_treatments,
+        control_name=control_name,
+        treatment_col=treatment_col,
+    )
+    def p(new_df: pd.DataFrame) -> pd.DataFrame:
+        return _simulate_t_learner_treatment_effect(
+            new_df,
+            learners,
+            unique_treatments,
+            control_name,
+            prediction_column,
+        )
+    p.__doc__ = learner_pred_fn_docstring("causal_t_classification_learner")
+    log = {"causal_t_classification_learner": {**learners_logs}}
+    return p, p(df), log
+causal_t_classification_learner.__doc__ += learner_return_docstring(
+    "Causal T-Learner Classifier"
+)

fklearn-2.3.0/src/fklearn/exceptions/exceptions.py ADDED Viewed

@@ -0,0 +1,31 @@
+from typing import Any, Dict, List
+class MultipleTreatmentsError(Exception):
+    def __init__(
+        self,
+        msg: str = "Data contains multiple treatments.",
+        *args: List[Any],
+        **kwargs: Dict[str, Any]
+    ) -> None:
+        super().__init__(msg, *args, **kwargs)
+class MissingControlError(Exception):
+    def __init__(
+        self,
+        msg: str = "Data does not contain the specified control.",
+        *args: List[Any],
+        **kwargs: Dict[str, Any]
+    ) -> None:
+        super().__init__(msg, *args, **kwargs)
+class MissingTreatmentError(Exception):
+    def __init__(
+        self,
+        msg: str = "Data does not contain the specified treatment.",
+        *args: List[Any],
+        **kwargs: Dict[str, Any]
+    ) -> None:
+        super().__init__(msg, *args, **kwargs)

fklearn-2.3.0/src/fklearn/resources/VERSION ADDED Viewed

	@@ -0,0 +1 @@
1	+ 2.3.0

{fklearn-2.2.0 → fklearn-2.3.0}/src/fklearn/training/classification.py RENAMED Viewed

@@ -1,7 +1,9 @@
-from typing import List, Any
+from typing import List, Any, Optional, Callable, Tuple, Union
 import numpy as np
 import pandas as pd
+from lightgbm import Booster
+from pathlib import Path
 from toolz import curry, merge, assoc
 from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.linear_model import LogisticRegression
@@ -233,7 +235,7 @@ xgb_classification_learner.__doc__ += learner_return_docstring("XGboost Classifi
 @curry
 def _get_catboost_shap_values(df: pd.DataFrame, cbr: Any,
                               features: List, target: str,
-                              weights: List, cat_features: List) -> np.array:
+                              weights: List, cat_features: List) -> np.ndarray:
     """
     Auxiliar method to allow us to get shap values for Catboost multiclass models
@@ -446,7 +448,7 @@ def nlp_logistic_classification_learner(df: pd.DataFrame,
     """
     # set default params
-    default_vect_params = {"strip_accents": "unicode", "min_df": 20}
+    default_vect_params = {"strip_accents": "unicode", "min_df": 1}
     merged_vect_params = default_vect_params if not vectorizer_params else merge(default_vect_params, vectorizer_params)
     default_clf_params = {"C": 0.1, "multi_class": "ovr", "solver": "liblinear"}
@@ -501,10 +503,24 @@ def lgbm_classification_learner(df: pd.DataFrame,
                                 target: str,
                                 learning_rate: float = 0.1,
                                 num_estimators: int = 100,
-                                extra_params: LogType = None,
+                                extra_params: Optional[LogType] = None,
                                 prediction_column: str = "prediction",
-                                weight_column: str = None,
-                                encode_extra_cols: bool = True) -> LearnerReturnType:
+                                weight_column: Optional[str] = None,
+                                encode_extra_cols: bool = True,
+                                valid_sets: Optional[List[pd.DataFrame]] = None,
+                                valid_names: Optional[List[str]] = None,
+                                feval: Optional[Union[
+                                    Callable[[np.ndarray, pd.DataFrame], Tuple[str, float, bool]],
+                                    List[Callable[[np.ndarray, pd.DataFrame], Tuple[str, float, bool]]]]
+                                ] = None,
+                                init_model: Optional[Union[str, Path, Booster]] = None,
+                                feature_name: Union[List[str], str] = 'auto',
+                                categorical_feature: Union[List[str], List[int], str] = 'auto',
+                                keep_training_booster: bool = False,
+                                callbacks: Optional[List[Callable]] = None,
+                                dataset_init_score: Optional[Union[
+                                    List, List[List], np.ndarray, pd.Series, pd.DataFrame]
+                                ] = None) -> LearnerReturnType:
     """
     Fits an LGBM classifier to the dataset.
@@ -557,6 +573,46 @@ def lgbm_classification_learner(df: pd.DataFrame,
     encode_extra_cols : bool (default: True)
         If True, treats all columns in `df` with name pattern fklearn_feat__col==val` as feature columns.
+    valid_sets : list of pandas.DataFrame, optional (default=None)
+        A list of datasets to be used for early-stopping during training.
+    valid_names : list of strings, optional (default=None)
+        A list of dataset names matching the list of datasets provided through the ``valid_sets`` parameter.
+    feval : callable, list of callable, or None, optional (default=None)
+        Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and
+        return (eval_name, eval_result, is_higher_better) or list of such tuples.
+    init_model : str, pathlib.Path, Booster or None, optional (default=None)
+        Filename of LightGBM model or Booster instance used for continue training.
+    feature_name : list of str, or 'auto', optional (default="auto")
+        Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
+    categorical_feature : list of str or int, or 'auto', optional (default="auto")
+        Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need
+        to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns
+        are used. All values in categorical features will be cast to int32 and thus should be less than int32 max value
+        (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero.
+        All negative values in categorical features will be treated as missing values. The output cannot be
+        monotonically constrained with respect to a categorical feature. Floating point numbers in categorical features
+        will be rounded towards 0.
+    keep_training_booster : bool, optional (default=False)
+        Whether the returned Booster will be used to keep training. If False, the returned value will be converted into
+        _InnerPredictor before returning. This means you won’t be able to use eval, eval_train or eval_valid methods of
+        the returned Booster. When your model is very large and cause the memory error, you can try to set this param to
+        True to avoid the model conversion performed during the internal call of model_to_string. You can still use
+        _InnerPredictor as init_model for future continue training.
+    callbacks : list of callable, or None, optional (default=None)
+        List of callback functions that are applied at each iteration. See Callbacks in LightGBM Python API for more
+        information.
+    dataset_init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for
+        multi-class task), or None, optional (default=None)
+        Init score for Dataset. It could be the prediction of the majority class or a prediction from any other model.
     """
     import lightgbm as lgbm
@@ -570,9 +626,12 @@ def lgbm_classification_learner(df: pd.DataFrame,
     features = features if not encode_extra_cols else expand_features_encoded(df, features)
     dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
-                          silent=True)
+                          silent=True, init_score=dataset_init_score)
-    bst = lgbm.train(params, dtrain, num_estimators)
+    bst = lgbm.train(params=params, train_set=dtrain, num_boost_round=num_estimators, valid_sets=valid_sets,
+                     valid_names=valid_names, feval=feval, init_model=init_model, feature_name=feature_name,
+                     categorical_feature=categorical_feature, keep_training_booster=keep_training_booster,
+                     callbacks=callbacks)
     def p(new_df: pd.DataFrame, apply_shap: bool = False) -> pd.DataFrame:
         if params["objective"] == "multiclass":

{fklearn-2.2.0 → fklearn-2.3.0}/src/fklearn/training/transformation.py RENAMED Viewed

@@ -1027,7 +1027,7 @@ def missing_warner(df: pd.DataFrame, cols_list: List[str],
     cols_without_missing = df_selected.loc[:, df_selected.isna().sum(axis=0) == 0].columns.tolist()
     def p(dataset: pd.DataFrame) -> pd.DataFrame:
-        def detailed_assignment(df: pd.DataFrame, cols_to_check: List[str]) -> np.array:
+        def detailed_assignment(df: pd.DataFrame, cols_to_check: List[str]) -> np.ndarray:
             cols_with_missing = np.array([np.where(df[col].isna(), col, "") for col in cols_to_check]).T
             missing_by_row_list = np.array([list(filter(None, x)) for x in cols_with_missing]).reshape(-1, 1)
             if missing_by_row_list.size == 0:

{fklearn-2.2.0 → fklearn-2.3.0}/src/fklearn/training/unsupervised.py RENAMED Viewed

@@ -42,13 +42,17 @@ def isolation_forest_learner(df: pd.DataFrame,
         If True, treats all columns in `df` with name pattern fklearn_feat__col==val` as feature columns.
     """
-    default_params = {"n_jobs": -1, "random_state": 1729, "contamination": 0.1, "behaviour": "new"}
+    model = IsolationForest()
+    default_params: Dict[str, Any] = {"n_jobs": -1, "random_state": 1729, "contamination": 0.1}
+    # Remove this when we stop supporting scikit-learn<0.24 as this param is deprecated
+    if "behaviour" in model.get_params():
+        default_params["behaviour"] = "new"
     params = default_params if not params else merge(default_params, params)
+    model.set_params(**params)
     features = features if not encode_extra_cols else expand_features_encoded(df, features)
-    model = IsolationForest()
-    model.set_params(**params)
     model.fit(df[features].values)
     def p(new_df: pd.DataFrame) -> pd.DataFrame:

fklearn-2.3.0/src/fklearn/validation/__init__.py ADDED Viewed

File without changes

fklearn 2.2.0__tar.gz → 2.3.0__tar.gz

fklearn 2.2.0tar.gz → 2.3.0tar.gz