PyPI - data-manipulation-utilities - Versions diffs - 0.2.0__py3-none-any.whl → 0.2.2__py3-none-any.whl - Mend

data-manipulation-utilities 0.2.0py3-none-any.whl → 0.2.2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

{data_manipulation_utilities-0.2.0.dist-info → data_manipulation_utilities-0.2.2.dist-info}/METADATA RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.0
+Version: 0.2.2
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
 Requires-Dist: scipy
-Requires-Dist: awkward==2.4.6
+Requires-Dist: awkward
 Requires-Dist: tqdm
 Requires-Dist: joblib
 Requires-Dist: scikit-learn
@@ -423,9 +423,21 @@ obj.run()
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
 ```yaml
+dataset:
+    # Before training, new features can be defined as below
+    define :
+        x : v + w
+        y : v - w
+    # If the key is found to be NaN, replace its value with the number provided
+    # This will be used in the training.
+    # Otherwise the entries with NaNs will be dropped
+    nan:
+        x : 0
+        y : 0
+        z : -999
 training :
     nfold    : 10
-    features : [w, x, y, z]
+    features : [x, y, z]
     hyper    :
       loss              : log_loss
       n_estimators      : 100
@@ -433,8 +445,25 @@ training :
       learning_rate     : 0.1
       min_samples_split : 2
 saving:
+    # The actual model names are model_001.pkl, model_002.pkl, etc, one for each fold
     path : 'tests/ml/train_mva/model.pkl'
 plotting:
+    roc :
+        min : [0.0, 0.0] # Optional, controls where the ROC curve starts and ends
+        max : [1.2, 1.2] # By default it does from 0 to 1 in both axes
+        # The section below is optional and will annotate the ROC curve with
+        # values for the score at different signal efficiencies
+        annotate:
+          sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9] # Values of signal efficiency at which to show the scores
+          form    : '{:.2f}' # Use two decimals for scores
+          color   : 'green'  # Color for text and marker
+          xoff    : -15      # Offsets in X and Y
+          yoff    : -15
+          size    :  10      # Size of text
+    correlation: # Adds correlation matrix for training datasets
+      title      : 'Correlation matrix'
+      size       : [10, 10]
+      mask_value : 0                # Where correlation is zero, the bin will appear white
     val_dir : 'tests/ml/train_mva'
     features:
         saving:
@@ -468,7 +497,9 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+    - Can use the `nan` section shown above to replace `NaN` values with something else
+    - For whatever remains we remove the entries from the training.
 ## Application
@@ -491,17 +522,56 @@ The picking process happens through the comparison of hashes between the samples
 The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
 `GradientBoostClassifier`, here called `CVClassifier`.
-If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+If a sample exists, that was used in the training of _every_ model, no model can be chosen for the prediction and a
 `CVSameData` exception will be risen.
+During training, the configuration will be stored in the model. Therefore, variable definitions can be picked up for evaluation
+from that configuration and the user does not need to define extra columns.
 ### Caveats
 When evaluating the model with real data, problems might occur, we deal with them as follows:
 - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
-- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be:
+    - Replaced by other values before evaluation IF a replacement was specified during training. The training configuration will be stored in the model
+    and can be accessed through:
+    ```python
+    model.cfg
+    ```
+    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
+# Pandas dataframes
+## Utilities
+These are thin layers of code that take pandas dataframes and carry out specific tasks
+### Dataframe to latex
+One can save a dataframe to latex with:
+```python
+import pandas as pnd
+import dmu.pdataframe.utilities as put
+d_data = {}
+d_data['a'] = [1,2,3]
+d_data['b'] = [4,5,6]
+df = pnd.DataFrame(d_data)
+d_format = {
+        'a' : '{:.0f}',
+        'b' : '{:.3f}'}
+df = _get_df()
+put.df_to_tex(df,
+        './table.tex',
+        d_format = d_format,
+        caption  = 'some caption')
+```
 # Rdataframes
 These are utility functions meant to be used with ROOT dataframes.
@@ -653,6 +723,43 @@ axes:
         label   : 'y'
 ```
+# Other plots
+## Matrices
+This can be done with `MatrixPlotter`, whose usage is illustrated below:
+```python
+import numpy
+import matplotlib.pyplot as plt
+from dmu.plotting.matrix import MatrixPlotter
+cfg = {
+        'labels'     : ['x', 'y', 'z'], # Used to label the matrix axes
+        'title'      : 'Some title',    # Optional, title of plot
+        'label_angle': 45,              # Labels will be rotated by 45 degrees
+        'upper'      : True,            # Useful in case this is a symmetric matrix
+        'zrange'     : [0, 10],         # Controls the z axis range
+        'size'       : [7, 7],          # Plot size
+        'format'     : '{:.3f}',        # Optional, if used will add numerical values to the contents, otherwise a color bar is used
+        'fontsize'   : 12,              # Font size associated to `format`
+        'mask_value' : 0,               # These values will appear white in the plot
+        }
+mat = [
+        [1, 2, 3],
+        [2, 0, 4],
+        [3, 4, numpy.nan]
+        ]
+mat = numpy.array(mat)
+obj = MatrixPlotter(mat=mat, cfg=cfg)
+obj.plot()
+plt.show()
+```
 # Manipulating ROOT files
 ## Getting trees from file

{data_manipulation_utilities-0.2.0.dist-info → data_manipulation_utilities-0.2.2.dist-info}/RECORD RENAMED Viewed

@@ -1,16 +1,19 @@
-data_manipulation_utilities-0.2.0.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
+data_manipulation_utilities-0.2.2.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
 dmu/arrays/utilities.py,sha256=PKoYyybPptA2aU-V3KLnJXBudWxTXu4x1uGdIMQ49HY,1722
 dmu/generic/utilities.py,sha256=0Xnq9t35wuebAqKxbyAiMk1ISB7IcXK4cFH25MT1fgw,1741
 dmu/logging/log_store.py,sha256=umdvjNDuV3LdezbG26b0AiyTglbvkxST19CQu9QATbA,4184
-dmu/ml/cv_classifier.py,sha256=n81m7i2M6Zq96AEd9EZGwXSrbG5m9jkS5RdeXvbsAXU,3712
-dmu/ml/cv_predict.py,sha256=AhCsCnHWPWGIRVTdGS1NxA2m4yH7t2lV_OdALwQAcAE,4927
-dmu/ml/train_mva.py,sha256=d_n-A07DFweikz5nXap4OE_Mqx8VprFT7zbxmnQAbac,9638
-dmu/ml/utilities.py,sha256=Nue7O9zi1QXgjGRPH6wnSAW9jusMQ2ZOSDJzBqJKIi0,3687
+dmu/ml/cv_classifier.py,sha256=8Jwx6xMhJaRLktlRdq0tFl32v6t8i63KmpxrlnXlomU,3759
+dmu/ml/cv_predict.py,sha256=4G7F_1yOvnLftsDC6zUpdvkxuHXGkPemhj0RsYySYDM,6708
+dmu/ml/train_mva.py,sha256=SZ5cQHl7HBxn0c5Hh4HlN1aqMZaJUAlNmsfjnUSQrTg,16894
+dmu/ml/utilities.py,sha256=l348bufD95CuSYdIrHScQThIy2nKwGKXZn-FQg3CEwg,3930
+dmu/pdataframe/utilities.py,sha256=ypvLiFfJ82ga94qlW3t5dXnvEFwYOXnbtJb2zHwsbqk,987
+dmu/plotting/matrix.py,sha256=pXuUJn-LgOvrI9qGkZQw16BzLjOjeikYQ_ll2VIcIXU,4978
 dmu/plotting/plotter.py,sha256=ytMxtzHEY8ZFU0ZKEBE-ROjMszXl5kHTMnQnWe173nU,7208
 dmu/plotting/plotter_1d.py,sha256=g6H2xAgsL9a6vRkpbqHICb3qwV_qMiQPZxxw_oOSf9M,5115
 dmu/plotting/plotter_2d.py,sha256=J-gKnagoHGfJFU7HBrhDFpGYH5Rxy0_zF5l8eE_7ZHE,2944
+dmu/plotting/utilities.py,sha256=SI9dvtZq2gr-PXVz71KE4o0i09rZOKgqJKD1jzf6KXk,1167
 dmu/rdataframe/atr_mgr.py,sha256=FdhaQWVpsm4OOe1IRbm7rfrq8VenTNdORyI-lZ2Bs1M,2386
-dmu/rdataframe/utilities.py,sha256=x8r379F2-vZPYzAdMFCn_V4Kx2Tx9t9pn_QHcZ1euew,2756
+dmu/rdataframe/utilities.py,sha256=MDY3u_y0s-ANvHAWRzGyeuuZUKoaqilfmb8mqlgfrVc,2771
 dmu/rfile/rfprinter.py,sha256=mp5jd-oCJAnuokbdmGyL9i6tK2lY72jEfROuBIZ_ums,3941
 dmu/rfile/utilities.py,sha256=XuYY7HuSBj46iSu3c60UYBHtI6KIPoJU_oofuhb-be0,945
 dmu/stats/fitter.py,sha256=vHNZ16U3apoQyeyM8evq-if49doF48sKB3q9wmA96Fw,18387
@@ -23,7 +26,7 @@ dmu/stats/zfit_plotter.py,sha256=Xs6kisNEmNQXhYRCcjowxO6xHuyAyrfyQIFhGAR61U4,197
 dmu/testing/utilities.py,sha256=WbMM4e9Cn3-B-12Vr64mB5qTKkV32joStlRkD-48lG0,3460
 dmu/text/transformer.py,sha256=4lrGknbAWRm0-rxbvgzOO-eR1-9bkYk61boJUEV3cQ0,6100
 dmu_data/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-dmu_data/ml/tests/train_mva.yaml,sha256=TCniCVpXMEFxZcHa8IIqollKA7ci4OkBnRznLEkXM9o,925
+dmu_data/ml/tests/train_mva.yaml,sha256=k5H4Gu9Gj57B9iqabhcTQEFN674Cv_uJ2Xcumb02zF4,1279
 dmu_data/plotting/tests/2d.yaml,sha256=VApcAfJFbjNcjMCTBSRm2P37MQlGavMZv6msbZwLSgw,402
 dmu_data/plotting/tests/fig_size.yaml,sha256=7ROq49nwZ1A2EbPiySmu6n3G-Jq6YAOkc3d2X3YNZv0,294
 dmu_data/plotting/tests/high_stat.yaml,sha256=bLglBLCZK6ft0xMhQ5OltxE76cWsBMPMjO6GG0OkDr8,522
@@ -44,8 +47,8 @@ dmu_scripts/rfile/compare_root_files.py,sha256=T8lDnQxsRNMr37x1Y7YvWD8ySHrJOWZki
 dmu_scripts/rfile/print_trees.py,sha256=Ze4Ccl_iUldl4eVEDVnYBoe4amqBT1fSBR1zN5WSztk,941
 dmu_scripts/ssh/coned.py,sha256=lhilYNHWRCGxC-jtyJ3LQ4oUgWW33B2l1tYCcyHHsR0,4858
 dmu_scripts/text/transform_text.py,sha256=9akj1LB0HAyopOvkLjNOJiptZw5XoOQLe17SlcrGMD0,1456
-data_manipulation_utilities-0.2.0.dist-info/METADATA,sha256=TJhGYcpEMs08J-Jw-Q9UT6PivCSnKo5APqPZLoFOM7g,23800
-data_manipulation_utilities-0.2.0.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
-data_manipulation_utilities-0.2.0.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
-data_manipulation_utilities-0.2.0.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
-data_manipulation_utilities-0.2.0.dist-info/RECORD,,
+data_manipulation_utilities-0.2.2.dist-info/METADATA,sha256=0QwhQmQML65qk2kaXf1znMZOVNuvaY3l35E7cXLRCZ8,27359
+data_manipulation_utilities-0.2.2.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
+data_manipulation_utilities-0.2.2.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
+data_manipulation_utilities-0.2.2.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
+data_manipulation_utilities-0.2.2.dist-info/RECORD,,

dmu/ml/cv_classifier.py CHANGED Viewed

@@ -2,6 +2,7 @@
 Module holding cv_classifier class
 '''
+from typing                  import Union
 from sklearn.ensemble        import GradientBoostingClassifier
 from dmu.logging.log_store import LogStore
@@ -22,7 +23,7 @@ class CVClassifier(GradientBoostingClassifier):
     '''
     # pylint: disable = too-many-ancestors, abstract-method
     # ----------------------------------
-    def __init__(self, cfg : dict | None = None):
+    def __init__(self, cfg : Union[dict,None] = None):
         '''
         cfg (dict) : Dictionary with configuration, specially the hyperparameters set in the `hyper` field
         '''

dmu/ml/cv_predict.py CHANGED Viewed

@@ -32,11 +32,56 @@ class CVPredict:
         if rdf is None:
             raise ValueError('No ROOT dataframe passed')
-        self._l_model = models
-        self._rdf     = rdf
+        self._l_model   = models
+        self._rdf       = rdf
+        self._d_nan_rep : dict[str,str]
         self._arr_patch : numpy.ndarray
     # --------------------------------------------
+    def _initialize(self):
+        self._rdf       = self._define_columns(self._rdf)
+        self._d_nan_rep = self._get_nan_replacements()
+    # --------------------------------------------
+    def _define_columns(self, rdf : RDataFrame) -> RDataFrame:
+        cfg = self._l_model[0].cfg
+        if 'define' not in cfg['dataset']:
+            log.debug('No define section found in config, will not define extra columns')
+            return self._rdf
+        d_def = cfg['dataset']['define']
+        log.debug(60 * '-')
+        log.info('Defining columns in RDF before evaluating classifier')
+        log.debug(60 * '-')
+        for name, expr in d_def.items():
+            log.debug(f'{name:<20}{"<---":20}{expr:<100}')
+            rdf = rdf.Define(name, expr)
+        return rdf
+    # --------------------------------------------
+    def _get_nan_replacements(self) -> dict[str,str]:
+        cfg = self._l_model[0].cfg
+        if 'nan' not in cfg['dataset']:
+            log.debug('No define section found in config, will not define extra columns')
+            return {}
+        return cfg['dataset']['nan']
+    # --------------------------------------------
+    def _replace_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if len(self._d_nan_rep) == 0:
+            log.debug('Not doing any NaN replacement')
+            return df
+        log.debug(60 * '-')
+        log.info('Doing NaN replacements')
+        log.debug(60 * '-')
+        for var, val in self._d_nan_rep.items():
+            log.debug(f'{var:<20}{"--->":20}{val:<20.3f}')
+            df[var] = df[var].fillna(val)
+        return df
+    # --------------------------------------------
     def _get_df(self):
         '''
         Will make ROOT rdf into dataframe and return it
@@ -45,6 +90,7 @@ class CVPredict:
         l_ft  = model.features
         d_data= self._rdf.AsNumpy(l_ft)
         df_ft = pnd.DataFrame(d_data)
+        df_ft = self._replace_nans(df_ft)
         df_ft = ut.patch_and_tag(df_ft)
         if 'patched_indices' in df_ft.attrs:
@@ -136,6 +182,8 @@ class CVPredict:
         '''
         Will return array of prediction probabilities for the signal category
         '''
+        self._initialize()
         df_ft = self._get_df()
         model = self._l_model[0]

dmu/ml/train_mva.py CHANGED Viewed

@@ -1,8 +1,10 @@
 '''
 Module with TrainMva class
 '''
+# pylint: disable = too-many-locals
+# pylint: disable = too-many-arguments, too-many-positional-arguments
 import os
-from typing import Union
 import joblib
 import pandas as pnd
@@ -14,12 +16,17 @@ from sklearn.model_selection import StratifiedKFold
 from ROOT import RDataFrame
-import dmu.ml.utilities    as ut
+import dmu.ml.utilities         as ut
+import dmu.pdataframe.utilities as put
+import dmu.plotting.utilities   as plu
 from dmu.ml.cv_classifier    import CVClassifier as cls
 from dmu.plotting.plotter_1d import Plotter1D    as Plotter
+from dmu.plotting.matrix     import MatrixPlotter
 from dmu.logging.log_store   import LogStore
-log = LogStore.add_logger('data_checks:train_mva')
+npa = numpy.ndarray
+log = LogStore.add_logger('dmu:ml:train_mva')
 # ---------------------------------------------
 class TrainMva:
     '''
@@ -43,15 +50,13 @@ class TrainMva:
         self._rdf_bkg = bkg
         self._rdf_sig = sig
-        self._cfg     = cfg if cfg is not None else {}
-        self._l_model   : cls
+        self._cfg     = cfg
         self._l_ft_name = self._cfg['training']['features']
         self._df_ft, self._l_lab = self._get_inputs()
     # ---------------------------------------------
-    def _get_inputs(self) -> tuple[pnd.DataFrame, numpy.ndarray]:
+    def _get_inputs(self) -> tuple[pnd.DataFrame, npa]:
         log.info('Getting signal')
         df_sig, arr_lab_sig = self._get_sample_inputs(self._rdf_sig, label = 1)
@@ -63,15 +68,28 @@ class TrainMva:
         return df, arr_lab
     # ---------------------------------------------
-    def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, numpy.ndarray]:
+    def _pre_process_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if 'nan' not in self._cfg['dataset']:
+            log.debug('dataset/nan section not found, not pre-processing NaNs')
+            return df
+        d_name_val = self._cfg['dataset']['nan']
+        for name, val in d_name_val.items():
+            log.debug(f'{val:<20}{"<---":<10}{name:<100}')
+            df[name] = df[name].fillna(val)
+        return df
+    # ---------------------------------------------
+    def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, npa]:
         d_ft = rdf.AsNumpy(self._l_ft_name)
         df   = pnd.DataFrame(d_ft)
+        df   = self._pre_process_nans(df)
         df   = ut.cleanup(df)
         l_lab= len(df) * [label]
         return df, numpy.array(l_lab)
     # ---------------------------------------------
-    def _get_model(self, arr_index : numpy.ndarray) -> cls:
+    def _get_model(self, arr_index : npa) -> cls:
         model = cls(cfg = self._cfg)
         df_ft = self._df_ft.iloc[arr_index]
         l_lab = self._l_lab[arr_index]
@@ -84,7 +102,6 @@ class TrainMva:
         return model
     # ---------------------------------------------
     def _get_models(self):
-        # pylint: disable = too-many-locals
         '''
         Will create models, train them and return them
         '''
@@ -105,15 +122,55 @@ class TrainMva:
             arr_sig_sig_tr, arr_sig_bkg_tr, arr_sig_all_tr, arr_lab_tr = self._get_scores(model, arr_itr, on_training_ok= True)
             arr_sig_sig_ts, arr_sig_bkg_ts, arr_sig_all_ts, arr_lab_ts = self._get_scores(model, arr_its, on_training_ok=False)
+            self._save_feature_importance(model, ifold)
+            self._plot_correlation(arr_itr, ifold)
             self._plot_scores(arr_sig_sig_tr, arr_sig_sig_ts, arr_sig_bkg_tr, arr_sig_bkg_ts, ifold)
             self._plot_roc(arr_lab_ts, arr_sig_all_ts, arr_lab_tr, arr_sig_all_tr, ifold)
             ifold+=1
         return l_model
     # ---------------------------------------------
-    def _get_scores(self, model : cls, arr_index : numpy.ndarray, on_training_ok : bool) -> tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]:
+    def _labels_from_varnames(self, l_var_name : list[str]) -> list[str]:
+        try:
+            d_plot = self._cfg['plotting']['features']['plots']
+        except ValueError:
+            log.warning('Cannot find plotting/features/plots section in config, using dataframe names')
+            return l_var_name
+        l_label = []
+        for var_name in l_var_name:
+            if var_name not in d_plot:
+                log.warning(f'No plot found for: {var_name}')
+                l_label.append(var_name)
+                continue
+            d_setting = d_plot[var_name]
+            [xlab, _ ]= d_setting['labels']
+            l_label.append(xlab)
+        return l_label
+    # ---------------------------------------------
+    def _save_feature_importance(self, model : cls, ifold : int) -> None:
+        l_var_name           = self._df_ft.columns.tolist()
+        d_data               = {}
+        d_data['Variable'  ] = self._labels_from_varnames(l_var_name)
+        d_data['Importance'] = 100 * model.feature_importances_
+        val_dir  = self._cfg['plotting']['val_dir']
+        val_dir  = f'{val_dir}/fold_{ifold:03}'
+        os.makedirs(val_dir, exist_ok=True)
+        df = pnd.DataFrame(d_data)
+        df = df.sort_values(by='Importance', ascending=False)
+        table_path = f'{val_dir}/importance.tex'
+        d_form = {'Variable' : '{}', 'Importance' : '{:.1f}'}
+        put.df_to_tex(df, table_path, d_format = d_form)
+    # ---------------------------------------------
+    def _get_scores(self, model : cls, arr_index : npa, on_training_ok : bool) -> tuple[npa, npa, npa, npa]:
         '''
         Returns a tuple of four arrays
@@ -136,7 +193,7 @@ class TrainMva:
         return arr_sig, arr_bkg, arr_all, arr_lab
     # ---------------------------------------------
-    def _split_scores(self, arr_prob : numpy.ndarray, arr_label : numpy.ndarray) -> tuple[numpy.ndarray, numpy.ndarray]:
+    def _split_scores(self, arr_prob : npa, arr_label : npa) -> tuple[npa, npa]:
         '''
         Will split the testing scores (predictions) based on the training scores
@@ -151,7 +208,7 @@ class TrainMva:
         return arr_sig, arr_bkg
     # ---------------------------------------------
-    def _save_model(self, model, ifold):
+    def _save_model(self, model : cls, ifold : int) -> None:
         '''
         Saves a model, associated to a specific fold
         '''
@@ -168,6 +225,53 @@ class TrainMva:
         log.info(f'Saving model to: {model_path}')
         joblib.dump(model, model_path)
     # ---------------------------------------------
+    def _get_correlation_cfg(self, df : pnd.DataFrame, ifold : int) -> dict:
+        l_var_name = df.columns.tolist()
+        l_label    = self._labels_from_varnames(l_var_name)
+        cfg = {
+                'labels'     : l_label,
+                'title'      : f'Fold {ifold}',
+                'label_angle': 45,
+                'upper'      : True,
+                'zrange'     : [-1, +1],
+                'size'       : [7, 7],
+                'format'     : '{:.3f}',
+                'fontsize'   : 12,
+                }
+        if 'correlation' not in self._cfg['plotting']:
+            log.info('Using default correlation plotting configuration')
+            return cfg
+        log.debug('Updating correlation plotting configuration')
+        custom = self._cfg['plotting']['correlation']
+        cfg.update(custom)
+        return cfg
+    # ---------------------------------------------
+    def _plot_correlation(self, arr_index : npa, ifold : int) -> None:
+        df_ft = self._df_ft.iloc[arr_index]
+        cfg = self._get_correlation_cfg(df_ft, ifold)
+        cov = df_ft.corr()
+        mat = cov.to_numpy()
+        log.debug(f'Plotting correlation for {ifold} fold')
+        val_dir  = self._cfg['plotting']['val_dir']
+        val_dir  = f'{val_dir}/fold_{ifold:03}'
+        os.makedirs(val_dir, exist_ok=True)
+        obj = MatrixPlotter(mat=mat, cfg=cfg)
+        obj.plot()
+        plt.savefig(f'{val_dir}/covariance.png')
+        plt.close()
+    # ---------------------------------------------
+    def _get_nentries(self, arr_val : npa) -> str:
+        size = len(arr_val)
+        size = size / 1000.
+        return f'{size:.2f}K'
+    # ---------------------------------------------
     def _plot_scores(self, arr_sig_trn, arr_sig_tst, arr_bkg_trn, arr_bkg_tst, ifold):
         # pylint: disable = too-many-arguments, too-many-positional-arguments
         '''
@@ -183,11 +287,11 @@ class TrainMva:
         val_dir  = f'{val_dir}/fold_{ifold:03}'
         os.makedirs(val_dir, exist_ok=True)
-        plt.hist(arr_sig_trn, alpha   =   0.3, bins=50, range=(0,1), color='b', density=True, label='Signal Train')
-        plt.hist(arr_sig_tst, histtype='step', bins=50, range=(0,1), color='b', density=True, label='Signal Test')
+        plt.hist(arr_sig_trn, alpha   =   0.3, bins=50, range=(0,1), color='b', density=True, label='Signal Train: '    + self._get_nentries(arr_sig_trn))
+        plt.hist(arr_sig_tst, histtype='step', bins=50, range=(0,1), color='b', density=True, label='Signal Test: '     + self._get_nentries(arr_sig_tst))
-        plt.hist(arr_bkg_trn, alpha   =   0.3, bins=50, range=(0,1), color='r', density=True, label='Background Train')
-        plt.hist(arr_bkg_tst, histtype='step', bins=50, range=(0,1), color='r', density=True, label='Background Test')
+        plt.hist(arr_bkg_trn, alpha   =   0.3, bins=50, range=(0,1), color='r', density=True, label='Background Train: '+ self._get_nentries(arr_bkg_trn))
+        plt.hist(arr_bkg_tst, histtype='step', bins=50, range=(0,1), color='r', density=True, label='Background Test: ' + self._get_nentries(arr_bkg_tst))
         plt.legend()
         plt.title(f'Fold: {ifold}')
@@ -197,16 +301,15 @@ class TrainMva:
         plt.close()
     # ---------------------------------------------
     def _plot_roc(self,
-                  l_lab_ts : numpy.ndarray,
-                  l_prb_ts : numpy.ndarray,
-                  l_lab_tr : numpy.ndarray,
-                  l_prb_tr : numpy.ndarray,
+                  l_lab_ts : npa,
+                  l_prb_ts : npa,
+                  l_lab_tr : npa,
+                  l_prb_tr : npa,
                   ifold    : int):
         '''
         Takes the labels and the probabilities and plots ROC
         curve for given fold
         '''
-        # pylint: disable = too-many-arguments, too-many-positional-arguments
         log.debug(f'Plotting ROC curve for {ifold} fold')
         val_dir  = self._cfg['plotting']['val_dir']
@@ -226,17 +329,70 @@ class TrainMva:
         if 'min' in self._cfg['plotting']['roc']:
             [min_x, min_y] = self._cfg['plotting']['roc']['min']
+        max_x = 1
+        max_y = 1
+        if 'max' in self._cfg['plotting']['roc']:
+            [max_x, max_y] = self._cfg['plotting']['roc']['max']
         plt.plot(xval_ts, yval_ts, color='b', label=f'Test: {area_ts:.3f}')
         plt.plot(xval_tr, yval_tr, color='r', label=f'Train: {area_tr:.3f}')
+        self._plot_probabilities(xval_ts, yval_ts, l_prb_ts, l_lab_ts)
         plt.xlabel('Signal efficiency')
-        plt.ylabel('Background efficiency')
+        plt.ylabel('Background rejection')
         plt.title(f'Fold: {ifold}')
-        plt.xlim(min_x, 1)
-        plt.ylim(min_y, 1)
+        plt.xlim(min_x, max_x)
+        plt.ylim(min_y, max_y)
+        plt.grid()
         plt.legend()
         plt.savefig(f'{val_dir}/roc.png')
         plt.close()
     # ---------------------------------------------
+    def _plot_probabilities(self,
+                            arr_seff: npa,
+                            arr_brej: npa,
+                            arr_sprb: npa,
+                            arr_labl: npa) -> None:
+        roc_cfg = self._cfg['plotting']['roc']
+        if 'annotate' not in roc_cfg:
+            log.debug('Annotation section in the ROC curve config not found, skipping annotation')
+            return
+        l_sprb   = [ sprb for sprb, labl in zip(arr_sprb, arr_labl) if labl == 1 ]
+        arr_sprb = numpy.array(l_sprb)
+        plt_cfg = roc_cfg['annotate']
+        if 'sig_eff' not in plt_cfg:
+            l_seff_target = [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95]
+        else:
+            l_seff_target = plt_cfg['sig_eff']
+            del plt_cfg['sig_eff']
+        arr_seff_target = numpy.array(l_seff_target)
+        arr_quantile    = 1 - arr_seff_target
+        l_score = numpy.quantile(arr_sprb, arr_quantile)
+        l_seff  = []
+        l_brej  = []
+        log.debug(60 * '-')
+        log.debug(f'{"SigEff":20}{"BkgRej":20}{"Score":20}')
+        log.debug(60 * '-')
+        for seff_target, score in zip(arr_seff_target, l_score):
+            arr_diff = numpy.abs(arr_seff - seff_target)
+            ind      = numpy.argmin(arr_diff)
+            seff     = arr_seff[ind]
+            brej     = arr_brej[ind]
+            log.debug(f'{seff:<20.3f}{brej:<20.3f}{score:<20.2f}')
+            l_seff.append(seff)
+            l_brej.append(brej)
+        plu.annotate(l_x=l_seff, l_y=l_brej, l_v=l_score, **plt_cfg)
+    # ---------------------------------------------
     def _plot_features(self):
         '''
         Will plot the features, based on the settings in the config
@@ -245,10 +401,44 @@ class TrainMva:
         ptr   = Plotter(d_rdf = {'Signal' : self._rdf_sig, 'Background' : self._rdf_bkg}, cfg=d_cfg)
         ptr.run()
     # ---------------------------------------------
+    def _save_settings_to_tex(self) -> None:
+        self._save_nan_conversion()
+        self._save_hyperparameters_to_tex()
+    # ---------------------------------------------
+    def _save_nan_conversion(self) -> None:
+        if 'nan' not in self._cfg['dataset']:
+            log.debug('NaN section not found, not saving it')
+            return
+        d_nan = self._cfg['dataset']['nan']
+        l_var = list(d_nan)
+        l_lab = self._labels_from_varnames(l_var)
+        l_val = list(d_nan.values())
+        d_tex = {'Variable' : l_lab, 'Replacement' : l_val}
+        df    = pnd.DataFrame(d_tex)
+        val_dir  = self._cfg['plotting']['val_dir']
+        os.makedirs(val_dir, exist_ok=True)
+        put.df_to_tex(df, f'{val_dir}/nan_replacement.tex')
+    # ---------------------------------------------
+    def _save_hyperparameters_to_tex(self) -> None:
+        if 'hyper' not in self._cfg['training']:
+            raise ValueError('Cannot find hyper parameters in configuration')
+        d_hyper = self._cfg['training']['hyper']
+        d_form  = { f'\\verb|{key}|' : f'\\verb|{val}|' for key, val in d_hyper.items() }
+        d_latex = { 'Hyperparameter' : list(d_form.keys()), 'Value' : list(d_form.values())}
+        df = pnd.DataFrame(d_latex)
+        val_dir  = self._cfg['plotting']['val_dir']
+        os.makedirs(val_dir, exist_ok=True)
+        put.df_to_tex(df, f'{val_dir}/hyperparameters.tex')
+    # ---------------------------------------------
     def run(self):
         '''
         Will do the training
         '''
+        self._save_settings_to_tex()
         self._plot_features()
         l_mod = self._get_models()

dmu/ml/utilities.py CHANGED Viewed

@@ -51,6 +51,14 @@ def _remove_nans(df : pnd.DataFrame) -> pnd.DataFrame:
         log.debug('No NaNs found in dataframe')
         return df
+    sr_is_nan = df.isna().any()
+    l_na_name = sr_is_nan[sr_is_nan].index.tolist()
+    log.info('Found columns with NaNs')
+    for name in l_na_name:
+        nan_count = df[name].isna().sum()
+        log.info(f'{nan_count:<10}{name:<100}')
     ninit = len(df)
     df    = df.dropna()
     nfinl = len(df)

dmu/pdataframe/utilities.py ADDED Viewed

@@ -0,0 +1,36 @@
+'''
+Module containing utilities for pandas dataframes
+'''
+import os
+import pandas as pnd
+from dmu.logging.log_store import LogStore
+log=LogStore.add_logger('dmu:pdataframe:utilities')
+# -------------------------------------
+def df_to_tex(df : pnd.DataFrame, path : str, hide_index : bool = True, d_format : dict[str,str]=None, caption : str =None) -> None:
+    '''
+    Saves pandas dataframe to latex
+    Parameters
+    -------------
+    d_format (dict) : Dictionary specifying the formattinng of the table, e.g. `{'col1': '{}', 'col2': '{:.3f}', 'col3' : '{:.3f}'}`
+    '''
+    if path is not None:
+        dir_name = os.path.dirname(path)
+        os.makedirs(dir_name, exist_ok=True)
+    st = df.style
+    if hide_index:
+        st=st.hide(axis='index')
+    if d_format is not None:
+        st=st.format(formatter=d_format)
+    log.info(f'Saving to: {path}')
+    buf = st.to_latex(buf=path, caption=caption, hrules=True)
+    return buf
+# -------------------------------------

dmu/plotting/matrix.py ADDED Viewed

@@ -0,0 +1,157 @@
+'''
+Module holding the MatrixPlotter class
+'''
+from typing import Annotated
+import numpy
+import numpy.typing      as npt
+import matplotlib.pyplot as plt
+from dmu.logging.log_store import LogStore
+Array2D = Annotated[npt.NDArray[numpy.float64], '(n,n)']
+log     = LogStore.add_logger('dmu:plotting:matrix')
+#-------------------------------------------------------
+class MatrixPlotter:
+    '''
+    Class used to plot matrices
+    '''
+    # -----------------------------------------------
+    def __init__(self, mat : Array2D, cfg : dict):
+        self._mat     = mat
+        self._cfg     = cfg
+        self._size    : int
+        self._l_label : list[str]
+    # -----------------------------------------------
+    def _initialize(self) -> None:
+        self._check_matrix()
+        self._reformat_matrix()
+        self._set_labels()
+        self._mask_matrix()
+    # -----------------------------------------------
+    def _mask_matrix(self) -> None:
+        if 'mask_value' not in self._cfg:
+            return
+        mask_val  = self._cfg['mask_value']
+        log.debug(f'Masking value: {mask_val}')
+        self._mat = numpy.ma.masked_where(self._mat == mask_val, self._mat)
+    # -----------------------------------------------
+    def _check_matrix(self) -> None:
+        a, b = self._mat.shape
+        if a != b:
+            raise ValueError(f'Matrix is not square, but with shape: {a}x{b}')
+        self._size = a
+    # -----------------------------------------------
+    def _set_labels(self) -> None:
+        if 'labels' not in self._cfg:
+            raise ValueError('Labels entry missing')
+        l_lab = self._cfg['labels']
+        nlab  = len(l_lab)
+        if nlab != self._size:
+            raise ValueError(f'Number of labels is not equal to its size: {nlab}!={self._size}')
+        self._l_label = l_lab
+    # -----------------------------------------------
+    def _reformat_matrix(self) -> None:
+        if 'upper' not in self._cfg:
+            log.debug('Drawing full matrix')
+            return
+        upper = self._cfg['upper']
+        if upper not in [True, False]:
+            raise ValueError(f'Invalid value for upper setting: {upper}')
+        if     upper:
+            log.debug('Drawing upper matrix')
+            self._mat = numpy.triu(self._mat, 0)
+            return
+        if not upper:
+            log.debug('Drawing lower matrix')
+            self._mat = numpy.triu(self._mat, 0)
+            return
+    # -----------------------------------------------
+    def _set_axes(self, ax) -> None:
+        ax.set_xticks(numpy.arange(self._size))
+        ax.set_yticks(numpy.arange(self._size))
+        ax.set_xticklabels(self._l_label)
+        ax.set_yticklabels(self._l_label)
+        rotation = 45
+        if 'label_angle' in self._cfg:
+            rotation = self._cfg['label_angle']
+        plt.setp(ax.get_xticklabels(), rotation=rotation, ha="right", rotation_mode="anchor")
+    # -----------------------------------------------
+    def _draw_matrix(self) -> None:
+        fsize = None
+        if 'size' in self._cfg:
+            fsize = self._cfg['size']
+        if 'zrange' not in self._cfg:
+            raise ValueError('z range not found in configuration')
+        [zmin, zmax] = self._cfg['zrange']
+        fig, ax = plt.subplots() if fsize is None else plt.subplots(figsize=fsize)
+        palette = plt.cm.viridis
+        im      = ax.imshow(self._mat, cmap=palette, vmin=zmin, vmax=zmax)
+        self._set_axes(ax)
+        if 'format' in self._cfg:
+            self._add_text(ax)
+        else:
+            log.debug('Not adding values to matrix but bar')
+            fig.colorbar(im)
+        if 'title' not in self._cfg:
+            return
+        title = self._cfg['title']
+        ax.set_title(title)
+        fig.tight_layout()
+    # -----------------------------------------------
+    def _add_text(self, ax):
+        fontsize = 12
+        if 'fontsize' in self._cfg:
+            fontsize = self._cfg['fontsize']
+        form = self._cfg['format']
+        log.debug(f'Adding values with format {form}')
+        for i_x, _ in enumerate(self._l_label):
+            for i_y, _ in enumerate(self._l_label):
+                try:
+                    val  = self._mat[i_y, i_x]
+                except:
+                    log.error(f'Cannot access ({i_x}, {i_y}) in:')
+                    print(self._mat)
+                    raise
+                if numpy.ma.is_masked(val):
+                    text = ''
+                else:
+                    text = form.format(val)
+                _ = ax.text(i_x, i_y, text, ha="center", va="center", fontsize=fontsize, color="k")
+    # -----------------------------------------------
+    def plot(self):
+        '''
+        Runs plotting, plot can be accessed through:
+        ```python
+        plt.show()
+        plt.savefig(...)
+        ```
+        '''
+        self._initialize()
+        self._draw_matrix()
+#-------------------------------------------------------

dmu/plotting/utilities.py ADDED Viewed

@@ -0,0 +1,33 @@
+'''
+Module with plotting utilities
+'''
+# pylint: disable=too-many-positional-arguments, too-many-arguments
+import matplotlib.pyplot as plt
+# ---------------------------------------------------------------------------
+def annotate(
+        l_x   : list[float],
+        l_y   : list[float],
+        l_v   : list[float],
+        form  : str =    '{}',
+        xoff  : int =  0,
+        yoff  : int =-20,
+        size  : int = 20,
+        color : str = 'black') -> None:
+    '''
+    Function used to annotate plots
+    l_x(y): List of x(y) coordinates for markers
+    l_v   : List of numerical values to annotate markers
+    form  : Formatting, e.g. {:.3f}
+    color : String with color for markers and annotation, e.g. black
+    size  : Font size, default 20
+    x(y)off : Offset in x(y).
+    '''
+    for x, y, v in zip(l_x, l_y, l_v):
+        label = form.format(v)
+        plt.plot(x, y, marker='o', markersize= 5, markeredgecolor=color, markerfacecolor=color)
+        plt.annotate(label, (x,y), fontsize=size, textcoords="offset points", xytext=(xoff, yoff), color=color, ha='center')
+# ---------------------------------------------------------------------------

dmu/rdataframe/utilities.py CHANGED Viewed

@@ -66,7 +66,7 @@ def add_column(rdf : RDataFrame, arr_val : Union[numpy.ndarray,None], name : str
     if arr_val.dtype == 'object':
         arr_val = arr_val.astype(float)
-    d_data[name] = arr_val
+    d_data[name] = ak.from_numpy(arr_val)
     rdf = ak.to_rdataframe(d_data)

dmu_data/ml/tests/train_mva.yaml CHANGED Viewed

@@ -1,5 +1,8 @@
+dataset:
+  nan :
+    x : 0
 training :
-    nfold    : 3
+    nfold    : 3
     features : [x, y, z]
     rdm_stat : 1
     hyper    :
@@ -7,31 +10,43 @@ training :
       n_estimators      : 100
       max_depth         : 3
       learning_rate     : 0.1
-      min_samples_split : 2
+      min_samples_split : 2
 saving:
-    path : 'tests/ml/train_mva/model.pkl'
+    path : '/tmp/dmu/ml/tests/train_mva/model.pkl'
 plotting:
     roc     :
-        min : [0, 0]
-    val_dir : 'tests/ml/train_mva'
+        min : [0.0, 0.0]
+        max : [1.2, 1.2]
+        annotate:
+          sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9]
+          form : '{:.2f}'
+          color: 'green'
+          xoff : -15
+          yoff : -15
+          size :  10
+    correlation:
+      title      : 'Correlation matrix'
+      size       : [10, 10]
+      mask_value : 0
+    val_dir : '/tmp/dmu/ml/tests/train_mva'
     features:
         saving:
-            plt_dir : 'tests/ml/train_mva/features'
+            plt_dir : '/tmp/dmu/ml/tests/train_mva/features'
         plots:
-          w :
+          w :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['w', '']
-          x :
+          x :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['x', '']
-          y :
+          y :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['y', '']
-          z :
+          z :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['z', '']

{data_manipulation_utilities-0.2.0.data → data_manipulation_utilities-0.2.2.data}/scripts/publish RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.0.dist-info → data_manipulation_utilities-0.2.2.dist-info}/WHEEL RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.0.dist-info → data_manipulation_utilities-0.2.2.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.0.dist-info → data_manipulation_utilities-0.2.2.dist-info}/top_level.txt RENAMED Viewed

File without changes

data-manipulation-utilities 0.2.0__py3-none-any.whl → 0.2.2__py3-none-any.whl

data-manipulation-utilities 0.2.0py3-none-any.whl → 0.2.2py3-none-any.whl