PyPI - data-manipulation-utilities - Versions diffs - 0.1.9__py3-none-any.whl → 0.2.1__py3-none-any.whl - Mend

data-manipulation-utilities 0.1.9py3-none-any.whl → 0.2.1py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

{data_manipulation_utilities-0.1.9.dist-info → data_manipulation_utilities-0.2.1.dist-info}/METADATA RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.1.9
+Version: 0.2.1
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
 Requires-Dist: scipy
-Requires-Dist: awkward
+Requires-Dist: awkward==2.4.6
 Requires-Dist: tqdm
 Requires-Dist: joblib
 Requires-Dist: scikit-learn
@@ -204,6 +204,33 @@ print_pdf(pdf,
 The `Fitter` class is a wrapper to zfit, use to make fitting easier.
+### Goodness of fits
+Once a fit has been done, one can use `GofCalculator` to get a rough estimate of the fit quality.
+This is done by:
+- Binning the data and PDF.
+- Calculating the reduced $\chi^2$.
+- Using the $\chi^2$ and the number of degrees of freedom to get the p-value.
+This class is used as shown below:
+```python
+from dmu.stats.gof_calculator import GofCalculator
+nll = _get_nll()
+res = Data.minimizer.minimize(nll)
+gcl = GofCalculator(nll, ndof=10)
+gof = gcl.get_gof(kind='pvalue')
+```
+where:
+- `ndof` Is the number of degrees of freedom used in the reduced $\chi^2$ calculation
+It is needed to know how many bins to use to make the histogram. The recommended value is 10.
+- `kind` The argument can be `pvalue` or `chi2/ndof`.
 ### Simplest fit
 ```python
@@ -396,6 +423,14 @@ obj.run()
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
 ```yaml
+dataset:
+    # If the key is found to be NaN, replace its value with the number provided
+    # This will be used in the training.
+    # Otherwise the entries with NaNs will be dropped
+    nan:
+        x : 0
+        y : 0
+        z : -999
 training :
     nfold    : 10
     features : [w, x, y, z]
@@ -406,8 +441,25 @@ training :
       learning_rate     : 0.1
       min_samples_split : 2
 saving:
+    # The actual model names are model_001.pkl, model_002.pkl, etc, one for each fold
     path : 'tests/ml/train_mva/model.pkl'
 plotting:
+    roc :
+        min : [0.0, 0.0] # Optional, controls where the ROC curve starts and ends
+        max : [1.2, 1.2] # By default it does from 0 to 1 in both axes
+        # The section below is optional and will annotate the ROC curve with
+        # values for the score at different signal efficiencies
+        annotate:
+          sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9] # Values of signal efficiency at which to show the scores
+          form    : '{:.2f}' # Use two decimals for scores
+          color   : 'green'  # Color for text and marker
+          xoff    : -15      # Offsets in X and Y
+          yoff    : -15
+          size    :  10      # Size of text
+    correlation: # Adds correlation matrix for training datasets
+      title      : 'Correlation matrix'
+      size       : [10, 10]
+      mask_value : 0                # Where correlation is zero, the bin will appear white
     val_dir : 'tests/ml/train_mva'
     features:
         saving:
@@ -475,6 +527,36 @@ When evaluating the model with real data, problems might occur, we deal with the
 - **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
+# Pandas dataframes
+## Utilities
+These are thin layers of code that take pandas dataframes and carry out specific tasks
+### Dataframe to latex
+One can save a dataframe to latex with:
+```python
+import pandas as pnd
+import dmu.pdataframe.utilities as put
+d_data = {}
+d_data['a'] = [1,2,3]
+d_data['b'] = [4,5,6]
+df = pnd.DataFrame(d_data)
+d_format = {
+        'a' : '{:.0f}',
+        'b' : '{:.3f}'}
+df = _get_df()
+put.df_to_tex(df,
+        './table.tex',
+        d_format = d_format,
+        caption  = 'some caption')
+```
 # Rdataframes
 These are utility functions meant to be used with ROOT dataframes.
@@ -626,6 +708,43 @@ axes:
         label   : 'y'
 ```
+# Other plots
+## Matrices
+This can be done with `MatrixPlotter`, whose usage is illustrated below:
+```python
+import numpy
+import matplotlib.pyplot as plt
+from dmu.plotting.matrix import MatrixPlotter
+cfg = {
+        'labels'     : ['x', 'y', 'z'], # Used to label the matrix axes
+        'title'      : 'Some title',    # Optional, title of plot
+        'label_angle': 45,              # Labels will be rotated by 45 degrees
+        'upper'      : True,            # Useful in case this is a symmetric matrix
+        'zrange'     : [0, 10],         # Controls the z axis range
+        'size'       : [7, 7],          # Plot size
+        'format'     : '{:.3f}',        # Optional, if used will add numerical values to the contents, otherwise a color bar is used
+        'fontsize'   : 12,              # Font size associated to `format`
+        'mask_value' : 0,               # These values will appear white in the plot
+        }
+mat = [
+        [1, 2, 3],
+        [2, 0, 4],
+        [3, 4, numpy.nan]
+        ]
+mat = numpy.array(mat)
+obj = MatrixPlotter(mat=mat, cfg=cfg)
+obj.plot()
+plt.show()
+```
 # Manipulating ROOT files
 ## Getting trees from file

{data_manipulation_utilities-0.1.9.dist-info → data_manipulation_utilities-0.2.1.dist-info}/RECORD RENAMED Viewed

@@ -1,14 +1,17 @@
-data_manipulation_utilities-0.1.9.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
+data_manipulation_utilities-0.2.1.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
 dmu/arrays/utilities.py,sha256=PKoYyybPptA2aU-V3KLnJXBudWxTXu4x1uGdIMQ49HY,1722
 dmu/generic/utilities.py,sha256=0Xnq9t35wuebAqKxbyAiMk1ISB7IcXK4cFH25MT1fgw,1741
 dmu/logging/log_store.py,sha256=umdvjNDuV3LdezbG26b0AiyTglbvkxST19CQu9QATbA,4184
-dmu/ml/cv_classifier.py,sha256=n81m7i2M6Zq96AEd9EZGwXSrbG5m9jkS5RdeXvbsAXU,3712
-dmu/ml/cv_predict.py,sha256=Bqxu-f6qquKJokFljhCzL_kiGcjLJLQFhVBD130fsyw,4893
-dmu/ml/train_mva.py,sha256=d_n-A07DFweikz5nXap4OE_Mqx8VprFT7zbxmnQAbac,9638
-dmu/ml/utilities.py,sha256=Nue7O9zi1QXgjGRPH6wnSAW9jusMQ2ZOSDJzBqJKIi0,3687
+dmu/ml/cv_classifier.py,sha256=8Jwx6xMhJaRLktlRdq0tFl32v6t8i63KmpxrlnXlomU,3759
+dmu/ml/cv_predict.py,sha256=AhCsCnHWPWGIRVTdGS1NxA2m4yH7t2lV_OdALwQAcAE,4927
+dmu/ml/train_mva.py,sha256=xJCJZKaly4Mml7Dy-TWQxpB-VNftL7EjQ79QKxROWx0,16475
+dmu/ml/utilities.py,sha256=l348bufD95CuSYdIrHScQThIy2nKwGKXZn-FQg3CEwg,3930
+dmu/pdataframe/utilities.py,sha256=ypvLiFfJ82ga94qlW3t5dXnvEFwYOXnbtJb2zHwsbqk,987
+dmu/plotting/matrix.py,sha256=pXuUJn-LgOvrI9qGkZQw16BzLjOjeikYQ_ll2VIcIXU,4978
 dmu/plotting/plotter.py,sha256=ytMxtzHEY8ZFU0ZKEBE-ROjMszXl5kHTMnQnWe173nU,7208
-dmu/plotting/plotter_1d.py,sha256=O7rTgCBlpCko1RSpj2TzcUIfx9sKoz2jAgw73Pz7Ynk,4472
+dmu/plotting/plotter_1d.py,sha256=g6H2xAgsL9a6vRkpbqHICb3qwV_qMiQPZxxw_oOSf9M,5115
 dmu/plotting/plotter_2d.py,sha256=J-gKnagoHGfJFU7HBrhDFpGYH5Rxy0_zF5l8eE_7ZHE,2944
+dmu/plotting/utilities.py,sha256=SI9dvtZq2gr-PXVz71KE4o0i09rZOKgqJKD1jzf6KXk,1167
 dmu/rdataframe/atr_mgr.py,sha256=FdhaQWVpsm4OOe1IRbm7rfrq8VenTNdORyI-lZ2Bs1M,2386
 dmu/rdataframe/utilities.py,sha256=x8r379F2-vZPYzAdMFCn_V4Kx2Tx9t9pn_QHcZ1euew,2756
 dmu/rfile/rfprinter.py,sha256=mp5jd-oCJAnuokbdmGyL9i6tK2lY72jEfROuBIZ_ums,3941
@@ -23,12 +26,13 @@ dmu/stats/zfit_plotter.py,sha256=Xs6kisNEmNQXhYRCcjowxO6xHuyAyrfyQIFhGAR61U4,197
 dmu/testing/utilities.py,sha256=WbMM4e9Cn3-B-12Vr64mB5qTKkV32joStlRkD-48lG0,3460
 dmu/text/transformer.py,sha256=4lrGknbAWRm0-rxbvgzOO-eR1-9bkYk61boJUEV3cQ0,6100
 dmu_data/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-dmu_data/ml/tests/train_mva.yaml,sha256=TCniCVpXMEFxZcHa8IIqollKA7ci4OkBnRznLEkXM9o,925
+dmu_data/ml/tests/train_mva.yaml,sha256=k5H4Gu9Gj57B9iqabhcTQEFN674Cv_uJ2Xcumb02zF4,1279
 dmu_data/plotting/tests/2d.yaml,sha256=VApcAfJFbjNcjMCTBSRm2P37MQlGavMZv6msbZwLSgw,402
 dmu_data/plotting/tests/fig_size.yaml,sha256=7ROq49nwZ1A2EbPiySmu6n3G-Jq6YAOkc3d2X3YNZv0,294
 dmu_data/plotting/tests/high_stat.yaml,sha256=bLglBLCZK6ft0xMhQ5OltxE76cWsBMPMjO6GG0OkDr8,522
 dmu_data/plotting/tests/name.yaml,sha256=mkcPAVg8wBAmlSbSRQ1bcaMl4vOS6LXMtpqQeDrrtO4,312
 dmu_data/plotting/tests/no_bounds.yaml,sha256=8e1QdphBjz-suDr857DoeUC2DXiy6SE-gvkORJQYv80,257
+dmu_data/plotting/tests/normalized.yaml,sha256=Y0eKtyV5pvlSxvqfsLjytYtv8xYF3HZ5WEdCJdeHGQI,193
 dmu_data/plotting/tests/simple.yaml,sha256=N_TvNBh_2dU0-VYgu_LMrtY0kV_hg2HxVuEoDlr1HX8,138
 dmu_data/plotting/tests/title.yaml,sha256=bawKp9aGpeRrHzv69BOCbFX8sq9bb3Es9tdsPTE7jIk,333
 dmu_data/plotting/tests/weights.yaml,sha256=RWQ1KxbCq-uO62WJ2AoY4h5Umc37zG35s-TpKnNMABI,312
@@ -43,8 +47,8 @@ dmu_scripts/rfile/compare_root_files.py,sha256=T8lDnQxsRNMr37x1Y7YvWD8ySHrJOWZki
 dmu_scripts/rfile/print_trees.py,sha256=Ze4Ccl_iUldl4eVEDVnYBoe4amqBT1fSBR1zN5WSztk,941
 dmu_scripts/ssh/coned.py,sha256=lhilYNHWRCGxC-jtyJ3LQ4oUgWW33B2l1tYCcyHHsR0,4858
 dmu_scripts/text/transform_text.py,sha256=9akj1LB0HAyopOvkLjNOJiptZw5XoOQLe17SlcrGMD0,1456
-data_manipulation_utilities-0.1.9.dist-info/METADATA,sha256=sxu2cZc14f4VfDD2J3MLGmW0jRHXJBpmDspXUt1D_0k,23046
-data_manipulation_utilities-0.1.9.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
-data_manipulation_utilities-0.1.9.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
-data_manipulation_utilities-0.1.9.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
-data_manipulation_utilities-0.1.9.dist-info/RECORD,,
+data_manipulation_utilities-0.2.1.dist-info/METADATA,sha256=ojD6P0bBj9GFohtPd7ULl7sDW80bIVD6JZ-bnNpHYmc,26649
+data_manipulation_utilities-0.2.1.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
+data_manipulation_utilities-0.2.1.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
+data_manipulation_utilities-0.2.1.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
+data_manipulation_utilities-0.2.1.dist-info/RECORD,,

dmu/ml/cv_classifier.py CHANGED Viewed

@@ -2,6 +2,7 @@
 Module holding cv_classifier class
 '''
+from typing                  import Union
 from sklearn.ensemble        import GradientBoostingClassifier
 from dmu.logging.log_store import LogStore
@@ -22,7 +23,7 @@ class CVClassifier(GradientBoostingClassifier):
     '''
     # pylint: disable = too-many-ancestors, abstract-method
     # ----------------------------------
-    def __init__(self, cfg : dict | None = None):
+    def __init__(self, cfg : Union[dict,None] = None):
         '''
         cfg (dict) : Dictionary with configuration, specially the hyperparameters set in the `hyper` field
         '''

dmu/ml/cv_predict.py CHANGED Viewed

@@ -10,8 +10,8 @@ import tqdm
 from ROOT import RDataFrame
 import dmu.ml.utilities     as ut
-import dmu.ml.cv_classifier as CVClassifier
+from dmu.ml.cv_classifier  import CVClassifier
 from dmu.logging.log_store import LogStore
 log = LogStore.add_logger('dmu:ml:cv_predict')
@@ -147,6 +147,7 @@ class CVPredict:
             arr_prb = self._predict_with_overlap(df_ft)
         arr_prb = self._patch_probabilities(arr_prb)
+        arr_prb = arr_prb.T[1]
         return arr_prb
 # ---------------------------------------

dmu/ml/train_mva.py CHANGED Viewed

@@ -1,8 +1,10 @@
 '''
 Module with TrainMva class
 '''
+# pylint: disable = too-many-locals
+# pylint: disable = too-many-arguments, too-many-positional-arguments
 import os
-from typing import Union
 import joblib
 import pandas as pnd
@@ -14,11 +16,16 @@ from sklearn.model_selection import StratifiedKFold
 from ROOT import RDataFrame
-import dmu.ml.utilities    as ut
+import dmu.ml.utilities         as ut
+import dmu.pdataframe.utilities as put
+import dmu.plotting.utilities   as plu
 from dmu.ml.cv_classifier    import CVClassifier as cls
 from dmu.plotting.plotter_1d import Plotter1D    as Plotter
+from dmu.plotting.matrix     import MatrixPlotter
 from dmu.logging.log_store   import LogStore
+npa = numpy.ndarray
 log = LogStore.add_logger('data_checks:train_mva')
 # ---------------------------------------------
 class TrainMva:
@@ -43,15 +50,13 @@ class TrainMva:
         self._rdf_bkg = bkg
         self._rdf_sig = sig
-        self._cfg     = cfg if cfg is not None else {}
-        self._l_model   : cls
+        self._cfg     = cfg
         self._l_ft_name = self._cfg['training']['features']
         self._df_ft, self._l_lab = self._get_inputs()
     # ---------------------------------------------
-    def _get_inputs(self) -> tuple[pnd.DataFrame, numpy.ndarray]:
+    def _get_inputs(self) -> tuple[pnd.DataFrame, npa]:
         log.info('Getting signal')
         df_sig, arr_lab_sig = self._get_sample_inputs(self._rdf_sig, label = 1)
@@ -63,15 +68,28 @@ class TrainMva:
         return df, arr_lab
     # ---------------------------------------------
-    def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, numpy.ndarray]:
+    def _pre_process_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if 'nan' not in self._cfg['dataset']:
+            log.debug('dataset/nan section not found, not pre-processing NaNs')
+            return df
+        d_name_val = self._cfg['dataset']['nan']
+        for name, val in d_name_val.items():
+            log.debug(f'{val:<20}{"<---":<10}{name:<100}')
+            df[name] = df[name].fillna(val)
+        return df
+    # ---------------------------------------------
+    def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, npa]:
         d_ft = rdf.AsNumpy(self._l_ft_name)
         df   = pnd.DataFrame(d_ft)
+        df   = self._pre_process_nans(df)
         df   = ut.cleanup(df)
         l_lab= len(df) * [label]
         return df, numpy.array(l_lab)
     # ---------------------------------------------
-    def _get_model(self, arr_index : numpy.ndarray) -> cls:
+    def _get_model(self, arr_index : npa) -> cls:
         model = cls(cfg = self._cfg)
         df_ft = self._df_ft.iloc[arr_index]
         l_lab = self._l_lab[arr_index]
@@ -84,7 +102,6 @@ class TrainMva:
         return model
     # ---------------------------------------------
     def _get_models(self):
-        # pylint: disable = too-many-locals
         '''
         Will create models, train them and return them
         '''
@@ -105,15 +122,55 @@ class TrainMva:
             arr_sig_sig_tr, arr_sig_bkg_tr, arr_sig_all_tr, arr_lab_tr = self._get_scores(model, arr_itr, on_training_ok= True)
             arr_sig_sig_ts, arr_sig_bkg_ts, arr_sig_all_ts, arr_lab_ts = self._get_scores(model, arr_its, on_training_ok=False)
+            self._save_feature_importance(model, ifold)
+            self._plot_correlation(arr_itr, ifold)
             self._plot_scores(arr_sig_sig_tr, arr_sig_sig_ts, arr_sig_bkg_tr, arr_sig_bkg_ts, ifold)
             self._plot_roc(arr_lab_ts, arr_sig_all_ts, arr_lab_tr, arr_sig_all_tr, ifold)
             ifold+=1
         return l_model
     # ---------------------------------------------
-    def _get_scores(self, model : cls, arr_index : numpy.ndarray, on_training_ok : bool) -> tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]:
+    def _labels_from_varnames(self, l_var_name : list[str]) -> list[str]:
+        try:
+            d_plot = self._cfg['plotting']['features']['plots']
+        except ValueError:
+            log.warning('Cannot find plotting/features/plots section in config, using dataframe names')
+            return l_var_name
+        l_label = []
+        for var_name in l_var_name:
+            if var_name not in d_plot:
+                log.warning(f'No plot found for: {var_name}')
+                l_label.append(var_name)
+                continue
+            d_setting = d_plot[var_name]
+            [xlab, _ ]= d_setting['labels']
+            l_label.append(xlab)
+        return l_label
+    # ---------------------------------------------
+    def _save_feature_importance(self, model : cls, ifold : int) -> None:
+        l_var_name           = self._df_ft.columns.tolist()
+        d_data               = {}
+        d_data['Variable'  ] = self._labels_from_varnames(l_var_name)
+        d_data['Importance'] = 100 * model.feature_importances_
+        val_dir  = self._cfg['plotting']['val_dir']
+        val_dir  = f'{val_dir}/fold_{ifold:03}'
+        os.makedirs(val_dir, exist_ok=True)
+        df = pnd.DataFrame(d_data)
+        df = df.sort_values(by='Importance', ascending=False)
+        table_path = f'{val_dir}/importance.tex'
+        d_form = {'Variable' : '{}', 'Importance' : '{:.1f}'}
+        put.df_to_tex(df, table_path, d_format = d_form)
+    # ---------------------------------------------
+    def _get_scores(self, model : cls, arr_index : npa, on_training_ok : bool) -> tuple[npa, npa, npa, npa]:
         '''
         Returns a tuple of four arrays
@@ -136,7 +193,7 @@ class TrainMva:
         return arr_sig, arr_bkg, arr_all, arr_lab
     # ---------------------------------------------
-    def _split_scores(self, arr_prob : numpy.ndarray, arr_label : numpy.ndarray) -> tuple[numpy.ndarray, numpy.ndarray]:
+    def _split_scores(self, arr_prob : npa, arr_label : npa) -> tuple[npa, npa]:
         '''
         Will split the testing scores (predictions) based on the training scores
@@ -151,7 +208,7 @@ class TrainMva:
         return arr_sig, arr_bkg
     # ---------------------------------------------
-    def _save_model(self, model, ifold):
+    def _save_model(self, model : cls, ifold : int) -> None:
         '''
         Saves a model, associated to a specific fold
         '''
@@ -168,6 +225,53 @@ class TrainMva:
         log.info(f'Saving model to: {model_path}')
         joblib.dump(model, model_path)
     # ---------------------------------------------
+    def _get_correlation_cfg(self, df : pnd.DataFrame, ifold : int) -> dict:
+        l_var_name = df.columns.tolist()
+        l_label    = self._labels_from_varnames(l_var_name)
+        cfg = {
+                'labels'     : l_label,
+                'title'      : f'Fold {ifold}',
+                'label_angle': 45,
+                'upper'      : True,
+                'zrange'     : [-1, +1],
+                'size'       : [7, 7],
+                'format'     : '{:.3f}',
+                'fontsize'   : 12,
+                }
+        if 'correlation' not in self._cfg['plotting']:
+            log.info('Using default correlation plotting configuration')
+            return cfg
+        log.debug('Updating correlation plotting configuration')
+        custom = self._cfg['plotting']['correlation']
+        cfg.update(custom)
+        return cfg
+    # ---------------------------------------------
+    def _plot_correlation(self, arr_index : npa, ifold : int) -> None:
+        df_ft = self._df_ft.iloc[arr_index]
+        cfg = self._get_correlation_cfg(df_ft, ifold)
+        cov = df_ft.corr()
+        mat = cov.to_numpy()
+        log.debug(f'Plotting correlation for {ifold} fold')
+        val_dir  = self._cfg['plotting']['val_dir']
+        val_dir  = f'{val_dir}/fold_{ifold:03}'
+        os.makedirs(val_dir, exist_ok=True)
+        obj = MatrixPlotter(mat=mat, cfg=cfg)
+        obj.plot()
+        plt.savefig(f'{val_dir}/covariance.png')
+        plt.close()
+    # ---------------------------------------------
+    def _get_nentries(self, arr_val : npa) -> str:
+        size = len(arr_val)
+        size = size / 1000.
+        return f'{size:.2f}K'
+    # ---------------------------------------------
     def _plot_scores(self, arr_sig_trn, arr_sig_tst, arr_bkg_trn, arr_bkg_tst, ifold):
         # pylint: disable = too-many-arguments, too-many-positional-arguments
         '''
@@ -183,11 +287,11 @@ class TrainMva:
         val_dir  = f'{val_dir}/fold_{ifold:03}'
         os.makedirs(val_dir, exist_ok=True)
-        plt.hist(arr_sig_trn, alpha   =   0.3, bins=50, range=(0,1), color='b', density=True, label='Signal Train')
-        plt.hist(arr_sig_tst, histtype='step', bins=50, range=(0,1), color='b', density=True, label='Signal Test')
+        plt.hist(arr_sig_trn, alpha   =   0.3, bins=50, range=(0,1), color='b', density=True, label='Signal Train: '    + self._get_nentries(arr_sig_trn))
+        plt.hist(arr_sig_tst, histtype='step', bins=50, range=(0,1), color='b', density=True, label='Signal Test: '     + self._get_nentries(arr_sig_tst))
-        plt.hist(arr_bkg_trn, alpha   =   0.3, bins=50, range=(0,1), color='r', density=True, label='Background Train')
-        plt.hist(arr_bkg_tst, histtype='step', bins=50, range=(0,1), color='r', density=True, label='Background Test')
+        plt.hist(arr_bkg_trn, alpha   =   0.3, bins=50, range=(0,1), color='r', density=True, label='Background Train: '+ self._get_nentries(arr_bkg_trn))
+        plt.hist(arr_bkg_tst, histtype='step', bins=50, range=(0,1), color='r', density=True, label='Background Test: ' + self._get_nentries(arr_bkg_tst))
         plt.legend()
         plt.title(f'Fold: {ifold}')
@@ -197,16 +301,15 @@ class TrainMva:
         plt.close()
     # ---------------------------------------------
     def _plot_roc(self,
-                  l_lab_ts : numpy.ndarray,
-                  l_prb_ts : numpy.ndarray,
-                  l_lab_tr : numpy.ndarray,
-                  l_prb_tr : numpy.ndarray,
+                  l_lab_ts : npa,
+                  l_prb_ts : npa,
+                  l_lab_tr : npa,
+                  l_prb_tr : npa,
                   ifold    : int):
         '''
         Takes the labels and the probabilities and plots ROC
         curve for given fold
         '''
-        # pylint: disable = too-many-arguments, too-many-positional-arguments
         log.debug(f'Plotting ROC curve for {ifold} fold')
         val_dir  = self._cfg['plotting']['val_dir']
@@ -226,17 +329,59 @@ class TrainMva:
         if 'min' in self._cfg['plotting']['roc']:
             [min_x, min_y] = self._cfg['plotting']['roc']['min']
+        max_x = 1
+        max_y = 1
+        if 'max' in self._cfg['plotting']['roc']:
+            [max_x, max_y] = self._cfg['plotting']['roc']['max']
+        self._plot_probabilities(xval_ts, yval_ts, l_prb_ts)
         plt.plot(xval_ts, yval_ts, color='b', label=f'Test: {area_ts:.3f}')
         plt.plot(xval_tr, yval_tr, color='r', label=f'Train: {area_tr:.3f}')
         plt.xlabel('Signal efficiency')
-        plt.ylabel('Background efficiency')
+        plt.ylabel('Background rejection')
         plt.title(f'Fold: {ifold}')
-        plt.xlim(min_x, 1)
-        plt.ylim(min_y, 1)
+        plt.xlim(min_x, max_x)
+        plt.ylim(min_y, max_y)
+        plt.grid()
         plt.legend()
         plt.savefig(f'{val_dir}/roc.png')
         plt.close()
     # ---------------------------------------------
+    def _plot_probabilities(self,
+                            arr_seff: npa,
+                            arr_brej: npa,
+                            arr_sprb: npa) -> None:
+        roc_cfg = self._cfg['plotting']['roc']
+        if 'annotate' not in roc_cfg:
+            log.debug('Annotation section in the ROC curve config not found, skipping annotation')
+            return
+        plt_cfg = roc_cfg['annotate']
+        if 'sig_eff' not in plt_cfg:
+            l_seff_target = [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95]
+        else:
+            l_seff_target = plt_cfg['sig_eff']
+            del plt_cfg['sig_eff']
+        arr_seff_target = numpy.array(l_seff_target)
+        l_score = numpy.quantile(arr_sprb, 1 - arr_seff_target)
+        l_seff  = []
+        l_brej  = []
+        for seff_target in l_seff_target:
+            arr_diff = numpy.abs(arr_seff - seff_target)
+            ind      = numpy.argmin(arr_diff)
+            seff     = arr_seff[ind]
+            brej     = arr_brej[ind]
+            l_seff.append(seff)
+            l_brej.append(brej)
+        plu.annotate(l_x=l_seff, l_y=l_brej, l_v=l_score, **plt_cfg)
+    # ---------------------------------------------
     def _plot_features(self):
         '''
         Will plot the features, based on the settings in the config
@@ -245,10 +390,44 @@ class TrainMva:
         ptr   = Plotter(d_rdf = {'Signal' : self._rdf_sig, 'Background' : self._rdf_bkg}, cfg=d_cfg)
         ptr.run()
     # ---------------------------------------------
+    def _save_settings_to_tex(self) -> None:
+        self._save_nan_conversion()
+        self._save_hyperparameters_to_tex()
+    # ---------------------------------------------
+    def _save_nan_conversion(self) -> None:
+        if 'nan' not in self._cfg['dataset']:
+            log.debug('NaN section not found, not saving it')
+            return
+        d_nan = self._cfg['dataset']['nan']
+        l_var = list(d_nan)
+        l_lab = self._labels_from_varnames(l_var)
+        l_val = list(d_nan.values())
+        d_tex = {'Variable' : l_lab, 'Replacement' : l_val}
+        df    = pnd.DataFrame(d_tex)
+        val_dir  = self._cfg['plotting']['val_dir']
+        os.makedirs(val_dir, exist_ok=True)
+        put.df_to_tex(df, f'{val_dir}/nan_replacement.tex')
+    # ---------------------------------------------
+    def _save_hyperparameters_to_tex(self) -> None:
+        if 'hyper' not in self._cfg['training']:
+            raise ValueError('Cannot find hyper parameters in configuration')
+        d_hyper = self._cfg['training']['hyper']
+        d_form  = { f'\\verb|{key}|' : f'\\verb|{val}|' for key, val in d_hyper.items() }
+        d_latex = { 'Hyperparameter' : list(d_form.keys()), 'Value' : list(d_form.values())}
+        df = pnd.DataFrame(d_latex)
+        val_dir  = self._cfg['plotting']['val_dir']
+        os.makedirs(val_dir, exist_ok=True)
+        put.df_to_tex(df, f'{val_dir}/hyperparameters.tex')
+    # ---------------------------------------------
     def run(self):
         '''
         Will do the training
         '''
+        self._save_settings_to_tex()
         self._plot_features()
         l_mod = self._get_models()

dmu/ml/utilities.py CHANGED Viewed

@@ -51,6 +51,14 @@ def _remove_nans(df : pnd.DataFrame) -> pnd.DataFrame:
         log.debug('No NaNs found in dataframe')
         return df
+    sr_is_nan = df.isna().any()
+    l_na_name = sr_is_nan[sr_is_nan].index.tolist()
+    log.info('Found columns with NaNs')
+    for name in l_na_name:
+        nan_count = df[name].isna().sum()
+        log.info(f'{nan_count:<10}{name:<100}')
     ninit = len(df)
     df    = df.dropna()
     nfinl = len(df)

dmu/pdataframe/utilities.py ADDED Viewed

@@ -0,0 +1,36 @@
+'''
+Module containing utilities for pandas dataframes
+'''
+import os
+import pandas as pnd
+from dmu.logging.log_store import LogStore
+log=LogStore.add_logger('dmu:pdataframe:utilities')
+# -------------------------------------
+def df_to_tex(df : pnd.DataFrame, path : str, hide_index : bool = True, d_format : dict[str,str]=None, caption : str =None) -> None:
+    '''
+    Saves pandas dataframe to latex
+    Parameters
+    -------------
+    d_format (dict) : Dictionary specifying the formattinng of the table, e.g. `{'col1': '{}', 'col2': '{:.3f}', 'col3' : '{:.3f}'}`
+    '''
+    if path is not None:
+        dir_name = os.path.dirname(path)
+        os.makedirs(dir_name, exist_ok=True)
+    st = df.style
+    if hide_index:
+        st=st.hide(axis='index')
+    if d_format is not None:
+        st=st.format(formatter=d_format)
+    log.info(f'Saving to: {path}')
+    buf = st.to_latex(buf=path, caption=caption, hrules=True)
+    return buf
+# -------------------------------------

dmu/plotting/matrix.py ADDED Viewed

@@ -0,0 +1,157 @@
+'''
+Module holding the MatrixPlotter class
+'''
+from typing import Annotated
+import numpy
+import numpy.typing      as npt
+import matplotlib.pyplot as plt
+from dmu.logging.log_store import LogStore
+Array2D = Annotated[npt.NDArray[numpy.float64], '(n,n)']
+log     = LogStore.add_logger('dmu:plotting:matrix')
+#-------------------------------------------------------
+class MatrixPlotter:
+    '''
+    Class used to plot matrices
+    '''
+    # -----------------------------------------------
+    def __init__(self, mat : Array2D, cfg : dict):
+        self._mat     = mat
+        self._cfg     = cfg
+        self._size    : int
+        self._l_label : list[str]
+    # -----------------------------------------------
+    def _initialize(self) -> None:
+        self._check_matrix()
+        self._reformat_matrix()
+        self._set_labels()
+        self._mask_matrix()
+    # -----------------------------------------------
+    def _mask_matrix(self) -> None:
+        if 'mask_value' not in self._cfg:
+            return
+        mask_val  = self._cfg['mask_value']
+        log.debug(f'Masking value: {mask_val}')
+        self._mat = numpy.ma.masked_where(self._mat == mask_val, self._mat)
+    # -----------------------------------------------
+    def _check_matrix(self) -> None:
+        a, b = self._mat.shape
+        if a != b:
+            raise ValueError(f'Matrix is not square, but with shape: {a}x{b}')
+        self._size = a
+    # -----------------------------------------------
+    def _set_labels(self) -> None:
+        if 'labels' not in self._cfg:
+            raise ValueError('Labels entry missing')
+        l_lab = self._cfg['labels']
+        nlab  = len(l_lab)
+        if nlab != self._size:
+            raise ValueError(f'Number of labels is not equal to its size: {nlab}!={self._size}')
+        self._l_label = l_lab
+    # -----------------------------------------------
+    def _reformat_matrix(self) -> None:
+        if 'upper' not in self._cfg:
+            log.debug('Drawing full matrix')
+            return
+        upper = self._cfg['upper']
+        if upper not in [True, False]:
+            raise ValueError(f'Invalid value for upper setting: {upper}')
+        if     upper:
+            log.debug('Drawing upper matrix')
+            self._mat = numpy.triu(self._mat, 0)
+            return
+        if not upper:
+            log.debug('Drawing lower matrix')
+            self._mat = numpy.triu(self._mat, 0)
+            return
+    # -----------------------------------------------
+    def _set_axes(self, ax) -> None:
+        ax.set_xticks(numpy.arange(self._size))
+        ax.set_yticks(numpy.arange(self._size))
+        ax.set_xticklabels(self._l_label)
+        ax.set_yticklabels(self._l_label)
+        rotation = 45
+        if 'label_angle' in self._cfg:
+            rotation = self._cfg['label_angle']
+        plt.setp(ax.get_xticklabels(), rotation=rotation, ha="right", rotation_mode="anchor")
+    # -----------------------------------------------
+    def _draw_matrix(self) -> None:
+        fsize = None
+        if 'size' in self._cfg:
+            fsize = self._cfg['size']
+        if 'zrange' not in self._cfg:
+            raise ValueError('z range not found in configuration')
+        [zmin, zmax] = self._cfg['zrange']
+        fig, ax = plt.subplots() if fsize is None else plt.subplots(figsize=fsize)
+        palette = plt.cm.viridis
+        im      = ax.imshow(self._mat, cmap=palette, vmin=zmin, vmax=zmax)
+        self._set_axes(ax)
+        if 'format' in self._cfg:
+            self._add_text(ax)
+        else:
+            log.debug('Not adding values to matrix but bar')
+            fig.colorbar(im)
+        if 'title' not in self._cfg:
+            return
+        title = self._cfg['title']
+        ax.set_title(title)
+        fig.tight_layout()
+    # -----------------------------------------------
+    def _add_text(self, ax):
+        fontsize = 12
+        if 'fontsize' in self._cfg:
+            fontsize = self._cfg['fontsize']
+        form = self._cfg['format']
+        log.debug(f'Adding values with format {form}')
+        for i_x, _ in enumerate(self._l_label):
+            for i_y, _ in enumerate(self._l_label):
+                try:
+                    val  = self._mat[i_y, i_x]
+                except:
+                    log.error(f'Cannot access ({i_x}, {i_y}) in:')
+                    print(self._mat)
+                    raise
+                if numpy.ma.is_masked(val):
+                    text = ''
+                else:
+                    text = form.format(val)
+                _ = ax.text(i_x, i_y, text, ha="center", va="center", fontsize=fontsize, color="k")
+    # -----------------------------------------------
+    def plot(self):
+        '''
+        Runs plotting, plot can be accessed through:
+        ```python
+        plt.show()
+        plt.savefig(...)
+        ```
+        '''
+        self._initialize()
+        self._draw_matrix()
+#-------------------------------------------------------

dmu/plotting/plotter_1d.py CHANGED Viewed

@@ -2,7 +2,6 @@
 Module containing plotter class
 '''
-import hist
 from hist import Hist
 import numpy
@@ -79,6 +78,7 @@ class Plotter1D(Plotter):
         l_bc_all = []
         for name, arr_val in d_data.items():
             arr_wgt      = d_wgt[name] if d_wgt is not None else numpy.ones_like(arr_val)
+            arr_wgt      = self._normalize_weights(arr_wgt, var)
             hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x', label=name).Weight()
             hst.fill(x=arr_val, weight=arr_wgt)
             hst.plot(label=name)
@@ -88,6 +88,23 @@ class Plotter1D(Plotter):
         return max_y
     # --------------------------------------------
+    def _normalize_weights(self, arr_wgt : numpy.ndarray, var : str) -> numpy.ndarray:
+        cfg_var = self._d_cfg['plots'][var]
+        if 'normalized' not in cfg_var:
+            log.debug(f'Not normalizing for variable: {var}')
+            return arr_wgt
+        if not cfg_var['normalized']:
+            log.debug(f'Not normalizing for variable: {var}')
+            return arr_wgt
+        log.debug(f'Normalizing for variable: {var}')
+        total   = numpy.sum(arr_wgt)
+        arr_wgt = arr_wgt / total
+        return arr_wgt
+    # --------------------------------------------
     def _style_plot(self, var : str, max_y : float) -> None:
         d_cfg  = self._d_cfg['plots'][var]
         yscale = d_cfg['yscale' ] if 'yscale' in d_cfg else 'linear'

dmu/plotting/utilities.py ADDED Viewed

@@ -0,0 +1,33 @@
+'''
+Module with plotting utilities
+'''
+# pylint: disable=too-many-positional-arguments, too-many-arguments
+import matplotlib.pyplot as plt
+# ---------------------------------------------------------------------------
+def annotate(
+        l_x   : list[float],
+        l_y   : list[float],
+        l_v   : list[float],
+        form  : str =    '{}',
+        xoff  : int =  0,
+        yoff  : int =-20,
+        size  : int = 20,
+        color : str = 'black') -> None:
+    '''
+    Function used to annotate plots
+    l_x(y): List of x(y) coordinates for markers
+    l_v   : List of numerical values to annotate markers
+    form  : Formatting, e.g. {:.3f}
+    color : String with color for markers and annotation, e.g. black
+    size  : Font size, default 20
+    x(y)off : Offset in x(y).
+    '''
+    for x, y, v in zip(l_x, l_y, l_v):
+        label = form.format(v)
+        plt.plot(x, y, marker='o', markersize= 5, markeredgecolor=color, markerfacecolor=color)
+        plt.annotate(label, (x,y), fontsize=size, textcoords="offset points", xytext=(xoff, yoff), color=color, ha='center')
+# ---------------------------------------------------------------------------

dmu_data/ml/tests/train_mva.yaml CHANGED Viewed

@@ -1,5 +1,8 @@
+dataset:
+  nan :
+    x : 0
 training :
-    nfold    : 3
+    nfold    : 3
     features : [x, y, z]
     rdm_stat : 1
     hyper    :
@@ -7,31 +10,43 @@ training :
       n_estimators      : 100
       max_depth         : 3
       learning_rate     : 0.1
-      min_samples_split : 2
+      min_samples_split : 2
 saving:
-    path : 'tests/ml/train_mva/model.pkl'
+    path : '/tmp/dmu/ml/tests/train_mva/model.pkl'
 plotting:
     roc     :
-        min : [0, 0]
-    val_dir : 'tests/ml/train_mva'
+        min : [0.0, 0.0]
+        max : [1.2, 1.2]
+        annotate:
+          sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9]
+          form : '{:.2f}'
+          color: 'green'
+          xoff : -15
+          yoff : -15
+          size :  10
+    correlation:
+      title      : 'Correlation matrix'
+      size       : [10, 10]
+      mask_value : 0
+    val_dir : '/tmp/dmu/ml/tests/train_mva'
     features:
         saving:
-            plt_dir : 'tests/ml/train_mva/features'
+            plt_dir : '/tmp/dmu/ml/tests/train_mva/features'
         plots:
-          w :
+          w :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['w', '']
-          x :
+          x :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['x', '']
-          y :
+          y :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['y', '']
-          z :
+          z :
             binning : [-4, 4, 100]
-            yscale  : 'linear'
+            yscale  : 'linear'
             labels  : ['z', '']

dmu_data/plotting/tests/normalized.yaml ADDED Viewed

@@ -0,0 +1,9 @@
+saving:
+    plt_dir : tests/plotting/normalized
+plots:
+    x :
+      normalized : true
+      binning    : [-5.0, 8.0, 40]
+    y :
+      normalized : false
+      binning    : [-5.0, 8.0, 40]

{data_manipulation_utilities-0.1.9.data → data_manipulation_utilities-0.2.1.data}/scripts/publish RENAMED Viewed

File without changes

{data_manipulation_utilities-0.1.9.dist-info → data_manipulation_utilities-0.2.1.dist-info}/WHEEL RENAMED Viewed

File without changes

{data_manipulation_utilities-0.1.9.dist-info → data_manipulation_utilities-0.2.1.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{data_manipulation_utilities-0.1.9.dist-info → data_manipulation_utilities-0.2.1.dist-info}/top_level.txt RENAMED Viewed

File without changes

data-manipulation-utilities 0.1.9__py3-none-any.whl → 0.2.1__py3-none-any.whl

data-manipulation-utilities 0.1.9py3-none-any.whl → 0.2.1py3-none-any.whl