PyPI - data-manipulation-utilities - Versions diffs - 0.2.1__tar.gz → 0.2.3__tar.gz - Mend

data-manipulation-utilities 0.2.1tar.gz → 0.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

{data_manipulation_utilities-0.2.1/src/data_manipulation_utilities.egg-info → data_manipulation_utilities-0.2.3}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.1
+Version: 0.2.3
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
 Requires-Dist: scipy
-Requires-Dist: awkward==2.4.6
+Requires-Dist: awkward
 Requires-Dist: tqdm
 Requires-Dist: joblib
 Requires-Dist: scikit-learn
@@ -424,6 +424,10 @@ where the settings for the training go in a config dictionary, which when writte
 ```yaml
 dataset:
+    # Before training, new features can be defined as below
+    define :
+        x : v + w
+        y : v - w
     # If the key is found to be NaN, replace its value with the number provided
     # This will be used in the training.
     # Otherwise the entries with NaNs will be dropped
@@ -433,7 +437,7 @@ dataset:
         z : -999
 training :
     nfold    : 10
-    features : [w, x, y, z]
+    features : [x, y, z]
     hyper    :
       loss              : log_loss
       n_estimators      : 100
@@ -493,7 +497,9 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+    - Can use the `nan` section shown above to replace `NaN` values with something else
+    - For whatever remains we remove the entries from the training.
 ## Application
@@ -516,15 +522,24 @@ The picking process happens through the comparison of hashes between the samples
 The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
 `GradientBoostClassifier`, here called `CVClassifier`.
-If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+If a sample exists, that was used in the training of _every_ model, no model can be chosen for the prediction and a
 `CVSameData` exception will be risen.
+During training, the configuration will be stored in the model. Therefore, variable definitions can be picked up for evaluation
+from that configuration and the user does not need to define extra columns.
 ### Caveats
 When evaluating the model with real data, problems might occur, we deal with them as follows:
 - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
-- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be:
+    - Replaced by other values before evaluation IF a replacement was specified during training. The training configuration will be stored in the model
+    and can be accessed through:
+    ```python
+    model.cfg
+    ```
+    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
 # Pandas dataframes
@@ -563,6 +578,24 @@ These are utility functions meant to be used with ROOT dataframes.
 ## Adding a column from a numpy array
+### With numba
+For this do:
+```python
+import dmu.rdataframe.utilities as ut
+arr_val = numpy.array([10, 20, 30])
+rdf     = ut.add_column_with_numba(rdf, arr_val, 'values', identifier='some_name')
+```
+where the identifier needs to be unique, every time the function is called.
+This is the case, because the addition is done internally by declaring a numba function whose name
+cannot be repeated as mentioned
+[here](https://root-forum.cern.ch/t/ways-to-work-around-the-redefinition-of-compiled-functions-in-one-single-notebook-session/41442/1)
+### With awkward
 For this do:
 ```python

data_manipulation_utilities-0.2.1/PKG-INFO → data_manipulation_utilities-0.2.3/README.md RENAMED Viewed

@@ -1,23 +1,3 @@
-Metadata-Version: 2.2
-Name: data_manipulation_utilities
-Version: 0.2.1
-Description-Content-Type: text/markdown
-Requires-Dist: logzero
-Requires-Dist: PyYAML
-Requires-Dist: scipy
-Requires-Dist: awkward==2.4.6
-Requires-Dist: tqdm
-Requires-Dist: joblib
-Requires-Dist: scikit-learn
-Requires-Dist: toml
-Requires-Dist: numpy
-Requires-Dist: matplotlib
-Requires-Dist: mplhep
-Requires-Dist: hist[plot]
-Requires-Dist: pandas
-Provides-Extra: dev
-Requires-Dist: pytest; extra == "dev"
 # D(ata) M(anipulation) U(tilities)
 These are tools that can be used for different data analysis tasks.
@@ -424,6 +404,10 @@ where the settings for the training go in a config dictionary, which when writte
 ```yaml
 dataset:
+    # Before training, new features can be defined as below
+    define :
+        x : v + w
+        y : v - w
     # If the key is found to be NaN, replace its value with the number provided
     # This will be used in the training.
     # Otherwise the entries with NaNs will be dropped
@@ -433,7 +417,7 @@ dataset:
         z : -999
 training :
     nfold    : 10
-    features : [w, x, y, z]
+    features : [x, y, z]
     hyper    :
       loss              : log_loss
       n_estimators      : 100
@@ -493,7 +477,9 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+    - Can use the `nan` section shown above to replace `NaN` values with something else
+    - For whatever remains we remove the entries from the training.
 ## Application
@@ -516,15 +502,24 @@ The picking process happens through the comparison of hashes between the samples
 The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
 `GradientBoostClassifier`, here called `CVClassifier`.
-If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+If a sample exists, that was used in the training of _every_ model, no model can be chosen for the prediction and a
 `CVSameData` exception will be risen.
+During training, the configuration will be stored in the model. Therefore, variable definitions can be picked up for evaluation
+from that configuration and the user does not need to define extra columns.
 ### Caveats
 When evaluating the model with real data, problems might occur, we deal with them as follows:
 - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
-- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be:
+    - Replaced by other values before evaluation IF a replacement was specified during training. The training configuration will be stored in the model
+    and can be accessed through:
+    ```python
+    model.cfg
+    ```
+    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
 # Pandas dataframes
@@ -563,6 +558,24 @@ These are utility functions meant to be used with ROOT dataframes.
 ## Adding a column from a numpy array
+### With numba
+For this do:
+```python
+import dmu.rdataframe.utilities as ut
+arr_val = numpy.array([10, 20, 30])
+rdf     = ut.add_column_with_numba(rdf, arr_val, 'values', identifier='some_name')
+```
+where the identifier needs to be unique, every time the function is called.
+This is the case, because the addition is done internally by declaring a numba function whose name
+cannot be repeated as mentioned
+[here](https://root-forum.cern.ch/t/ways-to-work-around-the-redefinition-of-compiled-functions-in-one-single-notebook-session/41442/1)
+### With awkward
 For this do:
 ```python

{data_manipulation_utilities-0.2.1 → data_manipulation_utilities-0.2.3}/pyproject.toml RENAMED Viewed

@@ -1,12 +1,12 @@
 [project]
 name        = 'data_manipulation_utilities'
-version     = '0.2.1'
+version     = '0.2.3'
 readme      = 'README.md'
 dependencies= [
 'logzero',
 'PyYAML',
 'scipy',
-'awkward==2.4.6',
+'awkward',
 'tqdm',
 'joblib',
 'scikit-learn',

data_manipulation_utilities-0.2.1/README.md → data_manipulation_utilities-0.2.3/src/data_manipulation_utilities.egg-info/PKG-INFO RENAMED Viewed

@@ -1,3 +1,23 @@
+Metadata-Version: 2.2
+Name: data_manipulation_utilities
+Version: 0.2.3
+Description-Content-Type: text/markdown
+Requires-Dist: logzero
+Requires-Dist: PyYAML
+Requires-Dist: scipy
+Requires-Dist: awkward
+Requires-Dist: tqdm
+Requires-Dist: joblib
+Requires-Dist: scikit-learn
+Requires-Dist: toml
+Requires-Dist: numpy
+Requires-Dist: matplotlib
+Requires-Dist: mplhep
+Requires-Dist: hist[plot]
+Requires-Dist: pandas
+Provides-Extra: dev
+Requires-Dist: pytest; extra == "dev"
 # D(ata) M(anipulation) U(tilities)
 These are tools that can be used for different data analysis tasks.
@@ -404,6 +424,10 @@ where the settings for the training go in a config dictionary, which when writte
 ```yaml
 dataset:
+    # Before training, new features can be defined as below
+    define :
+        x : v + w
+        y : v - w
     # If the key is found to be NaN, replace its value with the number provided
     # This will be used in the training.
     # Otherwise the entries with NaNs will be dropped
@@ -413,7 +437,7 @@ dataset:
         z : -999
 training :
     nfold    : 10
-    features : [w, x, y, z]
+    features : [x, y, z]
     hyper    :
       loss              : log_loss
       n_estimators      : 100
@@ -473,7 +497,9 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+    - Can use the `nan` section shown above to replace `NaN` values with something else
+    - For whatever remains we remove the entries from the training.
 ## Application
@@ -496,15 +522,24 @@ The picking process happens through the comparison of hashes between the samples
 The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
 `GradientBoostClassifier`, here called `CVClassifier`.
-If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+If a sample exists, that was used in the training of _every_ model, no model can be chosen for the prediction and a
 `CVSameData` exception will be risen.
+During training, the configuration will be stored in the model. Therefore, variable definitions can be picked up for evaluation
+from that configuration and the user does not need to define extra columns.
 ### Caveats
 When evaluating the model with real data, problems might occur, we deal with them as follows:
 - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
-- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be:
+    - Replaced by other values before evaluation IF a replacement was specified during training. The training configuration will be stored in the model
+    and can be accessed through:
+    ```python
+    model.cfg
+    ```
+    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
 # Pandas dataframes
@@ -543,6 +578,24 @@ These are utility functions meant to be used with ROOT dataframes.
 ## Adding a column from a numpy array
+### With numba
+For this do:
+```python
+import dmu.rdataframe.utilities as ut
+arr_val = numpy.array([10, 20, 30])
+rdf     = ut.add_column_with_numba(rdf, arr_val, 'values', identifier='some_name')
+```
+where the identifier needs to be unique, every time the function is called.
+This is the case, because the addition is done internally by declaring a numba function whose name
+cannot be repeated as mentioned
+[here](https://root-forum.cern.ch/t/ways-to-work-around-the-redefinition-of-compiled-functions-in-one-single-notebook-session/41442/1)
+### With awkward
 For this do:
 ```python

{data_manipulation_utilities-0.2.1 → data_manipulation_utilities-0.2.3}/src/data_manipulation_utilities.egg-info/requires.txt RENAMED Viewed

@@ -1,7 +1,7 @@
 logzero
 PyYAML
 scipy
-awkward==2.4.6
+awkward
 tqdm
 joblib
 scikit-learn

{data_manipulation_utilities-0.2.1 → data_manipulation_utilities-0.2.3}/src/dmu/ml/cv_predict.py RENAMED Viewed

@@ -32,11 +32,56 @@ class CVPredict:
         if rdf is None:
             raise ValueError('No ROOT dataframe passed')
-        self._l_model = models
-        self._rdf     = rdf
+        self._l_model   = models
+        self._rdf       = rdf
+        self._d_nan_rep : dict[str,str]
         self._arr_patch : numpy.ndarray
     # --------------------------------------------
+    def _initialize(self):
+        self._rdf       = self._define_columns(self._rdf)
+        self._d_nan_rep = self._get_nan_replacements()
+    # --------------------------------------------
+    def _define_columns(self, rdf : RDataFrame) -> RDataFrame:
+        cfg = self._l_model[0].cfg
+        if 'define' not in cfg['dataset']:
+            log.debug('No define section found in config, will not define extra columns')
+            return self._rdf
+        d_def = cfg['dataset']['define']
+        log.debug(60 * '-')
+        log.info('Defining columns in RDF before evaluating classifier')
+        log.debug(60 * '-')
+        for name, expr in d_def.items():
+            log.debug(f'{name:<20}{"<---":20}{expr:<100}')
+            rdf = rdf.Define(name, expr)
+        return rdf
+    # --------------------------------------------
+    def _get_nan_replacements(self) -> dict[str,str]:
+        cfg = self._l_model[0].cfg
+        if 'nan' not in cfg['dataset']:
+            log.debug('No define section found in config, will not define extra columns')
+            return {}
+        return cfg['dataset']['nan']
+    # --------------------------------------------
+    def _replace_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if len(self._d_nan_rep) == 0:
+            log.debug('Not doing any NaN replacement')
+            return df
+        log.debug(60 * '-')
+        log.info('Doing NaN replacements')
+        log.debug(60 * '-')
+        for var, val in self._d_nan_rep.items():
+            log.debug(f'{var:<20}{"--->":20}{val:<20.3f}')
+            df[var] = df[var].fillna(val)
+        return df
+    # --------------------------------------------
     def _get_df(self):
         '''
         Will make ROOT rdf into dataframe and return it
@@ -45,6 +90,7 @@ class CVPredict:
         l_ft  = model.features
         d_data= self._rdf.AsNumpy(l_ft)
         df_ft = pnd.DataFrame(d_data)
+        df_ft = self._replace_nans(df_ft)
         df_ft = ut.patch_and_tag(df_ft)
         if 'patched_indices' in df_ft.attrs:
@@ -136,6 +182,8 @@ class CVPredict:
         '''
         Will return array of prediction probabilities for the signal category
         '''
+        self._initialize()
         df_ft = self._get_df()
         model = self._l_model[0]

{data_manipulation_utilities-0.2.1 → data_manipulation_utilities-0.2.3}/src/dmu/ml/train_mva.py RENAMED Viewed

@@ -26,7 +26,7 @@ from dmu.plotting.matrix     import MatrixPlotter
 from dmu.logging.log_store   import LogStore
 npa = numpy.ndarray
-log = LogStore.add_logger('data_checks:train_mva')
+log = LogStore.add_logger('dmu:ml:train_mva')
 # ---------------------------------------------
 class TrainMva:
     '''
@@ -334,10 +334,10 @@ class TrainMva:
         if 'max' in self._cfg['plotting']['roc']:
             [max_x, max_y] = self._cfg['plotting']['roc']['max']
-        self._plot_probabilities(xval_ts, yval_ts, l_prb_ts)
         plt.plot(xval_ts, yval_ts, color='b', label=f'Test: {area_ts:.3f}')
         plt.plot(xval_tr, yval_tr, color='r', label=f'Train: {area_tr:.3f}')
+        self._plot_probabilities(xval_ts, yval_ts, l_prb_ts, l_lab_ts)
         plt.xlabel('Signal efficiency')
         plt.ylabel('Background rejection')
         plt.title(f'Fold: {ifold}')
@@ -351,13 +351,17 @@ class TrainMva:
     def _plot_probabilities(self,
                             arr_seff: npa,
                             arr_brej: npa,
-                            arr_sprb: npa) -> None:
+                            arr_sprb: npa,
+                            arr_labl: npa) -> None:
         roc_cfg = self._cfg['plotting']['roc']
         if 'annotate' not in roc_cfg:
             log.debug('Annotation section in the ROC curve config not found, skipping annotation')
             return
+        l_sprb   = [ sprb for sprb, labl in zip(arr_sprb, arr_labl) if labl == 1 ]
+        arr_sprb = numpy.array(l_sprb)
         plt_cfg = roc_cfg['annotate']
         if 'sig_eff' not in plt_cfg:
             l_seff_target = [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95]
@@ -366,17 +370,24 @@ class TrainMva:
             del plt_cfg['sig_eff']
         arr_seff_target = numpy.array(l_seff_target)
+        arr_quantile    = 1 - arr_seff_target
-        l_score = numpy.quantile(arr_sprb, 1 - arr_seff_target)
+        l_score = numpy.quantile(arr_sprb, arr_quantile)
         l_seff  = []
         l_brej  = []
-        for seff_target in l_seff_target:
+        log.debug(60 * '-')
+        log.debug(f'{"SigEff":20}{"BkgRej":20}{"Score":20}')
+        log.debug(60 * '-')
+        for seff_target, score in zip(arr_seff_target, l_score):
             arr_diff = numpy.abs(arr_seff - seff_target)
             ind      = numpy.argmin(arr_diff)
             seff     = arr_seff[ind]
             brej     = arr_brej[ind]
+            log.debug(f'{seff:<20.3f}{brej:<20.3f}{score:<20.2f}')
             l_seff.append(seff)
             l_brej.append(brej)

{data_manipulation_utilities-0.2.1 → data_manipulation_utilities-0.2.3}/src/dmu/rdataframe/utilities.py RENAMED Viewed

@@ -1,6 +1,7 @@
 '''
 Module containing utility functions to be used with ROOT dataframes
 '''
+# pylint: disable=no-name-in-module
 import re
 from dataclasses import dataclass
@@ -10,7 +11,7 @@ import pandas  as pnd
 import awkward as ak
 import numpy
-from ROOT import RDataFrame, RDF
+from ROOT import RDataFrame, RDF, Numba
 from dmu.logging.log_store import LogStore
@@ -34,6 +35,8 @@ def add_column(rdf : RDataFrame, arr_val : Union[numpy.ndarray,None], name : str
          exclude_re : Regex with patter of column names that we won't pick
     '''
+    log.warning(f'Adding column {name} with awkward')
     d_opt = {} if d_opt is None else d_opt
     if arr_val is None:
         raise ValueError('Array of values not introduced')
@@ -66,12 +69,35 @@ def add_column(rdf : RDataFrame, arr_val : Union[numpy.ndarray,None], name : str
     if arr_val.dtype == 'object':
         arr_val = arr_val.astype(float)
-    d_data[name] = arr_val
+    d_data[name] = ak.from_numpy(arr_val)
     rdf = ak.to_rdataframe(d_data)
     return rdf
 # ---------------------------------------------------------------------
+def add_column_with_numba(
+        rdf        : RDataFrame,
+        arr_val    : Union[numpy.ndarray,None],
+        name       : str,
+        identifier : str) -> RDataFrame:
+    '''
+    Will take a dataframe, an array of numbers and a string
+    Will add the array as a colunm to the dataframe
+    The `identifier` argument is a string need in order to avoid collisions
+    when using Numba to define a function to get the value from.
+    '''
+    identifier=f'fun_{identifier}'
+    @Numba.Declare(['int'], 'float', name=identifier)
+    def get_value(index):
+        return arr_val[index]
+    log.debug(f'Adding column {name} with numba')
+    rdf = rdf.Define(name, f'Numba::{identifier}(rdfentry_)')
+    return rdf
+# ---------------------------------------------------------------------
 def rdf_report_to_df(rep : RDF.RCutFlowReport) -> pnd.DataFrame:
     '''
     Takes the output of rdf.Report(), i.e. an RDataFrame cutflow report.