PyPI - data-manipulation-utilities - Versions diffs - 0.2.0__tar.gz → 0.2.2__tar.gz - Mend

data-manipulation-utilities 0.2.0tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.0
+Version: 0.2.2
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
 Requires-Dist: scipy
-Requires-Dist: awkward==2.4.6
+Requires-Dist: awkward
 Requires-Dist: tqdm
 Requires-Dist: joblib
 Requires-Dist: scikit-learn
@@ -423,9 +423,21 @@ obj.run()
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
 ```yaml
+dataset:
+    # Before training, new features can be defined as below
+    define :
+        x : v + w
+        y : v - w
+    # If the key is found to be NaN, replace its value with the number provided
+    # This will be used in the training.
+    # Otherwise the entries with NaNs will be dropped
+    nan:
+        x : 0
+        y : 0
+        z : -999
 training :
     nfold    : 10
-    features : [w, x, y, z]
+    features : [x, y, z]
     hyper    :
       loss              : log_loss
       n_estimators      : 100
@@ -433,8 +445,25 @@ training :
       learning_rate     : 0.1
       min_samples_split : 2
 saving:
+    # The actual model names are model_001.pkl, model_002.pkl, etc, one for each fold
     path : 'tests/ml/train_mva/model.pkl'
 plotting:
+    roc :
+        min : [0.0, 0.0] # Optional, controls where the ROC curve starts and ends
+        max : [1.2, 1.2] # By default it does from 0 to 1 in both axes
+        # The section below is optional and will annotate the ROC curve with
+        # values for the score at different signal efficiencies
+        annotate:
+          sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9] # Values of signal efficiency at which to show the scores
+          form    : '{:.2f}' # Use two decimals for scores
+          color   : 'green'  # Color for text and marker
+          xoff    : -15      # Offsets in X and Y
+          yoff    : -15
+          size    :  10      # Size of text
+    correlation: # Adds correlation matrix for training datasets
+      title      : 'Correlation matrix'
+      size       : [10, 10]
+      mask_value : 0                # Where correlation is zero, the bin will appear white
     val_dir : 'tests/ml/train_mva'
     features:
         saving:
@@ -468,7 +497,9 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+    - Can use the `nan` section shown above to replace `NaN` values with something else
+    - For whatever remains we remove the entries from the training.
 ## Application
@@ -491,17 +522,56 @@ The picking process happens through the comparison of hashes between the samples
 The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
 `GradientBoostClassifier`, here called `CVClassifier`.
-If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+If a sample exists, that was used in the training of _every_ model, no model can be chosen for the prediction and a
 `CVSameData` exception will be risen.
+During training, the configuration will be stored in the model. Therefore, variable definitions can be picked up for evaluation
+from that configuration and the user does not need to define extra columns.
 ### Caveats
 When evaluating the model with real data, problems might occur, we deal with them as follows:
 - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
-- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be:
+    - Replaced by other values before evaluation IF a replacement was specified during training. The training configuration will be stored in the model
+    and can be accessed through:
+    ```python
+    model.cfg
+    ```
+    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
+# Pandas dataframes
+## Utilities
+These are thin layers of code that take pandas dataframes and carry out specific tasks
+### Dataframe to latex
+One can save a dataframe to latex with:
+```python
+import pandas as pnd
+import dmu.pdataframe.utilities as put
+d_data = {}
+d_data['a'] = [1,2,3]
+d_data['b'] = [4,5,6]
+df = pnd.DataFrame(d_data)
+d_format = {
+        'a' : '{:.0f}',
+        'b' : '{:.3f}'}
+df = _get_df()
+put.df_to_tex(df,
+        './table.tex',
+        d_format = d_format,
+        caption  = 'some caption')
+```
 # Rdataframes
 These are utility functions meant to be used with ROOT dataframes.
@@ -653,6 +723,43 @@ axes:
         label   : 'y'
 ```
+# Other plots
+## Matrices
+This can be done with `MatrixPlotter`, whose usage is illustrated below:
+```python
+import numpy
+import matplotlib.pyplot as plt
+from dmu.plotting.matrix import MatrixPlotter
+cfg = {
+        'labels'     : ['x', 'y', 'z'], # Used to label the matrix axes
+        'title'      : 'Some title',    # Optional, title of plot
+        'label_angle': 45,              # Labels will be rotated by 45 degrees
+        'upper'      : True,            # Useful in case this is a symmetric matrix
+        'zrange'     : [0, 10],         # Controls the z axis range
+        'size'       : [7, 7],          # Plot size
+        'format'     : '{:.3f}',        # Optional, if used will add numerical values to the contents, otherwise a color bar is used
+        'fontsize'   : 12,              # Font size associated to `format`
+        'mask_value' : 0,               # These values will appear white in the plot
+        }
+mat = [
+        [1, 2, 3],
+        [2, 0, 4],
+        [3, 4, numpy.nan]
+        ]
+mat = numpy.array(mat)
+obj = MatrixPlotter(mat=mat, cfg=cfg)
+obj.plot()
+plt.show()
+```
 # Manipulating ROOT files
 ## Getting trees from file

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/README.md RENAMED Viewed

@@ -403,9 +403,21 @@ obj.run()
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
 ```yaml
+dataset:
+    # Before training, new features can be defined as below
+    define :
+        x : v + w
+        y : v - w
+    # If the key is found to be NaN, replace its value with the number provided
+    # This will be used in the training.
+    # Otherwise the entries with NaNs will be dropped
+    nan:
+        x : 0
+        y : 0
+        z : -999
 training :
     nfold    : 10
-    features : [w, x, y, z]
+    features : [x, y, z]
     hyper    :
       loss              : log_loss
       n_estimators      : 100
@@ -413,8 +425,25 @@ training :
       learning_rate     : 0.1
       min_samples_split : 2
 saving:
+    # The actual model names are model_001.pkl, model_002.pkl, etc, one for each fold
     path : 'tests/ml/train_mva/model.pkl'
 plotting:
+    roc :
+        min : [0.0, 0.0] # Optional, controls where the ROC curve starts and ends
+        max : [1.2, 1.2] # By default it does from 0 to 1 in both axes
+        # The section below is optional and will annotate the ROC curve with
+        # values for the score at different signal efficiencies
+        annotate:
+          sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9] # Values of signal efficiency at which to show the scores
+          form    : '{:.2f}' # Use two decimals for scores
+          color   : 'green'  # Color for text and marker
+          xoff    : -15      # Offsets in X and Y
+          yoff    : -15
+          size    :  10      # Size of text
+    correlation: # Adds correlation matrix for training datasets
+      title      : 'Correlation matrix'
+      size       : [10, 10]
+      mask_value : 0                # Where correlation is zero, the bin will appear white
     val_dir : 'tests/ml/train_mva'
     features:
         saving:
@@ -448,7 +477,9 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+    - Can use the `nan` section shown above to replace `NaN` values with something else
+    - For whatever remains we remove the entries from the training.
 ## Application
@@ -471,17 +502,56 @@ The picking process happens through the comparison of hashes between the samples
 The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
 `GradientBoostClassifier`, here called `CVClassifier`.
-If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+If a sample exists, that was used in the training of _every_ model, no model can be chosen for the prediction and a
 `CVSameData` exception will be risen.
+During training, the configuration will be stored in the model. Therefore, variable definitions can be picked up for evaluation
+from that configuration and the user does not need to define extra columns.
 ### Caveats
 When evaluating the model with real data, problems might occur, we deal with them as follows:
 - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
-- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be:
+    - Replaced by other values before evaluation IF a replacement was specified during training. The training configuration will be stored in the model
+    and can be accessed through:
+    ```python
+    model.cfg
+    ```
+    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
+# Pandas dataframes
+## Utilities
+These are thin layers of code that take pandas dataframes and carry out specific tasks
+### Dataframe to latex
+One can save a dataframe to latex with:
+```python
+import pandas as pnd
+import dmu.pdataframe.utilities as put
+d_data = {}
+d_data['a'] = [1,2,3]
+d_data['b'] = [4,5,6]
+df = pnd.DataFrame(d_data)
+d_format = {
+        'a' : '{:.0f}',
+        'b' : '{:.3f}'}
+df = _get_df()
+put.df_to_tex(df,
+        './table.tex',
+        d_format = d_format,
+        caption  = 'some caption')
+```
 # Rdataframes
 These are utility functions meant to be used with ROOT dataframes.
@@ -633,6 +703,43 @@ axes:
         label   : 'y'
 ```
+# Other plots
+## Matrices
+This can be done with `MatrixPlotter`, whose usage is illustrated below:
+```python
+import numpy
+import matplotlib.pyplot as plt
+from dmu.plotting.matrix import MatrixPlotter
+cfg = {
+        'labels'     : ['x', 'y', 'z'], # Used to label the matrix axes
+        'title'      : 'Some title',    # Optional, title of plot
+        'label_angle': 45,              # Labels will be rotated by 45 degrees
+        'upper'      : True,            # Useful in case this is a symmetric matrix
+        'zrange'     : [0, 10],         # Controls the z axis range
+        'size'       : [7, 7],          # Plot size
+        'format'     : '{:.3f}',        # Optional, if used will add numerical values to the contents, otherwise a color bar is used
+        'fontsize'   : 12,              # Font size associated to `format`
+        'mask_value' : 0,               # These values will appear white in the plot
+        }
+mat = [
+        [1, 2, 3],
+        [2, 0, 4],
+        [3, 4, numpy.nan]
+        ]
+mat = numpy.array(mat)
+obj = MatrixPlotter(mat=mat, cfg=cfg)
+obj.plot()
+plt.show()
+```
 # Manipulating ROOT files
 ## Getting trees from file

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/pyproject.toml RENAMED Viewed

@@ -1,12 +1,12 @@
 [project]
 name        = 'data_manipulation_utilities'
-version     = '0.2.0'
+version     = '0.2.2'
 readme      = 'README.md'
 dependencies= [
 'logzero',
 'PyYAML',
 'scipy',
-'awkward==2.4.6',
+'awkward',
 'tqdm',
 'joblib',
 'scikit-learn',

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/src/data_manipulation_utilities.egg-info/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.0
+Version: 0.2.2
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
 Requires-Dist: scipy
-Requires-Dist: awkward==2.4.6
+Requires-Dist: awkward
 Requires-Dist: tqdm
 Requires-Dist: joblib
 Requires-Dist: scikit-learn
@@ -423,9 +423,21 @@ obj.run()
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
 ```yaml
+dataset:
+    # Before training, new features can be defined as below
+    define :
+        x : v + w
+        y : v - w
+    # If the key is found to be NaN, replace its value with the number provided
+    # This will be used in the training.
+    # Otherwise the entries with NaNs will be dropped
+    nan:
+        x : 0
+        y : 0
+        z : -999
 training :
     nfold    : 10
-    features : [w, x, y, z]
+    features : [x, y, z]
     hyper    :
       loss              : log_loss
       n_estimators      : 100
@@ -433,8 +445,25 @@ training :
       learning_rate     : 0.1
       min_samples_split : 2
 saving:
+    # The actual model names are model_001.pkl, model_002.pkl, etc, one for each fold
     path : 'tests/ml/train_mva/model.pkl'
 plotting:
+    roc :
+        min : [0.0, 0.0] # Optional, controls where the ROC curve starts and ends
+        max : [1.2, 1.2] # By default it does from 0 to 1 in both axes
+        # The section below is optional and will annotate the ROC curve with
+        # values for the score at different signal efficiencies
+        annotate:
+          sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9] # Values of signal efficiency at which to show the scores
+          form    : '{:.2f}' # Use two decimals for scores
+          color   : 'green'  # Color for text and marker
+          xoff    : -15      # Offsets in X and Y
+          yoff    : -15
+          size    :  10      # Size of text
+    correlation: # Adds correlation matrix for training datasets
+      title      : 'Correlation matrix'
+      size       : [10, 10]
+      mask_value : 0                # Where correlation is zero, the bin will appear white
     val_dir : 'tests/ml/train_mva'
     features:
         saving:
@@ -468,7 +497,9 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+    - Can use the `nan` section shown above to replace `NaN` values with something else
+    - For whatever remains we remove the entries from the training.
 ## Application
@@ -491,17 +522,56 @@ The picking process happens through the comparison of hashes between the samples
 The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
 `GradientBoostClassifier`, here called `CVClassifier`.
-If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+If a sample exists, that was used in the training of _every_ model, no model can be chosen for the prediction and a
 `CVSameData` exception will be risen.
+During training, the configuration will be stored in the model. Therefore, variable definitions can be picked up for evaluation
+from that configuration and the user does not need to define extra columns.
 ### Caveats
 When evaluating the model with real data, problems might occur, we deal with them as follows:
 - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
-- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be:
+    - Replaced by other values before evaluation IF a replacement was specified during training. The training configuration will be stored in the model
+    and can be accessed through:
+    ```python
+    model.cfg
+    ```
+    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
+# Pandas dataframes
+## Utilities
+These are thin layers of code that take pandas dataframes and carry out specific tasks
+### Dataframe to latex
+One can save a dataframe to latex with:
+```python
+import pandas as pnd
+import dmu.pdataframe.utilities as put
+d_data = {}
+d_data['a'] = [1,2,3]
+d_data['b'] = [4,5,6]
+df = pnd.DataFrame(d_data)
+d_format = {
+        'a' : '{:.0f}',
+        'b' : '{:.3f}'}
+df = _get_df()
+put.df_to_tex(df,
+        './table.tex',
+        d_format = d_format,
+        caption  = 'some caption')
+```
 # Rdataframes
 These are utility functions meant to be used with ROOT dataframes.
@@ -653,6 +723,43 @@ axes:
         label   : 'y'
 ```
+# Other plots
+## Matrices
+This can be done with `MatrixPlotter`, whose usage is illustrated below:
+```python
+import numpy
+import matplotlib.pyplot as plt
+from dmu.plotting.matrix import MatrixPlotter
+cfg = {
+        'labels'     : ['x', 'y', 'z'], # Used to label the matrix axes
+        'title'      : 'Some title',    # Optional, title of plot
+        'label_angle': 45,              # Labels will be rotated by 45 degrees
+        'upper'      : True,            # Useful in case this is a symmetric matrix
+        'zrange'     : [0, 10],         # Controls the z axis range
+        'size'       : [7, 7],          # Plot size
+        'format'     : '{:.3f}',        # Optional, if used will add numerical values to the contents, otherwise a color bar is used
+        'fontsize'   : 12,              # Font size associated to `format`
+        'mask_value' : 0,               # These values will appear white in the plot
+        }
+mat = [
+        [1, 2, 3],
+        [2, 0, 4],
+        [3, 4, numpy.nan]
+        ]
+mat = numpy.array(mat)
+obj = MatrixPlotter(mat=mat, cfg=cfg)
+obj.plot()
+plt.show()
+```
 # Manipulating ROOT files
 ## Getting trees from file

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/src/data_manipulation_utilities.egg-info/SOURCES.txt RENAMED Viewed

@@ -13,9 +13,12 @@ src/dmu/ml/cv_classifier.py
 src/dmu/ml/cv_predict.py
 src/dmu/ml/train_mva.py
 src/dmu/ml/utilities.py
+src/dmu/pdataframe/utilities.py
+src/dmu/plotting/matrix.py
 src/dmu/plotting/plotter.py
 src/dmu/plotting/plotter_1d.py
 src/dmu/plotting/plotter_2d.py
+src/dmu/plotting/utilities.py
 src/dmu/rdataframe/atr_mgr.py
 src/dmu/rdataframe/utilities.py
 src/dmu/rfile/rfprinter.py

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/src/data_manipulation_utilities.egg-info/requires.txt RENAMED Viewed

@@ -1,7 +1,7 @@
 logzero
 PyYAML
 scipy
-awkward==2.4.6
+awkward
 tqdm
 joblib
 scikit-learn

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/src/dmu/ml/cv_classifier.py RENAMED Viewed

@@ -2,6 +2,7 @@
 Module holding cv_classifier class
 '''
+from typing                  import Union
 from sklearn.ensemble        import GradientBoostingClassifier
 from dmu.logging.log_store import LogStore
@@ -22,7 +23,7 @@ class CVClassifier(GradientBoostingClassifier):
     '''
     # pylint: disable = too-many-ancestors, abstract-method
     # ----------------------------------
-    def __init__(self, cfg : dict | None = None):
+    def __init__(self, cfg : Union[dict,None] = None):
         '''
         cfg (dict) : Dictionary with configuration, specially the hyperparameters set in the `hyper` field
         '''

{data_manipulation_utilities-0.2.0 → data_manipulation_utilities-0.2.2}/src/dmu/ml/cv_predict.py RENAMED Viewed

@@ -32,11 +32,56 @@ class CVPredict:
         if rdf is None:
             raise ValueError('No ROOT dataframe passed')
-        self._l_model = models
-        self._rdf     = rdf
+        self._l_model   = models
+        self._rdf       = rdf
+        self._d_nan_rep : dict[str,str]
         self._arr_patch : numpy.ndarray
     # --------------------------------------------
+    def _initialize(self):
+        self._rdf       = self._define_columns(self._rdf)
+        self._d_nan_rep = self._get_nan_replacements()
+    # --------------------------------------------
+    def _define_columns(self, rdf : RDataFrame) -> RDataFrame:
+        cfg = self._l_model[0].cfg
+        if 'define' not in cfg['dataset']:
+            log.debug('No define section found in config, will not define extra columns')
+            return self._rdf
+        d_def = cfg['dataset']['define']
+        log.debug(60 * '-')
+        log.info('Defining columns in RDF before evaluating classifier')
+        log.debug(60 * '-')
+        for name, expr in d_def.items():
+            log.debug(f'{name:<20}{"<---":20}{expr:<100}')
+            rdf = rdf.Define(name, expr)
+        return rdf
+    # --------------------------------------------
+    def _get_nan_replacements(self) -> dict[str,str]:
+        cfg = self._l_model[0].cfg
+        if 'nan' not in cfg['dataset']:
+            log.debug('No define section found in config, will not define extra columns')
+            return {}
+        return cfg['dataset']['nan']
+    # --------------------------------------------
+    def _replace_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if len(self._d_nan_rep) == 0:
+            log.debug('Not doing any NaN replacement')
+            return df
+        log.debug(60 * '-')
+        log.info('Doing NaN replacements')
+        log.debug(60 * '-')
+        for var, val in self._d_nan_rep.items():
+            log.debug(f'{var:<20}{"--->":20}{val:<20.3f}')
+            df[var] = df[var].fillna(val)
+        return df
+    # --------------------------------------------
     def _get_df(self):
         '''
         Will make ROOT rdf into dataframe and return it
@@ -45,6 +90,7 @@ class CVPredict:
         l_ft  = model.features
         d_data= self._rdf.AsNumpy(l_ft)
         df_ft = pnd.DataFrame(d_data)
+        df_ft = self._replace_nans(df_ft)
         df_ft = ut.patch_and_tag(df_ft)
         if 'patched_indices' in df_ft.attrs:
@@ -136,6 +182,8 @@ class CVPredict:
         '''
         Will return array of prediction probabilities for the signal category
         '''
+        self._initialize()
         df_ft = self._get_df()
         model = self._l_model[0]

data-manipulation-utilities 0.2.0__tar.gz → 0.2.2__tar.gz

data-manipulation-utilities 0.2.0tar.gz → 0.2.2tar.gz