PyPI - data-manipulation-utilities - Versions diffs - 0.2.4__tar.gz → 0.2.5__tar.gz - Mend

data-manipulation-utilities 0.2.4tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.4
+Version: 0.2.5
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -26,7 +26,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -36,10 +36,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -137,7 +137,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -299,7 +309,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -337,11 +347,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -353,7 +363,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -434,7 +444,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -497,7 +507,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -674,6 +684,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -703,6 +716,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -724,14 +747,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -823,7 +851,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -885,7 +913,7 @@ last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
 # of directories in `dir_path`, e.g.:
 oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
-oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
 ```
 The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/README.md RENAMED Viewed

@@ -6,7 +6,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -16,10 +16,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -117,7 +117,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -279,7 +289,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -317,11 +327,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -333,7 +343,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -414,7 +424,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -477,7 +487,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -654,6 +664,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -683,6 +696,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -704,14 +727,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -803,7 +831,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -865,7 +893,7 @@ last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
 # of directories in `dir_path`, e.g.:
 oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
-oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
 ```
 The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name        = 'data_manipulation_utilities'
-version     = '0.2.4'
+version     = '0.2.5'
 readme      = 'README.md'
 dependencies= [
 'logzero',

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/data_manipulation_utilities.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.4
+Version: 0.2.5
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -26,7 +26,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -36,10 +36,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -137,7 +137,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -299,7 +309,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -337,11 +347,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -353,7 +363,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -434,7 +444,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -497,7 +507,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -674,6 +684,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -703,6 +716,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -724,14 +747,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -823,7 +851,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -885,7 +913,7 @@ last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
 # of directories in `dir_path`, e.g.:
 oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
-oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
 ```
 The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/data_manipulation_utilities.egg-info/SOURCES.txt RENAMED Viewed

@@ -38,10 +38,12 @@ src/dmu_data/ml/tests/train_mva.yaml
 src/dmu_data/plotting/tests/2d.yaml
 src/dmu_data/plotting/tests/fig_size.yaml
 src/dmu_data/plotting/tests/high_stat.yaml
+src/dmu_data/plotting/tests/legend.yaml
 src/dmu_data/plotting/tests/name.yaml
 src/dmu_data/plotting/tests/no_bounds.yaml
 src/dmu_data/plotting/tests/normalized.yaml
 src/dmu_data/plotting/tests/simple.yaml
+src/dmu_data/plotting/tests/stats.yaml
 src/dmu_data/plotting/tests/title.yaml
 src/dmu_data/plotting/tests/weights.yaml
 src/dmu_data/text/transform.toml

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/ml/cv_classifier.py RENAMED Viewed

@@ -1,15 +1,15 @@
 '''
 Module holding cv_classifier class
 '''
+import os
 from typing                  import Union
 from sklearn.ensemble        import GradientBoostingClassifier
+import yaml
 from dmu.logging.log_store import LogStore
 import dmu.ml.utilities    as ut
 log = LogStore.add_logger('dmu:ml:CVClassifier')
 # ---------------------------------------
 class CVSameData(Exception):
     '''
@@ -61,6 +61,20 @@ class CVClassifier(GradientBoostingClassifier):
         return self._cfg
     # ----------------------------------
+    def save_cfg(self, path : str):
+        '''
+        Will save configuration used to train this classifier to YAML
+        path: Path to YAML file
+        '''
+        dir_name = os.path.dirname(path)
+        os.makedirs(dir_name, exist_ok=True)
+        with open(path, 'w', encoding='utf-8') as ofile:
+            yaml.safe_dump(self._cfg, ofile, indent=2)
+        log.info(f'Saved config to: {path}')
+    # ----------------------------------
     def __str__(self):
         nhash = len(self._s_hash)

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/ml/cv_predict.py RENAMED Viewed

@@ -73,11 +73,11 @@ class CVPredict:
             log.debug('Not doing any NaN replacement')
             return df
-        log.debug(60 * '-')
+        log.info(60 * '-')
         log.info('Doing NaN replacements')
-        log.debug(60 * '-')
+        log.info(60 * '-')
         for var, val in self._d_nan_rep.items():
-            log.debug(f'{var:<20}{"--->":20}{val:<20.3f}')
+            log.info(f'{var:<20}{"--->":20}{val:<20.3f}')
             df[var] = df[var].fillna(val)
         return df
@@ -155,7 +155,7 @@ class CVPredict:
         ndif = len(s_dif_hash)
         ndat = len(s_dat_hash)
         nmod = len(s_mod_hash)
-        log.debug(f'{ndif:<20}{"=":10}{ndat:<20}{"-":10}{nmod:<20}')
+        log.debug(f'{ndif:<10}{"=":5}{ndat:<10}{"-":5}{nmod:<10}')
         df_ft_group= df_ft.loc[df_ft.index.isin(s_dif_hash)]
@@ -173,7 +173,7 @@ class CVPredict:
             return arr_prb
         nentries = len(self._arr_patch)
-        log.warning(f'Patching {nentries} probabilities')
+        log.warning(f'Patching {nentries} probabilities with -1')
         arr_prb[self._arr_patch] = -1
         return arr_prb

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/ml/train_mva.py RENAMED Viewed

@@ -69,14 +69,20 @@ class TrainMva:
         return df, arr_lab
     # ---------------------------------------------
     def _pre_process_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if 'dataset' not in self._cfg:
+            return df
         if 'nan' not in self._cfg['dataset']:
             log.debug('dataset/nan section not found, not pre-processing NaNs')
             return df
         d_name_val = self._cfg['dataset']['nan']
-        for name, val in d_name_val.items():
-            log.debug(f'{val:<20}{"<---":<10}{name:<100}')
-            df[name] = df[name].fillna(val)
+        log.info(60 * '-')
+        log.info('Doing NaN replacements')
+        log.info(60 * '-')
+        for var, val in d_name_val.items():
+            log.info(f'{var:<20}{"--->":20}{val:<20.3f}')
+            df[var] = df[var].fillna(val)
         return df
     # ---------------------------------------------
@@ -406,6 +412,9 @@ class TrainMva:
         self._save_hyperparameters_to_tex()
     # ---------------------------------------------
     def _save_nan_conversion(self) -> None:
+        if 'dataset' not in self._cfg:
+            return
         if 'nan' not in self._cfg['dataset']:
             log.debug('NaN section not found, not saving it')
             return
@@ -434,13 +443,18 @@ class TrainMva:
         os.makedirs(val_dir, exist_ok=True)
         put.df_to_tex(df, f'{val_dir}/hyperparameters.tex')
     # ---------------------------------------------
-    def run(self):
+    def run(self, skip_fit : bool = False) -> None:
         '''
         Will do the training
+        skip_fit: By default false, if True, it will only do the plots of features and save tables
         '''
         self._save_settings_to_tex()
         self._plot_features()
+        if skip_fit:
+            return
         l_mod = self._get_models()
         for ifold, mod in enumerate(l_mod):
             self._save_model(mod, ifold)

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/ml/utilities.py RENAMED Viewed

@@ -16,7 +16,7 @@ log = LogStore.add_logger('dmu:ml:utilities')
 # ---------------------------------------------
 def patch_and_tag(df : pnd.DataFrame, value : float = 0) -> pnd.DataFrame:
     '''
-    Takes panda dataframe, replaces NaNs with value introduced, by default 0
+    Takes pandas dataframe, replaces NaNs with value introduced, by default 0
     Returns array of indices where the replacement happened
     '''
     l_nan = df.index[df.isna().any(axis=1)].tolist()
@@ -25,7 +25,13 @@ def patch_and_tag(df : pnd.DataFrame, value : float = 0) -> pnd.DataFrame:
         log.debug('No NaNs found')
         return df
-    log.warning(f'Found {nnan} NaNs, patching them with {value}')
+    log.warning(f'Found {nnan} NaNs')
+    df_nan_frq = df.isna().sum()
+    df_nan_frq = df_nan_frq[df_nan_frq > 0]
+    print(df_nan_frq)
+    log.warning(f'Attaching array with NaN {nnan} indexes and removing NaNs from dataframe')
     df_pa = df.fillna(value)
@@ -57,7 +63,7 @@ def _remove_nans(df : pnd.DataFrame) -> pnd.DataFrame:
     log.info('Found columns with NaNs')
     for name in l_na_name:
         nan_count = df[name].isna().sum()
-        log.info(f'{nan_count:<10}{name:<100}')
+        log.info(f'{nan_count:<10}{name}')
     ninit = len(df)
     df    = df.dropna()
@@ -75,10 +81,10 @@ def _remove_repeated(df : pnd.DataFrame) -> pnd.DataFrame:
     nfinl = len(s_hash)
     if ninit == nfinl:
-        log.debug('No cleaning needed for dataframe')
+        log.debug('No overlap between training and application found')
         return df
-    log.warning(f'Repeated entries found, cleaning up: {ninit} -> {nfinl}')
+    log.warning(f'Overlap between training and application found, cleaning up: {ninit} -> {nfinl}')
     df['hash_index'] = l_hash
     df               = df.set_index('hash_index', drop=True)

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/plotting/plotter.py RENAMED Viewed

@@ -107,7 +107,7 @@ class Plotter:
         d_cut = self._d_cfg['selection']['cuts']
-        log.info('Applying cuts')
+        log.debug('Applying cuts')
         for name, cut in d_cut.items():
             log.debug(f'{name:<50}{cut:<150}')
             rdf = rdf.Filter(cut, name)
@@ -212,7 +212,11 @@ class Plotter:
         var (str) : Name of variable, needed for plot name
         '''
-        plt.legend()
+        d_leg = {}
+        if 'style' in self._d_cfg and 'legend' in self._d_cfg['style']:
+            d_leg = self._d_cfg['style']['legend']
+        plt.legend(**d_leg)
         plt_dir = self._d_cfg['saving']['plt_dir']
         os.makedirs(plt_dir, exist_ok=True)

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/plotting/plotter_1d.py RENAMED Viewed

@@ -77,17 +77,33 @@ class Plotter1D(Plotter):
         l_bc_all = []
         for name, arr_val in d_data.items():
+            label        = self._label_from_name(name, arr_val)
             arr_wgt      = d_wgt[name] if d_wgt is not None else numpy.ones_like(arr_val)
             arr_wgt      = self._normalize_weights(arr_wgt, var)
-            hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x', label=name).Weight()
+            hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x').Weight()
             hst.fill(x=arr_val, weight=arr_wgt)
-            hst.plot(label=name)
+            hst.plot(label=label)
             l_bc_all    += hst.values().tolist()
         max_y = max(l_bc_all)
         return max_y
     # --------------------------------------------
+    def _label_from_name(self, name : str, arr_val : numpy.ndarray) -> str:
+        if 'stats' not in self._d_cfg:
+            return name
+        d_stat = self._d_cfg['stats']
+        if 'nentries' not in d_stat:
+            return name
+        form = d_stat['nentries']
+        nentries = len(arr_val)
+        nentries = form.format(nentries)
+        return f'{name}{nentries}'
+    # --------------------------------------------
     def _normalize_weights(self, arr_wgt : numpy.ndarray, var : str) -> numpy.ndarray:
         cfg_var = self._d_cfg['plots'][var]
         if 'normalized' not in cfg_var:
@@ -104,7 +120,6 @@ class Plotter1D(Plotter):
         return arr_wgt
     # --------------------------------------------
     def _style_plot(self, var : str, max_y : float) -> None:
         d_cfg  = self._d_cfg['plots'][var]
         yscale = d_cfg['yscale' ] if 'yscale' in d_cfg else 'linear'
@@ -124,12 +139,15 @@ class Plotter1D(Plotter):
         plt.legend()
         plt.title(title)
     # --------------------------------------------
-    def _plot_lines(self, var : str):
+    def _plot_lines(self, var : str) -> None:
         '''
         Will plot vertical lines for some variables
         var (str) : name of variable
         '''
+        if 'style' in self._d_cfg and 'skip_lines' in self._d_cfg['style'] and self._d_cfg['style']['skip_lines']:
+            return
         if var in ['B_const_mass_M', 'B_M']:
             plt.axvline(x=5280, color='r', label=r'$B^+$'   , linestyle=':')
         elif var == 'Jpsi_M':

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/plotting/plotter_2d.py RENAMED Viewed

@@ -10,6 +10,7 @@ import matplotlib.pyplot as plt
 from hist                  import Hist
 from ROOT                  import RDataFrame
+from matplotlib.colors     import LogNorm
 from dmu.logging.log_store import LogStore
 from dmu.plotting.plotter  import Plotter
@@ -28,11 +29,8 @@ class Plotter2D(Plotter):
         cfg   (dict): Dictionary with configuration, e.g. binning, ranges, etc
         '''
-        if not isinstance(cfg, dict):
-            raise ValueError('Config dictionary not passed')
-        self._d_cfg : dict       = cfg
-        self._rdf   : RDataFrame = super()._preprocess_rdf(rdf)
+        super().__init__({'single_rdf' : rdf}, cfg)
+        self._rdf : RDataFrame = self._d_rdf['single_rdf']
         self._wgt : numpy.ndarray
     # --------------------------------------------
@@ -61,7 +59,7 @@ class Plotter2D(Plotter):
         return arr_wgt
     # --------------------------------------------
-    def _plot_vars(self, varx : str, vary : str, wgt_name : str) -> None:
+    def _plot_vars(self, varx : str, vary : str, wgt_name : str, use_log : bool) -> None:
         log.info(f'Plotting {varx} vs {vary} with weights {wgt_name}')
         ax_x         = self._get_axis(varx)
@@ -72,7 +70,10 @@ class Plotter2D(Plotter):
         hst   = Hist(ax_x, ax_y)
         hst.fill(arr_x, arr_y, weight=arr_w)
-        mplhep.hist2dplot(hst)
+        if use_log:
+            mplhep.hist2dplot(hst, norm=LogNorm())
+        else:
+            mplhep.hist2dplot(hst)
     # --------------------------------------------
     def run(self):
         '''
@@ -80,8 +81,8 @@ class Plotter2D(Plotter):
         '''
         fig_size = self._get_fig_size()
-        for [varx, vary, wgt_name, plot_name] in self._d_cfg['plots_2d']:
+        for [varx, vary, wgt_name, plot_name, use_log] in self._d_cfg['plots_2d']:
             plt.figure(plot_name, figsize=fig_size)
-            self._plot_vars(varx, vary, wgt_name)
+            self._plot_vars(varx, vary, wgt_name, use_log)
             self._save_plot(plot_name)
 # --------------------------------------------

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/stats/model_factory.py RENAMED Viewed

@@ -1,7 +1,7 @@
 '''
 Module storing ZModel class
 '''
-# pylint: disable=too-many-lines, import-error
+# pylint: disable=too-many-lines, import-error, too-many-positional-arguments, too-many-arguments
 from typing import Callable, Union
@@ -69,12 +69,18 @@ class ModelFactory:
         self._d_par : dict[str,zpar] = {}
     #-----------------------------------------
+    def _fltname_from_name(self, name : str) -> str:
+        if name in ['mu', 'sg']:
+            return f'{name}_flt'
+        return name
+    #-----------------------------------------
     def _get_name(self, name : str, suffix : str) -> str:
         for can_be_shared in self._l_can_be_shared:
             if name.startswith(f'{can_be_shared}_') and can_be_shared in self._l_shr:
-                return can_be_shared
+                return self._fltname_from_name(can_be_shared)
-        return f'{name}{suffix}'
+        return self._fltname_from_name(f'{name}{suffix}')
     #-----------------------------------------
     def _get_parameter(self,
                        name   : str,
@@ -129,8 +135,8 @@ class ModelFactory:
     def _get_cbl(self, suffix : str = '') -> zpdf:
         mu  = self._get_parameter('mu_cbl', suffix, 5300, 5250, 5350)
         sg  = self._get_parameter('sg_cbl', suffix,   10,    2,  300)
-        al  = self._get_parameter('ac_cbl', suffix,    2,   1.,   4.)
-        nl  = self._get_parameter('nc_cbl', suffix,    1,  0.5,  5.0)
+        al  = self._get_parameter('ac_cbl', suffix,    2,   1.,  14.)
+        nl  = self._get_parameter('nc_cbl', suffix,    1,  0.5,  15.)
         pdf = zfit.pdf.CrystalBall(mu, sg, al, nl, self._obs)
@@ -151,8 +157,8 @@ class ModelFactory:
         sg  = self._get_parameter('sg_dscb', suffix,   10,    2,   30)
         ar  = self._get_parameter('ar_dscb', suffix,    1,    0,    5)
         al  = self._get_parameter('al_dscb', suffix,    1,    0,    5)
-        nr  = self._get_parameter('nr_dscb', suffix,    2,    1,    5)
-        nl  = self._get_parameter('nl_dscb', suffix,    2,    0,    5)
+        nr  = self._get_parameter('nr_dscb', suffix,    2,    1,   15)
+        nl  = self._get_parameter('nl_dscb', suffix,    2,    0,   15)
         pdf = zfit.pdf.DoubleCB(mu, sg, al, nl, ar, nr, self._obs)

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu/testing/utilities.py RENAMED Viewed

@@ -2,6 +2,7 @@
 Module containing utility functions needed by unit tests
 '''
 import os
+import math
 from typing              import Union
 from dataclasses         import dataclass
 from importlib.resources import files
@@ -21,56 +22,64 @@ class Data:
     '''
     Class storing shared data
     '''
-    nentries = 3000
 # -------------------------------
-def _double_data(d_data : dict) -> dict:
-    df_1   = pnd.DataFrame(d_data)
-    df_2   = pnd.DataFrame(d_data)
+def _double_data(df_1 : pnd.DataFrame) -> pnd.DataFrame:
+    df_2   = df_1.copy()
     df     = pnd.concat([df_1, df_2], axis=0)
-    d_data = { name : df[name].to_numpy() for name in df.columns }
-    return d_data
+    return df
 # -------------------------------
-def _add_nans(d_data : dict) -> dict:
-    df_good   = pnd.DataFrame(d_data)
-    df_bad    = pnd.DataFrame(d_data)
-    df_bad[:] = numpy.nan
+def _add_nans(df : pnd.DataFrame, columns : list[str]) -> pnd.DataFrame:
+    size = len(df) * 0.2
+    size = math.floor(size)
+    l_col = df.columns.tolist()
+    if columns is None:
+        l_col_index = range(len(l_col))
+    else:
+        l_col_index = [ l_col.index(column) for column in columns ]
-    df        = pnd.concat([df_good, df_bad])
-    d_data    = { name : df[name].to_numpy() for name in df.columns }
+    log.debug('Replacing randomly with {size} NaNs')
+    for _ in range(size):
+        irow = numpy.random.randint(0, df.shape[0])      # Random row index
+        icol = numpy.random.choice(l_col_index)      # Random column index
-    return d_data
+        df.iat[irow, icol] = numpy.nan
+    return df
 # -------------------------------
 def get_rdf(kind : Union[str,None] = None,
             repeated : bool        = False,
-            add_nans : bool        = False):
+            nentries : int         = 3_000,
+            add_nans : list[str]   = None):
     '''
     Return ROOT dataframe with toy data
     '''
     d_data = {}
     if   kind == 'sig':
-        d_data['w'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['x'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['y'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['z'] = numpy.random.normal(0, 1, size=Data.nentries)
+        d_data['w'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['x'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['y'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['z'] = numpy.random.normal(0, 1, size=nentries)
     elif kind == 'bkg':
-        d_data['w'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['x'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['y'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['z'] = numpy.random.normal(1, 1, size=Data.nentries)
+        d_data['w'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['x'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['y'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['z'] = numpy.random.normal(1, 1, size=nentries)
     else:
         log.error(f'Invalid kind: {kind}')
         raise ValueError
+    df = pnd.DataFrame(d_data)
     if repeated:
-        d_data = _double_data(d_data)
+        df = _double_data(df)
     if add_nans:
-        d_data = _add_nans(d_data)
+        df = _add_nans(df, columns=add_nans)
-    rdf = RDF.FromNumpy(d_data)
+    rdf = RDF.FromPandas(df)
     return rdf
 # -------------------------------

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu_data/ml/tests/train_mva.yaml RENAMED Viewed

@@ -1,6 +1,7 @@
 dataset:
   nan :
-    x : 0
+    x : 1
+    y : 2
 training :
     nfold    : 3
     features : [x, y, z]
@@ -33,10 +34,6 @@ plotting:
         saving:
             plt_dir : '/tmp/dmu/ml/tests/train_mva/features'
         plots:
-          w :
-            binning : [-4, 4, 100]
-            yscale  : 'linear'
-            labels  : ['w', '']
           x :
             binning : [-4, 4, 100]
             yscale  : 'linear'

{data_manipulation_utilities-0.2.4 → data_manipulation_utilities-0.2.5}/src/dmu_data/plotting/tests/2d.yaml RENAMED Viewed

@@ -1,13 +1,17 @@
 saving:
-    plt_dir : tests/plotting/2d_weighted
+    plt_dir : /tmp/dmu/tests/plotting/2d_weighted
+selection:
+  cuts:
+    xlow : x > -1.5
 definitions:
   z : x + y
 general:
     size : [20, 10]
 plots_2d:
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
-    - [x, z,    null, 'xz_r']
+    - [x, y, weights, 'xy_wgt', false]
+    - [x, y,    null, 'xy_raw', false]
+    - [x, z,    null, 'xz_raw', false]
+    - [x, z,    null, 'xz_log',  true]
 axes:
     x :
         binning : [-3.0, 3.0, 40]

data_manipulation_utilities-0.2.5/src/dmu_data/plotting/tests/legend.yaml ADDED Viewed

@@ -0,0 +1,12 @@
+saving:
+    plt_dir : tests/plotting/legend
+general:
+    size : [20, 10]
+plots:
+    x :
+        binning : [-5.0, 8.0, 40]
+    y :
+        binning : [-5.0, 8.0, 40]
+style:
+  legend:
+    bbox_to_anchor : [1.2, 1]

data_manipulation_utilities-0.2.5/src/dmu_data/plotting/tests/stats.yaml ADDED Viewed

@@ -0,0 +1,9 @@
+saving:
+    plt_dir : tests/plotting/stats
+plots:
+    x :
+        binning : [-5.0, 8.0, 40]
+    y :
+        binning : [-5.0, 8.0, 40]
+stats:
+  nentries : '{:.2e}'