PyPI - data-manipulation-utilities - Versions diffs - 0.2.4__py3-none-any.whl → 0.2.6__py3-none-any.whl - Mend

data-manipulation-utilities 0.2.4py3-none-any.whl → 0.2.6py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/METADATA +45 -17
{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/RECORD +20 -18
{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/WHEEL +1 -1
dmu/ml/cv_classifier.py +16 -2
dmu/ml/cv_predict.py +5 -5
dmu/ml/train_mva.py +48 -30
dmu/ml/utilities.py +11 -5
dmu/plotting/plotter.py +6 -2
dmu/plotting/plotter_1d.py +22 -4
dmu/plotting/plotter_2d.py +10 -9
dmu/stats/minimizers.py +40 -11
dmu/stats/model_factory.py +77 -31
dmu/testing/utilities.py +36 -27
dmu_data/ml/tests/train_mva.yaml +2 -2
dmu_data/plotting/tests/2d.yaml +8 -4
dmu_data/plotting/tests/legend.yaml +12 -0
dmu_data/plotting/tests/stats.yaml +9 -0
{data_manipulation_utilities-0.2.4.data → data_manipulation_utilities-0.2.6.data}/scripts/publish +0 -0
{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/entry_points.txt +0 -0
{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/top_level.txt +0 -0

{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.4
+Version: 0.2.6
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -26,7 +26,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -36,10 +36,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -137,7 +137,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -299,7 +309,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -337,11 +347,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -353,7 +363,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -417,7 +427,7 @@ rdf_bkg = _get_rdf(kind='bkg')
 cfg     = _get_config()
 obj= TrainMva(sig=rdf_sig, bkg=rdf_bkg, cfg=cfg)
-obj.run()
+obj.run(skip_fit=False) # by default it will be false, if true, it will only make plots of features
 ```
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
@@ -434,7 +444,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -497,7 +507,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -539,7 +549,7 @@ When evaluating the model with real data, problems might occur, we deal with the
     ```python
     model.cfg
     ```
-    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+    - For whatever features that are still NaN, they will be _patched_  with zeros when evaluated. However, the returned probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
 # Pandas dataframes
@@ -674,6 +684,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -703,6 +716,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -724,14 +747,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -823,7 +851,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -885,7 +913,7 @@ last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
 # of directories in `dir_path`, e.g.:
 oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
-oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
 ```
 The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.

{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/RECORD RENAMED Viewed

@@ -1,17 +1,17 @@
-data_manipulation_utilities-0.2.4.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
+data_manipulation_utilities-0.2.6.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
 dmu/arrays/utilities.py,sha256=PKoYyybPptA2aU-V3KLnJXBudWxTXu4x1uGdIMQ49HY,1722
 dmu/generic/utilities.py,sha256=0Xnq9t35wuebAqKxbyAiMk1ISB7IcXK4cFH25MT1fgw,1741
 dmu/generic/version_management.py,sha256=G_HjGY-hu8lotZuTdVAg0B8yD0AltE866q2vJxvTg1g,3749
 dmu/logging/log_store.py,sha256=umdvjNDuV3LdezbG26b0AiyTglbvkxST19CQu9QATbA,4184
-dmu/ml/cv_classifier.py,sha256=8Jwx6xMhJaRLktlRdq0tFl32v6t8i63KmpxrlnXlomU,3759
-dmu/ml/cv_predict.py,sha256=4G7F_1yOvnLftsDC6zUpdvkxuHXGkPemhj0RsYySYDM,6708
-dmu/ml/train_mva.py,sha256=SZ5cQHl7HBxn0c5Hh4HlN1aqMZaJUAlNmsfjnUSQrTg,16894
-dmu/ml/utilities.py,sha256=l348bufD95CuSYdIrHScQThIy2nKwGKXZn-FQg3CEwg,3930
+dmu/ml/cv_classifier.py,sha256=ZbzEm_jW9yoTC7k_xBA7hFpc1bDNayiVR3tbaj1_ieE,4228
+dmu/ml/cv_predict.py,sha256=4wwYL_jcUExDqLJVfClxEUWSd_QAx8yKHO3rX-mx4vw,6711
+dmu/ml/train_mva.py,sha256=Tjtm_cXIiC5syaeUXsPAK4NKLbgkDdly17qbiOIT_Go,17608
+dmu/ml/utilities.py,sha256=PK_61fW7gBV9aGZyez3PI8zAT7_Fc6IlQzDB7f8iBTM,4133
 dmu/pdataframe/utilities.py,sha256=ypvLiFfJ82ga94qlW3t5dXnvEFwYOXnbtJb2zHwsbqk,987
 dmu/plotting/matrix.py,sha256=pXuUJn-LgOvrI9qGkZQw16BzLjOjeikYQ_ll2VIcIXU,4978
-dmu/plotting/plotter.py,sha256=ytMxtzHEY8ZFU0ZKEBE-ROjMszXl5kHTMnQnWe173nU,7208
-dmu/plotting/plotter_1d.py,sha256=g6H2xAgsL9a6vRkpbqHICb3qwV_qMiQPZxxw_oOSf9M,5115
-dmu/plotting/plotter_2d.py,sha256=J-gKnagoHGfJFU7HBrhDFpGYH5Rxy0_zF5l8eE_7ZHE,2944
+dmu/plotting/plotter.py,sha256=3WRbNOrFBWgI3iW5TbEgT4w_eF7-XUPs_32JL1AW3yY,7359
+dmu/plotting/plotter_1d.py,sha256=2AnVxulyhKtwN-2Srhfm6fqdEREZNhcpJolBsJrWcsc,5745
+dmu/plotting/plotter_2d.py,sha256=mZhp3D5I-JodOnFTEF1NqHtcLtuI-2WNpCQsrsoXNtw,3017
 dmu/plotting/utilities.py,sha256=SI9dvtZq2gr-PXVz71KE4o0i09rZOKgqJKD1jzf6KXk,1167
 dmu/rdataframe/atr_mgr.py,sha256=FdhaQWVpsm4OOe1IRbm7rfrq8VenTNdORyI-lZ2Bs1M,2386
 dmu/rdataframe/utilities.py,sha256=pNcQARMP7txMhy6k27UnDcYf0buNy5U2fshaJDl_h8o,3661
@@ -20,21 +20,23 @@ dmu/rfile/utilities.py,sha256=XuYY7HuSBj46iSu3c60UYBHtI6KIPoJU_oofuhb-be0,945
 dmu/stats/fitter.py,sha256=vHNZ16U3apoQyeyM8evq-if49doF48sKB3q9wmA96Fw,18387
 dmu/stats/function.py,sha256=yzi_Fvp_ASsFzbWFivIf-comquy21WoeY7is6dgY0Go,9491
 dmu/stats/gof_calculator.py,sha256=4EN6OhULcztFvsAZ00rxgohJemnjtDNB5o0IBcv6kbk,4657
-dmu/stats/minimizers.py,sha256=f9cilFY9Kp9UvbSIUsKBGFzOOg7EEWZJLPod-4k-LAQ,6216
-dmu/stats/model_factory.py,sha256=LyDOf0f9I5dNUTS0MXHtSivD8aAcTLIagvMPtoXtThk,7426
+dmu/stats/minimizers.py,sha256=db9R2G0SOV-k0BKi6m4EyB_yp6AtZdP23_28B0315oo,7094
+dmu/stats/model_factory.py,sha256=QobbhhMFUg61icB_P2grNFsftf_kl6gELjj1mkC9YSw,9115
 dmu/stats/utilities.py,sha256=LQy4kd3xSXqpApcWuYfZxkGQyjowaXv2Wr1c4Bj-4ys,4523
 dmu/stats/zfit_plotter.py,sha256=Xs6kisNEmNQXhYRCcjowxO6xHuyAyrfyQIFhGAR61U4,19719
-dmu/testing/utilities.py,sha256=WbMM4e9Cn3-B-12Vr64mB5qTKkV32joStlRkD-48lG0,3460
+dmu/testing/utilities.py,sha256=moImLqGX9LAt5zJtE5j0gHHkUJ5kpbodryhiVswOsyM,3696
 dmu/text/transformer.py,sha256=4lrGknbAWRm0-rxbvgzOO-eR1-9bkYk61boJUEV3cQ0,6100
 dmu_data/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-dmu_data/ml/tests/train_mva.yaml,sha256=k5H4Gu9Gj57B9iqabhcTQEFN674Cv_uJ2Xcumb02zF4,1279
-dmu_data/plotting/tests/2d.yaml,sha256=VApcAfJFbjNcjMCTBSRm2P37MQlGavMZv6msbZwLSgw,402
+dmu_data/ml/tests/train_mva.yaml,sha256=jtYwBY_VELCgXY24e7eQYEvKQLsPtbFXgXEeOkYunvY,1291
+dmu_data/plotting/tests/2d.yaml,sha256=HSAtER-8CEqIGBY_jdcIdSVOHMfYPYhmgeZghTpVYh8,516
 dmu_data/plotting/tests/fig_size.yaml,sha256=7ROq49nwZ1A2EbPiySmu6n3G-Jq6YAOkc3d2X3YNZv0,294
 dmu_data/plotting/tests/high_stat.yaml,sha256=bLglBLCZK6ft0xMhQ5OltxE76cWsBMPMjO6GG0OkDr8,522
+dmu_data/plotting/tests/legend.yaml,sha256=wGpj58ig-GOlqbWoN894zrCet2Fj9f5QtY0rig_UC-c,213
 dmu_data/plotting/tests/name.yaml,sha256=mkcPAVg8wBAmlSbSRQ1bcaMl4vOS6LXMtpqQeDrrtO4,312
 dmu_data/plotting/tests/no_bounds.yaml,sha256=8e1QdphBjz-suDr857DoeUC2DXiy6SE-gvkORJQYv80,257
 dmu_data/plotting/tests/normalized.yaml,sha256=Y0eKtyV5pvlSxvqfsLjytYtv8xYF3HZ5WEdCJdeHGQI,193
 dmu_data/plotting/tests/simple.yaml,sha256=N_TvNBh_2dU0-VYgu_LMrtY0kV_hg2HxVuEoDlr1HX8,138
+dmu_data/plotting/tests/stats.yaml,sha256=fSZjoV-xPnukpCH2OAXsz_SNPjI113qzDg8Ln3spaaA,165
 dmu_data/plotting/tests/title.yaml,sha256=bawKp9aGpeRrHzv69BOCbFX8sq9bb3Es9tdsPTE7jIk,333
 dmu_data/plotting/tests/weights.yaml,sha256=RWQ1KxbCq-uO62WJ2AoY4h5Umc37zG35s-TpKnNMABI,312
 dmu_data/text/transform.toml,sha256=R-832BZalzHZ6c5gD6jtT_Hj8BCsM5vxa1v6oeiwaP4,94
@@ -48,8 +50,8 @@ dmu_scripts/rfile/compare_root_files.py,sha256=T8lDnQxsRNMr37x1Y7YvWD8ySHrJOWZki
 dmu_scripts/rfile/print_trees.py,sha256=Ze4Ccl_iUldl4eVEDVnYBoe4amqBT1fSBR1zN5WSztk,941
 dmu_scripts/ssh/coned.py,sha256=lhilYNHWRCGxC-jtyJ3LQ4oUgWW33B2l1tYCcyHHsR0,4858
 dmu_scripts/text/transform_text.py,sha256=9akj1LB0HAyopOvkLjNOJiptZw5XoOQLe17SlcrGMD0,1456
-data_manipulation_utilities-0.2.4.dist-info/METADATA,sha256=Gc-ZuL88YHEK3pOK1IfQmaN6rKCcVVqrFS2VlT70jyk,29229
-data_manipulation_utilities-0.2.4.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
-data_manipulation_utilities-0.2.4.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
-data_manipulation_utilities-0.2.4.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
-data_manipulation_utilities-0.2.4.dist-info/RECORD,,
+data_manipulation_utilities-0.2.6.dist-info/METADATA,sha256=P9-pWYbzx2C-dntMHgb85WS64CVPRu8BeRPxvJOk3VE,30187
+data_manipulation_utilities-0.2.6.dist-info/WHEEL,sha256=jB7zZ3N9hIM9adW7qlTAyycLYW9npaWKLRzaoVcLKcM,91
+data_manipulation_utilities-0.2.6.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
+data_manipulation_utilities-0.2.6.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
+data_manipulation_utilities-0.2.6.dist-info/RECORD,,

{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/WHEEL RENAMED Viewed

@@ -1,5 +1,5 @@
 Wheel-Version: 1.0
-Generator: setuptools (75.8.0)
+Generator: setuptools (75.8.2)
 Root-Is-Purelib: true
 Tag: py3-none-any

dmu/ml/cv_classifier.py CHANGED Viewed

@@ -1,15 +1,15 @@
 '''
 Module holding cv_classifier class
 '''
+import os
 from typing                  import Union
 from sklearn.ensemble        import GradientBoostingClassifier
+import yaml
 from dmu.logging.log_store import LogStore
 import dmu.ml.utilities    as ut
 log = LogStore.add_logger('dmu:ml:CVClassifier')
 # ---------------------------------------
 class CVSameData(Exception):
     '''
@@ -61,6 +61,20 @@ class CVClassifier(GradientBoostingClassifier):
         return self._cfg
     # ----------------------------------
+    def save_cfg(self, path : str):
+        '''
+        Will save configuration used to train this classifier to YAML
+        path: Path to YAML file
+        '''
+        dir_name = os.path.dirname(path)
+        os.makedirs(dir_name, exist_ok=True)
+        with open(path, 'w', encoding='utf-8') as ofile:
+            yaml.safe_dump(self._cfg, ofile, indent=2)
+        log.info(f'Saved config to: {path}')
+    # ----------------------------------
     def __str__(self):
         nhash = len(self._s_hash)

dmu/ml/cv_predict.py CHANGED Viewed

@@ -73,11 +73,11 @@ class CVPredict:
             log.debug('Not doing any NaN replacement')
             return df
-        log.debug(60 * '-')
+        log.info(60 * '-')
         log.info('Doing NaN replacements')
-        log.debug(60 * '-')
+        log.info(60 * '-')
         for var, val in self._d_nan_rep.items():
-            log.debug(f'{var:<20}{"--->":20}{val:<20.3f}')
+            log.info(f'{var:<20}{"--->":20}{val:<20.3f}')
             df[var] = df[var].fillna(val)
         return df
@@ -155,7 +155,7 @@ class CVPredict:
         ndif = len(s_dif_hash)
         ndat = len(s_dat_hash)
         nmod = len(s_mod_hash)
-        log.debug(f'{ndif:<20}{"=":10}{ndat:<20}{"-":10}{nmod:<20}')
+        log.debug(f'{ndif:<10}{"=":5}{ndat:<10}{"-":5}{nmod:<10}')
         df_ft_group= df_ft.loc[df_ft.index.isin(s_dif_hash)]
@@ -173,7 +173,7 @@ class CVPredict:
             return arr_prb
         nentries = len(self._arr_patch)
-        log.warning(f'Patching {nentries} probabilities')
+        log.warning(f'Patching {nentries} probabilities with -1')
         arr_prb[self._arr_patch] = -1
         return arr_prb

dmu/ml/train_mva.py CHANGED Viewed

@@ -1,7 +1,7 @@
 '''
 Module with TrainMva class
 '''
-# pylint: disable = too-many-locals
+# pylint: disable = too-many-locals, no-name-in-module
 # pylint: disable = too-many-arguments, too-many-positional-arguments
 import os
@@ -14,7 +14,7 @@ import matplotlib.pyplot as plt
 from sklearn.metrics         import roc_curve, auc
 from sklearn.model_selection import StratifiedKFold
-from ROOT import RDataFrame
+from ROOT import RDataFrame, RDF
 import dmu.ml.utilities         as ut
 import dmu.pdataframe.utilities as put
@@ -33,61 +33,71 @@ class TrainMva:
     Interface to scikit learn used to train classifier
     '''
     # ---------------------------------------------
-    def __init__(self, bkg=None, sig=None, cfg=None):
+    def __init__(self, bkg : RDataFrame, sig : RDataFrame, cfg : dict):
         '''
         bkg (ROOT dataframe): Holds real data
         sig (ROOT dataframe): Holds simulation
         cfg (dict)          : Dictionary storing configuration for training
         '''
-        if bkg is None:
-            raise ValueError('Background dataframe is not a ROOT dataframe')
-        if sig is None:
-            raise ValueError('Signal dataframe is not a ROOT dataframe')
-        if not isinstance(cfg, dict):
-            raise ValueError('Config dictionary is not a dictionary')
+        self._cfg       = cfg
+        self._l_ft_name = self._cfg['training']['features']
-        self._rdf_bkg = bkg
-        self._rdf_sig = sig
-        self._cfg     = cfg
+        df_ft_sig, l_lab_sig = self._get_sample_inputs(rdf = sig, label = 1)
+        df_ft_bkg, l_lab_bkg = self._get_sample_inputs(rdf = bkg, label = 0)
-        self._l_ft_name = self._cfg['training']['features']
+        self._df_ft = pnd.concat([df_ft_sig, df_ft_bkg], axis=0)
+        self._l_lab = numpy.array(l_lab_sig + l_lab_bkg)
-        self._df_ft, self._l_lab = self._get_inputs()
+        self._rdf_bkg = self._get_rdf(rdf = bkg, df=df_ft_bkg)
+        self._rdf_sig = self._get_rdf(rdf = sig, df=df_ft_sig)
     # ---------------------------------------------
-    def _get_inputs(self) -> tuple[pnd.DataFrame, npa]:
-        log.info('Getting signal')
-        df_sig, arr_lab_sig = self._get_sample_inputs(self._rdf_sig, label = 1)
+    def _get_rdf(self, rdf : RDataFrame, df : pnd.DataFrame) -> RDataFrame:
+        '''
+        Takes original ROOT dataframe and pre-processed features dataframe
+        Adds missing branches to latter and returns expanded ROOT dataframe
+        '''
-        log.info('Getting background')
-        df_bkg, arr_lab_bkg = self._get_sample_inputs(self._rdf_bkg, label = 0)
+        l_pnd_col = df.columns.tolist()
+        l_rdf_col = [ name.c_str() for name in rdf.GetColumnNames() ]
+        l_mis_col = [ col for col in l_rdf_col if col not in l_pnd_col ]
-        df      = pnd.concat([df_sig, df_bkg], axis=0)
-        arr_lab = numpy.concatenate([arr_lab_sig, arr_lab_bkg])
+        log.debug(f'Adding extra-nonfeature columns: {l_mis_col}')
-        return df, arr_lab
+        d_data = rdf.AsNumpy(l_mis_col)
+        df_ext = pnd.DataFrame(d_data)
+        df_all = pnd.concat([df, df_ext], axis=1)
+        return RDF.FromPandas(df_all)
     # ---------------------------------------------
     def _pre_process_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if 'dataset' not in self._cfg:
+            return df
         if 'nan' not in self._cfg['dataset']:
             log.debug('dataset/nan section not found, not pre-processing NaNs')
             return df
         d_name_val = self._cfg['dataset']['nan']
-        for name, val in d_name_val.items():
-            log.debug(f'{val:<20}{"<---":<10}{name:<100}')
-            df[name] = df[name].fillna(val)
+        log.info(70 * '-')
+        log.info('Doing NaN replacements')
+        log.info(70 * '-')
+        for var, val in d_name_val.items():
+            nna = df[var].isna().sum()
+            log.info(f'{var:<20}{"--->":20}{val:<20.3f}{nna}')
+            df[var] = df[var].fillna(val)
+        log.info(70 * '-')
         return df
     # ---------------------------------------------
-    def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, npa]:
+    def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, list[int]]:
         d_ft = rdf.AsNumpy(self._l_ft_name)
         df   = pnd.DataFrame(d_ft)
         df   = self._pre_process_nans(df)
         df   = ut.cleanup(df)
         l_lab= len(df) * [label]
-        return df, numpy.array(l_lab)
+        return df, l_lab
     # ---------------------------------------------
     def _get_model(self, arr_index : npa) -> cls:
         model = cls(cfg = self._cfg)
@@ -406,6 +416,9 @@ class TrainMva:
         self._save_hyperparameters_to_tex()
     # ---------------------------------------------
     def _save_nan_conversion(self) -> None:
+        if 'dataset' not in self._cfg:
+            return
         if 'nan' not in self._cfg['dataset']:
             log.debug('NaN section not found, not saving it')
             return
@@ -434,13 +447,18 @@ class TrainMva:
         os.makedirs(val_dir, exist_ok=True)
         put.df_to_tex(df, f'{val_dir}/hyperparameters.tex')
     # ---------------------------------------------
-    def run(self):
+    def run(self, skip_fit : bool = False) -> None:
         '''
         Will do the training
+        skip_fit: By default false, if True, it will only do the plots of features and save tables
         '''
         self._save_settings_to_tex()
         self._plot_features()
+        if skip_fit:
+            return
         l_mod = self._get_models()
         for ifold, mod in enumerate(l_mod):
             self._save_model(mod, ifold)

dmu/ml/utilities.py CHANGED Viewed

@@ -16,7 +16,7 @@ log = LogStore.add_logger('dmu:ml:utilities')
 # ---------------------------------------------
 def patch_and_tag(df : pnd.DataFrame, value : float = 0) -> pnd.DataFrame:
     '''
-    Takes panda dataframe, replaces NaNs with value introduced, by default 0
+    Takes pandas dataframe, replaces NaNs with value introduced, by default 0
     Returns array of indices where the replacement happened
     '''
     l_nan = df.index[df.isna().any(axis=1)].tolist()
@@ -25,7 +25,13 @@ def patch_and_tag(df : pnd.DataFrame, value : float = 0) -> pnd.DataFrame:
         log.debug('No NaNs found')
         return df
-    log.warning(f'Found {nnan} NaNs, patching them with {value}')
+    log.warning(f'Found {nnan} NaNs')
+    df_nan_frq = df.isna().sum()
+    df_nan_frq = df_nan_frq[df_nan_frq > 0]
+    print(df_nan_frq)
+    log.warning(f'Attaching array with NaN {nnan} indexes and removing NaNs from dataframe')
     df_pa = df.fillna(value)
@@ -57,7 +63,7 @@ def _remove_nans(df : pnd.DataFrame) -> pnd.DataFrame:
     log.info('Found columns with NaNs')
     for name in l_na_name:
         nan_count = df[name].isna().sum()
-        log.info(f'{nan_count:<10}{name:<100}')
+        log.info(f'{nan_count:<10}{name}')
     ninit = len(df)
     df    = df.dropna()
@@ -75,10 +81,10 @@ def _remove_repeated(df : pnd.DataFrame) -> pnd.DataFrame:
     nfinl = len(s_hash)
     if ninit == nfinl:
-        log.debug('No cleaning needed for dataframe')
+        log.debug('No overlap between training and application found')
         return df
-    log.warning(f'Repeated entries found, cleaning up: {ninit} -> {nfinl}')
+    log.warning(f'Overlap between training and application found, cleaning up: {ninit} -> {nfinl}')
     df['hash_index'] = l_hash
     df               = df.set_index('hash_index', drop=True)

dmu/plotting/plotter.py CHANGED Viewed

@@ -107,7 +107,7 @@ class Plotter:
         d_cut = self._d_cfg['selection']['cuts']
-        log.info('Applying cuts')
+        log.debug('Applying cuts')
         for name, cut in d_cut.items():
             log.debug(f'{name:<50}{cut:<150}')
             rdf = rdf.Filter(cut, name)
@@ -212,7 +212,11 @@ class Plotter:
         var (str) : Name of variable, needed for plot name
         '''
-        plt.legend()
+        d_leg = {}
+        if 'style' in self._d_cfg and 'legend' in self._d_cfg['style']:
+            d_leg = self._d_cfg['style']['legend']
+        plt.legend(**d_leg)
         plt_dir = self._d_cfg['saving']['plt_dir']
         os.makedirs(plt_dir, exist_ok=True)

dmu/plotting/plotter_1d.py CHANGED Viewed

@@ -77,17 +77,33 @@ class Plotter1D(Plotter):
         l_bc_all = []
         for name, arr_val in d_data.items():
+            label        = self._label_from_name(name, arr_val)
             arr_wgt      = d_wgt[name] if d_wgt is not None else numpy.ones_like(arr_val)
             arr_wgt      = self._normalize_weights(arr_wgt, var)
-            hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x', label=name).Weight()
+            hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x').Weight()
             hst.fill(x=arr_val, weight=arr_wgt)
-            hst.plot(label=name)
+            hst.plot(label=label)
             l_bc_all    += hst.values().tolist()
         max_y = max(l_bc_all)
         return max_y
     # --------------------------------------------
+    def _label_from_name(self, name : str, arr_val : numpy.ndarray) -> str:
+        if 'stats' not in self._d_cfg:
+            return name
+        d_stat = self._d_cfg['stats']
+        if 'nentries' not in d_stat:
+            return name
+        form = d_stat['nentries']
+        nentries = len(arr_val)
+        nentries = form.format(nentries)
+        return f'{name}{nentries}'
+    # --------------------------------------------
     def _normalize_weights(self, arr_wgt : numpy.ndarray, var : str) -> numpy.ndarray:
         cfg_var = self._d_cfg['plots'][var]
         if 'normalized' not in cfg_var:
@@ -104,7 +120,6 @@ class Plotter1D(Plotter):
         return arr_wgt
     # --------------------------------------------
     def _style_plot(self, var : str, max_y : float) -> None:
         d_cfg  = self._d_cfg['plots'][var]
         yscale = d_cfg['yscale' ] if 'yscale' in d_cfg else 'linear'
@@ -124,12 +139,15 @@ class Plotter1D(Plotter):
         plt.legend()
         plt.title(title)
     # --------------------------------------------
-    def _plot_lines(self, var : str):
+    def _plot_lines(self, var : str) -> None:
         '''
         Will plot vertical lines for some variables
         var (str) : name of variable
         '''
+        if 'style' in self._d_cfg and 'skip_lines' in self._d_cfg['style'] and self._d_cfg['style']['skip_lines']:
+            return
         if var in ['B_const_mass_M', 'B_M']:
             plt.axvline(x=5280, color='r', label=r'$B^+$'   , linestyle=':')
         elif var == 'Jpsi_M':

dmu/plotting/plotter_2d.py CHANGED Viewed

@@ -10,6 +10,7 @@ import matplotlib.pyplot as plt
 from hist                  import Hist
 from ROOT                  import RDataFrame
+from matplotlib.colors     import LogNorm
 from dmu.logging.log_store import LogStore
 from dmu.plotting.plotter  import Plotter
@@ -28,11 +29,8 @@ class Plotter2D(Plotter):
         cfg   (dict): Dictionary with configuration, e.g. binning, ranges, etc
         '''
-        if not isinstance(cfg, dict):
-            raise ValueError('Config dictionary not passed')
-        self._d_cfg : dict       = cfg
-        self._rdf   : RDataFrame = super()._preprocess_rdf(rdf)
+        super().__init__({'single_rdf' : rdf}, cfg)
+        self._rdf : RDataFrame = self._d_rdf['single_rdf']
         self._wgt : numpy.ndarray
     # --------------------------------------------
@@ -61,7 +59,7 @@ class Plotter2D(Plotter):
         return arr_wgt
     # --------------------------------------------
-    def _plot_vars(self, varx : str, vary : str, wgt_name : str) -> None:
+    def _plot_vars(self, varx : str, vary : str, wgt_name : str, use_log : bool) -> None:
         log.info(f'Plotting {varx} vs {vary} with weights {wgt_name}')
         ax_x         = self._get_axis(varx)
@@ -72,7 +70,10 @@ class Plotter2D(Plotter):
         hst   = Hist(ax_x, ax_y)
         hst.fill(arr_x, arr_y, weight=arr_w)
-        mplhep.hist2dplot(hst)
+        if use_log:
+            mplhep.hist2dplot(hst, norm=LogNorm())
+        else:
+            mplhep.hist2dplot(hst)
     # --------------------------------------------
     def run(self):
         '''
@@ -80,8 +81,8 @@ class Plotter2D(Plotter):
         '''
         fig_size = self._get_fig_size()
-        for [varx, vary, wgt_name, plot_name] in self._d_cfg['plots_2d']:
+        for [varx, vary, wgt_name, plot_name, use_log] in self._d_cfg['plots_2d']:
             plt.figure(plot_name, figsize=fig_size)
-            self._plot_vars(varx, vary, wgt_name)
+            self._plot_vars(varx, vary, wgt_name, use_log)
             self._save_plot(plot_name)
 # --------------------------------------------

dmu/stats/minimizers.py CHANGED Viewed

@@ -1,12 +1,16 @@
 '''
 Module containing derived classes from ZFit minimizer
 '''
+from typing import Union
 import numpy
 import zfit
+import matplotlib.pyplot as plt
 from zfit.result                   import FitResult
 from zfit.core.basepdf             import BasePDF           as zpdf
 from zfit.minimizers.baseminimizer import FailMinimizeNaN
+from dmu.stats.utilities           import print_pdf
 from dmu.stats.gof_calculator      import GofCalculator
 from dmu.logging.log_store         import LogStore
@@ -29,6 +33,7 @@ class AnealingMinimizer(zfit.minimize.Minuit):
         self._chi2ndof = chi2ndof
         self._check_thresholds()
+        self._l_bad_fit_res : list[FitResult] = []
         super().__init__()
     # ------------------------
@@ -66,19 +71,24 @@ class AnealingMinimizer(zfit.minimize.Minuit):
         return is_good
     # ------------------------
     def _is_good_fit(self, res : FitResult) -> bool:
+        good_fit = True
         if not res.valid:
-            log.warning('Skipping invalid fit')
-            return False
+            log.debug('Skipping invalid fit')
+            good_fit = False
         if res.status != 0:
-            log.warning('Skipping fit with bad status')
-            return False
+            log.debug('Skipping fit with bad status')
+            good_fit = False
         if not res.converged:
-            log.warning('Skipping non-converging fit')
-            return False
+            log.debug('Skipping non-converging fit')
+            good_fit = False
-        return True
+        if not good_fit:
+            self._l_bad_fit_res.append(res)
+        return good_fit
     # ------------------------
     def _get_gof(self, nll) -> tuple[float, float]:
         log.debug('Checking GOF')
@@ -108,10 +118,11 @@ class AnealingMinimizer(zfit.minimize.Minuit):
             par.set_value(fval)
             log.debug(f'{par.name:<20}{ival:<15.3f}{"->":<10}{fval:<15.3f}{"in":<5}{par.lower:<15.3e}{par.upper:<15.3e}')
     # ------------------------
-    def _pick_best_fit(self, d_chi2_res : dict) -> FitResult:
+    def _pick_best_fit(self, d_chi2_res : dict) -> Union[FitResult,None]:
         nres = len(d_chi2_res)
         if nres == 0:
-            raise ValueError('No fits found')
+            log.error('No fits found')
+            return None
         l_chi2_res= list(d_chi2_res.items())
         l_chi2_res.sort()
@@ -149,6 +160,15 @@ class AnealingMinimizer(zfit.minimize.Minuit):
         return l_model[0]
     # ------------------------
+    def _print_failed_fit_diagnostics(self, nll) -> None:
+        for res in self._l_bad_fit_res:
+            print(res)
+        arr_mass = nll.data[0].numpy()
+        plt.hist(arr_mass, bins=60)
+        plt.show()
+    # ------------------------
     def minimize(self, nll, **kwargs) -> FitResult:
         '''
         Will run minimization and return FitResult object
@@ -156,18 +176,20 @@ class AnealingMinimizer(zfit.minimize.Minuit):
         d_chi2_res : dict[float,FitResult] = {}
         for i_try in range(self._ntries):
-            log.info(f'try {i_try:02}/{self._ntries:02}')
             try:
                 res = super().minimize(nll, **kwargs)
             except (FailMinimizeNaN, ValueError, RuntimeError) as exc:
-                log.warning(exc)
+                log.error(f'{i_try:02}/{self._ntries:02}{"Failed":>20}')
+                log.debug(exc)
                 self._randomize_parameters(nll)
                 continue
             if not self._is_good_fit(res):
+                log.warning(f'{i_try:02}/{self._ntries:02}{"Bad fit":>20}')
                 continue
             chi2, pvl = self._get_gof(nll)
+            log.info(f'{i_try:02}/{self._ntries:02}{chi2:>20.3f}')
             d_chi2_res[chi2] = res
             if self._is_good_gof(chi2, pvl):
@@ -176,6 +198,13 @@ class AnealingMinimizer(zfit.minimize.Minuit):
             self._randomize_parameters(nll)
         res = self._pick_best_fit(d_chi2_res)
+        if res is None:
+            self._print_failed_fit_diagnostics(nll)
+            pdf = nll.model[0]
+            print_pdf(pdf)
+            raise ValueError('Fit failed')
         pdf = self._pdf_from_nll(nll)
         self._set_pdf_pars(res, pdf)

dmu/stats/model_factory.py CHANGED Viewed

@@ -1,7 +1,7 @@
 '''
 Module storing ZModel class
 '''
-# pylint: disable=too-many-lines, import-error
+# pylint: disable=too-many-lines, import-error, too-many-positional-arguments, too-many-arguments
 from typing import Callable, Union
@@ -37,7 +37,16 @@ class MethodRegistry:
         '''
         Will return method in charge of building PDF, for an input nickname
         '''
-        return cls._d_method.get(nickname, None)
+        method = cls._d_method.get(nickname, None)
+        if method is not None:
+            return method
+        log.warning('Available PDFs:')
+        for value in cls._d_method:
+            log.info(f'    {value}')
+        return method
 #-----------------------------------------
 class ModelFactory:
     '''
@@ -48,33 +57,56 @@ class ModelFactory:
     l_pdf = ['dscb', 'gauss']
     l_shr = ['mu']
-    mod   = ModelFactory(obs = obs, l_pdf = l_pdf, l_shared=l_shr)
+    mod   = ModelFactory(preffix = 'signal', obs = obs, l_pdf = l_pdf, l_shared=l_shr)
     pdf   = mod.get_pdf()
     ```
     where one can specify which parameters can be shared among the PDFs
     '''
     #-----------------------------------------
-    def __init__(self, obs : zobs, l_pdf : list[str], l_shared : list[str]):
+    def __init__(self,
+                 preffix  : str,
+                 obs      : zobs,
+                 l_pdf    : list[str],
+                 l_shared : list[str],
+                 l_float  : list[str]):
         '''
+        preffix:  used to identify PDF, will be used to name every parameter
         obs:      zfit obserbable
         l_pdf:    List of PDF nicknames which are registered below
         l_shared: List of parameter names that are shared
+        l_float:  List of parameter names to allow to float
         '''
+        self._preffix         = preffix
         self._l_pdf           = l_pdf
         self._l_shr           = l_shared
-        self._l_can_be_shared = ['mu', 'sg']
+        self._l_flt           = l_float
         self._obs             = obs
         self._d_par : dict[str,zpar] = {}
     #-----------------------------------------
-    def _get_name(self, name : str, suffix : str) -> str:
-        for can_be_shared in self._l_can_be_shared:
-            if name.startswith(f'{can_be_shared}_') and can_be_shared in self._l_shr:
-                return can_be_shared
+    def _split_name(self, name : str) -> tuple[str,str]:
+        l_part = name.split('_')
+        pname  = l_part[0]
+        xname  = '_'.join(l_part[1:])
-        return f'{name}{suffix}'
+        return pname, xname
+    #-----------------------------------------
+    def _get_parameter_name(self, name : str, suffix : str) -> str:
+        pname, xname = self._split_name(name)
+        log.debug(f'Using physical name: {pname}')
+        if pname in self._l_shr:
+            name = f'{pname}_{self._preffix}'
+        else:
+            name = f'{pname}_{xname}_{self._preffix}{suffix}'
+        if pname in self._l_flt:
+            return f'{name}_flt'
+        return name
     #-----------------------------------------
     def _get_parameter(self,
                        name   : str,
@@ -82,7 +114,10 @@ class ModelFactory:
                        val    : float,
                        low    : float,
                        high   : float) -> zpar:
-        name = self._get_name(name, suffix)
+        name = self._get_parameter_name(name, suffix)
+        log.debug(f'Assigning name: {name}')
         if name in self._d_par:
             return self._d_par[name]
@@ -94,15 +129,15 @@ class ModelFactory:
     #-----------------------------------------
     @MethodRegistry.register('exp')
     def _get_exponential(self, suffix : str = '') -> zpdf:
-        c   = self._get_parameter('c_exp', suffix, -0.005, -0.05, 0.00)
-        pdf = zfit.pdf.Exponential(c, self._obs)
+        c   = self._get_parameter('c_exp', suffix, -0.005, -0.20, 0.00)
+        pdf = zfit.pdf.Exponential(c, self._obs, name=f'exp{suffix}')
         return pdf
     #-----------------------------------------
     @MethodRegistry.register('pol1')
     def _get_pol1(self, suffix : str = '') -> zpdf:
         a   = self._get_parameter('a_pol1', suffix, -0.005, -0.95, 0.00)
-        pdf = zfit.pdf.Chebyshev(obs=self._obs, coeffs=[a])
+        pdf = zfit.pdf.Chebyshev(obs=self._obs, coeffs=[a], name=f'pol1{suffix}')
         return pdf
     #-----------------------------------------
@@ -110,51 +145,62 @@ class ModelFactory:
     def _get_pol2(self, suffix : str = '') -> zpdf:
         a   = self._get_parameter('a_pol2', suffix, -0.005, -0.95, 0.00)
         b   = self._get_parameter('b_pol2', suffix,  0.000, -0.95, 0.95)
-        pdf = zfit.pdf.Chebyshev(obs=self._obs, coeffs=[a, b])
+        pdf = zfit.pdf.Chebyshev(obs=self._obs, coeffs=[a, b], name=f'pol2{suffix}')
         return pdf
     #-----------------------------------------
     @MethodRegistry.register('cbr')
     def _get_cbr(self, suffix : str = '') -> zpdf:
-        mu  = self._get_parameter('mu_cbr', suffix, 5300, 5250, 5350)
+        mu  = self._get_parameter('mu_cbr', suffix, 5300, 5100, 5350)
         sg  = self._get_parameter('sg_cbr', suffix,   10,    2,  300)
-        ar  = self._get_parameter('ac_cbr', suffix,   -2,  -4.,  -1.)
-        nr  = self._get_parameter('nc_cbr', suffix,    1,  0.5,  5.0)
+        ar  = self._get_parameter('ac_cbr', suffix,   -2, -14., -0.1)
+        nr  = self._get_parameter('nc_cbr', suffix,    1,  0.5,  150)
+        pdf = zfit.pdf.CrystalBall(mu, sg, ar, nr, self._obs, name=f'cbr{suffix}')
+        return pdf
+    #-----------------------------------------
+    @MethodRegistry.register('suj')
+    def _get_suj(self, suffix : str = '') -> zpdf:
+        mu  = self._get_parameter('mu_suj', suffix, 5300, 4000, 6000)
+        sg  = self._get_parameter('sg_suj', suffix,   10,    2, 5000)
+        gm  = self._get_parameter('gm_suj', suffix,    1,  -10,   10)
+        dl  = self._get_parameter('dl_suj', suffix,    1,  0.1,   10)
-        pdf = zfit.pdf.CrystalBall(mu, sg, ar, nr, self._obs)
+        pdf = zfit.pdf.JohnsonSU(mu, sg, gm, dl, self._obs, name=f'suj{suffix}')
         return pdf
     #-----------------------------------------
     @MethodRegistry.register('cbl')
     def _get_cbl(self, suffix : str = '') -> zpdf:
-        mu  = self._get_parameter('mu_cbl', suffix, 5300, 5250, 5350)
+        mu  = self._get_parameter('mu_cbl', suffix, 5300, 5100, 5350)
         sg  = self._get_parameter('sg_cbl', suffix,   10,    2,  300)
-        al  = self._get_parameter('ac_cbl', suffix,    2,   1.,   4.)
-        nl  = self._get_parameter('nc_cbl', suffix,    1,  0.5,  5.0)
+        al  = self._get_parameter('ac_cbl', suffix,    2,  0.1,  14.)
+        nl  = self._get_parameter('nc_cbl', suffix,    1,  0.5,  150)
-        pdf = zfit.pdf.CrystalBall(mu, sg, al, nl, self._obs)
+        pdf = zfit.pdf.CrystalBall(mu, sg, al, nl, self._obs, name=f'cbl{suffix}')
         return pdf
     #-----------------------------------------
     @MethodRegistry.register('gauss')
     def _get_gauss(self, suffix : str = '') -> zpdf:
-        mu  = self._get_parameter('mu_gauss', suffix, 5300, 5250, 5350)
+        mu  = self._get_parameter('mu_gauss', suffix, 5300, 5100, 5350)
         sg  = self._get_parameter('sg_gauss', suffix,   10,    2,  300)
-        pdf = zfit.pdf.Gauss(mu, sg, self._obs)
+        pdf = zfit.pdf.Gauss(mu, sg, self._obs, name=f'gauss{suffix}')
         return pdf
     #-----------------------------------------
     @MethodRegistry.register('dscb')
     def _get_dscb(self, suffix : str = '') -> zpdf:
-        mu  = self._get_parameter('mu_dscb', suffix, 5300, 5250, 5400)
-        sg  = self._get_parameter('sg_dscb', suffix,   10,    2,   30)
+        mu  = self._get_parameter('mu_dscb', suffix, 4000, 4000, 5400)
+        sg  = self._get_parameter('sg_dscb', suffix,   10,    2,  500)
         ar  = self._get_parameter('ar_dscb', suffix,    1,    0,    5)
         al  = self._get_parameter('al_dscb', suffix,    1,    0,    5)
-        nr  = self._get_parameter('nr_dscb', suffix,    2,    1,    5)
-        nl  = self._get_parameter('nl_dscb', suffix,    2,    0,    5)
+        nr  = self._get_parameter('nr_dscb', suffix,    2,    1,  150)
+        nl  = self._get_parameter('nl_dscb', suffix,    2,    0,  150)
-        pdf = zfit.pdf.DoubleCB(mu, sg, al, nl, ar, nr, self._obs)
+        pdf = zfit.pdf.DoubleCB(mu, sg, al, nl, ar, nr, self._obs, name=f'dscb{suffix}')
         return pdf
     #-----------------------------------------
@@ -190,7 +236,7 @@ class ModelFactory:
         l_frc= [ zfit.param.Parameter(f'frc_{ifrc + 1}', 0.5, 0, 1) for ifrc in range(nfrc - 1) ]
-        pdf = zfit.pdf.SumPDF(l_pdf, fracs=l_frc)
+        pdf = zfit.pdf.SumPDF(l_pdf, name=self._preffix, fracs=l_frc)
         return pdf
     #-----------------------------------------

dmu/testing/utilities.py CHANGED Viewed

@@ -2,6 +2,7 @@
 Module containing utility functions needed by unit tests
 '''
 import os
+import math
 from typing              import Union
 from dataclasses         import dataclass
 from importlib.resources import files
@@ -21,56 +22,64 @@ class Data:
     '''
     Class storing shared data
     '''
-    nentries = 3000
 # -------------------------------
-def _double_data(d_data : dict) -> dict:
-    df_1   = pnd.DataFrame(d_data)
-    df_2   = pnd.DataFrame(d_data)
+def _double_data(df_1 : pnd.DataFrame) -> pnd.DataFrame:
+    df_2   = df_1.copy()
     df     = pnd.concat([df_1, df_2], axis=0)
-    d_data = { name : df[name].to_numpy() for name in df.columns }
-    return d_data
+    return df
 # -------------------------------
-def _add_nans(d_data : dict) -> dict:
-    df_good   = pnd.DataFrame(d_data)
-    df_bad    = pnd.DataFrame(d_data)
-    df_bad[:] = numpy.nan
+def _add_nans(df : pnd.DataFrame, columns : list[str]) -> pnd.DataFrame:
+    size = len(df) * 0.2
+    size = math.floor(size)
+    l_col = df.columns.tolist()
+    if columns is None:
+        l_col_index = range(len(l_col))
+    else:
+        l_col_index = [ l_col.index(column) for column in columns ]
-    df        = pnd.concat([df_good, df_bad])
-    d_data    = { name : df[name].to_numpy() for name in df.columns }
+    log.debug('Replacing randomly with {size} NaNs')
+    for _ in range(size):
+        irow = numpy.random.randint(0, df.shape[0])      # Random row index
+        icol = numpy.random.choice(l_col_index)      # Random column index
-    return d_data
+        df.iat[irow, icol] = numpy.nan
+    return df
 # -------------------------------
 def get_rdf(kind : Union[str,None] = None,
             repeated : bool        = False,
-            add_nans : bool        = False):
+            nentries : int         = 3_000,
+            add_nans : list[str]   = None):
     '''
     Return ROOT dataframe with toy data
     '''
     d_data = {}
     if   kind == 'sig':
-        d_data['w'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['x'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['y'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['z'] = numpy.random.normal(0, 1, size=Data.nentries)
+        d_data['w'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['x'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['y'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['z'] = numpy.random.normal(0, 1, size=nentries)
     elif kind == 'bkg':
-        d_data['w'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['x'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['y'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['z'] = numpy.random.normal(1, 1, size=Data.nentries)
+        d_data['w'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['x'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['y'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['z'] = numpy.random.normal(1, 1, size=nentries)
     else:
         log.error(f'Invalid kind: {kind}')
         raise ValueError
+    df = pnd.DataFrame(d_data)
     if repeated:
-        d_data = _double_data(d_data)
+        df = _double_data(df)
     if add_nans:
-        d_data = _add_nans(d_data)
+        df = _add_nans(df, columns=add_nans)
-    rdf = RDF.FromNumpy(d_data)
+    rdf = RDF.FromPandas(df)
     return rdf
 # -------------------------------

dmu_data/ml/tests/train_mva.yaml CHANGED Viewed

@@ -1,6 +1,7 @@
 dataset:
   nan :
-    x : 0
+    x : -3
+    y : -3
 training :
     nfold    : 3
     features : [x, y, z]
@@ -49,4 +50,3 @@ plotting:
             binning : [-4, 4, 100]
             yscale  : 'linear'
             labels  : ['z', '']

dmu_data/plotting/tests/2d.yaml CHANGED Viewed

@@ -1,13 +1,17 @@
 saving:
-    plt_dir : tests/plotting/2d_weighted
+    plt_dir : /tmp/dmu/tests/plotting/2d_weighted
+selection:
+  cuts:
+    xlow : x > -1.5
 definitions:
   z : x + y
 general:
     size : [20, 10]
 plots_2d:
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
-    - [x, z,    null, 'xz_r']
+    - [x, y, weights, 'xy_wgt', false]
+    - [x, y,    null, 'xy_raw', false]
+    - [x, z,    null, 'xz_raw', false]
+    - [x, z,    null, 'xz_log',  true]
 axes:
     x :
         binning : [-3.0, 3.0, 40]

dmu_data/plotting/tests/legend.yaml ADDED Viewed

@@ -0,0 +1,12 @@
+saving:
+    plt_dir : tests/plotting/legend
+general:
+    size : [20, 10]
+plots:
+    x :
+        binning : [-5.0, 8.0, 40]
+    y :
+        binning : [-5.0, 8.0, 40]
+style:
+  legend:
+    bbox_to_anchor : [1.2, 1]

dmu_data/plotting/tests/stats.yaml ADDED Viewed

@@ -0,0 +1,9 @@
+saving:
+    plt_dir : tests/plotting/stats
+plots:
+    x :
+        binning : [-5.0, 8.0, 40]
+    y :
+        binning : [-5.0, 8.0, 40]
+stats:
+  nentries : '{:.2e}'

{data_manipulation_utilities-0.2.4.data → data_manipulation_utilities-0.2.6.data}/scripts/publish RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.4.dist-info → data_manipulation_utilities-0.2.6.dist-info}/top_level.txt RENAMED Viewed

File without changes

data-manipulation-utilities 0.2.4__py3-none-any.whl → 0.2.6__py3-none-any.whl

data-manipulation-utilities 0.2.4py3-none-any.whl → 0.2.6py3-none-any.whl