PyPI - data-manipulation-utilities - Versions diffs - 0.2.3__py3-none-any.whl → 0.2.5__py3-none-any.whl - Mend

data-manipulation-utilities 0.2.3py3-none-any.whl → 0.2.5py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{data_manipulation_utilities-0.2.3.dist-info → data_manipulation_utilities-0.2.5.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.3
+Version: 0.2.5
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -26,7 +26,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -36,10 +36,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -137,7 +137,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -299,7 +309,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -337,11 +347,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -353,7 +363,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -434,7 +444,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -497,7 +507,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -674,6 +684,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -703,6 +716,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -724,14 +747,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -823,7 +851,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -856,6 +884,40 @@ Trees only in file_2.root:
   - Hlt2RD_BsToPhiMuMu_MVA/DecayTree
 ```
+# File system
+## Versions
+The utilities below allow the user to deal with versioned files and directories
+```python
+from dmu.generic.version_management import get_last_version
+from dmu.generic.version_management import get_next_version
+from dmu.generic.version_management import get_latest_file
+# get_next_version will take a version and provide the next one, e.g.
+get_next_version('v1')           # -> 'v2'
+get_next_version('v1.1')         # -> 'v2.1'
+get_next_version('v10.1')        # -> 'v11.1'
+get_next_version('/a/b/c/v1')    # -> '/a/b/c/v2'
+get_next_version('/a/b/c/v1.1')  # -> '/a/b/c/v2.1'
+get_next_version('/a/b/c/v10.1') # -> '/a/b/c/v11.1'
+# `get_latest_file` will return the path to the file with the highest version
+# in the `dir_path` directory that matches a wildcard, e.g.:
+last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
+# `get_last_version` will return the string with the latest version
+# of directories in `dir_path`, e.g.:
+oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+```
+The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.
 # Text manipulation
 ## Transformations

{data_manipulation_utilities-0.2.3.dist-info → data_manipulation_utilities-0.2.5.dist-info}/RECORD RENAMED Viewed

@@ -1,16 +1,17 @@
-data_manipulation_utilities-0.2.3.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
+data_manipulation_utilities-0.2.5.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
 dmu/arrays/utilities.py,sha256=PKoYyybPptA2aU-V3KLnJXBudWxTXu4x1uGdIMQ49HY,1722
 dmu/generic/utilities.py,sha256=0Xnq9t35wuebAqKxbyAiMk1ISB7IcXK4cFH25MT1fgw,1741
+dmu/generic/version_management.py,sha256=G_HjGY-hu8lotZuTdVAg0B8yD0AltE866q2vJxvTg1g,3749
 dmu/logging/log_store.py,sha256=umdvjNDuV3LdezbG26b0AiyTglbvkxST19CQu9QATbA,4184
-dmu/ml/cv_classifier.py,sha256=8Jwx6xMhJaRLktlRdq0tFl32v6t8i63KmpxrlnXlomU,3759
-dmu/ml/cv_predict.py,sha256=4G7F_1yOvnLftsDC6zUpdvkxuHXGkPemhj0RsYySYDM,6708
-dmu/ml/train_mva.py,sha256=SZ5cQHl7HBxn0c5Hh4HlN1aqMZaJUAlNmsfjnUSQrTg,16894
-dmu/ml/utilities.py,sha256=l348bufD95CuSYdIrHScQThIy2nKwGKXZn-FQg3CEwg,3930
+dmu/ml/cv_classifier.py,sha256=ZbzEm_jW9yoTC7k_xBA7hFpc1bDNayiVR3tbaj1_ieE,4228
+dmu/ml/cv_predict.py,sha256=4wwYL_jcUExDqLJVfClxEUWSd_QAx8yKHO3rX-mx4vw,6711
+dmu/ml/train_mva.py,sha256=XzXE92PzyF3cjlx5yMhtp5h4t7wzisRAyO1fBArssvc,17282
+dmu/ml/utilities.py,sha256=PK_61fW7gBV9aGZyez3PI8zAT7_Fc6IlQzDB7f8iBTM,4133
 dmu/pdataframe/utilities.py,sha256=ypvLiFfJ82ga94qlW3t5dXnvEFwYOXnbtJb2zHwsbqk,987
 dmu/plotting/matrix.py,sha256=pXuUJn-LgOvrI9qGkZQw16BzLjOjeikYQ_ll2VIcIXU,4978
-dmu/plotting/plotter.py,sha256=ytMxtzHEY8ZFU0ZKEBE-ROjMszXl5kHTMnQnWe173nU,7208
-dmu/plotting/plotter_1d.py,sha256=g6H2xAgsL9a6vRkpbqHICb3qwV_qMiQPZxxw_oOSf9M,5115
-dmu/plotting/plotter_2d.py,sha256=J-gKnagoHGfJFU7HBrhDFpGYH5Rxy0_zF5l8eE_7ZHE,2944
+dmu/plotting/plotter.py,sha256=3WRbNOrFBWgI3iW5TbEgT4w_eF7-XUPs_32JL1AW3yY,7359
+dmu/plotting/plotter_1d.py,sha256=2AnVxulyhKtwN-2Srhfm6fqdEREZNhcpJolBsJrWcsc,5745
+dmu/plotting/plotter_2d.py,sha256=mZhp3D5I-JodOnFTEF1NqHtcLtuI-2WNpCQsrsoXNtw,3017
 dmu/plotting/utilities.py,sha256=SI9dvtZq2gr-PXVz71KE4o0i09rZOKgqJKD1jzf6KXk,1167
 dmu/rdataframe/atr_mgr.py,sha256=FdhaQWVpsm4OOe1IRbm7rfrq8VenTNdORyI-lZ2Bs1M,2386
 dmu/rdataframe/utilities.py,sha256=pNcQARMP7txMhy6k27UnDcYf0buNy5U2fshaJDl_h8o,3661
@@ -20,20 +21,22 @@ dmu/stats/fitter.py,sha256=vHNZ16U3apoQyeyM8evq-if49doF48sKB3q9wmA96Fw,18387
 dmu/stats/function.py,sha256=yzi_Fvp_ASsFzbWFivIf-comquy21WoeY7is6dgY0Go,9491
 dmu/stats/gof_calculator.py,sha256=4EN6OhULcztFvsAZ00rxgohJemnjtDNB5o0IBcv6kbk,4657
 dmu/stats/minimizers.py,sha256=f9cilFY9Kp9UvbSIUsKBGFzOOg7EEWZJLPod-4k-LAQ,6216
-dmu/stats/model_factory.py,sha256=LyDOf0f9I5dNUTS0MXHtSivD8aAcTLIagvMPtoXtThk,7426
+dmu/stats/model_factory.py,sha256=ixWnhE8gPiOYW5pCb3eoVIaSvbUopEx4ldkZ3xL54Xg,7714
 dmu/stats/utilities.py,sha256=LQy4kd3xSXqpApcWuYfZxkGQyjowaXv2Wr1c4Bj-4ys,4523
 dmu/stats/zfit_plotter.py,sha256=Xs6kisNEmNQXhYRCcjowxO6xHuyAyrfyQIFhGAR61U4,19719
-dmu/testing/utilities.py,sha256=WbMM4e9Cn3-B-12Vr64mB5qTKkV32joStlRkD-48lG0,3460
+dmu/testing/utilities.py,sha256=moImLqGX9LAt5zJtE5j0gHHkUJ5kpbodryhiVswOsyM,3696
 dmu/text/transformer.py,sha256=4lrGknbAWRm0-rxbvgzOO-eR1-9bkYk61boJUEV3cQ0,6100
 dmu_data/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-dmu_data/ml/tests/train_mva.yaml,sha256=k5H4Gu9Gj57B9iqabhcTQEFN674Cv_uJ2Xcumb02zF4,1279
-dmu_data/plotting/tests/2d.yaml,sha256=VApcAfJFbjNcjMCTBSRm2P37MQlGavMZv6msbZwLSgw,402
+dmu_data/ml/tests/train_mva.yaml,sha256=o0ZIe43qPC-KwLT9y1qfYYw2bbOLlJIKRkCMUnM5sBo,1177
+dmu_data/plotting/tests/2d.yaml,sha256=HSAtER-8CEqIGBY_jdcIdSVOHMfYPYhmgeZghTpVYh8,516
 dmu_data/plotting/tests/fig_size.yaml,sha256=7ROq49nwZ1A2EbPiySmu6n3G-Jq6YAOkc3d2X3YNZv0,294
 dmu_data/plotting/tests/high_stat.yaml,sha256=bLglBLCZK6ft0xMhQ5OltxE76cWsBMPMjO6GG0OkDr8,522
+dmu_data/plotting/tests/legend.yaml,sha256=wGpj58ig-GOlqbWoN894zrCet2Fj9f5QtY0rig_UC-c,213
 dmu_data/plotting/tests/name.yaml,sha256=mkcPAVg8wBAmlSbSRQ1bcaMl4vOS6LXMtpqQeDrrtO4,312
 dmu_data/plotting/tests/no_bounds.yaml,sha256=8e1QdphBjz-suDr857DoeUC2DXiy6SE-gvkORJQYv80,257
 dmu_data/plotting/tests/normalized.yaml,sha256=Y0eKtyV5pvlSxvqfsLjytYtv8xYF3HZ5WEdCJdeHGQI,193
 dmu_data/plotting/tests/simple.yaml,sha256=N_TvNBh_2dU0-VYgu_LMrtY0kV_hg2HxVuEoDlr1HX8,138
+dmu_data/plotting/tests/stats.yaml,sha256=fSZjoV-xPnukpCH2OAXsz_SNPjI113qzDg8Ln3spaaA,165
 dmu_data/plotting/tests/title.yaml,sha256=bawKp9aGpeRrHzv69BOCbFX8sq9bb3Es9tdsPTE7jIk,333
 dmu_data/plotting/tests/weights.yaml,sha256=RWQ1KxbCq-uO62WJ2AoY4h5Umc37zG35s-TpKnNMABI,312
 dmu_data/text/transform.toml,sha256=R-832BZalzHZ6c5gD6jtT_Hj8BCsM5vxa1v6oeiwaP4,94
@@ -47,8 +50,8 @@ dmu_scripts/rfile/compare_root_files.py,sha256=T8lDnQxsRNMr37x1Y7YvWD8ySHrJOWZki
 dmu_scripts/rfile/print_trees.py,sha256=Ze4Ccl_iUldl4eVEDVnYBoe4amqBT1fSBR1zN5WSztk,941
 dmu_scripts/ssh/coned.py,sha256=lhilYNHWRCGxC-jtyJ3LQ4oUgWW33B2l1tYCcyHHsR0,4858
 dmu_scripts/text/transform_text.py,sha256=9akj1LB0HAyopOvkLjNOJiptZw5XoOQLe17SlcrGMD0,1456
-data_manipulation_utilities-0.2.3.dist-info/METADATA,sha256=STJ7vYfcSIM9dtMRzywGLwDzH1sUBE5DL9FqvskMcxo,27923
-data_manipulation_utilities-0.2.3.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
-data_manipulation_utilities-0.2.3.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
-data_manipulation_utilities-0.2.3.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
-data_manipulation_utilities-0.2.3.dist-info/RECORD,,
+data_manipulation_utilities-0.2.5.dist-info/METADATA,sha256=d8rJbrtHEg_fOma5NA5qL4ox8bP4MaIV0mbyl6uRiJs,30104
+data_manipulation_utilities-0.2.5.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
+data_manipulation_utilities-0.2.5.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
+data_manipulation_utilities-0.2.5.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
+data_manipulation_utilities-0.2.5.dist-info/RECORD,,

dmu/generic/version_management.py ADDED Viewed

@@ -0,0 +1,132 @@
+'''
+Module containing functions used to find latest, next version, etc of a path.
+'''
+import glob
+import os
+import re
+from dmu.logging.log_store import LogStore
+log=LogStore.add_logger('dmu:version_management')
+#---------------------------------------
+def _get_numeric_version(version : str) -> int:
+    '''
+    Takes string with numbers at the end (padded or not)
+    Returns integer version of numbers
+    '''
+    #Skip these directories
+    if version in ['__pycache__']:
+        return -1
+    regex=r'[a-z]+(\d+)'
+    mtch =re.match(regex, version)
+    if not mtch:
+        log.debug(f'Cannot extract numeric version from: {version}')
+        return -1
+    str_val = mtch.group(1)
+    val     = int(str_val)
+    return val
+#---------------------------------------
+def get_last_version(dir_path : str, version_only : bool = True, main_only : bool = False):
+    '''Returns path or just version associated to latest version found in given path
+    Parameters
+    ---------------------
+    dir_path (str) : Path to directory where versioned subdirectories exist
+    version_only (bool): Returns only vxxxx if True, otherwise, full path to directory
+    main_only (bool): Returns vX where X is a number. Otherwise it will return vx.y in case version has subversion
+    '''
+    l_obj = glob.glob(f'{dir_path}/*')
+    if len(l_obj) == 0:
+        log.error(f'Nothing found in {dir_path}')
+        raise ValueError
+    d_dir_org = { os.path.basename(obj).replace('.', '') : obj for obj in l_obj if os.path.isdir(obj) }
+    d_dir_num = { _get_numeric_version(name) : dir_path for name, dir_path in d_dir_org.items() }
+    c_dir = sorted(d_dir_num.items())
+    try:
+        _, path = c_dir[-1]
+    except:
+        log.error(f'Cannot find path in: {dir_path}')
+        raise
+    name = os.path.basename(path)
+    dirn = os.path.dirname(path)
+    if main_only and '.' in name:
+        ind = name.index('.')
+        name= name[:ind]
+    if version_only:
+        return name
+    return f'{dirn}/{name}'
+#---------------------------------------
+def get_latest_file(dir_path : str, wc : str) -> str:
+    '''Will find latest file in a given directory
+    Parameters
+    --------------------
+    dir_path (str): Directory where files are found
+    wc (str): Wildcard associated to files, e.g. file_*.txt
+    Returns
+    --------------------
+    Path to latest file, according to version
+    '''
+    l_path = glob.glob(f'{dir_path}/{wc}')
+    if len(l_path) == 0:
+        log.error(f'Cannot find files in: {dir_path}/{wc}')
+        raise ValueError
+    l_path.sort()
+    return l_path[-1]
+#---------------------------------------
+def get_next_version(version : str) -> str:
+    '''Pick up string symbolizing version and return next version
+    Parameters
+    -------------------------
+    version (str) : Of the form vx.y or vx where x and y are integers. It can also be a full path
+    Returns
+    -------------------------
+    String equal to the argument, but with the main version augmented by 1, e.g. vx+1.y
+    Examples:
+    -------------------------
+    get_next_version('v1.1') = 'v2.1'
+    get_next_version('v1'  ) = 'v2'
+    '''
+    if '/' in version:
+        path    = version
+        dirname = os.path.dirname(path)
+        version = os.path.basename(path)
+    else:
+        dirname = None
+    rgx = r'v(\d+)(\.\d+)?'
+    mtch = re.match(rgx, version)
+    if not mtch:
+        log.error(f'Cannot match {version} with {rgx}')
+        raise ValueError
+    ver_org = mtch.group(1)
+    ver_nxt = int(ver_org) + 1
+    ver_nxt = str(ver_nxt)
+    version = version.replace(f'v{ver_org}', f'v{ver_nxt}')
+    if dirname is not None:
+        version = f'{dirname}/{version}'
+    return version
+#---------------------------------------

dmu/ml/cv_classifier.py CHANGED Viewed

@@ -1,15 +1,15 @@
 '''
 Module holding cv_classifier class
 '''
+import os
 from typing                  import Union
 from sklearn.ensemble        import GradientBoostingClassifier
+import yaml
 from dmu.logging.log_store import LogStore
 import dmu.ml.utilities    as ut
 log = LogStore.add_logger('dmu:ml:CVClassifier')
 # ---------------------------------------
 class CVSameData(Exception):
     '''
@@ -61,6 +61,20 @@ class CVClassifier(GradientBoostingClassifier):
         return self._cfg
     # ----------------------------------
+    def save_cfg(self, path : str):
+        '''
+        Will save configuration used to train this classifier to YAML
+        path: Path to YAML file
+        '''
+        dir_name = os.path.dirname(path)
+        os.makedirs(dir_name, exist_ok=True)
+        with open(path, 'w', encoding='utf-8') as ofile:
+            yaml.safe_dump(self._cfg, ofile, indent=2)
+        log.info(f'Saved config to: {path}')
+    # ----------------------------------
     def __str__(self):
         nhash = len(self._s_hash)

dmu/ml/cv_predict.py CHANGED Viewed

@@ -73,11 +73,11 @@ class CVPredict:
             log.debug('Not doing any NaN replacement')
             return df
-        log.debug(60 * '-')
+        log.info(60 * '-')
         log.info('Doing NaN replacements')
-        log.debug(60 * '-')
+        log.info(60 * '-')
         for var, val in self._d_nan_rep.items():
-            log.debug(f'{var:<20}{"--->":20}{val:<20.3f}')
+            log.info(f'{var:<20}{"--->":20}{val:<20.3f}')
             df[var] = df[var].fillna(val)
         return df
@@ -155,7 +155,7 @@ class CVPredict:
         ndif = len(s_dif_hash)
         ndat = len(s_dat_hash)
         nmod = len(s_mod_hash)
-        log.debug(f'{ndif:<20}{"=":10}{ndat:<20}{"-":10}{nmod:<20}')
+        log.debug(f'{ndif:<10}{"=":5}{ndat:<10}{"-":5}{nmod:<10}')
         df_ft_group= df_ft.loc[df_ft.index.isin(s_dif_hash)]
@@ -173,7 +173,7 @@ class CVPredict:
             return arr_prb
         nentries = len(self._arr_patch)
-        log.warning(f'Patching {nentries} probabilities')
+        log.warning(f'Patching {nentries} probabilities with -1')
         arr_prb[self._arr_patch] = -1
         return arr_prb

dmu/ml/train_mva.py CHANGED Viewed

@@ -69,14 +69,20 @@ class TrainMva:
         return df, arr_lab
     # ---------------------------------------------
     def _pre_process_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
+        if 'dataset' not in self._cfg:
+            return df
         if 'nan' not in self._cfg['dataset']:
             log.debug('dataset/nan section not found, not pre-processing NaNs')
             return df
         d_name_val = self._cfg['dataset']['nan']
-        for name, val in d_name_val.items():
-            log.debug(f'{val:<20}{"<---":<10}{name:<100}')
-            df[name] = df[name].fillna(val)
+        log.info(60 * '-')
+        log.info('Doing NaN replacements')
+        log.info(60 * '-')
+        for var, val in d_name_val.items():
+            log.info(f'{var:<20}{"--->":20}{val:<20.3f}')
+            df[var] = df[var].fillna(val)
         return df
     # ---------------------------------------------
@@ -406,6 +412,9 @@ class TrainMva:
         self._save_hyperparameters_to_tex()
     # ---------------------------------------------
     def _save_nan_conversion(self) -> None:
+        if 'dataset' not in self._cfg:
+            return
         if 'nan' not in self._cfg['dataset']:
             log.debug('NaN section not found, not saving it')
             return
@@ -434,13 +443,18 @@ class TrainMva:
         os.makedirs(val_dir, exist_ok=True)
         put.df_to_tex(df, f'{val_dir}/hyperparameters.tex')
     # ---------------------------------------------
-    def run(self):
+    def run(self, skip_fit : bool = False) -> None:
         '''
         Will do the training
+        skip_fit: By default false, if True, it will only do the plots of features and save tables
         '''
         self._save_settings_to_tex()
         self._plot_features()
+        if skip_fit:
+            return
         l_mod = self._get_models()
         for ifold, mod in enumerate(l_mod):
             self._save_model(mod, ifold)

dmu/ml/utilities.py CHANGED Viewed

@@ -16,7 +16,7 @@ log = LogStore.add_logger('dmu:ml:utilities')
 # ---------------------------------------------
 def patch_and_tag(df : pnd.DataFrame, value : float = 0) -> pnd.DataFrame:
     '''
-    Takes panda dataframe, replaces NaNs with value introduced, by default 0
+    Takes pandas dataframe, replaces NaNs with value introduced, by default 0
     Returns array of indices where the replacement happened
     '''
     l_nan = df.index[df.isna().any(axis=1)].tolist()
@@ -25,7 +25,13 @@ def patch_and_tag(df : pnd.DataFrame, value : float = 0) -> pnd.DataFrame:
         log.debug('No NaNs found')
         return df
-    log.warning(f'Found {nnan} NaNs, patching them with {value}')
+    log.warning(f'Found {nnan} NaNs')
+    df_nan_frq = df.isna().sum()
+    df_nan_frq = df_nan_frq[df_nan_frq > 0]
+    print(df_nan_frq)
+    log.warning(f'Attaching array with NaN {nnan} indexes and removing NaNs from dataframe')
     df_pa = df.fillna(value)
@@ -57,7 +63,7 @@ def _remove_nans(df : pnd.DataFrame) -> pnd.DataFrame:
     log.info('Found columns with NaNs')
     for name in l_na_name:
         nan_count = df[name].isna().sum()
-        log.info(f'{nan_count:<10}{name:<100}')
+        log.info(f'{nan_count:<10}{name}')
     ninit = len(df)
     df    = df.dropna()
@@ -75,10 +81,10 @@ def _remove_repeated(df : pnd.DataFrame) -> pnd.DataFrame:
     nfinl = len(s_hash)
     if ninit == nfinl:
-        log.debug('No cleaning needed for dataframe')
+        log.debug('No overlap between training and application found')
         return df
-    log.warning(f'Repeated entries found, cleaning up: {ninit} -> {nfinl}')
+    log.warning(f'Overlap between training and application found, cleaning up: {ninit} -> {nfinl}')
     df['hash_index'] = l_hash
     df               = df.set_index('hash_index', drop=True)

dmu/plotting/plotter.py CHANGED Viewed

@@ -107,7 +107,7 @@ class Plotter:
         d_cut = self._d_cfg['selection']['cuts']
-        log.info('Applying cuts')
+        log.debug('Applying cuts')
         for name, cut in d_cut.items():
             log.debug(f'{name:<50}{cut:<150}')
             rdf = rdf.Filter(cut, name)
@@ -212,7 +212,11 @@ class Plotter:
         var (str) : Name of variable, needed for plot name
         '''
-        plt.legend()
+        d_leg = {}
+        if 'style' in self._d_cfg and 'legend' in self._d_cfg['style']:
+            d_leg = self._d_cfg['style']['legend']
+        plt.legend(**d_leg)
         plt_dir = self._d_cfg['saving']['plt_dir']
         os.makedirs(plt_dir, exist_ok=True)

dmu/plotting/plotter_1d.py CHANGED Viewed

@@ -77,17 +77,33 @@ class Plotter1D(Plotter):
         l_bc_all = []
         for name, arr_val in d_data.items():
+            label        = self._label_from_name(name, arr_val)
             arr_wgt      = d_wgt[name] if d_wgt is not None else numpy.ones_like(arr_val)
             arr_wgt      = self._normalize_weights(arr_wgt, var)
-            hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x', label=name).Weight()
+            hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x').Weight()
             hst.fill(x=arr_val, weight=arr_wgt)
-            hst.plot(label=name)
+            hst.plot(label=label)
             l_bc_all    += hst.values().tolist()
         max_y = max(l_bc_all)
         return max_y
     # --------------------------------------------
+    def _label_from_name(self, name : str, arr_val : numpy.ndarray) -> str:
+        if 'stats' not in self._d_cfg:
+            return name
+        d_stat = self._d_cfg['stats']
+        if 'nentries' not in d_stat:
+            return name
+        form = d_stat['nentries']
+        nentries = len(arr_val)
+        nentries = form.format(nentries)
+        return f'{name}{nentries}'
+    # --------------------------------------------
     def _normalize_weights(self, arr_wgt : numpy.ndarray, var : str) -> numpy.ndarray:
         cfg_var = self._d_cfg['plots'][var]
         if 'normalized' not in cfg_var:
@@ -104,7 +120,6 @@ class Plotter1D(Plotter):
         return arr_wgt
     # --------------------------------------------
     def _style_plot(self, var : str, max_y : float) -> None:
         d_cfg  = self._d_cfg['plots'][var]
         yscale = d_cfg['yscale' ] if 'yscale' in d_cfg else 'linear'
@@ -124,12 +139,15 @@ class Plotter1D(Plotter):
         plt.legend()
         plt.title(title)
     # --------------------------------------------
-    def _plot_lines(self, var : str):
+    def _plot_lines(self, var : str) -> None:
         '''
         Will plot vertical lines for some variables
         var (str) : name of variable
         '''
+        if 'style' in self._d_cfg and 'skip_lines' in self._d_cfg['style'] and self._d_cfg['style']['skip_lines']:
+            return
         if var in ['B_const_mass_M', 'B_M']:
             plt.axvline(x=5280, color='r', label=r'$B^+$'   , linestyle=':')
         elif var == 'Jpsi_M':

dmu/plotting/plotter_2d.py CHANGED Viewed

@@ -10,6 +10,7 @@ import matplotlib.pyplot as plt
 from hist                  import Hist
 from ROOT                  import RDataFrame
+from matplotlib.colors     import LogNorm
 from dmu.logging.log_store import LogStore
 from dmu.plotting.plotter  import Plotter
@@ -28,11 +29,8 @@ class Plotter2D(Plotter):
         cfg   (dict): Dictionary with configuration, e.g. binning, ranges, etc
         '''
-        if not isinstance(cfg, dict):
-            raise ValueError('Config dictionary not passed')
-        self._d_cfg : dict       = cfg
-        self._rdf   : RDataFrame = super()._preprocess_rdf(rdf)
+        super().__init__({'single_rdf' : rdf}, cfg)
+        self._rdf : RDataFrame = self._d_rdf['single_rdf']
         self._wgt : numpy.ndarray
     # --------------------------------------------
@@ -61,7 +59,7 @@ class Plotter2D(Plotter):
         return arr_wgt
     # --------------------------------------------
-    def _plot_vars(self, varx : str, vary : str, wgt_name : str) -> None:
+    def _plot_vars(self, varx : str, vary : str, wgt_name : str, use_log : bool) -> None:
         log.info(f'Plotting {varx} vs {vary} with weights {wgt_name}')
         ax_x         = self._get_axis(varx)
@@ -72,7 +70,10 @@ class Plotter2D(Plotter):
         hst   = Hist(ax_x, ax_y)
         hst.fill(arr_x, arr_y, weight=arr_w)
-        mplhep.hist2dplot(hst)
+        if use_log:
+            mplhep.hist2dplot(hst, norm=LogNorm())
+        else:
+            mplhep.hist2dplot(hst)
     # --------------------------------------------
     def run(self):
         '''
@@ -80,8 +81,8 @@ class Plotter2D(Plotter):
         '''
         fig_size = self._get_fig_size()
-        for [varx, vary, wgt_name, plot_name] in self._d_cfg['plots_2d']:
+        for [varx, vary, wgt_name, plot_name, use_log] in self._d_cfg['plots_2d']:
             plt.figure(plot_name, figsize=fig_size)
-            self._plot_vars(varx, vary, wgt_name)
+            self._plot_vars(varx, vary, wgt_name, use_log)
             self._save_plot(plot_name)
 # --------------------------------------------

dmu/stats/model_factory.py CHANGED Viewed

@@ -1,7 +1,7 @@
 '''
 Module storing ZModel class
 '''
-# pylint: disable=too-many-lines, import-error
+# pylint: disable=too-many-lines, import-error, too-many-positional-arguments, too-many-arguments
 from typing import Callable, Union
@@ -69,12 +69,18 @@ class ModelFactory:
         self._d_par : dict[str,zpar] = {}
     #-----------------------------------------
+    def _fltname_from_name(self, name : str) -> str:
+        if name in ['mu', 'sg']:
+            return f'{name}_flt'
+        return name
+    #-----------------------------------------
     def _get_name(self, name : str, suffix : str) -> str:
         for can_be_shared in self._l_can_be_shared:
             if name.startswith(f'{can_be_shared}_') and can_be_shared in self._l_shr:
-                return can_be_shared
+                return self._fltname_from_name(can_be_shared)
-        return f'{name}{suffix}'
+        return self._fltname_from_name(f'{name}{suffix}')
     #-----------------------------------------
     def _get_parameter(self,
                        name   : str,
@@ -129,8 +135,8 @@ class ModelFactory:
     def _get_cbl(self, suffix : str = '') -> zpdf:
         mu  = self._get_parameter('mu_cbl', suffix, 5300, 5250, 5350)
         sg  = self._get_parameter('sg_cbl', suffix,   10,    2,  300)
-        al  = self._get_parameter('ac_cbl', suffix,    2,   1.,   4.)
-        nl  = self._get_parameter('nc_cbl', suffix,    1,  0.5,  5.0)
+        al  = self._get_parameter('ac_cbl', suffix,    2,   1.,  14.)
+        nl  = self._get_parameter('nc_cbl', suffix,    1,  0.5,  15.)
         pdf = zfit.pdf.CrystalBall(mu, sg, al, nl, self._obs)
@@ -151,8 +157,8 @@ class ModelFactory:
         sg  = self._get_parameter('sg_dscb', suffix,   10,    2,   30)
         ar  = self._get_parameter('ar_dscb', suffix,    1,    0,    5)
         al  = self._get_parameter('al_dscb', suffix,    1,    0,    5)
-        nr  = self._get_parameter('nr_dscb', suffix,    2,    1,    5)
-        nl  = self._get_parameter('nl_dscb', suffix,    2,    0,    5)
+        nr  = self._get_parameter('nr_dscb', suffix,    2,    1,   15)
+        nl  = self._get_parameter('nl_dscb', suffix,    2,    0,   15)
         pdf = zfit.pdf.DoubleCB(mu, sg, al, nl, ar, nr, self._obs)

dmu/testing/utilities.py CHANGED Viewed

@@ -2,6 +2,7 @@
 Module containing utility functions needed by unit tests
 '''
 import os
+import math
 from typing              import Union
 from dataclasses         import dataclass
 from importlib.resources import files
@@ -21,56 +22,64 @@ class Data:
     '''
     Class storing shared data
     '''
-    nentries = 3000
 # -------------------------------
-def _double_data(d_data : dict) -> dict:
-    df_1   = pnd.DataFrame(d_data)
-    df_2   = pnd.DataFrame(d_data)
+def _double_data(df_1 : pnd.DataFrame) -> pnd.DataFrame:
+    df_2   = df_1.copy()
     df     = pnd.concat([df_1, df_2], axis=0)
-    d_data = { name : df[name].to_numpy() for name in df.columns }
-    return d_data
+    return df
 # -------------------------------
-def _add_nans(d_data : dict) -> dict:
-    df_good   = pnd.DataFrame(d_data)
-    df_bad    = pnd.DataFrame(d_data)
-    df_bad[:] = numpy.nan
+def _add_nans(df : pnd.DataFrame, columns : list[str]) -> pnd.DataFrame:
+    size = len(df) * 0.2
+    size = math.floor(size)
+    l_col = df.columns.tolist()
+    if columns is None:
+        l_col_index = range(len(l_col))
+    else:
+        l_col_index = [ l_col.index(column) for column in columns ]
-    df        = pnd.concat([df_good, df_bad])
-    d_data    = { name : df[name].to_numpy() for name in df.columns }
+    log.debug('Replacing randomly with {size} NaNs')
+    for _ in range(size):
+        irow = numpy.random.randint(0, df.shape[0])      # Random row index
+        icol = numpy.random.choice(l_col_index)      # Random column index
-    return d_data
+        df.iat[irow, icol] = numpy.nan
+    return df
 # -------------------------------
 def get_rdf(kind : Union[str,None] = None,
             repeated : bool        = False,
-            add_nans : bool        = False):
+            nentries : int         = 3_000,
+            add_nans : list[str]   = None):
     '''
     Return ROOT dataframe with toy data
     '''
     d_data = {}
     if   kind == 'sig':
-        d_data['w'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['x'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['y'] = numpy.random.normal(0, 1, size=Data.nentries)
-        d_data['z'] = numpy.random.normal(0, 1, size=Data.nentries)
+        d_data['w'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['x'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['y'] = numpy.random.normal(0, 1, size=nentries)
+        d_data['z'] = numpy.random.normal(0, 1, size=nentries)
     elif kind == 'bkg':
-        d_data['w'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['x'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['y'] = numpy.random.normal(1, 1, size=Data.nentries)
-        d_data['z'] = numpy.random.normal(1, 1, size=Data.nentries)
+        d_data['w'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['x'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['y'] = numpy.random.normal(1, 1, size=nentries)
+        d_data['z'] = numpy.random.normal(1, 1, size=nentries)
     else:
         log.error(f'Invalid kind: {kind}')
         raise ValueError
+    df = pnd.DataFrame(d_data)
     if repeated:
-        d_data = _double_data(d_data)
+        df = _double_data(df)
     if add_nans:
-        d_data = _add_nans(d_data)
+        df = _add_nans(df, columns=add_nans)
-    rdf = RDF.FromNumpy(d_data)
+    rdf = RDF.FromPandas(df)
     return rdf
 # -------------------------------

dmu_data/ml/tests/train_mva.yaml CHANGED Viewed

@@ -1,6 +1,7 @@
 dataset:
   nan :
-    x : 0
+    x : 1
+    y : 2
 training :
     nfold    : 3
     features : [x, y, z]
@@ -33,10 +34,6 @@ plotting:
         saving:
             plt_dir : '/tmp/dmu/ml/tests/train_mva/features'
         plots:
-          w :
-            binning : [-4, 4, 100]
-            yscale  : 'linear'
-            labels  : ['w', '']
           x :
             binning : [-4, 4, 100]
             yscale  : 'linear'

dmu_data/plotting/tests/2d.yaml CHANGED Viewed

@@ -1,13 +1,17 @@
 saving:
-    plt_dir : tests/plotting/2d_weighted
+    plt_dir : /tmp/dmu/tests/plotting/2d_weighted
+selection:
+  cuts:
+    xlow : x > -1.5
 definitions:
   z : x + y
 general:
     size : [20, 10]
 plots_2d:
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
-    - [x, z,    null, 'xz_r']
+    - [x, y, weights, 'xy_wgt', false]
+    - [x, y,    null, 'xy_raw', false]
+    - [x, z,    null, 'xz_raw', false]
+    - [x, z,    null, 'xz_log',  true]
 axes:
     x :
         binning : [-3.0, 3.0, 40]

dmu_data/plotting/tests/legend.yaml ADDED Viewed

@@ -0,0 +1,12 @@
+saving:
+    plt_dir : tests/plotting/legend
+general:
+    size : [20, 10]
+plots:
+    x :
+        binning : [-5.0, 8.0, 40]
+    y :
+        binning : [-5.0, 8.0, 40]
+style:
+  legend:
+    bbox_to_anchor : [1.2, 1]

dmu_data/plotting/tests/stats.yaml ADDED Viewed

@@ -0,0 +1,9 @@
+saving:
+    plt_dir : tests/plotting/stats
+plots:
+    x :
+        binning : [-5.0, 8.0, 40]
+    y :
+        binning : [-5.0, 8.0, 40]
+stats:
+  nentries : '{:.2e}'

{data_manipulation_utilities-0.2.3.data → data_manipulation_utilities-0.2.5.data}/scripts/publish RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.3.dist-info → data_manipulation_utilities-0.2.5.dist-info}/WHEEL RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.3.dist-info → data_manipulation_utilities-0.2.5.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{data_manipulation_utilities-0.2.3.dist-info → data_manipulation_utilities-0.2.5.dist-info}/top_level.txt RENAMED Viewed

File without changes

data-manipulation-utilities 0.2.3__py3-none-any.whl → 0.2.5__py3-none-any.whl

data-manipulation-utilities 0.2.3py3-none-any.whl → 0.2.5py3-none-any.whl