PyPI - data-manipulation-utilities - Versions diffs - 0.1.6__tar.gz → 0.1.9__tar.gz - Mend

data-manipulation-utilities 0.1.6tar.gz → 0.1.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.1.6
+Version: 0.1.9
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -41,7 +41,7 @@ such that:
 Then, for each remote it pushes the tags and the commits.
-*Why?*
+*Why?*
 1. Tags should be named as the project's version
 1. As soon as a new version is created, that version needs to be tagged.
@@ -121,6 +121,24 @@ samples:
 ## PDFs
+### Model building
+In order to do complex fits, one often needs PDFs with many parameters, which need to be added.
+In these PDFs certain parameters (e.g. $\mu$ or $\sigma$) need to be shared. This project provides
+`ModelFactory`, which can do this as shown below:
+```python
+from dmu.stats.model_factory import ModelFactory
+l_pdf = ['cbr'] + 2 * ['cbl']
+l_shr = ['mu', 'sg']
+mod   = ModelFactory(obs = Data.obs, l_pdf = l_pdf, l_shared=l_shr)
+pdf   = mod.get_pdf()
+```
+where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
+The `mu` and `sg` parameters are shared.
 ### Printing PDFs
 One can print a zfit PDF by doing:
@@ -231,6 +249,87 @@ likelihood :
     nbins : 100 #If specified, will do binned likelihood fit instead of unbinned
 ```
+## Minimizers
+These are alternative implementations of the minimizers in zfit meant to be used for special types of fits.
+### Anealing minimizer
+This minimizer is meant to be used for fits to models with many parameters, where multiple minima are expected in the
+likelihood. The minimizer use is illustrated in:
+```python
+from dmu.stats.minimizers  import AnealingMinimizer
+nll       = _get_nll()
+minimizer = AnealingMinimizer(ntries=10, pvalue=0.05)
+res       = minimizer.minimize(nll)
+```
+this will:
+- Take the `NLL` object.
+- Try fitting at most 10 times
+- After each fit, calculate the goodness of fit (in this case the p-value)
+- Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+randomize the parameters and try again.
+- If the desired goodness of fit has not been achieved, pick the best result.
+- Return the `FitResult` object and set the PDF to the final fit result.
+The $\chi^2/Ndof$ can also be used as in:
+```python
+from dmu.stats.minimizers  import AnealingMinimizer
+nll       = _get_nll()
+minimizer = AnealingMinimizer(ntries=10, chi2ndof=1.00)
+res       = minimizer.minimize(nll)
+```
+## Fit plotting
+The class `ZFitPlotter` can be used to plot fits done with zfit. For a complete set of examples of how to use
+this class check the [tests](tests/stats/test_fit_plotter.py). A simple example of its usage is below:
+```python
+from dmu.stats.zfit_plotter import ZFitPlotter
+obs = zfit.Space('m', limits=(0, 10))
+# Create signal PDF
+mu  = zfit.Parameter("mu", 5.0,  0, 10)
+sg  = zfit.Parameter("sg", 0.5,  0,  5)
+sig = zfit.pdf.Gauss(obs=obs, mu=mu, sigma=sg)
+nsg = zfit.Parameter('nsg', 1000, 0, 10000)
+esig= sig.create_extended(nsg, name='gauss')
+# Create background PDF
+lm  = zfit.Parameter('lm', -0.1, -1, 0)
+bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
+nbk = zfit.Parameter('nbk', 1000, 0, 10000)
+ebkg= bkg.create_extended(nbk, name='expo')
+# Add them
+pdf = zfit.pdf.SumPDF([ebkg, esig])
+sam = pdf.create_sampler()
+# Plot them
+obj   = ZFitPlotter(data=sam, model=pdf)
+d_leg = {'gauss': 'New Gauss'}
+obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
+# add a line to pull hist
+obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
+```
+this class supports:
+- Handling title, legend, plots size.
+- Adding pulls.
+- Stacking and overlaying of PDFs.
+- Blinding.
 ## Arrays
 ### Scaling by non-integer

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/README.md RENAMED Viewed

@@ -21,7 +21,7 @@ such that:
 Then, for each remote it pushes the tags and the commits.
-*Why?*
+*Why?*
 1. Tags should be named as the project's version
 1. As soon as a new version is created, that version needs to be tagged.
@@ -101,6 +101,24 @@ samples:
 ## PDFs
+### Model building
+In order to do complex fits, one often needs PDFs with many parameters, which need to be added.
+In these PDFs certain parameters (e.g. $\mu$ or $\sigma$) need to be shared. This project provides
+`ModelFactory`, which can do this as shown below:
+```python
+from dmu.stats.model_factory import ModelFactory
+l_pdf = ['cbr'] + 2 * ['cbl']
+l_shr = ['mu', 'sg']
+mod   = ModelFactory(obs = Data.obs, l_pdf = l_pdf, l_shared=l_shr)
+pdf   = mod.get_pdf()
+```
+where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
+The `mu` and `sg` parameters are shared.
 ### Printing PDFs
 One can print a zfit PDF by doing:
@@ -211,6 +229,87 @@ likelihood :
     nbins : 100 #If specified, will do binned likelihood fit instead of unbinned
 ```
+## Minimizers
+These are alternative implementations of the minimizers in zfit meant to be used for special types of fits.
+### Anealing minimizer
+This minimizer is meant to be used for fits to models with many parameters, where multiple minima are expected in the
+likelihood. The minimizer use is illustrated in:
+```python
+from dmu.stats.minimizers  import AnealingMinimizer
+nll       = _get_nll()
+minimizer = AnealingMinimizer(ntries=10, pvalue=0.05)
+res       = minimizer.minimize(nll)
+```
+this will:
+- Take the `NLL` object.
+- Try fitting at most 10 times
+- After each fit, calculate the goodness of fit (in this case the p-value)
+- Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+randomize the parameters and try again.
+- If the desired goodness of fit has not been achieved, pick the best result.
+- Return the `FitResult` object and set the PDF to the final fit result.
+The $\chi^2/Ndof$ can also be used as in:
+```python
+from dmu.stats.minimizers  import AnealingMinimizer
+nll       = _get_nll()
+minimizer = AnealingMinimizer(ntries=10, chi2ndof=1.00)
+res       = minimizer.minimize(nll)
+```
+## Fit plotting
+The class `ZFitPlotter` can be used to plot fits done with zfit. For a complete set of examples of how to use
+this class check the [tests](tests/stats/test_fit_plotter.py). A simple example of its usage is below:
+```python
+from dmu.stats.zfit_plotter import ZFitPlotter
+obs = zfit.Space('m', limits=(0, 10))
+# Create signal PDF
+mu  = zfit.Parameter("mu", 5.0,  0, 10)
+sg  = zfit.Parameter("sg", 0.5,  0,  5)
+sig = zfit.pdf.Gauss(obs=obs, mu=mu, sigma=sg)
+nsg = zfit.Parameter('nsg', 1000, 0, 10000)
+esig= sig.create_extended(nsg, name='gauss')
+# Create background PDF
+lm  = zfit.Parameter('lm', -0.1, -1, 0)
+bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
+nbk = zfit.Parameter('nbk', 1000, 0, 10000)
+ebkg= bkg.create_extended(nbk, name='expo')
+# Add them
+pdf = zfit.pdf.SumPDF([ebkg, esig])
+sam = pdf.create_sampler()
+# Plot them
+obj   = ZFitPlotter(data=sam, model=pdf)
+d_leg = {'gauss': 'New Gauss'}
+obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
+# add a line to pull hist
+obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
+```
+this class supports:
+- Handling title, legend, plots size.
+- Adding pulls.
+- Stacking and overlaying of PDFs.
+- Blinding.
 ## Arrays
 ### Scaling by non-integer

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name        = 'data_manipulation_utilities'
-version     = '0.1.6'
+version     = '0.1.9'
 readme      = 'README.md'
 dependencies= [
 'logzero',

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/src/data_manipulation_utilities.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.1.6
+Version: 0.1.9
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -41,7 +41,7 @@ such that:
 Then, for each remote it pushes the tags and the commits.
-*Why?*
+*Why?*
 1. Tags should be named as the project's version
 1. As soon as a new version is created, that version needs to be tagged.
@@ -121,6 +121,24 @@ samples:
 ## PDFs
+### Model building
+In order to do complex fits, one often needs PDFs with many parameters, which need to be added.
+In these PDFs certain parameters (e.g. $\mu$ or $\sigma$) need to be shared. This project provides
+`ModelFactory`, which can do this as shown below:
+```python
+from dmu.stats.model_factory import ModelFactory
+l_pdf = ['cbr'] + 2 * ['cbl']
+l_shr = ['mu', 'sg']
+mod   = ModelFactory(obs = Data.obs, l_pdf = l_pdf, l_shared=l_shr)
+pdf   = mod.get_pdf()
+```
+where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
+The `mu` and `sg` parameters are shared.
 ### Printing PDFs
 One can print a zfit PDF by doing:
@@ -231,6 +249,87 @@ likelihood :
     nbins : 100 #If specified, will do binned likelihood fit instead of unbinned
 ```
+## Minimizers
+These are alternative implementations of the minimizers in zfit meant to be used for special types of fits.
+### Anealing minimizer
+This minimizer is meant to be used for fits to models with many parameters, where multiple minima are expected in the
+likelihood. The minimizer use is illustrated in:
+```python
+from dmu.stats.minimizers  import AnealingMinimizer
+nll       = _get_nll()
+minimizer = AnealingMinimizer(ntries=10, pvalue=0.05)
+res       = minimizer.minimize(nll)
+```
+this will:
+- Take the `NLL` object.
+- Try fitting at most 10 times
+- After each fit, calculate the goodness of fit (in this case the p-value)
+- Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+randomize the parameters and try again.
+- If the desired goodness of fit has not been achieved, pick the best result.
+- Return the `FitResult` object and set the PDF to the final fit result.
+The $\chi^2/Ndof$ can also be used as in:
+```python
+from dmu.stats.minimizers  import AnealingMinimizer
+nll       = _get_nll()
+minimizer = AnealingMinimizer(ntries=10, chi2ndof=1.00)
+res       = minimizer.minimize(nll)
+```
+## Fit plotting
+The class `ZFitPlotter` can be used to plot fits done with zfit. For a complete set of examples of how to use
+this class check the [tests](tests/stats/test_fit_plotter.py). A simple example of its usage is below:
+```python
+from dmu.stats.zfit_plotter import ZFitPlotter
+obs = zfit.Space('m', limits=(0, 10))
+# Create signal PDF
+mu  = zfit.Parameter("mu", 5.0,  0, 10)
+sg  = zfit.Parameter("sg", 0.5,  0,  5)
+sig = zfit.pdf.Gauss(obs=obs, mu=mu, sigma=sg)
+nsg = zfit.Parameter('nsg', 1000, 0, 10000)
+esig= sig.create_extended(nsg, name='gauss')
+# Create background PDF
+lm  = zfit.Parameter('lm', -0.1, -1, 0)
+bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
+nbk = zfit.Parameter('nbk', 1000, 0, 10000)
+ebkg= bkg.create_extended(nbk, name='expo')
+# Add them
+pdf = zfit.pdf.SumPDF([ebkg, esig])
+sam = pdf.create_sampler()
+# Plot them
+obj   = ZFitPlotter(data=sam, model=pdf)
+d_leg = {'gauss': 'New Gauss'}
+obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
+# add a line to pull hist
+obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
+```
+this class supports:
+- Handling title, legend, plots size.
+- Adding pulls.
+- Stacking and overlaying of PDFs.
+- Blinding.
 ## Arrays
 ### Scaling by non-integer

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/src/data_manipulation_utilities.egg-info/SOURCES.txt RENAMED Viewed

@@ -22,7 +22,11 @@ src/dmu/rfile/rfprinter.py
 src/dmu/rfile/utilities.py
 src/dmu/stats/fitter.py
 src/dmu/stats/function.py
+src/dmu/stats/gof_calculator.py
+src/dmu/stats/minimizers.py
+src/dmu/stats/model_factory.py
 src/dmu/stats/utilities.py
+src/dmu/stats/zfit_plotter.py
 src/dmu/testing/utilities.py
 src/dmu/text/transformer.py
 src/dmu_data/__init__.py

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/src/dmu/plotting/plotter.py RENAMED Viewed

@@ -65,7 +65,7 @@ class Plotter:
         return minx, maxx
     #-------------------------------------
-    def _preprocess_rdf(self, rdf):
+    def _preprocess_rdf(self, rdf : RDataFrame) -> RDataFrame:
         '''
         rdf (RDataFrame): ROOT dataframe

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/src/dmu/plotting/plotter_1d.py RENAMED Viewed

@@ -2,6 +2,9 @@
 Module containing plotter class
 '''
+import hist
+from hist import Hist
 import numpy
 import matplotlib.pyplot as plt
@@ -33,58 +36,75 @@ class Plotter1D(Plotter):
         return xname, yname
     #-------------------------------------
-    def _plot_var(self, var):
+    def _is_normalized(self, var : str) -> bool:
+        d_cfg     = self._d_cfg['plots'][var]
+        normalized=False
+        if 'normalized' in d_cfg:
+            normalized = d_cfg['normalized']
+        return normalized
+    #-------------------------------------
+    def _get_binning(self, var : str, d_data : dict[str, numpy.ndarray]) -> tuple[float, float, int]:
+        d_cfg  = self._d_cfg['plots'][var]
+        minx, maxx, bins = d_cfg['binning']
+        if maxx <= minx + 1e-5:
+            log.info(f'Bounds not set for {var}, will calculated them')
+            minx, maxx = self._find_bounds(d_data = d_data, qnt=minx)
+            log.info(f'Using bounds [{minx:.3e}, {maxx:.3e}]')
+        else:
+            log.debug(f'Using bounds [{minx:.3e}, {maxx:.3e}]')
+        return minx, maxx, bins
+    #-------------------------------------
+    def _plot_var(self, var : str) -> float:
         '''
         Will plot a variable from a dictionary of dataframes
         Parameters
         --------------------
         var   (str)  : name of column
+        Return
+        --------------------
+        Largest bin content among all bins and among all histograms plotted
         '''
         # pylint: disable=too-many-locals
-        d_cfg = self._d_cfg['plots'][var]
-        minx, maxx, bins = d_cfg['binning']
-        yscale           = d_cfg['yscale' ] if 'yscale' in d_cfg else 'linear'
-        xname, yname     = self._get_labels(var)
-        normalized=False
-        if 'normalized' in d_cfg:
-            normalized = d_cfg['normalized']
-        title = ''
-        if 'title'      in d_cfg:
-            title = d_cfg['title']
         d_data = {}
         for name, rdf in self._d_rdf.items():
             d_data[name] = rdf.AsNumpy([var])[var]
-        if maxx <= minx + 1e-5:
-            log.info(f'Bounds not set for {var}, will calculated them')
-            minx, maxx = self._find_bounds(d_data = d_data, qnt=minx)
-            log.info(f'Using bounds [{minx:.3e}, {maxx:.3e}]')
-        else:
-            log.debug(f'Using bounds [{minx:.3e}, {maxx:.3e}]')
+        minx, maxx, bins = self._get_binning(var, d_data)
+        d_wgt            = self._get_weights(var)
         l_bc_all = []
-        d_wgt    = self._get_weights(var)
         for name, arr_val in d_data.items():
-            arr_wgt    = d_wgt[name] if d_wgt is not None else None
-            self._print_weights(arr_wgt, var, name)
-            l_bc, _, _ = plt.hist(arr_val, weights=arr_wgt, bins=bins, range=(minx, maxx), density=normalized, histtype='step', label=name)
-            l_bc_all  += numpy.array(l_bc).tolist()
+            arr_wgt      = d_wgt[name] if d_wgt is not None else numpy.ones_like(arr_val)
+            hst          = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x', label=name).Weight()
+            hst.fill(x=arr_val, weight=arr_wgt)
+            hst.plot(label=name)
+            l_bc_all    += hst.values().tolist()
-            plt.yscale(yscale)
-            plt.xlabel(xname)
-            plt.ylabel(yname)
+        max_y = max(l_bc_all)
+        return max_y
+    # --------------------------------------------
+    def _style_plot(self, var : str, max_y : float) -> None:
+        d_cfg  = self._d_cfg['plots'][var]
+        yscale = d_cfg['yscale' ] if 'yscale' in d_cfg else 'linear'
+        xname, yname = self._get_labels(var)
+        plt.xlabel(xname)
+        plt.ylabel(yname)
+        plt.yscale(yscale)
         if yscale == 'linear':
             plt.ylim(bottom=0)
-        max_y = max(l_bc_all)
+        title = ''
+        if 'title'      in d_cfg:
+            title = d_cfg['title']
         plt.ylim(top=1.2 * max_y)
+        plt.legend()
         plt.title(title)
     # --------------------------------------------
     def _plot_lines(self, var : str):
@@ -106,8 +126,10 @@ class Plotter1D(Plotter):
         fig_size = self._get_fig_size()
         for var in self._d_cfg['plots']:
             log.debug(f'Plotting: {var}')
             plt.figure(var, figsize=fig_size)
-            self._plot_var(var)
+            max_y = self._plot_var(var)
+            self._style_plot(var, max_y)
             self._plot_lines(var)
             self._save_plot(var)
 # --------------------------------------------

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/src/dmu/plotting/plotter_2d.py RENAMED Viewed

@@ -31,8 +31,8 @@ class Plotter2D(Plotter):
         if not isinstance(cfg, dict):
             raise ValueError('Config dictionary not passed')
-        self._rdf   : RDataFrame = rdf
         self._d_cfg : dict       = cfg
+        self._rdf   : RDataFrame = super()._preprocess_rdf(rdf)
         self._wgt : numpy.ndarray
     # --------------------------------------------

{data_manipulation_utilities-0.1.6 → data_manipulation_utilities-0.1.9}/src/dmu/stats/fitter.py RENAMED Viewed

@@ -4,6 +4,7 @@ Module holding zfitter class
 import pprint
 from typing                   import Union
+from functools                import lru_cache
 import numpy
 import zfit
@@ -100,8 +101,8 @@ class Fitter:
         return data
     #------------------------------
-    def _bin_pdf(self, nbins):
-        [[min_x]], [[max_x]] = self._pdf.space.limits
+    def _bin_pdf(self):
+        nbins, min_x, max_x = self._get_binning()
         _, arr_edg = numpy.histogram(self._data_np, bins = nbins, range=(min_x, max_x))
         size = arr_edg.size
@@ -117,23 +118,29 @@ class Fitter:
         return numpy.array(l_bc)
     #------------------------------
+    def _bin_data(self):
+        nbins, min_x, max_x = self._get_binning()
+        arr_data, _ = numpy.histogram(self._data_np, bins = nbins, range=(min_x, max_x))
+        arr_data    = arr_data.astype(float)
+        return arr_data
+    #------------------------------
+    @lru_cache(maxsize=10)
     def _get_binning(self):
         min_x = numpy.min(self._data_np)
         max_x = numpy.max(self._data_np)
         nbins = self._ndof + self._get_float_pars()
+        log.debug(f'Nbins: {nbins}')
+        log.debug(f'Range: [{min_x:.3f}, {max_x:.3f}]')
         return nbins, min_x, max_x
     #------------------------------
     def _calc_gof(self):
         log.debug('Calculating GOF')
-        nbins, min_x, max_x = self._get_binning()
-        log.debug(f'Nbins: {nbins}')
-        log.debug(f'Range: [{min_x:.3f}, {max_x:.3f}]')
-        arr_data, _ = numpy.histogram(self._data_np, bins = nbins, range=(min_x, max_x))
-        arr_data    = arr_data.astype(float)
-        arr_modl    = self._bin_pdf(nbins)
+        arr_data    = self._bin_data()
+        arr_modl    = self._bin_pdf()
         norm        = numpy.sum(arr_data) / numpy.sum(arr_modl)
         arr_modl    = norm * arr_modl
         arr_res     = arr_modl - arr_data

data-manipulation-utilities 0.1.6__tar.gz → 0.1.9__tar.gz

data-manipulation-utilities 0.1.6tar.gz → 0.1.9tar.gz