PyPI - data-manipulation-utilities - Versions diffs - 0.2.5__tar.gz → 0.2.7__tar.gz - Mend

data-manipulation-utilities 0.2.5tar.gz → 0.2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

{data_manipulation_utilities-0.2.5/src/data_manipulation_utilities.egg-info → data_manipulation_utilities-0.2.7}/PKG-INFO RENAMED Viewed

@@ -1,20 +1,25 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: data_manipulation_utilities
-Version: 0.2.5
+Version: 0.2.7
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
 Requires-Dist: scipy
 Requires-Dist: awkward
 Requires-Dist: tqdm
-Requires-Dist: joblib
-Requires-Dist: scikit-learn
+Requires-Dist: numpy
 Requires-Dist: toml
 Requires-Dist: numpy
 Requires-Dist: matplotlib
 Requires-Dist: mplhep
 Requires-Dist: hist[plot]
 Requires-Dist: pandas
+Provides-Extra: fit
+Requires-Dist: zfit; extra == "fit"
+Requires-Dist: tensorflow==2.18.0; extra == "fit"
+Provides-Extra: ml
+Requires-Dist: scikit-learn; extra == "ml"
+Requires-Dist: joblib; extra == "ml"
 Provides-Extra: dev
 Requires-Dist: pytest; extra == "dev"
@@ -51,6 +56,25 @@ Then, for each remote it pushes the tags and the commits.
 This section describes generic tools that could not be put in a specific category, but tend to be useful.
+## Hashing
+The snippet below:
+```python
+from dmu.generic  import hashing
+obj = [1, 'name', [1, 'sub', 'list'], {'x' : 1}]
+val = hashing.hash_object(obj)
+```
+will:
+- Make the input object into a JSON string
+- Encode it to utf-8
+- Make a 64 characters hash out of it
+in two lines, thus keeping the user's code clean.
 ## Timer
 In order to benchmark functions do:
@@ -67,9 +91,9 @@ def fun():
 fun()
 ```
-## JSON dumper
+## JSON dumper and loader
-The following lines will dump data (dictionaries, lists, etc) to a JSON file:
+The following lines will dump data (dictionaries, lists, etc) to a JSON file and load it back:
 ```python
 import dmu.generic.utilities as gut
@@ -77,8 +101,11 @@ import dmu.generic.utilities as gut
 data = [1,2,3,4]
 gut.dump_json(data, '/tmp/list.json')
+data = gut.load_json('/tmp/list.json')
 ```
+and it's meant to allow the user to bypass all the boilerplate and keep their code brief.
 # Physics
 ## Truth matching
@@ -132,7 +159,8 @@ from dmu.stats.model_factory import ModelFactory
 l_pdf = ['cbr'] + 2 * ['cbl']
 l_shr = ['mu', 'sg']
-mod   = ModelFactory(obs = Data.obs, l_pdf = l_pdf, l_shared=l_shr)
+d_fix = {'al_cbl' : 3, 'nr_cbr' : 1} # This is optional and will fix two parameters whose names start with the keys
+mod   = ModelFactory(obs = Data.obs, l_pdf = l_pdf, l_shared=l_shr, d_fix=d_fix)
 pdf   = mod.get_pdf()
 ```
@@ -145,10 +173,40 @@ pol1: Polynomial of degree 1
 pol2: Polynomial of degree 2
 cbr : CrystallBall with right tail
 cbl : CrystallBall with left tail
-gauss : Gaussian
+gauss : Gaussian
 dscb : Double sided CrystallBall
 ```
+### Model building with reparametrizations
+In order to introduce reparametrizations for the means and the resolutions, such that:
+$\mu\to\mu+\Delta\mu$
+$\sigma\to\sigma\cdot s_{\sigma}$
+where the reparametrized $\mu$ and $\sigma$ are constant, while the scale and resolution is floating, do:
+```python
+import zfit
+from dmu.stats.model_factory import ModelFactory
+l_shr = ['mu', 'sg']
+l_flt = []
+d_rep = {'mu' : 'scale', 'sg' : 'reso'}
+obs   = zfit.Space('mass', limits=(5080, 5680))
+mod   = ModelFactory(
+        preffix = name,
+        obs     = obs,
+        l_pdf   = l_name,
+        d_rep   = d_rep,
+        l_shared= l_shr,
+        l_float = l_flt)
+pdf   = mod.get_pdf()
+```
+Here, the floating parameters **should not** be the same as the reparametrized ones.
 ### Printing PDFs
 One can print a zfit PDF by doing:
@@ -427,7 +485,7 @@ rdf_bkg = _get_rdf(kind='bkg')
 cfg     = _get_config()
 obj= TrainMva(sig=rdf_sig, bkg=rdf_bkg, cfg=cfg)
-obj.run()
+obj.run(skip_fit=False) # by default it will be false, if true, it will only make plots of features
 ```
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
@@ -549,9 +607,61 @@ When evaluating the model with real data, problems might occur, we deal with the
     ```python
     model.cfg
     ```
-    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+    - For whatever features that are still NaN, they will be _patched_  with zeros when evaluated. However, the returned probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
+## Diagnostics
+To run diagnostics on the trained model do:
+```python
+from dmu.ml.cv_diagnostics import CVDiagnostics
+# Where l_model is the list of models and cfg is a dictionary with the config
+cvd = CVDiagnostics(models=l_model, rdf=rdf, cfg=cfg)
+cvd.run()
+```
+the configuration can be loaded from a YAML file and would look like:
+```yaml
+# Directory where plots will go
+output         : /tmp/tests/dmu/ml/cv_diagnostics/overlay
+  # Optional, will assume that the target is already in the input dataframe
+  # and will use it, instead of evaluating models
+score_from_rdf : mva
+correlations:
+  # Variables with respect to which the correlations with the features will be measured
+  target :
+    name : mass
+    overlay :
+      wp :
+        - 0.2
+        - 0.5
+        - 0.7
+        - 0.9
+      general:
+        size : [20, 10]
+      saving:
+        plt_dir : /tmp/tests/dmu/ml/cv_diagnostics/from_rdf
+      plots:
+        z :
+          binning    : [1000, 4000, 30]
+          yscale     : 'linear'
+          labels     : ['mass', 'Entries']
+          normalized : true
+          styling :
+            linestyle: '-' # By default there is no line, just pointer
+  methods:
+    - Pearson
+    - Kendall-$\tau$
+  figure:
+    title: Scores from file
+    size : [10, 8]
+    xlabelsize: 18 # Constrols size of x axis labels. By default 30
+    rotate    : 60 # Will rotate xlabels by 60 degrees
+```
 # Pandas dataframes
 ## Utilities
@@ -582,6 +692,19 @@ put.df_to_tex(df,
         caption  = 'some caption')
 ```
+### Dataframe to and from YAML
+This extends the existing JSON functionality
+```python
+import dmu.pdataframe.utilities as put
+df_1 = _get_df()
+put.to_yaml(df_1, yml_path)
+df_2 = put.from_yaml(yml_path)
+```
+and is meant to be less verbose than doing it through the YAML module.
 # Rdataframes
 These are utility functions meant to be used with ROOT dataframes.
@@ -707,6 +830,11 @@ plots:
         labels     : ['x', 'Entries'] # Labels are optional, will use varname and Entries as labels if not present
         title      : 'some title can be added for different variable plots'
         name       : 'plot_of_x' # This will ensure that one gets plot_of_x.png as a result, if missing x.png would be saved
+        # Can add styling to specific plots, this should be the argument of
+        # hist.plot(...)
+        styling :
+            label : x
+            linestyle: '-'
     y :
         binning    : [-5.0, 8.0, 40]
         yscale     : 'linear'
@@ -730,6 +858,47 @@ stats:
 it's up to the user to build this dictionary and load it.
+### Pluggins
+Extra functionality can be `plugged` into the code by using the pluggins section like:
+#### FWHM
+```yaml
+plugin:
+  fwhm:
+    # Can control each variable fit separately
+    x :
+      plot   : true
+      obs    : [-2, 4]
+      plot   : true
+      format : FWHM={:.3f}
+      add_std: True
+    y :
+      plot   : true
+      obs    : [-4, 8]
+      plot   : true
+      format : FWHM={:.3f}
+      add_std: True
+```
+where the section will
+- Use a KDE to fit the distribution and plot it on top of the histogram
+- Add the value of the FullWidth at Half Maximum in the title, for each distribution with a specific formatting.
+#### stats
+```yaml
+plugin:
+  stats:
+    x :
+      mean : $\mu$={:.2f}
+      rms  : $\sigma$={:.2f}
+      sum  : $\Sigma$={:.0f}
+```
+Can be used to print statistics, mean, rms and weighted sum of entries for each distribution.
 ## 2D plots
 For the 2D case it would look like:

data_manipulation_utilities-0.2.5/PKG-INFO → data_manipulation_utilities-0.2.7/README.md RENAMED Viewed

@@ -1,23 +1,3 @@
-Metadata-Version: 2.2
-Name: data_manipulation_utilities
-Version: 0.2.5
-Description-Content-Type: text/markdown
-Requires-Dist: logzero
-Requires-Dist: PyYAML
-Requires-Dist: scipy
-Requires-Dist: awkward
-Requires-Dist: tqdm
-Requires-Dist: joblib
-Requires-Dist: scikit-learn
-Requires-Dist: toml
-Requires-Dist: numpy
-Requires-Dist: matplotlib
-Requires-Dist: mplhep
-Requires-Dist: hist[plot]
-Requires-Dist: pandas
-Provides-Extra: dev
-Requires-Dist: pytest; extra == "dev"
 # D(ata) M(anipulation) U(tilities)
 These are tools that can be used for different data analysis tasks.
@@ -51,6 +31,25 @@ Then, for each remote it pushes the tags and the commits.
 This section describes generic tools that could not be put in a specific category, but tend to be useful.
+## Hashing
+The snippet below:
+```python
+from dmu.generic  import hashing
+obj = [1, 'name', [1, 'sub', 'list'], {'x' : 1}]
+val = hashing.hash_object(obj)
+```
+will:
+- Make the input object into a JSON string
+- Encode it to utf-8
+- Make a 64 characters hash out of it
+in two lines, thus keeping the user's code clean.
 ## Timer
 In order to benchmark functions do:
@@ -67,9 +66,9 @@ def fun():
 fun()
 ```
-## JSON dumper
+## JSON dumper and loader
-The following lines will dump data (dictionaries, lists, etc) to a JSON file:
+The following lines will dump data (dictionaries, lists, etc) to a JSON file and load it back:
 ```python
 import dmu.generic.utilities as gut
@@ -77,8 +76,11 @@ import dmu.generic.utilities as gut
 data = [1,2,3,4]
 gut.dump_json(data, '/tmp/list.json')
+data = gut.load_json('/tmp/list.json')
 ```
+and it's meant to allow the user to bypass all the boilerplate and keep their code brief.
 # Physics
 ## Truth matching
@@ -132,7 +134,8 @@ from dmu.stats.model_factory import ModelFactory
 l_pdf = ['cbr'] + 2 * ['cbl']
 l_shr = ['mu', 'sg']
-mod   = ModelFactory(obs = Data.obs, l_pdf = l_pdf, l_shared=l_shr)
+d_fix = {'al_cbl' : 3, 'nr_cbr' : 1} # This is optional and will fix two parameters whose names start with the keys
+mod   = ModelFactory(obs = Data.obs, l_pdf = l_pdf, l_shared=l_shr, d_fix=d_fix)
 pdf   = mod.get_pdf()
 ```
@@ -145,10 +148,40 @@ pol1: Polynomial of degree 1
 pol2: Polynomial of degree 2
 cbr : CrystallBall with right tail
 cbl : CrystallBall with left tail
-gauss : Gaussian
+gauss : Gaussian
 dscb : Double sided CrystallBall
 ```
+### Model building with reparametrizations
+In order to introduce reparametrizations for the means and the resolutions, such that:
+$\mu\to\mu+\Delta\mu$
+$\sigma\to\sigma\cdot s_{\sigma}$
+where the reparametrized $\mu$ and $\sigma$ are constant, while the scale and resolution is floating, do:
+```python
+import zfit
+from dmu.stats.model_factory import ModelFactory
+l_shr = ['mu', 'sg']
+l_flt = []
+d_rep = {'mu' : 'scale', 'sg' : 'reso'}
+obs   = zfit.Space('mass', limits=(5080, 5680))
+mod   = ModelFactory(
+        preffix = name,
+        obs     = obs,
+        l_pdf   = l_name,
+        d_rep   = d_rep,
+        l_shared= l_shr,
+        l_float = l_flt)
+pdf   = mod.get_pdf()
+```
+Here, the floating parameters **should not** be the same as the reparametrized ones.
 ### Printing PDFs
 One can print a zfit PDF by doing:
@@ -427,7 +460,7 @@ rdf_bkg = _get_rdf(kind='bkg')
 cfg     = _get_config()
 obj= TrainMva(sig=rdf_sig, bkg=rdf_bkg, cfg=cfg)
-obj.run()
+obj.run(skip_fit=False) # by default it will be false, if true, it will only make plots of features
 ```
 where the settings for the training go in a config dictionary, which when written to YAML looks like:
@@ -549,9 +582,61 @@ When evaluating the model with real data, problems might occur, we deal with the
     ```python
     model.cfg
     ```
-    - For whatever entries that are still NaN, they will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+    - For whatever features that are still NaN, they will be _patched_  with zeros when evaluated. However, the returned probabilities will be
 saved as -1. I.e. entries with NaNs will have probabilities of -1.
+## Diagnostics
+To run diagnostics on the trained model do:
+```python
+from dmu.ml.cv_diagnostics import CVDiagnostics
+# Where l_model is the list of models and cfg is a dictionary with the config
+cvd = CVDiagnostics(models=l_model, rdf=rdf, cfg=cfg)
+cvd.run()
+```
+the configuration can be loaded from a YAML file and would look like:
+```yaml
+# Directory where plots will go
+output         : /tmp/tests/dmu/ml/cv_diagnostics/overlay
+  # Optional, will assume that the target is already in the input dataframe
+  # and will use it, instead of evaluating models
+score_from_rdf : mva
+correlations:
+  # Variables with respect to which the correlations with the features will be measured
+  target :
+    name : mass
+    overlay :
+      wp :
+        - 0.2
+        - 0.5
+        - 0.7
+        - 0.9
+      general:
+        size : [20, 10]
+      saving:
+        plt_dir : /tmp/tests/dmu/ml/cv_diagnostics/from_rdf
+      plots:
+        z :
+          binning    : [1000, 4000, 30]
+          yscale     : 'linear'
+          labels     : ['mass', 'Entries']
+          normalized : true
+          styling :
+            linestyle: '-' # By default there is no line, just pointer
+  methods:
+    - Pearson
+    - Kendall-$\tau$
+  figure:
+    title: Scores from file
+    size : [10, 8]
+    xlabelsize: 18 # Constrols size of x axis labels. By default 30
+    rotate    : 60 # Will rotate xlabels by 60 degrees
+```
 # Pandas dataframes
 ## Utilities
@@ -582,6 +667,19 @@ put.df_to_tex(df,
         caption  = 'some caption')
 ```
+### Dataframe to and from YAML
+This extends the existing JSON functionality
+```python
+import dmu.pdataframe.utilities as put
+df_1 = _get_df()
+put.to_yaml(df_1, yml_path)
+df_2 = put.from_yaml(yml_path)
+```
+and is meant to be less verbose than doing it through the YAML module.
 # Rdataframes
 These are utility functions meant to be used with ROOT dataframes.
@@ -707,6 +805,11 @@ plots:
         labels     : ['x', 'Entries'] # Labels are optional, will use varname and Entries as labels if not present
         title      : 'some title can be added for different variable plots'
         name       : 'plot_of_x' # This will ensure that one gets plot_of_x.png as a result, if missing x.png would be saved
+        # Can add styling to specific plots, this should be the argument of
+        # hist.plot(...)
+        styling :
+            label : x
+            linestyle: '-'
     y :
         binning    : [-5.0, 8.0, 40]
         yscale     : 'linear'
@@ -730,6 +833,47 @@ stats:
 it's up to the user to build this dictionary and load it.
+### Pluggins
+Extra functionality can be `plugged` into the code by using the pluggins section like:
+#### FWHM
+```yaml
+plugin:
+  fwhm:
+    # Can control each variable fit separately
+    x :
+      plot   : true
+      obs    : [-2, 4]
+      plot   : true
+      format : FWHM={:.3f}
+      add_std: True
+    y :
+      plot   : true
+      obs    : [-4, 8]
+      plot   : true
+      format : FWHM={:.3f}
+      add_std: True
+```
+where the section will
+- Use a KDE to fit the distribution and plot it on top of the histogram
+- Add the value of the FullWidth at Half Maximum in the title, for each distribution with a specific formatting.
+#### stats
+```yaml
+plugin:
+  stats:
+    x :
+      mean : $\mu$={:.2f}
+      rms  : $\sigma$={:.2f}
+      sum  : $\Sigma$={:.0f}
+```
+Can be used to print statistics, mean, rms and weighted sum of entries for each distribution.
 ## 2D plots
 For the 2D case it would look like:

{data_manipulation_utilities-0.2.5 → data_manipulation_utilities-0.2.7}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name        = 'data_manipulation_utilities'
-version     = '0.2.5'
+version     = '0.2.7'
 readme      = 'README.md'
 dependencies= [
 'logzero',
@@ -8,8 +8,7 @@ dependencies= [
 'scipy',
 'awkward',
 'tqdm',
-'joblib',
-'scikit-learn',
+'numpy',
 'toml',
 'numpy',
 'matplotlib',
@@ -18,6 +17,9 @@ dependencies= [
 'pandas']
 [project.optional-dependencies]
+# Use latest tensorflow allowed by zfit
+fit  = ['zfit','tensorflow==2.18.0']
+ml   = ['scikit-learn', 'joblib']
 dev  = ['pytest']
 [tools.setuptools.packages.find]

data-manipulation-utilities 0.2.5__tar.gz → 0.2.7__tar.gz

data-manipulation-utilities 0.2.5tar.gz → 0.2.7tar.gz