PyPI - data-manipulation-utilities - Versions diffs - 0.2.3__tar.gz → 0.2.5__tar.gz - Mend

data-manipulation-utilities 0.2.3tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

{data_manipulation_utilities-0.2.3 → data_manipulation_utilities-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.3
+Version: 0.2.5
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -26,7 +26,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -36,10 +36,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -137,7 +137,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -299,7 +309,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -337,11 +347,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -353,7 +363,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -434,7 +444,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -497,7 +507,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -674,6 +684,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -703,6 +716,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -724,14 +747,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -823,7 +851,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -856,6 +884,40 @@ Trees only in file_2.root:
   - Hlt2RD_BsToPhiMuMu_MVA/DecayTree
 ```
+# File system
+## Versions
+The utilities below allow the user to deal with versioned files and directories
+```python
+from dmu.generic.version_management import get_last_version
+from dmu.generic.version_management import get_next_version
+from dmu.generic.version_management import get_latest_file
+# get_next_version will take a version and provide the next one, e.g.
+get_next_version('v1')           # -> 'v2'
+get_next_version('v1.1')         # -> 'v2.1'
+get_next_version('v10.1')        # -> 'v11.1'
+get_next_version('/a/b/c/v1')    # -> '/a/b/c/v2'
+get_next_version('/a/b/c/v1.1')  # -> '/a/b/c/v2.1'
+get_next_version('/a/b/c/v10.1') # -> '/a/b/c/v11.1'
+# `get_latest_file` will return the path to the file with the highest version
+# in the `dir_path` directory that matches a wildcard, e.g.:
+last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
+# `get_last_version` will return the string with the latest version
+# of directories in `dir_path`, e.g.:
+oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+```
+The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.
 # Text manipulation
 ## Transformations

{data_manipulation_utilities-0.2.3 → data_manipulation_utilities-0.2.5}/README.md RENAMED Viewed

@@ -6,7 +6,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -16,10 +16,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -117,7 +117,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -279,7 +289,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -317,11 +327,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -333,7 +343,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -414,7 +424,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -477,7 +487,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -654,6 +664,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -683,6 +696,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -704,14 +727,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -803,7 +831,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -836,6 +864,40 @@ Trees only in file_2.root:
   - Hlt2RD_BsToPhiMuMu_MVA/DecayTree
 ```
+# File system
+## Versions
+The utilities below allow the user to deal with versioned files and directories
+```python
+from dmu.generic.version_management import get_last_version
+from dmu.generic.version_management import get_next_version
+from dmu.generic.version_management import get_latest_file
+# get_next_version will take a version and provide the next one, e.g.
+get_next_version('v1')           # -> 'v2'
+get_next_version('v1.1')         # -> 'v2.1'
+get_next_version('v10.1')        # -> 'v11.1'
+get_next_version('/a/b/c/v1')    # -> '/a/b/c/v2'
+get_next_version('/a/b/c/v1.1')  # -> '/a/b/c/v2.1'
+get_next_version('/a/b/c/v10.1') # -> '/a/b/c/v11.1'
+# `get_latest_file` will return the path to the file with the highest version
+# in the `dir_path` directory that matches a wildcard, e.g.:
+last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
+# `get_last_version` will return the string with the latest version
+# of directories in `dir_path`, e.g.:
+oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+```
+The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.
 # Text manipulation
 ## Transformations

{data_manipulation_utilities-0.2.3 → data_manipulation_utilities-0.2.5}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name        = 'data_manipulation_utilities'
-version     = '0.2.3'
+version     = '0.2.5'
 readme      = 'README.md'
 dependencies= [
 'logzero',

{data_manipulation_utilities-0.2.3 → data_manipulation_utilities-0.2.5}/src/data_manipulation_utilities.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: data_manipulation_utilities
-Version: 0.2.3
+Version: 0.2.5
 Description-Content-Type: text/markdown
 Requires-Dist: logzero
 Requires-Dist: PyYAML
@@ -26,7 +26,7 @@ These are tools that can be used for different data analysis tasks.
 ## Pushing
-From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
+From the root directory of a version controlled project (i.e. a directory with the `.git` subdirectory)
 using a `pyproject.toml` file, run:
 ```bash
@@ -36,10 +36,10 @@ publish
 such that:
 1. The `pyproject.toml` file is checked and the version of the project is extracted.
-1. If a tag named as the version exists move to the steps below.
+1. If a tag named as the version exists move to the steps below.
 1. If it does not, make a new tag with the name as the version
-Then, for each remote it pushes the tags and the commits.
+Then, for each remote it pushes the tags and the commits.
 *Why?*
@@ -137,7 +137,17 @@ pdf   = mod.get_pdf()
 ```
 where the model is a sum of three `CrystallBall` PDFs, one with a right tail and two with a left tail.
-The `mu` and `sg` parameters are shared.
+The `mu` and `sg` parameters are shared. The elementary components that can be plugged are:
+```
+exp: Exponential
+pol1: Polynomial of degree 1
+pol2: Polynomial of degree 2
+cbr : CrystallBall with right tail
+cbl : CrystallBall with left tail
+gauss : Gaussian
+dscb : Double sided CrystallBall
+```
 ### Printing PDFs
@@ -299,7 +309,7 @@ this will:
 - Try fitting at most 10 times
 - After each fit, calculate the goodness of fit (in this case the p-value)
 - Stop when the number of tries has been exhausted or the p-value reached is higher than `0.05`
-- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
+- If the fit has not succeeded because of convergence, validity or goodness of fit issues,
 randomize the parameters and try again.
 - If the desired goodness of fit has not been achieved, pick the best result.
 - Return the `FitResult` object and set the PDF to the final fit result.
@@ -337,11 +347,11 @@ bkg = zfit.pdf.Exponential(obs=obs, lam=lm)
 nbk = zfit.Parameter('nbk', 1000, 0, 10000)
 ebkg= bkg.create_extended(nbk, name='expo')
-# Add them
+# Add them
 pdf = zfit.pdf.SumPDF([ebkg, esig])
 sam = pdf.create_sampler()
-# Plot them
+# Plot them
 obj   = ZFitPlotter(data=sam, model=pdf)
 d_leg = {'gauss': 'New Gauss'}
 obj.plot(nbins=50, d_leg=d_leg, stacked=True, plot_range=(0, 10), ext_text='Extra text here')
@@ -353,7 +363,7 @@ obj.axs[1].plot([0, 10], [0, 0], linestyle='--', color='black')
 this class supports:
 - Handling title, legend, plots size.
-- Adding pulls.
+- Adding pulls.
 - Stacking and overlaying of PDFs.
 - Blinding.
@@ -434,7 +444,7 @@ dataset:
     nan:
         x : 0
         y : 0
-        z : -999
+        z : -999
 training :
     nfold    : 10
     features : [x, y, z]
@@ -497,7 +507,7 @@ When training on real data, several things might go wrong and the code will try
 will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
 entries will be removed before training.
-- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
+- **NaNs**: Entries with NaNs will break the training with the scikit `GradientBoostClassifier` base class. Thus, we:
     - Can use the `nan` section shown above to replace `NaN` values with something else
     - For whatever remains we remove the entries from the training.
@@ -674,6 +684,9 @@ ptr.run()
 where the config dictionary `cfg_dat` in YAML would look like:
 ```yaml
+general:
+    # This will set the figure size
+    size : [20, 10]
 selection:
     #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
     max_ran_entries : 50000
@@ -703,6 +716,16 @@ plots:
         yscale     : 'linear'
         labels     : ['x + y', 'Entries']
         normalized : true #This should normalize to the area
+# Some vertical dashed lines are drawn by default
+# If you see them, you can turn them off with this
+style:
+  skip_lines : true
+  # This can pass arguments to legend making function `plt.legend()` in matplotlib
+  legend:
+    # The line below would place the legend outside the figure to avoid ovelaps with the histogram
+    bbox_to_anchor : [1.2, 1]
+stats:
+  nentries : '{:.2e}' # This will add number of entries in legend box
 ```
 it's up to the user to build this dictionary and load it.
@@ -724,14 +747,19 @@ The config would look like:
 ```yaml
 saving:
     plt_dir : tests/plotting/2d
+selection:
+  cuts:
+    xlow : x > -1.5
 general:
     size : [20, 10]
 plots_2d:
     # Column x and y
     # Name of column where weights are, null for not weights
     # Name of output plot, e.g. xy_x.png
-    - [x, y, weights, 'xy_w']
-    - [x, y,    null, 'xy_r']
+    # Book signaling to use log scale for z axis
+    - [x, y, weights, 'xy_w', false]
+    - [x, y,    null, 'xy_r', false]
+    - [x, y,    null, 'xy_l',  true]
 axes:
     x :
         binning : [-5.0, 8.0, 40]
@@ -823,7 +851,7 @@ Directory/Treename
     B_ENDVERTEX_CHI2DOF           Double_t
 ```
-## Comparing ROOT files
+## Comparing ROOT files
 Given two ROOT files the command below:
@@ -856,6 +884,40 @@ Trees only in file_2.root:
   - Hlt2RD_BsToPhiMuMu_MVA/DecayTree
 ```
+# File system
+## Versions
+The utilities below allow the user to deal with versioned files and directories
+```python
+from dmu.generic.version_management import get_last_version
+from dmu.generic.version_management import get_next_version
+from dmu.generic.version_management import get_latest_file
+# get_next_version will take a version and provide the next one, e.g.
+get_next_version('v1')           # -> 'v2'
+get_next_version('v1.1')         # -> 'v2.1'
+get_next_version('v10.1')        # -> 'v11.1'
+get_next_version('/a/b/c/v1')    # -> '/a/b/c/v2'
+get_next_version('/a/b/c/v1.1')  # -> '/a/b/c/v2.1'
+get_next_version('/a/b/c/v10.1') # -> '/a/b/c/v11.1'
+# `get_latest_file` will return the path to the file with the highest version
+# in the `dir_path` directory that matches a wildcard, e.g.:
+last_file = get_latest_file(dir_path = file_dir, wc='name_*.txt')
+# `get_last_version` will return the string with the latest version
+# of directories in `dir_path`, e.g.:
+oversion=get_last_version(dir_path=dir_path, version_only=True)  # This will return only the version, e.g. v3.2
+oversion=get_last_version(dir_path=dir_path, version_only=False) # This will return full path, e.g. /a/b/c/v3.2
+```
+The function above should work for numeric (e.g. `v1.2`) and non-numeric (e.g. `va`, `vb`) versions.
 # Text manipulation
 ## Transformations

{data_manipulation_utilities-0.2.3 → data_manipulation_utilities-0.2.5}/src/data_manipulation_utilities.egg-info/SOURCES.txt RENAMED Viewed

@@ -8,6 +8,7 @@ src/data_manipulation_utilities.egg-info/requires.txt
 src/data_manipulation_utilities.egg-info/top_level.txt
 src/dmu/arrays/utilities.py
 src/dmu/generic/utilities.py
+src/dmu/generic/version_management.py
 src/dmu/logging/log_store.py
 src/dmu/ml/cv_classifier.py
 src/dmu/ml/cv_predict.py
@@ -37,10 +38,12 @@ src/dmu_data/ml/tests/train_mva.yaml
 src/dmu_data/plotting/tests/2d.yaml
 src/dmu_data/plotting/tests/fig_size.yaml
 src/dmu_data/plotting/tests/high_stat.yaml
+src/dmu_data/plotting/tests/legend.yaml
 src/dmu_data/plotting/tests/name.yaml
 src/dmu_data/plotting/tests/no_bounds.yaml
 src/dmu_data/plotting/tests/normalized.yaml
 src/dmu_data/plotting/tests/simple.yaml
+src/dmu_data/plotting/tests/stats.yaml
 src/dmu_data/plotting/tests/title.yaml
 src/dmu_data/plotting/tests/weights.yaml
 src/dmu_data/text/transform.toml

data-manipulation-utilities 0.2.3__tar.gz → 0.2.5__tar.gz

data-manipulation-utilities 0.2.3tar.gz → 0.2.5tar.gz