PyPI - data-manipulation-utilities - Versions diffs - 0.0.1__py3-none-any.whl - Mend

data-manipulation-utilities 0.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

data_manipulation_utilities-0.0.1.dist-info/METADATA +713 -0
data_manipulation_utilities-0.0.1.dist-info/RECORD +45 -0
data_manipulation_utilities-0.0.1.dist-info/WHEEL +5 -0
data_manipulation_utilities-0.0.1.dist-info/entry_points.txt +6 -0
data_manipulation_utilities-0.0.1.dist-info/top_level.txt +3 -0
dmu/arrays/utilities.py +55 -0
dmu/dataframe/dataframe.py +36 -0
dmu/generic/utilities.py +69 -0
dmu/logging/log_store.py +129 -0
dmu/ml/cv_classifier.py +122 -0
dmu/ml/cv_predict.py +152 -0
dmu/ml/train_mva.py +257 -0
dmu/ml/utilities.py +132 -0
dmu/plotting/plotter.py +227 -0
dmu/plotting/plotter_1d.py +113 -0
dmu/plotting/plotter_2d.py +87 -0
dmu/rdataframe/atr_mgr.py +79 -0
dmu/rdataframe/utilities.py +72 -0
dmu/rfile/rfprinter.py +91 -0
dmu/rfile/utilities.py +34 -0
dmu/stats/fitter.py +515 -0
dmu/stats/function.py +314 -0
dmu/stats/utilities.py +134 -0
dmu/testing/utilities.py +119 -0
dmu/text/transformer.py +182 -0
dmu_data/__init__.py +0 -0
dmu_data/ml/tests/train_mva.yaml +37 -0
dmu_data/plotting/tests/2d.yaml +14 -0
dmu_data/plotting/tests/fig_size.yaml +13 -0
dmu_data/plotting/tests/high_stat.yaml +22 -0
dmu_data/plotting/tests/name.yaml +14 -0
dmu_data/plotting/tests/no_bounds.yaml +12 -0
dmu_data/plotting/tests/simple.yaml +8 -0
dmu_data/plotting/tests/title.yaml +14 -0
dmu_data/plotting/tests/weights.yaml +13 -0
dmu_data/text/transform.toml +4 -0
dmu_data/text/transform.txt +6 -0
dmu_data/text/transform_set.toml +8 -0
dmu_data/text/transform_set.txt +6 -0
dmu_data/text/transform_trf.txt +12 -0
dmu_scripts/physics/check_truth.py +121 -0
dmu_scripts/rfile/compare_root_files.py +299 -0
dmu_scripts/rfile/print_trees.py +35 -0
dmu_scripts/ssh/coned.py +168 -0
dmu_scripts/text/transform_text.py +46 -0

data_manipulation_utilities-0.0.1.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,713 @@
+Metadata-Version: 2.1
+Name: data_manipulation_utilities
+Version: 0.0.1
+Description-Content-Type: text/markdown
+Requires-Dist: zfit
+Requires-Dist: PyYAML
+Requires-Dist: scipy
+Requires-Dist: awkward
+Requires-Dist: tqdm
+Requires-Dist: joblib
+Requires-Dist: scikit-learn
+Requires-Dist: toml
+Requires-Dist: numpy
+Requires-Dist: matplotlib
+Requires-Dist: mplhep
+Requires-Dist: hist[plot]
+Requires-Dist: polars
+Requires-Dist: pandas
+Provides-Extra: dev
+Requires-Dist: pytest ; extra == 'dev'
+# D(ata) M(anipulation) U(tilities)
+These are tools that can be used for different data analysis tasks.
+# Generic
+This section describes generic tools that could not be put in a specific category, but tend to be useful.
+## Timer
+In order to benchmark functions do:
+```python
+import dmu.generic.utilities as gut
+# Needs to be turned on, it's off by default
+gut.TIMER_ON=True
+@gut.timeit
+def fun():
+    sleep(3)
+fun()
+```
+## JSON dumper
+The following lines will dump data (dictionaries, lists, etc) to a JSON file:
+```python
+import dmu.generic.utilities as gut
+data = [1,2,3,4]
+gut.dump_json(data, '/tmp/list.json')
+```
+# Physics
+## Truth matching
+In order to compare the truth matching efficiency and distributions after it is performed in several samples, run:
+```bash
+check_truth -c configuration.yaml
+```
+where the config file, can look like:
+```yaml
+# ---------
+max_entries : 1000
+samples:
+  # Below are the samples for which the methods will be compared
+  sample_a:
+    file_path : /path/to/root/files/*.root
+    tree_path : TreeName
+    methods :
+        #Below we specify the ways truth matching will be carried out
+        bkg_cat : B_BKGCAT == 0 || B_BKGCAT == 10 || B_BKGCAT == 50
+        true_id : TMath::Abs(B_TRUEID) == 521 && TMath::Abs(Jpsi_TRUEID) == 443 && TMath::Abs(Jpsi_MC_MOTHER_ID) == 521 && TMath::Abs(L1_TRUEID) == 11 && TMath::Abs(L2_TRUEID) == 11 && TMath::Abs(L1_MC_MOTHER_ID) == 443 && TMath::Abs(L2_MC_MOTHER_ID) == 443 && TMath::Abs(H_TRUEID) == 321 && TMath::Abs(H_MC_MOTHER_ID) == 521
+    plot:
+      # Below are the options used by Plottter1D (see plotting documentation below)
+      definitions:
+          mass : B_nopv_const_mass_M[0]
+      plots:
+          mass :
+              binning    : [5000, 6000, 40]
+              yscale     : 'linear'
+              labels     : ['$M_{DTF-noPV}(B^+)$', 'Entries']
+              normalized : true
+      saving:
+        plt_dir : /path/to/directory/with/plots
+```
+# Math
+## PDFs
+### Printing PDFs
+One can print a zfit PDF by doing:
+```python
+from dmu.stats.utilities   import print_pdf
+print_pdf(pdf)
+```
+this should produce an output that will look like:
+```
+PDF: SumPDF
+OBS: <zfit Space obs=('m',), axes=(0,), limits=(array([[-10.]]), array([[10.]])), binned=False>
+Name                                                        Value            Low           HighFloating               Constraint
+--------------------
+fr1                                                     5.000e-01      0.000e+00      1.000e+00    1                     none
+fr2                                                     5.000e-01      0.000e+00      1.000e+00    1                     none
+mu1                                                     4.000e-01     -5.000e+00      5.000e+00    1                     none
+mu2                                                     4.000e-01     -5.000e+00      5.000e+00    1                     none
+sg1                                                     1.300e+00      0.000e+00      5.000e+00    1                     none
+sg2                                                     1.300e+00      0.000e+00      5.000e+00    1                     none
+```
+showing basic information on the observable, the parameter ranges and values, whether they are Gaussian constrained and floating or not.
+One can add other options too:
+```python
+from dmu.stats.utilities   import print_pdf
+# Constraints, uncorrelated for now
+d_const = {'mu1' : [0.0, 0.1], 'sg1' : [1.0, 0.1]}
+#-----------------
+# simplest printing to screen
+print_pdf(pdf)
+# Will not show certain parameters
+print_pdf(pdf,
+          blind   = ['sg.*', 'mu.*'])
+# Will add constraints
+print_pdf(pdf,
+          d_const = d_const,
+          blind   = ['sg.*', 'mu.*'])
+#-----------------
+# Same as above but will dump to a text file instead of screen
+#-----------------
+print_pdf(pdf,
+          txt_path = 'tests/stats/utilities/print_pdf/pdf.txt')
+print_pdf(pdf,
+          blind    =['sg.*', 'mu.*'],
+          txt_path = 'tests/stats/utilities/print_pdf/pdf_blind.txt')
+print_pdf(pdf,
+          d_const  = d_const,
+          txt_path = 'tests/stats/utilities/print_pdf/pdf_const.txt')
+```
+## Fits
+The `Fitter` class is a wrapper to zfit, use to make fitting easier.
+### Simplest fit
+```python
+from dmu.stats.fitter      import Fitter
+obj = Fitter(pdf, dat)
+res = obj.fit()
+```
+### Customizations
+In order to customize the way the fitting is done one would pass a configuration dictionary to the `fit(cfg=config)`
+function. This dictionary can be represented in YAML as:
+```yaml
+# The strategies below are exclusive, only can should be used at a time
+strategy      :
+      # This strategy will fit multiple times and retry the fit until either
+      # ntries is exhausted or the pvalue is reached.
+      retry   :
+          ntries        : 4    #Number of tries
+          pvalue_thresh : 0.05 #Pvalue threshold, if the fit is better than this, the loop ends
+          ignore_status : true #Will pick invalid fits if this is true, otherwise only valid fits will be counted
+      # This will fit smaller datasets and get the value of the shape parameters to allow
+      # these shapes to float only around this value and within nsigma
+      # Fit can be carried out multiple times with larger and larger samples to tighten parameters
+      steps   :
+          nsteps   : [1e3, 1e4] #Number of entries to use
+          nsigma   : [5.0, 2.0] #Number of sigmas for the range of the parameter, for each step
+          yields   : ['ny1', 'ny2'] # in the fitting model ny1 and ny2 are the names of yields parameters, all the yield need to go in this list
+# The lines below will split the range of the data [0-10] into two subranges, such that the NLL is built
+# only in those ranges. The ranges need to be tuples
+ranges        :
+      - !!python/tuple [0, 3]
+      - !!python/tuple [6, 9]
+#The lines below will allow using contraints for each parameter, where the first element is the mean and the second
+#the width of a Gaussian constraint. No correlations are implemented, yet.
+constraints   :
+      mu : [5.0, 1.0]
+      sg : [1.0, 0.1]
+#After each fit, the parameters spciefied below will be printed, for debugging purposes
+print_pars    : ['mu', 'sg']
+likelihood :
+    nbins : 100 #If specified, will do binned likelihood fit instead of unbinned
+```
+## Arrays
+### Scaling by non-integer
+Given an array representing a distribution, the following lines will increase its size
+by `fscale`, where this number is a float, e.g. 3.4.
+```python
+from dmu.arrays.utilities import repeat_arr
+arr_val = repeat_arr(arr_val = arr_inp, ftimes = fscale)
+```
+in such a way that the output array will be `fscale` larger than the input one, but will keep the same distribution.
+## Functions
+The project contains the `Function` class that can be used to:
+- Store `(x,y)` coordinates.
+- Evaluate the function by interpolating
+- Storing the function as a JSON file
+- Loading the function from the JSON file
+It can be used as:
+```python
+import numpy
+from dmu.stats.function    import Function
+x    = numpy.linspace(0, 5, num=10)
+y    = numpy.sin(x)
+path = './function.json'
+# By default the interpolation is 'cubic', this uses scipy's interp1d
+# refer to that documentation for more information on this.
+fun  = Function(x=x, y=y, kind='cubic')
+fun.save(path = path)
+fun  = Function.load(path)
+xval = numpy.lispace(0, 5, num=100)
+yval = fun(xval)
+```
+# Machine learning
+## Classification
+To train models to classify data between signal and background, starting from ROOT dataframes do:
+```python
+from dmu.ml.train_mva      import TrainMva
+rdf_sig = _get_rdf(kind='sig')
+rdf_bkg = _get_rdf(kind='bkg')
+cfg     = _get_config()
+obj= TrainMva(sig=rdf_sig, bkg=rdf_bkg, cfg=cfg)
+obj.run()
+```
+where the settings for the training go in a config dictionary, which when written to YAML looks like:
+```yaml
+training :
+    nfold    : 10
+    features : [w, x, y, z]
+    hyper    :
+      loss              : log_loss
+      n_estimators      : 100
+      max_depth         : 3
+      learning_rate     : 0.1
+      min_samples_split : 2
+saving:
+    path : 'tests/ml/train_mva/model.pkl'
+plotting:
+    val_dir : 'tests/ml/train_mva'
+    features:
+        saving:
+            plt_dir : 'tests/ml/train_mva/features'
+        plots:
+          w :
+            binning : [-4, 4, 100]
+            yscale  : 'linear'
+            labels  : ['w', '']
+          x :
+            binning : [-4, 4, 100]
+            yscale  : 'linear'
+            labels  : ['x', '']
+          y :
+            binning : [-4, 4, 100]
+            yscale  : 'linear'
+            labels  : ['y', '']
+          z :
+            binning : [-4, 4, 100]
+            yscale  : 'linear'
+            labels  : ['z', '']
+```
+the `TrainMva` is just a wrapper to `scikit-learn` that enables cross-validation (and therefore that explains the `nfolds` setting).
+### Caveats
+When training on real data, several things might go wrong and the code will try to deal with them in the following ways:
+- **Repeated entries**: Entire rows with features might appear multiple times. When doing cross-validation, this might mean that two identical entries
+will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
+entries will be removed before training.
+- **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
+## Application
+Given the models already trained, one can use them with:
+```python
+from dmu.ml.cv_predict     import CVPredict
+#Build predictor with list of models and ROOT dataframe with data
+cvp     = CVPredict(models=l_model, rdf=rdf)
+#This will return an array of probabilibies
+arr_prb = cvp.predict()
+```
+If the entries in the input dataframe were used for the training of some of the models, the model that was not used
+will be _automatically_ picked for the prediction of a specific sample.
+The picking process happens through the comparison of hashes between the samples in `rdf` and the training samples.
+The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
+`GradientBoostClassifier`, here called `CVClassifier`.
+If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
+`CVSameData` exception will be risen.
+### Caveats
+When evaluating the model with real data, problems might occur, we deal with them as follows:
+- **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
+- **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_  with zeros and evaluated. However, before returning, the probabilities will be
+saved as -1. I.e. entries with NaNs will have probabilities of -1.
+# Rdataframes
+These are utility functions meant to be used with ROOT dataframes.
+## Adding a column from a numpy array
+For this do:
+```python
+import dmu.rdataframe.utilities as ut
+arr_val = numpy.array([10, 20, 30])
+rdf     = ut.add_column(rdf, arr_val, 'values')
+```
+the `add_column` function will check for:
+1. Presence of a column with the same name
+2. Same size for array and existing dataframe
+and return a dataframe with the added column
+## Attaching attributes
+**Use case** When performing operations in dataframes, like `Filter`, `Range` etc; a new instance of the dataframe
+will be created. One might want to attach attributes to the dataframe, like the name of the file or the tree, etc.
+Those attributes will thus be dropped. In order to deal with this one can do:
+```python
+from dmu.rdataframe.atr_mgr import AtrMgr
+# Pick up the attributes
+obj = AtrMgr(rdf)
+# Do things to dataframe
+rdf = rdf.Filter(x, y)
+rdf = rdf.Define('a', 'b')
+# Put back the attributes
+rdf = obj.add_atr(rdf)
+```
+The attributes can also be saved to JSON with:
+```python
+obj = AtrMgr(rdf)
+...
+obj.to_json('/path/to/file.json')
+```
+# Dataframes
+Polars is very fast, however the interface of polars is not simple. Therefore this project has a derived class
+called `DataFrame`, which implements a more user-friendly interface. It can be used as:
+```python
+from dmu.dataframe.dataframe import DataFrame
+df = DataFrame({
+'a': [1, 2, 3],
+'b': [4, 5, 6]
+})
+# Defining a new column
+df = df.define('c', 'a + b')
+```
+The remaining functionality is identical to `polars`.
+# Logging
+The `LogStore` class is an interface to the `logging` module. It is aimed at making it easier to include
+a good enough logging tool. It can be used as:
+```python
+from dmu.logging.log_store import LogStore
+LogStore.backend = 'logging' # This line is optional, the default backend is logging, but logzero is also supported
+log = LogStore.add_logger('msg')
+LogStore.set_level('msg', 10)
+log.debug('debug')
+log.info('info')
+log.warning('warning')
+log.error('error')
+log.critical('critical')
+```
+# Plotting from ROOT dataframes
+## 1D plots
+Given a set of ROOT dataframes and a configuration dictionary, one can plot distributions with:
+```python
+from dmu.plotting.plotter_1d import Plotter1D as Plotter
+ptr=Plotter(d_rdf=d_rdf, cfg=cfg_dat)
+ptr.run()
+```
+where the config dictionary `cfg_dat` in YAML would look like:
+```yaml
+selection:
+    #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
+    max_ran_entries : 50000
+    cuts:
+    #Will only use entries with z > 0
+      z : 'z > 0'
+saving:
+    #Will save lots to this directory
+    plt_dir : tests/plotting/high_stat
+definitions:
+    #Will define extra variables
+    z : 'x + y'
+#Settings to make histograms for differen variables
+plots:
+    x :
+        binning    : [0.98, 0.98, 40] # Here bounds agree => tool will calculate bounds making sure that they are the 2% and 98% quantile
+        yscale     : 'linear' # Optional, if not passed, will do linear, can be log
+        labels     : ['x', 'Entries'] # Labels are optional, will use varname and Entries as labels if not present
+        title      : 'some title can be added for different variable plots'
+        name       : 'plot_of_x' # This will ensure that one gets plot_of_x.png as a result, if missing x.png would be saved
+    y :
+        binning    : [-5.0, 8.0, 40]
+        yscale     : 'linear'
+        labels     : ['y', 'Entries']
+    z :
+        binning    : [-5.0, 8.0, 40]
+        yscale     : 'linear'
+        labels     : ['x + y', 'Entries']
+        normalized : true #This should normalize to the area
+```
+it's up to the user to build this dictionary and load it.
+## 2D plots
+For the 2D case it would look like:
+```python
+from dmu.plotting.plotter_2d import Plotter2D as Plotter
+ptr=Plotter(rdf=rdf, cfg=cfg_dat)
+ptr.run()
+```
+where one would introduce only one dataframe instead of a dictionary, given that overlaying 2D plots is not possible.
+The config would look like:
+```yaml
+saving:
+    plt_dir : tests/plotting/2d
+general:
+    size : [20, 10]
+plots_2d:
+    # Column x and y
+    # Name of column where weights are, null for not weights
+    # Name of output plot, e.g. xy_x.png
+    - [x, y, weights, 'xy_w']
+    - [x, y,    null, 'xy_r']
+axes:
+    x :
+        binning : [-5.0, 8.0, 40]
+        label   : 'x'
+    y :
+        binning : [-5.0, 8.0, 40]
+        label   : 'y'
+```
+# Manipulating ROOT files
+## Getting trees from file
+The lines below will return a dictionary with trees from the handle to a ROOT file:
+```python
+import dmu.rfile.utilities   as rfut
+ifile  = TFile("/path/to/root/file.root")
+d_tree = rfut.get_trees_from_file(ifile)
+```
+## Printing contents
+The following lines will create a `file.txt` with the contents of `file.root`, the text file will be in the same location as the
+ROOT file.
+```python
+from dmu.rfile.rfprinter import RFPrinter
+obj = RFPrinter(path='/path/to/file.root')
+obj.save()
+```
+## Printing from the command line
+This is mostly needed from the command line and can be done with:
+```bash
+print_trees -p /path/to/file.root
+```
+which would produce a `/pat/to/file.txt` file with the contents, which would look like:
+```
+Directory/Treename
+    B_CHI2                        Double_t
+    B_CHI2DOF                     Double_t
+    B_DIRA_OWNPV                  Float_t
+    B_ENDVERTEX_CHI2              Double_t
+    B_ENDVERTEX_CHI2DOF           Double_t
+```
+## Comparing ROOT files
+Given two ROOT files the command below:
+```bash
+compare_root_files -f file_1.root file_2.root
+```
+will check if:
+1. The files have the same trees. If not it will print which files are in the first file but not in the second
+and vice versa.
+1. The trees have the same branches. The same checks as above will be carried out here.
+1. The branches of the corresponding trees have the same values.
+the output will also go to a `summary.yaml` file that will look like:
+```yaml
+'Branches that differ for tree: Hlt2RD_BToMuE/DecayTree':
+  - L2_BREMHYPOENERGY
+  - L2_ECALPIDMU
+  - L1_IS_NOT_H
+'Branches that differ for tree: Hlt2RD_LbToLMuMu_LL/DecayTree':
+  - P_CaloNeutralHcal2EcalEnergyRatio
+  - P_BREMENERGY
+  - Pi_IS_NOT_H
+  - P_BREMPIDE
+Trees only in file_1.root: []
+Trees only in file_2.root:
+  - Hlt2RD_BuToKpEE_MVA_misid/DecayTree
+  - Hlt2RD_BsToPhiMuMu_MVA/DecayTree
+```
+# Text manipulation
+## Transformations
+Run:
+```bash
+transform_text -i ./transform.txt -c ./transform.toml
+```
+to apply a transformation to `transform.txt` following the transformations in `transform.toml`.
+The tool can be imported from another file like:
+```python
+from dmu.text.transformer import transformer as txt_trf
+trf=txt_trf(txt_path=data.txt, cfg_path=data.cfg)
+trf.save_as(out_path=data.out)
+```
+Currently the supported transformations are:
+### append
+Which will apppend to a given line a set of lines, the config lines could look like:
+```toml
+[settings]
+as_substring=true
+format      ='--> {} <--'
+[append]
+'primes are'=['2', '3', '5']
+'days are'=['Monday', 'Tuesday', 'Wednesday']
+```
+`as_substring` is a flag that will allow matches if the line in the text file only contains the key in the config
+e.g.:
+```
+the
+first
+primes are:
+and
+the first
+days are:
+```
+`format` will format the lines to be inserted, e.g.:
+```
+the
+first
+primes are:
+--> 2 <--
+--> 3 <--
+--> 5 <--
+and
+the first
+days are:
+--> Monday <--
+--> Tuesday <--
+--> Wednesday <--
+```
+## coned
+Utility used to edit SSH connection list, has the following behavior:
+```bash
+#Prints all connections
+coned -p
+#Adds a task name to a given server
+coned -a server_name server_index task
+#Removes a task name from a given server
+coned -d server_name server_index task
+```
+the list of servers with tasks and machines is specified in a YAML file that can look like:
+```yaml
+ihep:
+    '001' :
+        - checks
+        - extractor
+        - dsmanager
+        - classifier
+    '002' :
+        - checks
+        - hqm2
+        - dotfiles
+        - data_checks
+    '003' :
+        - setup
+        - ntupling
+        - preselection
+    '004' :
+        - scripts
+        - tools
+        - dmu
+        - ap
+lxplus:
+    '984' :
+        - ap
+```
+and should be placed in `$HOME/.config/dmu/ssh/servers.yaml`