data-manipulation-utilities 0.1.9__py3-none-any.whl → 0.2.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,11 +1,11 @@
1
1
  Metadata-Version: 2.2
2
2
  Name: data_manipulation_utilities
3
- Version: 0.1.9
3
+ Version: 0.2.1
4
4
  Description-Content-Type: text/markdown
5
5
  Requires-Dist: logzero
6
6
  Requires-Dist: PyYAML
7
7
  Requires-Dist: scipy
8
- Requires-Dist: awkward
8
+ Requires-Dist: awkward==2.4.6
9
9
  Requires-Dist: tqdm
10
10
  Requires-Dist: joblib
11
11
  Requires-Dist: scikit-learn
@@ -204,6 +204,33 @@ print_pdf(pdf,
204
204
 
205
205
  The `Fitter` class is a wrapper to zfit, use to make fitting easier.
206
206
 
207
+ ### Goodness of fits
208
+
209
+ Once a fit has been done, one can use `GofCalculator` to get a rough estimate of the fit quality.
210
+ This is done by:
211
+
212
+ - Binning the data and PDF.
213
+ - Calculating the reduced $\chi^2$.
214
+ - Using the $\chi^2$ and the number of degrees of freedom to get the p-value.
215
+
216
+ This class is used as shown below:
217
+
218
+ ```python
219
+ from dmu.stats.gof_calculator import GofCalculator
220
+
221
+ nll = _get_nll()
222
+ res = Data.minimizer.minimize(nll)
223
+
224
+ gcl = GofCalculator(nll, ndof=10)
225
+ gof = gcl.get_gof(kind='pvalue')
226
+ ```
227
+
228
+ where:
229
+
230
+ - `ndof` Is the number of degrees of freedom used in the reduced $\chi^2$ calculation
231
+ It is needed to know how many bins to use to make the histogram. The recommended value is 10.
232
+ - `kind` The argument can be `pvalue` or `chi2/ndof`.
233
+
207
234
  ### Simplest fit
208
235
 
209
236
  ```python
@@ -396,6 +423,14 @@ obj.run()
396
423
  where the settings for the training go in a config dictionary, which when written to YAML looks like:
397
424
 
398
425
  ```yaml
426
+ dataset:
427
+ # If the key is found to be NaN, replace its value with the number provided
428
+ # This will be used in the training.
429
+ # Otherwise the entries with NaNs will be dropped
430
+ nan:
431
+ x : 0
432
+ y : 0
433
+ z : -999
399
434
  training :
400
435
  nfold : 10
401
436
  features : [w, x, y, z]
@@ -406,8 +441,25 @@ training :
406
441
  learning_rate : 0.1
407
442
  min_samples_split : 2
408
443
  saving:
444
+ # The actual model names are model_001.pkl, model_002.pkl, etc, one for each fold
409
445
  path : 'tests/ml/train_mva/model.pkl'
410
446
  plotting:
447
+ roc :
448
+ min : [0.0, 0.0] # Optional, controls where the ROC curve starts and ends
449
+ max : [1.2, 1.2] # By default it does from 0 to 1 in both axes
450
+ # The section below is optional and will annotate the ROC curve with
451
+ # values for the score at different signal efficiencies
452
+ annotate:
453
+ sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9] # Values of signal efficiency at which to show the scores
454
+ form : '{:.2f}' # Use two decimals for scores
455
+ color : 'green' # Color for text and marker
456
+ xoff : -15 # Offsets in X and Y
457
+ yoff : -15
458
+ size : 10 # Size of text
459
+ correlation: # Adds correlation matrix for training datasets
460
+ title : 'Correlation matrix'
461
+ size : [10, 10]
462
+ mask_value : 0 # Where correlation is zero, the bin will appear white
411
463
  val_dir : 'tests/ml/train_mva'
412
464
  features:
413
465
  saving:
@@ -475,6 +527,36 @@ When evaluating the model with real data, problems might occur, we deal with the
475
527
  - **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_ with zeros and evaluated. However, before returning, the probabilities will be
476
528
  saved as -1. I.e. entries with NaNs will have probabilities of -1.
477
529
 
530
+ # Pandas dataframes
531
+
532
+ ## Utilities
533
+
534
+ These are thin layers of code that take pandas dataframes and carry out specific tasks
535
+
536
+ ### Dataframe to latex
537
+
538
+ One can save a dataframe to latex with:
539
+
540
+ ```python
541
+ import pandas as pnd
542
+ import dmu.pdataframe.utilities as put
543
+
544
+ d_data = {}
545
+ d_data['a'] = [1,2,3]
546
+ d_data['b'] = [4,5,6]
547
+ df = pnd.DataFrame(d_data)
548
+
549
+ d_format = {
550
+ 'a' : '{:.0f}',
551
+ 'b' : '{:.3f}'}
552
+
553
+ df = _get_df()
554
+ put.df_to_tex(df,
555
+ './table.tex',
556
+ d_format = d_format,
557
+ caption = 'some caption')
558
+ ```
559
+
478
560
  # Rdataframes
479
561
 
480
562
  These are utility functions meant to be used with ROOT dataframes.
@@ -626,6 +708,43 @@ axes:
626
708
  label : 'y'
627
709
  ```
628
710
 
711
+ # Other plots
712
+
713
+ ## Matrices
714
+
715
+ This can be done with `MatrixPlotter`, whose usage is illustrated below:
716
+
717
+ ```python
718
+ import numpy
719
+ import matplotlib.pyplot as plt
720
+
721
+ from dmu.plotting.matrix import MatrixPlotter
722
+
723
+ cfg = {
724
+ 'labels' : ['x', 'y', 'z'], # Used to label the matrix axes
725
+ 'title' : 'Some title', # Optional, title of plot
726
+ 'label_angle': 45, # Labels will be rotated by 45 degrees
727
+ 'upper' : True, # Useful in case this is a symmetric matrix
728
+ 'zrange' : [0, 10], # Controls the z axis range
729
+ 'size' : [7, 7], # Plot size
730
+ 'format' : '{:.3f}', # Optional, if used will add numerical values to the contents, otherwise a color bar is used
731
+ 'fontsize' : 12, # Font size associated to `format`
732
+ 'mask_value' : 0, # These values will appear white in the plot
733
+ }
734
+
735
+ mat = [
736
+ [1, 2, 3],
737
+ [2, 0, 4],
738
+ [3, 4, numpy.nan]
739
+ ]
740
+
741
+ mat = numpy.array(mat)
742
+
743
+ obj = MatrixPlotter(mat=mat, cfg=cfg)
744
+ obj.plot()
745
+ plt.show()
746
+ ```
747
+
629
748
  # Manipulating ROOT files
630
749
 
631
750
  ## Getting trees from file
@@ -1,14 +1,17 @@
1
- data_manipulation_utilities-0.1.9.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
1
+ data_manipulation_utilities-0.2.1.data/scripts/publish,sha256=-3K_Y2_4CfWCV50rPB8CRuhjxDu7xMGswinRwPovgLs,1976
2
2
  dmu/arrays/utilities.py,sha256=PKoYyybPptA2aU-V3KLnJXBudWxTXu4x1uGdIMQ49HY,1722
3
3
  dmu/generic/utilities.py,sha256=0Xnq9t35wuebAqKxbyAiMk1ISB7IcXK4cFH25MT1fgw,1741
4
4
  dmu/logging/log_store.py,sha256=umdvjNDuV3LdezbG26b0AiyTglbvkxST19CQu9QATbA,4184
5
- dmu/ml/cv_classifier.py,sha256=n81m7i2M6Zq96AEd9EZGwXSrbG5m9jkS5RdeXvbsAXU,3712
6
- dmu/ml/cv_predict.py,sha256=Bqxu-f6qquKJokFljhCzL_kiGcjLJLQFhVBD130fsyw,4893
7
- dmu/ml/train_mva.py,sha256=d_n-A07DFweikz5nXap4OE_Mqx8VprFT7zbxmnQAbac,9638
8
- dmu/ml/utilities.py,sha256=Nue7O9zi1QXgjGRPH6wnSAW9jusMQ2ZOSDJzBqJKIi0,3687
5
+ dmu/ml/cv_classifier.py,sha256=8Jwx6xMhJaRLktlRdq0tFl32v6t8i63KmpxrlnXlomU,3759
6
+ dmu/ml/cv_predict.py,sha256=AhCsCnHWPWGIRVTdGS1NxA2m4yH7t2lV_OdALwQAcAE,4927
7
+ dmu/ml/train_mva.py,sha256=xJCJZKaly4Mml7Dy-TWQxpB-VNftL7EjQ79QKxROWx0,16475
8
+ dmu/ml/utilities.py,sha256=l348bufD95CuSYdIrHScQThIy2nKwGKXZn-FQg3CEwg,3930
9
+ dmu/pdataframe/utilities.py,sha256=ypvLiFfJ82ga94qlW3t5dXnvEFwYOXnbtJb2zHwsbqk,987
10
+ dmu/plotting/matrix.py,sha256=pXuUJn-LgOvrI9qGkZQw16BzLjOjeikYQ_ll2VIcIXU,4978
9
11
  dmu/plotting/plotter.py,sha256=ytMxtzHEY8ZFU0ZKEBE-ROjMszXl5kHTMnQnWe173nU,7208
10
- dmu/plotting/plotter_1d.py,sha256=O7rTgCBlpCko1RSpj2TzcUIfx9sKoz2jAgw73Pz7Ynk,4472
12
+ dmu/plotting/plotter_1d.py,sha256=g6H2xAgsL9a6vRkpbqHICb3qwV_qMiQPZxxw_oOSf9M,5115
11
13
  dmu/plotting/plotter_2d.py,sha256=J-gKnagoHGfJFU7HBrhDFpGYH5Rxy0_zF5l8eE_7ZHE,2944
14
+ dmu/plotting/utilities.py,sha256=SI9dvtZq2gr-PXVz71KE4o0i09rZOKgqJKD1jzf6KXk,1167
12
15
  dmu/rdataframe/atr_mgr.py,sha256=FdhaQWVpsm4OOe1IRbm7rfrq8VenTNdORyI-lZ2Bs1M,2386
13
16
  dmu/rdataframe/utilities.py,sha256=x8r379F2-vZPYzAdMFCn_V4Kx2Tx9t9pn_QHcZ1euew,2756
14
17
  dmu/rfile/rfprinter.py,sha256=mp5jd-oCJAnuokbdmGyL9i6tK2lY72jEfROuBIZ_ums,3941
@@ -23,12 +26,13 @@ dmu/stats/zfit_plotter.py,sha256=Xs6kisNEmNQXhYRCcjowxO6xHuyAyrfyQIFhGAR61U4,197
23
26
  dmu/testing/utilities.py,sha256=WbMM4e9Cn3-B-12Vr64mB5qTKkV32joStlRkD-48lG0,3460
24
27
  dmu/text/transformer.py,sha256=4lrGknbAWRm0-rxbvgzOO-eR1-9bkYk61boJUEV3cQ0,6100
25
28
  dmu_data/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
26
- dmu_data/ml/tests/train_mva.yaml,sha256=TCniCVpXMEFxZcHa8IIqollKA7ci4OkBnRznLEkXM9o,925
29
+ dmu_data/ml/tests/train_mva.yaml,sha256=k5H4Gu9Gj57B9iqabhcTQEFN674Cv_uJ2Xcumb02zF4,1279
27
30
  dmu_data/plotting/tests/2d.yaml,sha256=VApcAfJFbjNcjMCTBSRm2P37MQlGavMZv6msbZwLSgw,402
28
31
  dmu_data/plotting/tests/fig_size.yaml,sha256=7ROq49nwZ1A2EbPiySmu6n3G-Jq6YAOkc3d2X3YNZv0,294
29
32
  dmu_data/plotting/tests/high_stat.yaml,sha256=bLglBLCZK6ft0xMhQ5OltxE76cWsBMPMjO6GG0OkDr8,522
30
33
  dmu_data/plotting/tests/name.yaml,sha256=mkcPAVg8wBAmlSbSRQ1bcaMl4vOS6LXMtpqQeDrrtO4,312
31
34
  dmu_data/plotting/tests/no_bounds.yaml,sha256=8e1QdphBjz-suDr857DoeUC2DXiy6SE-gvkORJQYv80,257
35
+ dmu_data/plotting/tests/normalized.yaml,sha256=Y0eKtyV5pvlSxvqfsLjytYtv8xYF3HZ5WEdCJdeHGQI,193
32
36
  dmu_data/plotting/tests/simple.yaml,sha256=N_TvNBh_2dU0-VYgu_LMrtY0kV_hg2HxVuEoDlr1HX8,138
33
37
  dmu_data/plotting/tests/title.yaml,sha256=bawKp9aGpeRrHzv69BOCbFX8sq9bb3Es9tdsPTE7jIk,333
34
38
  dmu_data/plotting/tests/weights.yaml,sha256=RWQ1KxbCq-uO62WJ2AoY4h5Umc37zG35s-TpKnNMABI,312
@@ -43,8 +47,8 @@ dmu_scripts/rfile/compare_root_files.py,sha256=T8lDnQxsRNMr37x1Y7YvWD8ySHrJOWZki
43
47
  dmu_scripts/rfile/print_trees.py,sha256=Ze4Ccl_iUldl4eVEDVnYBoe4amqBT1fSBR1zN5WSztk,941
44
48
  dmu_scripts/ssh/coned.py,sha256=lhilYNHWRCGxC-jtyJ3LQ4oUgWW33B2l1tYCcyHHsR0,4858
45
49
  dmu_scripts/text/transform_text.py,sha256=9akj1LB0HAyopOvkLjNOJiptZw5XoOQLe17SlcrGMD0,1456
46
- data_manipulation_utilities-0.1.9.dist-info/METADATA,sha256=sxu2cZc14f4VfDD2J3MLGmW0jRHXJBpmDspXUt1D_0k,23046
47
- data_manipulation_utilities-0.1.9.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
48
- data_manipulation_utilities-0.1.9.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
49
- data_manipulation_utilities-0.1.9.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
50
- data_manipulation_utilities-0.1.9.dist-info/RECORD,,
50
+ data_manipulation_utilities-0.2.1.dist-info/METADATA,sha256=ojD6P0bBj9GFohtPd7ULl7sDW80bIVD6JZ-bnNpHYmc,26649
51
+ data_manipulation_utilities-0.2.1.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
52
+ data_manipulation_utilities-0.2.1.dist-info/entry_points.txt,sha256=1TIZDed651KuOH-DgaN5AoBdirKmrKE_oM1b6b7zTUU,270
53
+ data_manipulation_utilities-0.2.1.dist-info/top_level.txt,sha256=n_x5J6uWtSqy9mRImKtdA2V2NJNyU8Kn3u8DTOKJix0,25
54
+ data_manipulation_utilities-0.2.1.dist-info/RECORD,,
dmu/ml/cv_classifier.py CHANGED
@@ -2,6 +2,7 @@
2
2
  Module holding cv_classifier class
3
3
  '''
4
4
 
5
+ from typing import Union
5
6
  from sklearn.ensemble import GradientBoostingClassifier
6
7
 
7
8
  from dmu.logging.log_store import LogStore
@@ -22,7 +23,7 @@ class CVClassifier(GradientBoostingClassifier):
22
23
  '''
23
24
  # pylint: disable = too-many-ancestors, abstract-method
24
25
  # ----------------------------------
25
- def __init__(self, cfg : dict | None = None):
26
+ def __init__(self, cfg : Union[dict,None] = None):
26
27
  '''
27
28
  cfg (dict) : Dictionary with configuration, specially the hyperparameters set in the `hyper` field
28
29
  '''
dmu/ml/cv_predict.py CHANGED
@@ -10,8 +10,8 @@ import tqdm
10
10
  from ROOT import RDataFrame
11
11
 
12
12
  import dmu.ml.utilities as ut
13
- import dmu.ml.cv_classifier as CVClassifier
14
13
 
14
+ from dmu.ml.cv_classifier import CVClassifier
15
15
  from dmu.logging.log_store import LogStore
16
16
 
17
17
  log = LogStore.add_logger('dmu:ml:cv_predict')
@@ -147,6 +147,7 @@ class CVPredict:
147
147
  arr_prb = self._predict_with_overlap(df_ft)
148
148
 
149
149
  arr_prb = self._patch_probabilities(arr_prb)
150
+ arr_prb = arr_prb.T[1]
150
151
 
151
152
  return arr_prb
152
153
  # ---------------------------------------
dmu/ml/train_mva.py CHANGED
@@ -1,8 +1,10 @@
1
1
  '''
2
2
  Module with TrainMva class
3
3
  '''
4
+ # pylint: disable = too-many-locals
5
+ # pylint: disable = too-many-arguments, too-many-positional-arguments
6
+
4
7
  import os
5
- from typing import Union
6
8
 
7
9
  import joblib
8
10
  import pandas as pnd
@@ -14,11 +16,16 @@ from sklearn.model_selection import StratifiedKFold
14
16
 
15
17
  from ROOT import RDataFrame
16
18
 
17
- import dmu.ml.utilities as ut
19
+ import dmu.ml.utilities as ut
20
+ import dmu.pdataframe.utilities as put
21
+ import dmu.plotting.utilities as plu
22
+
18
23
  from dmu.ml.cv_classifier import CVClassifier as cls
19
24
  from dmu.plotting.plotter_1d import Plotter1D as Plotter
25
+ from dmu.plotting.matrix import MatrixPlotter
20
26
  from dmu.logging.log_store import LogStore
21
27
 
28
+ npa = numpy.ndarray
22
29
  log = LogStore.add_logger('data_checks:train_mva')
23
30
  # ---------------------------------------------
24
31
  class TrainMva:
@@ -43,15 +50,13 @@ class TrainMva:
43
50
 
44
51
  self._rdf_bkg = bkg
45
52
  self._rdf_sig = sig
46
- self._cfg = cfg if cfg is not None else {}
47
-
48
- self._l_model : cls
53
+ self._cfg = cfg
49
54
 
50
55
  self._l_ft_name = self._cfg['training']['features']
51
56
 
52
57
  self._df_ft, self._l_lab = self._get_inputs()
53
58
  # ---------------------------------------------
54
- def _get_inputs(self) -> tuple[pnd.DataFrame, numpy.ndarray]:
59
+ def _get_inputs(self) -> tuple[pnd.DataFrame, npa]:
55
60
  log.info('Getting signal')
56
61
  df_sig, arr_lab_sig = self._get_sample_inputs(self._rdf_sig, label = 1)
57
62
 
@@ -63,15 +68,28 @@ class TrainMva:
63
68
 
64
69
  return df, arr_lab
65
70
  # ---------------------------------------------
66
- def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, numpy.ndarray]:
71
+ def _pre_process_nans(self, df : pnd.DataFrame) -> pnd.DataFrame:
72
+ if 'nan' not in self._cfg['dataset']:
73
+ log.debug('dataset/nan section not found, not pre-processing NaNs')
74
+ return df
75
+
76
+ d_name_val = self._cfg['dataset']['nan']
77
+ for name, val in d_name_val.items():
78
+ log.debug(f'{val:<20}{"<---":<10}{name:<100}')
79
+ df[name] = df[name].fillna(val)
80
+
81
+ return df
82
+ # ---------------------------------------------
83
+ def _get_sample_inputs(self, rdf : RDataFrame, label : int) -> tuple[pnd.DataFrame, npa]:
67
84
  d_ft = rdf.AsNumpy(self._l_ft_name)
68
85
  df = pnd.DataFrame(d_ft)
86
+ df = self._pre_process_nans(df)
69
87
  df = ut.cleanup(df)
70
88
  l_lab= len(df) * [label]
71
89
 
72
90
  return df, numpy.array(l_lab)
73
91
  # ---------------------------------------------
74
- def _get_model(self, arr_index : numpy.ndarray) -> cls:
92
+ def _get_model(self, arr_index : npa) -> cls:
75
93
  model = cls(cfg = self._cfg)
76
94
  df_ft = self._df_ft.iloc[arr_index]
77
95
  l_lab = self._l_lab[arr_index]
@@ -84,7 +102,6 @@ class TrainMva:
84
102
  return model
85
103
  # ---------------------------------------------
86
104
  def _get_models(self):
87
- # pylint: disable = too-many-locals
88
105
  '''
89
106
  Will create models, train them and return them
90
107
  '''
@@ -105,15 +122,55 @@ class TrainMva:
105
122
  arr_sig_sig_tr, arr_sig_bkg_tr, arr_sig_all_tr, arr_lab_tr = self._get_scores(model, arr_itr, on_training_ok= True)
106
123
  arr_sig_sig_ts, arr_sig_bkg_ts, arr_sig_all_ts, arr_lab_ts = self._get_scores(model, arr_its, on_training_ok=False)
107
124
 
125
+ self._save_feature_importance(model, ifold)
126
+ self._plot_correlation(arr_itr, ifold)
108
127
  self._plot_scores(arr_sig_sig_tr, arr_sig_sig_ts, arr_sig_bkg_tr, arr_sig_bkg_ts, ifold)
109
-
110
128
  self._plot_roc(arr_lab_ts, arr_sig_all_ts, arr_lab_tr, arr_sig_all_tr, ifold)
111
129
 
112
130
  ifold+=1
113
131
 
114
132
  return l_model
115
133
  # ---------------------------------------------
116
- def _get_scores(self, model : cls, arr_index : numpy.ndarray, on_training_ok : bool) -> tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]:
134
+ def _labels_from_varnames(self, l_var_name : list[str]) -> list[str]:
135
+ try:
136
+ d_plot = self._cfg['plotting']['features']['plots']
137
+ except ValueError:
138
+ log.warning('Cannot find plotting/features/plots section in config, using dataframe names')
139
+ return l_var_name
140
+
141
+ l_label = []
142
+ for var_name in l_var_name:
143
+ if var_name not in d_plot:
144
+ log.warning(f'No plot found for: {var_name}')
145
+ l_label.append(var_name)
146
+ continue
147
+
148
+ d_setting = d_plot[var_name]
149
+ [xlab, _ ]= d_setting['labels']
150
+
151
+ l_label.append(xlab)
152
+
153
+ return l_label
154
+ # ---------------------------------------------
155
+ def _save_feature_importance(self, model : cls, ifold : int) -> None:
156
+ l_var_name = self._df_ft.columns.tolist()
157
+
158
+ d_data = {}
159
+ d_data['Variable' ] = self._labels_from_varnames(l_var_name)
160
+ d_data['Importance'] = 100 * model.feature_importances_
161
+
162
+ val_dir = self._cfg['plotting']['val_dir']
163
+ val_dir = f'{val_dir}/fold_{ifold:03}'
164
+ os.makedirs(val_dir, exist_ok=True)
165
+
166
+ df = pnd.DataFrame(d_data)
167
+ df = df.sort_values(by='Importance', ascending=False)
168
+
169
+ table_path = f'{val_dir}/importance.tex'
170
+ d_form = {'Variable' : '{}', 'Importance' : '{:.1f}'}
171
+ put.df_to_tex(df, table_path, d_format = d_form)
172
+ # ---------------------------------------------
173
+ def _get_scores(self, model : cls, arr_index : npa, on_training_ok : bool) -> tuple[npa, npa, npa, npa]:
117
174
  '''
118
175
  Returns a tuple of four arrays
119
176
 
@@ -136,7 +193,7 @@ class TrainMva:
136
193
 
137
194
  return arr_sig, arr_bkg, arr_all, arr_lab
138
195
  # ---------------------------------------------
139
- def _split_scores(self, arr_prob : numpy.ndarray, arr_label : numpy.ndarray) -> tuple[numpy.ndarray, numpy.ndarray]:
196
+ def _split_scores(self, arr_prob : npa, arr_label : npa) -> tuple[npa, npa]:
140
197
  '''
141
198
  Will split the testing scores (predictions) based on the training scores
142
199
 
@@ -151,7 +208,7 @@ class TrainMva:
151
208
 
152
209
  return arr_sig, arr_bkg
153
210
  # ---------------------------------------------
154
- def _save_model(self, model, ifold):
211
+ def _save_model(self, model : cls, ifold : int) -> None:
155
212
  '''
156
213
  Saves a model, associated to a specific fold
157
214
  '''
@@ -168,6 +225,53 @@ class TrainMva:
168
225
  log.info(f'Saving model to: {model_path}')
169
226
  joblib.dump(model, model_path)
170
227
  # ---------------------------------------------
228
+ def _get_correlation_cfg(self, df : pnd.DataFrame, ifold : int) -> dict:
229
+ l_var_name = df.columns.tolist()
230
+ l_label = self._labels_from_varnames(l_var_name)
231
+ cfg = {
232
+ 'labels' : l_label,
233
+ 'title' : f'Fold {ifold}',
234
+ 'label_angle': 45,
235
+ 'upper' : True,
236
+ 'zrange' : [-1, +1],
237
+ 'size' : [7, 7],
238
+ 'format' : '{:.3f}',
239
+ 'fontsize' : 12,
240
+ }
241
+
242
+ if 'correlation' not in self._cfg['plotting']:
243
+ log.info('Using default correlation plotting configuration')
244
+ return cfg
245
+
246
+ log.debug('Updating correlation plotting configuration')
247
+ custom = self._cfg['plotting']['correlation']
248
+ cfg.update(custom)
249
+
250
+ return cfg
251
+ # ---------------------------------------------
252
+ def _plot_correlation(self, arr_index : npa, ifold : int) -> None:
253
+ df_ft = self._df_ft.iloc[arr_index]
254
+ cfg = self._get_correlation_cfg(df_ft, ifold)
255
+ cov = df_ft.corr()
256
+ mat = cov.to_numpy()
257
+
258
+ log.debug(f'Plotting correlation for {ifold} fold')
259
+
260
+ val_dir = self._cfg['plotting']['val_dir']
261
+ val_dir = f'{val_dir}/fold_{ifold:03}'
262
+ os.makedirs(val_dir, exist_ok=True)
263
+
264
+ obj = MatrixPlotter(mat=mat, cfg=cfg)
265
+ obj.plot()
266
+ plt.savefig(f'{val_dir}/covariance.png')
267
+ plt.close()
268
+ # ---------------------------------------------
269
+ def _get_nentries(self, arr_val : npa) -> str:
270
+ size = len(arr_val)
271
+ size = size / 1000.
272
+
273
+ return f'{size:.2f}K'
274
+ # ---------------------------------------------
171
275
  def _plot_scores(self, arr_sig_trn, arr_sig_tst, arr_bkg_trn, arr_bkg_tst, ifold):
172
276
  # pylint: disable = too-many-arguments, too-many-positional-arguments
173
277
  '''
@@ -183,11 +287,11 @@ class TrainMva:
183
287
  val_dir = f'{val_dir}/fold_{ifold:03}'
184
288
  os.makedirs(val_dir, exist_ok=True)
185
289
 
186
- plt.hist(arr_sig_trn, alpha = 0.3, bins=50, range=(0,1), color='b', density=True, label='Signal Train')
187
- plt.hist(arr_sig_tst, histtype='step', bins=50, range=(0,1), color='b', density=True, label='Signal Test')
290
+ plt.hist(arr_sig_trn, alpha = 0.3, bins=50, range=(0,1), color='b', density=True, label='Signal Train: ' + self._get_nentries(arr_sig_trn))
291
+ plt.hist(arr_sig_tst, histtype='step', bins=50, range=(0,1), color='b', density=True, label='Signal Test: ' + self._get_nentries(arr_sig_tst))
188
292
 
189
- plt.hist(arr_bkg_trn, alpha = 0.3, bins=50, range=(0,1), color='r', density=True, label='Background Train')
190
- plt.hist(arr_bkg_tst, histtype='step', bins=50, range=(0,1), color='r', density=True, label='Background Test')
293
+ plt.hist(arr_bkg_trn, alpha = 0.3, bins=50, range=(0,1), color='r', density=True, label='Background Train: '+ self._get_nentries(arr_bkg_trn))
294
+ plt.hist(arr_bkg_tst, histtype='step', bins=50, range=(0,1), color='r', density=True, label='Background Test: ' + self._get_nentries(arr_bkg_tst))
191
295
 
192
296
  plt.legend()
193
297
  plt.title(f'Fold: {ifold}')
@@ -197,16 +301,15 @@ class TrainMva:
197
301
  plt.close()
198
302
  # ---------------------------------------------
199
303
  def _plot_roc(self,
200
- l_lab_ts : numpy.ndarray,
201
- l_prb_ts : numpy.ndarray,
202
- l_lab_tr : numpy.ndarray,
203
- l_prb_tr : numpy.ndarray,
304
+ l_lab_ts : npa,
305
+ l_prb_ts : npa,
306
+ l_lab_tr : npa,
307
+ l_prb_tr : npa,
204
308
  ifold : int):
205
309
  '''
206
310
  Takes the labels and the probabilities and plots ROC
207
311
  curve for given fold
208
312
  '''
209
- # pylint: disable = too-many-arguments, too-many-positional-arguments
210
313
  log.debug(f'Plotting ROC curve for {ifold} fold')
211
314
 
212
315
  val_dir = self._cfg['plotting']['val_dir']
@@ -226,17 +329,59 @@ class TrainMva:
226
329
  if 'min' in self._cfg['plotting']['roc']:
227
330
  [min_x, min_y] = self._cfg['plotting']['roc']['min']
228
331
 
332
+ max_x = 1
333
+ max_y = 1
334
+ if 'max' in self._cfg['plotting']['roc']:
335
+ [max_x, max_y] = self._cfg['plotting']['roc']['max']
336
+
337
+ self._plot_probabilities(xval_ts, yval_ts, l_prb_ts)
338
+
229
339
  plt.plot(xval_ts, yval_ts, color='b', label=f'Test: {area_ts:.3f}')
230
340
  plt.plot(xval_tr, yval_tr, color='r', label=f'Train: {area_tr:.3f}')
231
341
  plt.xlabel('Signal efficiency')
232
- plt.ylabel('Background efficiency')
342
+ plt.ylabel('Background rejection')
233
343
  plt.title(f'Fold: {ifold}')
234
- plt.xlim(min_x, 1)
235
- plt.ylim(min_y, 1)
344
+ plt.xlim(min_x, max_x)
345
+ plt.ylim(min_y, max_y)
346
+ plt.grid()
236
347
  plt.legend()
237
348
  plt.savefig(f'{val_dir}/roc.png')
238
349
  plt.close()
239
350
  # ---------------------------------------------
351
+ def _plot_probabilities(self,
352
+ arr_seff: npa,
353
+ arr_brej: npa,
354
+ arr_sprb: npa) -> None:
355
+
356
+ roc_cfg = self._cfg['plotting']['roc']
357
+ if 'annotate' not in roc_cfg:
358
+ log.debug('Annotation section in the ROC curve config not found, skipping annotation')
359
+ return
360
+
361
+ plt_cfg = roc_cfg['annotate']
362
+ if 'sig_eff' not in plt_cfg:
363
+ l_seff_target = [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95]
364
+ else:
365
+ l_seff_target = plt_cfg['sig_eff']
366
+ del plt_cfg['sig_eff']
367
+
368
+ arr_seff_target = numpy.array(l_seff_target)
369
+
370
+ l_score = numpy.quantile(arr_sprb, 1 - arr_seff_target)
371
+ l_seff = []
372
+ l_brej = []
373
+ for seff_target in l_seff_target:
374
+ arr_diff = numpy.abs(arr_seff - seff_target)
375
+ ind = numpy.argmin(arr_diff)
376
+
377
+ seff = arr_seff[ind]
378
+ brej = arr_brej[ind]
379
+
380
+ l_seff.append(seff)
381
+ l_brej.append(brej)
382
+
383
+ plu.annotate(l_x=l_seff, l_y=l_brej, l_v=l_score, **plt_cfg)
384
+ # ---------------------------------------------
240
385
  def _plot_features(self):
241
386
  '''
242
387
  Will plot the features, based on the settings in the config
@@ -245,10 +390,44 @@ class TrainMva:
245
390
  ptr = Plotter(d_rdf = {'Signal' : self._rdf_sig, 'Background' : self._rdf_bkg}, cfg=d_cfg)
246
391
  ptr.run()
247
392
  # ---------------------------------------------
393
+ def _save_settings_to_tex(self) -> None:
394
+ self._save_nan_conversion()
395
+ self._save_hyperparameters_to_tex()
396
+ # ---------------------------------------------
397
+ def _save_nan_conversion(self) -> None:
398
+ if 'nan' not in self._cfg['dataset']:
399
+ log.debug('NaN section not found, not saving it')
400
+ return
401
+
402
+ d_nan = self._cfg['dataset']['nan']
403
+ l_var = list(d_nan)
404
+ l_lab = self._labels_from_varnames(l_var)
405
+ l_val = list(d_nan.values())
406
+
407
+ d_tex = {'Variable' : l_lab, 'Replacement' : l_val}
408
+ df = pnd.DataFrame(d_tex)
409
+ val_dir = self._cfg['plotting']['val_dir']
410
+ os.makedirs(val_dir, exist_ok=True)
411
+ put.df_to_tex(df, f'{val_dir}/nan_replacement.tex')
412
+ # ---------------------------------------------
413
+ def _save_hyperparameters_to_tex(self) -> None:
414
+ if 'hyper' not in self._cfg['training']:
415
+ raise ValueError('Cannot find hyper parameters in configuration')
416
+
417
+ d_hyper = self._cfg['training']['hyper']
418
+ d_form = { f'\\verb|{key}|' : f'\\verb|{val}|' for key, val in d_hyper.items() }
419
+ d_latex = { 'Hyperparameter' : list(d_form.keys()), 'Value' : list(d_form.values())}
420
+
421
+ df = pnd.DataFrame(d_latex)
422
+ val_dir = self._cfg['plotting']['val_dir']
423
+ os.makedirs(val_dir, exist_ok=True)
424
+ put.df_to_tex(df, f'{val_dir}/hyperparameters.tex')
425
+ # ---------------------------------------------
248
426
  def run(self):
249
427
  '''
250
428
  Will do the training
251
429
  '''
430
+ self._save_settings_to_tex()
252
431
  self._plot_features()
253
432
 
254
433
  l_mod = self._get_models()
dmu/ml/utilities.py CHANGED
@@ -51,6 +51,14 @@ def _remove_nans(df : pnd.DataFrame) -> pnd.DataFrame:
51
51
  log.debug('No NaNs found in dataframe')
52
52
  return df
53
53
 
54
+ sr_is_nan = df.isna().any()
55
+ l_na_name = sr_is_nan[sr_is_nan].index.tolist()
56
+
57
+ log.info('Found columns with NaNs')
58
+ for name in l_na_name:
59
+ nan_count = df[name].isna().sum()
60
+ log.info(f'{nan_count:<10}{name:<100}')
61
+
54
62
  ninit = len(df)
55
63
  df = df.dropna()
56
64
  nfinl = len(df)
@@ -0,0 +1,36 @@
1
+ '''
2
+ Module containing utilities for pandas dataframes
3
+ '''
4
+ import os
5
+ import pandas as pnd
6
+
7
+ from dmu.logging.log_store import LogStore
8
+
9
+ log=LogStore.add_logger('dmu:pdataframe:utilities')
10
+
11
+ # -------------------------------------
12
+ def df_to_tex(df : pnd.DataFrame, path : str, hide_index : bool = True, d_format : dict[str,str]=None, caption : str =None) -> None:
13
+ '''
14
+ Saves pandas dataframe to latex
15
+
16
+ Parameters
17
+ -------------
18
+ d_format (dict) : Dictionary specifying the formattinng of the table, e.g. `{'col1': '{}', 'col2': '{:.3f}', 'col3' : '{:.3f}'}`
19
+ '''
20
+
21
+ if path is not None:
22
+ dir_name = os.path.dirname(path)
23
+ os.makedirs(dir_name, exist_ok=True)
24
+
25
+ st = df.style
26
+ if hide_index:
27
+ st=st.hide(axis='index')
28
+
29
+ if d_format is not None:
30
+ st=st.format(formatter=d_format)
31
+
32
+ log.info(f'Saving to: {path}')
33
+ buf = st.to_latex(buf=path, caption=caption, hrules=True)
34
+
35
+ return buf
36
+ # -------------------------------------
dmu/plotting/matrix.py ADDED
@@ -0,0 +1,157 @@
1
+ '''
2
+ Module holding the MatrixPlotter class
3
+ '''
4
+ from typing import Annotated
5
+ import numpy
6
+ import numpy.typing as npt
7
+ import matplotlib.pyplot as plt
8
+
9
+ from dmu.logging.log_store import LogStore
10
+
11
+ Array2D = Annotated[npt.NDArray[numpy.float64], '(n,n)']
12
+ log = LogStore.add_logger('dmu:plotting:matrix')
13
+ #-------------------------------------------------------
14
+ class MatrixPlotter:
15
+ '''
16
+ Class used to plot matrices
17
+ '''
18
+ # -----------------------------------------------
19
+ def __init__(self, mat : Array2D, cfg : dict):
20
+ self._mat = mat
21
+ self._cfg = cfg
22
+
23
+ self._size : int
24
+ self._l_label : list[str]
25
+ # -----------------------------------------------
26
+ def _initialize(self) -> None:
27
+ self._check_matrix()
28
+ self._reformat_matrix()
29
+ self._set_labels()
30
+ self._mask_matrix()
31
+ # -----------------------------------------------
32
+ def _mask_matrix(self) -> None:
33
+ if 'mask_value' not in self._cfg:
34
+ return
35
+
36
+ mask_val = self._cfg['mask_value']
37
+ log.debug(f'Masking value: {mask_val}')
38
+
39
+ self._mat = numpy.ma.masked_where(self._mat == mask_val, self._mat)
40
+ # -----------------------------------------------
41
+ def _check_matrix(self) -> None:
42
+ a, b = self._mat.shape
43
+
44
+ if a != b:
45
+ raise ValueError(f'Matrix is not square, but with shape: {a}x{b}')
46
+
47
+ self._size = a
48
+ # -----------------------------------------------
49
+ def _set_labels(self) -> None:
50
+ if 'labels' not in self._cfg:
51
+ raise ValueError('Labels entry missing')
52
+
53
+ l_lab = self._cfg['labels']
54
+ nlab = len(l_lab)
55
+
56
+ if nlab != self._size:
57
+ raise ValueError(f'Number of labels is not equal to its size: {nlab}!={self._size}')
58
+
59
+ self._l_label = l_lab
60
+ # -----------------------------------------------
61
+ def _reformat_matrix(self) -> None:
62
+ if 'upper' not in self._cfg:
63
+ log.debug('Drawing full matrix')
64
+ return
65
+
66
+ upper = self._cfg['upper']
67
+ if upper not in [True, False]:
68
+ raise ValueError(f'Invalid value for upper setting: {upper}')
69
+
70
+ if upper:
71
+ log.debug('Drawing upper matrix')
72
+ self._mat = numpy.triu(self._mat, 0)
73
+ return
74
+
75
+ if not upper:
76
+ log.debug('Drawing lower matrix')
77
+ self._mat = numpy.triu(self._mat, 0)
78
+ return
79
+ # -----------------------------------------------
80
+ def _set_axes(self, ax) -> None:
81
+ ax.set_xticks(numpy.arange(self._size))
82
+ ax.set_yticks(numpy.arange(self._size))
83
+
84
+ ax.set_xticklabels(self._l_label)
85
+ ax.set_yticklabels(self._l_label)
86
+
87
+ rotation = 45
88
+ if 'label_angle' in self._cfg:
89
+ rotation = self._cfg['label_angle']
90
+
91
+ plt.setp(ax.get_xticklabels(), rotation=rotation, ha="right", rotation_mode="anchor")
92
+ # -----------------------------------------------
93
+ def _draw_matrix(self) -> None:
94
+ fsize = None
95
+ if 'size' in self._cfg:
96
+ fsize = self._cfg['size']
97
+
98
+ if 'zrange' not in self._cfg:
99
+ raise ValueError('z range not found in configuration')
100
+
101
+ [zmin, zmax] = self._cfg['zrange']
102
+
103
+ fig, ax = plt.subplots() if fsize is None else plt.subplots(figsize=fsize)
104
+
105
+ palette = plt.cm.viridis
106
+ im = ax.imshow(self._mat, cmap=palette, vmin=zmin, vmax=zmax)
107
+ self._set_axes(ax)
108
+
109
+ if 'format' in self._cfg:
110
+ self._add_text(ax)
111
+ else:
112
+ log.debug('Not adding values to matrix but bar')
113
+ fig.colorbar(im)
114
+
115
+ if 'title' not in self._cfg:
116
+ return
117
+
118
+ title = self._cfg['title']
119
+ ax.set_title(title)
120
+ fig.tight_layout()
121
+ # -----------------------------------------------
122
+ def _add_text(self, ax):
123
+ fontsize = 12
124
+ if 'fontsize' in self._cfg:
125
+ fontsize = self._cfg['fontsize']
126
+
127
+ form = self._cfg['format']
128
+ log.debug(f'Adding values with format {form}')
129
+
130
+ for i_x, _ in enumerate(self._l_label):
131
+ for i_y, _ in enumerate(self._l_label):
132
+ try:
133
+ val = self._mat[i_y, i_x]
134
+ except:
135
+ log.error(f'Cannot access ({i_x}, {i_y}) in:')
136
+ print(self._mat)
137
+ raise
138
+
139
+ if numpy.ma.is_masked(val):
140
+ text = ''
141
+ else:
142
+ text = form.format(val)
143
+
144
+ _ = ax.text(i_x, i_y, text, ha="center", va="center", fontsize=fontsize, color="k")
145
+ # -----------------------------------------------
146
+ def plot(self):
147
+ '''
148
+ Runs plotting, plot can be accessed through:
149
+
150
+ ```python
151
+ plt.show()
152
+ plt.savefig(...)
153
+ ```
154
+ '''
155
+ self._initialize()
156
+ self._draw_matrix()
157
+ #-------------------------------------------------------
@@ -2,7 +2,6 @@
2
2
  Module containing plotter class
3
3
  '''
4
4
 
5
- import hist
6
5
  from hist import Hist
7
6
 
8
7
  import numpy
@@ -79,6 +78,7 @@ class Plotter1D(Plotter):
79
78
  l_bc_all = []
80
79
  for name, arr_val in d_data.items():
81
80
  arr_wgt = d_wgt[name] if d_wgt is not None else numpy.ones_like(arr_val)
81
+ arr_wgt = self._normalize_weights(arr_wgt, var)
82
82
  hst = Hist.new.Reg(bins=bins, start=minx, stop=maxx, name='x', label=name).Weight()
83
83
  hst.fill(x=arr_val, weight=arr_wgt)
84
84
  hst.plot(label=name)
@@ -88,6 +88,23 @@ class Plotter1D(Plotter):
88
88
 
89
89
  return max_y
90
90
  # --------------------------------------------
91
+ def _normalize_weights(self, arr_wgt : numpy.ndarray, var : str) -> numpy.ndarray:
92
+ cfg_var = self._d_cfg['plots'][var]
93
+ if 'normalized' not in cfg_var:
94
+ log.debug(f'Not normalizing for variable: {var}')
95
+ return arr_wgt
96
+
97
+ if not cfg_var['normalized']:
98
+ log.debug(f'Not normalizing for variable: {var}')
99
+ return arr_wgt
100
+
101
+ log.debug(f'Normalizing for variable: {var}')
102
+ total = numpy.sum(arr_wgt)
103
+ arr_wgt = arr_wgt / total
104
+
105
+ return arr_wgt
106
+ # --------------------------------------------
107
+
91
108
  def _style_plot(self, var : str, max_y : float) -> None:
92
109
  d_cfg = self._d_cfg['plots'][var]
93
110
  yscale = d_cfg['yscale' ] if 'yscale' in d_cfg else 'linear'
@@ -0,0 +1,33 @@
1
+ '''
2
+ Module with plotting utilities
3
+ '''
4
+ # pylint: disable=too-many-positional-arguments, too-many-arguments
5
+
6
+ import matplotlib.pyplot as plt
7
+
8
+ # ---------------------------------------------------------------------------
9
+ def annotate(
10
+ l_x : list[float],
11
+ l_y : list[float],
12
+ l_v : list[float],
13
+ form : str = '{}',
14
+ xoff : int = 0,
15
+ yoff : int =-20,
16
+ size : int = 20,
17
+ color : str = 'black') -> None:
18
+ '''
19
+ Function used to annotate plots
20
+
21
+ l_x(y): List of x(y) coordinates for markers
22
+ l_v : List of numerical values to annotate markers
23
+ form : Formatting, e.g. {:.3f}
24
+ color : String with color for markers and annotation, e.g. black
25
+ size : Font size, default 20
26
+ x(y)off : Offset in x(y).
27
+ '''
28
+ for x, y, v in zip(l_x, l_y, l_v):
29
+ label = form.format(v)
30
+
31
+ plt.plot(x, y, marker='o', markersize= 5, markeredgecolor=color, markerfacecolor=color)
32
+ plt.annotate(label, (x,y), fontsize=size, textcoords="offset points", xytext=(xoff, yoff), color=color, ha='center')
33
+ # ---------------------------------------------------------------------------
@@ -1,5 +1,8 @@
1
+ dataset:
2
+ nan :
3
+ x : 0
1
4
  training :
2
- nfold : 3
5
+ nfold : 3
3
6
  features : [x, y, z]
4
7
  rdm_stat : 1
5
8
  hyper :
@@ -7,31 +10,43 @@ training :
7
10
  n_estimators : 100
8
11
  max_depth : 3
9
12
  learning_rate : 0.1
10
- min_samples_split : 2
13
+ min_samples_split : 2
11
14
  saving:
12
- path : 'tests/ml/train_mva/model.pkl'
15
+ path : '/tmp/dmu/ml/tests/train_mva/model.pkl'
13
16
  plotting:
14
17
  roc :
15
- min : [0, 0]
16
- val_dir : 'tests/ml/train_mva'
18
+ min : [0.0, 0.0]
19
+ max : [1.2, 1.2]
20
+ annotate:
21
+ sig_eff : [0.5, 0.6, 0.7, 0.8, 0.9]
22
+ form : '{:.2f}'
23
+ color: 'green'
24
+ xoff : -15
25
+ yoff : -15
26
+ size : 10
27
+ correlation:
28
+ title : 'Correlation matrix'
29
+ size : [10, 10]
30
+ mask_value : 0
31
+ val_dir : '/tmp/dmu/ml/tests/train_mva'
17
32
  features:
18
33
  saving:
19
- plt_dir : 'tests/ml/train_mva/features'
34
+ plt_dir : '/tmp/dmu/ml/tests/train_mva/features'
20
35
  plots:
21
- w :
36
+ w :
22
37
  binning : [-4, 4, 100]
23
- yscale : 'linear'
38
+ yscale : 'linear'
24
39
  labels : ['w', '']
25
- x :
40
+ x :
26
41
  binning : [-4, 4, 100]
27
- yscale : 'linear'
42
+ yscale : 'linear'
28
43
  labels : ['x', '']
29
- y :
44
+ y :
30
45
  binning : [-4, 4, 100]
31
- yscale : 'linear'
46
+ yscale : 'linear'
32
47
  labels : ['y', '']
33
- z :
48
+ z :
34
49
  binning : [-4, 4, 100]
35
- yscale : 'linear'
50
+ yscale : 'linear'
36
51
  labels : ['z', '']
37
-
52
+
@@ -0,0 +1,9 @@
1
+ saving:
2
+ plt_dir : tests/plotting/normalized
3
+ plots:
4
+ x :
5
+ normalized : true
6
+ binning : [-5.0, 8.0, 40]
7
+ y :
8
+ normalized : false
9
+ binning : [-5.0, 8.0, 40]