data-manipulation-utilities 0.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. data_manipulation_utilities-0.0.1.dist-info/METADATA +713 -0
  2. data_manipulation_utilities-0.0.1.dist-info/RECORD +45 -0
  3. data_manipulation_utilities-0.0.1.dist-info/WHEEL +5 -0
  4. data_manipulation_utilities-0.0.1.dist-info/entry_points.txt +6 -0
  5. data_manipulation_utilities-0.0.1.dist-info/top_level.txt +3 -0
  6. dmu/arrays/utilities.py +55 -0
  7. dmu/dataframe/dataframe.py +36 -0
  8. dmu/generic/utilities.py +69 -0
  9. dmu/logging/log_store.py +129 -0
  10. dmu/ml/cv_classifier.py +122 -0
  11. dmu/ml/cv_predict.py +152 -0
  12. dmu/ml/train_mva.py +257 -0
  13. dmu/ml/utilities.py +132 -0
  14. dmu/plotting/plotter.py +227 -0
  15. dmu/plotting/plotter_1d.py +113 -0
  16. dmu/plotting/plotter_2d.py +87 -0
  17. dmu/rdataframe/atr_mgr.py +79 -0
  18. dmu/rdataframe/utilities.py +72 -0
  19. dmu/rfile/rfprinter.py +91 -0
  20. dmu/rfile/utilities.py +34 -0
  21. dmu/stats/fitter.py +515 -0
  22. dmu/stats/function.py +314 -0
  23. dmu/stats/utilities.py +134 -0
  24. dmu/testing/utilities.py +119 -0
  25. dmu/text/transformer.py +182 -0
  26. dmu_data/__init__.py +0 -0
  27. dmu_data/ml/tests/train_mva.yaml +37 -0
  28. dmu_data/plotting/tests/2d.yaml +14 -0
  29. dmu_data/plotting/tests/fig_size.yaml +13 -0
  30. dmu_data/plotting/tests/high_stat.yaml +22 -0
  31. dmu_data/plotting/tests/name.yaml +14 -0
  32. dmu_data/plotting/tests/no_bounds.yaml +12 -0
  33. dmu_data/plotting/tests/simple.yaml +8 -0
  34. dmu_data/plotting/tests/title.yaml +14 -0
  35. dmu_data/plotting/tests/weights.yaml +13 -0
  36. dmu_data/text/transform.toml +4 -0
  37. dmu_data/text/transform.txt +6 -0
  38. dmu_data/text/transform_set.toml +8 -0
  39. dmu_data/text/transform_set.txt +6 -0
  40. dmu_data/text/transform_trf.txt +12 -0
  41. dmu_scripts/physics/check_truth.py +121 -0
  42. dmu_scripts/rfile/compare_root_files.py +299 -0
  43. dmu_scripts/rfile/print_trees.py +35 -0
  44. dmu_scripts/ssh/coned.py +168 -0
  45. dmu_scripts/text/transform_text.py +46 -0
@@ -0,0 +1,713 @@
1
+ Metadata-Version: 2.1
2
+ Name: data_manipulation_utilities
3
+ Version: 0.0.1
4
+ Description-Content-Type: text/markdown
5
+ Requires-Dist: zfit
6
+ Requires-Dist: PyYAML
7
+ Requires-Dist: scipy
8
+ Requires-Dist: awkward
9
+ Requires-Dist: tqdm
10
+ Requires-Dist: joblib
11
+ Requires-Dist: scikit-learn
12
+ Requires-Dist: toml
13
+ Requires-Dist: numpy
14
+ Requires-Dist: matplotlib
15
+ Requires-Dist: mplhep
16
+ Requires-Dist: hist[plot]
17
+ Requires-Dist: polars
18
+ Requires-Dist: pandas
19
+ Provides-Extra: dev
20
+ Requires-Dist: pytest ; extra == 'dev'
21
+
22
+ # D(ata) M(anipulation) U(tilities)
23
+
24
+ These are tools that can be used for different data analysis tasks.
25
+
26
+ # Generic
27
+
28
+ This section describes generic tools that could not be put in a specific category, but tend to be useful.
29
+
30
+ ## Timer
31
+
32
+ In order to benchmark functions do:
33
+
34
+ ```python
35
+ import dmu.generic.utilities as gut
36
+
37
+ # Needs to be turned on, it's off by default
38
+ gut.TIMER_ON=True
39
+ @gut.timeit
40
+ def fun():
41
+ sleep(3)
42
+
43
+ fun()
44
+ ```
45
+
46
+ ## JSON dumper
47
+
48
+ The following lines will dump data (dictionaries, lists, etc) to a JSON file:
49
+
50
+ ```python
51
+ import dmu.generic.utilities as gut
52
+
53
+ data = [1,2,3,4]
54
+
55
+ gut.dump_json(data, '/tmp/list.json')
56
+ ```
57
+
58
+ # Physics
59
+
60
+ ## Truth matching
61
+
62
+ In order to compare the truth matching efficiency and distributions after it is performed in several samples, run:
63
+
64
+ ```bash
65
+ check_truth -c configuration.yaml
66
+ ```
67
+
68
+ where the config file, can look like:
69
+
70
+ ```yaml
71
+ # ---------
72
+ max_entries : 1000
73
+ samples:
74
+ # Below are the samples for which the methods will be compared
75
+ sample_a:
76
+ file_path : /path/to/root/files/*.root
77
+ tree_path : TreeName
78
+ methods :
79
+ #Below we specify the ways truth matching will be carried out
80
+ bkg_cat : B_BKGCAT == 0 || B_BKGCAT == 10 || B_BKGCAT == 50
81
+ true_id : TMath::Abs(B_TRUEID) == 521 && TMath::Abs(Jpsi_TRUEID) == 443 && TMath::Abs(Jpsi_MC_MOTHER_ID) == 521 && TMath::Abs(L1_TRUEID) == 11 && TMath::Abs(L2_TRUEID) == 11 && TMath::Abs(L1_MC_MOTHER_ID) == 443 && TMath::Abs(L2_MC_MOTHER_ID) == 443 && TMath::Abs(H_TRUEID) == 321 && TMath::Abs(H_MC_MOTHER_ID) == 521
82
+ plot:
83
+ # Below are the options used by Plottter1D (see plotting documentation below)
84
+ definitions:
85
+ mass : B_nopv_const_mass_M[0]
86
+ plots:
87
+ mass :
88
+ binning : [5000, 6000, 40]
89
+ yscale : 'linear'
90
+ labels : ['$M_{DTF-noPV}(B^+)$', 'Entries']
91
+ normalized : true
92
+ saving:
93
+ plt_dir : /path/to/directory/with/plots
94
+ ```
95
+
96
+ # Math
97
+
98
+ ## PDFs
99
+
100
+ ### Printing PDFs
101
+
102
+ One can print a zfit PDF by doing:
103
+
104
+ ```python
105
+ from dmu.stats.utilities import print_pdf
106
+
107
+ print_pdf(pdf)
108
+ ```
109
+
110
+ this should produce an output that will look like:
111
+
112
+ ```
113
+ PDF: SumPDF
114
+ OBS: <zfit Space obs=('m',), axes=(0,), limits=(array([[-10.]]), array([[10.]])), binned=False>
115
+ Name Value Low HighFloating Constraint
116
+ --------------------
117
+ fr1 5.000e-01 0.000e+00 1.000e+00 1 none
118
+ fr2 5.000e-01 0.000e+00 1.000e+00 1 none
119
+ mu1 4.000e-01 -5.000e+00 5.000e+00 1 none
120
+ mu2 4.000e-01 -5.000e+00 5.000e+00 1 none
121
+ sg1 1.300e+00 0.000e+00 5.000e+00 1 none
122
+ sg2 1.300e+00 0.000e+00 5.000e+00 1 none
123
+ ```
124
+
125
+
126
+ showing basic information on the observable, the parameter ranges and values, whether they are Gaussian constrained and floating or not.
127
+ One can add other options too:
128
+
129
+ ```python
130
+ from dmu.stats.utilities import print_pdf
131
+
132
+ # Constraints, uncorrelated for now
133
+ d_const = {'mu1' : [0.0, 0.1], 'sg1' : [1.0, 0.1]}
134
+ #-----------------
135
+ # simplest printing to screen
136
+ print_pdf(pdf)
137
+
138
+ # Will not show certain parameters
139
+ print_pdf(pdf,
140
+ blind = ['sg.*', 'mu.*'])
141
+
142
+ # Will add constraints
143
+ print_pdf(pdf,
144
+ d_const = d_const,
145
+ blind = ['sg.*', 'mu.*'])
146
+ #-----------------
147
+ # Same as above but will dump to a text file instead of screen
148
+ #-----------------
149
+ print_pdf(pdf,
150
+ txt_path = 'tests/stats/utilities/print_pdf/pdf.txt')
151
+
152
+ print_pdf(pdf,
153
+ blind =['sg.*', 'mu.*'],
154
+ txt_path = 'tests/stats/utilities/print_pdf/pdf_blind.txt')
155
+
156
+ print_pdf(pdf,
157
+ d_const = d_const,
158
+ txt_path = 'tests/stats/utilities/print_pdf/pdf_const.txt')
159
+ ```
160
+
161
+ ## Fits
162
+
163
+ The `Fitter` class is a wrapper to zfit, use to make fitting easier.
164
+
165
+ ### Simplest fit
166
+
167
+ ```python
168
+ from dmu.stats.fitter import Fitter
169
+
170
+ obj = Fitter(pdf, dat)
171
+ res = obj.fit()
172
+ ```
173
+
174
+ ### Customizations
175
+ In order to customize the way the fitting is done one would pass a configuration dictionary to the `fit(cfg=config)`
176
+ function. This dictionary can be represented in YAML as:
177
+
178
+ ```yaml
179
+ # The strategies below are exclusive, only can should be used at a time
180
+ strategy :
181
+ # This strategy will fit multiple times and retry the fit until either
182
+ # ntries is exhausted or the pvalue is reached.
183
+ retry :
184
+ ntries : 4 #Number of tries
185
+ pvalue_thresh : 0.05 #Pvalue threshold, if the fit is better than this, the loop ends
186
+ ignore_status : true #Will pick invalid fits if this is true, otherwise only valid fits will be counted
187
+ # This will fit smaller datasets and get the value of the shape parameters to allow
188
+ # these shapes to float only around this value and within nsigma
189
+ # Fit can be carried out multiple times with larger and larger samples to tighten parameters
190
+ steps :
191
+ nsteps : [1e3, 1e4] #Number of entries to use
192
+ nsigma : [5.0, 2.0] #Number of sigmas for the range of the parameter, for each step
193
+ yields : ['ny1', 'ny2'] # in the fitting model ny1 and ny2 are the names of yields parameters, all the yield need to go in this list
194
+ # The lines below will split the range of the data [0-10] into two subranges, such that the NLL is built
195
+ # only in those ranges. The ranges need to be tuples
196
+ ranges :
197
+ - !!python/tuple [0, 3]
198
+ - !!python/tuple [6, 9]
199
+ #The lines below will allow using contraints for each parameter, where the first element is the mean and the second
200
+ #the width of a Gaussian constraint. No correlations are implemented, yet.
201
+ constraints :
202
+ mu : [5.0, 1.0]
203
+ sg : [1.0, 0.1]
204
+ #After each fit, the parameters spciefied below will be printed, for debugging purposes
205
+ print_pars : ['mu', 'sg']
206
+ likelihood :
207
+ nbins : 100 #If specified, will do binned likelihood fit instead of unbinned
208
+ ```
209
+
210
+ ## Arrays
211
+
212
+ ### Scaling by non-integer
213
+
214
+ Given an array representing a distribution, the following lines will increase its size
215
+ by `fscale`, where this number is a float, e.g. 3.4.
216
+
217
+ ```python
218
+ from dmu.arrays.utilities import repeat_arr
219
+
220
+ arr_val = repeat_arr(arr_val = arr_inp, ftimes = fscale)
221
+ ```
222
+
223
+ in such a way that the output array will be `fscale` larger than the input one, but will keep the same distribution.
224
+
225
+ ## Functions
226
+
227
+ The project contains the `Function` class that can be used to:
228
+
229
+ - Store `(x,y)` coordinates.
230
+ - Evaluate the function by interpolating
231
+ - Storing the function as a JSON file
232
+ - Loading the function from the JSON file
233
+
234
+ It can be used as:
235
+
236
+ ```python
237
+ import numpy
238
+ from dmu.stats.function import Function
239
+
240
+ x = numpy.linspace(0, 5, num=10)
241
+ y = numpy.sin(x)
242
+
243
+ path = './function.json'
244
+
245
+ # By default the interpolation is 'cubic', this uses scipy's interp1d
246
+ # refer to that documentation for more information on this.
247
+ fun = Function(x=x, y=y, kind='cubic')
248
+ fun.save(path = path)
249
+
250
+ fun = Function.load(path)
251
+
252
+ xval = numpy.lispace(0, 5, num=100)
253
+ yval = fun(xval)
254
+ ```
255
+
256
+ # Machine learning
257
+
258
+ ## Classification
259
+
260
+ To train models to classify data between signal and background, starting from ROOT dataframes do:
261
+
262
+ ```python
263
+ from dmu.ml.train_mva import TrainMva
264
+
265
+ rdf_sig = _get_rdf(kind='sig')
266
+ rdf_bkg = _get_rdf(kind='bkg')
267
+ cfg = _get_config()
268
+
269
+ obj= TrainMva(sig=rdf_sig, bkg=rdf_bkg, cfg=cfg)
270
+ obj.run()
271
+ ```
272
+
273
+ where the settings for the training go in a config dictionary, which when written to YAML looks like:
274
+
275
+ ```yaml
276
+ training :
277
+ nfold : 10
278
+ features : [w, x, y, z]
279
+ hyper :
280
+ loss : log_loss
281
+ n_estimators : 100
282
+ max_depth : 3
283
+ learning_rate : 0.1
284
+ min_samples_split : 2
285
+ saving:
286
+ path : 'tests/ml/train_mva/model.pkl'
287
+ plotting:
288
+ val_dir : 'tests/ml/train_mva'
289
+ features:
290
+ saving:
291
+ plt_dir : 'tests/ml/train_mva/features'
292
+ plots:
293
+ w :
294
+ binning : [-4, 4, 100]
295
+ yscale : 'linear'
296
+ labels : ['w', '']
297
+ x :
298
+ binning : [-4, 4, 100]
299
+ yscale : 'linear'
300
+ labels : ['x', '']
301
+ y :
302
+ binning : [-4, 4, 100]
303
+ yscale : 'linear'
304
+ labels : ['y', '']
305
+ z :
306
+ binning : [-4, 4, 100]
307
+ yscale : 'linear'
308
+ labels : ['z', '']
309
+ ```
310
+
311
+ the `TrainMva` is just a wrapper to `scikit-learn` that enables cross-validation (and therefore that explains the `nfolds` setting).
312
+
313
+ ### Caveats
314
+
315
+ When training on real data, several things might go wrong and the code will try to deal with them in the following ways:
316
+
317
+ - **Repeated entries**: Entire rows with features might appear multiple times. When doing cross-validation, this might mean that two identical entries
318
+ will end up in different folds. The tool checks for wether a model is evaluated for an entry that was used for training and raise an exception. Thus, repeated
319
+ entries will be removed before training.
320
+
321
+ - **NaNs**: Entries with NaNs will break the training with the scikit GradientBoostClassifier base class. Thus, we also remove them from the training.
322
+
323
+ ## Application
324
+
325
+ Given the models already trained, one can use them with:
326
+
327
+ ```python
328
+ from dmu.ml.cv_predict import CVPredict
329
+
330
+ #Build predictor with list of models and ROOT dataframe with data
331
+ cvp = CVPredict(models=l_model, rdf=rdf)
332
+
333
+ #This will return an array of probabilibies
334
+ arr_prb = cvp.predict()
335
+ ```
336
+
337
+ If the entries in the input dataframe were used for the training of some of the models, the model that was not used
338
+ will be _automatically_ picked for the prediction of a specific sample.
339
+
340
+ The picking process happens through the comparison of hashes between the samples in `rdf` and the training samples.
341
+ The hashes of the training samples are stored in the pickled model itself; which therefore is a reimplementation of
342
+ `GradientBoostClassifier`, here called `CVClassifier`.
343
+
344
+ If a sample exist, that was used in the training of _every_ model, no model can be chosen for the prediction and an
345
+ `CVSameData` exception will be risen.
346
+
347
+ ### Caveats
348
+
349
+ When evaluating the model with real data, problems might occur, we deal with them as follows:
350
+
351
+ - **Repeated entries**: When there are repeated features in the dataset to be evaluated we assign the same probabilities, no filtering is used.
352
+ - **NaNs**: Entries with NaNs will break the evaluation. These entries will be _patched_ with zeros and evaluated. However, before returning, the probabilities will be
353
+ saved as -1. I.e. entries with NaNs will have probabilities of -1.
354
+
355
+ # Rdataframes
356
+
357
+ These are utility functions meant to be used with ROOT dataframes.
358
+
359
+ ## Adding a column from a numpy array
360
+
361
+ For this do:
362
+
363
+ ```python
364
+ import dmu.rdataframe.utilities as ut
365
+
366
+ arr_val = numpy.array([10, 20, 30])
367
+ rdf = ut.add_column(rdf, arr_val, 'values')
368
+ ```
369
+
370
+ the `add_column` function will check for:
371
+
372
+ 1. Presence of a column with the same name
373
+ 2. Same size for array and existing dataframe
374
+
375
+ and return a dataframe with the added column
376
+
377
+ ## Attaching attributes
378
+
379
+ **Use case** When performing operations in dataframes, like `Filter`, `Range` etc; a new instance of the dataframe
380
+ will be created. One might want to attach attributes to the dataframe, like the name of the file or the tree, etc.
381
+ Those attributes will thus be dropped. In order to deal with this one can do:
382
+
383
+ ```python
384
+ from dmu.rdataframe.atr_mgr import AtrMgr
385
+ # Pick up the attributes
386
+ obj = AtrMgr(rdf)
387
+
388
+ # Do things to dataframe
389
+ rdf = rdf.Filter(x, y)
390
+ rdf = rdf.Define('a', 'b')
391
+
392
+ # Put back the attributes
393
+ rdf = obj.add_atr(rdf)
394
+ ```
395
+
396
+ The attributes can also be saved to JSON with:
397
+
398
+ ```python
399
+ obj = AtrMgr(rdf)
400
+ ...
401
+ obj.to_json('/path/to/file.json')
402
+ ```
403
+
404
+ # Dataframes
405
+
406
+ Polars is very fast, however the interface of polars is not simple. Therefore this project has a derived class
407
+ called `DataFrame`, which implements a more user-friendly interface. It can be used as:
408
+
409
+ ```python
410
+ from dmu.dataframe.dataframe import DataFrame
411
+
412
+ df = DataFrame({
413
+ 'a': [1, 2, 3],
414
+ 'b': [4, 5, 6]
415
+ })
416
+
417
+
418
+ # Defining a new column
419
+ df = df.define('c', 'a + b')
420
+
421
+ ```
422
+
423
+ The remaining functionality is identical to `polars`.
424
+
425
+ # Logging
426
+
427
+ The `LogStore` class is an interface to the `logging` module. It is aimed at making it easier to include
428
+ a good enough logging tool. It can be used as:
429
+
430
+ ```python
431
+ from dmu.logging.log_store import LogStore
432
+
433
+ LogStore.backend = 'logging' # This line is optional, the default backend is logging, but logzero is also supported
434
+ log = LogStore.add_logger('msg')
435
+ LogStore.set_level('msg', 10)
436
+
437
+ log.debug('debug')
438
+ log.info('info')
439
+ log.warning('warning')
440
+ log.error('error')
441
+ log.critical('critical')
442
+ ```
443
+
444
+ # Plotting from ROOT dataframes
445
+
446
+ ## 1D plots
447
+
448
+ Given a set of ROOT dataframes and a configuration dictionary, one can plot distributions with:
449
+
450
+ ```python
451
+ from dmu.plotting.plotter_1d import Plotter1D as Plotter
452
+
453
+ ptr=Plotter(d_rdf=d_rdf, cfg=cfg_dat)
454
+ ptr.run()
455
+ ```
456
+
457
+ where the config dictionary `cfg_dat` in YAML would look like:
458
+
459
+ ```yaml
460
+ selection:
461
+ #Will do at most 50K random entries. Will only happen if the dataset has more than 50K entries
462
+ max_ran_entries : 50000
463
+ cuts:
464
+ #Will only use entries with z > 0
465
+ z : 'z > 0'
466
+ saving:
467
+ #Will save lots to this directory
468
+ plt_dir : tests/plotting/high_stat
469
+ definitions:
470
+ #Will define extra variables
471
+ z : 'x + y'
472
+ #Settings to make histograms for differen variables
473
+ plots:
474
+ x :
475
+ binning : [0.98, 0.98, 40] # Here bounds agree => tool will calculate bounds making sure that they are the 2% and 98% quantile
476
+ yscale : 'linear' # Optional, if not passed, will do linear, can be log
477
+ labels : ['x', 'Entries'] # Labels are optional, will use varname and Entries as labels if not present
478
+ title : 'some title can be added for different variable plots'
479
+ name : 'plot_of_x' # This will ensure that one gets plot_of_x.png as a result, if missing x.png would be saved
480
+ y :
481
+ binning : [-5.0, 8.0, 40]
482
+ yscale : 'linear'
483
+ labels : ['y', 'Entries']
484
+ z :
485
+ binning : [-5.0, 8.0, 40]
486
+ yscale : 'linear'
487
+ labels : ['x + y', 'Entries']
488
+ normalized : true #This should normalize to the area
489
+ ```
490
+
491
+ it's up to the user to build this dictionary and load it.
492
+
493
+ ## 2D plots
494
+
495
+ For the 2D case it would look like:
496
+
497
+ ```python
498
+ from dmu.plotting.plotter_2d import Plotter2D as Plotter
499
+
500
+ ptr=Plotter(rdf=rdf, cfg=cfg_dat)
501
+ ptr.run()
502
+ ```
503
+
504
+ where one would introduce only one dataframe instead of a dictionary, given that overlaying 2D plots is not possible.
505
+ The config would look like:
506
+
507
+ ```yaml
508
+ saving:
509
+ plt_dir : tests/plotting/2d
510
+ general:
511
+ size : [20, 10]
512
+ plots_2d:
513
+ # Column x and y
514
+ # Name of column where weights are, null for not weights
515
+ # Name of output plot, e.g. xy_x.png
516
+ - [x, y, weights, 'xy_w']
517
+ - [x, y, null, 'xy_r']
518
+ axes:
519
+ x :
520
+ binning : [-5.0, 8.0, 40]
521
+ label : 'x'
522
+ y :
523
+ binning : [-5.0, 8.0, 40]
524
+ label : 'y'
525
+ ```
526
+
527
+ # Manipulating ROOT files
528
+
529
+ ## Getting trees from file
530
+
531
+ The lines below will return a dictionary with trees from the handle to a ROOT file:
532
+
533
+ ```python
534
+ import dmu.rfile.utilities as rfut
535
+
536
+ ifile = TFile("/path/to/root/file.root")
537
+
538
+ d_tree = rfut.get_trees_from_file(ifile)
539
+ ```
540
+
541
+ ## Printing contents
542
+
543
+ The following lines will create a `file.txt` with the contents of `file.root`, the text file will be in the same location as the
544
+ ROOT file.
545
+
546
+ ```python
547
+ from dmu.rfile.rfprinter import RFPrinter
548
+
549
+ obj = RFPrinter(path='/path/to/file.root')
550
+ obj.save()
551
+ ```
552
+
553
+ ## Printing from the command line
554
+
555
+ This is mostly needed from the command line and can be done with:
556
+
557
+ ```bash
558
+ print_trees -p /path/to/file.root
559
+ ```
560
+
561
+ which would produce a `/pat/to/file.txt` file with the contents, which would look like:
562
+
563
+ ```
564
+ Directory/Treename
565
+ B_CHI2 Double_t
566
+ B_CHI2DOF Double_t
567
+ B_DIRA_OWNPV Float_t
568
+ B_ENDVERTEX_CHI2 Double_t
569
+ B_ENDVERTEX_CHI2DOF Double_t
570
+ ```
571
+
572
+ ## Comparing ROOT files
573
+
574
+ Given two ROOT files the command below:
575
+
576
+ ```bash
577
+ compare_root_files -f file_1.root file_2.root
578
+ ```
579
+
580
+ will check if:
581
+
582
+ 1. The files have the same trees. If not it will print which files are in the first file but not in the second
583
+ and vice versa.
584
+ 1. The trees have the same branches. The same checks as above will be carried out here.
585
+ 1. The branches of the corresponding trees have the same values.
586
+
587
+ the output will also go to a `summary.yaml` file that will look like:
588
+
589
+ ```yaml
590
+ 'Branches that differ for tree: Hlt2RD_BToMuE/DecayTree':
591
+ - L2_BREMHYPOENERGY
592
+ - L2_ECALPIDMU
593
+ - L1_IS_NOT_H
594
+ 'Branches that differ for tree: Hlt2RD_LbToLMuMu_LL/DecayTree':
595
+ - P_CaloNeutralHcal2EcalEnergyRatio
596
+ - P_BREMENERGY
597
+ - Pi_IS_NOT_H
598
+ - P_BREMPIDE
599
+ Trees only in file_1.root: []
600
+ Trees only in file_2.root:
601
+ - Hlt2RD_BuToKpEE_MVA_misid/DecayTree
602
+ - Hlt2RD_BsToPhiMuMu_MVA/DecayTree
603
+ ```
604
+
605
+ # Text manipulation
606
+
607
+ ## Transformations
608
+
609
+ Run:
610
+
611
+ ```bash
612
+ transform_text -i ./transform.txt -c ./transform.toml
613
+ ```
614
+ to apply a transformation to `transform.txt` following the transformations in `transform.toml`.
615
+
616
+ The tool can be imported from another file like:
617
+
618
+ ```python
619
+ from dmu.text.transformer import transformer as txt_trf
620
+
621
+ trf=txt_trf(txt_path=data.txt, cfg_path=data.cfg)
622
+ trf.save_as(out_path=data.out)
623
+ ```
624
+
625
+ Currently the supported transformations are:
626
+
627
+ ### append
628
+
629
+ Which will apppend to a given line a set of lines, the config lines could look like:
630
+
631
+ ```toml
632
+ [settings]
633
+ as_substring=true
634
+ format ='--> {} <--'
635
+
636
+ [append]
637
+ 'primes are'=['2', '3', '5']
638
+ 'days are'=['Monday', 'Tuesday', 'Wednesday']
639
+ ```
640
+
641
+ `as_substring` is a flag that will allow matches if the line in the text file only contains the key in the config
642
+ e.g.:
643
+
644
+ ```
645
+ the
646
+ first
647
+ primes are:
648
+ and
649
+ the first
650
+ days are:
651
+ ```
652
+
653
+ `format` will format the lines to be inserted, e.g.:
654
+
655
+ ```
656
+ the
657
+ first
658
+ primes are:
659
+ --> 2 <--
660
+ --> 3 <--
661
+ --> 5 <--
662
+ and
663
+ the first
664
+ days are:
665
+ --> Monday <--
666
+ --> Tuesday <--
667
+ --> Wednesday <--
668
+ ```
669
+
670
+ ## coned
671
+
672
+ Utility used to edit SSH connection list, has the following behavior:
673
+
674
+ ```bash
675
+ #Prints all connections
676
+ coned -p
677
+
678
+ #Adds a task name to a given server
679
+ coned -a server_name server_index task
680
+
681
+ #Removes a task name from a given server
682
+ coned -d server_name server_index task
683
+ ```
684
+
685
+ the list of servers with tasks and machines is specified in a YAML file that can look like:
686
+
687
+ ```yaml
688
+ ihep:
689
+ '001' :
690
+ - checks
691
+ - extractor
692
+ - dsmanager
693
+ - classifier
694
+ '002' :
695
+ - checks
696
+ - hqm2
697
+ - dotfiles
698
+ - data_checks
699
+ '003' :
700
+ - setup
701
+ - ntupling
702
+ - preselection
703
+ '004' :
704
+ - scripts
705
+ - tools
706
+ - dmu
707
+ - ap
708
+ lxplus:
709
+ '984' :
710
+ - ap
711
+ ```
712
+
713
+ and should be placed in `$HOME/.config/dmu/ssh/servers.yaml`