pycompound 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -8,32 +8,6 @@ import matplotlib.pyplot as plt
8
8
 
9
9
 
10
10
  def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_ID1=None, spectrum_ID2=None, similarity_measure='cosine', weights={'Cosine':0.25,'Shannon':0.25,'Renyi':0.25,'Tsallis':0.25}, spectrum_preprocessing_order='FCNMWL', high_quality_reference_library=False, mz_min=0, mz_max=9999999, int_min=0, int_max=9999999, window_size_centroiding=0.5, window_size_matching=0.5, noise_threshold=0.0, wf_mz=0.0, wf_intensity=1.0, LET_threshold=0.0, entropy_dimension=1.1, y_axis_transformation='normalized', output_path=None, return_plot=False):
11
- '''
12
- plots two spectra against each other before and after preprocessing transformations for high-resolution mass spectrometry data
13
-
14
- --query_data: mgf, mzML, or csv file of query mass spectrum/spectra to be identified. If csv file, each row should correspond to a mass spectrum, the left-most column should contain an identifier, and each of the other columns should correspond to a single mass/charge ratio. Mandatory argument.
15
- --reference_data: mgf, mzML, or csv file of the reference mass spectra. If csv file, each row should correspond to a mass spectrum, the left-most column should contain in identifier (i.e. the CAS registry number or the compound name), and the remaining column should correspond to a single mass/charge ratio. Mandatory argument.
16
- --spectrum_ID1: ID of one spectrum to be plotted. Default is first spectrum in the query library. Optional argument.
17
- --spectrum_ID2: ID of another spectrum to be plotted. Default is first spectrum in the reference library. Optional argument.
18
- --similarity_measure: cosine, shannon, renyi, tsallis, mixture, jaccard, dice, 3w_jaccard, sokal_sneath, binary_cosine, mountford, mcconnaughey, driver_kroeber, simpson, braun_banquet, fager_mcgowan, kulczynski, intersection, hamming, hellinger. Default: cosine.
19
- --weights: dict of weights to give to each non-binary similarity measure (i.e. cosine, shannon, renyi, and tsallis) when the mixture similarity measure is specified. Default: 0.25 for each of the four non-binary similarity measures.
20
- --spectrum_preprocessing_order: The spectrum preprocessing transformations and the order in which they are to be applied. Note that these transformations are applied prior to computing similarity scores. Format must be a string with 2-6 characters chosen from C, F, M, N, L, W representing centroiding, filtering based on mass/charge and intensity values, matching, noise removal, low-entropy trannsformation, and weight-factor-transformation, respectively. For example, if \'WCM\' is passed, then each spectrum will undergo a weight factor transformation, then centroiding, and then matching. Note that if an argument is passed, then \'M\' must be contained in the argument, since matching is a required preprocessing step in spectral library matching of HRMS data. Furthermore, \'C\' must be performed before matching since centroiding can change the number of ion fragments in a given spectrum. Default: FCNMWL')
21
- --high_quality_reference_library: True/False flag indicating whether the reference library is considered to be of high quality. If True, then the spectrum preprocessing transformations of filtering and noise removal are performed only on the query spectrum/spectra. If False, all spectrum preprocessing transformations specified will be applied to both the query and reference spectra. Default: False')
22
- --mz_min: Remove all peaks with mass/charge value less than mz_min in each spectrum. Default: 0
23
- --mz_max: Remove all peaks with mass/charge value greater than mz_max in each spectrum. Default: 9999999
24
- --int_min: Remove all peaks with intensity value less than int_min in each spectrum. Default: 0
25
- --int_max: Remove all peaks with intensity value greater than int_max in each spectrum. Default: 9999999
26
- --window_size_centroiding: Window size parameter used in centroiding a given spectrum. Default: 0.5
27
- --window_size_matching: Window size parameter used in matching a query spectrum and a reference library spectrum. Default: 0.5
28
- --noise_threshold: Ion fragments (i.e. points in a given mass spectrum) with intensity less than max(intensities)*noise_threshold are removed. Default: 0.0
29
- --wf_mz: Mass/charge weight factor parameter. Default: 0.0
30
- --wf_intensity: Intensity weight factor parameter. Default: 0.0
31
- --LET_threshold: Low-entropy transformation threshold parameter. Spectra with Shannon entropy less than LET_threshold are transformed according to intensitiesNew=intensitiesOriginal^{(1+S)/(1+LET_threshold)}. Default: 0.0
32
- --entropy_dimension: Entropy dimension parameter. Must have positive value other than 1. When the entropy dimension is 1, then Renyi and Tsallis entropy are equivalent to Shannon entropy. Therefore, this parameter only applies to the renyi and tsallis similarity measures. This parameter will be ignored if similarity measure cosine or shannon is chosen. Default: 1.1
33
- --y_axis_transformation: transformation to apply to y-axis (i.e. intensity axis) of plots. Options: \'normalized\', \'none\', \'log10\', and \'sqrt\'. Default: normalized.')
34
- --output_path: path to output PDF file containing the plots of the spectra before and after preprocessing transformations. If no argument is passed, then the plots will be saved to the PDF ./spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}_plot.pdf in the current working directory.
35
- '''
36
-
37
11
  if query_data is None:
38
12
  print('\nError: No argument passed to the mandatory query_data. Please pass the path to the CSV file of the query data.')
39
13
  sys.exit()
@@ -41,12 +15,12 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
41
15
  extension = query_data.rsplit('.',1)
42
16
  extension = extension[(len(extension)-1)]
43
17
  if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
44
- output_path_tmp = query_data[:-3] + 'csv'
18
+ output_path_tmp = query_data[:-3] + 'txt'
45
19
  build_library_from_raw_data(input_path=query_data, output_path=output_path_tmp, is_reference=True)
46
- df_query = pd.read_csv(output_path_tmp)
47
- if extension == 'csv' or extension == 'CSV':
48
- df_query = pd.read_csv(query_data)
49
- unique_query_ids = df_query.iloc[:,0].unique().tolist()
20
+ df_query = pd.read_csv(output_path_tmp, sep='\t')
21
+ if extension == 'txt' or extension == 'TXT':
22
+ df_query = pd.read_csv(query_data, sep='\t')
23
+ unique_query_ids = df_query['id'].unique().tolist()
50
24
  unique_query_ids = [str(tmp) for tmp in unique_query_ids]
51
25
 
52
26
  if reference_data is None:
@@ -56,25 +30,25 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
56
30
  extension = reference_data.rsplit('.',1)
57
31
  extension = extension[(len(extension)-1)]
58
32
  if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
59
- output_path_tmp = reference_data[:-3] + 'csv'
33
+ output_path_tmp = reference_data[:-3] + 'txt'
60
34
  build_library_from_raw_data(input_path=reference_data, output_path=output_path_tmp, is_reference=True)
61
- df_reference = pd.read_csv(output_path_tmp)
62
- if extension == 'csv' or extension == 'CSV':
63
- df_reference = pd.read_csv(reference_data)
64
- unique_reference_ids = df_reference.iloc[:,0].unique().tolist()
35
+ df_reference = pd.read_csv(output_path_tmp, sep='\t')
36
+ if extension == 'txt' or extension == 'TXT':
37
+ df_reference = pd.read_csv(reference_data, sep='\t')
38
+ unique_reference_ids = df_reference['id'].unique().tolist()
65
39
  unique_reference_ids = [str(tmp) for tmp in unique_reference_ids]
66
40
 
67
41
 
68
42
  if spectrum_ID1 is not None:
69
43
  spectrum_ID1 = str(spectrum_ID1)
70
44
  else:
71
- spectrum_ID1 = str(df_query.iloc[0,0])
45
+ spectrum_ID1 = str(df_query['id'].iloc[0])
72
46
  print('No argument passed to spectrum_ID1; using the first spectrum in query_data.')
73
47
 
74
48
  if spectrum_ID2 is not None:
75
49
  spectrum_ID2 = str(spectrum_ID2)
76
50
  else:
77
- spectrum_ID2 = str(df_reference.iloc[0,0])
51
+ spectrum_ID2 = str(df_reference['id'].iloc[0])
78
52
  print('No argument passed to spectrum_ID2; using the first spectrum in reference_data.')
79
53
 
80
54
  if spectrum_preprocessing_order is not None:
@@ -157,17 +131,17 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
157
131
  if spectrum_ID1 in unique_query_ids and spectrum_ID2 in unique_query_ids:
158
132
  query_idx = unique_query_ids.index(spectrum_ID1)
159
133
  reference_idx = unique_query_ids.index(spectrum_ID2)
160
- q_idxs_tmp = np.where(df_query.iloc[:,0].astype(str) == unique_query_ids[query_idx])[0]
161
- r_idxs_tmp = np.where(df_query.iloc[:,0].astype(str) == unique_query_ids[reference_idx])[0]
162
- q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp,1], df_query.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
163
- r_spec = np.asarray(pd.concat([df_query.iloc[r_idxs_tmp,1], df_query.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
134
+ q_idxs_tmp = np.where(df_query['id'].astype(str) == unique_query_ids[query_idx])[0]
135
+ r_idxs_tmp = np.where(df_query['id'].astype(str) == unique_query_ids[reference_idx])[0]
136
+ q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
137
+ r_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[r_idxs_tmp], df_query['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
164
138
  elif spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_reference_ids:
165
139
  query_idx = unique_reference_ids.index(spectrum_ID1)
166
140
  reference_idx = unique_reference_ids.index(spectrum_ID2)
167
- q_idxs_tmp = np.where(df_reference.iloc[:,0].astype(str) == unique_reference_ids[query_idx])[0]
168
- r_idxs_tmp = np.where(df_reference.iloc[:,0].astype(str) == unique_reference_ids[reference_idx])[0]
169
- q_spec = np.asarray(pd.concat([df_reference.iloc[q_idxs_tmp,1], df_reference.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
170
- r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp,1], df_reference.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
141
+ q_idxs_tmp = np.where(df_reference['id'].astype(str) == unique_reference_ids[query_idx])[0]
142
+ r_idxs_tmp = np.where(df_reference['id'].astype(str) == unique_reference_ids[reference_idx])[0]
143
+ q_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[q_idxs_tmp], df_reference['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
144
+ r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
171
145
  else:
172
146
  if spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_query_ids:
173
147
  spec_tmp = spectrum_ID1
@@ -175,10 +149,10 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
175
149
  spectrum_ID2 = spec_tmp
176
150
  query_idx = unique_query_ids.index(spectrum_ID1)
177
151
  reference_idx = unique_reference_ids.index(spectrum_ID2)
178
- q_idxs_tmp = np.where(df_query.iloc[:,0].astype(str) == unique_query_ids[query_idx])[0]
179
- r_idxs_tmp = np.where(df_reference.iloc[:,0].astype(str) == unique_reference_ids[reference_idx])[0]
180
- q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp,1], df_query.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
181
- r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp,1], df_reference.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
152
+ q_idxs_tmp = np.where(df_query['id'].astype(str) == unique_query_ids[query_idx])[0]
153
+ r_idxs_tmp = np.where(df_reference['id'].astype(str) == unique_reference_ids[reference_idx])[0]
154
+ q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
155
+ r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
182
156
 
183
157
 
184
158
  q_spec_pre_trans = q_spec.copy()
@@ -293,9 +267,6 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
293
267
  plt.yticks([])
294
268
 
295
269
 
296
- print('\n\n\n')
297
- print(high_quality_reference_library)
298
- print('\n\n\n')
299
270
  plt.subplots_adjust(top=0.8, hspace=0.92, bottom=0.3)
300
271
  plt.figlegend(loc = 'upper center')
301
272
  fig.text(0.05, 0.18, f'Similarity Measure: {similarity_measure.capitalize()}', fontsize=7)
@@ -321,28 +292,6 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
321
292
 
322
293
 
323
294
  def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_ID1=None, spectrum_ID2=None, similarity_measure='cosine', weights={'Cosine':0.25,'Shannon':0.25,'Renyi':0.25,'Tsallis':0.25}, spectrum_preprocessing_order='FNLW', high_quality_reference_library=False, mz_min=0, mz_max=9999999, int_min=0, int_max=9999999, noise_threshold=0.0, wf_mz=0.0, wf_intensity=1.0, LET_threshold=0.0, entropy_dimension=1.1, y_axis_transformation='normalized', output_path=None, return_plot=False):
324
- '''
325
- plots two spectra against each other before and after preprocessing transformations for high-resolution mass spectrometry data
326
-
327
- --query_data: cdf or csv file of query mass spectrum/spectra to be identified. If csv file, each row should correspond to a mass spectrum, the left-most column should contain an identifier, and each of the other columns should correspond to a single mass/charge ratio. Mandatory argument.
328
- --reference_data: cdf of csv file of the reference mass spectra. If csv file, each row should correspond to a mass spectrum, the left-most column should contain in identifier (i.e. the CAS registry number or the compound name), and the remaining column should correspond to a single mass/charge ratio. Mandatory argument.
329
- --similarity_measure: cosine, shannon, renyi, tsallis, mixture, jaccard, dice, 3w_jaccard, sokal_sneath, binary_cosine, mountford, mcconnaughey, driver_kroeber, simpson, braun_banquet, fager_mcgowan, kulczynski, intersection, hamming, hellinger. Default: cosine.
330
- --weights: dict of weights to give to each non-binary similarity measure (i.e. cosine, shannon, renyi, and tsallis) when the mixture similarity measure is specified. Default: 0.25 for each of the four non-binary similarity measures.
331
- --spectrum_preprocessing_order: The spectrum preprocessing transformations and the order in which they are to be applied. Note that these transformations are applied prior to computing similarity scores. Format must be a string with 2-4 characters chosen from F, N, L, W representing filtering based on mass/charge and intensity values, noise removal, low-entropy trannsformation, and weight-factor-transformation, respectively. For example, if \'WN\' is passed, then each spectrum will undergo a weight factor transformation and then noise removal. Default: FNLW')
332
- --high_quality_reference_library: True/False flag indicating whether the reference library is considered to be of high quality. If True, then the spectrum preprocessing transformations of filtering and noise removal are performed only on the query spectrum/spectra. If False, all spectrum preprocessing transformations specified will be applied to both the query and reference spectra. Default: False')
333
- --mz_min: Remove all peaks with mass/charge value less than mz_min in each spectrum. Default: 0
334
- --mz_max: Remove all peaks with mass/charge value greater than mz_max in each spectrum. Default: 9999999
335
- --int_min: Remove all peaks with intensity value less than int_min in each spectrum. Default: 0
336
- --int_max: Remove all peaks with intensity value greater than int_max in each spectrum. Default: 9999999
337
- --noise_threshold: Ion fragments (i.e. points in a given mass spectrum) with intensity less than max(intensities)*noise_threshold are removed. Default: 0.0
338
- --wf_mz: Mass/charge weight factor parameter. Default: 0.0
339
- --wf_intensity: Intensity weight factor parameter. Default: 0.0
340
- --LET_threshold: Low-entropy transformation threshold parameter. Spectra with Shannon entropy less than LET_threshold are transformed according to intensitiesNew=intensitiesOriginal^{(1+S)/(1+LET_threshold)}. Default: 0.0
341
- --entropy_dimension: Entropy dimension parameter. Must have positive value other than 1. When the entropy dimension is 1, then Renyi and Tsallis entropy are equivalent to Shannon entropy. Therefore, this parameter only applies to the renyi and tsallis similarity measures. This parameter will be ignored if similarity measure cosine or shannon is chosen. Default: 1.1
342
- --y_axis_transformation: transformation to apply to y-axis (i.e. intensity axis) of plots. Options: \'normalized\', \'none\', \'log10\', and \'sqrt\'. Default: normalized.')
343
- --output_path: path to output PDF file containing the plots of the spectra before and after preprocessing transformations. If no argument is passed, then the plots will be saved to the PDF ./spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}_plot.pdf in the current working directory.
344
- '''
345
-
346
295
  if query_data is None:
347
296
  print('\nError: No argument passed to the mandatory query_data. Please pass the path to the CSV file of the query data.')
348
297
  sys.exit()
@@ -350,12 +299,12 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
350
299
  extension = query_data.rsplit('.',1)
351
300
  extension = extension[(len(extension)-1)]
352
301
  if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
353
- output_path_tmp = query_data[:-3] + 'csv'
302
+ output_path_tmp = query_data[:-3] + 'txt'
354
303
  build_library_from_raw_data(input_path=query_data, output_path=output_path_tmp, is_reference=False)
355
- df_query = pd.read_csv(output_path_tmp)
356
- if extension == 'csv' or extension == 'CSV':
357
- df_query = pd.read_csv(query_data)
358
- unique_query_ids = df_query.iloc[:,0].unique()
304
+ df_query = pd.read_csv(output_path_tmp, sep='\t')
305
+ if extension == 'txt' or extension == 'TXT':
306
+ df_query = pd.read_csv(query_data, sep='\t')
307
+ unique_query_ids = df_query['id'].unique()
359
308
 
360
309
  if reference_data is None:
361
310
  print('\nError: No argument passed to the mandatory reference_data. Please pass the path to the CSV file of the reference data.')
@@ -364,24 +313,24 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
364
313
  extension = reference_data.rsplit('.',1)
365
314
  extension = extension[(len(extension)-1)]
366
315
  if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
367
- output_path_tmp = reference_data[:-3] + 'csv'
316
+ output_path_tmp = reference_data[:-3] + 'txt'
368
317
  build_library_from_raw_data(input_path=reference_data, output_path=output_path_tmp, is_reference=True)
369
- df_reference = pd.read_csv(output_path_tmp)
370
- if extension == 'csv' or extension == 'CSV':
371
- df_reference = pd.read_csv(reference_data)
372
- unique_reference_ids = df_reference.iloc[:,0].unique()
318
+ df_reference = pd.read_csv(output_path_tmp, sep='\t')
319
+ if extension == 'txt' or extension == 'TXT':
320
+ df_reference = pd.read_csv(reference_data, sep='\t')
321
+ unique_reference_ids = df_reference['id'].unique()
373
322
 
374
323
 
375
324
  if spectrum_ID1 is not None:
376
325
  spectrum_ID1 = str(spectrum_ID1)
377
326
  else:
378
- spectrum_ID1 = str(df_query.iloc[0,0])
327
+ spectrum_ID1 = str(df_query['id'].iloc[0])
379
328
  print('No argument passed to spectrum_ID1; using the first spectrum in query_data.')
380
329
 
381
330
  if spectrum_ID2 is not None:
382
331
  spectrum_ID2 = str(spectrum_ID2)
383
332
  else:
384
- spectrum_ID2 = str(df_reference.iloc[0,0])
333
+ spectrum_ID2 = str(df_reference['id'].iloc[0])
385
334
  print('No argument passed to spectrum_ID2; using the first spectrum in reference_data.')
386
335
 
387
336
  if spectrum_preprocessing_order is not None:
@@ -446,12 +395,12 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
446
395
  print(f'Warning: plots will be saved to the PDF ./spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}_plot.pdf in the current working directory.')
447
396
  output_path = f'{Path.cwd()}/spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}.pdf'
448
397
 
449
- min_mz = np.min([np.min(df_query.iloc[:,1]), np.min(df_reference.iloc[:,1])])
450
- max_mz = np.max([np.max(df_query.iloc[:,1]), np.max(df_reference.iloc[:,1])])
398
+ min_mz = np.min([np.min(df_query['mz_ratio'].tolist()), np.min(df_reference['mz_ratio'].tolist())])
399
+ max_mz = np.max([np.max(df_query['mz_ratio'].tolist()), np.max(df_reference['mz_ratio'].tolist())])
451
400
  mzs = np.linspace(min_mz,max_mz,(max_mz-min_mz+1))
452
401
 
453
- unique_query_ids = df_query.iloc[:,0].unique().tolist()
454
- unique_reference_ids = df_reference.iloc[:,0].unique().tolist()
402
+ unique_query_ids = df_query['id'].unique().tolist()
403
+ unique_reference_ids = df_reference['id'].unique().tolist()
455
404
  unique_query_ids = [str(ID) for ID in unique_query_ids]
456
405
  unique_reference_ids = [str(ID) for ID in unique_reference_ids]
457
406
  common_IDs = np.intersect1d([str(ID) for ID in unique_query_ids], [str(ID) for ID in unique_reference_ids])
@@ -459,35 +408,48 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
459
408
  print(f'Warning: the query and reference library have overlapping IDs: {common_IDs}')
460
409
 
461
410
  if spectrum_ID1 in unique_query_ids and spectrum_ID2 in unique_query_ids:
462
- q_idxs_tmp = np.where(df_query.iloc[:,0].astype(str) == spectrum_ID1)[0]
463
- r_idxs_tmp = np.where(df_query.iloc[:,0].astype(str) == spectrum_ID2)[0]
464
- q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp,1], df_query.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
465
- r_spec = np.asarray(pd.concat([df_query.iloc[r_idxs_tmp,1], df_query.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
411
+ q_idxs_tmp = np.where(df_query['id'].astype(str) == spectrum_ID1)[0]
412
+ r_idxs_tmp = np.where(df_query['id'].astype(str) == spectrum_ID2)[0]
413
+ q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
414
+ r_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[r_idxs_tmp], df_query['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
466
415
  elif spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_reference_ids:
467
- q_idxs_tmp = np.where(df_reference.iloc[:,0].astype(str) == spectrum_ID1)[0]
468
- r_idxs_tmp = np.where(df_reference.iloc[:,0].astype(str) == spectrum_ID2)[0]
469
- q_spec = np.asarray(pd.concat([df_reference.iloc[q_idxs_tmp,1], df_reference.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
470
- r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp,1], df_reference.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
416
+ q_idxs_tmp = np.where(df_reference['id'].astype(str) == spectrum_ID1)[0]
417
+ r_idxs_tmp = np.where(df_reference['id'].astype(str) == spectrum_ID2)[0]
418
+ q_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[q_idxs_tmp], df_reference['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
419
+ r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
471
420
  else:
472
421
  if spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_query_ids:
473
422
  spec_tmp = spectrum_ID1
474
423
  spectrum_ID1 = spectrum_ID2
475
424
  spectrum_ID2 = spec_tmp
476
- q_idxs_tmp = np.where(df_query.iloc[:,0].astype(str) == spectrum_ID1)[0]
477
- r_idxs_tmp = np.where(df_reference.iloc[:,0].astype(str) == spectrum_ID2)[0]
478
- q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp,1], df_query.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
479
- r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp,1], df_reference.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
425
+ q_idxs_tmp = np.where(df_query['id'].astype(str) == spectrum_ID1)[0]
426
+ r_idxs_tmp = np.where(df_reference['id'].astype(str) == spectrum_ID2)[0]
427
+ q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
428
+ r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
480
429
 
481
430
  q_spec = convert_spec(q_spec,mzs)
482
431
  r_spec = convert_spec(r_spec,mzs)
483
432
 
484
- int_min_tmp_q = min(q_spec[q_spec[:,1].nonzero(),1][0])
485
- int_min_tmp_r = min(r_spec[r_spec[:,1].nonzero(),1][0])
486
- int_max_tmp_q = max(q_spec[q_spec[:,1].nonzero(),1][0])
487
- int_max_tmp_r = max(r_spec[r_spec[:,1].nonzero(),1][0])
488
- int_min_tmp = int(min([int_min_tmp_q,int_min_tmp_r]))
489
- int_max_tmp = int(max([int_max_tmp_q,int_max_tmp_r]))
490
-
433
+ nz_q = q_spec[:, 1] != 0
434
+ nz_r = r_spec[:, 1] != 0
435
+
436
+ if np.any(nz_q):
437
+ int_min_tmp_q = q_spec[nz_q, 1].min()
438
+ int_max_tmp_q = q_spec[nz_q, 1].max()
439
+ else:
440
+ int_min_tmp_q = 0.0
441
+ int_max_tmp_q = 0.0
442
+
443
+ if np.any(nz_r):
444
+ int_min_tmp_r = r_spec[nz_r, 1].min()
445
+ int_max_tmp_r = r_spec[nz_r, 1].max()
446
+ else:
447
+ int_min_tmp_r = 0.0
448
+ int_max_tmp_r = 0.0
449
+
450
+ int_min_tmp = int(min(int_min_tmp_q, int_min_tmp_r))
451
+ int_max_tmp = int(max(int_max_tmp_q, int_max_tmp_r))
452
+
491
453
  fig, axes = plt.subplots(nrows=2, ncols=1)
492
454
 
493
455
  plt.subplot(2,1,1)
@@ -24,15 +24,15 @@ def objective_function_HRMS(X, ctx):
24
24
  acc = get_acc_HRMS(
25
25
  ctx["df_query"],
26
26
  ctx["df_reference"],
27
- ctx["precursor_ion_mz_tolerance"],
28
- ctx["ionization_mode"], ctx["adduct"],
27
+ ctx["precursor_ion_mz_tolerance"], ctx["ionization_mode"], ctx["adduct"],
29
28
  ctx["similarity_measure"], ctx["weights"], ctx["spectrum_preprocessing_order"],
30
29
  ctx["mz_min"], ctx["mz_max"], ctx["int_min"], ctx["int_max"],
31
30
  p["window_size_centroiding"], p["window_size_matching"], p["noise_threshold"],
32
31
  p["wf_mz"], p["wf_int"], p["LET_threshold"],
33
32
  p["entropy_dimension"],
34
33
  ctx["high_quality_reference_library"],
35
- verbose=False
34
+ verbose=False,
35
+ exact_match_required=ctx["exact_match_required"]
36
36
  )
37
37
  print(f"\nparams({ctx['optimize_params']}) = {np.array(X)}\naccuracy: {acc*100}%")
38
38
  return 1.0 - acc
@@ -46,7 +46,8 @@ def objective_function_NRMS(X, ctx):
46
46
  ctx["mz_min"], ctx["mz_max"], ctx["int_min"], ctx["int_max"],
47
47
  p["noise_threshold"], p["wf_mz"], p["wf_int"], p["LET_threshold"], p["entropy_dimension"],
48
48
  ctx["high_quality_reference_library"],
49
- verbose=False
49
+ verbose=False,
50
+ exact_match_required=ctx["exact_match_required"]
50
51
  )
51
52
  print(f"\nparams({ctx['optimize_params']}) = {np.array(X)}\naccuracy: {acc*100}%")
52
53
  return 1.0 - acc
@@ -54,7 +55,7 @@ def objective_function_NRMS(X, ctx):
54
55
 
55
56
 
56
57
 
57
- def tune_params_DE(query_data=None, reference_data=None, chromatography_platform='HRMS', precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, similarity_measure='cosine', weights=None, spectrum_preprocessing_order='CNMWL', mz_min=0, mz_max=999999999, int_min=0, int_max=999999999, high_quality_reference_library=False, optimize_params=["window_size_centroiding","window_size_matching","noise_threshold","wf_mz","wf_int","LET_threshold","entropy_dimension"], param_bounds={"window_size_centroiding":(0.0,0.5),"window_size_matching":(0.0,0.5),"noise_threshold":(0.0,0.25),"wf_mz":(0.0,5.0),"wf_int":(0.0,5.0),"LET_threshold":(0.0,5.0),"entropy_dimension":(1.0,3.0)}, default_params={"window_size_centroiding": 0.5, "window_size_matching":0.5, "noise_threshold":0.10, "wf_mz":0.0, "wf_int":1.0, "LET_threshold":0.0, "entropy_dimension":1.1}, maxiters=3, de_workers=1):
58
+ def tune_params_DE(query_data=None, reference_data=None, chromatography_platform='HRMS', precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, similarity_measure='cosine', weights=None, spectrum_preprocessing_order='CNMWL', mz_min=0, mz_max=999999999, int_min=0, int_max=999999999, high_quality_reference_library=False, optimize_params=["window_size_centroiding","window_size_matching","noise_threshold","wf_mz","wf_int","LET_threshold","entropy_dimension"], param_bounds={"window_size_centroiding":(0.0,0.5),"window_size_matching":(0.0,0.5),"noise_threshold":(0.0,0.25),"wf_mz":(0.0,5.0),"wf_int":(0.0,5.0),"LET_threshold":(0.0,5.0),"entropy_dimension":(1.0,3.0)}, default_params={"window_size_centroiding": 0.5, "window_size_matching":0.5, "noise_threshold":0.10, "wf_mz":0.0, "wf_int":1.0, "LET_threshold":0.0, "entropy_dimension":1.1}, maxiters=3, de_workers=1, exact_match_required=False):
58
59
 
59
60
  if query_data is None:
60
61
  print('\nError: No argument passed to the mandatory query_data. Please pass the path to the TXT file of the query data.')
@@ -107,6 +108,7 @@ def tune_params_DE(query_data=None, reference_data=None, chromatography_platform
107
108
  high_quality_reference_library=high_quality_reference_library,
108
109
  default_params=default_params,
109
110
  optimize_params=optimize_params,
111
+ exact_match_required=exact_match_required
110
112
  )
111
113
 
112
114
  bounds = [param_bounds[p] for p in optimize_params]
@@ -137,14 +139,7 @@ default_HRMS_grid = {'similarity_measure':['cosine'], 'weight':[{'Cosine':0.25,'
137
139
  default_NRMS_grid = {'similarity_measure':['cosine'], 'weight':[{'Cosine':0.25,'Shannon':0.25,'Renyi':0.25,'Tsallis':0.25}], 'spectrum_preprocessing_order':['FCNMWL'], 'mz_min':[0], 'mz_max':[9999999], 'int_min':[0], 'int_max':[99999999], 'noise_threshold':[0.0], 'wf_mz':[0.0], 'wf_int':[1.0], 'LET_threshold':[0.0], 'entropy_dimension':[1.1], 'high_quality_reference_library':[False]}
138
140
 
139
141
 
140
- def _eval_one_HRMS(df_query, df_reference,
141
- precursor_ion_mz_tolerance_tmp, ionization_mode_tmp, adduct_tmp,
142
- similarity_measure_tmp, weight,
143
- spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp,
144
- int_min_tmp, int_max_tmp, noise_threshold_tmp,
145
- window_size_centroiding_tmp, window_size_matching_tmp,
146
- wf_mz_tmp, wf_int_tmp, LET_threshold_tmp,
147
- entropy_dimension_tmp, high_quality_reference_library_tmp):
142
+ def _eval_one_HRMS(df_query, df_reference, precursor_ion_mz_tolerance_tmp, ionization_mode_tmp, adduct_tmp, similarity_measure_tmp, weight, spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp, int_min_tmp, int_max_tmp, noise_threshold_tmp, window_size_centroiding_tmp, window_size_matching_tmp, wf_mz_tmp, wf_int_tmp, LET_threshold_tmp, entropy_dimension_tmp, high_quality_reference_library_tmp, exact_match_required_tmp):
148
143
 
149
144
  acc = get_acc_HRMS(
150
145
  df_query=df_query, df_reference=df_reference,
@@ -161,7 +156,8 @@ def _eval_one_HRMS(df_query, df_reference,
161
156
  LET_threshold=LET_threshold_tmp,
162
157
  entropy_dimension=entropy_dimension_tmp,
163
158
  high_quality_reference_library=high_quality_reference_library_tmp,
164
- verbose=False
159
+ verbose=False,
160
+ exact_match_required=exact_match_required_tmp
165
161
  )
166
162
 
167
163
  return (
@@ -173,12 +169,7 @@ def _eval_one_HRMS(df_query, df_reference,
173
169
  )
174
170
 
175
171
 
176
- def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids,
177
- similarity_measure_tmp, weight,
178
- spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp,
179
- int_min_tmp, int_max_tmp, noise_threshold_tmp,
180
- wf_mz_tmp, wf_int_tmp, LET_threshold_tmp,
181
- entropy_dimension_tmp, high_quality_reference_library_tmp):
172
+ def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids, similarity_measure_tmp, weight, spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp, int_min_tmp, int_max_tmp, noise_threshold_tmp, wf_mz_tmp, wf_int_tmp, LET_threshold_tmp, entropy_dimension_tmp, high_quality_reference_library_tmp, exact_match_required):
182
173
 
183
174
  acc = get_acc_NRMS(
184
175
  df_query=df_query, df_reference=df_reference,
@@ -192,7 +183,8 @@ def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_id
192
183
  LET_threshold=LET_threshold_tmp,
193
184
  entropy_dimension=entropy_dimension_tmp,
194
185
  high_quality_reference_library=high_quality_reference_library_tmp,
195
- verbose=False
186
+ verbose=False,
187
+ exact_match_required=exact_match_required_tmp
196
188
  )
197
189
 
198
190
  return (
@@ -203,7 +195,7 @@ def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_id
203
195
 
204
196
 
205
197
 
206
- def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, grid=None, output_path=None, return_output=False):
198
+ def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, grid=None, output_path=None, return_output=False, exact_match_required=False):
207
199
  grid = {**default_HRMS_grid, **(grid or {})}
208
200
  for key, value in grid.items():
209
201
  globals()[key] = value
@@ -252,7 +244,9 @@ def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precurso
252
244
 
253
245
  param_grid = product(similarity_measure, weight, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, noise_threshold,
254
246
  window_size_centroiding, window_size_matching, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library)
255
- results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_HRMS)(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, *params) for params in param_grid)
247
+ #results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_HRMS)(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, (*params for params in param_grid), exact_match_required))
248
+ results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_HRMS)(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, *params, exact_match_required) for params in param_grid)
249
+
256
250
 
257
251
  df_out = pd.DataFrame(results, columns=[
258
252
  'ACC','SIMILARITY.MEASURE','WEIGHT','SPECTRUM.PROCESSING.ORDER', 'MZ.MIN','MZ.MAX','INT.MIN','INT.MAX','NOISE.THRESHOLD',
@@ -276,7 +270,7 @@ def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precurso
276
270
 
277
271
 
278
272
 
279
- def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=None, output_path=None, return_output=False):
273
+ def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=None, output_path=None, return_output=False, exact_match_required=False):
280
274
  grid = {**default_NRMS_grid, **(grid or {})}
281
275
  for key, value in grid.items():
282
276
  globals()[key] = value
@@ -319,7 +313,8 @@ def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=Non
319
313
 
320
314
  param_grid = product(similarity_measure, weight, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max,
321
315
  noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library)
322
- results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_NRMS)(df_query, df_reference, unique_query_ids, unique_reference_ids, *params) for params in param_grid)
316
+ #results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_NRMS)(df_query, df_reference, unique_query_ids, unique_reference_ids, *params) for params in param_grid, exact_match_required)
317
+ results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_NRMS)(df_query, df_reference, unique_query_ids, unique_reference_ids, *params, exact_match_required) for params in param_grid)
323
318
 
324
319
  df_out = pd.DataFrame(results, columns=['ACC','SIMILARITY.MEASURE','WEIGHT','SPECTRUM.PROCESSING.ORDER', 'MZ.MIN','MZ.MAX','INT.MIN','INT.MAX',
325
320
  'NOISE.THRESHOLD','WF.MZ','WF.INT','LET.THRESHOLD','ENTROPY.DIMENSION', 'HIGH.QUALITY.REFERENCE.LIBRARY'])
@@ -340,7 +335,7 @@ def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=Non
340
335
 
341
336
 
342
337
 
343
- def get_acc_HRMS(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, window_size_centroiding, window_size_matching, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True):
338
+ def get_acc_HRMS(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, window_size_centroiding, window_size_matching, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True, exact_match_required=False):
344
339
 
345
340
  n_top_matches_to_save = 1
346
341
  unique_reference_ids = df_reference['id'].dropna().astype(str).unique().tolist()
@@ -443,36 +438,40 @@ def get_acc_HRMS(df_query, df_reference, precursor_ion_mz_tolerance, ionization_
443
438
  top_idx = df_scores.values.argmax(axis=1)
444
439
  top_scores = df_scores.values[np.arange(df_scores.shape[0]), top_idx]
445
440
  top_ids = [df_scores.columns[i] for i in top_idx]
446
-
447
441
  df_tmp = pd.DataFrame({'TRUE.ID': df_scores.index.to_list(), 'PREDICTED.ID': top_ids, 'SCORE': top_scores})
448
- if verbose:
449
- print(df_tmp)
450
-
451
- acc = (df_tmp['TRUE.ID'] == df_tmp['PREDICTED.ID']).mean()
442
+ #if verbose:
443
+ # print(df_tmp)
444
+ if exact_match_required == True:
445
+ acc = (df_tmp['TRUE.ID'] == df_tmp['PREDICTED.ID']).mean()
446
+ else:
447
+ true_lower = df_tmp['TRUE.ID'].str.lower()
448
+ pred_lower = df_tmp['PREDICTED.ID'].str.lower()
449
+ matches = [t in p for t, p in zip(true_lower, pred_lower)]
450
+ acc = sum(matches) / len(matches)
452
451
  return acc
453
452
 
454
453
 
455
- def get_acc_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True):
454
+ def get_acc_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True, exact_match_required=False):
456
455
 
457
456
  n_top_matches_to_save = 1
458
457
 
459
- min_mz = int(np.min([np.min(df_query.iloc[:,1]), np.min(df_reference.iloc[:,1])]))
460
- max_mz = int(np.max([np.max(df_query.iloc[:,1]), np.max(df_reference.iloc[:,1])]))
458
+ min_mz = int(np.min([np.min(df_query['mz_ratio']), np.min(df_reference['mz_ratio'])]))
459
+ max_mz = int(np.max([np.max(df_query['mz_ratio']), np.max(df_reference['mz_ratio'])]))
461
460
  mzs = np.linspace(min_mz,max_mz,(max_mz-min_mz+1))
462
461
 
463
462
  all_similarity_scores = []
464
463
  for query_idx in range(0,len(unique_query_ids)):
465
- q_idxs_tmp = np.where(df_query.iloc[:,0] == unique_query_ids[query_idx])[0]
466
- q_spec_tmp = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp,1], df_query.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
464
+ q_idxs_tmp = np.where(df_query['id'] == unique_query_ids[query_idx])[0]
465
+ q_spec_tmp = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
467
466
  q_spec_tmp = convert_spec(q_spec_tmp,mzs)
468
467
 
469
468
  similarity_scores = []
470
469
  for ref_idx in range(0,len(unique_reference_ids)):
471
470
  q_spec = q_spec_tmp
472
- if verbose is True and ref_idx % 1000 == 0:
473
- print(f'Query spectrum #{query_idx} has had its similarity with {ref_idx} reference library spectra computed')
474
- r_idxs_tmp = np.where(df_reference.iloc[:,0] == unique_reference_ids[ref_idx])[0]
475
- r_spec_tmp = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp,1], df_reference.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
471
+ #if verbose is True and ref_idx % 1000 == 0:
472
+ # print(f'Query spectrum #{query_idx} has had its similarity with {ref_idx} reference library spectra computed')
473
+ r_idxs_tmp = np.where(df_reference['id'] == unique_reference_ids[ref_idx])[0]
474
+ r_spec_tmp = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
476
475
  r_spec = convert_spec(r_spec_tmp,mzs)
477
476
 
478
477
  for transformation in spectrum_preprocessing_order:
@@ -533,7 +532,15 @@ def get_acc_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids,
533
532
  scores = np.array(scores)
534
533
  out = np.c_[unique_query_ids,preds,scores]
535
534
  df_tmp = pd.DataFrame(out, columns=['TRUE.ID','PREDICTED.ID','SCORE'])
536
- acc = (df_tmp['TRUE.ID']==df_tmp['PREDICTED.ID']).mean()
535
+ #if verbose:
536
+ # print(df_tmp)
537
+ if exact_match_required == True:
538
+ acc = (df_tmp['TRUE.ID'] == df_tmp['PREDICTED.ID']).mean()
539
+ else:
540
+ true_lower = df_tmp['TRUE.ID'].str.lower()
541
+ pred_lower = df_tmp['PREDICTED.ID'].str.lower()
542
+ matches = [t in p for t, p in zip(true_lower, pred_lower)]
543
+ acc = sum(matches) / len(matches)
537
544
  return acc
538
545
 
539
546
 
@@ -571,8 +578,6 @@ def run_spec_lib_matching_on_HRMS_data(query_data=None, reference_data=None, pre
571
578
  if 'adduct' in df_reference.columns.tolist() and adduct != 'N/A' and adduct != None:
572
579
  df_reference = df_reference.loc[df_reference['adduct']==adduct]
573
580
 
574
- print(df_reference.loc[df_reference['id']=='Hectochlorin M+H'])
575
-
576
581
  if spectrum_preprocessing_order is not None:
577
582
  spectrum_preprocessing_order = list(spectrum_preprocessing_order)
578
583
  else:
@@ -806,7 +811,7 @@ def run_spec_lib_matching_on_NRMS_data(query_data=None, reference_data=None, lik
806
811
  df_query = pd.read_csv(output_path_tmp, sep='\t')
807
812
  if extension == 'txt' or extension == 'TXT':
808
813
  df_query = pd.read_csv(query_data, sep='\t')
809
- unique_query_ids = df_query.iloc[:,0].unique()
814
+ unique_query_ids = df_query['id'].unique()
810
815
 
811
816
  if reference_data is None:
812
817
  print('\nError: No argument passed to the mandatory reference_data. Please pass the path to the CSV file of the reference data.')
@@ -814,14 +819,14 @@ def run_spec_lib_matching_on_NRMS_data(query_data=None, reference_data=None, lik
814
819
  else:
815
820
  if isinstance(reference_data,str):
816
821
  df_reference = get_reference_df(reference_data,likely_reference_ids)
817
- unique_reference_ids = df_reference.iloc[:,0].unique()
822
+ unique_reference_ids = df_reference['id'].unique()
818
823
  else:
819
824
  dfs = []
820
825
  unique_reference_ids = []
821
826
  for f in reference_data:
822
827
  tmp = get_reference_df(f,likely_reference_ids)
823
828
  dfs.append(tmp)
824
- unique_reference_ids.extend(tmp.iloc[:,0].unique())
829
+ unique_reference_ids.extend(tmp['id'].unique())
825
830
  df_reference = pd.concat(dfs, axis=0, ignore_index=True)
826
831
 
827
832
 
@@ -897,23 +902,23 @@ def run_spec_lib_matching_on_NRMS_data(query_data=None, reference_data=None, lik
897
902
 
898
903
 
899
904
 
900
- min_mz = int(np.min([np.min(df_query.iloc[:,1]), np.min(df_reference.iloc[:,1])]))
901
- max_mz = int(np.max([np.max(df_query.iloc[:,1]), np.max(df_reference.iloc[:,1])]))
905
+ min_mz = int(np.min([np.min(df_query['mz_ratio']), np.min(df_reference['mz_ratio'])]))
906
+ max_mz = int(np.max([np.max(df_query['mz_ratio']), np.max(df_reference['mz_ratio'])]))
902
907
  mzs = np.linspace(min_mz,max_mz,(max_mz-min_mz+1))
903
908
 
904
909
  all_similarity_scores = []
905
910
  for query_idx in range(0,len(unique_query_ids)):
906
- q_idxs_tmp = np.where(df_query.iloc[:,0] == unique_query_ids[query_idx])[0]
907
- q_spec_tmp = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp,1], df_query.iloc[q_idxs_tmp,2]], axis=1).reset_index(drop=True))
911
+ q_idxs_tmp = np.where(df_query['id'] == unique_query_ids[query_idx])[0]
912
+ q_spec_tmp = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
908
913
  q_spec_tmp = convert_spec(q_spec_tmp,mzs)
909
914
 
910
915
  similarity_scores = []
911
916
  for ref_idx in range(0,len(unique_reference_ids)):
912
- if verbose is True and ref_idx % 1000 == 0:
913
- print(f'Query spectrum #{query_idx} has had its similarity with {ref_idx} reference library spectra computed')
917
+ #if verbose is True and ref_idx % 1000 == 0:
918
+ # print(f'Query spectrum #{query_idx} has had its similarity with {ref_idx} reference library spectra computed')
914
919
  q_spec = q_spec_tmp
915
- r_idxs_tmp = np.where(df_reference.iloc[:,0] == unique_reference_ids[ref_idx])[0]
916
- r_spec_tmp = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp,1], df_reference.iloc[r_idxs_tmp,2]], axis=1).reset_index(drop=True))
920
+ r_idxs_tmp = np.where(df_reference['id'] == unique_reference_ids[ref_idx])[0]
921
+ r_spec_tmp = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
917
922
  r_spec = convert_spec(r_spec_tmp,mzs)
918
923
 
919
924
  for transformation in spectrum_preprocessing_order:
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pycompound
3
- Version: 0.1.7
3
+ Version: 0.1.9
4
4
  Summary: Python package to perform compound identification in mass spectrometry via spectral library matching.
5
5
  Author-email: Hunter Dlugas <fy7392@wayne.edu>
6
6
  License-Expression: MIT
@@ -24,4 +24,5 @@ Requires-Dist: joblib==1.5.2
24
24
  Dynamic: license-file
25
25
 
26
26
  # PyCompound
27
- A Python-based tool for spectral library matching, PyCompound is available as a Python package with a command-line interface (CLI) available and as a GUI application build with Python/Shiny. It performs spectral library matching to identify chemical compounds, offering a range of spectrum preprocessing transformations and similarity measures, including Cosine, three entropy-based similarity measures, and a plethora of binary similarity measures. PyCompound also includes functionality to tune parameters commonly used in a compound identification workflow given a query library of spectra with known ID. PyCompound supports both high-resolution mass spectrometry (HRMS) data (e.g., LC-MS/MS) and nominal-resolution mass spectrometry (NRMS) data (e.g., GC-MS). For the full documentation, see the GitHub repository https://github.com/hdlugas/pycompound.
27
+
28
+ A Python-based tool for spectral library matching, PyCompound is available as a Python package (pycompound) with a command-line interface (CLI) available and as a GUI application build with Python/Shiny. It performs spectral library matching to identify chemical compounds, offering a range of spectrum preprocessing transformations and similarity measures, including Cosine, three entropy-based similarity measures, and a plethora of binary similarity measures. PyCompound also includes functionality to tune parameters commonly used in a compound identification workflow given a query library of spectra with known ID. PyCompound supports both high-resolution mass spectrometry (HRMS) data (e.g., LC-MS/MS) and nominal-resolution mass spectrometry (NRMS) data (e.g., GC-MS). For the full documentation, see the GitHub repository https://github.com/hdlugas/pycompound.