pycompound 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pycompound/plot_spectra.py +72 -110
- pycompound/spec_lib_matching.py +59 -54
- {pycompound-0.1.7.dist-info → pycompound-0.1.9.dist-info}/METADATA +3 -2
- {pycompound-0.1.7.dist-info → pycompound-0.1.9.dist-info}/RECORD +7 -8
- {pycompound-0.1.7.dist-info → pycompound-0.1.9.dist-info}/top_level.txt +0 -1
- app.py +0 -3871
- {pycompound-0.1.7.dist-info → pycompound-0.1.9.dist-info}/WHEEL +0 -0
- {pycompound-0.1.7.dist-info → pycompound-0.1.9.dist-info}/licenses/LICENSE +0 -0
pycompound/plot_spectra.py
CHANGED
|
@@ -8,32 +8,6 @@ import matplotlib.pyplot as plt
|
|
|
8
8
|
|
|
9
9
|
|
|
10
10
|
def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_ID1=None, spectrum_ID2=None, similarity_measure='cosine', weights={'Cosine':0.25,'Shannon':0.25,'Renyi':0.25,'Tsallis':0.25}, spectrum_preprocessing_order='FCNMWL', high_quality_reference_library=False, mz_min=0, mz_max=9999999, int_min=0, int_max=9999999, window_size_centroiding=0.5, window_size_matching=0.5, noise_threshold=0.0, wf_mz=0.0, wf_intensity=1.0, LET_threshold=0.0, entropy_dimension=1.1, y_axis_transformation='normalized', output_path=None, return_plot=False):
|
|
11
|
-
'''
|
|
12
|
-
plots two spectra against each other before and after preprocessing transformations for high-resolution mass spectrometry data
|
|
13
|
-
|
|
14
|
-
--query_data: mgf, mzML, or csv file of query mass spectrum/spectra to be identified. If csv file, each row should correspond to a mass spectrum, the left-most column should contain an identifier, and each of the other columns should correspond to a single mass/charge ratio. Mandatory argument.
|
|
15
|
-
--reference_data: mgf, mzML, or csv file of the reference mass spectra. If csv file, each row should correspond to a mass spectrum, the left-most column should contain in identifier (i.e. the CAS registry number or the compound name), and the remaining column should correspond to a single mass/charge ratio. Mandatory argument.
|
|
16
|
-
--spectrum_ID1: ID of one spectrum to be plotted. Default is first spectrum in the query library. Optional argument.
|
|
17
|
-
--spectrum_ID2: ID of another spectrum to be plotted. Default is first spectrum in the reference library. Optional argument.
|
|
18
|
-
--similarity_measure: cosine, shannon, renyi, tsallis, mixture, jaccard, dice, 3w_jaccard, sokal_sneath, binary_cosine, mountford, mcconnaughey, driver_kroeber, simpson, braun_banquet, fager_mcgowan, kulczynski, intersection, hamming, hellinger. Default: cosine.
|
|
19
|
-
--weights: dict of weights to give to each non-binary similarity measure (i.e. cosine, shannon, renyi, and tsallis) when the mixture similarity measure is specified. Default: 0.25 for each of the four non-binary similarity measures.
|
|
20
|
-
--spectrum_preprocessing_order: The spectrum preprocessing transformations and the order in which they are to be applied. Note that these transformations are applied prior to computing similarity scores. Format must be a string with 2-6 characters chosen from C, F, M, N, L, W representing centroiding, filtering based on mass/charge and intensity values, matching, noise removal, low-entropy trannsformation, and weight-factor-transformation, respectively. For example, if \'WCM\' is passed, then each spectrum will undergo a weight factor transformation, then centroiding, and then matching. Note that if an argument is passed, then \'M\' must be contained in the argument, since matching is a required preprocessing step in spectral library matching of HRMS data. Furthermore, \'C\' must be performed before matching since centroiding can change the number of ion fragments in a given spectrum. Default: FCNMWL')
|
|
21
|
-
--high_quality_reference_library: True/False flag indicating whether the reference library is considered to be of high quality. If True, then the spectrum preprocessing transformations of filtering and noise removal are performed only on the query spectrum/spectra. If False, all spectrum preprocessing transformations specified will be applied to both the query and reference spectra. Default: False')
|
|
22
|
-
--mz_min: Remove all peaks with mass/charge value less than mz_min in each spectrum. Default: 0
|
|
23
|
-
--mz_max: Remove all peaks with mass/charge value greater than mz_max in each spectrum. Default: 9999999
|
|
24
|
-
--int_min: Remove all peaks with intensity value less than int_min in each spectrum. Default: 0
|
|
25
|
-
--int_max: Remove all peaks with intensity value greater than int_max in each spectrum. Default: 9999999
|
|
26
|
-
--window_size_centroiding: Window size parameter used in centroiding a given spectrum. Default: 0.5
|
|
27
|
-
--window_size_matching: Window size parameter used in matching a query spectrum and a reference library spectrum. Default: 0.5
|
|
28
|
-
--noise_threshold: Ion fragments (i.e. points in a given mass spectrum) with intensity less than max(intensities)*noise_threshold are removed. Default: 0.0
|
|
29
|
-
--wf_mz: Mass/charge weight factor parameter. Default: 0.0
|
|
30
|
-
--wf_intensity: Intensity weight factor parameter. Default: 0.0
|
|
31
|
-
--LET_threshold: Low-entropy transformation threshold parameter. Spectra with Shannon entropy less than LET_threshold are transformed according to intensitiesNew=intensitiesOriginal^{(1+S)/(1+LET_threshold)}. Default: 0.0
|
|
32
|
-
--entropy_dimension: Entropy dimension parameter. Must have positive value other than 1. When the entropy dimension is 1, then Renyi and Tsallis entropy are equivalent to Shannon entropy. Therefore, this parameter only applies to the renyi and tsallis similarity measures. This parameter will be ignored if similarity measure cosine or shannon is chosen. Default: 1.1
|
|
33
|
-
--y_axis_transformation: transformation to apply to y-axis (i.e. intensity axis) of plots. Options: \'normalized\', \'none\', \'log10\', and \'sqrt\'. Default: normalized.')
|
|
34
|
-
--output_path: path to output PDF file containing the plots of the spectra before and after preprocessing transformations. If no argument is passed, then the plots will be saved to the PDF ./spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}_plot.pdf in the current working directory.
|
|
35
|
-
'''
|
|
36
|
-
|
|
37
11
|
if query_data is None:
|
|
38
12
|
print('\nError: No argument passed to the mandatory query_data. Please pass the path to the CSV file of the query data.')
|
|
39
13
|
sys.exit()
|
|
@@ -41,12 +15,12 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
41
15
|
extension = query_data.rsplit('.',1)
|
|
42
16
|
extension = extension[(len(extension)-1)]
|
|
43
17
|
if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
|
|
44
|
-
output_path_tmp = query_data[:-3] + '
|
|
18
|
+
output_path_tmp = query_data[:-3] + 'txt'
|
|
45
19
|
build_library_from_raw_data(input_path=query_data, output_path=output_path_tmp, is_reference=True)
|
|
46
|
-
df_query = pd.read_csv(output_path_tmp)
|
|
47
|
-
if extension == '
|
|
48
|
-
df_query = pd.read_csv(query_data)
|
|
49
|
-
unique_query_ids = df_query
|
|
20
|
+
df_query = pd.read_csv(output_path_tmp, sep='\t')
|
|
21
|
+
if extension == 'txt' or extension == 'TXT':
|
|
22
|
+
df_query = pd.read_csv(query_data, sep='\t')
|
|
23
|
+
unique_query_ids = df_query['id'].unique().tolist()
|
|
50
24
|
unique_query_ids = [str(tmp) for tmp in unique_query_ids]
|
|
51
25
|
|
|
52
26
|
if reference_data is None:
|
|
@@ -56,25 +30,25 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
56
30
|
extension = reference_data.rsplit('.',1)
|
|
57
31
|
extension = extension[(len(extension)-1)]
|
|
58
32
|
if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
|
|
59
|
-
output_path_tmp = reference_data[:-3] + '
|
|
33
|
+
output_path_tmp = reference_data[:-3] + 'txt'
|
|
60
34
|
build_library_from_raw_data(input_path=reference_data, output_path=output_path_tmp, is_reference=True)
|
|
61
|
-
df_reference = pd.read_csv(output_path_tmp)
|
|
62
|
-
if extension == '
|
|
63
|
-
df_reference = pd.read_csv(reference_data)
|
|
64
|
-
unique_reference_ids = df_reference
|
|
35
|
+
df_reference = pd.read_csv(output_path_tmp, sep='\t')
|
|
36
|
+
if extension == 'txt' or extension == 'TXT':
|
|
37
|
+
df_reference = pd.read_csv(reference_data, sep='\t')
|
|
38
|
+
unique_reference_ids = df_reference['id'].unique().tolist()
|
|
65
39
|
unique_reference_ids = [str(tmp) for tmp in unique_reference_ids]
|
|
66
40
|
|
|
67
41
|
|
|
68
42
|
if spectrum_ID1 is not None:
|
|
69
43
|
spectrum_ID1 = str(spectrum_ID1)
|
|
70
44
|
else:
|
|
71
|
-
spectrum_ID1 = str(df_query.iloc[0
|
|
45
|
+
spectrum_ID1 = str(df_query['id'].iloc[0])
|
|
72
46
|
print('No argument passed to spectrum_ID1; using the first spectrum in query_data.')
|
|
73
47
|
|
|
74
48
|
if spectrum_ID2 is not None:
|
|
75
49
|
spectrum_ID2 = str(spectrum_ID2)
|
|
76
50
|
else:
|
|
77
|
-
spectrum_ID2 = str(df_reference.iloc[0
|
|
51
|
+
spectrum_ID2 = str(df_reference['id'].iloc[0])
|
|
78
52
|
print('No argument passed to spectrum_ID2; using the first spectrum in reference_data.')
|
|
79
53
|
|
|
80
54
|
if spectrum_preprocessing_order is not None:
|
|
@@ -157,17 +131,17 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
157
131
|
if spectrum_ID1 in unique_query_ids and spectrum_ID2 in unique_query_ids:
|
|
158
132
|
query_idx = unique_query_ids.index(spectrum_ID1)
|
|
159
133
|
reference_idx = unique_query_ids.index(spectrum_ID2)
|
|
160
|
-
q_idxs_tmp = np.where(df_query
|
|
161
|
-
r_idxs_tmp = np.where(df_query
|
|
162
|
-
q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp
|
|
163
|
-
r_spec = np.asarray(pd.concat([df_query.iloc[r_idxs_tmp
|
|
134
|
+
q_idxs_tmp = np.where(df_query['id'].astype(str) == unique_query_ids[query_idx])[0]
|
|
135
|
+
r_idxs_tmp = np.where(df_query['id'].astype(str) == unique_query_ids[reference_idx])[0]
|
|
136
|
+
q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
137
|
+
r_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[r_idxs_tmp], df_query['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
164
138
|
elif spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_reference_ids:
|
|
165
139
|
query_idx = unique_reference_ids.index(spectrum_ID1)
|
|
166
140
|
reference_idx = unique_reference_ids.index(spectrum_ID2)
|
|
167
|
-
q_idxs_tmp = np.where(df_reference
|
|
168
|
-
r_idxs_tmp = np.where(df_reference
|
|
169
|
-
q_spec = np.asarray(pd.concat([df_reference.iloc[q_idxs_tmp
|
|
170
|
-
r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp
|
|
141
|
+
q_idxs_tmp = np.where(df_reference['id'].astype(str) == unique_reference_ids[query_idx])[0]
|
|
142
|
+
r_idxs_tmp = np.where(df_reference['id'].astype(str) == unique_reference_ids[reference_idx])[0]
|
|
143
|
+
q_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[q_idxs_tmp], df_reference['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
144
|
+
r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
171
145
|
else:
|
|
172
146
|
if spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_query_ids:
|
|
173
147
|
spec_tmp = spectrum_ID1
|
|
@@ -175,10 +149,10 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
175
149
|
spectrum_ID2 = spec_tmp
|
|
176
150
|
query_idx = unique_query_ids.index(spectrum_ID1)
|
|
177
151
|
reference_idx = unique_reference_ids.index(spectrum_ID2)
|
|
178
|
-
q_idxs_tmp = np.where(df_query
|
|
179
|
-
r_idxs_tmp = np.where(df_reference
|
|
180
|
-
q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp
|
|
181
|
-
r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp
|
|
152
|
+
q_idxs_tmp = np.where(df_query['id'].astype(str) == unique_query_ids[query_idx])[0]
|
|
153
|
+
r_idxs_tmp = np.where(df_reference['id'].astype(str) == unique_reference_ids[reference_idx])[0]
|
|
154
|
+
q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
155
|
+
r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
182
156
|
|
|
183
157
|
|
|
184
158
|
q_spec_pre_trans = q_spec.copy()
|
|
@@ -293,9 +267,6 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
293
267
|
plt.yticks([])
|
|
294
268
|
|
|
295
269
|
|
|
296
|
-
print('\n\n\n')
|
|
297
|
-
print(high_quality_reference_library)
|
|
298
|
-
print('\n\n\n')
|
|
299
270
|
plt.subplots_adjust(top=0.8, hspace=0.92, bottom=0.3)
|
|
300
271
|
plt.figlegend(loc = 'upper center')
|
|
301
272
|
fig.text(0.05, 0.18, f'Similarity Measure: {similarity_measure.capitalize()}', fontsize=7)
|
|
@@ -321,28 +292,6 @@ def generate_plots_on_HRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
321
292
|
|
|
322
293
|
|
|
323
294
|
def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_ID1=None, spectrum_ID2=None, similarity_measure='cosine', weights={'Cosine':0.25,'Shannon':0.25,'Renyi':0.25,'Tsallis':0.25}, spectrum_preprocessing_order='FNLW', high_quality_reference_library=False, mz_min=0, mz_max=9999999, int_min=0, int_max=9999999, noise_threshold=0.0, wf_mz=0.0, wf_intensity=1.0, LET_threshold=0.0, entropy_dimension=1.1, y_axis_transformation='normalized', output_path=None, return_plot=False):
|
|
324
|
-
'''
|
|
325
|
-
plots two spectra against each other before and after preprocessing transformations for high-resolution mass spectrometry data
|
|
326
|
-
|
|
327
|
-
--query_data: cdf or csv file of query mass spectrum/spectra to be identified. If csv file, each row should correspond to a mass spectrum, the left-most column should contain an identifier, and each of the other columns should correspond to a single mass/charge ratio. Mandatory argument.
|
|
328
|
-
--reference_data: cdf of csv file of the reference mass spectra. If csv file, each row should correspond to a mass spectrum, the left-most column should contain in identifier (i.e. the CAS registry number or the compound name), and the remaining column should correspond to a single mass/charge ratio. Mandatory argument.
|
|
329
|
-
--similarity_measure: cosine, shannon, renyi, tsallis, mixture, jaccard, dice, 3w_jaccard, sokal_sneath, binary_cosine, mountford, mcconnaughey, driver_kroeber, simpson, braun_banquet, fager_mcgowan, kulczynski, intersection, hamming, hellinger. Default: cosine.
|
|
330
|
-
--weights: dict of weights to give to each non-binary similarity measure (i.e. cosine, shannon, renyi, and tsallis) when the mixture similarity measure is specified. Default: 0.25 for each of the four non-binary similarity measures.
|
|
331
|
-
--spectrum_preprocessing_order: The spectrum preprocessing transformations and the order in which they are to be applied. Note that these transformations are applied prior to computing similarity scores. Format must be a string with 2-4 characters chosen from F, N, L, W representing filtering based on mass/charge and intensity values, noise removal, low-entropy trannsformation, and weight-factor-transformation, respectively. For example, if \'WN\' is passed, then each spectrum will undergo a weight factor transformation and then noise removal. Default: FNLW')
|
|
332
|
-
--high_quality_reference_library: True/False flag indicating whether the reference library is considered to be of high quality. If True, then the spectrum preprocessing transformations of filtering and noise removal are performed only on the query spectrum/spectra. If False, all spectrum preprocessing transformations specified will be applied to both the query and reference spectra. Default: False')
|
|
333
|
-
--mz_min: Remove all peaks with mass/charge value less than mz_min in each spectrum. Default: 0
|
|
334
|
-
--mz_max: Remove all peaks with mass/charge value greater than mz_max in each spectrum. Default: 9999999
|
|
335
|
-
--int_min: Remove all peaks with intensity value less than int_min in each spectrum. Default: 0
|
|
336
|
-
--int_max: Remove all peaks with intensity value greater than int_max in each spectrum. Default: 9999999
|
|
337
|
-
--noise_threshold: Ion fragments (i.e. points in a given mass spectrum) with intensity less than max(intensities)*noise_threshold are removed. Default: 0.0
|
|
338
|
-
--wf_mz: Mass/charge weight factor parameter. Default: 0.0
|
|
339
|
-
--wf_intensity: Intensity weight factor parameter. Default: 0.0
|
|
340
|
-
--LET_threshold: Low-entropy transformation threshold parameter. Spectra with Shannon entropy less than LET_threshold are transformed according to intensitiesNew=intensitiesOriginal^{(1+S)/(1+LET_threshold)}. Default: 0.0
|
|
341
|
-
--entropy_dimension: Entropy dimension parameter. Must have positive value other than 1. When the entropy dimension is 1, then Renyi and Tsallis entropy are equivalent to Shannon entropy. Therefore, this parameter only applies to the renyi and tsallis similarity measures. This parameter will be ignored if similarity measure cosine or shannon is chosen. Default: 1.1
|
|
342
|
-
--y_axis_transformation: transformation to apply to y-axis (i.e. intensity axis) of plots. Options: \'normalized\', \'none\', \'log10\', and \'sqrt\'. Default: normalized.')
|
|
343
|
-
--output_path: path to output PDF file containing the plots of the spectra before and after preprocessing transformations. If no argument is passed, then the plots will be saved to the PDF ./spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}_plot.pdf in the current working directory.
|
|
344
|
-
'''
|
|
345
|
-
|
|
346
295
|
if query_data is None:
|
|
347
296
|
print('\nError: No argument passed to the mandatory query_data. Please pass the path to the CSV file of the query data.')
|
|
348
297
|
sys.exit()
|
|
@@ -350,12 +299,12 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
350
299
|
extension = query_data.rsplit('.',1)
|
|
351
300
|
extension = extension[(len(extension)-1)]
|
|
352
301
|
if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
|
|
353
|
-
output_path_tmp = query_data[:-3] + '
|
|
302
|
+
output_path_tmp = query_data[:-3] + 'txt'
|
|
354
303
|
build_library_from_raw_data(input_path=query_data, output_path=output_path_tmp, is_reference=False)
|
|
355
|
-
df_query = pd.read_csv(output_path_tmp)
|
|
356
|
-
if extension == '
|
|
357
|
-
df_query = pd.read_csv(query_data)
|
|
358
|
-
unique_query_ids = df_query
|
|
304
|
+
df_query = pd.read_csv(output_path_tmp, sep='\t')
|
|
305
|
+
if extension == 'txt' or extension == 'TXT':
|
|
306
|
+
df_query = pd.read_csv(query_data, sep='\t')
|
|
307
|
+
unique_query_ids = df_query['id'].unique()
|
|
359
308
|
|
|
360
309
|
if reference_data is None:
|
|
361
310
|
print('\nError: No argument passed to the mandatory reference_data. Please pass the path to the CSV file of the reference data.')
|
|
@@ -364,24 +313,24 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
364
313
|
extension = reference_data.rsplit('.',1)
|
|
365
314
|
extension = extension[(len(extension)-1)]
|
|
366
315
|
if extension == 'mgf' or extension == 'MGF' or extension == 'mzML' or extension == 'mzml' or extension == 'MZML' or extension == 'cdf' or extension == 'CDF':
|
|
367
|
-
output_path_tmp = reference_data[:-3] + '
|
|
316
|
+
output_path_tmp = reference_data[:-3] + 'txt'
|
|
368
317
|
build_library_from_raw_data(input_path=reference_data, output_path=output_path_tmp, is_reference=True)
|
|
369
|
-
df_reference = pd.read_csv(output_path_tmp)
|
|
370
|
-
if extension == '
|
|
371
|
-
df_reference = pd.read_csv(reference_data)
|
|
372
|
-
unique_reference_ids = df_reference
|
|
318
|
+
df_reference = pd.read_csv(output_path_tmp, sep='\t')
|
|
319
|
+
if extension == 'txt' or extension == 'TXT':
|
|
320
|
+
df_reference = pd.read_csv(reference_data, sep='\t')
|
|
321
|
+
unique_reference_ids = df_reference['id'].unique()
|
|
373
322
|
|
|
374
323
|
|
|
375
324
|
if spectrum_ID1 is not None:
|
|
376
325
|
spectrum_ID1 = str(spectrum_ID1)
|
|
377
326
|
else:
|
|
378
|
-
spectrum_ID1 = str(df_query.iloc[0
|
|
327
|
+
spectrum_ID1 = str(df_query['id'].iloc[0])
|
|
379
328
|
print('No argument passed to spectrum_ID1; using the first spectrum in query_data.')
|
|
380
329
|
|
|
381
330
|
if spectrum_ID2 is not None:
|
|
382
331
|
spectrum_ID2 = str(spectrum_ID2)
|
|
383
332
|
else:
|
|
384
|
-
spectrum_ID2 = str(df_reference.iloc[0
|
|
333
|
+
spectrum_ID2 = str(df_reference['id'].iloc[0])
|
|
385
334
|
print('No argument passed to spectrum_ID2; using the first spectrum in reference_data.')
|
|
386
335
|
|
|
387
336
|
if spectrum_preprocessing_order is not None:
|
|
@@ -446,12 +395,12 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
446
395
|
print(f'Warning: plots will be saved to the PDF ./spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}_plot.pdf in the current working directory.')
|
|
447
396
|
output_path = f'{Path.cwd()}/spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}.pdf'
|
|
448
397
|
|
|
449
|
-
min_mz = np.min([np.min(df_query
|
|
450
|
-
max_mz = np.max([np.max(df_query
|
|
398
|
+
min_mz = np.min([np.min(df_query['mz_ratio'].tolist()), np.min(df_reference['mz_ratio'].tolist())])
|
|
399
|
+
max_mz = np.max([np.max(df_query['mz_ratio'].tolist()), np.max(df_reference['mz_ratio'].tolist())])
|
|
451
400
|
mzs = np.linspace(min_mz,max_mz,(max_mz-min_mz+1))
|
|
452
401
|
|
|
453
|
-
unique_query_ids = df_query
|
|
454
|
-
unique_reference_ids = df_reference
|
|
402
|
+
unique_query_ids = df_query['id'].unique().tolist()
|
|
403
|
+
unique_reference_ids = df_reference['id'].unique().tolist()
|
|
455
404
|
unique_query_ids = [str(ID) for ID in unique_query_ids]
|
|
456
405
|
unique_reference_ids = [str(ID) for ID in unique_reference_ids]
|
|
457
406
|
common_IDs = np.intersect1d([str(ID) for ID in unique_query_ids], [str(ID) for ID in unique_reference_ids])
|
|
@@ -459,35 +408,48 @@ def generate_plots_on_NRMS_data(query_data=None, reference_data=None, spectrum_I
|
|
|
459
408
|
print(f'Warning: the query and reference library have overlapping IDs: {common_IDs}')
|
|
460
409
|
|
|
461
410
|
if spectrum_ID1 in unique_query_ids and spectrum_ID2 in unique_query_ids:
|
|
462
|
-
q_idxs_tmp = np.where(df_query
|
|
463
|
-
r_idxs_tmp = np.where(df_query
|
|
464
|
-
q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp
|
|
465
|
-
r_spec = np.asarray(pd.concat([df_query.iloc[r_idxs_tmp
|
|
411
|
+
q_idxs_tmp = np.where(df_query['id'].astype(str) == spectrum_ID1)[0]
|
|
412
|
+
r_idxs_tmp = np.where(df_query['id'].astype(str) == spectrum_ID2)[0]
|
|
413
|
+
q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
414
|
+
r_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[r_idxs_tmp], df_query['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
466
415
|
elif spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_reference_ids:
|
|
467
|
-
q_idxs_tmp = np.where(df_reference
|
|
468
|
-
r_idxs_tmp = np.where(df_reference
|
|
469
|
-
q_spec = np.asarray(pd.concat([df_reference.iloc[q_idxs_tmp
|
|
470
|
-
r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp
|
|
416
|
+
q_idxs_tmp = np.where(df_reference['id'].astype(str) == spectrum_ID1)[0]
|
|
417
|
+
r_idxs_tmp = np.where(df_reference['id'].astype(str) == spectrum_ID2)[0]
|
|
418
|
+
q_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[q_idxs_tmp], df_reference['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
419
|
+
r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
471
420
|
else:
|
|
472
421
|
if spectrum_ID1 in unique_reference_ids and spectrum_ID2 in unique_query_ids:
|
|
473
422
|
spec_tmp = spectrum_ID1
|
|
474
423
|
spectrum_ID1 = spectrum_ID2
|
|
475
424
|
spectrum_ID2 = spec_tmp
|
|
476
|
-
q_idxs_tmp = np.where(df_query
|
|
477
|
-
r_idxs_tmp = np.where(df_reference
|
|
478
|
-
q_spec = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp
|
|
479
|
-
r_spec = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp
|
|
425
|
+
q_idxs_tmp = np.where(df_query['id'].astype(str) == spectrum_ID1)[0]
|
|
426
|
+
r_idxs_tmp = np.where(df_reference['id'].astype(str) == spectrum_ID2)[0]
|
|
427
|
+
q_spec = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
428
|
+
r_spec = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
480
429
|
|
|
481
430
|
q_spec = convert_spec(q_spec,mzs)
|
|
482
431
|
r_spec = convert_spec(r_spec,mzs)
|
|
483
432
|
|
|
484
|
-
|
|
485
|
-
|
|
486
|
-
|
|
487
|
-
|
|
488
|
-
|
|
489
|
-
|
|
490
|
-
|
|
433
|
+
nz_q = q_spec[:, 1] != 0
|
|
434
|
+
nz_r = r_spec[:, 1] != 0
|
|
435
|
+
|
|
436
|
+
if np.any(nz_q):
|
|
437
|
+
int_min_tmp_q = q_spec[nz_q, 1].min()
|
|
438
|
+
int_max_tmp_q = q_spec[nz_q, 1].max()
|
|
439
|
+
else:
|
|
440
|
+
int_min_tmp_q = 0.0
|
|
441
|
+
int_max_tmp_q = 0.0
|
|
442
|
+
|
|
443
|
+
if np.any(nz_r):
|
|
444
|
+
int_min_tmp_r = r_spec[nz_r, 1].min()
|
|
445
|
+
int_max_tmp_r = r_spec[nz_r, 1].max()
|
|
446
|
+
else:
|
|
447
|
+
int_min_tmp_r = 0.0
|
|
448
|
+
int_max_tmp_r = 0.0
|
|
449
|
+
|
|
450
|
+
int_min_tmp = int(min(int_min_tmp_q, int_min_tmp_r))
|
|
451
|
+
int_max_tmp = int(max(int_max_tmp_q, int_max_tmp_r))
|
|
452
|
+
|
|
491
453
|
fig, axes = plt.subplots(nrows=2, ncols=1)
|
|
492
454
|
|
|
493
455
|
plt.subplot(2,1,1)
|
pycompound/spec_lib_matching.py
CHANGED
|
@@ -24,15 +24,15 @@ def objective_function_HRMS(X, ctx):
|
|
|
24
24
|
acc = get_acc_HRMS(
|
|
25
25
|
ctx["df_query"],
|
|
26
26
|
ctx["df_reference"],
|
|
27
|
-
ctx["precursor_ion_mz_tolerance"],
|
|
28
|
-
ctx["ionization_mode"], ctx["adduct"],
|
|
27
|
+
ctx["precursor_ion_mz_tolerance"], ctx["ionization_mode"], ctx["adduct"],
|
|
29
28
|
ctx["similarity_measure"], ctx["weights"], ctx["spectrum_preprocessing_order"],
|
|
30
29
|
ctx["mz_min"], ctx["mz_max"], ctx["int_min"], ctx["int_max"],
|
|
31
30
|
p["window_size_centroiding"], p["window_size_matching"], p["noise_threshold"],
|
|
32
31
|
p["wf_mz"], p["wf_int"], p["LET_threshold"],
|
|
33
32
|
p["entropy_dimension"],
|
|
34
33
|
ctx["high_quality_reference_library"],
|
|
35
|
-
verbose=False
|
|
34
|
+
verbose=False,
|
|
35
|
+
exact_match_required=ctx["exact_match_required"]
|
|
36
36
|
)
|
|
37
37
|
print(f"\nparams({ctx['optimize_params']}) = {np.array(X)}\naccuracy: {acc*100}%")
|
|
38
38
|
return 1.0 - acc
|
|
@@ -46,7 +46,8 @@ def objective_function_NRMS(X, ctx):
|
|
|
46
46
|
ctx["mz_min"], ctx["mz_max"], ctx["int_min"], ctx["int_max"],
|
|
47
47
|
p["noise_threshold"], p["wf_mz"], p["wf_int"], p["LET_threshold"], p["entropy_dimension"],
|
|
48
48
|
ctx["high_quality_reference_library"],
|
|
49
|
-
verbose=False
|
|
49
|
+
verbose=False,
|
|
50
|
+
exact_match_required=ctx["exact_match_required"]
|
|
50
51
|
)
|
|
51
52
|
print(f"\nparams({ctx['optimize_params']}) = {np.array(X)}\naccuracy: {acc*100}%")
|
|
52
53
|
return 1.0 - acc
|
|
@@ -54,7 +55,7 @@ def objective_function_NRMS(X, ctx):
|
|
|
54
55
|
|
|
55
56
|
|
|
56
57
|
|
|
57
|
-
def tune_params_DE(query_data=None, reference_data=None, chromatography_platform='HRMS', precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, similarity_measure='cosine', weights=None, spectrum_preprocessing_order='CNMWL', mz_min=0, mz_max=999999999, int_min=0, int_max=999999999, high_quality_reference_library=False, optimize_params=["window_size_centroiding","window_size_matching","noise_threshold","wf_mz","wf_int","LET_threshold","entropy_dimension"], param_bounds={"window_size_centroiding":(0.0,0.5),"window_size_matching":(0.0,0.5),"noise_threshold":(0.0,0.25),"wf_mz":(0.0,5.0),"wf_int":(0.0,5.0),"LET_threshold":(0.0,5.0),"entropy_dimension":(1.0,3.0)}, default_params={"window_size_centroiding": 0.5, "window_size_matching":0.5, "noise_threshold":0.10, "wf_mz":0.0, "wf_int":1.0, "LET_threshold":0.0, "entropy_dimension":1.1}, maxiters=3, de_workers=1):
|
|
58
|
+
def tune_params_DE(query_data=None, reference_data=None, chromatography_platform='HRMS', precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, similarity_measure='cosine', weights=None, spectrum_preprocessing_order='CNMWL', mz_min=0, mz_max=999999999, int_min=0, int_max=999999999, high_quality_reference_library=False, optimize_params=["window_size_centroiding","window_size_matching","noise_threshold","wf_mz","wf_int","LET_threshold","entropy_dimension"], param_bounds={"window_size_centroiding":(0.0,0.5),"window_size_matching":(0.0,0.5),"noise_threshold":(0.0,0.25),"wf_mz":(0.0,5.0),"wf_int":(0.0,5.0),"LET_threshold":(0.0,5.0),"entropy_dimension":(1.0,3.0)}, default_params={"window_size_centroiding": 0.5, "window_size_matching":0.5, "noise_threshold":0.10, "wf_mz":0.0, "wf_int":1.0, "LET_threshold":0.0, "entropy_dimension":1.1}, maxiters=3, de_workers=1, exact_match_required=False):
|
|
58
59
|
|
|
59
60
|
if query_data is None:
|
|
60
61
|
print('\nError: No argument passed to the mandatory query_data. Please pass the path to the TXT file of the query data.')
|
|
@@ -107,6 +108,7 @@ def tune_params_DE(query_data=None, reference_data=None, chromatography_platform
|
|
|
107
108
|
high_quality_reference_library=high_quality_reference_library,
|
|
108
109
|
default_params=default_params,
|
|
109
110
|
optimize_params=optimize_params,
|
|
111
|
+
exact_match_required=exact_match_required
|
|
110
112
|
)
|
|
111
113
|
|
|
112
114
|
bounds = [param_bounds[p] for p in optimize_params]
|
|
@@ -137,14 +139,7 @@ default_HRMS_grid = {'similarity_measure':['cosine'], 'weight':[{'Cosine':0.25,'
|
|
|
137
139
|
default_NRMS_grid = {'similarity_measure':['cosine'], 'weight':[{'Cosine':0.25,'Shannon':0.25,'Renyi':0.25,'Tsallis':0.25}], 'spectrum_preprocessing_order':['FCNMWL'], 'mz_min':[0], 'mz_max':[9999999], 'int_min':[0], 'int_max':[99999999], 'noise_threshold':[0.0], 'wf_mz':[0.0], 'wf_int':[1.0], 'LET_threshold':[0.0], 'entropy_dimension':[1.1], 'high_quality_reference_library':[False]}
|
|
138
140
|
|
|
139
141
|
|
|
140
|
-
def _eval_one_HRMS(df_query, df_reference,
|
|
141
|
-
precursor_ion_mz_tolerance_tmp, ionization_mode_tmp, adduct_tmp,
|
|
142
|
-
similarity_measure_tmp, weight,
|
|
143
|
-
spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp,
|
|
144
|
-
int_min_tmp, int_max_tmp, noise_threshold_tmp,
|
|
145
|
-
window_size_centroiding_tmp, window_size_matching_tmp,
|
|
146
|
-
wf_mz_tmp, wf_int_tmp, LET_threshold_tmp,
|
|
147
|
-
entropy_dimension_tmp, high_quality_reference_library_tmp):
|
|
142
|
+
def _eval_one_HRMS(df_query, df_reference, precursor_ion_mz_tolerance_tmp, ionization_mode_tmp, adduct_tmp, similarity_measure_tmp, weight, spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp, int_min_tmp, int_max_tmp, noise_threshold_tmp, window_size_centroiding_tmp, window_size_matching_tmp, wf_mz_tmp, wf_int_tmp, LET_threshold_tmp, entropy_dimension_tmp, high_quality_reference_library_tmp, exact_match_required_tmp):
|
|
148
143
|
|
|
149
144
|
acc = get_acc_HRMS(
|
|
150
145
|
df_query=df_query, df_reference=df_reference,
|
|
@@ -161,7 +156,8 @@ def _eval_one_HRMS(df_query, df_reference,
|
|
|
161
156
|
LET_threshold=LET_threshold_tmp,
|
|
162
157
|
entropy_dimension=entropy_dimension_tmp,
|
|
163
158
|
high_quality_reference_library=high_quality_reference_library_tmp,
|
|
164
|
-
verbose=False
|
|
159
|
+
verbose=False,
|
|
160
|
+
exact_match_required=exact_match_required_tmp
|
|
165
161
|
)
|
|
166
162
|
|
|
167
163
|
return (
|
|
@@ -173,12 +169,7 @@ def _eval_one_HRMS(df_query, df_reference,
|
|
|
173
169
|
)
|
|
174
170
|
|
|
175
171
|
|
|
176
|
-
def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids,
|
|
177
|
-
similarity_measure_tmp, weight,
|
|
178
|
-
spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp,
|
|
179
|
-
int_min_tmp, int_max_tmp, noise_threshold_tmp,
|
|
180
|
-
wf_mz_tmp, wf_int_tmp, LET_threshold_tmp,
|
|
181
|
-
entropy_dimension_tmp, high_quality_reference_library_tmp):
|
|
172
|
+
def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids, similarity_measure_tmp, weight, spectrum_preprocessing_order_tmp, mz_min_tmp, mz_max_tmp, int_min_tmp, int_max_tmp, noise_threshold_tmp, wf_mz_tmp, wf_int_tmp, LET_threshold_tmp, entropy_dimension_tmp, high_quality_reference_library_tmp, exact_match_required):
|
|
182
173
|
|
|
183
174
|
acc = get_acc_NRMS(
|
|
184
175
|
df_query=df_query, df_reference=df_reference,
|
|
@@ -192,7 +183,8 @@ def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_id
|
|
|
192
183
|
LET_threshold=LET_threshold_tmp,
|
|
193
184
|
entropy_dimension=entropy_dimension_tmp,
|
|
194
185
|
high_quality_reference_library=high_quality_reference_library_tmp,
|
|
195
|
-
verbose=False
|
|
186
|
+
verbose=False,
|
|
187
|
+
exact_match_required=exact_match_required_tmp
|
|
196
188
|
)
|
|
197
189
|
|
|
198
190
|
return (
|
|
@@ -203,7 +195,7 @@ def _eval_one_NRMS(df_query, df_reference, unique_query_ids, unique_reference_id
|
|
|
203
195
|
|
|
204
196
|
|
|
205
197
|
|
|
206
|
-
def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, grid=None, output_path=None, return_output=False):
|
|
198
|
+
def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precursor_ion_mz_tolerance=None, ionization_mode=None, adduct=None, grid=None, output_path=None, return_output=False, exact_match_required=False):
|
|
207
199
|
grid = {**default_HRMS_grid, **(grid or {})}
|
|
208
200
|
for key, value in grid.items():
|
|
209
201
|
globals()[key] = value
|
|
@@ -252,7 +244,9 @@ def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precurso
|
|
|
252
244
|
|
|
253
245
|
param_grid = product(similarity_measure, weight, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, noise_threshold,
|
|
254
246
|
window_size_centroiding, window_size_matching, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library)
|
|
255
|
-
results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_HRMS)(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct,
|
|
247
|
+
#results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_HRMS)(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, (*params for params in param_grid), exact_match_required))
|
|
248
|
+
results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_HRMS)(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, *params, exact_match_required) for params in param_grid)
|
|
249
|
+
|
|
256
250
|
|
|
257
251
|
df_out = pd.DataFrame(results, columns=[
|
|
258
252
|
'ACC','SIMILARITY.MEASURE','WEIGHT','SPECTRUM.PROCESSING.ORDER', 'MZ.MIN','MZ.MAX','INT.MIN','INT.MAX','NOISE.THRESHOLD',
|
|
@@ -276,7 +270,7 @@ def tune_params_on_HRMS_data_grid(query_data=None, reference_data=None, precurso
|
|
|
276
270
|
|
|
277
271
|
|
|
278
272
|
|
|
279
|
-
def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=None, output_path=None, return_output=False):
|
|
273
|
+
def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=None, output_path=None, return_output=False, exact_match_required=False):
|
|
280
274
|
grid = {**default_NRMS_grid, **(grid or {})}
|
|
281
275
|
for key, value in grid.items():
|
|
282
276
|
globals()[key] = value
|
|
@@ -319,7 +313,8 @@ def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=Non
|
|
|
319
313
|
|
|
320
314
|
param_grid = product(similarity_measure, weight, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max,
|
|
321
315
|
noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library)
|
|
322
|
-
results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_NRMS)(df_query, df_reference, unique_query_ids, unique_reference_ids, *params) for params in param_grid)
|
|
316
|
+
#results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_NRMS)(df_query, df_reference, unique_query_ids, unique_reference_ids, *params) for params in param_grid, exact_match_required)
|
|
317
|
+
results = Parallel(n_jobs=-1, verbose=10)(delayed(_eval_one_NRMS)(df_query, df_reference, unique_query_ids, unique_reference_ids, *params, exact_match_required) for params in param_grid)
|
|
323
318
|
|
|
324
319
|
df_out = pd.DataFrame(results, columns=['ACC','SIMILARITY.MEASURE','WEIGHT','SPECTRUM.PROCESSING.ORDER', 'MZ.MIN','MZ.MAX','INT.MIN','INT.MAX',
|
|
325
320
|
'NOISE.THRESHOLD','WF.MZ','WF.INT','LET.THRESHOLD','ENTROPY.DIMENSION', 'HIGH.QUALITY.REFERENCE.LIBRARY'])
|
|
@@ -340,7 +335,7 @@ def tune_params_on_NRMS_data_grid(query_data=None, reference_data=None, grid=Non
|
|
|
340
335
|
|
|
341
336
|
|
|
342
337
|
|
|
343
|
-
def get_acc_HRMS(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, window_size_centroiding, window_size_matching, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True):
|
|
338
|
+
def get_acc_HRMS(df_query, df_reference, precursor_ion_mz_tolerance, ionization_mode, adduct, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, window_size_centroiding, window_size_matching, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True, exact_match_required=False):
|
|
344
339
|
|
|
345
340
|
n_top_matches_to_save = 1
|
|
346
341
|
unique_reference_ids = df_reference['id'].dropna().astype(str).unique().tolist()
|
|
@@ -443,36 +438,40 @@ def get_acc_HRMS(df_query, df_reference, precursor_ion_mz_tolerance, ionization_
|
|
|
443
438
|
top_idx = df_scores.values.argmax(axis=1)
|
|
444
439
|
top_scores = df_scores.values[np.arange(df_scores.shape[0]), top_idx]
|
|
445
440
|
top_ids = [df_scores.columns[i] for i in top_idx]
|
|
446
|
-
|
|
447
441
|
df_tmp = pd.DataFrame({'TRUE.ID': df_scores.index.to_list(), 'PREDICTED.ID': top_ids, 'SCORE': top_scores})
|
|
448
|
-
if verbose:
|
|
449
|
-
|
|
450
|
-
|
|
451
|
-
|
|
442
|
+
#if verbose:
|
|
443
|
+
# print(df_tmp)
|
|
444
|
+
if exact_match_required == True:
|
|
445
|
+
acc = (df_tmp['TRUE.ID'] == df_tmp['PREDICTED.ID']).mean()
|
|
446
|
+
else:
|
|
447
|
+
true_lower = df_tmp['TRUE.ID'].str.lower()
|
|
448
|
+
pred_lower = df_tmp['PREDICTED.ID'].str.lower()
|
|
449
|
+
matches = [t in p for t, p in zip(true_lower, pred_lower)]
|
|
450
|
+
acc = sum(matches) / len(matches)
|
|
452
451
|
return acc
|
|
453
452
|
|
|
454
453
|
|
|
455
|
-
def get_acc_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True):
|
|
454
|
+
def get_acc_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids, similarity_measure, weights, spectrum_preprocessing_order, mz_min, mz_max, int_min, int_max, noise_threshold, wf_mz, wf_int, LET_threshold, entropy_dimension, high_quality_reference_library, verbose=True, exact_match_required=False):
|
|
456
455
|
|
|
457
456
|
n_top_matches_to_save = 1
|
|
458
457
|
|
|
459
|
-
min_mz = int(np.min([np.min(df_query
|
|
460
|
-
max_mz = int(np.max([np.max(df_query
|
|
458
|
+
min_mz = int(np.min([np.min(df_query['mz_ratio']), np.min(df_reference['mz_ratio'])]))
|
|
459
|
+
max_mz = int(np.max([np.max(df_query['mz_ratio']), np.max(df_reference['mz_ratio'])]))
|
|
461
460
|
mzs = np.linspace(min_mz,max_mz,(max_mz-min_mz+1))
|
|
462
461
|
|
|
463
462
|
all_similarity_scores = []
|
|
464
463
|
for query_idx in range(0,len(unique_query_ids)):
|
|
465
|
-
q_idxs_tmp = np.where(df_query
|
|
466
|
-
q_spec_tmp = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp
|
|
464
|
+
q_idxs_tmp = np.where(df_query['id'] == unique_query_ids[query_idx])[0]
|
|
465
|
+
q_spec_tmp = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
467
466
|
q_spec_tmp = convert_spec(q_spec_tmp,mzs)
|
|
468
467
|
|
|
469
468
|
similarity_scores = []
|
|
470
469
|
for ref_idx in range(0,len(unique_reference_ids)):
|
|
471
470
|
q_spec = q_spec_tmp
|
|
472
|
-
if verbose is True and ref_idx % 1000 == 0:
|
|
473
|
-
|
|
474
|
-
r_idxs_tmp = np.where(df_reference
|
|
475
|
-
r_spec_tmp = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp
|
|
471
|
+
#if verbose is True and ref_idx % 1000 == 0:
|
|
472
|
+
# print(f'Query spectrum #{query_idx} has had its similarity with {ref_idx} reference library spectra computed')
|
|
473
|
+
r_idxs_tmp = np.where(df_reference['id'] == unique_reference_ids[ref_idx])[0]
|
|
474
|
+
r_spec_tmp = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
476
475
|
r_spec = convert_spec(r_spec_tmp,mzs)
|
|
477
476
|
|
|
478
477
|
for transformation in spectrum_preprocessing_order:
|
|
@@ -533,7 +532,15 @@ def get_acc_NRMS(df_query, df_reference, unique_query_ids, unique_reference_ids,
|
|
|
533
532
|
scores = np.array(scores)
|
|
534
533
|
out = np.c_[unique_query_ids,preds,scores]
|
|
535
534
|
df_tmp = pd.DataFrame(out, columns=['TRUE.ID','PREDICTED.ID','SCORE'])
|
|
536
|
-
|
|
535
|
+
#if verbose:
|
|
536
|
+
# print(df_tmp)
|
|
537
|
+
if exact_match_required == True:
|
|
538
|
+
acc = (df_tmp['TRUE.ID'] == df_tmp['PREDICTED.ID']).mean()
|
|
539
|
+
else:
|
|
540
|
+
true_lower = df_tmp['TRUE.ID'].str.lower()
|
|
541
|
+
pred_lower = df_tmp['PREDICTED.ID'].str.lower()
|
|
542
|
+
matches = [t in p for t, p in zip(true_lower, pred_lower)]
|
|
543
|
+
acc = sum(matches) / len(matches)
|
|
537
544
|
return acc
|
|
538
545
|
|
|
539
546
|
|
|
@@ -571,8 +578,6 @@ def run_spec_lib_matching_on_HRMS_data(query_data=None, reference_data=None, pre
|
|
|
571
578
|
if 'adduct' in df_reference.columns.tolist() and adduct != 'N/A' and adduct != None:
|
|
572
579
|
df_reference = df_reference.loc[df_reference['adduct']==adduct]
|
|
573
580
|
|
|
574
|
-
print(df_reference.loc[df_reference['id']=='Hectochlorin M+H'])
|
|
575
|
-
|
|
576
581
|
if spectrum_preprocessing_order is not None:
|
|
577
582
|
spectrum_preprocessing_order = list(spectrum_preprocessing_order)
|
|
578
583
|
else:
|
|
@@ -806,7 +811,7 @@ def run_spec_lib_matching_on_NRMS_data(query_data=None, reference_data=None, lik
|
|
|
806
811
|
df_query = pd.read_csv(output_path_tmp, sep='\t')
|
|
807
812
|
if extension == 'txt' or extension == 'TXT':
|
|
808
813
|
df_query = pd.read_csv(query_data, sep='\t')
|
|
809
|
-
unique_query_ids = df_query
|
|
814
|
+
unique_query_ids = df_query['id'].unique()
|
|
810
815
|
|
|
811
816
|
if reference_data is None:
|
|
812
817
|
print('\nError: No argument passed to the mandatory reference_data. Please pass the path to the CSV file of the reference data.')
|
|
@@ -814,14 +819,14 @@ def run_spec_lib_matching_on_NRMS_data(query_data=None, reference_data=None, lik
|
|
|
814
819
|
else:
|
|
815
820
|
if isinstance(reference_data,str):
|
|
816
821
|
df_reference = get_reference_df(reference_data,likely_reference_ids)
|
|
817
|
-
unique_reference_ids = df_reference
|
|
822
|
+
unique_reference_ids = df_reference['id'].unique()
|
|
818
823
|
else:
|
|
819
824
|
dfs = []
|
|
820
825
|
unique_reference_ids = []
|
|
821
826
|
for f in reference_data:
|
|
822
827
|
tmp = get_reference_df(f,likely_reference_ids)
|
|
823
828
|
dfs.append(tmp)
|
|
824
|
-
unique_reference_ids.extend(tmp
|
|
829
|
+
unique_reference_ids.extend(tmp['id'].unique())
|
|
825
830
|
df_reference = pd.concat(dfs, axis=0, ignore_index=True)
|
|
826
831
|
|
|
827
832
|
|
|
@@ -897,23 +902,23 @@ def run_spec_lib_matching_on_NRMS_data(query_data=None, reference_data=None, lik
|
|
|
897
902
|
|
|
898
903
|
|
|
899
904
|
|
|
900
|
-
min_mz = int(np.min([np.min(df_query
|
|
901
|
-
max_mz = int(np.max([np.max(df_query
|
|
905
|
+
min_mz = int(np.min([np.min(df_query['mz_ratio']), np.min(df_reference['mz_ratio'])]))
|
|
906
|
+
max_mz = int(np.max([np.max(df_query['mz_ratio']), np.max(df_reference['mz_ratio'])]))
|
|
902
907
|
mzs = np.linspace(min_mz,max_mz,(max_mz-min_mz+1))
|
|
903
908
|
|
|
904
909
|
all_similarity_scores = []
|
|
905
910
|
for query_idx in range(0,len(unique_query_ids)):
|
|
906
|
-
q_idxs_tmp = np.where(df_query
|
|
907
|
-
q_spec_tmp = np.asarray(pd.concat([df_query.iloc[q_idxs_tmp
|
|
911
|
+
q_idxs_tmp = np.where(df_query['id'] == unique_query_ids[query_idx])[0]
|
|
912
|
+
q_spec_tmp = np.asarray(pd.concat([df_query['mz_ratio'].iloc[q_idxs_tmp], df_query['intensity'].iloc[q_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
908
913
|
q_spec_tmp = convert_spec(q_spec_tmp,mzs)
|
|
909
914
|
|
|
910
915
|
similarity_scores = []
|
|
911
916
|
for ref_idx in range(0,len(unique_reference_ids)):
|
|
912
|
-
if verbose is True and ref_idx % 1000 == 0:
|
|
913
|
-
|
|
917
|
+
#if verbose is True and ref_idx % 1000 == 0:
|
|
918
|
+
# print(f'Query spectrum #{query_idx} has had its similarity with {ref_idx} reference library spectra computed')
|
|
914
919
|
q_spec = q_spec_tmp
|
|
915
|
-
r_idxs_tmp = np.where(df_reference
|
|
916
|
-
r_spec_tmp = np.asarray(pd.concat([df_reference.iloc[r_idxs_tmp
|
|
920
|
+
r_idxs_tmp = np.where(df_reference['id'] == unique_reference_ids[ref_idx])[0]
|
|
921
|
+
r_spec_tmp = np.asarray(pd.concat([df_reference['mz_ratio'].iloc[r_idxs_tmp], df_reference['intensity'].iloc[r_idxs_tmp]], axis=1).reset_index(drop=True))
|
|
917
922
|
r_spec = convert_spec(r_spec_tmp,mzs)
|
|
918
923
|
|
|
919
924
|
for transformation in spectrum_preprocessing_order:
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pycompound
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.9
|
|
4
4
|
Summary: Python package to perform compound identification in mass spectrometry via spectral library matching.
|
|
5
5
|
Author-email: Hunter Dlugas <fy7392@wayne.edu>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -24,4 +24,5 @@ Requires-Dist: joblib==1.5.2
|
|
|
24
24
|
Dynamic: license-file
|
|
25
25
|
|
|
26
26
|
# PyCompound
|
|
27
|
-
|
|
27
|
+
|
|
28
|
+
A Python-based tool for spectral library matching, PyCompound is available as a Python package (pycompound) with a command-line interface (CLI) available and as a GUI application build with Python/Shiny. It performs spectral library matching to identify chemical compounds, offering a range of spectrum preprocessing transformations and similarity measures, including Cosine, three entropy-based similarity measures, and a plethora of binary similarity measures. PyCompound also includes functionality to tune parameters commonly used in a compound identification workflow given a query library of spectra with known ID. PyCompound supports both high-resolution mass spectrometry (HRMS) data (e.g., LC-MS/MS) and nominal-resolution mass spectrometry (NRMS) data (e.g., GC-MS). For the full documentation, see the GitHub repository https://github.com/hdlugas/pycompound.
|