PyPI - pycompound - Versions diffs - 0.1.6__tar.gz → 0.1.8__tar.gz - Mend

pycompound 0.1.6tar.gz → 0.1.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

pycompound-0.1.8/PKG-INFO ADDED Viewed

@@ -0,0 +1,824 @@
+Metadata-Version: 2.4
+Name: pycompound
+Version: 0.1.8
+Summary: Python package to perform compound identification in mass spectrometry via spectral library matching.
+Author-email: Hunter Dlugas <fy7392@wayne.edu>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/hdlugas/pycompound
+Project-URL: Issues, https://github.com/hdlugas/pycompound/issues
+Classifier: Programming Language :: Python :: 3
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: matplotlib==3.8.4
+Requires-Dist: numpy==1.26.4
+Requires-Dist: pandas==2.2.2
+Requires-Dist: scipy==1.13.1
+Requires-Dist: pyteomics==4.7.2
+Requires-Dist: netCDF4==1.6.5
+Requires-Dist: lxml>=5.1.0
+Requires-Dist: orjson==3.11.0
+Requires-Dist: shiny==1.4.0
+Requires-Dist: joblib==1.5.2
+Dynamic: license-file
+# PyCompound
+A Python-based tool for spectral library matching, PyCompound is available as a Python package with a command-line interface (CLI) available and as a GUI application build with Python/Shiny. It performs spectral library matching to identify chemical compounds, offering a range of spectrum preprocessing transformations and similarity measures, including Cosine and three entropy-based similarity measures. PyCompound supports both high-resolution mass spectrometry (HRMS) data (e.g., LC-MS/MS) and nominal-resolution mass spectrometry (NRMS) data (e.g., GC-MS).
+## Table of Contents
+- [1. Install dependencies](#create-conda-env)
+- [2. Functionality](#functionality)
+   - [2.1 Spectrum Preprocessing Transformations](#spec-preprocessing-transformations)
+   - [2.2 Similarity Measures](#similarity-measures)
+- [3. Usage](#usage)
+   - [3.1 Parameter descriptions](#param_descriptions)
+   - [3.2 Obtain LC-MS/MS or GC-MS library from MGF, mzML, or cdf file](#process-data)
+   - [3.3 Run spectral library matching](#run-spec-lib-matching)
+   - [3.4 Tune parameters](#tuning)
+   - [3.5 Plot a query spectrum against a reference spectrum before and after spectrum preprocessing transformations](#plotting)
+   - [3.6 Shiny application](#shiny)
+- [4. Bugs/Questions?](#bugs-questions)
+<a name="create-conda-env"></a>
+## 1. Install dependencies
+PyCompound requires the Python dependencies Matplotlib, NumPy, Pandas, SciPy, Pyteomics, and netCDF4. Specifically, this software was validated with python=3.12.4, matplotlib=3.8.4, numpy=1.26.4, pandas=2.2.2, scipy=1.13.1, pyteomics=4.7.2, netCDF4=1.6.5, lxml=5.1.0, joblib=1.5.2, and shiny=1.4.0, although it may work with other versions of these tools. A user may consider creating a conda environment (see [https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) for guidance on getting started with conda if you are unfamiliar). For a system with conda installed, one can create the environment pycompound_env, activate it, and install the necessary dependencies with:
+```
+conda create -n pycompound_env python=3.12
+conda activate pycompound_env
+pip install pycompound==0.1.7
+```
+<a name="functionality"></a>
+## 2. Functionality
+<a name="spec-preprocessing-transformations"></a>
+## 2.1 Spectrum Preprocessing Transformations
+The following spectrum preprocessing transformations are offered:
+-   Filtering: Given user-defined parameters (mz_min,mz_max),
+    (int_min,int_max) and spectrum $I$ with m/z values
+    $(m_{1},m_{2},...,m_{n})$ and intensities $(x_{1},x_{2},...,x_{n})$,
+    the transformed spectrum $I^{\star}$ consists of the peaks
+    $(m_{i},x_{i})$ in $I$ such that mz_min $\leq m_{i}\leq$ mz_max and
+    int_min $\leq x_{i}\leq$ int_max.
+-   Weight Factor Transformation: Given a pair of user-defined weight
+    factor parameters $(\text{a,b})$ and spectrum $I$ with m/z values
+    $(m_{1},m_{2},...,m_{n})$ and intensities $(x_{1},x_{2},...,x_{n})$,
+    the transformed spectrum $I^{\star}$ has the same m/z values as $I$
+    and has intensities given by
+    $I^{\star}:=(m_{1}^{\text{a}}\cdot x_{1}^{\text{b}},m_{2}^{\text{a}}\cdot x_{2}^{\text{b}},...,m_{n}^{\text{a}}\cdot x_{n}^{\text{b}})$.
+-   Low-Entropy Transformation: Given a user-defined low-entropy
+    threshold parameter $T$ and spectrum $I$ with intensities
+    $(x_{1},x_{2},...,x_{n})$, $\sum_{i=1}^nx_i = 1$, and Shannon
+    entropy $H_{Shannon}(I)=-\sum_{i=1}^{n}x_{i}\cdot ln(x_{i})$, the
+    transformed spectrum intensities
+    $I^{\star}=(x_{1}^{\star},x_{2}^{\star},...,x_{n}^{\star})$ are such
+    that, for all $i\in\{1,2,...,n\}$, $x_{i}^{\star}=x_{i}$ if
+    $H_{Shannon}(I)\geq T$ and
+    $x_{i}^{\star}=x_{i}^{\frac{1+H_{Shannon}(I)}{1+T}}$ if
+    $H_{Shannon}(I)<T$.
+-   Centroiding (only applicable to HRMS data): Given a user-defined
+    window-size parameter $w_{centroiding}$ and a spectrum $I$ with m/z
+    values $(m_{1},m_{2},...,m_{n})$ and intensities
+    $(x_{1},x_{2},...,x_{n})$, the transformed spectrum $I^{\star}$
+    merges adjacent peaks $(m_{i},x_{i}),(m_{i+1},x_{i+1})$ into the
+    peak
+    $(\frac{m_{i}\cdot x_{i}+m_{i+1}\cdot x_{i+1}}{x_{i}+x_{i+1}},x_{i}+x_{i+1})$
+    if $|m_{i}-m_{i+1}|< w_{centroiding}$ for
+    $i\in\{1,2,...,n-1\}$. This centroiding procedure generalizes to
+    more than two peaks whose m/z values are within a distance
+    $w_{centroiding}$ of each other.
+-   Noise Removal: Given a user-defined noise removal parameter $r$ and
+    a spectrum $I$ with intensities $(x_{1},x_{2},...,x_{n})$, noise
+    removal removes peaks from $I$ with
+    $x_{j}< r\cdot\text{max}(\{x_{1},x_{2},...,x_{n}\})$ for
+    $j\in\{1,2,...,n\}$.
+-   Matching (only applicable to HRMS data): Given a user-defined
+    window-size parameter $w_{matching}$ and two spectra $I$, $J$ with
+    m/z ratios $(a_{1},a_{2},...,a_{n}), (b_{1},b_{2},...,b_{m})$ and
+    intensities $(x_{1},x_{2},...,x_{n}), (y_{1},y_{2},...,y_{m})$,
+    respectively, of which we would like to measure the similarity
+    between, the matching procedure outputs two spectra
+    $I^{\star},J^{\star}$ containing the same number of peaks with
+    $I^{\star}$ and $J^{\star}$ having intensities and
+    identical m/z ratios. Specifically, for a given peak $(a_{i},x_{i})$
+    of $I$, if there are no peaks $(b_{j},y_{j})$ in $J$ with
+    $|a_{i}-b_{j}|< w_{matching}$, then the peak $(a_{i},x_{i})$
+    remains in $I^{\star}$ and the peak $(a_{i},0)$ is included in
+    $J^{\star}$. If there is at least one peak $(b_{j},y_{j})$ with
+    $|a_{i}-b_{j}|< w_{matching}$, then the peak $(a_{i},x_{i})$
+    remains in $I^{\star}$ and the peak
+    $(a_{i},\sum_{j\text{ such that }|a_{i}-b_{j}|< w_{matching}}b_{j})$
+    is included in $J^{\star}$. This procedure is applied when
+    transposing the roles of $I$ and $J$ as well.
+<a name="similarity-measures"></a>
+## 2.2 Similarity Measures
+Given a pair of processed spectra intensities
+$I=(a_{1},a_{2},...,a_{n}), J=(b_{1},b_{2},...,b_{n})\in\mathbb{R}^{n}$
+with $0\leq a_{i},b_{i}\leq 1$ for all $i\in\{1,2,...,n\}$ and
+$\sum_{i=1}^{n}a_{i}=\sum_{i=1}^{n}b_{i}=1$, PyCompound provides
+functionality for computing the following similarity measures:
+-   Cosine Similarity Measure:
+```math
+S_{Cosine}(I,J)=\frac{I\circ J}{|I|_{2}\cdot |J|_{2}}
+```
+where multiplication in the numerator refers to the dot product $I\circ J=a_{1}b_{1}+a_{2}b_{2}+...+a_{n}b_{n}$ of $I$ and $J$ and multiplication in the denominator refers to multiplication of the $L^{2}$-norms of $I$ and $J$, $\vert I\vert_{2}=\sqrt{a_{1}^{2}+a_{2}^{2}+...+a_{n}^{2}}, \vert J\vert_{2}=\sqrt{b_{1}^{2}+b_{2}^{2}+...+b_{n}^{2}}$.
+-   Shannon Entropy Similarity Measure:
+```math
+S_{Shannon}(I,J) = 1-\frac{2\cdot H_{Shannon}\left(\frac{I+J}{2}\right) - H_{Shannon}(I)-H_{Shannon}(J)}{ln(4)},
+```
+```math
+H_{Shannon}(I)=-\sum_{i=1}^{n}a_{i}\cdot ln(a_{i})
+```
+-    Tsallis Entropy Similarity Measure:
+```math
+S_{Tsallis}(I,J,q)=1-\frac{2\times H_{Tsallis}(I/2+J/2,q)-H_{Tsallis}(I,q)-H_{Tsallis}(J,q)}{N_{Tsallis}(I,J,q)},
+```
+```math
+N_{Tsallis}(I,J,q):=\frac{\sum_{i=1}^{n}\left(2\left(\frac{a_{i}}{2}\right)^{q}+2\left(\frac{b_{i}}{2}\right)^{q}-a_{i}^{q}-b_{i}^{q}\right)}{1-q},
+```
+```math
+H_{Tsallis}(I,q)=\frac{\left(\sum_{i=1}^{n}a_{i}^{q}\right)-1}{1-q},
+```
+```math
+q\neq 1, \ q>0
+```
+-   Rényi Entropy Similarity Measure:
+```math
+S_{Renyi}(I,J,q)=1-\frac{2\times H_{Renyi}(I/2+J/2,q)-H_{Renyi}(I,q)-H_{Renyi}(J,q)}{N_{Renyi}(I,J,q)},
+```
+```math
+N_{Renyi}(I,J,q):=\left(\frac{1}{1-q}\right)\left(2\times ln\left(\sum_{i}(a_{i}/2)^{q}+\sum_{j}(b_{j}/2)^{q}\right)-ln(\sum_{i}a_{i}^{q})-ln(\sum_{i}b_{i}^{q})\right),
+```
+```math
+H_{Renyi}(I,q)=\frac{1}{1-q}ln(\sum_{i=1}^{n}a_{i}^{q}),
+```
+```math
+q\neq 1, \ q>0
+```
+Additionally, the plethora of binary similarity measures considered in https://doi.org/10.3390/metabo12080694 are available along with a mixture similarity measure that is a weighted sum of the four non-binary similarity measures (i.e. Cosine, Shannon Entropy, Renyi, and Tsallis).
+<a name="usage"></a>
+## 3. Usage
+PyCompound has three main capabilities:
+1. Plotting a query spectrum vs. a reference spectrum before and after preprocessing transformations.
+2. Running spectral library matching to identify compounds based on their mass spectrometry data
+3. Tuning parameters to maximize accuracy given a query dataset with known compuond IDs (e.g. from targeted metabolomics experiments).
+These tasks are implemented separately for the cases of (i) NRMS and (ii) HRMS data due to the different spectrum preprocessing transformations stemming from a different format in the mass to charge (m/z) ratios in NRMS vs HRMS data. Example scripts which implement these tasks can be found in the pycompound/tests directory.
+<a name="param_descriptions"></a>
+### 3.1 Parameter descriptions
+For the function build_library_from_raw_data:
+```
+--input_path: Path to input file (must be either mgf, mzMZ, msp, cdf, or json file). Mandatory argument.
+--output_path: Path to output text file. Default: current working directory.
+--is_reference: Boolean flag indicating whether IDs of spectra should be written to output. Only pass True if building a library with known compound IDs. Only applicable to MGF files. Options: \'True\', \'False\'. Optional argument. Default: False.
+```
+Common parameters:
+```
+--query_data (mandatory argument):
+  * HRMS case: mgf, mzML, msp, json, or txt file of query mass spectrum/spectra to be identified. If txt file, must have at least 3 columns with each row corresponding to a single ion fragment of a mass spectrum, one 'id' column containing an identifier, one 'mz_ratio' column corresponding to the mass to charge (m/z) ratios, and one 'intensity' column containing the intensities. For example, if spectrum A has 3 ion fragments, then there would be three rows in this text file corresponding to spectrum A. Optional columns for the text file are 'precursor_ion_mz', 'ionization_mode', and 'adduct'.
+  * NRMS case: cdf or txt file of query mass spectrum/spectra to be identified. If txt file, same format as in HRMS case is required.
+--reference_data (mandatory argument): Same format text file as query_data except of reference library spectra. We recommend using the reference libraries from our Zenodo database ([https://zenodo.org/records/12786324](https://zenodo.org/records/12786324); stored on Zenodo due to file size limitations on GitHub).
+--precursor_ion_mz_tolerance (only applicable to HRMS): positive float representing a window size around each query spectrum's precursor ion mass:charge ratio in which candidate reference spectra must lie to be considered in compound identification. Default: None.
+--ionization_mode (only applicable to HRMS): Positive, Negative, or None. Default: None.
+--adduct (only applicable to HRMS): Options: H, NH3, NH4, Na, K, N/A. Default: N/A.
+--likely_reference_IDs: text file with one column containing the IDs of a subset of all compounds in the reference_data to be used in spectral library matching. Each ID in this file must be an ID in the reference library. Default: None (i.e. default is to use entire reference library)
+--similarity_measure: cosine, shannon, renyi, tsallis, mixture, jaccard, dice, 3w_jaccard, sokal_sneath, binary_cosine, mountford, mcconnaughey, driver_kroeber, simpson, braun_banquet, fager_mcgowan, kulczynski, intersection, hamming, hellinger. Default: cosine.
+--weights: dict of weights to give to each non-binary similarity measure (i.e. cosine, shannon, renyi, and tsallis) when the mixture similarity measure is specified. Default: 0.25 for each of the four non-binary similarity measures.
+--spectrum_preprocessing_order: The spectrum preprocessing transformations and the order in which they are to be applied. Note that these transformations are applied prior to computing similarity scores. Format must be a string with 2-6 characters chosen from C, F, M, N, L, W representing centroiding, filtering based on mass/charge and intensity values, matching, noise removal, low-entropy trannsformation, and weight-factor-transformation, respectively. For example, if \'WCM\' is passed, then each spectrum will undergo a weight factor transformation, then centroiding, and then matching. Note that if an argument is passed, then \'M\' must be contained in the argument, since matching is a required preprocessing step in spectral library matching of HRMS data. Furthermore, \'C\' must be performed before matching since centroiding can change the number of ion fragments in a given spectrum. Note that C and M are not applicable to NRMS data. Default: FCNMWL for HRMS and FNLW for NRMS.')
+--high_quality_reference_library: True/False flag indicating whether the reference library is considered to be of high quality. If True, then the spectrum preprocessing transformations of filtering and noise removal are performed only on the query spectrum/spectra. If False, all spectrum preprocessing transformations specified will be applied to both the query and reference spectra. Default: False')
+--mz_min: Remove all peaks with mass/charge value less than mz_min in each spectrum. Default: 0
+--mz_max: Remove all peaks with mass/charge value greater than mz_max in each spectrum. Default: 9999999
+--int_min: Remove all peaks with intensity value less than int_min in each spectrum. Default: 0
+--int_max: Remove all peaks with intensity value greater than int_max in each spectrum. Default: 9999999
+--window_size_centroiding (only for HRMS): Window size parameter used in centroiding a given spectrum. Default: 0.5
+--window_size_matching (only for HRMS): Window size parameter used in matching a query spectrum and a reference library spectrum. Default: 0.5
+--noise_threshold: Ion fragments (i.e. points in a given mass spectrum) with intensity less than max(intensities)*noise_threshold are removed. Default: 0.0
+--wf_mz: Mass/charge weight factor parameter. Default: 0.0
+--wf_intensity: Intensity weight factor parameter. Default: 0.0
+--LET_threshold: Low-entropy transformation threshold parameter. Spectra with Shannon entropy less than LET_threshold are transformed according to intensitiesNew=intensitiesOriginal^{(1+S)/(1+LET_threshold)}. Default: 0.0
+--entropy_dimension: Entropy dimension parameter. Must have positive value other than 1. When the entropy dimension is 1, then Renyi and Tsallis entropy are equivalent to Shannon entropy. Therefore, this parameter only applies to the renyi and tsallis similarity measures. This parameter will be ignored if similarity measure cosine or shannon is chosen. Default: 1.1
+```
+Parameters specific to run_spec_lib_matching_on_HRMS_data and run_spec_lib_matching_on_NRMS_data:
+```
+--n_top_matches_to_save: The number of top matches to report. For example, if n_top_matches_to_save=5, then for each query spectrum, the five reference spectra with the largest similarity with the given query spectrum will be reported. Default: 1
+--print_id_results: Flag that prints identification results if True. Default: False
+--output_identification: Output text file containing the most-similar reference spectra for each query spectrum along with the corresponding similarity scores. Default is to save identification output in current working directory with filename 'output_identification.txt'.
+--output_similarity_scores: Output text file containing similarity scores between all query spectrum/spectra and all reference spectra. Each row corresponds to a query spectrum, the left-most column contains the query spectrum/spectra identifier, and the remaining column contain the similarity scores with respect to all reference library spectra. If no argument passed, then this text file is written to the current working directory with filename output_all_similarity_scores.txt.
+```
+Parameters specific to tune_params_on_HRMS_data_grid and tune_params_on_NRMS_data_grid:
+```
+`` grid: dict object such as {'similarity_measure':['cosine','shannon'], 'spectrum_preprocessing_order':['FCNMWL'], 'mz_min':[0], 'mz_max':[9999999], 'int_min':[0], 'int_max':[99999999], 'window_size_centroiding':[0.5], 'window_size_matching':[0.5], 'noise_threshold':[0.0,0.1], 'wf_mz':[0.0], 'wf_int':[1.0], 'LET_threshold':[0.0], 'entropy_dimension':[1.1], 'high_quality_reference_library':[False]} with all possible combinations of parameters being utilized.
+--output_path: path to output text file containing the accuracies for each possible combination of parameters. If no argument is passed, then the plots will be saved to ./tuning_param_output.txt in the current working directory.
+```
+Parameters specific to tune_params_DE:
+```
+-- optimize_params: list of continuous parameters (i.e. window_size_centroiding, window_size_matching, noise_threshold, wf_mz, wf_int, LET_threshold; window_size parameters only applicable to HRMS data) to optimize via differential evolution.
+-- param_bounds: dict with keys being the parameters to optimize and values being a tuple of length 2 of the lower and upper bounds of acceptable parameter values.
+-- maxiters: maximum number of iterations of differential evolution.
+-- de_workers: number of CPUs to utilize.
+```
+Parameters specific to generate_plots_on_HRMS_data and generate_plots_on_NRMS_data:
+```
+--spectrum_ID1: ID of one spectrum to be plotted. Default is first spectrum in the query library. Optional argument.
+--spectrum_ID2: ID of another spectrum to be plotted. Default is first spectrum in the reference library. Optional argument.
+--y_axis_transformation: transformation to apply to y-axis (i.e. intensity axis) of plots. Options: 'normalized', 'none', 'log10', and 'sqrt'. Default: 'normalized.')
+--output_path: path to output PDF file containing the plots of the spectra before and after preprocessing transformations. If no argument is passed, then the plots will be saved to the PDF ./spectrum1_{spectrum_ID1}_spectrum2_{spectrum_ID2}_plot.pdf in the current working directory.
+```
+<a name="process-data"></a>
+### 3.2 Obtain LC-MS/MS or GC-MS library from MGF, mzML, cdf, msp, or json file
+To obtain a text file of LC-MS/MS spectra in the format necessary for spectral library matching from raw data in the form of an mgf, mzML, msp, json, or cdf file inside Python, one can run:
+```
+from pycompound.build_library import build_library_from_raw_data
+build_library_from_raw_data(input_path='path_to_input_file', output_path='path_to_output_file', is_reference=False)
+```
+Since the other functionality provided by pycompound is capable of being directly run on mgf, mzML, msp, json, and cdf files, you may not need to directly build a library yourself. Some example mgf and json files one can use to build an LC-MS/MS library can be found from the Global Natural Products Social Molecular Networking (GNPS) databases here: [https://external.gnps2.org/gnpslibrary](https://external.gnps2.org/gnpslibrary). Some example mzML files one can use to build an LC-MS/MS library can be found in this repository: [https://github.com/HUPO-PSI/mzML](https://github.com/HUPO-PSI/mzML). Some example MSP files can be found here: [https://mona.fiehnlab.ucdavis.edu/downloads](https://mona.fiehnlab.ucdavis.edu/downloads). The mgf, mzML, msp, and json files provided in this repository are trimmed versions of files found in these referenced repositories. The script tests/test_build_libraries.py demonstrates this usage.
+Full LC-MS/MS and GC-MS reference libraries are available at the Zenodo database ([https://zenodo.org/records/12786324](https://zenodo.org/records/12786324)).
+<a name="run-spec-lib-matching"></a>
+### 3.3 Run spectral library matching
+The files tests/test_spec_lib_matching.py, tests/test_spec_lib_matching_CLI, and tests/example_code_for_python_use.py demonstrate how some of the spectrum preprocessing functionality and similarity measures can be implemented either directly in Python or in the CLI wrapper. The two main functions - one for HRMS data and one for NRMS data - can be implemented as shown below inside Python:
+```
+from pycompound.spec_lib_matching import run_spec_lib_matching_on_HRMS_data
+from pycompound.spec_lib_matching import run_spec_lib_matching_on_NRMS_data
+run_spec_lib_matching_on_HRMS_data(
+        query_data='path_to_query_library',
+        reference_data='path_to_reference_library',
+        likely_reference_IDs=None,
+        similarity_measure='cosine',
+        spectrum_preprocessing_order='FCNMWL',
+        high_quality_reference_library=False,
+        mz_min=0,
+        mz_max=9999999,
+        int_min=0,
+        int_max=9999999,
+        window_size_centroiding=0.5,
+        window_size_matching=0.5,
+        noise_threshold=0.0,
+        wf_mz=0.0,
+        wf_intensity=1.0,
+        LET_threshold=0.0,
+        entropy_dimension=1.1,
+        n_top_matches_to_save=1,
+        print_id_results=False,
+        output_identification=None,
+        output_similarity_scores=None)
+run_spec_lib_matching_on_NRMS_data(
+        query_data='path_to_query_library',
+        reference_data='path_to_reference_library',
+        likely_reference_IDs=None,
+        similarity_measure='cosine',
+        spectrum_preprocessing_order='FNLW',
+        high_quality_reference_library=False,
+        mz_min=0,
+        mz_max=9999999,
+        int_min=0,
+        int_max=9999999,
+        noise_threshold=0.0,
+        wf_mz=0.0,
+        wf_intensity=1.0,
+        LET_threshold=0.0,
+        entropy_dimension=1.1,
+        n_top_matches_to_save=1,
+        print_id_results=False,
+        output_identification=None,
+        output_similarity_scores=None)
+```
+To use the CLI version, one can run the following from the terminal:
+```
+python spec_lib_matching_CLI.py \
+        --query_data ${PWD}/../tests/data/lcms_query_library.txt \
+        --reference_data ${PWD}/../tests/data/full_GNPS_reference_library.txt \
+        --chromatography_platform HRMS \
+        --likely_reference_IDs None \
+        --similarity_measure cosine \
+        --spectrum_preprocessing_order FCNMWL \
+        --high_quality_reference_library False \
+        --mz_min 0 \
+        --mz_max 9999999 \
+        --int_min 0 \
+        --int_max 9999999 \
+        --window_size_centroiding 0.5 \
+        --window_size_matching 0.5 \
+        --noise_threshold 0.0 \
+        --wf_mz 0.0 \
+        --wf_intensity 1.0 \
+        --LET_threshold 0.0 \
+        --entropy_dimension 1.1 \
+        --n_top_matches_to_save 1 \
+        --print_id_results False \
+        --output_identification ${PWD}/../tests/output_identification_HRMS.txt \
+        --output_similarity_scores ${PWD}/../tests/output_similarity_scores_HRMS.txt
+python spec_lib_matching_CLI.py \
+        --query_data ${PWD}/../tests/data/lcms_query_library.txt \
+        --reference_data ${PWD}/../tests/data/full_GNPS_reference_library.txt \
+        --chromatography_platform NRMS \
+        --likely_reference_IDs None \
+        --similarity_measure cosine \
+        --spectrum_preprocessing_order FCNMWL \
+        --high_quality_reference_library False \
+        --mz_min 0 \
+        --mz_max 9999999 \
+        --int_min 0 \
+        --int_max 9999999 \
+        --noise_threshold 0.0 \
+        --wf_mz 0.0 \
+        --wf_intensity 1.0 \
+        --LET_threshold 0.0 \
+        --entropy_dimension 1.1 \
+        --n_top_matches_to_save 1 \
+        --print_id_results False \
+        --output_identification ${PWD}/../tests/output_identification_NRMS.txt \
+        --output_similarity_scores ${PWD}/../tests/output_similarity_scores_NRMS.txt
+```
+For a user who may wish to incorporate our transformations and similarity measures directly in their python code similar to the example script tests/example_code_for_python_use.py, the available transformations and similarity measures are:
+```
+# Weight factor transformation
+wf_transform(spec_mzs, spec_ints, wf_mz, wf_int)
+"""
+Perform weight factor transformation on a spectrum
+Args:
+   spec_mzs: 1d numpy array representing mass/charge values
+   spec_ints: 1d numpy array representing intensity values
+   wf_mz: float
+   wf_int: float
+Returns:
+   np.ndarray: 1d numpy array of weight-factor-transformed spectrum intensities
+"""
+# Low-entropy transformation
+LE_transform(intensity, thresh, normalization_method)
+"""
+Transforms spectrum's intensities if the Shannon entropy of the intensities is below some threshold
+Args:
+   intensity: 1d numpy array
+   thresh: nonnegative float
+   normalization_method: either 'standard' or 'softmax'
+Returns:
+   np.ndarray: 1d numpy array of transformed intensities
+"""
+# Filter HR-MS such as LC-MS/MS spectrum
+filter_spec_lcms(spec, mz_min, mz_max, int_min, int_max, is_matched)
+"""
+Filter an MS/MS spectrum based on m/z and intensity values
+Args:
+   spec: N x 2 numpy array with first column being m/z and second column being intensity
+   mz_min: minimum m/z value
+   mz_max: maximum m/z value
+   int_min: minimum intensity value
+   int_max: maximum intensity value
+   is_matched: flag to indicate whether the given spectrum has already been matched to another spectrum
+Returns:
+   np.ndarray: N x 2 numpy array with intensity of 0 put anywhere outside of the m/z and/or intensity bounds
+"""
+# Filter NR-MS such as GC-MS spectrum
+filter_spec_gcms(spec, mz_min, mz_max, int_min, int_max)
+"""
+Filter an MS spectrum based on m/z and intensity values
+Args:
+   spec: N x 2 numpy array with first column being m/z and second column being intensity
+   mz_min: minimum m/z value
+   mz_max: maximum m/z value
+   int_min: minimum intensity value
+   int_max: maximum intensity value
+Returns:
+   np.ndarray: N x 2 numpy array with intensity of 0 put anywhere outside of the m/z and/or intensity bounds
+"""
+# Remove low-intensity noise
+remove_noise(spec, nr)
+"""
+Remove low-intensity ion fragments
+Args:
+   spec: N x 2 numpy array with first column being m/z and second column being intensity
+   nr: noise removal parameter; ion fragments with intensity less than max(intensity)*nr have intensity set to 0
+Returns:
+   np.ndarray: N x 2 numpy array
+"""
+# Centroid spectrum by merging close m/z peaks
+centroid_spectrum(spec, window_size)
+"""
+Centroid a spectrum by merging ion fragments that are 'close' with respect to m/z value
+Args:
+   spec: N x 2 numpy array with the first column being mass/charge and the second column being intensity
+   window_size: window-size parameter
+Returns:
+   np.ndarray: M x 2 numpy array with M <= N due to peaks being merged
+"""
+# Match peaks between two spectra
+match_peaks_in_spectra(spec_a, spec_b, window_size)
+"""
+Align two spectra so that we obtain a list of intensity values from each spectrum of the same length
+Args:
+   spec_a: N x 2 numpy array with the first column being mass/charge and the second column being intensity
+   spec_b: M x 2 numpy array with the first column being mass/charge and the second column being intensity
+   window_size: window-size parameter
+Returns:
+   np.ndarray: K x 3 numpy array with first column being mass/charge, second column being matched intensities of spec_a, and third column being matched intensities of spec_b
+"""
+# Assign 0 to the intensities without m/z values
+convert_spec(spec, mzs)
+"""
+Set intensity values to 0 where m/z values are missing
+Args:
+   spec: N x 2 dimensional numpy array
+   mzs: length M list of entire span of mass/charge values considering both the query and reference libraries
+Returns:
+   np.ndarray: M x 2 dimensional numpy array
+"""
+# Cosine similarity
+S_cos(ints_a, ints_b)
+"""
+Cosine similarity measure
+Args:
+   ints_a: 1d numpy array of intensities of a spectrum
+   ints_b: 1d numpy array of intensities of a spectrum
+Returns:
+   float: float between 0 and 1 indicating the similarity of the two spectra
+"""
+# Shnnon entropy similarity
+S_shannon(ints_a, ints_b)
+"""
+Shannon entropy similarity measure
+Args:
+   ints_a: 1d numpy array of intensities of a spectrum
+   ints_b: 1d numpy array of intensities of a spectrum
+Returns:
+   float: float between 0 and 1 indicating the similarity of the two spectra
+"""
+# Renyi entropy similarity
+S_renyi(ints_a, ints_b, q)
+"""
+Renyi entropy similarity measure
+Args:
+   ints_a: 1d numpy array of intensities of a spectrum
+   ints_b: 1d numpy array of intensities of a spectrum
+   q: positive float representing 'entropy dimension'
+Returns:
+   float: float between 0 and 1 indicating the similarity of the two spectra
+"""
+# Tsallis entropy similarity
+S_tsallis(ints_a, ints_b, q)
+"""
+Tsallis entropy similarity measure
+Args:
+   ints_a: 1d numpy array of intensities of a spectrum
+   ints_b: 1d numpy array of intensities of a spectrum
+   q: positive float representing 'entropy dimension'
+Returns:
+   float: float between 0 and 1 indicating the similarity of the two spectra
+"""
+```
+<a name="tuning"></a>
+### 3.4 Tune parameters
+Note that in order to tune parameters such as noise_threshold, LET_threshold etc., one must have a query library with compounds whose ground truth ID is known (e.g. from targeted metabolomics experiments). PyCompound offers two different methods of tuning parameters: one being an exhaustive grid search of pre-specified values, and the other being an optimization approach using differential evolution to optimize continuous parameters with respect to accuracy. The usage of the functions to tune parameters within Python is:
+```
+from pycompound.spec_lib_matching import tune_params_on_HRMS_data_grid
+from pycompound.spec_lib_matching import tune_params_on_NRMS_data_grid
+from pycompound.spec_lib_matching import tune_params_DE
+from pathlib import Path
+tune_params_on_HRMS_data_grid(
+    query_data=f'{Path.cwd()}/tests/data/lcms_query_library_tuning.txt',
+    reference_data=f'{Path.cwd()}/tests/data/full_GNPS_reference_library.txt',
+    precursor_ion_mz_tolerance=0.5,
+    ionization_mode='Positive',
+    adduct='H',
+    grid={'similarity_measure':['cosine'], 'spectrum_preprocessing_order':['FCNMWL'], 'mz_min':[0], 'mz_max':[9999999], 'int_min':[0], 'int_max':[99999999], 'window_size_centroiding':[0.5], 'window_size_matching':[0.1,0.5], 'noise_threshold':[0.0], 'wf_mz':[0.0], 'wf_int':[1.0], 'LET_threshold':[0.0], 'entropy_dimension':[1.1], 'high_quality_reference_library':[False]},
+    output_path=f'{Path.cwd()}/tuning_param_output_HRMS.txt'
+)
+tune_params_on_NRMS_data_grid(
+    query_data=f'{Path.cwd()}/tests/data/gcms_query_library_tuning.txt',
+    reference_data=f'{Path.cwd()}/tests/data/gcms_reference_library.txt',
+    grid={'similarity_measure':['cosine','shannon'], 'spectrum_preprocessing_order':['FNLW'], 'mz_min':[0], 'mz_max':[9999999], 'int_min':[0], 'int_max':[99999999], 'noise_threshold':[0.0,0.1], 'wf_mz':[0.0], 'wf_int':[1.0], 'LET_threshold':[0.0,3.0], 'entropy_dimension':[1.1], 'high_quality_reference_library':[False]},
+    output_path=f'{Path.cwd()}/tuning_param_output_NRMS.txt'
+)
+tune_params_DE(
+    query_data=f'{Path.cwd()}/tests/data/lcms_query_library_tuning.txt',
+    reference_data=f'{Path.cwd()}/tests/data/full_GNPS_reference_library.txt',
+    precursor_ion_mz_tolerance=0.1,
+    ionization_mode='Positive',
+    adduct='H',
+    chromatography_platform='HRMS',
+    similarity_measure='shannon',
+    optimize_params=["wf_mz","wf_int"],
+    param_bounds={"wf_mz":(0.0,5.0),"wf_int":(0.0,5.0)},
+    default_params={"window_size_centroiding": 0.5, "window_size_matching":0.5, "noise_threshold":0.10, "wf_mz":0.0, "wf_int":1.0, "LET_threshold":0.0, "entropy_dimension":1.1},
+    maxiters=2,
+    de_workers=-1
+)
+tune_params_DE(
+    query_data=f'{Path.cwd()}/tests/data/gcms_query_library_tuning.txt',
+    reference_data=f'{Path.cwd()}/tests/data/gcms_reference_library.txt',
+    chromatography_platform='NRMS',
+    similarity_measure='renyi',
+    optimize_params=["wf_mz","wf_int","LET_threshold","entropy_dimension"],
+    param_bounds={"wf_mz":(0.0,5.0),"wf_int":(0.0,5.0),"LET_threshold":(0,5),"entropy_dimension":(1.01,3)},
+    default_params={"noise_threshold":0.10, "wf_mz":0.0, "wf_int":1.0, "LET_threshold":0.0, "entropy_dimension":1.1},
+    de_workers=-1
+)
+```
+The CLI version can be run with:
+```
+python ../src/tuning_CLI_grid.py \
+  --query_data ${PWD}/tests/data/lcms_query_library_tuning.txt \
+  --reference_data ${PWD}/tests/data/full_GNPS_reference_library.txt \
+  --precursor_ion_mz_tolerance 0.1 \
+  --ionization_mode Positive \
+  --adduct H \
+  --chromatography_platform HRMS \
+  --similarity_measure cosine \
+  --spectrum_preprocessing_order FCNMWL \
+  --high_quality_reference_library False \
+  --mz_min 0 \
+  --mz_max 9999999 \
+  --int_min 0 \
+  --int_max 9999999 \
+  --window_size_centroiding 0.5 \
+  --window_size_matching 0.1,0.5 \
+  --noise_threshold 0.0 \
+  --wf_mz 2,3 \
+  --wf_intensity 1.0 \
+  --LET_threshold 0.0 \
+  --entropy_dimension 1.1 \
+  --output_path ${PWD}/output_tuning_HRMS_grid.txt \
+python ../src/pycompound/tuning_CLI_grid.py \
+  --query_data ${PWD}/tests/data/gcms_query_library_tuning.txt \
+  --reference_data ${PWD}/tests/data/gcms_reference_library.txt \
+  --chromatography_platform NRMS \
+  --similarity_measure cosine,shannon \
+  --spectrum_preprocessing_order FCNMWL \
+  --high_quality_reference_library False \
+  --mz_min 0 \
+  --mz_max 9999999 \
+  --int_min 0 \
+  --int_max 9999999 \
+  --noise_threshold 0.0,0.1 \
+  --wf_mz 0 \
+  --wf_intensity 1.0 \
+  --LET_threshold 0.0 \
+  --entropy_dimension 1.1 \
+  --output_path ${PWD}/output_tuning_NRMS_grid.txt \
+python ../src/pycompound/tuning_CLI_DE.py \
+  --chromatography_platform HRMS \
+  --query_data ${PWD}/data/lcms_query_library_tuning.txt \
+  --reference_data ${PWD}/data/full_GNPS_reference_library.txt \
+  --precursor_ion_mz_tolerance 0.1 \
+  --ionization_mode Positive \
+  --adduct H \
+  --similarity_measure cosine \
+  --opt window_size_centroiding noise_threshold wf_mz \
+  --bound window_size_centroiding=0.0:0.4 \
+  --bound noise_threshold=0.0:0.20 \
+  --bound wf_mz=0.0:5.0 \
+  --maxiter 3 \
+  --seed 1 \
+  --workers 5
+python ../src/pycompound/tuning_CLI_DE.py \
+  --query_data ${PWD}/tests/data/gcms_query_library_tuning.txt \
+  --reference_data ${PWD}/tests/data/gcms_reference_library.txt \
+  --chromatography_platform NRMS \
+  --similarity_measure cosine \
+  --opt noise_threshold wf_mz \
+  --bound noise_threshold=0.0:0.20 \
+  --bound wf_mz=0.0:5.0 \
+  --maxiter 3 \
+  --seed 1 \
+  --workers 4
+```
+<a name="plotting"></a>
+### 3.5 Plot a query spectrum against a reference spectrum before and after spectrum preprocessing transformations
+These functions plot a query spectrum against a reference spectrum, both before and after preprocessing. They support HRMS and NRMS data and can be used directly within Python with usage:
+```
+from pycompound.plot_spectra import generate_plots_on_HRMS_data
+from pycompound.plot_spectra import generate_plots_on_NRMS_data
+generate_plots_on_HRMS_data(
+        query_data='path_to_query_library',
+        reference_data='path_to_reference_data',
+        spectrum_ID1=None,
+        spectrum_ID2=None,
+        similarity_measure='cosine',
+        spectrum_preprocessing_order='FCNMWL',
+        high_quality_reference_library=False,
+        mz_min=0,
+        mz_max=9999999,
+        int_min=0,
+        int_max=9999999,
+        window_size_centroiding=0.5,
+        window_size_matching=0.5,
+        noise_threshold=0.0,
+        wf_mz=0.0,
+        wf_intensity=1.0,
+        LET_threshold=0.0,
+        entropy_dimension=1.1,
+        y_axis_transformation='normalized',
+        output_path=None
+)
+generate_plots_on_NRMS_data(
+        query_data='path_to_query_library',
+        reference_data='path_to_reference_data',
+        spectrum_ID1=None,
+        spectrum_ID2=None,
+        similarity_measure='cosine',
+        spectrum_preprocessing_order='FNLW',
+        high_quality_reference_library=False,
+        mz_min=0,
+        mz_max=9999999,
+        int_min=0,
+        int_max=9999999,
+        noise_threshold=0.0,
+        wf_mz=0.0,
+        wf_intensity=1.0,
+        LET_threshold=0.0,
+        entropy_dimension=1.1,
+        y_axis_transformation='normalized',
+        output_path=None
+)
+```
+To use the command line version, one can run the following from the terminal:
+```
+python plot_spectra_CLI.py \
+  --query_data ${PWD}/tests/data/lcms_query_library.txt \
+  --reference_data ${PWD}/tests/data/full_GNPS_reference_library.txt \
+  --spectrum_ID1 463514 \
+  --spectrum_ID2 112312 \
+  --chromatography_platform HRMS \
+  --similarity_measure cosine \
+  --spectrum_preprocessing_order FCNMWL \
+  --high_quality_reference_library False \
+  --mz_min 0 \
+  --mz_max 9999999 \
+  --int_min 0 \
+  --int_max 9999999 \
+  --window_size_centroiding 0.5 \
+  --window_size_matching 0.5 \
+  --noise_threshold 0.0 \
+  --wf_mz 0.0 \
+  --wf_intensity 1.0 \
+  --LET_threshold 0.0 \
+  --entropy_dimension 1.1 \
+  --output_path ${PWD}/output_plotting_HRMS.pdf \
+python plot_spectra_CLI.py \
+  --query_data ${PWD}/data/gcms_query_library.txt \
+  --reference_data ${PWD}/data/gcms_reference_library.txt \
+  --spectrum_ID1 463514 \
+  --spectrum_ID2 112312 \
+  --chromatography_platform NRMS \
+  --similarity_measure tsallis \
+  --spectrum_preprocessing_order FCNMWL \
+  --high_quality_reference_library False \
+  --mz_min 0 \
+  --mz_max 9999999 \
+  --int_min 0 \
+  --int_max 9999999 \
+  --noise_threshold 0.0 \
+  --wf_mz 0.0 \
+  --wf_intensity 1.0 \
+  --LET_threshold 0.0 \
+  --entropy_dimension 1.1 \
+  --output_path ${PWD}/output_plotting_NRMS.pdf \
+```
+An example of such a generated plot is seen below.
+<br />
+![image](https://github.com/user-attachments/assets/de22a402-1329-4bb3-a664-9423159264c8)
+<br />
+This plot compares two MS/MS spectra: Spectrum ID 1 (unknown, in blue) and Spectrum ID 2 (Hectochlorin M+H, in red). The top panel displays the untransformed spectra, while the bottom panel shows the transformed spectra following preprocessing steps. The footnote details are as follows:
+-   Filtering: Given user-defined parameters (mz_min,mz_max),
+-   Similarity Measure: Cosine -- The similarity measure used is cosine correlation.
+-   Similarity Score: 0.9946 -- The cosine similarity score between the two transformed spectra.
+-   Spectrum Preprocessing Order: FCNMWL -- The sequence of preprocessing steps applied: Filtering (F), Centroiding (C), Noise removal (N), Matching (M), Weight factor transformation (W), and Low-entropy transformation (L).
+-   High Quality Reference Library: False -- Both query and reference spectra underwent the same preprocessing transformations.
+-   Window Size (Centroiding): 0.5 -- A 0.5 Da window was used for centroiding peaks.
+-   Window Size (Matching): 0.5 -- Peaks were aligned using a 0.5 Da m/z tolerance window.
+-   Raw-Scale M/Z Range: [217.7, 628.8] -- The maximum and minimum of m/z values of peaks with non-zero intensities.
+-   Raw-Scale Intensity Range: [3885.0, 5549140] -- The maximum and minimum of absolute non-zero intensity values of the raw spectra before normalization.
+-   Noise Threshold: 0.0 -- No noise threshold was applied.
+-   Weight Factors (m/z, intensity): (0.0, 1.0) -- Non-zero intensities were transformed using weights of 0.0 for m/z and 1.0 for intensity.
+-   Low-Entropy Threshold: 0.0 -- No low-entropy transformation was applied.
+<a name="shiny"></a>
+### 3.6 Shiny application
+PyCompound is also available as a Shiny application. The Shiny application offers the same functionality as the Python package and its CLI interface. Simply run the Python script src/pycompound_shiny.py with a command such as <shiny run --launch-browser pycompound_shiny.py> to launch the Shiny application. Alternatively, one can you the publicly available web version at [https://0199ee0c-c2ce-4fdc-5ade-623633df1622.share.connect.posit.cloud/](https://0199ee0c-c2ce-4fdc-5ade-623633df1622.share.connect.posit.cloud/). If you plan to perform some heavy computations such as parameter tuning on large datasets, we recommend either using the Python package, its CLI wrapper, or running the Shiny app on your local machine to take advantage of multithreading (which isn't offered on the POSIT-hosted Shiny app).
+<a name="bugs-questions"></a>
+## 4. Bugs/Questions?
+If you notice any bugs in this software or have any questions, please create a new issue in this repository.

pycompound 0.1.6__tar.gz → 0.1.8__tar.gz

pycompound 0.1.6tar.gz → 0.1.8tar.gz