PySAR 2.5.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,740 @@
1
+ Metadata-Version: 2.4
2
+ Name: PySAR
3
+ Version: 2.5.0
4
+ Summary: Analysing Sequence Activity Relationships (SARs) of protein sequences and their mutants using Machine Learning.
5
+ Author-email: AJ McKenna <amckenna41@qub.ac.uk>
6
+ Maintainer-email: AJ McKenna <amckenna41@qub.ac.uk>
7
+ License: MIT
8
+ Project-URL: Homepage, https://github.com/amckenna41/pySAR
9
+ Project-URL: Download, https://github.com/amckenna41/pySAR/archive/refs/heads/main.zip
10
+ Keywords: bioinformatics,protein engineering,drug discovery,python,pypi,machine learning,directed evolution,sequence activity relationships,SAR,aaindex,protpy,protein descriptors
11
+ Classifier: Development Status :: 5 - Production/Stable
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: Intended Audience :: Healthcare Industry
16
+ Classifier: Intended Audience :: Information Technology
17
+ Classifier: License :: OSI Approved :: MIT License
18
+ Classifier: Natural Language :: English
19
+ Classifier: Programming Language :: Python :: 3.8
20
+ Classifier: Programming Language :: Python :: 3.9
21
+ Classifier: Programming Language :: Python :: 3.10
22
+ Classifier: Programming Language :: Python :: 3.11
23
+ Classifier: Programming Language :: Python :: 3.12
24
+ Classifier: Programming Language :: Python :: 3.13
25
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
26
+ Classifier: Programming Language :: Python :: 3 :: Only
27
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
28
+ Classifier: Topic :: Scientific/Engineering :: Mathematics
29
+ Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
30
+ Requires-Python: >=3.8
31
+ Description-Content-Type: text/markdown
32
+ License-File: LICENSE
33
+ Requires-Dist: numpy>=1.21
34
+ Requires-Dist: pandas>=1.3
35
+ Requires-Dist: scipy>=1.7
36
+ Requires-Dist: delayed>=0.11
37
+ Requires-Dist: scikit-learn>=1.0
38
+ Requires-Dist: matplotlib>=3.4
39
+ Requires-Dist: seaborn>=0.11
40
+ Requires-Dist: tqdm>=4.60
41
+ Requires-Dist: aaindex>=1.2.0
42
+ Requires-Dist: protpy>=1.3.0
43
+ Provides-Extra: test
44
+ Requires-Dist: pytest; extra == "test"
45
+ Requires-Dist: pytest-cov; extra == "test"
46
+ Requires-Dist: pytest-flake8; extra == "test"
47
+ Requires-Dist: pytest-timeout; extra == "test"
48
+ Provides-Extra: docs
49
+ Requires-Dist: sphinx; extra == "docs"
50
+ Dynamic: license-file
51
+
52
+ <p align="center">
53
+ <img src="https://raw.githubusercontent.com/amckenna41/pySAR/master/images/pySAR.png" alt="pySARLogo" height="300" width="400"/>
54
+ </p>
55
+
56
+ # pySAR - Python Sequence Activity Relationship #
57
+ [![PyPI](https://img.shields.io/pypi/v/pySAR)](https://pypi.org/project/pySAR/)
58
+ [![pytest](https://github.com/amckenna41/pySAR/workflows/Building%20and%20Testing/badge.svg)](https://github.com/amckenna41/pySAR/actions?query=workflowBuilding%20and%20Testing)
59
+ [![Platforms](https://img.shields.io/badge/platforms-linux%2C%20macOS%2C%20Windows-green)](https://pypi.org/project/pySAR/)
60
+ [![PythonV](https://img.shields.io/pypi/pyversions/pySAR?logo=2)](https://pypi.org/project/pySAR/)
61
+ [![Documentation Status](https://readthedocs.org/projects/pysar/badge/?version=latest)](https://pysar.readthedocs.io/en/latest/?badge=latest)
62
+ [![License: MIT](https://img.shields.io/badge/License-MIT-red.svg)](https://opensource.org/licenses/MIT)
63
+ [![Issues](https://img.shields.io/github/issues/amckenna41/pySAR)](https://github.com/amckenna41/pySAR/issues)
64
+ [![codecov](https://codecov.io/gh/amckenna41/pySAR/branch/master/graph/badge.svg?token=4PQDVGKGYN)](https://codecov.io/gh/amckenna41/pySAR)
65
+ <!-- [![Commits](https://img.shields.io/github/commit-activity/w/amckenna41/pySAR)](https://github.com/amckenna41/pySAR) -->
66
+ <!-- [![Size](https://img.shields.io/github/repo-size/amckenna41/pySAR)](https://github.com/amckenna41/pySAR) -->
67
+ <!-- [![Build](https://img.shields.io/github/workflow/status/amckenna41/pySAR/Deploy%20to%20PyPI%20%F0%9F%93%A6)](https://github.com/amckenna41/pySAR/actions) -->
68
+ <!-- [![Build Status](https://travis-ci.com/amckenna41/pySAR.svg?branch=main)](https://travis-ci.com/amckenna41/pySAR) -->
69
+ <!-- [![DOI](https://zenodo.org/badge/344290370.svg)](https://zenodo.org/badge/latestdoi/344290370) -->
70
+ <!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) -->
71
+
72
+ `pySAR` is a Python library for analysing Sequence Activity Relationships (SARs)/Sequence Function Relationships (SFRs) of protein sequences.
73
+
74
+ * 📖 The published research article is available [here][article].
75
+ * 🌍 A front-end app for `pySAR` is available [here][frontend] (coming soon).
76
+ * 💻 A quick Colab notebook demo of `pySAR` is available [here][demo].
77
+ * 📰 A **Medium** article that dives deeper into SARs and the `pySAR` software itself is available [here][medium].
78
+
79
+ Table of Contents
80
+ =================
81
+ * [Introduction](#Introduction)
82
+ * [Background](#background)
83
+ * [Requirements](#requirements)
84
+ * [Installation](#installation)
85
+ * [Documentation](#documentation)
86
+ * [Usage](#usage)
87
+ * [Directories](#directories)
88
+ * [Issues](#Issues)
89
+ * [Contact](#contact)
90
+ * [License](#license)
91
+ * [References](#references)
92
+
93
+
94
+ Research Article
95
+ ================
96
+ The research article that accompanied this software is titled: [Machine Learning Based Predictive Model for the Analysis of Sequence Activity Relationships Using Protein Spectra and Protein Descriptors][article] and was published in the Journal of Biomedical Informatics[[1]](#references).
97
+
98
+ How to cite
99
+ ===========
100
+ > Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016
101
+
102
+ Introduction
103
+ ============
104
+ `pySAR` is a Python library for analysing **Sequence Activity Relationships (SARs)/Sequence Function Relationships** (SFRs) of protein sequences. `pySAR` offers extensive and verbose functionalities that allow you to numerically encode a dataset of protein sequences using a large abundance of available methodologies and features, supporting **400,000+ different encoding strategies**. The software uses physicochemical and biochemical features from the Amino Acid Index (AAI) database [[2]](#references) via the custom-built [`aaindex`][aaindex] package, as well as allowing for the calculation of a range of structural, physicochemical and biochemical protein descriptors via the custom-built [`protpy`][protpy] package.
105
+
106
+ After finding the optimal technique and feature set at which to numerically encode your dataset of sequences, `pySAR` can then be used to build a **predictive regression ML model** with the training data being that of the encoded protein sequences, and the training labels being the in vitro experimentally pre-calculated activity values for each protein sequence. This model maps a set of protein sequences to the sought-after activity value, being able to accurately predict the activity/fitness value of new unseen sequences. The use-case for the software is within the field of **Protein Engineering**, **Directed Evolution** and or **Drug Discovery**, where a user has a set of in vitro experimentally determined activity/fitness values for a library of **mutant protein sequences** and wants to computationally predict the sought activity value for a selection of mutated unseen sequences, in the aim of finding the best sequence that minimises/maximises their activity value. <br>
107
+
108
+ In the published [research][article], the sought activity/fitness characteristic is the **thermostability** of proteins from a recombination library designed from parental cytochrome P450's. This thermostability is measured using the T50 metric (temperature at which 50% of a protein is irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which we want to maximise [[1]](#references).
109
+
110
+ Two additional <strong>custom-built</strong> softwares were created alongside `pySAR` - [`aaindex`][aaindex] and [`protpy`][protpy]. The `aaindex` software package is used for parsing the amino acid index which is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids [[2]](#references). `protpy` is used for calculating a series of protein physicochemical, biochemical and structural protein descriptors. Both of these software packages are integrated into `pySAR` but can also be used individually for their respective purposes.
111
+
112
+ Background
113
+ ==========
114
+ Accurately establishing the connection between a protein sequence and its function remains a focal point within the fields of proteomics, protein engineering and drug discovery. There has been a continued drive to build accurate and reliable predictive models via Machine Learning (ML) that allow for the virtual screening of many protein mutant sequences, measuring the relationship between sequence and 'fitness' or 'activity' — commonly known as a Sequence-Activity-Relationship (SAR) or Sequence-Function-Relationship (SFR). Due to the cost and impracticality of experimentally measuring these activity/fitness values for large libraries of mutant sequences, it is of great benefit to accelerate and automate this process computationally.
115
+
116
+ An important preliminary stage in the building of these predictive models is the numerical encoding of the chosen protein sequences, as sequences and their constituent amino acids cannot be directly passed into ML models. `pySAR` primarily focuses on encoding strategies involving the Amino Acid Index database and a variety of sequence-derived physicochemical and biochemical descriptors. Taking into account the various combinations of features and descriptors, `pySAR` supports **400,000+ different encoding strategies**.
117
+
118
+ [Directed Evolution (DE)][directed_evolution] is a prominent real-world use-case: a methodology for protein engineering that mimics natural selection, optimising a protein through iterative rounds of mutagenesis, selection, and amplification. `pySAR` can support such workflows by computationally predicting which mutant sequences are most likely to yield the desired activity value — reducing the burden of wet-lab experimentation.
119
+
120
+ Requirements
121
+ ============
122
+ * [python][python] >= 3.8
123
+ * [aaindex][aaindex] >= 1.2.0
124
+ * [protpy][protpy] >= 1.3.0
125
+ * [numpy][numpy] >= 1.21
126
+ * [pandas][pandas] >= 1.3
127
+ * [scikit-learn][sklearn] >= 1.0
128
+ * [scipy][scipy] >= 1.7
129
+ * [delayed][delayed] >= 0.11
130
+ * [tqdm][tqdm] >= 4.60
131
+ * [matplotlib][matplotlib] >= 3.4
132
+ * [seaborn][seaborn] >= 0.11
133
+
134
+ Installation
135
+ ============
136
+ Install the latest version of `pySAR` via [PyPi][PyPi] using pip:
137
+
138
+ ```bash
139
+ pip3 install pysar --upgrade
140
+ ```
141
+
142
+ Installation from source:
143
+ ```bash
144
+ git clone -b master https://github.com/amckenna41/pySAR.git
145
+ cd pySAR
146
+ pip3 install .
147
+ ```
148
+
149
+ Documentation
150
+ =============
151
+ Full documentation for `pySAR` is available on [Read the Docs](https://pysar.readthedocs.io/en/latest/), including:
152
+
153
+ Usage
154
+ =====
155
+ ### Config File
156
+ `pySAR` works mainly via JSON configuration files. There are many different customisable parameters for the functionalities in `pySAR` including the metaparameters of some of the available protein descriptors, all Digital Signal Processing (DSP) parameters in the `pyDSP` module, the type of regression model to use and parameters specific to the dataset - a description of each parameter is available on the [CONFIG.md][config] file.
157
+
158
+ These config files offer a more straightforward way of making any changes to the `pySAR` pipeline. The names of **All** the parameters as listed in the example config files must remain unchanged, only the value of each parameter should be changed, any parameters not being used can be set to <em>null</em>. Additionally, you can pass in the individual parameter names and values to the `pySAR` and `Encoding` classes when numerically encoding the protein sequences via **kwargs**. An example of the config file used in my research project ([thermostability.json](https://github.com/amckenna41/pySAR/blob/master/config/thermostability.json)), with most of the available parameters, can be seen below and in the example config file - [CONFIG.md][config].
159
+
160
+ ```json
161
+ {
162
+ "dataset":
163
+ {
164
+ "dataset": "thermostability.txt",
165
+ "sequence_col": "sequence",
166
+ "activity": "T50"
167
+ },
168
+ "model":
169
+ {
170
+ "algorithm": "plsregression",
171
+ "parameters": "",
172
+ "test_split": 0.2
173
+ },
174
+ "descriptors":
175
+ {
176
+ "descriptors_csv": "descriptors_thermostability.csv",
177
+ "moreaubroto_autocorrelation":
178
+ {
179
+ "lag":30,
180
+ "properties":["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102",
181
+ "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"],
182
+ "normalize": 1
183
+ },
184
+ ...
185
+ },
186
+ "pyDSP":
187
+ {
188
+ "use_dsp": 1,
189
+ "spectrum": "power",
190
+ "window": {
191
+ "type": "hamming",
192
+ ...
193
+ },
194
+ "filter": {
195
+ "type": null,
196
+ ...
197
+ }
198
+ }
199
+ }
200
+ ```
201
+
202
+ ### Examples...
203
+
204
+ <details><summary><b>Encoding protein sequences using all 566 AAIndex indices:</b></summary><br>
205
+ Encoding protein sequences in dataset using all <b>566 indices</b> in the <b>AAI1 database</b>. Each sequence encoded via an index in the AAI can be passed through an additional step where its protein spectra can be generated following an <b>FFT</b>. pySAR supports generation of the power, imaginary, real or absolute spectra as well as other DSP functionalities including windowing and filter functions.
206
+
207
+ In the example below, the encoded sequences will be used to generate a **imaginary protein spectra** with a **blackman** window function applied. This will then be used as feature data to build a predictive regression ML model that can be used for accurate prediction of the sought activity value (**thermostability**) of unseen protein sequences. The encoding class also takes the **JSON config file** as input which will have all the required parameter values. The output results will show the calculated metric values for each index in the AAI when measuring predicted vs observed activity values for the unseen test sequences.<br>
208
+
209
+ ```python
210
+ from pySAR.encoding import Encoding
211
+
212
+ '''thermostability.json
213
+ {
214
+ "dataset":
215
+ {
216
+ "dataset": "thermostability.txt",
217
+ "activity": "T50"
218
+ ...
219
+ }
220
+ "model":
221
+ {
222
+ "algorithm": "randomforest",
223
+ ...
224
+ }
225
+ "pyDSP":
226
+ {
227
+ "use_dsp": 1,
228
+ "spectrum": "imaginary",
229
+ "window": {
230
+ "type": "blackman"
231
+ }
232
+ }
233
+ }
234
+ '''
235
+ #create instance of Encoding class, using RF algorithm with its default params
236
+ encoding = Encoding(config_file='thermostability.json')
237
+
238
+ #encode sequences using all indices in the AAI if input parameter "aai_indices" is empty/None
239
+ aai_encoding = encoding.aai_encoding()
240
+
241
+ ```
242
+ Output results showing AAI index and its category as well as all the associated metric values for each predictive model. From the results below we can determine that the **CHOP780206** index in the AAI has the highest predictability (**R2 score**) for our chosen dataset (thermostability) and this generated model can be used for predicting the thermostability of new unseen sequences:
243
+
244
+ | | Index | Category | R2 | RMSE | MSE | RPD | MAE | Explained Variance |
245
+ |---:|:-----------|:-----------|---------:|--------:|--------:|--------:|--------:|---------------------:|
246
+ | 0 | CHOP780206 | secondary_struct | 0.62737 | 3.85619 | 14.8702 | 1.63818 | 3.16755 | 0.713467 |
247
+ | 1 | QIAN880131 | secondary_struct | 0.626689 | 3.90576 | 15.255 | 1.63668 | 3.09849 | 0.631582 |
248
+ | 2 | QIAN880118 | secondary_struct | 0.625156 | 3.99581 | 15.9665 | 1.63333 | 3.32038 | 0.625897 |
249
+ | 3 | PRAM900104 | secondary_struct | 0.615866 | 3.90389 | 15.2403 | 1.61346 | 3.24906 | 0.617799 |
250
+ | .. | .......... | .......... | ........ | ....... | ....... | ....... | ....... | ........... |
251
+ </details>
252
+
253
+ <details><summary><b>Encoding using list of 4 AAI indices, with no DSP functionalities:</summary></b><br>
254
+ This method follows a similar procedure as the previous step, except <b>4 indices</b> from the AAI are being specifically input into the function, with the encoded sequence output being concatenated together and used as feature data to build the predictive <b>PLSRegression</b> model with its default parameters. The config parameter <em> use_dsp </em> tells the function to not generate the protein spectra or apply any additional DSP processing to the sequences.<br>
255
+
256
+ ```python
257
+ from pySAR.encoding import Encoding
258
+
259
+ '''thermostability.json
260
+ {
261
+ "dataset":
262
+ {
263
+ "dataset": "thermostability.txt",
264
+ "activity": "T50"
265
+ ...
266
+ }
267
+ "model":
268
+ {
269
+ "algorithm": "plsreg",
270
+ "parameters": null
271
+ }
272
+ "pyDSP":
273
+ {
274
+ "use_dsp": 0,
275
+ ...
276
+ }
277
+ }
278
+ '''
279
+ #create instance of Encoding class, using PLS algorithm with its default params
280
+ encoding = Encoding(config_file='thermostability.json')
281
+
282
+ #encode sequences using 4 indices specified by user, use_dsp = False
283
+ aai_encoding = encoding.aai_encoding(aai_indices=["PONP800102","RICJ880102","ROBB760107","KARS160113"])
284
+
285
+ ```
286
+ Output DataFrame showing the 4 predictive models built using the PLS algorithm, with the 4 indices from the AAI. From the results below we can determine that the **PONP800102** index in the AAI has the highest predictability (<b>R2 score</b>) for our chosen dataset (<b>thermostability</b>) and this generated model can be used for predicting the thermostability of unseen sequences:
287
+
288
+ | | Index | Category | R2 | RMSE | MSE | RPD | MAE | Explained Variance |
289
+ |---:|:-----------|:------------|---------:|--------:|---------:|--------:|--------:|---------------------:|
290
+ | 0 | PONP800102 | hydrophobic | 0.74726 | 3.0817 | 9.49688 | 1.98913 | 2.63742 | 0.751032 |
291
+ | 1 | ROBB760107 | secondary_struct | 0.666527 | 3.19801 | 10.2273 | 1.73169 | 2.50305 | 0.668255 |
292
+ | 2 | RICJ880102 | secondary_struct | 0.568067 | 3.83976 | 14.7438 | 1.52157 | 3.01342 | 0.568274 |
293
+ | 3 | KARS160113 | meta | 0.544129 | 4.04266 | 16.3431 | 1.48108 | 3.26047 | 0.544693 |
294
+
295
+ </details>
296
+
297
+ <details><summary><b>Encoding protein sequences using all available protein descriptors:</summary></b><br>
298
+ Calculate the protein descriptor values for a dataset of protein sequences from the 33 available descriptors in the <em>descriptors</em> module. Use each descriptor as a <b>feature set</b> in the building of the predictive ML models used to predict the <b>activity</b> value of unseen sequences. By default, the function will look for a csv file pointed to by the <em>"descriptors_csv"</em> parameter in the config file that contains the pre-calculated descriptor values for a dataset. If file is not found then all descriptor values will be calculated for the dataset using the <em>descriptors</em> module and custom-built <i>protpy</i> package.
299
+
300
+ ```python
301
+ from pySAR.encoding import Encoding
302
+
303
+ '''thermostability.json
304
+ {
305
+ "dataset":
306
+ {
307
+ "dataset": "thermostability.txt",
308
+ "activity": "T50"
309
+ ...
310
+ }
311
+ "model":
312
+ {
313
+ "algorithm": "adaboost",
314
+ "parameters": [{
315
+ "estimators": 100,
316
+ "learning_rate": 1.5
317
+ ...
318
+ },
319
+ "descriptors":
320
+ {
321
+ "descriptors_csv": "descriptors_thermostability.csv",
322
+ "moreaubroto_autocorrelation": {
323
+ "lag": 30,
324
+ "properties": ["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102",
325
+ "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"],
326
+ "normalize": 1
327
+ },
328
+ ...
329
+ }
330
+ }
331
+ '''
332
+ #create instance of Encoding class using AdaBoost algorithm, using 100 estimators & a learning rate of 1.5
333
+ encoding = Encoding(config_file='thermostability.json')
334
+
335
+ #building predictive models using all available descriptors, calculating evaluation metrics values for
336
+ # models and storing into desc_results_df DataFrame
337
+ desc_results_df = encoding.descriptor_encoding()
338
+ ```
339
+ Output results showing the protein descriptor and its group as well as all the associated metric values for each predictive model. From the results below we can determine that the **CTD Distribution** descriptor has the highest predictability (<b>R2 score</b>) for our chosen dataset (thermostability) and this generated model can be used for predicting the thermostability of unseen sequences:
340
+
341
+ | | Descriptor | Group | R2 | RMSE | MSE | RPD | MAE | Explained Variance |
342
+ |---:|:------------------------|:----------------|---------:|--------:|--------:|--------:|--------:|---------------------:|
343
+ | 0 | ctd_d | CTD | 0.721885 | 3.26159 | 10.638 | 1.89621 | 2.60679 | 0.727389 |
344
+ | 1 | geary_autocorrelation | Autocorrelation | 0.648121 | 3.67418 | 13.4996 | 1.68579 | 2.82868 | 0.666745 |
345
+ | 2 | tripeptide_composition | Composition | 0.616577 | 3.3979 | 11.5457 | 1.61496 | 2.53736 | 0.675571 |
346
+ | 3 | amino_acid_composition | Composition | 0.612824 | 3.37447 | 11.3871 | 1.60711 | 2.79698 | 0.643864 |
347
+ | 4 | ...... | ...... | ...... | ...... | ...... | ...... | ...... | ...... |
348
+ </details>
349
+
350
+ <details><summary><b>Encoding using AAI + protein descriptors:</summary></b><br>
351
+ Encoding protein sequences in the dataset using <b>ALL 566 indices</b> in the AAI database combined with <b>ALL available protein descriptors</b>. All 566 indices can be used in concatenation with 1, 2 or 3 descriptors. At each iteration the encoded sequences generated from the indices from the AAI will be combined with the <b>feature set</b> generated from the dataset's descriptor values and used to build a predictive regression ML model that can be used for the accurate prediction of the sought <b>activity/fitness</b> value of unseen protein sequences. The output results will show the calculated metric values when measuring predicted vs observed activity values for the test sequences.<br>
352
+
353
+ ```python
354
+ from pySAR.encoding import Encoding
355
+
356
+ '''thermostability.json
357
+ {
358
+ "dataset":
359
+ {
360
+ "dataset": "thermostability.txt",
361
+ "activity": "T50"
362
+ ...
363
+ }
364
+ "model":
365
+ {
366
+ "algorithm": "randomforest",
367
+ "parameters":
368
+ {
369
+ "estimators": 100,
370
+ "learning_rate": 1.5,
371
+ ...
372
+ }
373
+ },
374
+ "descriptors":
375
+ {
376
+ "descriptors_csv": "descriptors_thermostability.csv",
377
+ "moreaubroto_autocorrelation": {
378
+ "lag": 30,
379
+ "properties": ["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102",
380
+ "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"],
381
+ "normalize": 1
382
+ },
383
+ ...
384
+ },
385
+ "pyDSP":
386
+ {
387
+ "use_dsp": 0,
388
+ "spectrum": "power",
389
+ "window": ""
390
+ ...
391
+ }
392
+ }
393
+ '''
394
+ #create instance of Encoding class using RF algorithm, using 100 estimators with a learning rate of 1.5 - as listed in config
395
+ encoding = Encoding('thermostability.json')
396
+
397
+ #building predictive models using all available aa_indices + descriptors, calculating evaluation metric values for models and storing into aai_desc_results_df DataFrame
398
+ aai_desc_results_df = encoding.aai_descriptor_encoding()
399
+ ```
400
+
401
+ Output results showing AAI index and its category, the protein descriptor and its group as well as all output metric values for each predictive model. From the results below we can determine that the **ARGP820103** index in concatenation with the **Conjoint Triad** descriptor has the highest predictability (<b>R2 score</b>) for our chosen dataset (<b>thermostability</b>) and this generated model can be used for predicting the thermostability of unseen sequences:
402
+
403
+ | | Index | Category | Descriptor | Descriptor Group | R2 | RMSE |
404
+ |---:|:-----------|:------------|:---------------------------|:---------------------|---------:|--------:|
405
+ | 0 | ARGP820103 | composition | _conjoint_triad | Conjoint Triad | 0.72754 | 3.22135 |
406
+ | 1 | ARGP820101 | hydrophobic | _quasi_seq_order | Quasi-Sequence-Order | 0.722284 | 3.30995 |
407
+ | 2 | ARGP820101 | hydrophobic | _seq_order_coupling_number | Quasi-Sequence-Order | 0.722158 | 3.34926 |
408
+ | 3 | ANDN920101 | observable | _seq_order_coupling_number | Quasi-Sequence-Order | 0.70826 | 3.25232 |
409
+ | 4 | ..... | ..... | ..... | ..... | ..... | ..... |
410
+ </details>
411
+
412
+
413
+ <details><summary><b>Building predictive model from subset of AAI and protein descriptors:</summary></b><br>
414
+ The below code will build a <b>PLSRegression model</b> using the AAI index <b>CIDH920105</b> and the <b>amino acid composition</b> descriptor. The index is passed through a DSP pipeline and is transformed into its <b>informational protein spectra</b> using the <b>power spectra</b>, with a <b>hamming window</b> function applied to the output of the FFT. The concatenated features from the AAI index and the descriptor will be used as the feature data in building the PLS ML model. This model is then used to access its <b>predictability</b> by testing on test unseen sequences. The output results will show the calculated metric values when measuring <b>predicted vs observed activity values</b> for the test sequences.<br>
415
+
416
+ ```python
417
+ from pySAR.pySAR import PySAR
418
+
419
+ '''thermostability.json
420
+ {
421
+ "dataset":
422
+ {
423
+ "dataset": "thermostability.txt",
424
+ "activity": "T50"
425
+ ...
426
+ },
427
+ "model":
428
+ {
429
+ "algorithm": "plsregression",
430
+ "parameters": "",
431
+ ...
432
+ },
433
+ "descriptors":
434
+ {
435
+ "descriptors_csv": "descriptors_thermostability.csv",
436
+ "moreaubroto_autocorrelation": {
437
+ "lag": 30,
438
+ "properties": ["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102",
439
+ "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"],
440
+ "normalize": 1
441
+ },
442
+ ...
443
+ },
444
+ "pyDSP":
445
+ {
446
+ "use_dsp": 1,
447
+ "spectrum": "power",
448
+ "window": "hamming",
449
+ ...
450
+ }
451
+ }
452
+ '''
453
+ #create instance of PySAR class, inputting path to configuration file
454
+ pySAR = PySAR(config_file="thermostability.json")
455
+
456
+ #encode protein sequences using both the CIDH920105 index + aa_composition descriptor
457
+ results_df = pySAR.encode_aai_descriptor(aai_indices="CIDH920105", descriptors="amino_acid_composition")
458
+ ```
459
+
460
+ Output results showing AAI index and its category, the protein descriptor and its group as well as the metric values for the generated predictive model. From the results below we can determine that the **CIDH920105** index in concatenation with the **Amino Acid Composition** descriptor has medium predictability (R2 score) but a high error rate (MSE/RMSE) for our chosen dataset (thermostability) and this feature set combination is not that effective for predicting the thermostability of unseen sequences:
461
+
462
+ ```python
463
+ ##########################################################################################
464
+ ###################################### Parameters ########################################
465
+
466
+ # AAI Indices: CIDH920105
467
+ # Descriptors: amino_acid_composition
468
+ # Configuration File: thermostability_config.json
469
+ # Dataset: thermostability.txt
470
+ # Number of Sequences/Sequence Length: 261 x 466
471
+ # Target Activity: T50
472
+ # Algorithm: PLSRegression
473
+ # Model Parameters: {'copy': True, 'max_iter': 500, 'n_components': 2, 'scale': True,
474
+ #'tol': 1e-06}
475
+ # Test Split: 0.2
476
+ # Feature Space: (261, 486)
477
+
478
+ ##########################################################################################
479
+ ######################################## Results #########################################
480
+
481
+ # R2: 0.6720111107323943
482
+ # RMSE: 3.7522525079464457
483
+ # MSE: 14.079398883390391
484
+ # MAE: 3.0713217158459805
485
+ # RPD 1.7461053136208489
486
+ # Explained Variance 0.6721157080699659
487
+
488
+ ##########################################################################################
489
+ ```
490
+ </details>
491
+
492
+ <details><summary><b>Calculate individual descriptor values, e.g Tripeptide Composition and Geary Autocorrelation:</summary></b><br>
493
+ The individual protein descriptor values for the dataset of protein sequences can be calculated using the custom-built <b>protpy</b> package via the <i>descriptor</i> module. The full list of descriptors can be seen via the function <i>all_descriptors_list()</i> as well as on the <b>protpy</b> repo homepage.
494
+
495
+ ```python
496
+ from pySAR.descriptors import Descriptors
497
+
498
+ #create instance of descriptors class
499
+ desc = Descriptors(config_file="thermostability.json")
500
+
501
+ #calculate tripeptide composition descriptor
502
+ tripeptide_composition = desc.get_tripeptide_composition()
503
+
504
+ #calculate geary autocorrelation descriptor
505
+ geary_autocorrelation = desc.get_geary_autocorrelation()
506
+ ```
507
+ </details>
508
+
509
+ <details><summary><b>Calculate and export all protein descriptors:</summary></b><br>
510
+ Prior to evaluating the various available properties and features at which to encode a set of protein sequences, it is reccomened that you pre-calculate all the available descriptors in one go, saving them to a csv for later that <i>pySAR</i> will then import from. Output values are stored in a csv set by the <i>descriptors_csv</i> config parameter (the name of the exported csv via the <i>descriptors_export_filename</i> parameter can also be passed into the function). Output will be of the shape N x M, where N is the number of protein sequences in the dataset and M is the total number of features calculated from all 33 descriptors which varies depending on some descriptor-specific metaparameters. For example, using the thermostability dataset, the output will be 261 x 10572. <br>
511
+
512
+ ```python
513
+ '''thermostability.json
514
+ {
515
+ "dataset":
516
+ {
517
+ "dataset": "thermostability.txt",
518
+ "activity": "T50"
519
+ ...
520
+ },
521
+ "model":
522
+ {
523
+ ...
524
+ }
525
+ "descriptors":
526
+ {
527
+ "descriptors_csv": "descriptors_thermostability.csv",
528
+ "moreaubroto_autocorrelation": {
529
+ "lag": 30,
530
+ "properties": ["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102",
531
+ "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"],
532
+ "normalize": 1
533
+ },
534
+ ...
535
+ },
536
+ "pyDSP":
537
+ {
538
+ ...
539
+ }
540
+ }
541
+ '''
542
+ #import descriptors class
543
+ from pySAR.descriptors import Descriptors
544
+
545
+ #create instance of descriptors class
546
+ desc = Descriptors(config_file="thermostability.json")
547
+
548
+ #export all descriptors to csv using parameters in config, export=True will export to csv
549
+ desc.get_all_descriptors(export=True, descriptors_export_filename="descriptors_thermostability.csv")
550
+ ```
551
+ </details>
552
+
553
+ <details><summary><b>Get record from AAIndex database:</summary></b><br>
554
+ A custom-built package called <b>aaindex</b> was created for this project to work with all the data in the AAIndex databases, primarily the <b>aaindex1</b>. The AAIndex library offers diverse functionalities for obtaining all data from all records within the <b>aaindex1</b>. Each record is stored in json format and can be retrieved via its accession number, and can also be searched via its name/description. Each record contains the following attributes: description, references, category, notes, correlation coefficient, pmid and values.<br>
555
+
556
+ ```python
557
+ from aaindex import aaindex1
558
+
559
+ record = aaindex1['CHOP780206'] #get full record
560
+ description = aaindex1['CHOP780206'].description #get record's description
561
+ refs = aaindex1['CHOP780206'].references #get record's references
562
+ category = aaindex1['CHOP780206'].category #get record's category
563
+ notes = aaindex1['CHOP780206'].notes #get record's notes
564
+ correlation_coefficients = aaindex1['CHOP780206'].correlation_coefficients #get record's correlation coefficients
565
+ pmid = aaindex1['CHOP780206'].pmid #get record's pmid
566
+ values = aaindex1['CHOP780206'].values #get amino acid values from record
567
+
568
+ num_record = aaindex1.num_records() #get total number of records
569
+ record_names = aaindex1.record_names() #get list of all record names
570
+ amino_acids = aaindex1.amino_acids() #get list of all canonical amino acids
571
+ records = aaindex1.search("hydrophobicity") #get all records with hydrophobicity in their title/description
572
+ ```
573
+ </details>
574
+
575
+ <details><summary><b>Parallel encoding across all AAI indices using n_jobs:</b></summary><br>
576
+ Setting <code>n_jobs</code> to a value greater than 1 distributes model-building across multiple CPU cores. Pass <code>n_jobs=-1</code> to use all available cores. This applies to all three encoding methods and can significantly reduce wall-clock time when evaluating hundreds of indices or descriptor combinations.<br>
577
+
578
+ ```python
579
+ from pySAR.encoding import Encoding, SortKey
580
+
581
+ encoding = Encoding(config_file='thermostability.json')
582
+
583
+ # build 566 AAI models in parallel using all CPU cores, sorted by RMSE
584
+ aai_results = encoding.aai_encoding(n_jobs=-1, sort_by=SortKey.RMSE)
585
+
586
+ # build all descriptor models in parallel using 4 workers
587
+ desc_results = encoding.descriptor_encoding(n_jobs=4)
588
+
589
+ # build AAI + descriptor models in parallel - can be many thousands of models
590
+ aai_desc_results = encoding.aai_descriptor_encoding(n_jobs=-1, max_models=1000)
591
+ ```
592
+ For reproducible parallel runs, pass `random_state` to seed the ML models:
593
+
594
+ ```python
595
+ aai_results = encoding.aai_encoding(n_jobs=-1, random_state=42)
596
+ ```
597
+ </details>
598
+
599
+ <details><summary><b>Resuming a partially-completed encoding run:</b></summary><br>
600
+ Long encoding jobs (e.g. all 566 AAI indices or thousands of AAI+descriptor combinations) can be interrupted and resumed without re-running completed models. Enable checkpointing by passing <code>resume=True</code> and a path for the checkpoint file via <code>resume_file</code>. On the first run a checkpoint CSV is written after each batch; subsequent runs with the same file skip already-completed keys and append the new results.<br>
601
+
602
+ ```python
603
+ from pySAR.encoding import Encoding
604
+
605
+ encoding = Encoding(config_file='thermostability.json')
606
+
607
+ # first run - starts from scratch and saves progress to checkpoint.csv after each index
608
+ aai_results = encoding.aai_encoding(
609
+ resume=True,
610
+ resume_file='aai_checkpoint.csv',
611
+ n_jobs=4
612
+ )
613
+
614
+ # later run (e.g. after an interruption) - skips completed indices automatically
615
+ aai_results = encoding.aai_encoding(
616
+ resume=True,
617
+ resume_file='aai_checkpoint.csv',
618
+ n_jobs=4
619
+ )
620
+ ```
621
+ The same pattern works for `descriptor_encoding` and `aai_descriptor_encoding`:
622
+
623
+ ```python
624
+ aai_desc_results = encoding.aai_descriptor_encoding(
625
+ resume=True,
626
+ resume_file='aai_desc_checkpoint.csv',
627
+ n_jobs=-1
628
+ )
629
+ ```
630
+ </details>
631
+
632
+ <details><summary><b>Using descriptor validation and utility methods:</b></summary><br>
633
+ The <code>Descriptors</code> class exposes several utility methods for validating inputs, inspecting descriptor metadata and managing internal state. These can be used independently of the main encoding workflow.<br>
634
+
635
+ ```python
636
+ from pySAR.descriptors import Descriptors, DescriptorType
637
+ from pySAR.descriptors import InvalidDescriptorError, InvalidSequenceError
638
+
639
+ desc = Descriptors(config_file='thermostability.json')
640
+
641
+ # validate a list of descriptor names - raises InvalidDescriptorError for unknown names
642
+ valid_descs = desc.validate_descriptors(['amino_acid_composition', 'dipeptide_composition'])
643
+
644
+ # validate sequences in the loaded dataset - raises InvalidSequenceError for non-canonical amino acids
645
+ desc.validate_sequences()
646
+
647
+ # retrieve metadata (feature count, group, and parameters) for a descriptor
648
+ info = desc.get_descriptor_info('amino_acid_composition')
649
+ print(info)
650
+ # {'name': 'amino_acid_composition', 'group': 'Composition', 'feature_count': 20, 'parameters': {}}
651
+
652
+ # get the total number of features produced by the current descriptor configuration (per descriptor)
653
+ total_features = desc.descriptor_feature_count # cached property; returns dict of {name: count}
654
+
655
+ # get the list of output column names for a specific descriptor (must be calculated first)
656
+ desc.get_amino_acid_composition() # calculate it first
657
+ cols = desc.get_descriptor_columns('amino_acid_composition')
658
+
659
+ # reset all descriptor DataFrames back to empty (useful before re-calculation workflows)
660
+ desc.reset_descriptors()
661
+
662
+ # clear the internal feature-count cache (e.g. after changing descriptor metaparameters)
663
+ desc.clear_cache()
664
+ ```
665
+
666
+ Use the `DescriptorType` enum to filter or identify descriptor families:
667
+
668
+ ```python
669
+ from pySAR.descriptors import DescriptorType
670
+
671
+ # enum members: COMPOSITION, AUTOCORRELATION, SEQUENCE_ORDER, PSEUDO_AA, CTD, CONJOINT_TRIAD
672
+ print(DescriptorType.COMPOSITION.value) # 'composition'
673
+ ```
674
+ </details>
675
+
676
+ Directories and Files
677
+ =====================
678
+ * `/config` - configuration files for the example datasets that `pySAR` has been tested with, as well as the thermostability.json config file that was used in the research. These config files should be used as a template for future datasets used with `pySAR`.
679
+ * `/data` - data files used in the research project including the thermostability dataset, config file and pre-calculated protein descriptors.
680
+ * `/docs` - Sphinx documentation source for `pySAR`, including `conf.py`, `index.rst`, `usage.rst`, `api.rst` and `contributing.rst`.
681
+ * `/example_datasets` - example datasets used for the building and testing of `pySAR`, including the thermostability dataset used in the research. The format of these datasets should be used as a template for future datasets used with `pySAR`.
682
+ * `/images` - all images used throughout the repo.
683
+ * `/pySAR` - source code for `pySAR` software.
684
+ * `/tests` - unit and integration tests for `pySAR`.
685
+ * `pyproject.toml` - package build metadata and dependency specification (PEP 517/518).
686
+ * `CONFIG.md` - example markdown file describing each of the available parameters in the config files.
687
+
688
+ Issues
689
+ ======
690
+ Any issues, errors or bugs can be raised via the [Issues](https://github.com/amckenna41/pySAR/issues) tab in the repository.
691
+
692
+ Contact
693
+ =======
694
+ If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the [Issues][Issues] tab. <br><br>
695
+ <!-- [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/adam-mckenna-7a5b22151/) -->
696
+
697
+ License
698
+ =======
699
+ Distributed under the MIT License. See [`LICENSE`][license] for more details.
700
+
701
+ References
702
+ ==========
703
+ \[1\]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016<br><br>
704
+ \[2\]: Kawashima, S. and Kanehisa, M., 2000. AAindex: amino acid index database. Nucleic acids research, 28(1), pp.374-374. DOI: 10.1093/nar/27.1.368 <br><br>
705
+ \[3\]: Fontaine NT, Cadet XF, Vetrivel I. Novel Descriptors and Digital Signal Processing- Based Method for Protein Sequence Activity Relationship Study. Int J Mol Sci. 2019 Nov 11;20(22):5640. doi: 10.3390/ijms20225640. PMID: 31718061; PMCID: PMC6888668. <br><br>
706
+ \[4\]: Cadet, F., Fontaine, N., Li, G. et al. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci Rep 8, 16757 (2018).<br><br>
707
+ \[5\]: Lutz S. Beyond directed evolution--semi-rational protein engineering and design. Curr Opin Biotechnol. 2010 Dec;21(6):734-43. doi: 10.1016/j.copbio.2010.08.011. Epub 2010 Sep 24. PMID: 20869867; PMCID: PMC2982887. <br><br>
708
+ \[6\]: Yang, K.K., Wu, Z. & Arnold, F.H. Machine-learning-guided directed evolution for protein engineering. Nat Methods 16, 687–694 (2019). https://doi.org/10.1038/s41592-019-0496-6 <br><br>
709
+ \[7\]: Yuting Xu, Deeptak Verma, Robert P. Sheridan, Andy Liaw, Junshui Ma, Nicholas M. Marshall, John McIntosh, Edward C. Sherer, Vladimir Svetnik, and Jennifer M. Johnston
710
+ Journal of Chemical Information and Modeling 2020 60 (6), 2773-2790
711
+ DOI: 10.1021/acs.jcim.0c00073 <br><br>
712
+ \[8\]: Medina-Ortiz, D., Contreras, S., Amado-Hinojosa, J., Torres-Almonacid, J., Asenjo, J. A., Navarrete, M., & Olivera-Nappa, Á. (2020). Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins. ArXiv [Cs.CE]. <br>
713
+
714
+ <a href="https://www.buymeacoffee.com/amckenna41" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
715
+
716
+ [Back to top](#TOP)
717
+
718
+ <!-- |Logo| image:: https://raw.githubusercontent.com/pySAR/pySAR/master/pySAR.png -->
719
+
720
+ [python]: https://www.python.org/downloads/release/python-360/
721
+ [aaindex]: https://github.com/amckenna41/aaindex
722
+ [protpy]: https://github.com/amckenna41/protpy
723
+ [numpy]: https://numpy.org/
724
+ [pandas]: https://pandas.pydata.org/
725
+ [sklearn]: https://scikit-learn.org/stable/
726
+ [scipy]: https://www.scipy.org/
727
+ [tqdm]: https://tqdm.github.io/
728
+ [seaborn]: https://seaborn.pydata.org/
729
+ [matplotlib]: https://matplotlib.org/
730
+ [delayed]: https://pypi.org/project/delayed/
731
+ [PyPi]: https://pypi.org/project/pysar/
732
+ [article]: https://www.sciencedirect.com/science/article/abs/pii/S1532046422000326
733
+ [pdf]: https://github.com/amckenna41/pySAR/blob/master/pySAR_research.pdf
734
+ [ppt]: https://github.com/amckenna41/pySAR/blob/master/pySAR_demo.key
735
+ [demo]: https://colab.research.google.com/drive/1hxtnf8i4q13fB1_2TpJFimS5qfZi9RAo?usp=sharing
736
+ [Issues]: https://github.com/amckenna41/pySAR/issues
737
+ [license]: https://github.com/amckenna41/pySAR/blob/master/LICENSE
738
+ [config]: https://github.com/amckenna41/pySAR/blob/master/CONFIG.md
739
+ [medium]: https://ajmckenna69.medium.com/pysar-a3de9f71733f
740
+ [directed_evolution]: https://en.wikipedia.org/wiki/Directed_evolution_(protein_engineering)