PyPI - enzymetk - Versions diffs - 0.0.2__tar.gz → 0.0.7__tar.gz - Mend

enzymetk 0.0.2tar.gz → 0.0.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (59) hide show

{enzymetk-0.0.2 → enzymetk-0.0.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: enzymetk
-Version: 0.0.2
+Version: 0.0.7
 Home-page: https://github.com/arianemora/enzyme-tk/
 Author: Ariane Mora
 Author-email: ariane.n.mora@gmail.com
@@ -13,22 +13,22 @@ Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
 Classifier: Natural Language :: English
 Classifier: Operating System :: OS Independent
-Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
-Requires-Python: >=3.8
+Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: fair-esm
 Requires-Dist: scikit-learn
 Requires-Dist: numpy
 Requires-Dist: seaborn
 Requires-Dist: sciutil
-Requires-Dist: pandas==2.1.4
+Requires-Dist: tqdm
+Requires-Dist: pandas
 Requires-Dist: biopython
-Requires-Dist: sentence_transformers
-Requires-Dist: pubchempy
-Requires-Dist: pyfaidx
-Requires-Dist: spacy
+Requires-Dist: transformers
+Requires-Dist: torch
+Requires-Dist: huggingface_hub
 Dynamic: author
 Dynamic: author-email
 Dynamic: classifier
@@ -37,6 +37,7 @@ Dynamic: description-content-type
 Dynamic: home-page
 Dynamic: keywords
 Dynamic: license
+Dynamic: license-file
 Dynamic: project-url
 Dynamic: requires-dist
 Dynamic: requires-python
@@ -45,14 +46,100 @@ Dynamic: requires-python
 Enzyme-tk is a collection of tools for enzyme engineering, setup as interoperable modules that act on dataframes. These modules are designed to be imported into pipelines for specific function. For this reason, `steps` as each module is called (e.g. finding similar proteins with `BLAST` would be considered a step) are designed to be as light as possible. An example of a pipeline is the [annotate-e](https://github.com/ArianeMora/annotate-e)  ` pipeline, this acts to annotate a fasta with an ensemble of methods (each is designated as an Enzyme-tk step).
+**If you have any issues installing, let me know - this has been tested only on Linux/Ubuntu. Please post an issue!**
 ## Installation
 ## Install base package to import modules
 ```bash
+conda create --name enzymetk python==3.12 -y
 pip install enzymetk
+# Install torch for your specific cuda version
+pip install torch torchvision #--index-url https://download.pytorch.org/whl/cu130
+```
+## If you're at the bleeding edge, and going to use older models e.g. chemBERTa2 you may need to run
+```
+pip uninstall transformers -y
+pip install "transformers<5"
+```
+## For each module run install the first time you're running it
+This will install as a venv where possible and conda where the tools don't allow for venvs.
+See specific tools for info.
+```
+bm = BLAST(id_col, seq_col, label_col)
+bm.install() # by default will create a venv or if needed a conda env
+```
+Note if you want to use your specific environment you can install externally and override the installed venv or conda env e.g.
+```
+bm = BLAST(id_col, seq_col, label_col)
+bm.conda = 'blast_env' # an already installed env on your computer
+bm.venv = None # so it knows to use conda i.e. forces it not to use venv
 ```
+## Modules requiring conda
+- CREEP [not tested again]
+- CLEAN [not tested again]
+- ProteInfer [not tested again]
+## Modules able to run in venv
+- BLAST [cpu, tested with both, see notebook]
+- ChemBERTA [cpu, colab]
+- Boltz
+- Chai: conda install -c conda-forge pdbfixer
+- esm2/3 [cpu, see notebook]
+- foldseek [tested and works]
+- ligandmpnn
+- mmseqs [can get working...]
+- msa []
+- reaction_similarity [good, cpu]
+- rxnfp [needs specific python version so not easy in colab] hence install is with `enzymetk install rxnfp` requires conda
+- substrate_similarity [good, cpu]
+- tree
+- unimol [good, cpu]
+Docko git@github.com:ArianeMora/docko.git
+ValueError: CCD component ALA not found!
+boltz predict  boltz.fasta --use_msa_server --cache ./mol
+srun -p gpu --qos=normal --gres=gpu:1 --pty --mem=64G  --time=000:30:00 bash
+pipelines: reads --> poreChop --> Flye --> Prokka --> Squidly --> Foldseek --> Boltz --> Chai
+pipelines: seqs --> BLAST --> Proteinfer --> Foldseek -->  MMseqs --> ClustalOmega --> FastTree
+pipelines: reactions --> rxnFP --> selformer --> uniMol --> chemBERTa2 --> RDkit reaction similarity
+| Module                       | Name          | Description                                                                       | Colab ipynb|
+|------------------------------|---------------|-----------------------------------------------------------------------------------|------------|
+| Metagenomics                 | PoreChop      | Used to filter adapters for nanopore sequences in metagenomics   pipeline.        | y          |
+| Metagenomics                 | Flye          | Used to assemble the metagenomes.                                                 | ?          |
+| Metagenomics                 | Prokka        | Annotation of genes within the genome.                                            | ?          |
+| Function prediction          | Proteinfer    | Annotation of genes to function (GO or EC class) using ML.                        | 33          |
+| Function prediction          | CLEAN         | Annotation of genes to EC class using ML.                                         | 11          |
+| Function prediction          | CREEP         | Annotation of genes to EC class using ML.                                         | 13          |
+| Function prediction          | Func-e        | Annotation of genes to reaction using ML.                                         | This study. |
+| Function prediction          | Squidly       | Annotation of catalytic residues using ML.                                        | 36          |
+| Embedding generation         | ESM2 & 3      | Conversion of amino acid sequence to a numerical embedding   using a PLM.         | 46,47       |
+| Embedding generation         | RxnFP         | Conversion of reaction smiles to a numerical embedding using a   language model.  | 48          |
+| Embedding generation         | Selformer     | Conversion of reaction selfies to a numerical embedding using   a language model. | 49          |
+| Embedding generation         | Uni-mol       | Conversion of molecule smiles to a numerical embedding using a   language model.  | 50          |
+| Embedding generation         | ChemBERTa2    | Conversion of reaction smiles to a numerical embedding using a   language model.  | 51          |
+| Docking                      | Chai          | Diffusion based folding of a protein and ligand.                                  | 42          |
+| Docking                      | Boltz         | Diffusion based folding of a protein and ligand.                                  | 52          |
+| Similarity                   | Diamond       | Sequence similarity calculation   using basic local alignment search.             | 53          |
+| Similarity                   | Foldseek      | Fast structure similarity search.                                                 | 54          |
+| Similarity                   | MMseqs        | Fast sequence clustering.                                                         | 55          |
+| Docking                      | StructureZyme | Alignment and calculation of structure metrics.                                   | 56          |
+| Oligo design                 | Oligopoolio   | Calculation of oligo fragments for gene assembly.                                 | This study. |
+| Sequencing                   | LevSeq        | Sequence verification of protein variants.                                        | 34          |
+| MSA generation               | ClustalOmega  | Creation of multiple sequence alignments (MSA).                                   | 57          |
+| Phylogenetic tree generation | FastTree      | Creation of multiple phylogenetic trees.                                          | 58          |
 ### Install only the specific requirements you need (recomended)
 For this clone the repo and then install the requirements for the specific modules you use
@@ -71,6 +158,7 @@ This is a work-in progress! e.g. some tools (e.g. proteInfer and CLEAN) require
 Here are some of the tools that have been implemented to be chained together as a pipeline:
+[boltz2](https://github.com/jwohlwend/boltz)
 [mmseqs2](https://github.com/soedinglab/mmseqs2)
 [foldseek](https://github.com/steineggerlab/foldseek)
 [diamond](https://github.com/bbuchfink/diamond)
@@ -89,6 +177,7 @@ Here are some of the tools that have been implemented to be chained together as
 [fasttree](https://morgannprice.github.io/fasttree/)
 [Porechop](https://github.com/rrwick/Porechop)
 [prokka](https://github.com/tseemann/prokka)
 ## Things to note
 All the tools use the conda env of `enzymetk` by default.
@@ -120,6 +209,12 @@ The steps are the main building blocks of the pipeline. They are responsible for
 BLAST is a tool for searching a database of sequences for similar sequences. Here you can either pass a database that you have already created or pass the sequences as part of your dataframe and pass the label column (this needs to have two values: reference and query) reference refers to sequences that you want to search against and query refers to sequences that you want to search for.
+Note you can install 2 ways, with a conda env by command line:
+```
+enzymetk install_diamond
+```
 ```python
 id_col = 'Entry'
 seq_col = 'Sequence'
@@ -148,6 +243,34 @@ df = pd.DataFrame(rows, columns=[id_col, seq_col])
 print(df)
 df << (ActiveSitePred(id_col, seq_col, squidly_dir, num_threads) >> Save('tmp/squidly_as_pred.pkl'))
+```
+### Boltz2
+Boltz2 is a model for predicting structures. Note you need docko installed as I run via that.
+Below is an example using boltz with 4 threads, and uses a cofactor (intermediate in this case). Just set to be None for a single substrate version.
+```
+import sys
+from enzymetk.dock_boltz_step import Boltz
+from enzymetk.save_step import Save
+import pandas as pd
+import os
+os.environ['MKL_THREADING_LAYER'] = 'GNU'
+output_dir = 'tmp/'
+num_threads = 4
+id_col = 'Entry'
+seq_col = 'Sequence'
+substrate_col = 'Substrate'
+intermediate_col = 'Intermediate'
+rows = [['P0DP23_boltz_8999', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP24_boltz_p1', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP23_boltz_p2', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP24_boltz_p3', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP24_boltz_p4', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]']]
+df = pd.DataFrame(rows, columns=[id_col, seq_col, substrate_col, intermediate_col])
+df << (Boltz(id_col, seq_col, substrate_col, intermediate_col, f'{output_dir}', num_threads) >> Save(f'{output_dir}test.pkl'))
 ```
 ### Chai
@@ -257,6 +380,16 @@ df << (CREEP(id_col, reaction_col, CREEP_cache_dir='/disk1/share/software/CREEP/
 EmbedESM is a tool for embedding a set of sequences using ESM2.
+Either in your own conda env: `pip install esm-fair` or you can run:
+```
+id_col = 'Entry'
+seq_col = 'Sequence'
+label_col = 'ActiveSite'
+esm = EmbedESM(id_col, seq_col, extraction_method='mean', tmp_dir='tmp', rep_num=36) # i.e. the representation number you want usually the last layer
+esm.install() # And follow the instructions to activate the env
+```
 ```python
 from enzymetk.embedprotein_esm_step import EmbedESM
 from enzymetk.save_step import Save

{enzymetk-0.0.2 → enzymetk-0.0.7}/README.md RENAMED Viewed

@@ -2,14 +2,100 @@
 Enzyme-tk is a collection of tools for enzyme engineering, setup as interoperable modules that act on dataframes. These modules are designed to be imported into pipelines for specific function. For this reason, `steps` as each module is called (e.g. finding similar proteins with `BLAST` would be considered a step) are designed to be as light as possible. An example of a pipeline is the [annotate-e](https://github.com/ArianeMora/annotate-e)  ` pipeline, this acts to annotate a fasta with an ensemble of methods (each is designated as an Enzyme-tk step).
+**If you have any issues installing, let me know - this has been tested only on Linux/Ubuntu. Please post an issue!**
 ## Installation
 ## Install base package to import modules
 ```bash
+conda create --name enzymetk python==3.12 -y
 pip install enzymetk
+# Install torch for your specific cuda version
+pip install torch torchvision #--index-url https://download.pytorch.org/whl/cu130
+```
+## If you're at the bleeding edge, and going to use older models e.g. chemBERTa2 you may need to run
+```
+pip uninstall transformers -y
+pip install "transformers<5"
+```
+## For each module run install the first time you're running it
+This will install as a venv where possible and conda where the tools don't allow for venvs.
+See specific tools for info.
+```
+bm = BLAST(id_col, seq_col, label_col)
+bm.install() # by default will create a venv or if needed a conda env
+```
+Note if you want to use your specific environment you can install externally and override the installed venv or conda env e.g.
+```
+bm = BLAST(id_col, seq_col, label_col)
+bm.conda = 'blast_env' # an already installed env on your computer
+bm.venv = None # so it knows to use conda i.e. forces it not to use venv
 ```
+## Modules requiring conda
+- CREEP [not tested again]
+- CLEAN [not tested again]
+- ProteInfer [not tested again]
+## Modules able to run in venv
+- BLAST [cpu, tested with both, see notebook]
+- ChemBERTA [cpu, colab]
+- Boltz
+- Chai: conda install -c conda-forge pdbfixer
+- esm2/3 [cpu, see notebook]
+- foldseek [tested and works]
+- ligandmpnn
+- mmseqs [can get working...]
+- msa []
+- reaction_similarity [good, cpu]
+- rxnfp [needs specific python version so not easy in colab] hence install is with `enzymetk install rxnfp` requires conda
+- substrate_similarity [good, cpu]
+- tree
+- unimol [good, cpu]
+Docko git@github.com:ArianeMora/docko.git
+ValueError: CCD component ALA not found!
+boltz predict  boltz.fasta --use_msa_server --cache ./mol
+srun -p gpu --qos=normal --gres=gpu:1 --pty --mem=64G  --time=000:30:00 bash
+pipelines: reads --> poreChop --> Flye --> Prokka --> Squidly --> Foldseek --> Boltz --> Chai
+pipelines: seqs --> BLAST --> Proteinfer --> Foldseek -->  MMseqs --> ClustalOmega --> FastTree
+pipelines: reactions --> rxnFP --> selformer --> uniMol --> chemBERTa2 --> RDkit reaction similarity
+| Module                       | Name          | Description                                                                       | Colab ipynb|
+|------------------------------|---------------|-----------------------------------------------------------------------------------|------------|
+| Metagenomics                 | PoreChop      | Used to filter adapters for nanopore sequences in metagenomics   pipeline.        | y          |
+| Metagenomics                 | Flye          | Used to assemble the metagenomes.                                                 | ?          |
+| Metagenomics                 | Prokka        | Annotation of genes within the genome.                                            | ?          |
+| Function prediction          | Proteinfer    | Annotation of genes to function (GO or EC class) using ML.                        | 33          |
+| Function prediction          | CLEAN         | Annotation of genes to EC class using ML.                                         | 11          |
+| Function prediction          | CREEP         | Annotation of genes to EC class using ML.                                         | 13          |
+| Function prediction          | Func-e        | Annotation of genes to reaction using ML.                                         | This study. |
+| Function prediction          | Squidly       | Annotation of catalytic residues using ML.                                        | 36          |
+| Embedding generation         | ESM2 & 3      | Conversion of amino acid sequence to a numerical embedding   using a PLM.         | 46,47       |
+| Embedding generation         | RxnFP         | Conversion of reaction smiles to a numerical embedding using a   language model.  | 48          |
+| Embedding generation         | Selformer     | Conversion of reaction selfies to a numerical embedding using   a language model. | 49          |
+| Embedding generation         | Uni-mol       | Conversion of molecule smiles to a numerical embedding using a   language model.  | 50          |
+| Embedding generation         | ChemBERTa2    | Conversion of reaction smiles to a numerical embedding using a   language model.  | 51          |
+| Docking                      | Chai          | Diffusion based folding of a protein and ligand.                                  | 42          |
+| Docking                      | Boltz         | Diffusion based folding of a protein and ligand.                                  | 52          |
+| Similarity                   | Diamond       | Sequence similarity calculation   using basic local alignment search.             | 53          |
+| Similarity                   | Foldseek      | Fast structure similarity search.                                                 | 54          |
+| Similarity                   | MMseqs        | Fast sequence clustering.                                                         | 55          |
+| Docking                      | StructureZyme | Alignment and calculation of structure metrics.                                   | 56          |
+| Oligo design                 | Oligopoolio   | Calculation of oligo fragments for gene assembly.                                 | This study. |
+| Sequencing                   | LevSeq        | Sequence verification of protein variants.                                        | 34          |
+| MSA generation               | ClustalOmega  | Creation of multiple sequence alignments (MSA).                                   | 57          |
+| Phylogenetic tree generation | FastTree      | Creation of multiple phylogenetic trees.                                          | 58          |
 ### Install only the specific requirements you need (recomended)
 For this clone the repo and then install the requirements for the specific modules you use
@@ -28,6 +114,7 @@ This is a work-in progress! e.g. some tools (e.g. proteInfer and CLEAN) require
 Here are some of the tools that have been implemented to be chained together as a pipeline:
+[boltz2](https://github.com/jwohlwend/boltz)
 [mmseqs2](https://github.com/soedinglab/mmseqs2)
 [foldseek](https://github.com/steineggerlab/foldseek)
 [diamond](https://github.com/bbuchfink/diamond)
@@ -46,6 +133,7 @@ Here are some of the tools that have been implemented to be chained together as
 [fasttree](https://morgannprice.github.io/fasttree/)
 [Porechop](https://github.com/rrwick/Porechop)
 [prokka](https://github.com/tseemann/prokka)
 ## Things to note
 All the tools use the conda env of `enzymetk` by default.
@@ -77,6 +165,12 @@ The steps are the main building blocks of the pipeline. They are responsible for
 BLAST is a tool for searching a database of sequences for similar sequences. Here you can either pass a database that you have already created or pass the sequences as part of your dataframe and pass the label column (this needs to have two values: reference and query) reference refers to sequences that you want to search against and query refers to sequences that you want to search for.
+Note you can install 2 ways, with a conda env by command line:
+```
+enzymetk install_diamond
+```
 ```python
 id_col = 'Entry'
 seq_col = 'Sequence'
@@ -105,6 +199,34 @@ df = pd.DataFrame(rows, columns=[id_col, seq_col])
 print(df)
 df << (ActiveSitePred(id_col, seq_col, squidly_dir, num_threads) >> Save('tmp/squidly_as_pred.pkl'))
+```
+### Boltz2
+Boltz2 is a model for predicting structures. Note you need docko installed as I run via that.
+Below is an example using boltz with 4 threads, and uses a cofactor (intermediate in this case). Just set to be None for a single substrate version.
+```
+import sys
+from enzymetk.dock_boltz_step import Boltz
+from enzymetk.save_step import Save
+import pandas as pd
+import os
+os.environ['MKL_THREADING_LAYER'] = 'GNU'
+output_dir = 'tmp/'
+num_threads = 4
+id_col = 'Entry'
+seq_col = 'Sequence'
+substrate_col = 'Substrate'
+intermediate_col = 'Intermediate'
+rows = [['P0DP23_boltz_8999', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP24_boltz_p1', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP23_boltz_p2', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP24_boltz_p3', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
+        ['P0DP24_boltz_p4', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]']]
+df = pd.DataFrame(rows, columns=[id_col, seq_col, substrate_col, intermediate_col])
+df << (Boltz(id_col, seq_col, substrate_col, intermediate_col, f'{output_dir}', num_threads) >> Save(f'{output_dir}test.pkl'))
 ```
 ### Chai
@@ -214,6 +336,16 @@ df << (CREEP(id_col, reaction_col, CREEP_cache_dir='/disk1/share/software/CREEP/
 EmbedESM is a tool for embedding a set of sequences using ESM2.
+Either in your own conda env: `pip install esm-fair` or you can run:
+```
+id_col = 'Entry'
+seq_col = 'Sequence'
+label_col = 'ActiveSite'
+esm = EmbedESM(id_col, seq_col, extraction_method='mean', tmp_dir='tmp', rep_num=36) # i.e. the representation number you want usually the last layer
+esm.install() # And follow the instructions to activate the env
+```
 ```python
 from enzymetk.embedprotein_esm_step import EmbedESM
 from enzymetk.save_step import Save

enzymetk-0.0.7/enzymetk/__init__.py ADDED Viewed

@@ -0,0 +1,122 @@
+###############################################################################
+#                                                                             #
+#    This program is free software: you can redistribute it and/or modify     #
+#    it under the terms of the GNU General Public License as published by     #
+#    the Free Software Foundation, either version 3 of the License, or        #
+#    (at your option) any later version.                                      #
+#                                                                             #
+#    This program is distributed in the hope that it will be useful,          #
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of           #
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the            #
+#    GNU General Public License for more details.                             #
+#                                                                             #
+#    You should have received a copy of the GNU General Public License        #
+#    along with this program. If not, see <http://www.gnu.org/licenses/>.     #
+#                                                                             #
+###############################################################################
+"""
+Author: Ariane Mora
+Date: March 2025
+"""
+__title__ = 'enzymetk'
+__description__ = 'Toolkit for enzymes and what not'
+__url__ = 'https://github.com/arianemora/enzyme-tk/'
+__version__ = '0.0.7'
+__author__ = 'Ariane Mora'
+__author_email__ = 'ariane.n.mora@gmail.com'
+__license__ = 'GPL3'
+# Core classes
+from enzymetk.step import Step, Pipeline
+from enzymetk.save_step import Save
+# EC Annotation
+from enzymetk.annotateEC_CLEAN_step import CLEAN
+from enzymetk.annotateEC_CREEP_step import CREEP
+from enzymetk.annotateEC_proteinfer_step import ProteInfer
+# Docking
+from enzymetk.dock_boltz_step import Boltz
+from enzymetk.dock_chai_step import Chai
+from enzymetk.dock_vina_step import Vina
+# Chemical Embeddings
+from enzymetk.embedchem_chemberta_step import ChemBERT
+from enzymetk.embedchem_rxnfp_step import RxnFP
+from enzymetk.embedchem_selformer_step import SelFormer
+from enzymetk.embedchem_unimol_step import UniMol
+# Protein Embeddings
+from enzymetk.embedprotein_esm_step import EmbedESM
+from enzymetk.embedprotein_esm3_step import EmbedESM3
+# Sequence Generation/Alignment
+from enzymetk.generate_msa_step import ClustalOmega
+from enzymetk.generate_tree_step import FastTree
+# Protein Design
+from enzymetk.inpaint_ligandMPNN_step import LigandMPNN
+# Metagenomics
+from enzymetk.metagenomics_porechop_trim_reads_step import PoreChop
+from enzymetk.metagenomics_prokka_annotate_genes import Prokka
+# Prediction
+from enzymetk.predict_catalyticsite_step import ActiveSitePred
+# Sequence Search
+from enzymetk.sequence_search_blast import BLAST
+# Similarity Search
+from enzymetk.similarity_foldseek_step import FoldSeek
+from enzymetk.similarity_mmseqs_step import MMseqs
+from enzymetk.similarity_reaction_step import ReactionDist
+from enzymetk.similarity_substrate_step import SubstrateDist
+# Structure Search (aliased to avoid conflict with similarity_foldseek_step.FoldSeek)
+from enzymetk.structure_search_foldseek import FoldSeek as StructureFoldSeek
+__all__ = [
+    # Core
+    'Step',
+    'Pipeline',
+    'Save',
+    # EC Annotation
+    'CLEAN',
+    'CREEP',
+    'ProteInfer',
+    # Docking
+    'Boltz',
+    'Chai',
+    'Vina',
+    # Chemical Embeddings
+    'ChemBERT',
+    'RxnFP',
+    'SelFormer',
+    'UniMol',
+    # Protein Embeddings
+    'EmbedESM',
+    'EmbedESM3',
+    # Sequence Generation/Alignment
+    'ClustalOmega',
+    'FastTree',
+    # Protein Design
+    'LigandMPNN',
+    # Metagenomics
+    'PoreChop',
+    'Prokka',
+    # Prediction
+    'ActiveSitePred',
+    # Sequence Search
+    'BLAST',
+    # Similarity Search
+    'FoldSeek',
+    'MMseqs',
+    'ReactionDist',
+    'SubstrateDist',
+    # Structure Search
+    'StructureFoldSeek',
+]

{enzymetk-0.0.2 → enzymetk-0.0.7}/enzymetk/annotateEC_CLEAN_step.py RENAMED Viewed

@@ -116,7 +116,7 @@ class CLEAN(Step):
                 print(output_filenames)
                 for sub_df in output_filenames:
                     df = pd.concat([df, sub_df])
-                return df
+                return self.__filter_df(df)
             else:
-                return self.__execute([df, tmp_dir])
+                return self.__filter_df(self.__execute([df, tmp_dir]))
                 return df

{enzymetk-0.0.2 → enzymetk-0.0.7}/enzymetk/annotateEC_CREEP_step.py RENAMED Viewed

@@ -5,9 +5,12 @@ import subprocess
 import logging
 import numpy as np
 import os
+from enzymetk.step import run_script
+from pathlib import Path
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.INFO)
+SCRIPT_DIR = Path(__file__).parent.resolve()
 """
 import os
@@ -38,9 +41,14 @@ class CREEP(Step):
         self.args_extract = args_extract
         self.args_retrieval = args_retrieval
-    def __execute(self, df: pd.DataFrame, tmp_dir: str) -> pd.DataFrame:
-        tmp_dir = '/disk1/ariane/vscode/degradeo/pipeline/tmp/'
-        input_filename = f'{tmp_dir}/creepasjkdkajshdkja.csv'
+    def install(self, env_args=None):
+        # Try to automatically install CREEP conda env
+        run_script('install_CREEP.sh', verbose=True)
+        self.CREEP_dir = SCRIPT_DIR.parent.resolve() / 'conda_envs' / 'CREEP'
+        self.CREEP_cache_dir = f'{self.CREEP_dir}/data/'
+    def __execute(self, df: pd.DataFrame, tmp_dir: str):
+        input_filename = f'{tmp_dir}/input.csv'
         df.to_csv(input_filename, index=False)
         cmd = ['conda', 'run', '-n', self.env_name, 'python', f'{self.CREEP_dir}scripts/step_02_extract_CREEP.py', '--pretrained_folder',
                                  f'{self.CREEP_cache_dir}output/easy_split',

{enzymetk-0.0.2 → enzymetk-0.0.7}/enzymetk/annotateEC_proteinfer_step.py RENAMED Viewed

@@ -5,7 +5,10 @@ from multiprocessing.dummy import Pool as ThreadPool
 from tempfile import TemporaryDirectory
 import os
 import subprocess
+from enzymetk.step import run_script
+from pathlib import Path
+SCRIPT_DIR = Path(__file__).parent.resolve()
 class ProteInfer(Step):
@@ -53,6 +56,12 @@ class ProteInfer(Step):
         self.ec3_filter = ec3_filter
         self.ec4_filter = ec4_filter
+    def install(self, env_args=None):
+        # Try to automatically install CREEP conda env
+        run_script('install_CREEP.sh', verbose=True)
+        self.CREEP_dir = SCRIPT_DIR.parent.resolve() / 'conda_envs' / 'CREEP'
+        self.CREEP_cache_dir = f'{self.CREEP_dir}/data/'
     def __execute(self, data: list) -> np.array:
         df, tmp_dir = data
         # Make sure in the directory of proteinfer

enzymetk-0.0.7/enzymetk/dock_boltz_step.py ADDED Viewed

@@ -0,0 +1,73 @@
+from enzymetk.step import Step
+import pandas as pd
+import logging
+import numpy as np
+from multiprocessing.dummy import Pool as ThreadPool
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+try:
+    from docko.boltz import run_boltz_affinity
+except ImportError as e:
+    print("Boltz: Needs docko package. Install with: pip install docko.")
+class Boltz(Step):
+    def __init__(self, id_col: str, seq_col: str, substrate_col: str, intermediate_col: str, output_dir: str,
+                num_threads: 1, env_name = None, args=None):
+        super().__init__()
+        self.id_col = id_col
+        self.seq_col = seq_col
+        self.substrate_col = substrate_col
+        self.intermediate_col = intermediate_col
+        self.output_dir = output_dir or None
+        self.num_threads = num_threads or 1
+        self.conda = env_name
+        self.env_name = env_name
+        self.args = args
+    def install(self, env_args=None):
+        # e.g. env args could by python=='3.1.1.
+        self.install_venv(env_args)
+        # Now the specific
+        try:
+            cmd = [f'{self.env_name}/bin/pip', 'install', 'docko']
+            self.run(cmd)
+        except Exception as e:
+            cmd = [f'{self.env_name}/bin/pip3', 'install', 'docko']
+            self.run(cmd)
+        self.run(cmd)
+        # Now set the venv to be the location:
+        self.venv = f'{self.env_name}/bin/python'
+    def __execute(self, df: pd.DataFrame) -> pd.DataFrame:
+        output_filenames = []
+        for run_id, seq, substrate, intermediate in df[[self.id_col, self.seq_col, self.substrate_col, self.intermediate_col]].values:
+            # Might have an issue if the things are not correctly installed in the same dicrectory
+            if not isinstance(substrate, str):
+                substrate = ''
+            print(run_id, seq, substrate)
+            if self.args:
+                run_boltz_affinity(run_id, seq, substrate, self.output_dir, intermediate, self.args)
+            else:
+                run_boltz_affinity(run_id, seq, substrate, self.output_dir, intermediate)
+            output_filenames.append(f'{self.output_dir}/{run_id}/')
+        return output_filenames
+    def execute(self, df: pd.DataFrame) -> pd.DataFrame:
+        if self.output_dir:
+            if self.num_threads > 1:
+                pool = ThreadPool(self.num_threads)
+                df_list = np.array_split(df, self.num_threads)
+                results = pool.map(self.__execute, df_list)
+            else:
+                results = self.__execute(df)
+            df['output_dir'] = results
+            return df
+        else:
+            print('No output directory provided')

enzymetk 0.0.2__tar.gz → 0.0.7__tar.gz

enzymetk 0.0.2tar.gz → 0.0.7tar.gz