clustrX 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
clustrx-1.0.0/PKG-INFO ADDED
@@ -0,0 +1,164 @@
1
+ Metadata-Version: 2.4
2
+ Name: clustrX
3
+ Version: 1.0.0
4
+ Summary: clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection
5
+ Author-email: Mario Benítez-Prián <mario.benitezprian@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/mario-benitez-prian/clustrX
8
+ Project-URL: Repository, https://github.com/mario-benitez-prian/clustrX
9
+ Project-URL: Bug Tracker, https://github.com/mario-benitez-prian/clustrX/issues
10
+ Keywords: bioinformatics,clustering,blast,hmmer,sequences,graphs,protein-families,similarity-search,sequence-clustering
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Requires-Python: >=3.8
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: polars>=0.19.0
24
+ Requires-Dist: igraph>=0.10.0
25
+ Requires-Dist: numpy>=1.20.0
26
+ Requires-Dist: psutil>=5.8.0
27
+ Provides-Extra: test
28
+ Requires-Dist: pytest>=7.0; extra == "test"
29
+
30
+ # clustrX: Highly Robust and Sensitive Protein Clustering
31
+
32
+ [![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://pypi.org/project/clustrX/)
33
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
34
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
35
+
36
+ **clustrX** is a high-performance framework designed to transform sequence similarity search results into biologically coherent protein families. By modeling homology as a weighted mathematical network and applying the **Leiden community detection algorithm**, `clustrX` provides a sensitive and robust solution for clustering sequences, especially in complex scenarios involving remote homology and short peptides.
37
+
38
+ ---
39
+
40
+ ## 🚀 Key Features
41
+
42
+ * **Leiden Community Detection**: Beyond simple links, `clustrX` identifies densely connected communities, ensuring high internal cohesion and preventing artificial family merging (e.g., due to domain bridges).
43
+ * **Agnostic Input**: Works with results from **BLAST**, **Diamond**, **MMseqs2**, and **HMMER**. Or others using the custom input option.
44
+ * **Dynamic Coverage Filter**: Our recommended approach to handle sequences of varying lengths to obtain the most reliable and biologically sound results.
45
+ * **Ultra-Fast Performance**: Powered by `Polars` (Rust-based) for data processing and `igraph` (C-based) for network analysis.
46
+ * **Integrated Workflow**: From similarity hits to Multiple Sequence Alignments (MSAs) in a single command.
47
+
48
+ ---
49
+
50
+ ## 📦 Installation
51
+
52
+ You can install `clustrX` using two main methods. Note the difference in dependency management:
53
+
54
+ ### Option A: Via Conda (Recommended)
55
+ This is the easiest way as it automatically installs all external dependencies, including **MAFFT** for alignments.
56
+ ```bash
57
+ conda install -c bioconda clustrx
58
+ ```
59
+
60
+ ### Option B: Via Pip (Using a Virtual Environment)
61
+ To avoid conflicts with other packages and ensure the `clustrx` command is correctly recognized by your system (avoiding PATH issues), we highly recommend using a virtual environment:
62
+
63
+ 1. **Create a new environment**:
64
+ ```bash
65
+ python -m venv clustrx_env
66
+ ```
67
+ 2. **Activate it**:
68
+ * **Windows**: `clustrx_env\Scripts\activate`
69
+ * **Linux/macOS**: `source clustrx_env/bin/activate`
70
+ 3. **Install**:
71
+ ```bash
72
+ pip install clustrX
73
+ ```
74
+
75
+ > [!TIP]
76
+ > **If the `clustrx` command is not recognized** after installation (common on Windows), it is likely because the installation directory is not in your system's PATH. You can either add it manually or use the following foolproof method:
77
+ > `python -m clustrx [arguments]`
78
+
79
+ *Note: If you use Pip, remember that you must **install MAFFT manually** on your system if you plan to use the `--mafft` option.*
80
+
81
+ ---
82
+
83
+ ## ⚙️ Input Formats & Requirements
84
+
85
+ `clustrX` is designed to be a post-processing layer. It requires two main inputs:
86
+ 1. **Similarity Hits**: A tabular file (BLAST-like or HMMER).
87
+ 2. **Sequences**: A FASTA file containing the sequences referenced in the hits.
88
+
89
+ ### Using BLAST
90
+ `clustrX` works natively with the default tabular output of BLAST (`-outfmt 6`).
91
+ ```bash
92
+ blastp -query sequences.fasta -db database -out hits.tsv -outfmt 6
93
+ ```
94
+
95
+ ### Using Diamond or MMseqs2
96
+ If you use these tools, you **must** ensure the output is in **BLAST tabular format (outfmt 6)**:
97
+
98
+ * **Diamond**:
99
+ ```bash
100
+ diamond blastp -q query.fasta -d db.dmnd -o hits.tsv --outfmt 6
101
+ ```
102
+ * **MMseqs2**:
103
+ ```bash
104
+ mmseqs easy-search query.fasta target.fasta hits.tsv tmp --format-mode 0
105
+ ```
106
+
107
+ ### Using HMMER
108
+ HMMER outputs require specific flags depending on the filtering level you need:
109
+
110
+ * **`domtblout` (Recommended)**: Use the `--domtblout` flag in `hmmsearch` or `phmmer`. This format provides alignment coordinates, which are **required** for using the **Dynamic Coverage** filter.
111
+ ```bash
112
+ hmmsearch --domtblout hits.domtblout profile.hmm database.fasta
113
+ ```
114
+ * **`tblout`**: Use the `--tblout` flag. Note that this format lacks coordinate information; therefore, **Dynamic Coverage cannot be applied** (only E-value and Bitscore filters will be used).
115
+ ```bash
116
+ hmmsearch --tblout hits.tblout profile.hmm database.fasta
117
+ ```
118
+
119
+ ---
120
+
121
+ ## 🧬 The Power of Dynamic Coverage
122
+
123
+ We strongly recommend using the **Dynamic Coverage** mode (`--coverage dynamic`) for most scientific applications. For more information about this, please, read the paper.
124
+
125
+ Standard clustering methods often use fixed thresholds that fail to resolve relationships between sequences of very different sizes. Our dynamic filter uses a **hyperbolic decay function** (calibrated with a 50-residue scale factor) that:
126
+ 1. Increases stringency for **short peptides** (up to 0.8 coverage) to filter out statistical noise.
127
+ 2. Gradually relaxes for **larger proteins** (down to 0.4 coverage) to maximize sensitivity in detecting remote homology.
128
+
129
+ ---
130
+
131
+ ## 🛠️ Workflow & Usage
132
+
133
+ The `clustrX` pipeline follows a clear 3-step logic:
134
+
135
+ 1. **Filter**: Hits are filtered based on E-value, Bitscore, and (recommended) Dynamic Coverage.
136
+ 2. **Cluster**: A similarity network is built where edges are weighted by Bitscore, then partitioned using Leiden algorithm.
137
+ 3. **Output**: Results are exported. **Note: Fasta generation and alignments are optional.**
138
+
139
+ ### Example: Recommended Scientific Run
140
+ ```bash
141
+ clustrx -i hits.tsv -f sequences.fasta --coverage dynamic --write-fasta --mafft --outdir results_full
142
+ ```
143
+ * `--write-fasta`: (Optional) Creates a FASTA file for each generated cluster.
144
+ * `--mafft`: (Optional) Automatically performs Multiple Sequence Alignment for each cluster.
145
+
146
+ ---
147
+
148
+ ## 💡 Use Cases
149
+
150
+ * **Protein Family Discovery**: Organizing large proteomes into evolutionarily related groups.
151
+ * **Short Peptide Classification**: Specifically tuned for the discovery of **Antimicrobial Peptides (AMPs)**, toxins, signaling peptides or others.
152
+ * **Remote Homology Exploration**: Identifying relationships in the "twilight zone" (identity < 30%) where traditional greedy methods fragment families.
153
+ * **Domain-Aware Clustering**: Using HMMER `domtblout` inputs to cluster sequences based on specific functional domains.
154
+
155
+ ---
156
+
157
+ ## 📝 Citation
158
+ If you use **clustrX** in your research, please cite:
159
+ > Benítez-Prián, M. & San Mauro, D. (2026). clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection.
160
+
161
+ ## 👤 Authors
162
+ **Mario Benítez-Prián** & **Diego San Mauro**
163
+
164
+ Contact: [mario.benitezprian@gmail.com](mailto:mario.benitezprian@gmail.com) | [GitHub](https://github.com/mario-benitez-prian)
@@ -0,0 +1,135 @@
1
+ # clustrX: Highly Robust and Sensitive Protein Clustering
2
+
3
+ [![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://pypi.org/project/clustrX/)
4
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
6
+
7
+ **clustrX** is a high-performance framework designed to transform sequence similarity search results into biologically coherent protein families. By modeling homology as a weighted mathematical network and applying the **Leiden community detection algorithm**, `clustrX` provides a sensitive and robust solution for clustering sequences, especially in complex scenarios involving remote homology and short peptides.
8
+
9
+ ---
10
+
11
+ ## 🚀 Key Features
12
+
13
+ * **Leiden Community Detection**: Beyond simple links, `clustrX` identifies densely connected communities, ensuring high internal cohesion and preventing artificial family merging (e.g., due to domain bridges).
14
+ * **Agnostic Input**: Works with results from **BLAST**, **Diamond**, **MMseqs2**, and **HMMER**. Or others using the custom input option.
15
+ * **Dynamic Coverage Filter**: Our recommended approach to handle sequences of varying lengths to obtain the most reliable and biologically sound results.
16
+ * **Ultra-Fast Performance**: Powered by `Polars` (Rust-based) for data processing and `igraph` (C-based) for network analysis.
17
+ * **Integrated Workflow**: From similarity hits to Multiple Sequence Alignments (MSAs) in a single command.
18
+
19
+ ---
20
+
21
+ ## 📦 Installation
22
+
23
+ You can install `clustrX` using two main methods. Note the difference in dependency management:
24
+
25
+ ### Option A: Via Conda (Recommended)
26
+ This is the easiest way as it automatically installs all external dependencies, including **MAFFT** for alignments.
27
+ ```bash
28
+ conda install -c bioconda clustrx
29
+ ```
30
+
31
+ ### Option B: Via Pip (Using a Virtual Environment)
32
+ To avoid conflicts with other packages and ensure the `clustrx` command is correctly recognized by your system (avoiding PATH issues), we highly recommend using a virtual environment:
33
+
34
+ 1. **Create a new environment**:
35
+ ```bash
36
+ python -m venv clustrx_env
37
+ ```
38
+ 2. **Activate it**:
39
+ * **Windows**: `clustrx_env\Scripts\activate`
40
+ * **Linux/macOS**: `source clustrx_env/bin/activate`
41
+ 3. **Install**:
42
+ ```bash
43
+ pip install clustrX
44
+ ```
45
+
46
+ > [!TIP]
47
+ > **If the `clustrx` command is not recognized** after installation (common on Windows), it is likely because the installation directory is not in your system's PATH. You can either add it manually or use the following foolproof method:
48
+ > `python -m clustrx [arguments]`
49
+
50
+ *Note: If you use Pip, remember that you must **install MAFFT manually** on your system if you plan to use the `--mafft` option.*
51
+
52
+ ---
53
+
54
+ ## ⚙️ Input Formats & Requirements
55
+
56
+ `clustrX` is designed to be a post-processing layer. It requires two main inputs:
57
+ 1. **Similarity Hits**: A tabular file (BLAST-like or HMMER).
58
+ 2. **Sequences**: A FASTA file containing the sequences referenced in the hits.
59
+
60
+ ### Using BLAST
61
+ `clustrX` works natively with the default tabular output of BLAST (`-outfmt 6`).
62
+ ```bash
63
+ blastp -query sequences.fasta -db database -out hits.tsv -outfmt 6
64
+ ```
65
+
66
+ ### Using Diamond or MMseqs2
67
+ If you use these tools, you **must** ensure the output is in **BLAST tabular format (outfmt 6)**:
68
+
69
+ * **Diamond**:
70
+ ```bash
71
+ diamond blastp -q query.fasta -d db.dmnd -o hits.tsv --outfmt 6
72
+ ```
73
+ * **MMseqs2**:
74
+ ```bash
75
+ mmseqs easy-search query.fasta target.fasta hits.tsv tmp --format-mode 0
76
+ ```
77
+
78
+ ### Using HMMER
79
+ HMMER outputs require specific flags depending on the filtering level you need:
80
+
81
+ * **`domtblout` (Recommended)**: Use the `--domtblout` flag in `hmmsearch` or `phmmer`. This format provides alignment coordinates, which are **required** for using the **Dynamic Coverage** filter.
82
+ ```bash
83
+ hmmsearch --domtblout hits.domtblout profile.hmm database.fasta
84
+ ```
85
+ * **`tblout`**: Use the `--tblout` flag. Note that this format lacks coordinate information; therefore, **Dynamic Coverage cannot be applied** (only E-value and Bitscore filters will be used).
86
+ ```bash
87
+ hmmsearch --tblout hits.tblout profile.hmm database.fasta
88
+ ```
89
+
90
+ ---
91
+
92
+ ## 🧬 The Power of Dynamic Coverage
93
+
94
+ We strongly recommend using the **Dynamic Coverage** mode (`--coverage dynamic`) for most scientific applications. For more information about this, please, read the paper.
95
+
96
+ Standard clustering methods often use fixed thresholds that fail to resolve relationships between sequences of very different sizes. Our dynamic filter uses a **hyperbolic decay function** (calibrated with a 50-residue scale factor) that:
97
+ 1. Increases stringency for **short peptides** (up to 0.8 coverage) to filter out statistical noise.
98
+ 2. Gradually relaxes for **larger proteins** (down to 0.4 coverage) to maximize sensitivity in detecting remote homology.
99
+
100
+ ---
101
+
102
+ ## 🛠️ Workflow & Usage
103
+
104
+ The `clustrX` pipeline follows a clear 3-step logic:
105
+
106
+ 1. **Filter**: Hits are filtered based on E-value, Bitscore, and (recommended) Dynamic Coverage.
107
+ 2. **Cluster**: A similarity network is built where edges are weighted by Bitscore, then partitioned using Leiden algorithm.
108
+ 3. **Output**: Results are exported. **Note: Fasta generation and alignments are optional.**
109
+
110
+ ### Example: Recommended Scientific Run
111
+ ```bash
112
+ clustrx -i hits.tsv -f sequences.fasta --coverage dynamic --write-fasta --mafft --outdir results_full
113
+ ```
114
+ * `--write-fasta`: (Optional) Creates a FASTA file for each generated cluster.
115
+ * `--mafft`: (Optional) Automatically performs Multiple Sequence Alignment for each cluster.
116
+
117
+ ---
118
+
119
+ ## 💡 Use Cases
120
+
121
+ * **Protein Family Discovery**: Organizing large proteomes into evolutionarily related groups.
122
+ * **Short Peptide Classification**: Specifically tuned for the discovery of **Antimicrobial Peptides (AMPs)**, toxins, signaling peptides or others.
123
+ * **Remote Homology Exploration**: Identifying relationships in the "twilight zone" (identity < 30%) where traditional greedy methods fragment families.
124
+ * **Domain-Aware Clustering**: Using HMMER `domtblout` inputs to cluster sequences based on specific functional domains.
125
+
126
+ ---
127
+
128
+ ## 📝 Citation
129
+ If you use **clustrX** in your research, please cite:
130
+ > Benítez-Prián, M. & San Mauro, D. (2026). clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection.
131
+
132
+ ## 👤 Authors
133
+ **Mario Benítez-Prián** & **Diego San Mauro**
134
+
135
+ Contact: [mario.benitezprian@gmail.com](mailto:mario.benitezprian@gmail.com) | [GitHub](https://github.com/mario-benitez-prian)
@@ -0,0 +1,164 @@
1
+ Metadata-Version: 2.4
2
+ Name: clustrX
3
+ Version: 1.0.0
4
+ Summary: clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection
5
+ Author-email: Mario Benítez-Prián <mario.benitezprian@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/mario-benitez-prian/clustrX
8
+ Project-URL: Repository, https://github.com/mario-benitez-prian/clustrX
9
+ Project-URL: Bug Tracker, https://github.com/mario-benitez-prian/clustrX/issues
10
+ Keywords: bioinformatics,clustering,blast,hmmer,sequences,graphs,protein-families,similarity-search,sequence-clustering
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Requires-Python: >=3.8
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: polars>=0.19.0
24
+ Requires-Dist: igraph>=0.10.0
25
+ Requires-Dist: numpy>=1.20.0
26
+ Requires-Dist: psutil>=5.8.0
27
+ Provides-Extra: test
28
+ Requires-Dist: pytest>=7.0; extra == "test"
29
+
30
+ # clustrX: Highly Robust and Sensitive Protein Clustering
31
+
32
+ [![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://pypi.org/project/clustrX/)
33
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
34
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
35
+
36
+ **clustrX** is a high-performance framework designed to transform sequence similarity search results into biologically coherent protein families. By modeling homology as a weighted mathematical network and applying the **Leiden community detection algorithm**, `clustrX` provides a sensitive and robust solution for clustering sequences, especially in complex scenarios involving remote homology and short peptides.
37
+
38
+ ---
39
+
40
+ ## 🚀 Key Features
41
+
42
+ * **Leiden Community Detection**: Beyond simple links, `clustrX` identifies densely connected communities, ensuring high internal cohesion and preventing artificial family merging (e.g., due to domain bridges).
43
+ * **Agnostic Input**: Works with results from **BLAST**, **Diamond**, **MMseqs2**, and **HMMER**. Or others using the custom input option.
44
+ * **Dynamic Coverage Filter**: Our recommended approach to handle sequences of varying lengths to obtain the most reliable and biologically sound results.
45
+ * **Ultra-Fast Performance**: Powered by `Polars` (Rust-based) for data processing and `igraph` (C-based) for network analysis.
46
+ * **Integrated Workflow**: From similarity hits to Multiple Sequence Alignments (MSAs) in a single command.
47
+
48
+ ---
49
+
50
+ ## 📦 Installation
51
+
52
+ You can install `clustrX` using two main methods. Note the difference in dependency management:
53
+
54
+ ### Option A: Via Conda (Recommended)
55
+ This is the easiest way as it automatically installs all external dependencies, including **MAFFT** for alignments.
56
+ ```bash
57
+ conda install -c bioconda clustrx
58
+ ```
59
+
60
+ ### Option B: Via Pip (Using a Virtual Environment)
61
+ To avoid conflicts with other packages and ensure the `clustrx` command is correctly recognized by your system (avoiding PATH issues), we highly recommend using a virtual environment:
62
+
63
+ 1. **Create a new environment**:
64
+ ```bash
65
+ python -m venv clustrx_env
66
+ ```
67
+ 2. **Activate it**:
68
+ * **Windows**: `clustrx_env\Scripts\activate`
69
+ * **Linux/macOS**: `source clustrx_env/bin/activate`
70
+ 3. **Install**:
71
+ ```bash
72
+ pip install clustrX
73
+ ```
74
+
75
+ > [!TIP]
76
+ > **If the `clustrx` command is not recognized** after installation (common on Windows), it is likely because the installation directory is not in your system's PATH. You can either add it manually or use the following foolproof method:
77
+ > `python -m clustrx [arguments]`
78
+
79
+ *Note: If you use Pip, remember that you must **install MAFFT manually** on your system if you plan to use the `--mafft` option.*
80
+
81
+ ---
82
+
83
+ ## ⚙️ Input Formats & Requirements
84
+
85
+ `clustrX` is designed to be a post-processing layer. It requires two main inputs:
86
+ 1. **Similarity Hits**: A tabular file (BLAST-like or HMMER).
87
+ 2. **Sequences**: A FASTA file containing the sequences referenced in the hits.
88
+
89
+ ### Using BLAST
90
+ `clustrX` works natively with the default tabular output of BLAST (`-outfmt 6`).
91
+ ```bash
92
+ blastp -query sequences.fasta -db database -out hits.tsv -outfmt 6
93
+ ```
94
+
95
+ ### Using Diamond or MMseqs2
96
+ If you use these tools, you **must** ensure the output is in **BLAST tabular format (outfmt 6)**:
97
+
98
+ * **Diamond**:
99
+ ```bash
100
+ diamond blastp -q query.fasta -d db.dmnd -o hits.tsv --outfmt 6
101
+ ```
102
+ * **MMseqs2**:
103
+ ```bash
104
+ mmseqs easy-search query.fasta target.fasta hits.tsv tmp --format-mode 0
105
+ ```
106
+
107
+ ### Using HMMER
108
+ HMMER outputs require specific flags depending on the filtering level you need:
109
+
110
+ * **`domtblout` (Recommended)**: Use the `--domtblout` flag in `hmmsearch` or `phmmer`. This format provides alignment coordinates, which are **required** for using the **Dynamic Coverage** filter.
111
+ ```bash
112
+ hmmsearch --domtblout hits.domtblout profile.hmm database.fasta
113
+ ```
114
+ * **`tblout`**: Use the `--tblout` flag. Note that this format lacks coordinate information; therefore, **Dynamic Coverage cannot be applied** (only E-value and Bitscore filters will be used).
115
+ ```bash
116
+ hmmsearch --tblout hits.tblout profile.hmm database.fasta
117
+ ```
118
+
119
+ ---
120
+
121
+ ## 🧬 The Power of Dynamic Coverage
122
+
123
+ We strongly recommend using the **Dynamic Coverage** mode (`--coverage dynamic`) for most scientific applications. For more information about this, please, read the paper.
124
+
125
+ Standard clustering methods often use fixed thresholds that fail to resolve relationships between sequences of very different sizes. Our dynamic filter uses a **hyperbolic decay function** (calibrated with a 50-residue scale factor) that:
126
+ 1. Increases stringency for **short peptides** (up to 0.8 coverage) to filter out statistical noise.
127
+ 2. Gradually relaxes for **larger proteins** (down to 0.4 coverage) to maximize sensitivity in detecting remote homology.
128
+
129
+ ---
130
+
131
+ ## 🛠️ Workflow & Usage
132
+
133
+ The `clustrX` pipeline follows a clear 3-step logic:
134
+
135
+ 1. **Filter**: Hits are filtered based on E-value, Bitscore, and (recommended) Dynamic Coverage.
136
+ 2. **Cluster**: A similarity network is built where edges are weighted by Bitscore, then partitioned using Leiden algorithm.
137
+ 3. **Output**: Results are exported. **Note: Fasta generation and alignments are optional.**
138
+
139
+ ### Example: Recommended Scientific Run
140
+ ```bash
141
+ clustrx -i hits.tsv -f sequences.fasta --coverage dynamic --write-fasta --mafft --outdir results_full
142
+ ```
143
+ * `--write-fasta`: (Optional) Creates a FASTA file for each generated cluster.
144
+ * `--mafft`: (Optional) Automatically performs Multiple Sequence Alignment for each cluster.
145
+
146
+ ---
147
+
148
+ ## 💡 Use Cases
149
+
150
+ * **Protein Family Discovery**: Organizing large proteomes into evolutionarily related groups.
151
+ * **Short Peptide Classification**: Specifically tuned for the discovery of **Antimicrobial Peptides (AMPs)**, toxins, signaling peptides or others.
152
+ * **Remote Homology Exploration**: Identifying relationships in the "twilight zone" (identity < 30%) where traditional greedy methods fragment families.
153
+ * **Domain-Aware Clustering**: Using HMMER `domtblout` inputs to cluster sequences based on specific functional domains.
154
+
155
+ ---
156
+
157
+ ## 📝 Citation
158
+ If you use **clustrX** in your research, please cite:
159
+ > Benítez-Prián, M. & San Mauro, D. (2026). clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection.
160
+
161
+ ## 👤 Authors
162
+ **Mario Benítez-Prián** & **Diego San Mauro**
163
+
164
+ Contact: [mario.benitezprian@gmail.com](mailto:mario.benitezprian@gmail.com) | [GitHub](https://github.com/mario-benitez-prian)
@@ -0,0 +1,19 @@
1
+ README.md
2
+ pyproject.toml
3
+ clustrX.egg-info/PKG-INFO
4
+ clustrX.egg-info/SOURCES.txt
5
+ clustrX.egg-info/dependency_links.txt
6
+ clustrX.egg-info/entry_points.txt
7
+ clustrX.egg-info/requires.txt
8
+ clustrX.egg-info/top_level.txt
9
+ clustrx/__init__.py
10
+ clustrx/__main__.py
11
+ clustrx/cli.py
12
+ clustrx/clustrx.py
13
+ tests/test_clustrx_logic.py
14
+ tests/test_custom_format.py
15
+ tests/test_filtering.py
16
+ tests/test_hmmer_format.py
17
+ tests/test_integration.py
18
+ tests/test_scientific_rigor.py
19
+ tests/test_validation.py
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ clustrx = clustrx.cli:main
@@ -0,0 +1,7 @@
1
+ polars>=0.19.0
2
+ igraph>=0.10.0
3
+ numpy>=1.20.0
4
+ psutil>=5.8.0
5
+
6
+ [test]
7
+ pytest>=7.0
@@ -0,0 +1 @@
1
+ clustrx
File without changes
@@ -0,0 +1,4 @@
1
+ from .cli import main
2
+
3
+ if __name__ == "__main__":
4
+ main()
@@ -0,0 +1,140 @@
1
+ import argparse
2
+ import polars as pl
3
+ from pathlib import Path
4
+ from .clustrx import read_hits, get_fasta_info, build_clusters, write_clusters
5
+
6
+ def main():
7
+ """
8
+ Main entry point for the clustRX CLI.
9
+
10
+ This function handles:
11
+ 1. Argument parsing (BLAST/HMMER formats, filters, clustering parameters).
12
+ 2. Sequence discovery and indexing from FASTA headers (Selective I/O).
13
+ 3. Edge processing and graph-based clustering (Leiden algorithm).
14
+ 4. Output generation (FASTA files and optional MSAs).
15
+ """
16
+ parser = argparse.ArgumentParser(
17
+ description=(
18
+ "clustRX: Robust and Sensitive Sequence Clustering using Similarity Networks and Leiden Community Detection\n\n"
19
+ "A decoupled clustering framework supporting BLAST, DIAMOND, HMMER, and MMseqs2 outputs.\n"
20
+ "Features Adaptive Dynamic Coverage filtering and high-speed parsing via Polars & igraph.\n\n"
21
+ "Authors: Mario Benítez-Prián and Diego San Mauro | Please cite clustrX if used in your research.\n"
22
+ ),
23
+ formatter_class=argparse.RawDescriptionHelpFormatter
24
+ )
25
+ parser.add_argument("-i", "--input", required=True, help="Input hits file (BLAST tabular or HMMER tblout)")
26
+ parser.add_argument("-o", "--outdir", default="clustrx_output", help="Output directory")
27
+ parser.add_argument("-c", "--coverage", help="Alignment coverage filter (0.0-1.0 or 'dynamic'). Percentage of the longest sequence covered by the alignment.")
28
+ parser.add_argument("-fmt", "--format", choices=["blast", "hmmer", "domhmmer", "tblhmmer", "mmseqs", "custom"], help="Input format (optional, auto-detected if not provided)")
29
+ parser.add_argument("-f", "--fasta", required=True, help="FASTA file with all sequences")
30
+ parser.add_argument("-min", "--min-cluster-size", type=int, default=2, help="Minimum cluster size to output (default=2).")
31
+ parser.add_argument("-e", "--evalue", type=float, help="E-value threshold for filtering hits (default: no filter)")
32
+ parser.add_argument("-b", "--bitscore", type=float, help="Bitscore threshold for filtering hits (default: no filter)")
33
+ parser.add_argument("-pi", "--pidentity", type=float, help="Minimum percentage identity (0-100) (default: no filter)")
34
+ parser.add_argument("--id-override", type=float, default=90.0, help="Identity threshold to override coverage filter (default: 90.0).")
35
+ parser.add_argument("--seed", type=int, help="Seed for reproducibility (default: None)")
36
+ parser.add_argument("--resolution", type=float, default=1.0, help="Resolution parameter for Leiden algorithm (default: 1.0)")
37
+ parser.add_argument("--mafft", action="store_true", help="Automatically run MAFFT on generated clusters (requires MAFFT installed)")
38
+ parser.add_argument("--write-fasta", action="store_true", help="Generate FASTA files and alignments for clusters (default: False).")
39
+
40
+ # Custom format groups
41
+ custom_group = parser.add_argument_group('Custom Format Options', 'Specify 0-indexed column numbers for custom tabular inputs (column 1 in file is column 0).')
42
+ custom_group.add_argument("--col-query", type=int, help="Column index for query ID")
43
+ custom_group.add_argument("--col-target", type=int, help="Column index for target ID")
44
+ custom_group.add_argument("--col-bitscore", type=int, help="Column index for alignment score (bitscore)")
45
+ custom_group.add_argument("--col-evalue", type=int, help="Column index for E-value")
46
+ custom_group.add_argument("--col-pident", type=int, help="Column index for percentage identity")
47
+ custom_group.add_argument("--col-length", type=int, help="Column index for alignment length (used in dynamic coverage). Query and target lengths are extracted from the fasta file.")
48
+ args = parser.parse_args()
49
+
50
+ if args.mafft:
51
+ import shutil
52
+ if not shutil.which("mafft"):
53
+ print("Error: MAFFT is not installed or not found in system PATH. Please install it (e.g., 'sudo apt install mafft' or via Bioconda) to use the --mafft flag.")
54
+ return
55
+
56
+ # SELECTIVE I/O: Read only names and lengths to build the index.
57
+ # This is extremely memory-efficient and fast.
58
+ names, lengths = get_fasta_info(args.fasta)
59
+
60
+ # PERFORMANCE HACK: Use Polars Enum for O(1) string matching.
61
+ # This maps sequence names to internal IDs for fast hit lookups.
62
+ name_enum = pl.Enum(names)
63
+
64
+ # Create the ID mapping and lengths DataFrames for Polars
65
+ mapping_df = pl.DataFrame({
66
+ "name": names,
67
+ "id": list(range(len(names)))
68
+ }).with_columns([
69
+ pl.col("name").cast(name_enum),
70
+ pl.col("id").cast(pl.Int32)
71
+ ])
72
+
73
+ lengths_df = pl.DataFrame({
74
+ 'name': names,
75
+ 'len': [lengths.get(n, 0) for n in names]
76
+ }).with_columns([
77
+ pl.col('name').cast(name_enum),
78
+ pl.col('len').cast(pl.Int32)
79
+ ])
80
+
81
+ # Process coverage arg
82
+ cov_val = None
83
+ if args.coverage:
84
+ if args.coverage.lower() == 'dynamic':
85
+ cov_val = 'dynamic'
86
+ else:
87
+ try:
88
+ cov_val = float(args.coverage)
89
+ except ValueError:
90
+ print(f"Error: coverage must be a float or 'dynamic'. Got '{args.coverage}'")
91
+ return
92
+
93
+ # Process custom cols
94
+ custom_cols = {}
95
+ if args.col_query is not None: custom_cols['q'] = args.col_query
96
+ if args.col_target is not None: custom_cols['t'] = args.col_target
97
+ if args.col_bitscore is not None: custom_cols['bitscore'] = args.col_bitscore
98
+ if args.col_evalue is not None: custom_cols['evalue'] = args.col_evalue
99
+ if args.col_pident is not None: custom_cols['pident'] = args.col_pident
100
+ if args.col_length is not None: custom_cols['length'] = args.col_length
101
+
102
+ input_format = args.format
103
+ if custom_cols and input_format is None:
104
+ input_format = 'custom'
105
+
106
+ # Call read_hits with mapping and lengths
107
+ u_v, weights, v_list = read_hits(
108
+ args.input, format=input_format, evalue=args.evalue, bitscore=args.bitscore,
109
+ pident=args.pidentity, coverage=cov_val,
110
+ mapping_df=mapping_df, lengths_df=lengths_df,
111
+ pident_override=args.id_override,
112
+ custom_cols=custom_cols if input_format == 'custom' else None
113
+ )
114
+
115
+ # Build clusters
116
+ components = build_clusters(
117
+ (u_v, weights, v_list),
118
+ min_size=args.min_cluster_size,
119
+ seed=args.seed,
120
+ resolution=args.resolution
121
+ )
122
+
123
+ # OUTPUT PHASE: Write clusters to disk.
124
+ # Only load the FULL sequences (ATGCs) here if requested or needed for MAFFT.
125
+ if args.write_fasta or args.mafft:
126
+ from .clustrx import read_fasta
127
+ sequences = read_fasta(args.fasta)
128
+ else:
129
+ sequences = None
130
+
131
+ out_clusters = Path(args.outdir) / "clusters"
132
+ out_fastas = Path(args.outdir) / "fasta_files"
133
+ out_alignments = Path(args.outdir) / "alignments" if args.mafft else None
134
+
135
+ write_clusters(components, sequences, out_clusters, out_fastas, out_alignments)
136
+
137
+ print(f"Done. {len(components)} clusters written to {args.outdir}/")
138
+
139
+ if __name__ == "__main__":
140
+ main()