yamcot 1.0.0__cp314-cp314-macosx_11_0_arm64.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- yamcot/__init__.py +46 -0
- yamcot/_core/__init__.py +17 -0
- yamcot/_core/_core.cpython-314-darwin.so +0 -0
- yamcot/_core/bindings.cpp +28 -0
- yamcot/_core/core_functions.h +29 -0
- yamcot/_core/fasta_to_plain.h +182 -0
- yamcot/_core/mco_prc.cpp +1476 -0
- yamcot/_core/pfm_to_pwm.h +130 -0
- yamcot/cli.py +621 -0
- yamcot/comparison.py +1066 -0
- yamcot/execute.py +97 -0
- yamcot/functions.py +787 -0
- yamcot/io.py +522 -0
- yamcot/models.py +1161 -0
- yamcot/pipeline.py +402 -0
- yamcot/ragged.py +126 -0
- yamcot-1.0.0.dist-info/METADATA +433 -0
- yamcot-1.0.0.dist-info/RECORD +21 -0
- yamcot-1.0.0.dist-info/WHEEL +6 -0
- yamcot-1.0.0.dist-info/entry_points.txt +3 -0
- yamcot-1.0.0.dist-info/licenses/LICENSE +21 -0
|
@@ -0,0 +1,433 @@
|
|
|
1
|
+
Metadata-Version: 2.2
|
|
2
|
+
Name: yamcot
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: Universal motif comparison tool
|
|
5
|
+
Author-Email: Anton Tsukanov <tsukanov@bionet.nsc.ru>
|
|
6
|
+
License: MIT
|
|
7
|
+
Classifier: Development Status :: 4 - Beta
|
|
8
|
+
Classifier: Intended Audience :: Science/Research
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
16
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
17
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
18
|
+
Project-URL: Homepage, https://github.com/ubercomrade/yamcot
|
|
19
|
+
Project-URL: Repository, https://github.com/ubercomrade/yamcot
|
|
20
|
+
Project-URL: Documentation, https://github.com/ubercomrade/yamcot#readme
|
|
21
|
+
Requires-Python: >=3.10
|
|
22
|
+
Requires-Dist: numpy<2.4,>=2.0
|
|
23
|
+
Requires-Dist: numba>=0.62.0
|
|
24
|
+
Requires-Dist: scipy>=1.14.1
|
|
25
|
+
Requires-Dist: pandas>=2.2.3
|
|
26
|
+
Requires-Dist: joblib>=1.5.3
|
|
27
|
+
Description-Content-Type: text/markdown
|
|
28
|
+
|
|
29
|
+
# YAMCOT
|
|
30
|
+
|
|
31
|
+
YAMCOT is Another Motif COmparison Tool (YAMCOT) designed to support comparisons across different motif model types.
|
|
32
|
+
|
|
33
|
+
## Introduction
|
|
34
|
+
|
|
35
|
+
Transcription factors (TFs) serve as fundamental regulators of gene expression levels. These proteins modulate the activity of the RNA polymerase complex by binding to specific DNA sequences located within regulatory regions, such as promoters and enhancers [^1]. The specific DNA segment recognized by a TF is termed a transcription factor binding site (TFBS). TFBSs for a given TF are typically similar but not identical; therefore, they are described using *motifs* that capture the variability of the recognized sequences [^2]. A variety of high-throughput experimental methods, including ChIP-seq, HT-SELEX, and DAP-seq, are currently used to identify TFBS motifs [^3][^4][^5]. While motifs are most frequently represented as Position Weight Matrices (PWMs), a standard supported by widely used *de novo* motif discovery tools like MEME[^6], STREME[^7], and HOMER[^8], the field has increasingly adopted alternative models to capture complex nucleotide dependencies. These include diverse variants of Markov Models (BaMMs, InMoDe, DIMONT etc.)[^9][^10][^11][^12][^13][^14], which account for higher-order dependencies that PWMs ignore, as well as models based on locally positioned dinucleotides (SiteGA) [^15][^16] and deep learning architectures (DeepBind, DeeperBind, DeepGRN and etc.) [^17][^18][^19][^20][^21].
|
|
36
|
+
|
|
37
|
+
The identification of a motif is only the first step; establishing its biological context requires robust comparison methods. Comparing motifs is essential for determining whether a newly discovered pattern represents a novel specificity or a variation of a known factor, for clustering redundant motifs identified across different experiments, and for inferring functional relationships between TFs based on binding similarity. Several established tools address this need, including Tomtom [^22], STAMP [^23], MACRO-APE [^24] and MoSBAT [^25]. These tools utilize various metrics and algorithms to quantify similarity, ranging from column-wise matrix correlations to Jaccard index-based comparisons of recognized site sets. However, a significant limitation of the current software ecosystem is its heavy reliance on matrix-based representations (PFMs or PWMs). This constraint makes it challenging to directly compare alternative models, such as Markov models or dinucleotide models, without converting them into simpler matrix formats, a process that often results in information loss.
|
|
38
|
+
|
|
39
|
+
To address these limitations, we introduce YAMCOT, a comprehensive framework designed to facilitate the comparison of diverse motif models beyond standard frequency matrices. YAMCOT implements four distinct modes of comparison to accommodate various analytical needs. The first and most universal mode involves the direct comparison of TFBS recognition profiles generated by different motifs, conceptually similar to affinity-based approaches [^25]. This allows for the assessment of similarity based on the functional output of the models—the scores assigned to sequences—rather than their internal parameters. The second mode leverages the same underlying approach but allows the user to explicitly define the model architecture; currently, YAMCOT supports three specific model types: PWM, BMM, and SiteGA, with an extensible architecture designed to accommodate future model types. The third mode incorporates MoTaLi (see details in). Finally, the fourth mode provides a Tomtom-like functionality for scenarios where models can be represented as N-dimensional matrix . In this mode, if the models are compatible matrix formats, they are compared using standard metrics such as Pearson Correlation Coefficient (PCC), Euclidean Distance (ED), and Cosine similarity. Crucially, if the models are of heterogeneous types (e.g., comparing a BaMM to a PWM), YAMCOT employs a strategy of scanning sequences to generate recognition profiles, which are then used to reconstruct compatible Position Frequency Matrices for comparison, ensuring that even fundamentally different model types can be quantitatively evaluated within a single framework.
|
|
40
|
+
|
|
41
|
+
### Methodology
|
|
42
|
+
|
|
43
|
+
#### Similarity Metrics
|
|
44
|
+
|
|
45
|
+
YAMCOT implements several metrics to quantify the resemblance between motif importance profiles or matrix columns.
|
|
46
|
+
|
|
47
|
+
**Continuous Jaccard (CJ)**
|
|
48
|
+
The Continuous Jaccard index extends the classical Jaccard similarity to continuous-valued vectors $v_1, v_2$. It is defined as the ratio of the sum of element-wise intersections to the sum of element-wise unions:
|
|
49
|
+
$$\text{CJ}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\sum_i \max(v_1^i, v_2^i)}$$
|
|
50
|
+
This metric is equivalent to averaging the binary Jaccard index across all possible thresholds, providing a threshold-independent measure of profile similarity.
|
|
51
|
+
|
|
52
|
+
**Continuous Overlap (CO)**
|
|
53
|
+
The Continuous Overlap coefficient (or Szymkiewicz-Simpson coefficient) measures the sub-set relationship between two profiles, normalizing the intersection by the smaller of the two total affinities:
|
|
54
|
+
$$\text{CO}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\min\left(\sum_i v_1^i, \sum_i v_2^i\right)}$$
|
|
55
|
+
|
|
56
|
+
**Pearson Correlation Coefficient (PCC)**
|
|
57
|
+
For linear correlation between profiles or motif columns, the PCC is calculated as:
|
|
58
|
+
$$\text{PCC}(v_1, v_2) = \frac{\sum_i (v_1^i - \bar{v}_1)(v_2^i - \bar{v}_2)}{\sqrt{\sum_i (v_1^i - \bar{v}_1)^2 \sum_i (v_2^i - \bar{v}_2)^2}}$$
|
|
59
|
+
|
|
60
|
+
#### Null Hypothesis and Surrogate Generation
|
|
61
|
+
|
|
62
|
+
To estimate the statistical significance (p-values) of observed similarity scores, YAMCOT employs a **Surrogate Null Model**. Unlike simple permutations that destroy local dependencies, our tool generates synthetic "surrogate" profiles that preserve the marginal properties and biological plausibility (smoothness) of the original data.
|
|
63
|
+
|
|
64
|
+
1. **Convolutional Distortion**: For profile-based surrogates, a sophisticated distortion logic is applied:
|
|
65
|
+
* **Kernel Selection**: A base kernel (smooth, edge, or double-peak) is selected to represent typical profile features.
|
|
66
|
+
* **Controlled Perturbation**: Noise and gradient bias are added to introduce variation while maintaining structural integrity.
|
|
67
|
+
* **Smoothing**: Convolution ensures the surrogate remains biologically realistic.
|
|
68
|
+
* **Convex Combination**: The final surrogate is a blend of the identity kernel and the distorted kernel, controlled by a user-defined distortion parameter.
|
|
69
|
+
|
|
70
|
+
2. **Permutation**: For matrix-based comparisons (`tomtom-like`), the tool performs random column-wise permutations.
|
|
71
|
+
|
|
72
|
+
This methodology ensures that the null distribution reflects realistic background similarity.
|
|
73
|
+
|
|
74
|
+
## Installation
|
|
75
|
+
|
|
76
|
+
YAMCOT requires **Python 3.12 or higher**.
|
|
77
|
+
|
|
78
|
+
### From PyPI (Recommended)
|
|
79
|
+
|
|
80
|
+
The easiest way to install YAMCOT is via `pip` or `uv`. This will automatically download and install all required dependencies.
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
# Using uv (Fastest)
|
|
84
|
+
uv pip install yamcot
|
|
85
|
+
|
|
86
|
+
# Using pip
|
|
87
|
+
pip install yamcot
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### From Source
|
|
91
|
+
|
|
92
|
+
If you want to contribute to development or build the latest version from the repository, you will need a C++ compiler with **C++17 support** (e.g., GCC, Clang, or MSVC).
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
# Clone the repository
|
|
96
|
+
git clone https://github.com/ubercomrade/yamcot.git
|
|
97
|
+
cd yamcot
|
|
98
|
+
|
|
99
|
+
# Install in editable mode
|
|
100
|
+
pip install -e .
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### Dependencies
|
|
104
|
+
|
|
105
|
+
When installing via `pip`, the following dependencies are resolved automatically:
|
|
106
|
+
|
|
107
|
+
* `numpy` (>= 1.26, < 2.0)
|
|
108
|
+
* `numba` (>= 0.60.0)
|
|
109
|
+
* `scipy` (>= 1.14.1)
|
|
110
|
+
* `pandas` (>= 2.2.3)
|
|
111
|
+
* `joblib` (>= 1.5.3)
|
|
112
|
+
|
|
113
|
+
### Build Requirements (Source only)
|
|
114
|
+
|
|
115
|
+
To build the C++ extension from source, the following tools are used:
|
|
116
|
+
|
|
117
|
+
* `scikit-build-core` (>= 0.10)
|
|
118
|
+
* `nanobind` (>= 2.0)
|
|
119
|
+
|
|
120
|
+
## CLI Reference
|
|
121
|
+
|
|
122
|
+
The `yamcot` tool provides four main operation modes.
|
|
123
|
+
|
|
124
|
+
### `profile` mode
|
|
125
|
+
|
|
126
|
+
Compare motifs based on pre-calculated score profiles.
|
|
127
|
+
|
|
128
|
+
**Input**: Text files with numerical scores (comma, tab, or space-separated).
|
|
129
|
+
**Example Data**: [`examples/scores_1.fasta`](examples/scores_1.fasta)
|
|
130
|
+
|
|
131
|
+
```bash
|
|
132
|
+
# in the `examples` directory
|
|
133
|
+
yamcot profile scores_1.fasta scores_2.fasta \
|
|
134
|
+
--metric cj \
|
|
135
|
+
--perm 1000 \
|
|
136
|
+
--distortion 0.5 \
|
|
137
|
+
--search-range 10
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**All parameters for `profile` mode**:
|
|
141
|
+
|
|
142
|
+
| Flag | Value | Comment |
|
|
143
|
+
| :--- | :--- | :--- |
|
|
144
|
+
| `profile1` | Path | Path to the first profile file (FASTA-like format). |
|
|
145
|
+
| `profile2` | Path | Path to the second profile file (FASTA-like format). |
|
|
146
|
+
| `--metric` | `cj`, `co`, `corr` | Similarity metric: Continuous Jaccard, Continuous Overlap, or Pearson Correlation (default: `cj`). |
|
|
147
|
+
| `--perm` | Integer | Number of permutations for p-value calculation (default: 0). |
|
|
148
|
+
| `--distortion` | Float | Distortion level (0.0-1.0) for surrogate generation (default: 0.4). |
|
|
149
|
+
| `--search-range` | Integer | Maximum offset range to explore when aligning profiles (default: 10). |
|
|
150
|
+
| `--min-kernel-size` | Integer | Minimum kernel size for surrogate convolution (default: 3). |
|
|
151
|
+
| `--max-kernel-size` | Integer | Maximum kernel size for surrogate convolution (default: 11). |
|
|
152
|
+
| `--seed` | Integer | Global random seed for reproducibility. |
|
|
153
|
+
| `--jobs` | Integer | Number of parallel jobs (-1 uses all cores) (default: -1). |
|
|
154
|
+
| `-v`, `--verbose` | Flag | Enable verbose logging. |
|
|
155
|
+
|
|
156
|
+
### `motif` mode
|
|
157
|
+
|
|
158
|
+
Compare motifs by scanning sequences with models and comparing the resulting profiles.
|
|
159
|
+
|
|
160
|
+
**Input**: Motif model files (PWM: `.meme`, `.pfm`; BaMM: `.ihbcp` + `.hbcp`; SiteGA: `.mat`).
|
|
161
|
+
**Example Models**: [`examples/foxa2.meme`](examples/foxa2.meme), [`examples/gata4.meme`](examples/gata4.meme)
|
|
162
|
+
|
|
163
|
+
```bash
|
|
164
|
+
# in the `examples` directory
|
|
165
|
+
yamcot foxa2.meme gata4.meme \
|
|
166
|
+
--model1-type pwm \
|
|
167
|
+
--model2-type pwm \
|
|
168
|
+
--fasta examples/foreground.fa \
|
|
169
|
+
--metric co \
|
|
170
|
+
--perm 1000 \
|
|
171
|
+
--distortion 0.3
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
**All parameters for `motif` mode**:
|
|
175
|
+
|
|
176
|
+
| Flag | Value | Comment |
|
|
177
|
+
| :--- | :--- | :--- |
|
|
178
|
+
| `model1` | Path | Path to the first motif model file. |
|
|
179
|
+
| `model2` | Path | Path to the second motif model file. |
|
|
180
|
+
| `--model1-type` | `pwm`, `bamm`, `sitega` | Format of the first model (Required). |
|
|
181
|
+
| `--model2-type` | `pwm`, `bamm`, `sitega` | Format of the second model (Required). |
|
|
182
|
+
| `--fasta` | Path | FASTA file with target sequences. If omitted, random sequences are generated. |
|
|
183
|
+
| `--promoters` | Path | FASTA file with promoter sequences for threshold calculation. |
|
|
184
|
+
| `--num-sequences` | Integer | Number of random sequences to generate (default: 1000). |
|
|
185
|
+
| `--seq-length` | Integer | Length of random sequences (default: 200). |
|
|
186
|
+
| `--metric` | `cj`, `co`, `corr` | Similarity metric (default: `cj`). |
|
|
187
|
+
| `--perm` | Integer | Number of permutations (default: 0). |
|
|
188
|
+
| `--distortion` | Float | Distortion level (default: 0.4). |
|
|
189
|
+
| `--search-range` | Integer | Maximum alignment offset (default: 10). |
|
|
190
|
+
| `--seed` | Integer | Global random seed. |
|
|
191
|
+
| `--jobs` | Integer | Number of parallel jobs (default: -1). |
|
|
192
|
+
|
|
193
|
+
### `motali` mode
|
|
194
|
+
|
|
195
|
+
Compare motifs by calculating Precision-Recall Curve (PRC) AUC derived from scanning sequences.
|
|
196
|
+
|
|
197
|
+
**Example Models**: [`examples/sitega_gata2.mat`](examples/sitega_gata2.mat), [`examples/gata2.meme`](examples/gata2.meme)
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
# in the `examples` directory
|
|
201
|
+
yamcot motali sitega_gata2.mat gata2.meme \
|
|
202
|
+
--model1-type sitega \
|
|
203
|
+
--model2-type pwm \
|
|
204
|
+
--fasta foreground.fa \
|
|
205
|
+
--promoters background.fa \
|
|
206
|
+
--num-sequences 5000 \
|
|
207
|
+
--seq-length 150
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
**All parameters for `motali` mode**:
|
|
211
|
+
|
|
212
|
+
| Flag | Value | Comment |
|
|
213
|
+
| :--- | :--- | :--- |
|
|
214
|
+
| `model1` | Path | Path to the first motif model file. |
|
|
215
|
+
| `model2` | Path | Path to the second motif model file. |
|
|
216
|
+
| `--model1-type` | `pwm`, `sitega` | Format of the first model (Required). |
|
|
217
|
+
| `--model2-type` | `pwm`, `sitega` | Format of the second model (Required). |
|
|
218
|
+
| `--fasta` | Path | FASTA file with target sequences. |
|
|
219
|
+
| `--promoters` | Path | FASTA file with promoter sequences (Required for thresholds). |
|
|
220
|
+
| `--num-sequences` | Integer | Number of random sequences (default: 10000). |
|
|
221
|
+
| `--seq-length` | Integer | Length of random sequences (default: 200). |
|
|
222
|
+
| `--tmp-dir` | Path | Directory for temporary files (default: `/tmp`). |
|
|
223
|
+
|
|
224
|
+
### `tomtom-like` mode
|
|
225
|
+
|
|
226
|
+
Compare motifs by direct N-dimetional matrix comparison (column-wise).
|
|
227
|
+
|
|
228
|
+
**Example Models**: [`examples/pif4.pfm`](examples/pif4.pfm), [`examples/pif4.meme`](examples/pif4.meme)
|
|
229
|
+
|
|
230
|
+
```bash
|
|
231
|
+
# in the `examples` directory
|
|
232
|
+
yamcot tomtom-like pif4.pfm pif4.meme \
|
|
233
|
+
--model1-type pwm \
|
|
234
|
+
--model2-type pwm \
|
|
235
|
+
--metric cosine \
|
|
236
|
+
--permutations 1000 \
|
|
237
|
+
--pfm-mode \
|
|
238
|
+
--num-sequences 10000 \
|
|
239
|
+
--seq-length 100
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
**All parameters for `tomtom-like` mode**:
|
|
243
|
+
|
|
244
|
+
| Flag | Value | Comment |
|
|
245
|
+
| :--- | :--- | :--- |
|
|
246
|
+
| `model1` | Path | Path to the first motif model file. |
|
|
247
|
+
| `model2` | Path | Path to the second motif model file. |
|
|
248
|
+
| `--model1-type` | `pwm`, `bamm`, `sitega` | Format of the first model (Required). |
|
|
249
|
+
| `--model2-type` | `pwm`, `bamm`, `sitega` | Format of the second model (Required). |
|
|
250
|
+
| `--metric` | `pcc`, `ed`, `cosine` | Column-wise metric: Pearson Correlation, Euclidean Distance, or Cosine Similarity (default: `pcc`). |
|
|
251
|
+
| `--permutations` | Integer | Number of Monte Carlo permutations for p-value (default: 0). |
|
|
252
|
+
| `--permute-rows` | Flag | Shuffle values within columns during permutation. |
|
|
253
|
+
| `--pfm-mode` | Flag | Derive PFM by scanning sequences (useful for comparing different model types). |
|
|
254
|
+
| `--num-sequences` | Integer | Sequences for PFM mode (default: 20000). |
|
|
255
|
+
| `--seq-length` | Integer | Sequence length for PFM mode (default: 100). |
|
|
256
|
+
| `--seed` | Integer | Global random seed. |
|
|
257
|
+
| `--jobs` | Integer | Number of parallel jobs (default: -1). |
|
|
258
|
+
|
|
259
|
+
## Library Usage
|
|
260
|
+
|
|
261
|
+
YAMCOT is designed as an extensible framework. You can implement your own motif models by inheriting from the base abstractions provided in [`yamcot/`](yamcot/).
|
|
262
|
+
|
|
263
|
+
### Implementing a Custom Model
|
|
264
|
+
|
|
265
|
+
To create a new model, inherit from the [`MotifModel`](yamcot/models.py) class and implement the required methods. Below is a simplified example of a dinucleotide-based motif model implemented in pure Python/NumPy.
|
|
266
|
+
|
|
267
|
+
```python
|
|
268
|
+
import numpy as np
|
|
269
|
+
from yamcot.models import MotifModel, RaggedData
|
|
270
|
+
from yamcot.pipeline import Pipeline
|
|
271
|
+
from yamcot.ragged import ragged_from_list
|
|
272
|
+
|
|
273
|
+
class SimpleDinucleotideMotif(MotifModel):
|
|
274
|
+
"""Example of a custom motif model with dinucleotide dependencies."""
|
|
275
|
+
|
|
276
|
+
def __init__(self, matrix, name, length):
|
|
277
|
+
# matrix shape: (16, length - 1) representing 16 possible dinucleotides
|
|
278
|
+
super().__init__(matrix=matrix, name=name, length=length)
|
|
279
|
+
|
|
280
|
+
def scan(self, sequences: RaggedData, strand=None) -> RaggedData:
|
|
281
|
+
"""
|
|
282
|
+
Scan sequences with the custom model.
|
|
283
|
+
Returns RaggedData containing scores for each position.
|
|
284
|
+
"""
|
|
285
|
+
all_scores = []
|
|
286
|
+
for i in range(sequences.num_sequences):
|
|
287
|
+
seq = sequences.get_slice(i)
|
|
288
|
+
if len(seq) < self.length:
|
|
289
|
+
all_scores.append(np.array([], dtype=np.float32))
|
|
290
|
+
continue
|
|
291
|
+
|
|
292
|
+
n_pos = len(seq) - self.length + 1
|
|
293
|
+
scores = np.zeros(n_pos, dtype=np.float32)
|
|
294
|
+
|
|
295
|
+
# Simple sliding window scoring logic
|
|
296
|
+
for j in range(n_pos):
|
|
297
|
+
subseq = seq[j : j + self.length]
|
|
298
|
+
pos_score = 0.0
|
|
299
|
+
for k in range(self.length - 1):
|
|
300
|
+
# Calculate dinucleotide index (0-15) for ACGT
|
|
301
|
+
# Assuming 0=A, 1=C, 2=G, 3=T
|
|
302
|
+
if subseq[k] < 4 and subseq[k+1] < 4:
|
|
303
|
+
dinucl_idx = int(subseq[k] * 4 + subseq[k+1])
|
|
304
|
+
pos_score += self.matrix[dinucl_idx, k]
|
|
305
|
+
scores[j] = pos_score
|
|
306
|
+
all_scores.append(scores)
|
|
307
|
+
|
|
308
|
+
return ragged_from_list(all_scores, dtype=np.float32)
|
|
309
|
+
|
|
310
|
+
@classmethod
|
|
311
|
+
def from_file(cls, path: str, **kwargs) -> SimpleDinucleotideMotif:
|
|
312
|
+
"""Load the model from a file."""
|
|
313
|
+
# Implementation of your file parsing logic
|
|
314
|
+
matrix = np.load(path)
|
|
315
|
+
name = kwargs.get('name', 'custom_motif')
|
|
316
|
+
return cls(matrix, name, length=matrix.shape[1] + 1)
|
|
317
|
+
|
|
318
|
+
@property
|
|
319
|
+
def model_type(self) -> str:
|
|
320
|
+
"""Unique identifier for the model type."""
|
|
321
|
+
return 'dinucleotide'
|
|
322
|
+
|
|
323
|
+
def write(self, path: str):
|
|
324
|
+
"""Save the model to a file."""
|
|
325
|
+
np.save(path, self.matrix)
|
|
326
|
+
|
|
327
|
+
# Register the subclass to enable factory methods and CLI support
|
|
328
|
+
MotifModel.register_subclass('dinucleotide', SimpleDinucleotideMotif)
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
### Key Methods to Override
|
|
332
|
+
|
|
333
|
+
To ensure compatibility with the internal comparison [`Pipeline`](yamcot/pipeline.py), you must override the following methods:
|
|
334
|
+
|
|
335
|
+
| Method | Description |
|
|
336
|
+
| :--- | :--- |
|
|
337
|
+
| `scan(sequences, strand)` | **Required.** Performs motif scanning on a set of sequences. Must accept `RaggedData` and return `RaggedData` containing position-wise scores. |
|
|
338
|
+
| `from_file(path, **kwargs)` | **Required.** Class method to initialize the model from a file path. Enables the use of `MotifModel.create_from_file(path, 'type')`. |
|
|
339
|
+
| `model_type` | **Required.** Property returning a unique string identifier for the model class. |
|
|
340
|
+
| `write(path)` | **Required.** Method to serialize the model to its native format. |
|
|
341
|
+
|
|
342
|
+
### Example: Running a Comparison
|
|
343
|
+
|
|
344
|
+
```python
|
|
345
|
+
# 1. Prepare sequences and models
|
|
346
|
+
# Encode sequence: A=0, C=1, G=2, T=3
|
|
347
|
+
seq_list = [np.array([0, 1, 2, 3, 0, 1, 2, 3], dtype=np.int8)]
|
|
348
|
+
sequences = ragged_from_list(seq_list, dtype=np.int8)
|
|
349
|
+
|
|
350
|
+
# Initialize custom models
|
|
351
|
+
# Matrix for 16 dinucleotides across 9 positions (motif length 10)
|
|
352
|
+
m1 = SimpleDinucleotideMotif(np.random.rand(16, 9), "Motif_A", 10)
|
|
353
|
+
m2 = SimpleDinucleotideMotif(np.random.rand(16, 9), "Motif_B", 10)
|
|
354
|
+
|
|
355
|
+
# 2. Execute comparison using the Pipeline
|
|
356
|
+
pipeline = Pipeline()
|
|
357
|
+
result = pipeline.execute_motif_comparison(
|
|
358
|
+
model1=m1,
|
|
359
|
+
model2=m2,
|
|
360
|
+
sequences=sequences,
|
|
361
|
+
promoters=sequences, # Used for threshold (FPR) calculation
|
|
362
|
+
comparison_type='motif',
|
|
363
|
+
metric='cj',
|
|
364
|
+
n_permutations=1000
|
|
365
|
+
)
|
|
366
|
+
|
|
367
|
+
print(f"Similarity (CJ): {result['similarity']:.4f}")
|
|
368
|
+
print(f"P-value: {result['p_value']:.4e}")
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
### Examples
|
|
372
|
+
|
|
373
|
+
The [`examples/`](examples/) directory contains sample data and [script](examples/run.sh) to demonstrate the tool's capabilities with CLI.
|
|
374
|
+
|
|
375
|
+
To run a basic comparison:
|
|
376
|
+
|
|
377
|
+
```bash
|
|
378
|
+
yamcot motif examples/foxa2.meme examples/gata2.meme \
|
|
379
|
+
--model1-type pwm --model2-type pwm \
|
|
380
|
+
--fasta examples/foreground.fa --metric cj --perm 100
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
## Bibliography
|
|
384
|
+
|
|
385
|
+
[^1]: Lambert, S. A., Jolma, A., Campitelli, L. F., Das, P. K., Yin, Y., Albu, M., ... & Weirauch, M. T. (2018). The human transcription factors. _Cell_, _172_(4), 650-665.
|
|
386
|
+
|
|
387
|
+
[^2]: Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. _Nature Reviews Genetics, 5 (4), 276-287.
|
|
388
|
+
|
|
389
|
+
[^3]: Park, P. J. (2009). ChIP–seq: advantages and challenges of a maturing technology. _Nature reviews genetics_, _10_(10), 669-680.
|
|
390
|
+
|
|
391
|
+
[^4]: Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., Taipale, M., Vaquerizas, J. M., Yan, J., Sillanpää, M. J., Bonke, M., Palin, K., Talukder, S., Hughes, T. R., Luscombe, N. M., Ukkonen, E., & Taipale, J. (2010). Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. _Genome research_, _20_(6), 861–873. https://doi.org/10.1101/gr.100552.109
|
|
392
|
+
|
|
393
|
+
[^5]: O'Malley, R. C., Huang, S. C., Song, L., Lewsey, M. G., Bartlett, A., Nery, J. R., Galli, M., Gallavotti, A., & Ecker, J. R. (2016). Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. _Cell_, _165_(5), 1280–1292. https://doi.org/10.1016/j.cell.2016.04.038
|
|
394
|
+
|
|
395
|
+
[^6]: Bailey, T. L., & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers.Proceedings. International Conference on Intelligent Systems for Molecular Biology_, _2_, 28–36.
|
|
396
|
+
|
|
397
|
+
[^7]: Bailey T. L. (2021). STREME: accurate and versatile sequence motif discovery. _Bioinformatics (Oxford, England)_, _37_(18), 2834–2840. https://doi.org/10.1093/bioinformatics/btab203
|
|
398
|
+
|
|
399
|
+
[^8]: Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H., & Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. _Molecular cell_, _38_(4), 576–589. https://doi.org/10.1016/j.molcel.2010.05.004
|
|
400
|
+
|
|
401
|
+
[^9]: Grau J, Posch S, Grosse I, Keilwagen J. A general approach for discriminative de novo motif discovery from high-throughput data. Nucleic Acids Res. 2013 Nov;41(21):e197. doi: 10.1093/nar/gkt831. Epub 2013 Sep 20. PMID: 24057214; PMCID: PMC3834837.
|
|
402
|
+
|
|
403
|
+
[^10]: Eggeling R, Grosse I, Grau J. InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites. Bioinformatics. 2017 Feb 15;33(4):580-582. doi: 10.1093/bioinformatics/btw689. PMID: 28035026; PMCID: PMC5408807.
|
|
404
|
+
|
|
405
|
+
[^11]: Siebert, M., & Söding, J. (2016). Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. _Nucleic acids research_, _44_(13), 6055–6069. https://doi.org/10.1093/nar/gkw521
|
|
406
|
+
|
|
407
|
+
[^12]: Ge, W., Meier, M., Roth, C., & Söding, J. (2021). Bayesian Markov models improve the prediction of binding motifs beyond first order. _NAR genomics and bioinformatics_, _3_(2), lqab026. https://doi.org/10.1093/nargab/lqab026
|
|
408
|
+
|
|
409
|
+
[^13]: Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics. 2020 May 1;36(9):2690-2696. doi: 10.1093/bioinformatics/btaa045. PMID: 31999322; PMCID: PMC7203737.
|
|
410
|
+
|
|
411
|
+
[^14]: Mathelier, A., & Wasserman, W. W. (2013). The next generation of transcription factor binding site prediction. _PLoS computational biology_, _9_(9), e1003214. https://doi.org/10.1371/journal.pcbi.1003214
|
|
412
|
+
|
|
413
|
+
[^15]: Levitsky, V. G., Ignatieva, E. V., Ananko, E. A., Turnaev, I. I., Merkulova, T. I., Kolchanov, N. A., & Hodgman, T. C. (2007). Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. _BMC bioinformatics_, _8_, 481. https://doi.org/10.1186/1471-2105-8-481
|
|
414
|
+
|
|
415
|
+
[^16]: Tsukanov, A. V., Mironova, V. V., & Levitsky, V. G. (2022). Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis. _Frontiers in plant science_, _13_, 938545. https://doi.org/10.3389/fpls.2022.938545
|
|
416
|
+
|
|
417
|
+
[^17]: Alipanahi, B., Delong, A., Weirauch, M. T., & Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. _Nature biotechnology_, _33_(8), 831–838. https://doi.org/10.1038/nbt.3300
|
|
418
|
+
|
|
419
|
+
[^18]: Hassanzadeh, H. R., & Wang, M. D. (2016). DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins. _Proceedings. IEEE International Conference on Bioinformatics and Biomedicine_, _2016_, 178–183. https://doi.org/10.1109/bibm.2016.7822515
|
|
420
|
+
|
|
421
|
+
[^19]: Chen, C., Hou, J., Shi, X., Yang, H., Birchler, J. A., & Cheng, J. (2021). DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. _BMC bioinformatics_, _22_(1), 38. https://doi.org/10.1186/s12859-020-03952-1
|
|
422
|
+
|
|
423
|
+
[^20]: Wang, K., Zeng, X., Zhou, J., Liu, F., Luan, X., & Wang, X. (2024). BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. _Briefings in bioinformatics_, _25_(3), bbae195. https://doi.org/10.1093/bib/bbae195
|
|
424
|
+
|
|
425
|
+
[^21]: Jing Zhang, F., Zhang, S. W., & Zhang, S. (2022). Prediction of Transcription Factor Binding Sites With an Attention Augmented Convolutional Neural Network. _IEEE/ACM transactions on computational biology and bioinformatics_, _19_(6), 3614–3623. https://doi.org/10.1109/TCBB.2021.3126623
|
|
426
|
+
|
|
427
|
+
[^22]: Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., & Noble, W. S. (2007). Quantifying similarity between motifs. _Genome biology_, _8_(2), R24. https://doi.org/10.1186/gb-2007-8-2-r24
|
|
428
|
+
|
|
429
|
+
[^23]: Mahony, S., & Benos, P. V. (2007). STAMP: a web tool for exploring DNA-binding motif similarities. _Nucleic acids research_, _35_(Web Server issue), W253–W258. https://doi.org/10.1093/nar/gkm272
|
|
430
|
+
|
|
431
|
+
[^24]: Vorontsov, I. E., Kulakovskiy, I. V., & Makeev, V. J. (2013). Jaccard index based similarity measure to compare transcription factor binding site models. _Algorithms for molecular biology : AMB_, _8_(1), 23. https://doi.org/10.1186/1748-7188-8-23
|
|
432
|
+
|
|
433
|
+
[^25]: Lambert, S. A., Albu, M., Hughes, T. R., & Najafabadi, H. S. (2016). Motif comparison based on similarity of binding affinity profiles. _Bioinformatics (Oxford, England)_, _32_(22), 3504–3506. https://doi.org/10.1093/bioinformatics/btw489
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
yamcot/functions.py,sha256=8FJ8lQyhRaKt3KmE1xcaTkXSnq_zW2EwWtRARm-yKz4,25385
|
|
2
|
+
yamcot/models.py,sha256=4yclmGUk59GfrqOJ3z-KQHYPLeavOhpgxK8DuPTVeLs,39828
|
|
3
|
+
yamcot/execute.py,sha256=hhjAZmot1GvGl7okmMJNx6HFZFDFDTQ2iP4Kog1CkNo,2619
|
|
4
|
+
yamcot/io.py,sha256=oCabdXvUIXa6oi8iU9aW-fFramXKRQAkVsHw6BLsZqg,18696
|
|
5
|
+
yamcot/__init__.py,sha256=b69xn8E9ITUOxWY6L_oTBXpSVabiAxCzHM2gm-49PKY,1752
|
|
6
|
+
yamcot/ragged.py,sha256=DMyG-WIotFROZn9x0_MYLFpo2E_U6iQqupV2GLBYNuQ,3426
|
|
7
|
+
yamcot/comparison.py,sha256=TbdtnnLR89awk1sGgdK1eGpJdLEDl9VtCEDCXJSQBNA,40095
|
|
8
|
+
yamcot/cli.py,sha256=F7N3BDP3htkwkuv44hGK6mwIWU7RMdpgI_uTXJWgVIA,22184
|
|
9
|
+
yamcot/pipeline.py,sha256=h1BSr_Wx8K72KyAexuliz6WOzWZ-Avvm6qF4rH4NNdA,15483
|
|
10
|
+
yamcot/_core/pfm_to_pwm.h,sha256=amX7UqeIluZDtttDNTmcLNZ93JCGnIBRuCute6dqW-Q,2405
|
|
11
|
+
yamcot/_core/_core.cpython-314-darwin.so,sha256=54g4pyMGi7PxxlWz3vlIWhmotbvdhvA3p4T6WmOw8sk,110120
|
|
12
|
+
yamcot/_core/__init__.py,sha256=DeaVE538feEZSU8RumnbRjGAfB_5hz_0reslBK2x0YI,581
|
|
13
|
+
yamcot/_core/fasta_to_plain.h,sha256=4fkg65CMEq65xNBOuiIIHiEZAY_WekY2-4JaqOlBrV8,3341
|
|
14
|
+
yamcot/_core/mco_prc.cpp,sha256=PTVpvkJcrT8zMNI5_4eTi59CZIBNc9xfUPYgcx-erRA,42833
|
|
15
|
+
yamcot/_core/bindings.cpp,sha256=14jALD3NR7XWqg7M1fE_0oP4kg9ZU8IsD8zyrs5v4Mo,782
|
|
16
|
+
yamcot/_core/core_functions.h,sha256=4RzTutV1bOANNWirPB4b1wenIXW0u4VrMVp9iYcj1zw,598
|
|
17
|
+
yamcot-1.0.0.dist-info/RECORD,,
|
|
18
|
+
yamcot-1.0.0.dist-info/WHEEL,sha256=1bSoEhplGNQCM6mySRtVr9yWoZ8316lga5ImQCm3ICg,141
|
|
19
|
+
yamcot-1.0.0.dist-info/entry_points.txt,sha256=o2DLtC7_k2XDhzvqZ79fyjzGGqL0AOnjx5YTLlzYHKc,48
|
|
20
|
+
yamcot-1.0.0.dist-info/METADATA,sha256=lI2X0KghL_nGUbV8bI-ehKGW6w8KoG3QZMYxAd9aAKU,25919
|
|
21
|
+
yamcot-1.0.0.dist-info/licenses/LICENSE,sha256=tklX98EEQMtDKvF0JvPcsvb8C83RYFPdV3i420W1iBs,1070
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 Anton Tsukanov
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|