yamcot 1.0.0__cp314-cp314-win_amd64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,433 @@
1
+ Metadata-Version: 2.2
2
+ Name: yamcot
3
+ Version: 1.0.0
4
+ Summary: Universal motif comparison tool
5
+ Author-Email: Anton Tsukanov <tsukanov@bionet.nsc.ru>
6
+ License: MIT
7
+ Classifier: Development Status :: 4 - Beta
8
+ Classifier: Intended Audience :: Science/Research
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Programming Language :: Python :: 3.13
15
+ Classifier: Programming Language :: Python :: 3.14
16
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
17
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
18
+ Project-URL: Homepage, https://github.com/ubercomrade/yamcot
19
+ Project-URL: Repository, https://github.com/ubercomrade/yamcot
20
+ Project-URL: Documentation, https://github.com/ubercomrade/yamcot#readme
21
+ Requires-Python: >=3.10
22
+ Requires-Dist: numpy<2.4,>=2.0
23
+ Requires-Dist: numba>=0.62.0
24
+ Requires-Dist: scipy>=1.14.1
25
+ Requires-Dist: pandas>=2.2.3
26
+ Requires-Dist: joblib>=1.5.3
27
+ Description-Content-Type: text/markdown
28
+
29
+ # YAMCOT
30
+
31
+ YAMCOT is Another Motif COmparison Tool (YAMCOT) designed to support comparisons across different motif model types.
32
+
33
+ ## Introduction
34
+
35
+ Transcription factors (TFs) serve as fundamental regulators of gene expression levels. These proteins modulate the activity of the RNA polymerase complex by binding to specific DNA sequences located within regulatory regions, such as promoters and enhancers [^1]. The specific DNA segment recognized by a TF is termed a transcription factor binding site (TFBS). TFBSs for a given TF are typically similar but not identical; therefore, they are described using *motifs* that capture the variability of the recognized sequences [^2]. A variety of high-throughput experimental methods, including ChIP-seq, HT-SELEX, and DAP-seq, are currently used to identify TFBS motifs [^3][^4][^5]. While motifs are most frequently represented as Position Weight Matrices (PWMs), a standard supported by widely used *de novo* motif discovery tools like MEME[^6], STREME[^7], and HOMER[^8], the field has increasingly adopted alternative models to capture complex nucleotide dependencies. These include diverse variants of Markov Models (BaMMs, InMoDe, DIMONT etc.)[^9][^10][^11][^12][^13][^14], which account for higher-order dependencies that PWMs ignore, as well as models based on locally positioned dinucleotides (SiteGA) [^15][^16] and deep learning architectures (DeepBind, DeeperBind, DeepGRN and etc.) [^17][^18][^19][^20][^21].
36
+
37
+ The identification of a motif is only the first step; establishing its biological context requires robust comparison methods. Comparing motifs is essential for determining whether a newly discovered pattern represents a novel specificity or a variation of a known factor, for clustering redundant motifs identified across different experiments, and for inferring functional relationships between TFs based on binding similarity. Several established tools address this need, including Tomtom [^22], STAMP [^23], MACRO-APE [^24] and MoSBAT [^25]. These tools utilize various metrics and algorithms to quantify similarity, ranging from column-wise matrix correlations to Jaccard index-based comparisons of recognized site sets. However, a significant limitation of the current software ecosystem is its heavy reliance on matrix-based representations (PFMs or PWMs). This constraint makes it challenging to directly compare alternative models, such as Markov models or dinucleotide models, without converting them into simpler matrix formats, a process that often results in information loss.
38
+
39
+ To address these limitations, we introduce YAMCOT, a comprehensive framework designed to facilitate the comparison of diverse motif models beyond standard frequency matrices. YAMCOT implements four distinct modes of comparison to accommodate various analytical needs. The first and most universal mode involves the direct comparison of TFBS recognition profiles generated by different motifs, conceptually similar to affinity-based approaches [^25]. This allows for the assessment of similarity based on the functional output of the models—the scores assigned to sequences—rather than their internal parameters. The second mode leverages the same underlying approach but allows the user to explicitly define the model architecture; currently, YAMCOT supports three specific model types: PWM, BMM, and SiteGA, with an extensible architecture designed to accommodate future model types. The third mode incorporates MoTaLi (see details in). Finally, the fourth mode provides a Tomtom-like functionality for scenarios where models can be represented as N-dimensional matrix . In this mode, if the models are compatible matrix formats, they are compared using standard metrics such as Pearson Correlation Coefficient (PCC), Euclidean Distance (ED), and Cosine similarity. Crucially, if the models are of heterogeneous types (e.g., comparing a BaMM to a PWM), YAMCOT employs a strategy of scanning sequences to generate recognition profiles, which are then used to reconstruct compatible Position Frequency Matrices for comparison, ensuring that even fundamentally different model types can be quantitatively evaluated within a single framework.
40
+
41
+ ### Methodology
42
+
43
+ #### Similarity Metrics
44
+
45
+ YAMCOT implements several metrics to quantify the resemblance between motif importance profiles or matrix columns.
46
+
47
+ **Continuous Jaccard (CJ)**
48
+ The Continuous Jaccard index extends the classical Jaccard similarity to continuous-valued vectors $v_1, v_2$. It is defined as the ratio of the sum of element-wise intersections to the sum of element-wise unions:
49
+ $$\text{CJ}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\sum_i \max(v_1^i, v_2^i)}$$
50
+ This metric is equivalent to averaging the binary Jaccard index across all possible thresholds, providing a threshold-independent measure of profile similarity.
51
+
52
+ **Continuous Overlap (CO)**
53
+ The Continuous Overlap coefficient (or Szymkiewicz-Simpson coefficient) measures the sub-set relationship between two profiles, normalizing the intersection by the smaller of the two total affinities:
54
+ $$\text{CO}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\min\left(\sum_i v_1^i, \sum_i v_2^i\right)}$$
55
+
56
+ **Pearson Correlation Coefficient (PCC)**
57
+ For linear correlation between profiles or motif columns, the PCC is calculated as:
58
+ $$\text{PCC}(v_1, v_2) = \frac{\sum_i (v_1^i - \bar{v}_1)(v_2^i - \bar{v}_2)}{\sqrt{\sum_i (v_1^i - \bar{v}_1)^2 \sum_i (v_2^i - \bar{v}_2)^2}}$$
59
+
60
+ #### Null Hypothesis and Surrogate Generation
61
+
62
+ To estimate the statistical significance (p-values) of observed similarity scores, YAMCOT employs a **Surrogate Null Model**. Unlike simple permutations that destroy local dependencies, our tool generates synthetic "surrogate" profiles that preserve the marginal properties and biological plausibility (smoothness) of the original data.
63
+
64
+ 1. **Convolutional Distortion**: For profile-based surrogates, a sophisticated distortion logic is applied:
65
+ * **Kernel Selection**: A base kernel (smooth, edge, or double-peak) is selected to represent typical profile features.
66
+ * **Controlled Perturbation**: Noise and gradient bias are added to introduce variation while maintaining structural integrity.
67
+ * **Smoothing**: Convolution ensures the surrogate remains biologically realistic.
68
+ * **Convex Combination**: The final surrogate is a blend of the identity kernel and the distorted kernel, controlled by a user-defined distortion parameter.
69
+
70
+ 2. **Permutation**: For matrix-based comparisons (`tomtom-like`), the tool performs random column-wise permutations.
71
+
72
+ This methodology ensures that the null distribution reflects realistic background similarity.
73
+
74
+ ## Installation
75
+
76
+ YAMCOT requires **Python 3.12 or higher**.
77
+
78
+ ### From PyPI (Recommended)
79
+
80
+ The easiest way to install YAMCOT is via `pip` or `uv`. This will automatically download and install all required dependencies.
81
+
82
+ ```bash
83
+ # Using uv (Fastest)
84
+ uv pip install yamcot
85
+
86
+ # Using pip
87
+ pip install yamcot
88
+ ```
89
+
90
+ ### From Source
91
+
92
+ If you want to contribute to development or build the latest version from the repository, you will need a C++ compiler with **C++17 support** (e.g., GCC, Clang, or MSVC).
93
+
94
+ ```bash
95
+ # Clone the repository
96
+ git clone https://github.com/ubercomrade/yamcot.git
97
+ cd yamcot
98
+
99
+ # Install in editable mode
100
+ pip install -e .
101
+ ```
102
+
103
+ ### Dependencies
104
+
105
+ When installing via `pip`, the following dependencies are resolved automatically:
106
+
107
+ * `numpy` (>= 1.26, < 2.0)
108
+ * `numba` (>= 0.60.0)
109
+ * `scipy` (>= 1.14.1)
110
+ * `pandas` (>= 2.2.3)
111
+ * `joblib` (>= 1.5.3)
112
+
113
+ ### Build Requirements (Source only)
114
+
115
+ To build the C++ extension from source, the following tools are used:
116
+
117
+ * `scikit-build-core` (>= 0.10)
118
+ * `nanobind` (>= 2.0)
119
+
120
+ ## CLI Reference
121
+
122
+ The `yamcot` tool provides four main operation modes.
123
+
124
+ ### `profile` mode
125
+
126
+ Compare motifs based on pre-calculated score profiles.
127
+
128
+ **Input**: Text files with numerical scores (comma, tab, or space-separated).
129
+ **Example Data**: [`examples/scores_1.fasta`](examples/scores_1.fasta)
130
+
131
+ ```bash
132
+ # in the `examples` directory
133
+ yamcot profile scores_1.fasta scores_2.fasta \
134
+ --metric cj \
135
+ --perm 1000 \
136
+ --distortion 0.5 \
137
+ --search-range 10
138
+ ```
139
+
140
+ **All parameters for `profile` mode**:
141
+
142
+ | Flag | Value | Comment |
143
+ | :--- | :--- | :--- |
144
+ | `profile1` | Path | Path to the first profile file (FASTA-like format). |
145
+ | `profile2` | Path | Path to the second profile file (FASTA-like format). |
146
+ | `--metric` | `cj`, `co`, `corr` | Similarity metric: Continuous Jaccard, Continuous Overlap, or Pearson Correlation (default: `cj`). |
147
+ | `--perm` | Integer | Number of permutations for p-value calculation (default: 0). |
148
+ | `--distortion` | Float | Distortion level (0.0-1.0) for surrogate generation (default: 0.4). |
149
+ | `--search-range` | Integer | Maximum offset range to explore when aligning profiles (default: 10). |
150
+ | `--min-kernel-size` | Integer | Minimum kernel size for surrogate convolution (default: 3). |
151
+ | `--max-kernel-size` | Integer | Maximum kernel size for surrogate convolution (default: 11). |
152
+ | `--seed` | Integer | Global random seed for reproducibility. |
153
+ | `--jobs` | Integer | Number of parallel jobs (-1 uses all cores) (default: -1). |
154
+ | `-v`, `--verbose` | Flag | Enable verbose logging. |
155
+
156
+ ### `motif` mode
157
+
158
+ Compare motifs by scanning sequences with models and comparing the resulting profiles.
159
+
160
+ **Input**: Motif model files (PWM: `.meme`, `.pfm`; BaMM: `.ihbcp` + `.hbcp`; SiteGA: `.mat`).
161
+ **Example Models**: [`examples/foxa2.meme`](examples/foxa2.meme), [`examples/gata4.meme`](examples/gata4.meme)
162
+
163
+ ```bash
164
+ # in the `examples` directory
165
+ yamcot foxa2.meme gata4.meme \
166
+ --model1-type pwm \
167
+ --model2-type pwm \
168
+ --fasta examples/foreground.fa \
169
+ --metric co \
170
+ --perm 1000 \
171
+ --distortion 0.3
172
+ ```
173
+
174
+ **All parameters for `motif` mode**:
175
+
176
+ | Flag | Value | Comment |
177
+ | :--- | :--- | :--- |
178
+ | `model1` | Path | Path to the first motif model file. |
179
+ | `model2` | Path | Path to the second motif model file. |
180
+ | `--model1-type` | `pwm`, `bamm`, `sitega` | Format of the first model (Required). |
181
+ | `--model2-type` | `pwm`, `bamm`, `sitega` | Format of the second model (Required). |
182
+ | `--fasta` | Path | FASTA file with target sequences. If omitted, random sequences are generated. |
183
+ | `--promoters` | Path | FASTA file with promoter sequences for threshold calculation. |
184
+ | `--num-sequences` | Integer | Number of random sequences to generate (default: 1000). |
185
+ | `--seq-length` | Integer | Length of random sequences (default: 200). |
186
+ | `--metric` | `cj`, `co`, `corr` | Similarity metric (default: `cj`). |
187
+ | `--perm` | Integer | Number of permutations (default: 0). |
188
+ | `--distortion` | Float | Distortion level (default: 0.4). |
189
+ | `--search-range` | Integer | Maximum alignment offset (default: 10). |
190
+ | `--seed` | Integer | Global random seed. |
191
+ | `--jobs` | Integer | Number of parallel jobs (default: -1). |
192
+
193
+ ### `motali` mode
194
+
195
+ Compare motifs by calculating Precision-Recall Curve (PRC) AUC derived from scanning sequences.
196
+
197
+ **Example Models**: [`examples/sitega_gata2.mat`](examples/sitega_gata2.mat), [`examples/gata2.meme`](examples/gata2.meme)
198
+
199
+ ```bash
200
+ # in the `examples` directory
201
+ yamcot motali sitega_gata2.mat gata2.meme \
202
+ --model1-type sitega \
203
+ --model2-type pwm \
204
+ --fasta foreground.fa \
205
+ --promoters background.fa \
206
+ --num-sequences 5000 \
207
+ --seq-length 150
208
+ ```
209
+
210
+ **All parameters for `motali` mode**:
211
+
212
+ | Flag | Value | Comment |
213
+ | :--- | :--- | :--- |
214
+ | `model1` | Path | Path to the first motif model file. |
215
+ | `model2` | Path | Path to the second motif model file. |
216
+ | `--model1-type` | `pwm`, `sitega` | Format of the first model (Required). |
217
+ | `--model2-type` | `pwm`, `sitega` | Format of the second model (Required). |
218
+ | `--fasta` | Path | FASTA file with target sequences. |
219
+ | `--promoters` | Path | FASTA file with promoter sequences (Required for thresholds). |
220
+ | `--num-sequences` | Integer | Number of random sequences (default: 10000). |
221
+ | `--seq-length` | Integer | Length of random sequences (default: 200). |
222
+ | `--tmp-dir` | Path | Directory for temporary files (default: `/tmp`). |
223
+
224
+ ### `tomtom-like` mode
225
+
226
+ Compare motifs by direct N-dimetional matrix comparison (column-wise).
227
+
228
+ **Example Models**: [`examples/pif4.pfm`](examples/pif4.pfm), [`examples/pif4.meme`](examples/pif4.meme)
229
+
230
+ ```bash
231
+ # in the `examples` directory
232
+ yamcot tomtom-like pif4.pfm pif4.meme \
233
+ --model1-type pwm \
234
+ --model2-type pwm \
235
+ --metric cosine \
236
+ --permutations 1000 \
237
+ --pfm-mode \
238
+ --num-sequences 10000 \
239
+ --seq-length 100
240
+ ```
241
+
242
+ **All parameters for `tomtom-like` mode**:
243
+
244
+ | Flag | Value | Comment |
245
+ | :--- | :--- | :--- |
246
+ | `model1` | Path | Path to the first motif model file. |
247
+ | `model2` | Path | Path to the second motif model file. |
248
+ | `--model1-type` | `pwm`, `bamm`, `sitega` | Format of the first model (Required). |
249
+ | `--model2-type` | `pwm`, `bamm`, `sitega` | Format of the second model (Required). |
250
+ | `--metric` | `pcc`, `ed`, `cosine` | Column-wise metric: Pearson Correlation, Euclidean Distance, or Cosine Similarity (default: `pcc`). |
251
+ | `--permutations` | Integer | Number of Monte Carlo permutations for p-value (default: 0). |
252
+ | `--permute-rows` | Flag | Shuffle values within columns during permutation. |
253
+ | `--pfm-mode` | Flag | Derive PFM by scanning sequences (useful for comparing different model types). |
254
+ | `--num-sequences` | Integer | Sequences for PFM mode (default: 20000). |
255
+ | `--seq-length` | Integer | Sequence length for PFM mode (default: 100). |
256
+ | `--seed` | Integer | Global random seed. |
257
+ | `--jobs` | Integer | Number of parallel jobs (default: -1). |
258
+
259
+ ## Library Usage
260
+
261
+ YAMCOT is designed as an extensible framework. You can implement your own motif models by inheriting from the base abstractions provided in [`yamcot/`](yamcot/).
262
+
263
+ ### Implementing a Custom Model
264
+
265
+ To create a new model, inherit from the [`MotifModel`](yamcot/models.py) class and implement the required methods. Below is a simplified example of a dinucleotide-based motif model implemented in pure Python/NumPy.
266
+
267
+ ```python
268
+ import numpy as np
269
+ from yamcot.models import MotifModel, RaggedData
270
+ from yamcot.pipeline import Pipeline
271
+ from yamcot.ragged import ragged_from_list
272
+
273
+ class SimpleDinucleotideMotif(MotifModel):
274
+ """Example of a custom motif model with dinucleotide dependencies."""
275
+
276
+ def __init__(self, matrix, name, length):
277
+ # matrix shape: (16, length - 1) representing 16 possible dinucleotides
278
+ super().__init__(matrix=matrix, name=name, length=length)
279
+
280
+ def scan(self, sequences: RaggedData, strand=None) -> RaggedData:
281
+ """
282
+ Scan sequences with the custom model.
283
+ Returns RaggedData containing scores for each position.
284
+ """
285
+ all_scores = []
286
+ for i in range(sequences.num_sequences):
287
+ seq = sequences.get_slice(i)
288
+ if len(seq) < self.length:
289
+ all_scores.append(np.array([], dtype=np.float32))
290
+ continue
291
+
292
+ n_pos = len(seq) - self.length + 1
293
+ scores = np.zeros(n_pos, dtype=np.float32)
294
+
295
+ # Simple sliding window scoring logic
296
+ for j in range(n_pos):
297
+ subseq = seq[j : j + self.length]
298
+ pos_score = 0.0
299
+ for k in range(self.length - 1):
300
+ # Calculate dinucleotide index (0-15) for ACGT
301
+ # Assuming 0=A, 1=C, 2=G, 3=T
302
+ if subseq[k] < 4 and subseq[k+1] < 4:
303
+ dinucl_idx = int(subseq[k] * 4 + subseq[k+1])
304
+ pos_score += self.matrix[dinucl_idx, k]
305
+ scores[j] = pos_score
306
+ all_scores.append(scores)
307
+
308
+ return ragged_from_list(all_scores, dtype=np.float32)
309
+
310
+ @classmethod
311
+ def from_file(cls, path: str, **kwargs) -> SimpleDinucleotideMotif:
312
+ """Load the model from a file."""
313
+ # Implementation of your file parsing logic
314
+ matrix = np.load(path)
315
+ name = kwargs.get('name', 'custom_motif')
316
+ return cls(matrix, name, length=matrix.shape[1] + 1)
317
+
318
+ @property
319
+ def model_type(self) -> str:
320
+ """Unique identifier for the model type."""
321
+ return 'dinucleotide'
322
+
323
+ def write(self, path: str):
324
+ """Save the model to a file."""
325
+ np.save(path, self.matrix)
326
+
327
+ # Register the subclass to enable factory methods and CLI support
328
+ MotifModel.register_subclass('dinucleotide', SimpleDinucleotideMotif)
329
+ ```
330
+
331
+ ### Key Methods to Override
332
+
333
+ To ensure compatibility with the internal comparison [`Pipeline`](yamcot/pipeline.py), you must override the following methods:
334
+
335
+ | Method | Description |
336
+ | :--- | :--- |
337
+ | `scan(sequences, strand)` | **Required.** Performs motif scanning on a set of sequences. Must accept `RaggedData` and return `RaggedData` containing position-wise scores. |
338
+ | `from_file(path, **kwargs)` | **Required.** Class method to initialize the model from a file path. Enables the use of `MotifModel.create_from_file(path, 'type')`. |
339
+ | `model_type` | **Required.** Property returning a unique string identifier for the model class. |
340
+ | `write(path)` | **Required.** Method to serialize the model to its native format. |
341
+
342
+ ### Example: Running a Comparison
343
+
344
+ ```python
345
+ # 1. Prepare sequences and models
346
+ # Encode sequence: A=0, C=1, G=2, T=3
347
+ seq_list = [np.array([0, 1, 2, 3, 0, 1, 2, 3], dtype=np.int8)]
348
+ sequences = ragged_from_list(seq_list, dtype=np.int8)
349
+
350
+ # Initialize custom models
351
+ # Matrix for 16 dinucleotides across 9 positions (motif length 10)
352
+ m1 = SimpleDinucleotideMotif(np.random.rand(16, 9), "Motif_A", 10)
353
+ m2 = SimpleDinucleotideMotif(np.random.rand(16, 9), "Motif_B", 10)
354
+
355
+ # 2. Execute comparison using the Pipeline
356
+ pipeline = Pipeline()
357
+ result = pipeline.execute_motif_comparison(
358
+ model1=m1,
359
+ model2=m2,
360
+ sequences=sequences,
361
+ promoters=sequences, # Used for threshold (FPR) calculation
362
+ comparison_type='motif',
363
+ metric='cj',
364
+ n_permutations=1000
365
+ )
366
+
367
+ print(f"Similarity (CJ): {result['similarity']:.4f}")
368
+ print(f"P-value: {result['p_value']:.4e}")
369
+ ```
370
+
371
+ ### Examples
372
+
373
+ The [`examples/`](examples/) directory contains sample data and [script](examples/run.sh) to demonstrate the tool's capabilities with CLI.
374
+
375
+ To run a basic comparison:
376
+
377
+ ```bash
378
+ yamcot motif examples/foxa2.meme examples/gata2.meme \
379
+ --model1-type pwm --model2-type pwm \
380
+ --fasta examples/foreground.fa --metric cj --perm 100
381
+ ```
382
+
383
+ ## Bibliography
384
+
385
+ [^1]: Lambert, S. A., Jolma, A., Campitelli, L. F., Das, P. K., Yin, Y., Albu, M., ... & Weirauch, M. T. (2018). The human transcription factors. _Cell_, _172_(4), 650-665.
386
+
387
+ [^2]: Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. _Nature Reviews Genetics, 5 (4), 276-287.
388
+
389
+ [^3]: Park, P. J. (2009). ChIP–seq: advantages and challenges of a maturing technology. _Nature reviews genetics_, _10_(10), 669-680.
390
+
391
+ [^4]: Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., Taipale, M., Vaquerizas, J. M., Yan, J., Sillanpää, M. J., Bonke, M., Palin, K., Talukder, S., Hughes, T. R., Luscombe, N. M., Ukkonen, E., & Taipale, J. (2010). Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. _Genome research_, _20_(6), 861–873. https://doi.org/10.1101/gr.100552.109
392
+
393
+ [^5]: O'Malley, R. C., Huang, S. C., Song, L., Lewsey, M. G., Bartlett, A., Nery, J. R., Galli, M., Gallavotti, A., & Ecker, J. R. (2016). Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. _Cell_, _165_(5), 1280–1292. https://doi.org/10.1016/j.cell.2016.04.038
394
+
395
+ [^6]: Bailey, T. L., & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers.Proceedings. International Conference on Intelligent Systems for Molecular Biology_, _2_, 28–36.
396
+
397
+ [^7]: Bailey T. L. (2021). STREME: accurate and versatile sequence motif discovery. _Bioinformatics (Oxford, England)_, _37_(18), 2834–2840. https://doi.org/10.1093/bioinformatics/btab203
398
+
399
+ [^8]: Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H., & Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. _Molecular cell_, _38_(4), 576–589. https://doi.org/10.1016/j.molcel.2010.05.004
400
+
401
+ [^9]: Grau J, Posch S, Grosse I, Keilwagen J. A general approach for discriminative de novo motif discovery from high-throughput data. Nucleic Acids Res. 2013 Nov;41(21):e197. doi: 10.1093/nar/gkt831. Epub 2013 Sep 20. PMID: 24057214; PMCID: PMC3834837.
402
+
403
+ [^10]: Eggeling R, Grosse I, Grau J. InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites. Bioinformatics. 2017 Feb 15;33(4):580-582. doi: 10.1093/bioinformatics/btw689. PMID: 28035026; PMCID: PMC5408807.
404
+
405
+ [^11]: Siebert, M., & Söding, J. (2016). Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. _Nucleic acids research_, _44_(13), 6055–6069. https://doi.org/10.1093/nar/gkw521
406
+
407
+ [^12]: Ge, W., Meier, M., Roth, C., & Söding, J. (2021). Bayesian Markov models improve the prediction of binding motifs beyond first order. _NAR genomics and bioinformatics_, _3_(2), lqab026. https://doi.org/10.1093/nargab/lqab026
408
+
409
+ [^13]: Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics. 2020 May 1;36(9):2690-2696. doi: 10.1093/bioinformatics/btaa045. PMID: 31999322; PMCID: PMC7203737.
410
+
411
+ [^14]: Mathelier, A., & Wasserman, W. W. (2013). The next generation of transcription factor binding site prediction. _PLoS computational biology_, _9_(9), e1003214. https://doi.org/10.1371/journal.pcbi.1003214
412
+
413
+ [^15]: Levitsky, V. G., Ignatieva, E. V., Ananko, E. A., Turnaev, I. I., Merkulova, T. I., Kolchanov, N. A., & Hodgman, T. C. (2007). Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. _BMC bioinformatics_, _8_, 481. https://doi.org/10.1186/1471-2105-8-481
414
+
415
+ [^16]: Tsukanov, A. V., Mironova, V. V., & Levitsky, V. G. (2022). Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis. _Frontiers in plant science_, _13_, 938545. https://doi.org/10.3389/fpls.2022.938545
416
+
417
+ [^17]: Alipanahi, B., Delong, A., Weirauch, M. T., & Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. _Nature biotechnology_, _33_(8), 831–838. https://doi.org/10.1038/nbt.3300
418
+
419
+ [^18]: Hassanzadeh, H. R., & Wang, M. D. (2016). DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins. _Proceedings. IEEE International Conference on Bioinformatics and Biomedicine_, _2016_, 178–183. https://doi.org/10.1109/bibm.2016.7822515
420
+
421
+ [^19]: Chen, C., Hou, J., Shi, X., Yang, H., Birchler, J. A., & Cheng, J. (2021). DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. _BMC bioinformatics_, _22_(1), 38. https://doi.org/10.1186/s12859-020-03952-1
422
+
423
+ [^20]: Wang, K., Zeng, X., Zhou, J., Liu, F., Luan, X., & Wang, X. (2024). BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. _Briefings in bioinformatics_, _25_(3), bbae195. https://doi.org/10.1093/bib/bbae195
424
+
425
+ [^21]: Jing Zhang, F., Zhang, S. W., & Zhang, S. (2022). Prediction of Transcription Factor Binding Sites With an Attention Augmented Convolutional Neural Network. _IEEE/ACM transactions on computational biology and bioinformatics_, _19_(6), 3614–3623. https://doi.org/10.1109/TCBB.2021.3126623
426
+
427
+ [^22]: Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., & Noble, W. S. (2007). Quantifying similarity between motifs. _Genome biology_, _8_(2), R24. https://doi.org/10.1186/gb-2007-8-2-r24
428
+
429
+ [^23]: Mahony, S., & Benos, P. V. (2007). STAMP: a web tool for exploring DNA-binding motif similarities. _Nucleic acids research_, _35_(Web Server issue), W253–W258. https://doi.org/10.1093/nar/gkm272
430
+
431
+ [^24]: Vorontsov, I. E., Kulakovskiy, I. V., & Makeev, V. J. (2013). Jaccard index based similarity measure to compare transcription factor binding site models. _Algorithms for molecular biology : AMB_, _8_(1), 23. https://doi.org/10.1186/1748-7188-8-23
432
+
433
+ [^25]: Lambert, S. A., Albu, M., Hughes, T. R., & Najafabadi, H. S. (2016). Motif comparison based on similarity of binding affinity profiles. _Bioinformatics (Oxford, England)_, _32_(22), 3504–3506. https://doi.org/10.1093/bioinformatics/btw489
@@ -0,0 +1,21 @@
1
+ yamcot/__init__.py,sha256=TmzwEcRytsoehj1RdWg_97sTbToxF0DZ3iU8OLtkB68,1798
2
+ yamcot/_core/__init__.py,sha256=ivT4zUSABZskPPvISFsNS7UKXSgNnDla-CSn9jTbmU4,598
3
+ yamcot/_core/_core.cp314-win_amd64.pyd,sha256=DVJkuMi47W1cyUQ2GxzkEETCKvc_0JdBBilR5eiFxHs,86528
4
+ yamcot/_core/bindings.cpp,sha256=f3w9sw5PT57cL22PEZn7wZnqMOwPAhcS08RpmUlPFOU,809
5
+ yamcot/_core/core_functions.h,sha256=O0K_YzG8_LLDPwlwkltSlzwsK_ubmoSAf_j5MWu7I8g,626
6
+ yamcot/_core/fasta_to_plain.h,sha256=_4tNONTcVXM2QA_Kf4LaBPMvd6zWzeOLjJopLHJZhJE,3522
7
+ yamcot/_core/mco_prc.cpp,sha256=PTVpvkJcrT8zMNI5_4eTi59CZIBNc9xfUPYgcx-erRA,42833
8
+ yamcot/_core/pfm_to_pwm.h,sha256=_EW-UmC3G60kQMNEFpPr2XzxppACX25u-GvHgMzMTFY,2535
9
+ yamcot/cli.py,sha256=Ve-Q0nnyMOf1U76LUus-coMb3KG2DOUHTmfmDQ4KULU,22805
10
+ yamcot/comparison.py,sha256=bgGAuv36TR-SmYsAJ0fLIkQkM3ku82lVr-xBG7el0r0,41161
11
+ yamcot/execute.py,sha256=B6_KwgEmRFIqP5nk8jBfHX60uuxdNoJmmrfwE7pbc9I,2716
12
+ yamcot/functions.py,sha256=ckRp3y7Ab_Rx03PBY-hYnVFs7vPuBmSJfSVtyEIRuo0,26172
13
+ yamcot/io.py,sha256=ne57h0YA7QWpVgZYTf1p8uBWanqjX-vPZVXzfLOT0II,19218
14
+ yamcot/models.py,sha256=vAfxiz-ehWKccMLDKiMYlQz7ujWwN344nyHlKaE5FXo,40989
15
+ yamcot/pipeline.py,sha256=oFsPWA7zoLlz56GYxoP7VkZY0A9UWQYOYP18OkKLMak,15885
16
+ yamcot/ragged.py,sha256=c0csUMmiYXYc2OQnPybkg4g-TXwHXWW-K1fkYwylacg,3552
17
+ yamcot-1.0.0.dist-info/METADATA,sha256=lI2X0KghL_nGUbV8bI-ehKGW6w8KoG3QZMYxAd9aAKU,25919
18
+ yamcot-1.0.0.dist-info/WHEEL,sha256=gWMs92Yhbl9pSGNRFWCXG1mfeuNl7HhxcJG5aLu4nQc,106
19
+ yamcot-1.0.0.dist-info/entry_points.txt,sha256=o2DLtC7_k2XDhzvqZ79fyjzGGqL0AOnjx5YTLlzYHKc,48
20
+ yamcot-1.0.0.dist-info/licenses/LICENSE,sha256=jkQiLNG7Rsv6Clx4AarWaoJE3RSF_gKjWAzUqZD_YP4,1090
21
+ yamcot-1.0.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: scikit-build-core 0.11.6
3
+ Root-Is-Purelib: false
4
+ Tag: cp314-cp314-win_amd64
5
+
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ yamcot = yamcot.cli:main_cli
3
+
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Anton Tsukanov
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.