tetss2 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- tetss2-0.1.0/MANIFEST.in +2 -0
- tetss2-0.1.0/PKG-INFO +375 -0
- tetss2-0.1.0/README.md +364 -0
- tetss2-0.1.0/pyproject.toml +31 -0
- tetss2-0.1.0/setup.cfg +4 -0
- tetss2-0.1.0/src/tetss2/__init__.py +3 -0
- tetss2-0.1.0/src/tetss2/assets/best_model.pth +0 -0
- tetss2-0.1.0/src/tetss2/cli.py +108 -0
- tetss2-0.1.0/src/tetss2/model.py +27 -0
- tetss2-0.1.0/src/tetss2/predictor.py +139 -0
- tetss2-0.1.0/src/tetss2.egg-info/PKG-INFO +375 -0
- tetss2-0.1.0/src/tetss2.egg-info/SOURCES.txt +14 -0
- tetss2-0.1.0/src/tetss2.egg-info/dependency_links.txt +1 -0
- tetss2-0.1.0/src/tetss2.egg-info/entry_points.txt +2 -0
- tetss2-0.1.0/src/tetss2.egg-info/requires.txt +3 -0
- tetss2-0.1.0/src/tetss2.egg-info/top_level.txt +1 -0
tetss2-0.1.0/MANIFEST.in
ADDED
tetss2-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,375 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: tetss2
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: TETSS2.0: a PyTorch model for predicting TE-TSS activity from DNA sequences.
|
|
5
|
+
Author-email: Moriyaa Cui <2311459@tongji.edu.cn>
|
|
6
|
+
Requires-Python: >=3.8
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
Requires-Dist: numpy>=1.21
|
|
9
|
+
Requires-Dist: pandas>=1.3
|
|
10
|
+
Requires-Dist: torch>=1.10
|
|
11
|
+
|
|
12
|
+
# TETSS2.0
|
|
13
|
+
|
|
14
|
+
**TETSS2.0** is a deep learning model for predicting TE-derived transcription start site (TE-TSS) activity from DNA sequences.
|
|
15
|
+
|
|
16
|
+
The Python package name and command-line tool name are **`tetss2`**. The model name shown in documents, figures, and the website is **TETSS2.0**.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Overview
|
|
21
|
+
|
|
22
|
+
TETSS2.0 is a PyTorch-based convolutional neural network classifier designed to predict whether an input DNA sequence is associated with TE-TSS activity.
|
|
23
|
+
|
|
24
|
+
The model takes a DNA sequence as input and returns:
|
|
25
|
+
|
|
26
|
+
* a prediction probability
|
|
27
|
+
* a binary prediction label
|
|
28
|
+
* the classification threshold used for prediction
|
|
29
|
+
|
|
30
|
+
By default, TETSS2.0 uses a threshold of `0.5`.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Input requirement
|
|
35
|
+
|
|
36
|
+
TETSS2.0 expects DNA sequences of exactly **201 bp**.
|
|
37
|
+
|
|
38
|
+
Allowed bases:
|
|
39
|
+
|
|
40
|
+
```text
|
|
41
|
+
A, C, G, T, N
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Notes:
|
|
45
|
+
|
|
46
|
+
* Input sequences are automatically converted to uppercase.
|
|
47
|
+
* `N` is allowed but is encoded as an all-zero position in the one-hot representation.
|
|
48
|
+
* Sequences shorter or longer than 201 bp are rejected by default.
|
|
49
|
+
* The option `--no-length-check` is available only for debugging and is not recommended for normal prediction.
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Output
|
|
54
|
+
|
|
55
|
+
For each input sequence, TETSS2.0 outputs:
|
|
56
|
+
|
|
57
|
+
| Column | Description |
|
|
58
|
+
| ------------------------ | ---------------------------------------- |
|
|
59
|
+
| `tetss2_sequence_length` | Length of the input sequence |
|
|
60
|
+
| `tetss2_probability` | Predicted probability score |
|
|
61
|
+
| `tetss2_prediction` | Binary prediction result, `0` or `1` |
|
|
62
|
+
| `tetss2_threshold` | Threshold used for binary classification |
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Installation
|
|
67
|
+
|
|
68
|
+
### Option 1: Install from a local source directory
|
|
69
|
+
|
|
70
|
+
If you have downloaded or cloned this package locally, enter the package directory and install it with:
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
cd tetss_rampage_package
|
|
74
|
+
pip install -e .
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
After installation, check whether the command-line tool is available:
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
tetss2 --help
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Option 2: Recommended conda environment
|
|
84
|
+
|
|
85
|
+
We recommend creating a clean conda environment before installation:
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
|
|
89
|
+
conda create -n tetss2 python=3.9
|
|
90
|
+
conda activate tetss2
|
|
91
|
+
conda install numpy pandas scikit-learn
|
|
92
|
+
pip install torch==1.10.2
|
|
93
|
+
|
|
94
|
+
pip install -e .
|
|
95
|
+
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
A future public release may support:
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
pip install tetss2
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Command-line usage
|
|
107
|
+
|
|
108
|
+
### 1. Predict a single sequence
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
tetss2 predict --sequence ACGTACGTACGTACGT
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
For normal use, the input sequence should be exactly 201 bp:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
tetss2 predict --sequence YOUR_201BP_DNA_SEQUENCE
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
Example output:
|
|
121
|
+
|
|
122
|
+
```json
|
|
123
|
+
{
|
|
124
|
+
"model": "TETSS2.0",
|
|
125
|
+
"sequence": "ACGTACGTACGTACGT",
|
|
126
|
+
"sequence_length": 16,
|
|
127
|
+
"probability": 0.35490313172340393,
|
|
128
|
+
"prediction": 0,
|
|
129
|
+
"threshold": 0.5
|
|
130
|
+
}
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
Note: the example above uses a short sequence only to demonstrate the command format. For biological prediction, please use a 201 bp sequence.
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
### 2. Batch prediction from a TSV file
|
|
138
|
+
|
|
139
|
+
Prepare an input file containing a `sequence` column.
|
|
140
|
+
|
|
141
|
+
Example input file: `input.tsv`
|
|
142
|
+
|
|
143
|
+
```tsv
|
|
144
|
+
sample_id sequence
|
|
145
|
+
sample1 ACGT...
|
|
146
|
+
sample2 TTTT...
|
|
147
|
+
sample3 GCGC...
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
Run batch prediction:
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
tetss2 predict-file \
|
|
154
|
+
--input input.tsv \
|
|
155
|
+
--output tetss2_predictions.tsv
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
The output file will contain the original columns plus TETSS2.0 prediction results.
|
|
159
|
+
|
|
160
|
+
Example output:
|
|
161
|
+
|
|
162
|
+
```tsv
|
|
163
|
+
sample_id sequence tetss2_sequence_length tetss2_probability tetss2_prediction tetss2_threshold
|
|
164
|
+
sample1 ACGT... 201 0.3549 0 0.5
|
|
165
|
+
sample2 TTTT... 201 0.8123 1 0.5
|
|
166
|
+
sample3 GCGC... 201 0.4471 0 0.5
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
### 3. Batch prediction from a CSV file
|
|
172
|
+
|
|
173
|
+
If your input file is comma-separated, use `--sep ","`:
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
tetss2 predict-file \
|
|
177
|
+
--input input.csv \
|
|
178
|
+
--output tetss2_predictions.csv \
|
|
179
|
+
--sep ","
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
### 4. Use a custom sequence column name
|
|
185
|
+
|
|
186
|
+
If the sequence column is not named `sequence`, specify it with `--sequence-column`.
|
|
187
|
+
|
|
188
|
+
For example, if the input file contains a column named `dna`:
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
tetss2 predict-file \
|
|
192
|
+
--input input.tsv \
|
|
193
|
+
--output tetss2_predictions.tsv \
|
|
194
|
+
--sequence-column dna
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
|
|
199
|
+
## Python API usage
|
|
200
|
+
|
|
201
|
+
TETSS2.0 can also be used directly in Python.
|
|
202
|
+
|
|
203
|
+
```python
|
|
204
|
+
from tetss2 import TETSS2Predictor
|
|
205
|
+
|
|
206
|
+
predictor = TETSS2Predictor()
|
|
207
|
+
|
|
208
|
+
result = predictor.predict("YOUR_201BP_DNA_SEQUENCE")
|
|
209
|
+
print(result)
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Example output:
|
|
213
|
+
|
|
214
|
+
```python
|
|
215
|
+
{
|
|
216
|
+
"model": "TETSS2.0",
|
|
217
|
+
"sequence": "YOUR_201BP_DNA_SEQUENCE",
|
|
218
|
+
"sequence_length": 201,
|
|
219
|
+
"probability": 0.73,
|
|
220
|
+
"prediction": 1,
|
|
221
|
+
"threshold": 0.5,
|
|
222
|
+
}
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
---
|
|
226
|
+
|
|
227
|
+
## Model architecture
|
|
228
|
+
|
|
229
|
+
TETSS2.0 uses a one-dimensional convolutional neural network for DNA sequence classification.
|
|
230
|
+
|
|
231
|
+
The model contains:
|
|
232
|
+
|
|
233
|
+
* one-hot encoding of DNA sequences
|
|
234
|
+
* multiple 1D convolutional layers
|
|
235
|
+
* batch normalization
|
|
236
|
+
* ReLU activation
|
|
237
|
+
* max pooling
|
|
238
|
+
* adaptive max pooling
|
|
239
|
+
* fully connected classification layers
|
|
240
|
+
|
|
241
|
+
The model outputs a single logit, which is converted to a probability using the sigmoid function.
|
|
242
|
+
|
|
243
|
+
---
|
|
244
|
+
|
|
245
|
+
## Model files
|
|
246
|
+
|
|
247
|
+
The package includes the trained model weight file:
|
|
248
|
+
|
|
249
|
+
```text
|
|
250
|
+
best_model.pth
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
The original training output directory also contains:
|
|
254
|
+
|
|
255
|
+
```text
|
|
256
|
+
best_model.pth
|
|
257
|
+
run_config.json
|
|
258
|
+
train_history.tsv
|
|
259
|
+
final_val_metrics.json
|
|
260
|
+
split_summary.json
|
|
261
|
+
train_split.tsv
|
|
262
|
+
val_split.tsv
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
These files can be used to document model configuration, validation performance, and data splitting information.
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
## Validation performance
|
|
270
|
+
|
|
271
|
+
Please fill in the following values using `final_val_metrics.json`:
|
|
272
|
+
|
|
273
|
+
| Metric | Value |
|
|
274
|
+
| --------------- | ----: |
|
|
275
|
+
| AUROC | TODO |
|
|
276
|
+
| AUPRC | TODO |
|
|
277
|
+
| Accuracy | TODO |
|
|
278
|
+
| Precision | TODO |
|
|
279
|
+
| Recall | TODO |
|
|
280
|
+
| True negatives | TODO |
|
|
281
|
+
| False positives | TODO |
|
|
282
|
+
| False negatives | TODO |
|
|
283
|
+
| True positives | TODO |
|
|
284
|
+
|
|
285
|
+
To view the metrics file:
|
|
286
|
+
|
|
287
|
+
```bash
|
|
288
|
+
cat final_val_metrics.json
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## Troubleshooting
|
|
294
|
+
|
|
295
|
+
### MKL threading error
|
|
296
|
+
|
|
297
|
+
If you see an error similar to:
|
|
298
|
+
|
|
299
|
+
```text
|
|
300
|
+
mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
try running:
|
|
304
|
+
|
|
305
|
+
```bash
|
|
306
|
+
export MKL_THREADING_LAYER=GNU
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
Then run the prediction command again.
|
|
310
|
+
|
|
311
|
+
---
|
|
312
|
+
|
|
313
|
+
### Cannot find sequence column
|
|
314
|
+
|
|
315
|
+
If you see an error like:
|
|
316
|
+
|
|
317
|
+
```text
|
|
318
|
+
Cannot find sequence column 'sequence'
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
check whether your input file contains a real tab or comma separator.
|
|
322
|
+
|
|
323
|
+
For a TSV file, you can check tabs with:
|
|
324
|
+
|
|
325
|
+
```bash
|
|
326
|
+
cat -A input.tsv
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
A real tab will appear as:
|
|
330
|
+
|
|
331
|
+
```text
|
|
332
|
+
^I
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
A correct TSV file should look like:
|
|
336
|
+
|
|
337
|
+
```text
|
|
338
|
+
sample_id^Isequence$
|
|
339
|
+
sample1^IACGT...$
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
---
|
|
343
|
+
|
|
344
|
+
## Citation
|
|
345
|
+
|
|
346
|
+
If you use TETSS2 in your research, please cite:
|
|
347
|
+
|
|
348
|
+
@software{tetss2,
|
|
349
|
+
title={TETSS2: Deep Learning Model for TE-TSS Prediction},
|
|
350
|
+
year={2026}
|
|
351
|
+
}
|
|
352
|
+
|
|
353
|
+
---
|
|
354
|
+
|
|
355
|
+
## Contact
|
|
356
|
+
|
|
357
|
+
For questions or issues, please contact the developer or open an issue in the project repository.
|
|
358
|
+
|
|
359
|
+
---
|
|
360
|
+
|
|
361
|
+
## License
|
|
362
|
+
|
|
363
|
+
License information will be added later.
|
|
364
|
+
|
|
365
|
+
## Repository
|
|
366
|
+
|
|
367
|
+
https://github.com/MoriyaaCui/TETSS2.git
|
|
368
|
+
|
|
369
|
+
|
|
370
|
+
## Live Demo
|
|
371
|
+
|
|
372
|
+
A Gradio demo is available:
|
|
373
|
+
|
|
374
|
+
```bash
|
|
375
|
+
python demo/app.py
|
tetss2-0.1.0/README.md
ADDED
|
@@ -0,0 +1,364 @@
|
|
|
1
|
+
# TETSS2.0
|
|
2
|
+
|
|
3
|
+
**TETSS2.0** is a deep learning model for predicting TE-derived transcription start site (TE-TSS) activity from DNA sequences.
|
|
4
|
+
|
|
5
|
+
The Python package name and command-line tool name are **`tetss2`**. The model name shown in documents, figures, and the website is **TETSS2.0**.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
TETSS2.0 is a PyTorch-based convolutional neural network classifier designed to predict whether an input DNA sequence is associated with TE-TSS activity.
|
|
12
|
+
|
|
13
|
+
The model takes a DNA sequence as input and returns:
|
|
14
|
+
|
|
15
|
+
* a prediction probability
|
|
16
|
+
* a binary prediction label
|
|
17
|
+
* the classification threshold used for prediction
|
|
18
|
+
|
|
19
|
+
By default, TETSS2.0 uses a threshold of `0.5`.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Input requirement
|
|
24
|
+
|
|
25
|
+
TETSS2.0 expects DNA sequences of exactly **201 bp**.
|
|
26
|
+
|
|
27
|
+
Allowed bases:
|
|
28
|
+
|
|
29
|
+
```text
|
|
30
|
+
A, C, G, T, N
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Notes:
|
|
34
|
+
|
|
35
|
+
* Input sequences are automatically converted to uppercase.
|
|
36
|
+
* `N` is allowed but is encoded as an all-zero position in the one-hot representation.
|
|
37
|
+
* Sequences shorter or longer than 201 bp are rejected by default.
|
|
38
|
+
* The option `--no-length-check` is available only for debugging and is not recommended for normal prediction.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Output
|
|
43
|
+
|
|
44
|
+
For each input sequence, TETSS2.0 outputs:
|
|
45
|
+
|
|
46
|
+
| Column | Description |
|
|
47
|
+
| ------------------------ | ---------------------------------------- |
|
|
48
|
+
| `tetss2_sequence_length` | Length of the input sequence |
|
|
49
|
+
| `tetss2_probability` | Predicted probability score |
|
|
50
|
+
| `tetss2_prediction` | Binary prediction result, `0` or `1` |
|
|
51
|
+
| `tetss2_threshold` | Threshold used for binary classification |
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Installation
|
|
56
|
+
|
|
57
|
+
### Option 1: Install from a local source directory
|
|
58
|
+
|
|
59
|
+
If you have downloaded or cloned this package locally, enter the package directory and install it with:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
cd tetss_rampage_package
|
|
63
|
+
pip install -e .
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
After installation, check whether the command-line tool is available:
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
tetss2 --help
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Option 2: Recommended conda environment
|
|
73
|
+
|
|
74
|
+
We recommend creating a clean conda environment before installation:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
|
|
78
|
+
conda create -n tetss2 python=3.9
|
|
79
|
+
conda activate tetss2
|
|
80
|
+
conda install numpy pandas scikit-learn
|
|
81
|
+
pip install torch==1.10.2
|
|
82
|
+
|
|
83
|
+
pip install -e .
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
A future public release may support:
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
pip install tetss2
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
## Command-line usage
|
|
96
|
+
|
|
97
|
+
### 1. Predict a single sequence
|
|
98
|
+
|
|
99
|
+
```bash
|
|
100
|
+
tetss2 predict --sequence ACGTACGTACGTACGT
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
For normal use, the input sequence should be exactly 201 bp:
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
tetss2 predict --sequence YOUR_201BP_DNA_SEQUENCE
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Example output:
|
|
110
|
+
|
|
111
|
+
```json
|
|
112
|
+
{
|
|
113
|
+
"model": "TETSS2.0",
|
|
114
|
+
"sequence": "ACGTACGTACGTACGT",
|
|
115
|
+
"sequence_length": 16,
|
|
116
|
+
"probability": 0.35490313172340393,
|
|
117
|
+
"prediction": 0,
|
|
118
|
+
"threshold": 0.5
|
|
119
|
+
}
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Note: the example above uses a short sequence only to demonstrate the command format. For biological prediction, please use a 201 bp sequence.
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
### 2. Batch prediction from a TSV file
|
|
127
|
+
|
|
128
|
+
Prepare an input file containing a `sequence` column.
|
|
129
|
+
|
|
130
|
+
Example input file: `input.tsv`
|
|
131
|
+
|
|
132
|
+
```tsv
|
|
133
|
+
sample_id sequence
|
|
134
|
+
sample1 ACGT...
|
|
135
|
+
sample2 TTTT...
|
|
136
|
+
sample3 GCGC...
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
Run batch prediction:
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
tetss2 predict-file \
|
|
143
|
+
--input input.tsv \
|
|
144
|
+
--output tetss2_predictions.tsv
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
The output file will contain the original columns plus TETSS2.0 prediction results.
|
|
148
|
+
|
|
149
|
+
Example output:
|
|
150
|
+
|
|
151
|
+
```tsv
|
|
152
|
+
sample_id sequence tetss2_sequence_length tetss2_probability tetss2_prediction tetss2_threshold
|
|
153
|
+
sample1 ACGT... 201 0.3549 0 0.5
|
|
154
|
+
sample2 TTTT... 201 0.8123 1 0.5
|
|
155
|
+
sample3 GCGC... 201 0.4471 0 0.5
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
### 3. Batch prediction from a CSV file
|
|
161
|
+
|
|
162
|
+
If your input file is comma-separated, use `--sep ","`:
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
tetss2 predict-file \
|
|
166
|
+
--input input.csv \
|
|
167
|
+
--output tetss2_predictions.csv \
|
|
168
|
+
--sep ","
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
### 4. Use a custom sequence column name
|
|
174
|
+
|
|
175
|
+
If the sequence column is not named `sequence`, specify it with `--sequence-column`.
|
|
176
|
+
|
|
177
|
+
For example, if the input file contains a column named `dna`:
|
|
178
|
+
|
|
179
|
+
```bash
|
|
180
|
+
tetss2 predict-file \
|
|
181
|
+
--input input.tsv \
|
|
182
|
+
--output tetss2_predictions.tsv \
|
|
183
|
+
--sequence-column dna
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Python API usage
|
|
189
|
+
|
|
190
|
+
TETSS2.0 can also be used directly in Python.
|
|
191
|
+
|
|
192
|
+
```python
|
|
193
|
+
from tetss2 import TETSS2Predictor
|
|
194
|
+
|
|
195
|
+
predictor = TETSS2Predictor()
|
|
196
|
+
|
|
197
|
+
result = predictor.predict("YOUR_201BP_DNA_SEQUENCE")
|
|
198
|
+
print(result)
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
Example output:
|
|
202
|
+
|
|
203
|
+
```python
|
|
204
|
+
{
|
|
205
|
+
"model": "TETSS2.0",
|
|
206
|
+
"sequence": "YOUR_201BP_DNA_SEQUENCE",
|
|
207
|
+
"sequence_length": 201,
|
|
208
|
+
"probability": 0.73,
|
|
209
|
+
"prediction": 1,
|
|
210
|
+
"threshold": 0.5,
|
|
211
|
+
}
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
---
|
|
215
|
+
|
|
216
|
+
## Model architecture
|
|
217
|
+
|
|
218
|
+
TETSS2.0 uses a one-dimensional convolutional neural network for DNA sequence classification.
|
|
219
|
+
|
|
220
|
+
The model contains:
|
|
221
|
+
|
|
222
|
+
* one-hot encoding of DNA sequences
|
|
223
|
+
* multiple 1D convolutional layers
|
|
224
|
+
* batch normalization
|
|
225
|
+
* ReLU activation
|
|
226
|
+
* max pooling
|
|
227
|
+
* adaptive max pooling
|
|
228
|
+
* fully connected classification layers
|
|
229
|
+
|
|
230
|
+
The model outputs a single logit, which is converted to a probability using the sigmoid function.
|
|
231
|
+
|
|
232
|
+
---
|
|
233
|
+
|
|
234
|
+
## Model files
|
|
235
|
+
|
|
236
|
+
The package includes the trained model weight file:
|
|
237
|
+
|
|
238
|
+
```text
|
|
239
|
+
best_model.pth
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
The original training output directory also contains:
|
|
243
|
+
|
|
244
|
+
```text
|
|
245
|
+
best_model.pth
|
|
246
|
+
run_config.json
|
|
247
|
+
train_history.tsv
|
|
248
|
+
final_val_metrics.json
|
|
249
|
+
split_summary.json
|
|
250
|
+
train_split.tsv
|
|
251
|
+
val_split.tsv
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
These files can be used to document model configuration, validation performance, and data splitting information.
|
|
255
|
+
|
|
256
|
+
---
|
|
257
|
+
|
|
258
|
+
## Validation performance
|
|
259
|
+
|
|
260
|
+
Please fill in the following values using `final_val_metrics.json`:
|
|
261
|
+
|
|
262
|
+
| Metric | Value |
|
|
263
|
+
| --------------- | ----: |
|
|
264
|
+
| AUROC | TODO |
|
|
265
|
+
| AUPRC | TODO |
|
|
266
|
+
| Accuracy | TODO |
|
|
267
|
+
| Precision | TODO |
|
|
268
|
+
| Recall | TODO |
|
|
269
|
+
| True negatives | TODO |
|
|
270
|
+
| False positives | TODO |
|
|
271
|
+
| False negatives | TODO |
|
|
272
|
+
| True positives | TODO |
|
|
273
|
+
|
|
274
|
+
To view the metrics file:
|
|
275
|
+
|
|
276
|
+
```bash
|
|
277
|
+
cat final_val_metrics.json
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## Troubleshooting
|
|
283
|
+
|
|
284
|
+
### MKL threading error
|
|
285
|
+
|
|
286
|
+
If you see an error similar to:
|
|
287
|
+
|
|
288
|
+
```text
|
|
289
|
+
mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
try running:
|
|
293
|
+
|
|
294
|
+
```bash
|
|
295
|
+
export MKL_THREADING_LAYER=GNU
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
Then run the prediction command again.
|
|
299
|
+
|
|
300
|
+
---
|
|
301
|
+
|
|
302
|
+
### Cannot find sequence column
|
|
303
|
+
|
|
304
|
+
If you see an error like:
|
|
305
|
+
|
|
306
|
+
```text
|
|
307
|
+
Cannot find sequence column 'sequence'
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
check whether your input file contains a real tab or comma separator.
|
|
311
|
+
|
|
312
|
+
For a TSV file, you can check tabs with:
|
|
313
|
+
|
|
314
|
+
```bash
|
|
315
|
+
cat -A input.tsv
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
A real tab will appear as:
|
|
319
|
+
|
|
320
|
+
```text
|
|
321
|
+
^I
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
A correct TSV file should look like:
|
|
325
|
+
|
|
326
|
+
```text
|
|
327
|
+
sample_id^Isequence$
|
|
328
|
+
sample1^IACGT...$
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
---
|
|
332
|
+
|
|
333
|
+
## Citation
|
|
334
|
+
|
|
335
|
+
If you use TETSS2 in your research, please cite:
|
|
336
|
+
|
|
337
|
+
@software{tetss2,
|
|
338
|
+
title={TETSS2: Deep Learning Model for TE-TSS Prediction},
|
|
339
|
+
year={2026}
|
|
340
|
+
}
|
|
341
|
+
|
|
342
|
+
---
|
|
343
|
+
|
|
344
|
+
## Contact
|
|
345
|
+
|
|
346
|
+
For questions or issues, please contact the developer or open an issue in the project repository.
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
## License
|
|
351
|
+
|
|
352
|
+
License information will be added later.
|
|
353
|
+
|
|
354
|
+
## Repository
|
|
355
|
+
|
|
356
|
+
https://github.com/MoriyaaCui/TETSS2.git
|
|
357
|
+
|
|
358
|
+
|
|
359
|
+
## Live Demo
|
|
360
|
+
|
|
361
|
+
A Gradio demo is available:
|
|
362
|
+
|
|
363
|
+
```bash
|
|
364
|
+
python demo/app.py
|