tetss2 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,2 @@
1
+ recursive-include src/tetss2/assets *.pth
2
+ include README.md
tetss2-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,375 @@
1
+ Metadata-Version: 2.4
2
+ Name: tetss2
3
+ Version: 0.1.0
4
+ Summary: TETSS2.0: a PyTorch model for predicting TE-TSS activity from DNA sequences.
5
+ Author-email: Moriyaa Cui <2311459@tongji.edu.cn>
6
+ Requires-Python: >=3.8
7
+ Description-Content-Type: text/markdown
8
+ Requires-Dist: numpy>=1.21
9
+ Requires-Dist: pandas>=1.3
10
+ Requires-Dist: torch>=1.10
11
+
12
+ # TETSS2.0
13
+
14
+ **TETSS2.0** is a deep learning model for predicting TE-derived transcription start site (TE-TSS) activity from DNA sequences.
15
+
16
+ The Python package name and command-line tool name are **`tetss2`**. The model name shown in documents, figures, and the website is **TETSS2.0**.
17
+
18
+ ---
19
+
20
+ ## Overview
21
+
22
+ TETSS2.0 is a PyTorch-based convolutional neural network classifier designed to predict whether an input DNA sequence is associated with TE-TSS activity.
23
+
24
+ The model takes a DNA sequence as input and returns:
25
+
26
+ * a prediction probability
27
+ * a binary prediction label
28
+ * the classification threshold used for prediction
29
+
30
+ By default, TETSS2.0 uses a threshold of `0.5`.
31
+
32
+ ---
33
+
34
+ ## Input requirement
35
+
36
+ TETSS2.0 expects DNA sequences of exactly **201 bp**.
37
+
38
+ Allowed bases:
39
+
40
+ ```text
41
+ A, C, G, T, N
42
+ ```
43
+
44
+ Notes:
45
+
46
+ * Input sequences are automatically converted to uppercase.
47
+ * `N` is allowed but is encoded as an all-zero position in the one-hot representation.
48
+ * Sequences shorter or longer than 201 bp are rejected by default.
49
+ * The option `--no-length-check` is available only for debugging and is not recommended for normal prediction.
50
+
51
+ ---
52
+
53
+ ## Output
54
+
55
+ For each input sequence, TETSS2.0 outputs:
56
+
57
+ | Column | Description |
58
+ | ------------------------ | ---------------------------------------- |
59
+ | `tetss2_sequence_length` | Length of the input sequence |
60
+ | `tetss2_probability` | Predicted probability score |
61
+ | `tetss2_prediction` | Binary prediction result, `0` or `1` |
62
+ | `tetss2_threshold` | Threshold used for binary classification |
63
+
64
+ ---
65
+
66
+ ## Installation
67
+
68
+ ### Option 1: Install from a local source directory
69
+
70
+ If you have downloaded or cloned this package locally, enter the package directory and install it with:
71
+
72
+ ```bash
73
+ cd tetss_rampage_package
74
+ pip install -e .
75
+ ```
76
+
77
+ After installation, check whether the command-line tool is available:
78
+
79
+ ```bash
80
+ tetss2 --help
81
+ ```
82
+
83
+ ### Option 2: Recommended conda environment
84
+
85
+ We recommend creating a clean conda environment before installation:
86
+
87
+ ```bash
88
+
89
+ conda create -n tetss2 python=3.9
90
+ conda activate tetss2
91
+ conda install numpy pandas scikit-learn
92
+ pip install torch==1.10.2
93
+
94
+ pip install -e .
95
+
96
+ ```
97
+
98
+ A future public release may support:
99
+
100
+ ```bash
101
+ pip install tetss2
102
+ ```
103
+
104
+ ---
105
+
106
+ ## Command-line usage
107
+
108
+ ### 1. Predict a single sequence
109
+
110
+ ```bash
111
+ tetss2 predict --sequence ACGTACGTACGTACGT
112
+ ```
113
+
114
+ For normal use, the input sequence should be exactly 201 bp:
115
+
116
+ ```bash
117
+ tetss2 predict --sequence YOUR_201BP_DNA_SEQUENCE
118
+ ```
119
+
120
+ Example output:
121
+
122
+ ```json
123
+ {
124
+ "model": "TETSS2.0",
125
+ "sequence": "ACGTACGTACGTACGT",
126
+ "sequence_length": 16,
127
+ "probability": 0.35490313172340393,
128
+ "prediction": 0,
129
+ "threshold": 0.5
130
+ }
131
+ ```
132
+
133
+ Note: the example above uses a short sequence only to demonstrate the command format. For biological prediction, please use a 201 bp sequence.
134
+
135
+ ---
136
+
137
+ ### 2. Batch prediction from a TSV file
138
+
139
+ Prepare an input file containing a `sequence` column.
140
+
141
+ Example input file: `input.tsv`
142
+
143
+ ```tsv
144
+ sample_id sequence
145
+ sample1 ACGT...
146
+ sample2 TTTT...
147
+ sample3 GCGC...
148
+ ```
149
+
150
+ Run batch prediction:
151
+
152
+ ```bash
153
+ tetss2 predict-file \
154
+ --input input.tsv \
155
+ --output tetss2_predictions.tsv
156
+ ```
157
+
158
+ The output file will contain the original columns plus TETSS2.0 prediction results.
159
+
160
+ Example output:
161
+
162
+ ```tsv
163
+ sample_id sequence tetss2_sequence_length tetss2_probability tetss2_prediction tetss2_threshold
164
+ sample1 ACGT... 201 0.3549 0 0.5
165
+ sample2 TTTT... 201 0.8123 1 0.5
166
+ sample3 GCGC... 201 0.4471 0 0.5
167
+ ```
168
+
169
+ ---
170
+
171
+ ### 3. Batch prediction from a CSV file
172
+
173
+ If your input file is comma-separated, use `--sep ","`:
174
+
175
+ ```bash
176
+ tetss2 predict-file \
177
+ --input input.csv \
178
+ --output tetss2_predictions.csv \
179
+ --sep ","
180
+ ```
181
+
182
+ ---
183
+
184
+ ### 4. Use a custom sequence column name
185
+
186
+ If the sequence column is not named `sequence`, specify it with `--sequence-column`.
187
+
188
+ For example, if the input file contains a column named `dna`:
189
+
190
+ ```bash
191
+ tetss2 predict-file \
192
+ --input input.tsv \
193
+ --output tetss2_predictions.tsv \
194
+ --sequence-column dna
195
+ ```
196
+
197
+ ---
198
+
199
+ ## Python API usage
200
+
201
+ TETSS2.0 can also be used directly in Python.
202
+
203
+ ```python
204
+ from tetss2 import TETSS2Predictor
205
+
206
+ predictor = TETSS2Predictor()
207
+
208
+ result = predictor.predict("YOUR_201BP_DNA_SEQUENCE")
209
+ print(result)
210
+ ```
211
+
212
+ Example output:
213
+
214
+ ```python
215
+ {
216
+ "model": "TETSS2.0",
217
+ "sequence": "YOUR_201BP_DNA_SEQUENCE",
218
+ "sequence_length": 201,
219
+ "probability": 0.73,
220
+ "prediction": 1,
221
+ "threshold": 0.5,
222
+ }
223
+ ```
224
+
225
+ ---
226
+
227
+ ## Model architecture
228
+
229
+ TETSS2.0 uses a one-dimensional convolutional neural network for DNA sequence classification.
230
+
231
+ The model contains:
232
+
233
+ * one-hot encoding of DNA sequences
234
+ * multiple 1D convolutional layers
235
+ * batch normalization
236
+ * ReLU activation
237
+ * max pooling
238
+ * adaptive max pooling
239
+ * fully connected classification layers
240
+
241
+ The model outputs a single logit, which is converted to a probability using the sigmoid function.
242
+
243
+ ---
244
+
245
+ ## Model files
246
+
247
+ The package includes the trained model weight file:
248
+
249
+ ```text
250
+ best_model.pth
251
+ ```
252
+
253
+ The original training output directory also contains:
254
+
255
+ ```text
256
+ best_model.pth
257
+ run_config.json
258
+ train_history.tsv
259
+ final_val_metrics.json
260
+ split_summary.json
261
+ train_split.tsv
262
+ val_split.tsv
263
+ ```
264
+
265
+ These files can be used to document model configuration, validation performance, and data splitting information.
266
+
267
+ ---
268
+
269
+ ## Validation performance
270
+
271
+ Please fill in the following values using `final_val_metrics.json`:
272
+
273
+ | Metric | Value |
274
+ | --------------- | ----: |
275
+ | AUROC | TODO |
276
+ | AUPRC | TODO |
277
+ | Accuracy | TODO |
278
+ | Precision | TODO |
279
+ | Recall | TODO |
280
+ | True negatives | TODO |
281
+ | False positives | TODO |
282
+ | False negatives | TODO |
283
+ | True positives | TODO |
284
+
285
+ To view the metrics file:
286
+
287
+ ```bash
288
+ cat final_val_metrics.json
289
+ ```
290
+
291
+ ---
292
+
293
+ ## Troubleshooting
294
+
295
+ ### MKL threading error
296
+
297
+ If you see an error similar to:
298
+
299
+ ```text
300
+ mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1
301
+ ```
302
+
303
+ try running:
304
+
305
+ ```bash
306
+ export MKL_THREADING_LAYER=GNU
307
+ ```
308
+
309
+ Then run the prediction command again.
310
+
311
+ ---
312
+
313
+ ### Cannot find sequence column
314
+
315
+ If you see an error like:
316
+
317
+ ```text
318
+ Cannot find sequence column 'sequence'
319
+ ```
320
+
321
+ check whether your input file contains a real tab or comma separator.
322
+
323
+ For a TSV file, you can check tabs with:
324
+
325
+ ```bash
326
+ cat -A input.tsv
327
+ ```
328
+
329
+ A real tab will appear as:
330
+
331
+ ```text
332
+ ^I
333
+ ```
334
+
335
+ A correct TSV file should look like:
336
+
337
+ ```text
338
+ sample_id^Isequence$
339
+ sample1^IACGT...$
340
+ ```
341
+
342
+ ---
343
+
344
+ ## Citation
345
+
346
+ If you use TETSS2 in your research, please cite:
347
+
348
+ @software{tetss2,
349
+ title={TETSS2: Deep Learning Model for TE-TSS Prediction},
350
+ year={2026}
351
+ }
352
+
353
+ ---
354
+
355
+ ## Contact
356
+
357
+ For questions or issues, please contact the developer or open an issue in the project repository.
358
+
359
+ ---
360
+
361
+ ## License
362
+
363
+ License information will be added later.
364
+
365
+ ## Repository
366
+
367
+ https://github.com/MoriyaaCui/TETSS2.git
368
+
369
+
370
+ ## Live Demo
371
+
372
+ A Gradio demo is available:
373
+
374
+ ```bash
375
+ python demo/app.py
tetss2-0.1.0/README.md ADDED
@@ -0,0 +1,364 @@
1
+ # TETSS2.0
2
+
3
+ **TETSS2.0** is a deep learning model for predicting TE-derived transcription start site (TE-TSS) activity from DNA sequences.
4
+
5
+ The Python package name and command-line tool name are **`tetss2`**. The model name shown in documents, figures, and the website is **TETSS2.0**.
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ TETSS2.0 is a PyTorch-based convolutional neural network classifier designed to predict whether an input DNA sequence is associated with TE-TSS activity.
12
+
13
+ The model takes a DNA sequence as input and returns:
14
+
15
+ * a prediction probability
16
+ * a binary prediction label
17
+ * the classification threshold used for prediction
18
+
19
+ By default, TETSS2.0 uses a threshold of `0.5`.
20
+
21
+ ---
22
+
23
+ ## Input requirement
24
+
25
+ TETSS2.0 expects DNA sequences of exactly **201 bp**.
26
+
27
+ Allowed bases:
28
+
29
+ ```text
30
+ A, C, G, T, N
31
+ ```
32
+
33
+ Notes:
34
+
35
+ * Input sequences are automatically converted to uppercase.
36
+ * `N` is allowed but is encoded as an all-zero position in the one-hot representation.
37
+ * Sequences shorter or longer than 201 bp are rejected by default.
38
+ * The option `--no-length-check` is available only for debugging and is not recommended for normal prediction.
39
+
40
+ ---
41
+
42
+ ## Output
43
+
44
+ For each input sequence, TETSS2.0 outputs:
45
+
46
+ | Column | Description |
47
+ | ------------------------ | ---------------------------------------- |
48
+ | `tetss2_sequence_length` | Length of the input sequence |
49
+ | `tetss2_probability` | Predicted probability score |
50
+ | `tetss2_prediction` | Binary prediction result, `0` or `1` |
51
+ | `tetss2_threshold` | Threshold used for binary classification |
52
+
53
+ ---
54
+
55
+ ## Installation
56
+
57
+ ### Option 1: Install from a local source directory
58
+
59
+ If you have downloaded or cloned this package locally, enter the package directory and install it with:
60
+
61
+ ```bash
62
+ cd tetss_rampage_package
63
+ pip install -e .
64
+ ```
65
+
66
+ After installation, check whether the command-line tool is available:
67
+
68
+ ```bash
69
+ tetss2 --help
70
+ ```
71
+
72
+ ### Option 2: Recommended conda environment
73
+
74
+ We recommend creating a clean conda environment before installation:
75
+
76
+ ```bash
77
+
78
+ conda create -n tetss2 python=3.9
79
+ conda activate tetss2
80
+ conda install numpy pandas scikit-learn
81
+ pip install torch==1.10.2
82
+
83
+ pip install -e .
84
+
85
+ ```
86
+
87
+ A future public release may support:
88
+
89
+ ```bash
90
+ pip install tetss2
91
+ ```
92
+
93
+ ---
94
+
95
+ ## Command-line usage
96
+
97
+ ### 1. Predict a single sequence
98
+
99
+ ```bash
100
+ tetss2 predict --sequence ACGTACGTACGTACGT
101
+ ```
102
+
103
+ For normal use, the input sequence should be exactly 201 bp:
104
+
105
+ ```bash
106
+ tetss2 predict --sequence YOUR_201BP_DNA_SEQUENCE
107
+ ```
108
+
109
+ Example output:
110
+
111
+ ```json
112
+ {
113
+ "model": "TETSS2.0",
114
+ "sequence": "ACGTACGTACGTACGT",
115
+ "sequence_length": 16,
116
+ "probability": 0.35490313172340393,
117
+ "prediction": 0,
118
+ "threshold": 0.5
119
+ }
120
+ ```
121
+
122
+ Note: the example above uses a short sequence only to demonstrate the command format. For biological prediction, please use a 201 bp sequence.
123
+
124
+ ---
125
+
126
+ ### 2. Batch prediction from a TSV file
127
+
128
+ Prepare an input file containing a `sequence` column.
129
+
130
+ Example input file: `input.tsv`
131
+
132
+ ```tsv
133
+ sample_id sequence
134
+ sample1 ACGT...
135
+ sample2 TTTT...
136
+ sample3 GCGC...
137
+ ```
138
+
139
+ Run batch prediction:
140
+
141
+ ```bash
142
+ tetss2 predict-file \
143
+ --input input.tsv \
144
+ --output tetss2_predictions.tsv
145
+ ```
146
+
147
+ The output file will contain the original columns plus TETSS2.0 prediction results.
148
+
149
+ Example output:
150
+
151
+ ```tsv
152
+ sample_id sequence tetss2_sequence_length tetss2_probability tetss2_prediction tetss2_threshold
153
+ sample1 ACGT... 201 0.3549 0 0.5
154
+ sample2 TTTT... 201 0.8123 1 0.5
155
+ sample3 GCGC... 201 0.4471 0 0.5
156
+ ```
157
+
158
+ ---
159
+
160
+ ### 3. Batch prediction from a CSV file
161
+
162
+ If your input file is comma-separated, use `--sep ","`:
163
+
164
+ ```bash
165
+ tetss2 predict-file \
166
+ --input input.csv \
167
+ --output tetss2_predictions.csv \
168
+ --sep ","
169
+ ```
170
+
171
+ ---
172
+
173
+ ### 4. Use a custom sequence column name
174
+
175
+ If the sequence column is not named `sequence`, specify it with `--sequence-column`.
176
+
177
+ For example, if the input file contains a column named `dna`:
178
+
179
+ ```bash
180
+ tetss2 predict-file \
181
+ --input input.tsv \
182
+ --output tetss2_predictions.tsv \
183
+ --sequence-column dna
184
+ ```
185
+
186
+ ---
187
+
188
+ ## Python API usage
189
+
190
+ TETSS2.0 can also be used directly in Python.
191
+
192
+ ```python
193
+ from tetss2 import TETSS2Predictor
194
+
195
+ predictor = TETSS2Predictor()
196
+
197
+ result = predictor.predict("YOUR_201BP_DNA_SEQUENCE")
198
+ print(result)
199
+ ```
200
+
201
+ Example output:
202
+
203
+ ```python
204
+ {
205
+ "model": "TETSS2.0",
206
+ "sequence": "YOUR_201BP_DNA_SEQUENCE",
207
+ "sequence_length": 201,
208
+ "probability": 0.73,
209
+ "prediction": 1,
210
+ "threshold": 0.5,
211
+ }
212
+ ```
213
+
214
+ ---
215
+
216
+ ## Model architecture
217
+
218
+ TETSS2.0 uses a one-dimensional convolutional neural network for DNA sequence classification.
219
+
220
+ The model contains:
221
+
222
+ * one-hot encoding of DNA sequences
223
+ * multiple 1D convolutional layers
224
+ * batch normalization
225
+ * ReLU activation
226
+ * max pooling
227
+ * adaptive max pooling
228
+ * fully connected classification layers
229
+
230
+ The model outputs a single logit, which is converted to a probability using the sigmoid function.
231
+
232
+ ---
233
+
234
+ ## Model files
235
+
236
+ The package includes the trained model weight file:
237
+
238
+ ```text
239
+ best_model.pth
240
+ ```
241
+
242
+ The original training output directory also contains:
243
+
244
+ ```text
245
+ best_model.pth
246
+ run_config.json
247
+ train_history.tsv
248
+ final_val_metrics.json
249
+ split_summary.json
250
+ train_split.tsv
251
+ val_split.tsv
252
+ ```
253
+
254
+ These files can be used to document model configuration, validation performance, and data splitting information.
255
+
256
+ ---
257
+
258
+ ## Validation performance
259
+
260
+ Please fill in the following values using `final_val_metrics.json`:
261
+
262
+ | Metric | Value |
263
+ | --------------- | ----: |
264
+ | AUROC | TODO |
265
+ | AUPRC | TODO |
266
+ | Accuracy | TODO |
267
+ | Precision | TODO |
268
+ | Recall | TODO |
269
+ | True negatives | TODO |
270
+ | False positives | TODO |
271
+ | False negatives | TODO |
272
+ | True positives | TODO |
273
+
274
+ To view the metrics file:
275
+
276
+ ```bash
277
+ cat final_val_metrics.json
278
+ ```
279
+
280
+ ---
281
+
282
+ ## Troubleshooting
283
+
284
+ ### MKL threading error
285
+
286
+ If you see an error similar to:
287
+
288
+ ```text
289
+ mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1
290
+ ```
291
+
292
+ try running:
293
+
294
+ ```bash
295
+ export MKL_THREADING_LAYER=GNU
296
+ ```
297
+
298
+ Then run the prediction command again.
299
+
300
+ ---
301
+
302
+ ### Cannot find sequence column
303
+
304
+ If you see an error like:
305
+
306
+ ```text
307
+ Cannot find sequence column 'sequence'
308
+ ```
309
+
310
+ check whether your input file contains a real tab or comma separator.
311
+
312
+ For a TSV file, you can check tabs with:
313
+
314
+ ```bash
315
+ cat -A input.tsv
316
+ ```
317
+
318
+ A real tab will appear as:
319
+
320
+ ```text
321
+ ^I
322
+ ```
323
+
324
+ A correct TSV file should look like:
325
+
326
+ ```text
327
+ sample_id^Isequence$
328
+ sample1^IACGT...$
329
+ ```
330
+
331
+ ---
332
+
333
+ ## Citation
334
+
335
+ If you use TETSS2 in your research, please cite:
336
+
337
+ @software{tetss2,
338
+ title={TETSS2: Deep Learning Model for TE-TSS Prediction},
339
+ year={2026}
340
+ }
341
+
342
+ ---
343
+
344
+ ## Contact
345
+
346
+ For questions or issues, please contact the developer or open an issue in the project repository.
347
+
348
+ ---
349
+
350
+ ## License
351
+
352
+ License information will be added later.
353
+
354
+ ## Repository
355
+
356
+ https://github.com/MoriyaaCui/TETSS2.git
357
+
358
+
359
+ ## Live Demo
360
+
361
+ A Gradio demo is available:
362
+
363
+ ```bash
364
+ python demo/app.py