darkprofiler 0.2.2__tar.gz → 0.2.4__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {darkprofiler-0.2.2/src/darkprofiler.egg-info → darkprofiler-0.2.4}/PKG-INFO +6 -11
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/README.md +5 -10
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/pyproject.toml +1 -1
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler/__init__.py +1 -1
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler/cli.py +1 -1
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler/run.py +164 -56
- {darkprofiler-0.2.2 → darkprofiler-0.2.4/src/darkprofiler.egg-info}/PKG-INFO +6 -11
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/LICENSE.txt +0 -0
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/setup.cfg +0 -0
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler.egg-info/SOURCES.txt +0 -0
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler.egg-info/dependency_links.txt +0 -0
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler.egg-info/entry_points.txt +0 -0
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler.egg-info/requires.txt +0 -0
- {darkprofiler-0.2.2 → darkprofiler-0.2.4}/src/darkprofiler.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: darkprofiler
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.4
|
|
4
4
|
Summary: DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.
|
|
5
5
|
Author-email: Hanjun Lee <hanjun@alum.mit.edu>
|
|
6
6
|
License: MIT
|
|
@@ -30,7 +30,6 @@ DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo
|
|
|
30
30
|
- **Alternative splicing**
|
|
31
31
|
- **Neoantigens (SNV‑derived mutanome)**
|
|
32
32
|
- **Alternative reading frame peptides**
|
|
33
|
-
- **Amino acid mismatch**
|
|
34
33
|
- **Unknown / unaligned**
|
|
35
34
|
|
|
36
35
|
DarkProfiler is intended to be the *post‑processing / annotation* step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.
|
|
@@ -206,8 +205,8 @@ Requirements and recommendations:
|
|
|
206
205
|
- Empty sequences are silently ignored.
|
|
207
206
|
- There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.
|
|
208
207
|
|
|
209
|
-
A peptide sequence is assigned to **at most one output category
|
|
210
|
-
(canonical → alternative splicing → neoantigen → alternative reading frame →
|
|
208
|
+
A peptide sequence is assigned to **at most one output category** within a given hamming distance, corresponding to the first category that matches in the pipeline
|
|
209
|
+
(canonical → alternative splicing → neoantigen → alternative reading frame → unknown).
|
|
211
210
|
|
|
212
211
|
### VCF with SNVs (optional)
|
|
213
212
|
|
|
@@ -244,7 +243,6 @@ The database contains translated and derived proteomes as FASTA files:
|
|
|
244
243
|
- `mutanome.fa`
|
|
245
244
|
- `mutatedCanonicalTranscriptome.fa`
|
|
246
245
|
- `mutatedAlternativeTranslatome.fa`
|
|
247
|
-
- `mutatedAlternativeORFeome.fa`
|
|
248
246
|
|
|
249
247
|
DarkProfiler also creates **persistent fast indices** under the same database directory to accelerate peptide search with Hamming distance:
|
|
250
248
|
for example:
|
|
@@ -349,9 +347,8 @@ classify_peptides(
|
|
|
349
347
|
5. Build alternative splicing proteome (CDS must start with `ATG`) and classify peptides
|
|
350
348
|
6. Apply SNVs, build mutanome (CDS must start with `ATG`) and classify peptides
|
|
351
349
|
7. Build alternative ORFs (3 frames) and classify peptides
|
|
352
|
-
8.
|
|
353
|
-
9.
|
|
354
|
-
10. Finalize
|
|
350
|
+
8. Write unaligned peptides and summary plots
|
|
351
|
+
9. Finalize
|
|
355
352
|
|
|
356
353
|
### Category definitions
|
|
357
354
|
|
|
@@ -361,7 +358,7 @@ classify_peptides(
|
|
|
361
358
|
- **ORF region labels**
|
|
362
359
|
For alternative ORF hits, DarkProfiler labels the peptide start as:
|
|
363
360
|
- `uORF` (upstream of CDS start)
|
|
364
|
-
- `intORF` (inside annotated CDS span)
|
|
361
|
+
- `intORF` (out-of-frame peptdies from inside annotated CDS span)
|
|
365
362
|
- `dORF` (downstream of CDS end)
|
|
366
363
|
- `lncRNA` (no CDS annotation)
|
|
367
364
|
|
|
@@ -379,7 +376,6 @@ Each category is represented by a separate FASTA file in `output_dir`:
|
|
|
379
376
|
- `alternativeSplicing.fa`
|
|
380
377
|
- `neoantigen.fa`
|
|
381
378
|
- `alternativeReadingFrame.fa`
|
|
382
|
-
- `aminoAcidMismatch.fa`
|
|
383
379
|
- `unknown.fa`
|
|
384
380
|
|
|
385
381
|
For classification FASTAs (all except `unknown.fa`), each record uses:
|
|
@@ -415,7 +411,6 @@ canonical 123
|
|
|
415
411
|
alternativeSplicing 45
|
|
416
412
|
neoantigen 7
|
|
417
413
|
alternativeReadingFrame 32
|
|
418
|
-
aminoAcidMismatch 10
|
|
419
414
|
unknown 83
|
|
420
415
|
```
|
|
421
416
|
|
|
@@ -12,7 +12,6 @@ DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo
|
|
|
12
12
|
- **Alternative splicing**
|
|
13
13
|
- **Neoantigens (SNV‑derived mutanome)**
|
|
14
14
|
- **Alternative reading frame peptides**
|
|
15
|
-
- **Amino acid mismatch**
|
|
16
15
|
- **Unknown / unaligned**
|
|
17
16
|
|
|
18
17
|
DarkProfiler is intended to be the *post‑processing / annotation* step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.
|
|
@@ -188,8 +187,8 @@ Requirements and recommendations:
|
|
|
188
187
|
- Empty sequences are silently ignored.
|
|
189
188
|
- There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.
|
|
190
189
|
|
|
191
|
-
A peptide sequence is assigned to **at most one output category
|
|
192
|
-
(canonical → alternative splicing → neoantigen → alternative reading frame →
|
|
190
|
+
A peptide sequence is assigned to **at most one output category** within a given hamming distance, corresponding to the first category that matches in the pipeline
|
|
191
|
+
(canonical → alternative splicing → neoantigen → alternative reading frame → unknown).
|
|
193
192
|
|
|
194
193
|
### VCF with SNVs (optional)
|
|
195
194
|
|
|
@@ -226,7 +225,6 @@ The database contains translated and derived proteomes as FASTA files:
|
|
|
226
225
|
- `mutanome.fa`
|
|
227
226
|
- `mutatedCanonicalTranscriptome.fa`
|
|
228
227
|
- `mutatedAlternativeTranslatome.fa`
|
|
229
|
-
- `mutatedAlternativeORFeome.fa`
|
|
230
228
|
|
|
231
229
|
DarkProfiler also creates **persistent fast indices** under the same database directory to accelerate peptide search with Hamming distance:
|
|
232
230
|
for example:
|
|
@@ -331,9 +329,8 @@ classify_peptides(
|
|
|
331
329
|
5. Build alternative splicing proteome (CDS must start with `ATG`) and classify peptides
|
|
332
330
|
6. Apply SNVs, build mutanome (CDS must start with `ATG`) and classify peptides
|
|
333
331
|
7. Build alternative ORFs (3 frames) and classify peptides
|
|
334
|
-
8.
|
|
335
|
-
9.
|
|
336
|
-
10. Finalize
|
|
332
|
+
8. Write unaligned peptides and summary plots
|
|
333
|
+
9. Finalize
|
|
337
334
|
|
|
338
335
|
### Category definitions
|
|
339
336
|
|
|
@@ -343,7 +340,7 @@ classify_peptides(
|
|
|
343
340
|
- **ORF region labels**
|
|
344
341
|
For alternative ORF hits, DarkProfiler labels the peptide start as:
|
|
345
342
|
- `uORF` (upstream of CDS start)
|
|
346
|
-
- `intORF` (inside annotated CDS span)
|
|
343
|
+
- `intORF` (out-of-frame peptdies from inside annotated CDS span)
|
|
347
344
|
- `dORF` (downstream of CDS end)
|
|
348
345
|
- `lncRNA` (no CDS annotation)
|
|
349
346
|
|
|
@@ -361,7 +358,6 @@ Each category is represented by a separate FASTA file in `output_dir`:
|
|
|
361
358
|
- `alternativeSplicing.fa`
|
|
362
359
|
- `neoantigen.fa`
|
|
363
360
|
- `alternativeReadingFrame.fa`
|
|
364
|
-
- `aminoAcidMismatch.fa`
|
|
365
361
|
- `unknown.fa`
|
|
366
362
|
|
|
367
363
|
For classification FASTAs (all except `unknown.fa`), each record uses:
|
|
@@ -397,7 +393,6 @@ canonical 123
|
|
|
397
393
|
alternativeSplicing 45
|
|
398
394
|
neoantigen 7
|
|
399
395
|
alternativeReadingFrame 32
|
|
400
|
-
aminoAcidMismatch 10
|
|
401
396
|
unknown 83
|
|
402
397
|
```
|
|
403
398
|
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "darkprofiler"
|
|
7
|
-
version = "0.2.
|
|
7
|
+
version = "0.2.4"
|
|
8
8
|
description = "DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.7"
|
|
@@ -121,7 +121,7 @@ def build_parser() -> argparse.ArgumentParser:
|
|
|
121
121
|
"Optional path to existing database directory containing "
|
|
122
122
|
"canonicalProteome.fa, alternativeSplicing.fa, mutanome.fa, "
|
|
123
123
|
"mutatedCanonicalTranscriptome.fa, mutatedAlternativeTranslatome.fa, "
|
|
124
|
-
"
|
|
124
|
+
"and other index files"
|
|
125
125
|
),
|
|
126
126
|
)
|
|
127
127
|
p_run.add_argument(
|
|
@@ -406,11 +406,6 @@ def classify_peptides(reference,
|
|
|
406
406
|
and reuses them on subsequent runs (no re-indexing needed).
|
|
407
407
|
- The query side uses the index for k in {0,1,2} and peptide length in [index_L_min,index_L_max].
|
|
408
408
|
|
|
409
|
-
Other requirements already implemented in prior version:
|
|
410
|
-
- Filter transcripts whose CDS doesn't start with ATG
|
|
411
|
-
- Report transcript nt coordinate and uORF/intORF/dORF/lncRNA/CDS
|
|
412
|
-
- Rename aminoAcidMisincorporation -> aminoAcidMismatch
|
|
413
|
-
"""
|
|
414
409
|
steps = [
|
|
415
410
|
"Filter VCF to exome",
|
|
416
411
|
"Setup and load transcriptome/CDS/knownCanonical",
|
|
@@ -419,7 +414,6 @@ def classify_peptides(reference,
|
|
|
419
414
|
"Generate alternative splicing proteome and classify peptides",
|
|
420
415
|
"Apply SNVs, generate mutanome and classify neoantigens",
|
|
421
416
|
"Generate alternative ORFs and classify peptides",
|
|
422
|
-
"Identify amino acid mismatches",
|
|
423
417
|
"Write unaligned peptides and pie chart",
|
|
424
418
|
"Finalize",
|
|
425
419
|
]
|
|
@@ -481,7 +475,6 @@ def classify_peptides(reference,
|
|
|
481
475
|
|
|
482
476
|
required_db_files = [
|
|
483
477
|
"alternativeSplicing.fa",
|
|
484
|
-
"mutatedAlternativeORFeome.fa",
|
|
485
478
|
"canonicalProteome.fa",
|
|
486
479
|
"mutatedAlternativeTranslatome.fa",
|
|
487
480
|
"mutanome.fa",
|
|
@@ -1006,31 +999,169 @@ def classify_peptides(reference,
|
|
|
1006
999
|
|
|
1007
1000
|
report_step_done() # 5
|
|
1008
1001
|
|
|
1009
|
-
#
|
|
1010
|
-
# NOTE: keep your original SNV application as-is (omitted here for brevity),
|
|
1011
|
-
# but ensure translation uses translate_cds_with_map (ATG filter) and then build index.
|
|
1012
|
-
#
|
|
1013
|
-
# If you already have your SNV block, set:
|
|
1014
|
-
# mutated_canonical_transcripts = load_transcriptome(mutatedCanonicalTranscriptome.fa)
|
|
1015
|
-
# mutanome_proteins, mutanome_aa2nt = translate_cds_with_map(mutated_canonical_transcripts, cds_map)
|
|
1016
|
-
# write mutanome.fa
|
|
1017
|
-
#
|
|
1002
|
+
# ---------- build mutanome.fa + mutatedCanonicalTranscriptome.fa ----------
|
|
1018
1003
|
mutated_canonical_tx_fa = os.path.join(database_dir, "mutatedCanonicalTranscriptome.fa")
|
|
1019
1004
|
mutanome_fa = os.path.join(database_dir, "mutanome.fa")
|
|
1020
1005
|
mutanome_idx_dir = os.path.join(database_dir, "mutanome.idx")
|
|
1021
1006
|
|
|
1022
|
-
|
|
1023
|
-
|
|
1024
|
-
|
|
1025
|
-
|
|
1026
|
-
|
|
1007
|
+
def _parse_gff_attributes(attr_str):
|
|
1008
|
+
attrs = {}
|
|
1009
|
+
for item in attr_str.strip().split(";"):
|
|
1010
|
+
item = item.strip()
|
|
1011
|
+
if not item:
|
|
1012
|
+
continue
|
|
1013
|
+
if "=" in item:
|
|
1014
|
+
k, v = item.split("=", 1)
|
|
1015
|
+
attrs[k] = v.strip().strip('"')
|
|
1016
|
+
else:
|
|
1017
|
+
parts = item.split()
|
|
1018
|
+
if len(parts) >= 2:
|
|
1019
|
+
attrs[parts[0]] = parts[1].strip('"')
|
|
1020
|
+
return attrs
|
|
1021
|
+
|
|
1022
|
+
# Build transcript exon model from GFF (exon features only)
|
|
1023
|
+
transcript_exons = defaultdict(list) # tx_id -> list[(start,end)] 1-based closed
|
|
1024
|
+
transcript_strand = {} # tx_id -> '+'/'-'
|
|
1025
|
+
transcript_chrom = {} # tx_id -> chrom (normalized)
|
|
1026
|
+
|
|
1027
|
+
with open(gff_path) as fh:
|
|
1028
|
+
for line in fh:
|
|
1029
|
+
if not line.strip() or line.startswith("#"):
|
|
1030
|
+
continue
|
|
1031
|
+
fields = line.rstrip("\n").split("\t")
|
|
1032
|
+
if len(fields) < 9:
|
|
1033
|
+
continue
|
|
1034
|
+
chrom, source, feature, start, end, score, strand, frame, attrs_str = fields
|
|
1035
|
+
if feature.lower() != "exon":
|
|
1036
|
+
continue
|
|
1037
|
+
try:
|
|
1038
|
+
start_i = int(start)
|
|
1039
|
+
end_i = int(end)
|
|
1040
|
+
except ValueError:
|
|
1041
|
+
continue
|
|
1042
|
+
attrs = _parse_gff_attributes(attrs_str)
|
|
1043
|
+
tx_id = attrs.get("transcript_id") or attrs.get("transcriptId")
|
|
1044
|
+
if tx_id is None:
|
|
1045
|
+
continue
|
|
1046
|
+
tx_id = normalize_gff_tx_id(tx_id)
|
|
1047
|
+
transcript_exons[tx_id].append((start_i, end_i))
|
|
1048
|
+
transcript_strand[tx_id] = strand
|
|
1049
|
+
transcript_chrom[tx_id] = normalize_chrom(chrom)
|
|
1050
|
+
|
|
1051
|
+
# Sort exons ascending by genomic start
|
|
1052
|
+
for tx_id in list(transcript_exons.keys()):
|
|
1053
|
+
transcript_exons[tx_id].sort(key=lambda x: x[0])
|
|
1054
|
+
|
|
1055
|
+
# Exon caches for coordinate mapping
|
|
1056
|
+
exon_order_cache = {} # tx_id -> (exons_sorted, exons_desc)
|
|
1057
|
+
for tx_id, exons_sorted in transcript_exons.items():
|
|
1058
|
+
exon_order_cache[tx_id] = (exons_sorted, list(reversed(exons_sorted)))
|
|
1059
|
+
|
|
1060
|
+
# Index SNVs by chromosome for speed
|
|
1061
|
+
snvs_by_chrom = defaultdict(list) # chrom -> list[(pos, ref, alt)]
|
|
1062
|
+
for chrom, pos, ref, alt in snvs:
|
|
1063
|
+
snvs_by_chrom[chrom].append((int(pos), ref, alt))
|
|
1064
|
+
for chrom in snvs_by_chrom:
|
|
1065
|
+
snvs_by_chrom[chrom].sort(key=lambda x: x[0])
|
|
1066
|
+
|
|
1067
|
+
# Simple base complement
|
|
1068
|
+
_complement = {"A":"T","T":"A","C":"G","G":"C","a":"t","t":"a","c":"g","g":"c"}
|
|
1069
|
+
def _complement_base(b):
|
|
1070
|
+
return _complement.get(b, b)
|
|
1071
|
+
|
|
1072
|
+
def _apply_snvs_to_one_transcript(tx_id):
|
|
1073
|
+
"""
|
|
1074
|
+
Returns: (tx_id, mutated_seq_string) OR (tx_id, None) if transcript not present.
|
|
1075
|
+
If no SNVs (or no mapping), returns original transcript sequence.
|
|
1076
|
+
"""
|
|
1077
|
+
if tx_id not in transcriptome:
|
|
1078
|
+
return tx_id, None
|
|
1079
|
+
|
|
1080
|
+
seq = transcriptome[tx_id]
|
|
1081
|
+
seq_list = list(seq)
|
|
1082
|
+
|
|
1083
|
+
chrom = transcript_chrom.get(tx_id)
|
|
1084
|
+
if chrom is None or chrom not in snvs_by_chrom:
|
|
1085
|
+
return tx_id, seq
|
|
1086
|
+
|
|
1087
|
+
if tx_id not in exon_order_cache:
|
|
1088
|
+
return tx_id, seq
|
|
1089
|
+
|
|
1090
|
+
exons_sorted, exons_desc = exon_order_cache[tx_id]
|
|
1091
|
+
strand = transcript_strand.get(tx_id, "+")
|
|
1092
|
+
|
|
1093
|
+
for pos, ref, alt in snvs_by_chrom[chrom]:
|
|
1094
|
+
if strand == "+":
|
|
1095
|
+
offset = 0
|
|
1096
|
+
within = False
|
|
1097
|
+
for s, e in exons_sorted:
|
|
1098
|
+
if pos < s:
|
|
1099
|
+
break
|
|
1100
|
+
if pos > e:
|
|
1101
|
+
offset += (e - s + 1)
|
|
1102
|
+
else:
|
|
1103
|
+
offset += (pos - s)
|
|
1104
|
+
within = True
|
|
1105
|
+
break
|
|
1106
|
+
if not within:
|
|
1107
|
+
continue
|
|
1108
|
+
tx_index = offset
|
|
1109
|
+
expected_ref = ref.upper()
|
|
1110
|
+
alt_base = alt.upper()
|
|
1111
|
+
else:
|
|
1112
|
+
offset = 0
|
|
1113
|
+
within = False
|
|
1114
|
+
for s, e in exons_desc:
|
|
1115
|
+
if pos > e:
|
|
1116
|
+
continue
|
|
1117
|
+
if pos < s:
|
|
1118
|
+
offset += (e - s + 1)
|
|
1119
|
+
else:
|
|
1120
|
+
offset += (e - pos)
|
|
1121
|
+
within = True
|
|
1122
|
+
break
|
|
1123
|
+
if not within:
|
|
1124
|
+
continue
|
|
1125
|
+
tx_index = offset
|
|
1126
|
+
expected_ref = _complement_base(ref.upper())
|
|
1127
|
+
alt_base = _complement_base(alt.upper())
|
|
1128
|
+
|
|
1129
|
+
if 0 <= tx_index < len(seq_list):
|
|
1130
|
+
if expected_ref and seq_list[tx_index].upper() != expected_ref:
|
|
1131
|
+
continue
|
|
1132
|
+
seq_list[tx_index] = alt_base
|
|
1133
|
+
|
|
1134
|
+
return tx_id, "".join(seq_list)
|
|
1135
|
+
|
|
1136
|
+
# Build mutated canonical transcriptome
|
|
1137
|
+
if build_database:
|
|
1138
|
+
canonical_tx_list = [tx for tx in canonical_tx_ids if tx in transcriptome]
|
|
1139
|
+
|
|
1140
|
+
mutated_canonical_tx_dict = {}
|
|
1141
|
+
if num_threads and num_threads > 1:
|
|
1142
|
+
with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as ex:
|
|
1143
|
+
for tx_id, mutseq in ex.map(_apply_snvs_to_one_transcript, canonical_tx_list, chunksize=50):
|
|
1144
|
+
if mutseq is not None:
|
|
1145
|
+
mutated_canonical_tx_dict[tx_id] = mutseq
|
|
1146
|
+
else:
|
|
1147
|
+
for tx_id in canonical_tx_list:
|
|
1148
|
+
tx_id2, mutseq = _apply_snvs_to_one_transcript(tx_id)
|
|
1149
|
+
if mutseq is not None:
|
|
1150
|
+
mutated_canonical_tx_dict[tx_id2] = mutseq
|
|
1151
|
+
|
|
1152
|
+
with open(mutated_canonical_tx_fa, "w") as out:
|
|
1153
|
+
for tx_id, nt_seq in mutated_canonical_tx_dict.items():
|
|
1154
|
+
out.write(f">{tx_id}\n{nt_seq}\n")
|
|
1155
|
+
|
|
1156
|
+
mutated_canonical_transcripts = mutated_canonical_tx_dict
|
|
1027
1157
|
else:
|
|
1028
|
-
|
|
1029
|
-
mutated_canonical_transcripts = {} # placeholder
|
|
1030
|
-
# ---------- END: your existing SNV block should fill these ----------
|
|
1158
|
+
mutated_canonical_transcripts = load_transcriptome(mutated_canonical_tx_fa)
|
|
1031
1159
|
|
|
1032
|
-
# Translate mutanome
|
|
1033
|
-
mutanome_proteins, mutanome_aa2nt = translate_cds_with_map(
|
|
1160
|
+
# Translate mutanome (CDS only, ATG filtered)
|
|
1161
|
+
mutanome_proteins, mutanome_aa2nt = translate_cds_with_map(
|
|
1162
|
+
mutated_canonical_transcripts,
|
|
1163
|
+
cds_map
|
|
1164
|
+
)
|
|
1034
1165
|
|
|
1035
1166
|
if build_database:
|
|
1036
1167
|
with open(mutanome_fa, "w") as out:
|
|
@@ -1063,8 +1194,7 @@ def classify_peptides(reference,
|
|
|
1063
1194
|
|
|
1064
1195
|
# -------------------------- alternative ORFs (3 frames) ----------------------
|
|
1065
1196
|
alt_orf_translatome_fa = os.path.join(database_dir, "mutatedAlternativeTranslatome.fa")
|
|
1066
|
-
|
|
1067
|
-
alt_orf_idx_dir = os.path.join(database_dir, "mutatedAlternativeORFeome.idx")
|
|
1197
|
+
alt_orf_idx_dir = os.path.join(database_dir, "mutatedAlternativeTranslatome.idx")
|
|
1068
1198
|
|
|
1069
1199
|
if build_database:
|
|
1070
1200
|
alt_orf_records = {}
|
|
@@ -1082,14 +1212,11 @@ def classify_peptides(reference,
|
|
|
1082
1212
|
with open(alt_orf_translatome_fa, "w") as out:
|
|
1083
1213
|
for rid, aa_seq in alt_orf_records.items():
|
|
1084
1214
|
out.write(f">{rid}\n{aa_seq}\n")
|
|
1085
|
-
with open(alt_orf_orfeome_fa, "w") as out:
|
|
1086
|
-
for rid, aa_seq in alt_orf_records.items():
|
|
1087
|
-
out.write(f">{rid}\n{aa_seq}\n")
|
|
1088
1215
|
|
|
1089
|
-
# Build index for
|
|
1216
|
+
# Build index for alt ORF translatome (no aa2nt needed)
|
|
1090
1217
|
if build_fast_index:
|
|
1091
1218
|
build_proteome_index(
|
|
1092
|
-
|
|
1219
|
+
alt_orf_translatome_fa,
|
|
1093
1220
|
alt_orf_idx_dir,
|
|
1094
1221
|
L_min=index_L_min,
|
|
1095
1222
|
L_max=index_L_max,
|
|
@@ -1106,13 +1233,13 @@ def classify_peptides(reference,
|
|
|
1106
1233
|
frame = int(frame_str)
|
|
1107
1234
|
except ValueError:
|
|
1108
1235
|
return None, None, None, None
|
|
1109
|
-
|
|
1236
|
+
|
|
1110
1237
|
nt0 = frame + aa_pos * 3
|
|
1111
1238
|
if tx_id not in transcriptome:
|
|
1112
1239
|
return None, None, None, None
|
|
1113
1240
|
if nt0 < 0 or nt0 >= len(transcriptome[tx_id]):
|
|
1114
1241
|
return None, None, None, None
|
|
1115
|
-
|
|
1242
|
+
|
|
1116
1243
|
if tx_id not in cds_bounds:
|
|
1117
1244
|
region = "lncRNA"
|
|
1118
1245
|
else:
|
|
@@ -1122,7 +1249,6 @@ def classify_peptides(reference,
|
|
|
1122
1249
|
elif nt0 >= cds_end:
|
|
1123
1250
|
region = "dORF"
|
|
1124
1251
|
else:
|
|
1125
|
-
# inside CDS bounds: decide by frame
|
|
1126
1252
|
if ((nt0 - cds_start) % 3) == 0:
|
|
1127
1253
|
region = "CDS"
|
|
1128
1254
|
else:
|
|
@@ -1144,21 +1270,6 @@ def classify_peptides(reference,
|
|
|
1144
1270
|
|
|
1145
1271
|
report_step_done() # 7
|
|
1146
1272
|
|
|
1147
|
-
# -------------------------- amino acid mismatch ------------------------------
|
|
1148
|
-
mismatch_hit_records, peptides_remaining = classify_with_index(
|
|
1149
|
-
peptides_remaining,
|
|
1150
|
-
alt_orf_orfeome_fa,
|
|
1151
|
-
alt_orf_idx_dir,
|
|
1152
|
-
coord_resolver=coord_resolver_altorf,
|
|
1153
|
-
step_label="amino acid mismatch search",
|
|
1154
|
-
k=hamming_distance,
|
|
1155
|
-
require_aa2nt=False
|
|
1156
|
-
)
|
|
1157
|
-
if mismatch_hit_records:
|
|
1158
|
-
write_peptide_hits_fasta(mismatch_hit_records, os.path.join(output_dir, "aminoAcidMismatch.fa"))
|
|
1159
|
-
|
|
1160
|
-
report_step_done() # 8
|
|
1161
|
-
|
|
1162
1273
|
# -------------------------- unknown -----------------------------------------
|
|
1163
1274
|
if peptides_remaining:
|
|
1164
1275
|
with open(os.path.join(output_dir, "unknown.fa"), "w") as out:
|
|
@@ -1171,7 +1282,6 @@ def classify_peptides(reference,
|
|
|
1171
1282
|
counts["alternativeSplicing"] = len(alt_splicing_hit_records)
|
|
1172
1283
|
counts["neoantigen"] = len(neoantigen_hit_records)
|
|
1173
1284
|
counts["alternativeReadingFrame"] = len(alternative_orf_hit_records)
|
|
1174
|
-
counts["aminoAcidMismatch"] = len(mismatch_hit_records)
|
|
1175
1285
|
counts["unknown"] = len(peptides_remaining)
|
|
1176
1286
|
|
|
1177
1287
|
tsv_path = os.path.join(output_dir, "pieChart.tsv")
|
|
@@ -1185,7 +1295,6 @@ def classify_peptides(reference,
|
|
|
1185
1295
|
"alternativeSplicing",
|
|
1186
1296
|
"neoantigen",
|
|
1187
1297
|
"alternativeReadingFrame",
|
|
1188
|
-
"aminoAcidMismatch",
|
|
1189
1298
|
"unknown",
|
|
1190
1299
|
]
|
|
1191
1300
|
legend_labels = [
|
|
@@ -1193,10 +1302,9 @@ def classify_peptides(reference,
|
|
|
1193
1302
|
"alternative splicing",
|
|
1194
1303
|
"neoantigen",
|
|
1195
1304
|
"alternative reading frame",
|
|
1196
|
-
"amino acid mismatch",
|
|
1197
1305
|
"unknown",
|
|
1198
1306
|
]
|
|
1199
|
-
colors = ["#263b81", "#0578a6", "#64cdf6", "#d71f26", "#
|
|
1307
|
+
colors = ["#263b81", "#0578a6", "#64cdf6", "#d71f26", "#e5e5e5"]
|
|
1200
1308
|
sizes_all = [counts[k] for k in category_keys]
|
|
1201
1309
|
nonzero_indices = [i for i, s in enumerate(sizes_all) if s > 0]
|
|
1202
1310
|
|
|
@@ -1230,6 +1338,6 @@ def classify_peptides(reference,
|
|
|
1230
1338
|
fig.savefig(pie_path, format="pdf", dpi=1200, bbox_inches='tight')
|
|
1231
1339
|
plt.close(fig)
|
|
1232
1340
|
|
|
1341
|
+
report_step_done() # 8
|
|
1233
1342
|
report_step_done() # 9
|
|
1234
|
-
report_step_done() # 10
|
|
1235
1343
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: darkprofiler
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.4
|
|
4
4
|
Summary: DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.
|
|
5
5
|
Author-email: Hanjun Lee <hanjun@alum.mit.edu>
|
|
6
6
|
License: MIT
|
|
@@ -30,7 +30,6 @@ DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo
|
|
|
30
30
|
- **Alternative splicing**
|
|
31
31
|
- **Neoantigens (SNV‑derived mutanome)**
|
|
32
32
|
- **Alternative reading frame peptides**
|
|
33
|
-
- **Amino acid mismatch**
|
|
34
33
|
- **Unknown / unaligned**
|
|
35
34
|
|
|
36
35
|
DarkProfiler is intended to be the *post‑processing / annotation* step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.
|
|
@@ -206,8 +205,8 @@ Requirements and recommendations:
|
|
|
206
205
|
- Empty sequences are silently ignored.
|
|
207
206
|
- There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.
|
|
208
207
|
|
|
209
|
-
A peptide sequence is assigned to **at most one output category
|
|
210
|
-
(canonical → alternative splicing → neoantigen → alternative reading frame →
|
|
208
|
+
A peptide sequence is assigned to **at most one output category** within a given hamming distance, corresponding to the first category that matches in the pipeline
|
|
209
|
+
(canonical → alternative splicing → neoantigen → alternative reading frame → unknown).
|
|
211
210
|
|
|
212
211
|
### VCF with SNVs (optional)
|
|
213
212
|
|
|
@@ -244,7 +243,6 @@ The database contains translated and derived proteomes as FASTA files:
|
|
|
244
243
|
- `mutanome.fa`
|
|
245
244
|
- `mutatedCanonicalTranscriptome.fa`
|
|
246
245
|
- `mutatedAlternativeTranslatome.fa`
|
|
247
|
-
- `mutatedAlternativeORFeome.fa`
|
|
248
246
|
|
|
249
247
|
DarkProfiler also creates **persistent fast indices** under the same database directory to accelerate peptide search with Hamming distance:
|
|
250
248
|
for example:
|
|
@@ -349,9 +347,8 @@ classify_peptides(
|
|
|
349
347
|
5. Build alternative splicing proteome (CDS must start with `ATG`) and classify peptides
|
|
350
348
|
6. Apply SNVs, build mutanome (CDS must start with `ATG`) and classify peptides
|
|
351
349
|
7. Build alternative ORFs (3 frames) and classify peptides
|
|
352
|
-
8.
|
|
353
|
-
9.
|
|
354
|
-
10. Finalize
|
|
350
|
+
8. Write unaligned peptides and summary plots
|
|
351
|
+
9. Finalize
|
|
355
352
|
|
|
356
353
|
### Category definitions
|
|
357
354
|
|
|
@@ -361,7 +358,7 @@ classify_peptides(
|
|
|
361
358
|
- **ORF region labels**
|
|
362
359
|
For alternative ORF hits, DarkProfiler labels the peptide start as:
|
|
363
360
|
- `uORF` (upstream of CDS start)
|
|
364
|
-
- `intORF` (inside annotated CDS span)
|
|
361
|
+
- `intORF` (out-of-frame peptdies from inside annotated CDS span)
|
|
365
362
|
- `dORF` (downstream of CDS end)
|
|
366
363
|
- `lncRNA` (no CDS annotation)
|
|
367
364
|
|
|
@@ -379,7 +376,6 @@ Each category is represented by a separate FASTA file in `output_dir`:
|
|
|
379
376
|
- `alternativeSplicing.fa`
|
|
380
377
|
- `neoantigen.fa`
|
|
381
378
|
- `alternativeReadingFrame.fa`
|
|
382
|
-
- `aminoAcidMismatch.fa`
|
|
383
379
|
- `unknown.fa`
|
|
384
380
|
|
|
385
381
|
For classification FASTAs (all except `unknown.fa`), each record uses:
|
|
@@ -415,7 +411,6 @@ canonical 123
|
|
|
415
411
|
alternativeSplicing 45
|
|
416
412
|
neoantigen 7
|
|
417
413
|
alternativeReadingFrame 32
|
|
418
|
-
aminoAcidMismatch 10
|
|
419
414
|
unknown 83
|
|
420
415
|
```
|
|
421
416
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|