darkprofiler 0.2.2__tar.gz → 0.2.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: darkprofiler
3
- Version: 0.2.2
3
+ Version: 0.2.4
4
4
  Summary: DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.
5
5
  Author-email: Hanjun Lee <hanjun@alum.mit.edu>
6
6
  License: MIT
@@ -30,7 +30,6 @@ DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo
30
30
  - **Alternative splicing**
31
31
  - **Neoantigens (SNV‑derived mutanome)**
32
32
  - **Alternative reading frame peptides**
33
- - **Amino acid mismatch**
34
33
  - **Unknown / unaligned**
35
34
 
36
35
  DarkProfiler is intended to be the *post‑processing / annotation* step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.
@@ -206,8 +205,8 @@ Requirements and recommendations:
206
205
  - Empty sequences are silently ignored.
207
206
  - There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.
208
207
 
209
- A peptide sequence is assigned to **at most one output category**, corresponding to the first category that matches in the pipeline
210
- (canonical → alternative splicing → neoantigen → alternative reading frame → amino acid mismatch → unknown).
208
+ A peptide sequence is assigned to **at most one output category** within a given hamming distance, corresponding to the first category that matches in the pipeline
209
+ (canonical → alternative splicing → neoantigen → alternative reading frame → unknown).
211
210
 
212
211
  ### VCF with SNVs (optional)
213
212
 
@@ -244,7 +243,6 @@ The database contains translated and derived proteomes as FASTA files:
244
243
  - `mutanome.fa`
245
244
  - `mutatedCanonicalTranscriptome.fa`
246
245
  - `mutatedAlternativeTranslatome.fa`
247
- - `mutatedAlternativeORFeome.fa`
248
246
 
249
247
  DarkProfiler also creates **persistent fast indices** under the same database directory to accelerate peptide search with Hamming distance:
250
248
  for example:
@@ -349,9 +347,8 @@ classify_peptides(
349
347
  5. Build alternative splicing proteome (CDS must start with `ATG`) and classify peptides
350
348
  6. Apply SNVs, build mutanome (CDS must start with `ATG`) and classify peptides
351
349
  7. Build alternative ORFs (3 frames) and classify peptides
352
- 8. Identify amino acid mismatch using Hamming distance (`k` in 0–2)
353
- 9. Write unaligned peptides and summary plots
354
- 10. Finalize
350
+ 8. Write unaligned peptides and summary plots
351
+ 9. Finalize
355
352
 
356
353
  ### Category definitions
357
354
 
@@ -361,7 +358,7 @@ classify_peptides(
361
358
  - **ORF region labels**
362
359
  For alternative ORF hits, DarkProfiler labels the peptide start as:
363
360
  - `uORF` (upstream of CDS start)
364
- - `intORF` (inside annotated CDS span)
361
+ - `intORF` (out-of-frame peptdies from inside annotated CDS span)
365
362
  - `dORF` (downstream of CDS end)
366
363
  - `lncRNA` (no CDS annotation)
367
364
 
@@ -379,7 +376,6 @@ Each category is represented by a separate FASTA file in `output_dir`:
379
376
  - `alternativeSplicing.fa`
380
377
  - `neoantigen.fa`
381
378
  - `alternativeReadingFrame.fa`
382
- - `aminoAcidMismatch.fa`
383
379
  - `unknown.fa`
384
380
 
385
381
  For classification FASTAs (all except `unknown.fa`), each record uses:
@@ -415,7 +411,6 @@ canonical 123
415
411
  alternativeSplicing 45
416
412
  neoantigen 7
417
413
  alternativeReadingFrame 32
418
- aminoAcidMismatch 10
419
414
  unknown 83
420
415
  ```
421
416
 
@@ -12,7 +12,6 @@ DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo
12
12
  - **Alternative splicing**
13
13
  - **Neoantigens (SNV‑derived mutanome)**
14
14
  - **Alternative reading frame peptides**
15
- - **Amino acid mismatch**
16
15
  - **Unknown / unaligned**
17
16
 
18
17
  DarkProfiler is intended to be the *post‑processing / annotation* step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.
@@ -188,8 +187,8 @@ Requirements and recommendations:
188
187
  - Empty sequences are silently ignored.
189
188
  - There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.
190
189
 
191
- A peptide sequence is assigned to **at most one output category**, corresponding to the first category that matches in the pipeline
192
- (canonical → alternative splicing → neoantigen → alternative reading frame → amino acid mismatch → unknown).
190
+ A peptide sequence is assigned to **at most one output category** within a given hamming distance, corresponding to the first category that matches in the pipeline
191
+ (canonical → alternative splicing → neoantigen → alternative reading frame → unknown).
193
192
 
194
193
  ### VCF with SNVs (optional)
195
194
 
@@ -226,7 +225,6 @@ The database contains translated and derived proteomes as FASTA files:
226
225
  - `mutanome.fa`
227
226
  - `mutatedCanonicalTranscriptome.fa`
228
227
  - `mutatedAlternativeTranslatome.fa`
229
- - `mutatedAlternativeORFeome.fa`
230
228
 
231
229
  DarkProfiler also creates **persistent fast indices** under the same database directory to accelerate peptide search with Hamming distance:
232
230
  for example:
@@ -331,9 +329,8 @@ classify_peptides(
331
329
  5. Build alternative splicing proteome (CDS must start with `ATG`) and classify peptides
332
330
  6. Apply SNVs, build mutanome (CDS must start with `ATG`) and classify peptides
333
331
  7. Build alternative ORFs (3 frames) and classify peptides
334
- 8. Identify amino acid mismatch using Hamming distance (`k` in 0–2)
335
- 9. Write unaligned peptides and summary plots
336
- 10. Finalize
332
+ 8. Write unaligned peptides and summary plots
333
+ 9. Finalize
337
334
 
338
335
  ### Category definitions
339
336
 
@@ -343,7 +340,7 @@ classify_peptides(
343
340
  - **ORF region labels**
344
341
  For alternative ORF hits, DarkProfiler labels the peptide start as:
345
342
  - `uORF` (upstream of CDS start)
346
- - `intORF` (inside annotated CDS span)
343
+ - `intORF` (out-of-frame peptdies from inside annotated CDS span)
347
344
  - `dORF` (downstream of CDS end)
348
345
  - `lncRNA` (no CDS annotation)
349
346
 
@@ -361,7 +358,6 @@ Each category is represented by a separate FASTA file in `output_dir`:
361
358
  - `alternativeSplicing.fa`
362
359
  - `neoantigen.fa`
363
360
  - `alternativeReadingFrame.fa`
364
- - `aminoAcidMismatch.fa`
365
361
  - `unknown.fa`
366
362
 
367
363
  For classification FASTAs (all except `unknown.fa`), each record uses:
@@ -397,7 +393,6 @@ canonical 123
397
393
  alternativeSplicing 45
398
394
  neoantigen 7
399
395
  alternativeReadingFrame 32
400
- aminoAcidMismatch 10
401
396
  unknown 83
402
397
  ```
403
398
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "darkprofiler"
7
- version = "0.2.2"
7
+ version = "0.2.4"
8
8
  description = "DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.7"
@@ -2,5 +2,5 @@ from .run import classify_peptides
2
2
 
3
3
  __all__ = ["classify_peptides"]
4
4
 
5
- __version__ = "0.2.2"
5
+ __version__ = "0.2.4"
6
6
 
@@ -121,7 +121,7 @@ def build_parser() -> argparse.ArgumentParser:
121
121
  "Optional path to existing database directory containing "
122
122
  "canonicalProteome.fa, alternativeSplicing.fa, mutanome.fa, "
123
123
  "mutatedCanonicalTranscriptome.fa, mutatedAlternativeTranslatome.fa, "
124
- "mutatedAlternativeORFeome.fa."
124
+ "and other index files"
125
125
  ),
126
126
  )
127
127
  p_run.add_argument(
@@ -406,11 +406,6 @@ def classify_peptides(reference,
406
406
  and reuses them on subsequent runs (no re-indexing needed).
407
407
  - The query side uses the index for k in {0,1,2} and peptide length in [index_L_min,index_L_max].
408
408
 
409
- Other requirements already implemented in prior version:
410
- - Filter transcripts whose CDS doesn't start with ATG
411
- - Report transcript nt coordinate and uORF/intORF/dORF/lncRNA/CDS
412
- - Rename aminoAcidMisincorporation -> aminoAcidMismatch
413
- """
414
409
  steps = [
415
410
  "Filter VCF to exome",
416
411
  "Setup and load transcriptome/CDS/knownCanonical",
@@ -419,7 +414,6 @@ def classify_peptides(reference,
419
414
  "Generate alternative splicing proteome and classify peptides",
420
415
  "Apply SNVs, generate mutanome and classify neoantigens",
421
416
  "Generate alternative ORFs and classify peptides",
422
- "Identify amino acid mismatches",
423
417
  "Write unaligned peptides and pie chart",
424
418
  "Finalize",
425
419
  ]
@@ -481,7 +475,6 @@ def classify_peptides(reference,
481
475
 
482
476
  required_db_files = [
483
477
  "alternativeSplicing.fa",
484
- "mutatedAlternativeORFeome.fa",
485
478
  "canonicalProteome.fa",
486
479
  "mutatedAlternativeTranslatome.fa",
487
480
  "mutanome.fa",
@@ -1006,31 +999,169 @@ def classify_peptides(reference,
1006
999
 
1007
1000
  report_step_done() # 5
1008
1001
 
1009
- # ----------------------------- mutated antigens (mutanome) -------------------
1010
- # NOTE: keep your original SNV application as-is (omitted here for brevity),
1011
- # but ensure translation uses translate_cds_with_map (ATG filter) and then build index.
1012
- #
1013
- # If you already have your SNV block, set:
1014
- # mutated_canonical_transcripts = load_transcriptome(mutatedCanonicalTranscriptome.fa)
1015
- # mutanome_proteins, mutanome_aa2nt = translate_cds_with_map(mutated_canonical_transcripts, cds_map)
1016
- # write mutanome.fa
1017
- #
1002
+ # ---------- build mutanome.fa + mutatedCanonicalTranscriptome.fa ----------
1018
1003
  mutated_canonical_tx_fa = os.path.join(database_dir, "mutatedCanonicalTranscriptome.fa")
1019
1004
  mutanome_fa = os.path.join(database_dir, "mutanome.fa")
1020
1005
  mutanome_idx_dir = os.path.join(database_dir, "mutanome.idx")
1021
1006
 
1022
- # ---------- BEGIN: your existing SNV block should fill these ----------
1023
- # For runs reusing DB, mutated_canonical_tx_fa and mutanome_fa should already exist.
1024
- mutated_canonical_transcripts = {}
1025
- if os.path.exists(mutated_canonical_tx_fa):
1026
- mutated_canonical_transcripts = load_transcriptome(mutated_canonical_tx_fa)
1007
+ def _parse_gff_attributes(attr_str):
1008
+ attrs = {}
1009
+ for item in attr_str.strip().split(";"):
1010
+ item = item.strip()
1011
+ if not item:
1012
+ continue
1013
+ if "=" in item:
1014
+ k, v = item.split("=", 1)
1015
+ attrs[k] = v.strip().strip('"')
1016
+ else:
1017
+ parts = item.split()
1018
+ if len(parts) >= 2:
1019
+ attrs[parts[0]] = parts[1].strip('"')
1020
+ return attrs
1021
+
1022
+ # Build transcript exon model from GFF (exon features only)
1023
+ transcript_exons = defaultdict(list) # tx_id -> list[(start,end)] 1-based closed
1024
+ transcript_strand = {} # tx_id -> '+'/'-'
1025
+ transcript_chrom = {} # tx_id -> chrom (normalized)
1026
+
1027
+ with open(gff_path) as fh:
1028
+ for line in fh:
1029
+ if not line.strip() or line.startswith("#"):
1030
+ continue
1031
+ fields = line.rstrip("\n").split("\t")
1032
+ if len(fields) < 9:
1033
+ continue
1034
+ chrom, source, feature, start, end, score, strand, frame, attrs_str = fields
1035
+ if feature.lower() != "exon":
1036
+ continue
1037
+ try:
1038
+ start_i = int(start)
1039
+ end_i = int(end)
1040
+ except ValueError:
1041
+ continue
1042
+ attrs = _parse_gff_attributes(attrs_str)
1043
+ tx_id = attrs.get("transcript_id") or attrs.get("transcriptId")
1044
+ if tx_id is None:
1045
+ continue
1046
+ tx_id = normalize_gff_tx_id(tx_id)
1047
+ transcript_exons[tx_id].append((start_i, end_i))
1048
+ transcript_strand[tx_id] = strand
1049
+ transcript_chrom[tx_id] = normalize_chrom(chrom)
1050
+
1051
+ # Sort exons ascending by genomic start
1052
+ for tx_id in list(transcript_exons.keys()):
1053
+ transcript_exons[tx_id].sort(key=lambda x: x[0])
1054
+
1055
+ # Exon caches for coordinate mapping
1056
+ exon_order_cache = {} # tx_id -> (exons_sorted, exons_desc)
1057
+ for tx_id, exons_sorted in transcript_exons.items():
1058
+ exon_order_cache[tx_id] = (exons_sorted, list(reversed(exons_sorted)))
1059
+
1060
+ # Index SNVs by chromosome for speed
1061
+ snvs_by_chrom = defaultdict(list) # chrom -> list[(pos, ref, alt)]
1062
+ for chrom, pos, ref, alt in snvs:
1063
+ snvs_by_chrom[chrom].append((int(pos), ref, alt))
1064
+ for chrom in snvs_by_chrom:
1065
+ snvs_by_chrom[chrom].sort(key=lambda x: x[0])
1066
+
1067
+ # Simple base complement
1068
+ _complement = {"A":"T","T":"A","C":"G","G":"C","a":"t","t":"a","c":"g","g":"c"}
1069
+ def _complement_base(b):
1070
+ return _complement.get(b, b)
1071
+
1072
+ def _apply_snvs_to_one_transcript(tx_id):
1073
+ """
1074
+ Returns: (tx_id, mutated_seq_string) OR (tx_id, None) if transcript not present.
1075
+ If no SNVs (or no mapping), returns original transcript sequence.
1076
+ """
1077
+ if tx_id not in transcriptome:
1078
+ return tx_id, None
1079
+
1080
+ seq = transcriptome[tx_id]
1081
+ seq_list = list(seq)
1082
+
1083
+ chrom = transcript_chrom.get(tx_id)
1084
+ if chrom is None or chrom not in snvs_by_chrom:
1085
+ return tx_id, seq
1086
+
1087
+ if tx_id not in exon_order_cache:
1088
+ return tx_id, seq
1089
+
1090
+ exons_sorted, exons_desc = exon_order_cache[tx_id]
1091
+ strand = transcript_strand.get(tx_id, "+")
1092
+
1093
+ for pos, ref, alt in snvs_by_chrom[chrom]:
1094
+ if strand == "+":
1095
+ offset = 0
1096
+ within = False
1097
+ for s, e in exons_sorted:
1098
+ if pos < s:
1099
+ break
1100
+ if pos > e:
1101
+ offset += (e - s + 1)
1102
+ else:
1103
+ offset += (pos - s)
1104
+ within = True
1105
+ break
1106
+ if not within:
1107
+ continue
1108
+ tx_index = offset
1109
+ expected_ref = ref.upper()
1110
+ alt_base = alt.upper()
1111
+ else:
1112
+ offset = 0
1113
+ within = False
1114
+ for s, e in exons_desc:
1115
+ if pos > e:
1116
+ continue
1117
+ if pos < s:
1118
+ offset += (e - s + 1)
1119
+ else:
1120
+ offset += (e - pos)
1121
+ within = True
1122
+ break
1123
+ if not within:
1124
+ continue
1125
+ tx_index = offset
1126
+ expected_ref = _complement_base(ref.upper())
1127
+ alt_base = _complement_base(alt.upper())
1128
+
1129
+ if 0 <= tx_index < len(seq_list):
1130
+ if expected_ref and seq_list[tx_index].upper() != expected_ref:
1131
+ continue
1132
+ seq_list[tx_index] = alt_base
1133
+
1134
+ return tx_id, "".join(seq_list)
1135
+
1136
+ # Build mutated canonical transcriptome
1137
+ if build_database:
1138
+ canonical_tx_list = [tx for tx in canonical_tx_ids if tx in transcriptome]
1139
+
1140
+ mutated_canonical_tx_dict = {}
1141
+ if num_threads and num_threads > 1:
1142
+ with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as ex:
1143
+ for tx_id, mutseq in ex.map(_apply_snvs_to_one_transcript, canonical_tx_list, chunksize=50):
1144
+ if mutseq is not None:
1145
+ mutated_canonical_tx_dict[tx_id] = mutseq
1146
+ else:
1147
+ for tx_id in canonical_tx_list:
1148
+ tx_id2, mutseq = _apply_snvs_to_one_transcript(tx_id)
1149
+ if mutseq is not None:
1150
+ mutated_canonical_tx_dict[tx_id2] = mutseq
1151
+
1152
+ with open(mutated_canonical_tx_fa, "w") as out:
1153
+ for tx_id, nt_seq in mutated_canonical_tx_dict.items():
1154
+ out.write(f">{tx_id}\n{nt_seq}\n")
1155
+
1156
+ mutated_canonical_transcripts = mutated_canonical_tx_dict
1027
1157
  else:
1028
- # If you keep your original code, it will write this file when build_database=True.
1029
- mutated_canonical_transcripts = {} # placeholder
1030
- # ---------- END: your existing SNV block should fill these ----------
1158
+ mutated_canonical_transcripts = load_transcriptome(mutated_canonical_tx_fa)
1031
1159
 
1032
- # Translate mutanome with ATG filter + mapping (even if file was precomputed)
1033
- mutanome_proteins, mutanome_aa2nt = translate_cds_with_map(mutated_canonical_transcripts, cds_map)
1160
+ # Translate mutanome (CDS only, ATG filtered)
1161
+ mutanome_proteins, mutanome_aa2nt = translate_cds_with_map(
1162
+ mutated_canonical_transcripts,
1163
+ cds_map
1164
+ )
1034
1165
 
1035
1166
  if build_database:
1036
1167
  with open(mutanome_fa, "w") as out:
@@ -1063,8 +1194,7 @@ def classify_peptides(reference,
1063
1194
 
1064
1195
  # -------------------------- alternative ORFs (3 frames) ----------------------
1065
1196
  alt_orf_translatome_fa = os.path.join(database_dir, "mutatedAlternativeTranslatome.fa")
1066
- alt_orf_orfeome_fa = os.path.join(database_dir, "mutatedAlternativeORFeome.fa")
1067
- alt_orf_idx_dir = os.path.join(database_dir, "mutatedAlternativeORFeome.idx")
1197
+ alt_orf_idx_dir = os.path.join(database_dir, "mutatedAlternativeTranslatome.idx")
1068
1198
 
1069
1199
  if build_database:
1070
1200
  alt_orf_records = {}
@@ -1082,14 +1212,11 @@ def classify_peptides(reference,
1082
1212
  with open(alt_orf_translatome_fa, "w") as out:
1083
1213
  for rid, aa_seq in alt_orf_records.items():
1084
1214
  out.write(f">{rid}\n{aa_seq}\n")
1085
- with open(alt_orf_orfeome_fa, "w") as out:
1086
- for rid, aa_seq in alt_orf_records.items():
1087
- out.write(f">{rid}\n{aa_seq}\n")
1088
1215
 
1089
- # Build index for ORFeome (no aa2nt needed)
1216
+ # Build index for alt ORF translatome (no aa2nt needed)
1090
1217
  if build_fast_index:
1091
1218
  build_proteome_index(
1092
- alt_orf_orfeome_fa,
1219
+ alt_orf_translatome_fa,
1093
1220
  alt_orf_idx_dir,
1094
1221
  L_min=index_L_min,
1095
1222
  L_max=index_L_max,
@@ -1106,13 +1233,13 @@ def classify_peptides(reference,
1106
1233
  frame = int(frame_str)
1107
1234
  except ValueError:
1108
1235
  return None, None, None, None
1109
-
1236
+
1110
1237
  nt0 = frame + aa_pos * 3
1111
1238
  if tx_id not in transcriptome:
1112
1239
  return None, None, None, None
1113
1240
  if nt0 < 0 or nt0 >= len(transcriptome[tx_id]):
1114
1241
  return None, None, None, None
1115
-
1242
+
1116
1243
  if tx_id not in cds_bounds:
1117
1244
  region = "lncRNA"
1118
1245
  else:
@@ -1122,7 +1249,6 @@ def classify_peptides(reference,
1122
1249
  elif nt0 >= cds_end:
1123
1250
  region = "dORF"
1124
1251
  else:
1125
- # inside CDS bounds: decide by frame
1126
1252
  if ((nt0 - cds_start) % 3) == 0:
1127
1253
  region = "CDS"
1128
1254
  else:
@@ -1144,21 +1270,6 @@ def classify_peptides(reference,
1144
1270
 
1145
1271
  report_step_done() # 7
1146
1272
 
1147
- # -------------------------- amino acid mismatch ------------------------------
1148
- mismatch_hit_records, peptides_remaining = classify_with_index(
1149
- peptides_remaining,
1150
- alt_orf_orfeome_fa,
1151
- alt_orf_idx_dir,
1152
- coord_resolver=coord_resolver_altorf,
1153
- step_label="amino acid mismatch search",
1154
- k=hamming_distance,
1155
- require_aa2nt=False
1156
- )
1157
- if mismatch_hit_records:
1158
- write_peptide_hits_fasta(mismatch_hit_records, os.path.join(output_dir, "aminoAcidMismatch.fa"))
1159
-
1160
- report_step_done() # 8
1161
-
1162
1273
  # -------------------------- unknown -----------------------------------------
1163
1274
  if peptides_remaining:
1164
1275
  with open(os.path.join(output_dir, "unknown.fa"), "w") as out:
@@ -1171,7 +1282,6 @@ def classify_peptides(reference,
1171
1282
  counts["alternativeSplicing"] = len(alt_splicing_hit_records)
1172
1283
  counts["neoantigen"] = len(neoantigen_hit_records)
1173
1284
  counts["alternativeReadingFrame"] = len(alternative_orf_hit_records)
1174
- counts["aminoAcidMismatch"] = len(mismatch_hit_records)
1175
1285
  counts["unknown"] = len(peptides_remaining)
1176
1286
 
1177
1287
  tsv_path = os.path.join(output_dir, "pieChart.tsv")
@@ -1185,7 +1295,6 @@ def classify_peptides(reference,
1185
1295
  "alternativeSplicing",
1186
1296
  "neoantigen",
1187
1297
  "alternativeReadingFrame",
1188
- "aminoAcidMismatch",
1189
1298
  "unknown",
1190
1299
  ]
1191
1300
  legend_labels = [
@@ -1193,10 +1302,9 @@ def classify_peptides(reference,
1193
1302
  "alternative splicing",
1194
1303
  "neoantigen",
1195
1304
  "alternative reading frame",
1196
- "amino acid mismatch",
1197
1305
  "unknown",
1198
1306
  ]
1199
- colors = ["#263b81", "#0578a6", "#64cdf6", "#d71f26", "#f493a9", "#e5e5e5"]
1307
+ colors = ["#263b81", "#0578a6", "#64cdf6", "#d71f26", "#e5e5e5"]
1200
1308
  sizes_all = [counts[k] for k in category_keys]
1201
1309
  nonzero_indices = [i for i, s in enumerate(sizes_all) if s > 0]
1202
1310
 
@@ -1230,6 +1338,6 @@ def classify_peptides(reference,
1230
1338
  fig.savefig(pie_path, format="pdf", dpi=1200, bbox_inches='tight')
1231
1339
  plt.close(fig)
1232
1340
 
1341
+ report_step_done() # 8
1233
1342
  report_step_done() # 9
1234
- report_step_done() # 10
1235
1343
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: darkprofiler
3
- Version: 0.2.2
3
+ Version: 0.2.4
4
4
  Summary: DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.
5
5
  Author-email: Hanjun Lee <hanjun@alum.mit.edu>
6
6
  License: MIT
@@ -30,7 +30,6 @@ DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo
30
30
  - **Alternative splicing**
31
31
  - **Neoantigens (SNV‑derived mutanome)**
32
32
  - **Alternative reading frame peptides**
33
- - **Amino acid mismatch**
34
33
  - **Unknown / unaligned**
35
34
 
36
35
  DarkProfiler is intended to be the *post‑processing / annotation* step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.
@@ -206,8 +205,8 @@ Requirements and recommendations:
206
205
  - Empty sequences are silently ignored.
207
206
  - There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.
208
207
 
209
- A peptide sequence is assigned to **at most one output category**, corresponding to the first category that matches in the pipeline
210
- (canonical → alternative splicing → neoantigen → alternative reading frame → amino acid mismatch → unknown).
208
+ A peptide sequence is assigned to **at most one output category** within a given hamming distance, corresponding to the first category that matches in the pipeline
209
+ (canonical → alternative splicing → neoantigen → alternative reading frame → unknown).
211
210
 
212
211
  ### VCF with SNVs (optional)
213
212
 
@@ -244,7 +243,6 @@ The database contains translated and derived proteomes as FASTA files:
244
243
  - `mutanome.fa`
245
244
  - `mutatedCanonicalTranscriptome.fa`
246
245
  - `mutatedAlternativeTranslatome.fa`
247
- - `mutatedAlternativeORFeome.fa`
248
246
 
249
247
  DarkProfiler also creates **persistent fast indices** under the same database directory to accelerate peptide search with Hamming distance:
250
248
  for example:
@@ -349,9 +347,8 @@ classify_peptides(
349
347
  5. Build alternative splicing proteome (CDS must start with `ATG`) and classify peptides
350
348
  6. Apply SNVs, build mutanome (CDS must start with `ATG`) and classify peptides
351
349
  7. Build alternative ORFs (3 frames) and classify peptides
352
- 8. Identify amino acid mismatch using Hamming distance (`k` in 0–2)
353
- 9. Write unaligned peptides and summary plots
354
- 10. Finalize
350
+ 8. Write unaligned peptides and summary plots
351
+ 9. Finalize
355
352
 
356
353
  ### Category definitions
357
354
 
@@ -361,7 +358,7 @@ classify_peptides(
361
358
  - **ORF region labels**
362
359
  For alternative ORF hits, DarkProfiler labels the peptide start as:
363
360
  - `uORF` (upstream of CDS start)
364
- - `intORF` (inside annotated CDS span)
361
+ - `intORF` (out-of-frame peptdies from inside annotated CDS span)
365
362
  - `dORF` (downstream of CDS end)
366
363
  - `lncRNA` (no CDS annotation)
367
364
 
@@ -379,7 +376,6 @@ Each category is represented by a separate FASTA file in `output_dir`:
379
376
  - `alternativeSplicing.fa`
380
377
  - `neoantigen.fa`
381
378
  - `alternativeReadingFrame.fa`
382
- - `aminoAcidMismatch.fa`
383
379
  - `unknown.fa`
384
380
 
385
381
  For classification FASTAs (all except `unknown.fa`), each record uses:
@@ -415,7 +411,6 @@ canonical 123
415
411
  alternativeSplicing 45
416
412
  neoantigen 7
417
413
  alternativeReadingFrame 32
418
- aminoAcidMismatch 10
419
414
  unknown 83
420
415
  ```
421
416
 
File without changes
File without changes