PyPI - XspecT - Versions diffs - 0.5.4__tar.gz → 0.6.0__tar.gz - Mend

XspecT 0.5.4tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (129) hide show

{xspect-0.5.4 → xspect-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: XspecT
-Version: 0.5.4
+Version: 0.6.0
 Summary: Tool to monitor and characterize pathogens using Bloom filters.
 License: MIT License
@@ -45,6 +45,9 @@ Requires-Dist: xxhash
 Requires-Dist: fastapi
 Requires-Dist: uvicorn
 Requires-Dist: python-multipart
+Requires-Dist: mappy
+Requires-Dist: pysam
+Requires-Dist: numpy
 Provides-Extra: docs
 Requires-Dist: mkdocs-material; extra == "docs"
 Requires-Dist: mkdocs-include-markdown-plugin; extra == "docs"

{xspect-0.5.4 → xspect-0.6.0}/docs/benchmark.md RENAMED Viewed

@@ -6,12 +6,12 @@ The benchmark was performed by first download all available Acinetobacter genome
 ## Benchmark Results
-The benchmark results show that XspecT achieves high classification accuracy, with an overall accuracy of 99.94% for whole genomes and  87.11% for simulated reads.
+The benchmark results show that XspecT achieves high classification accuracy, with an overall accuracy of nearly 100% for whole genomes and 82% for simulated reads. However, the low macro-average F1 score (0.41) for the read dataset highlights a substantial class imbalance.
-| Category          | Total    | Matches  | Mismatches | Match Rate | Mismatch Rate |
-|-------------------|----------|----------|------------|------------|---------------|
-| Assemblies        | 44,905   | 44,879   | 26         | 99.94%     | 0.06%         |
-| Simulated reads   | 9,000,000| 7,839,877| 1,160,123  | 87.11%     | 12.89%        |
+| Dataset   | Total Samples | Matches   | Mismatches | Match Rate | Mismatch Rate | Accuracy | Macro Avg F1 | Weighted Avg F1 |
+|-----------|--------------:|----------:|-----------:|-----------:|--------------:|---------:|-------------:|----------------:|
+| Assembly  | 44,905        | 44,879    | 26         | 99.94%     | 0.06%         | ≈1.00    | 0.95         | 1.00            |
+| Reads     | 9,200,000     | 7,526,902 | 1,673,098  | 81.81%     | 18.19%        | 0.82     | 0.41         | 0.87            |
 ## Running the benchmark yourself

{xspect-0.5.4 → xspect-0.6.0}/docs/cli.md RENAMED Viewed

@@ -43,6 +43,8 @@ To train a model with NCBI data, run the following command:
 xspect models train ncbi
 ```
+By default, XspecT filters out NCBI accessions that do not meed minimum N50 thresholds, have an inconclusive taxonomy check status, or are deemed atypical by NCBI. Furthermore, species with "Candidatus" and "sp." in their species names are filtered out. To disable filtering behavior, use the respective flag (see `xspect models train ncbi --help`).
 If you would like to train models with manually curated data from a directory, you can use:
 ```bash

{xspect-0.5.4 → xspect-0.6.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "XspecT"
-version = "0.5.4"
+version = "0.6.0"
 description = "Tool to monitor and characterize pathogens using Bloom filters."
 readme = {file = "README.md", content-type = "text/markdown"}
 license = {file = "LICENSE"}
@@ -18,7 +18,10 @@ dependencies = [
   "xxhash",
   "fastapi",
   "uvicorn",
-  "python-multipart"
+  "python-multipart",
+  "mappy",
+  "pysam",
+  "numpy"
   ]
 classifiers = [
   "Intended Audience :: Developers",

{xspect-0.5.4 → xspect-0.6.0}/scripts/benchmark/main.nf RENAMED Viewed

@@ -127,8 +127,8 @@ process createAssemblyTable {
 }
 process summarizeClassifications {
-  conda "jq"
-  cpus 2
+  conda "conda-forge::pandas"
+  cpus 4
   memory '16 GB'
   publishDir "results"
@@ -141,24 +141,38 @@ process summarizeClassifications {
   script:
   """
-  cp ${assemblies} classifications.tsv
+  #!/usr/bin/env python
+  import pandas as pd
+  import json
+  import os
+  df = pd.read_csv('${assemblies}', sep='\\t')
+  df['Prediction'] = 'unknown'
+  classifications = '${classifications}'.split()
-  awk 'BEGIN {FS=OFS="\t"}
-    NR==1 {print \$0, "Prediction"}
-    NR>1 {print \$0, "unknown"}' classifications.tsv > temp_classifications.tsv
-  mv temp_classifications.tsv classifications.tsv
+  with open(classifications[0]) as f:
+    data = json.load(f)
+    keys = data["scores"]["total"]
+    for key in keys:
+      df[str(key)] = pd.NA
-  for json_file in ${classifications}; do
-    basename=\$(basename \$json_file .json)
-    accession=\$(echo \$basename | cut -d'_' -f1-2)
-    prediction=\$(jq '.["prediction"]' \$json_file | tr -d '"')
+  for json_file in classifications:
+    basename = os.path.basename(json_file).replace('.json', '')
+    accession = '_'.join(basename.split('_')[:2])
-    awk -v acc="\$accession" -v pred="\$prediction" 'BEGIN {FS=OFS="\t"}
-      NR==1 {print}
-      NR>1 && \$1 ~ acc {\$NF=pred; print}
-      NR>1 && \$1 !~ acc {print}' classifications.tsv > temp_classifications.tsv
-    mv temp_classifications.tsv classifications.tsv
-  done
+    with open(json_file, 'r') as f:
+      data = json.load(f)
+      prediction = data.get('prediction', 'unknown')
+    mask = df['Assembly Accession'].str.contains(accession, na=False)
+    df.loc[mask, 'Prediction'] = prediction
+    scores = data.get('scores', {}).get('total', {})
+    for species_id, score in scores.items():
+      df.loc[mask, str(species_id)] = score
+  df.to_csv('classifications.tsv', sep='\\t', index=False)
   """
 }
@@ -188,7 +202,10 @@ process selectForReadGen {
     for id, accession in species_model["training_accessions"].items():
       training_accessions.extend(accession)
-  assemblies = assemblies[assemblies['Assembly Level'] == 'Complete Genome']
+  assemblies = assemblies[
+    (assemblies['Assembly Level'] == 'Complete Genome') |
+    (assemblies['Assembly Level'] == 'Chromosome')
+  ]
   assemblies = assemblies[~assemblies['Assembly Accession'].isin(training_accessions)]
   # use up to three assemblies for each species
@@ -238,8 +255,8 @@ process generateReads {
 }
 process summarizeReadClassifications {
-  conda "conda-forge::jq"
-  cpus 2
+  conda "conda-forge::pandas"
+  cpus 4
   memory '16 GB'
   publishDir "results"
@@ -252,29 +269,55 @@ process summarizeReadClassifications {
   script:
   """
-  echo -e "Assembly Accession\tRead\tPrediction\tSpecies ID" > read_classifications.tsv
+  #!/usr/bin/env python
+  import pandas as pd
+  import json
+  import os
-  for json_file in ${read_classifications}; do
-    basename=\$(basename \$json_file .json)
-    accession=\$(echo \$basename | cut -d'_' -f1-2)
+  df_assemblies = pd.read_csv('${read_assemblies}', sep='\\t')
+  # Create a mapping of accession to species ID
+  accession_to_species = dict(zip(df_assemblies['Assembly Accession'], df_assemblies['Species ID']))
+  results = []
+  classifications = '${read_classifications}'.split()
+  for json_file in classifications:
+    basename = os.path.basename(json_file).replace('.json', '')
+    accession = '_'.join(basename.split('_')[:2])
-    # Get species ID from assemblies table
-    species_id=\$(awk -F'\t' -v acc="\$accession" '\$1 == acc {print \$6}' ${read_assemblies})
+    species_id = accession_to_species.get(accession, 'unknown')
-    # Extract predictions from JSON and append to TSV
-    jq -r --arg acc "\$accession" --arg species "\$species_id" '
-      .scores
-      | to_entries[]
-      | select(.key != "total")
-      | "\\(.key)\\t\\(.value | to_entries | max_by(.value) | .key)"
-      | "\\(\$acc)\\t" + . + "\\t\\(\$species)"
-    ' "\$json_file" >> read_classifications.tsv
-  done
+    with open(json_file, 'r') as f:
+      data = json.load(f)
+      scores = data.get('scores', {})
+      for read_name, read_scores in scores.items():
+        if read_name != 'total':
+          if read_scores:
+            max_score = max(read_scores.values())
+            max_species = [species for species, score in read_scores.items() if score == max_score]
+            prediction = max_species[0] if len(max_species) == 1 else "ambiguous"
+            result = {
+              'Assembly Accession': accession,
+              'Read': read_name,
+              'Prediction': prediction,
+              'Species ID': species_id
+            }
+            for species, score in read_scores.items():
+              result[species] = score
+            results.append(result)
+  df_results = pd.DataFrame(results)
+  df_results.to_csv('read_classifications.tsv', sep='\\t', index=False)
   """
 }
 process calculateStats {
-  conda "conda-forge::pandas"
+  conda "conda-forge::pandas conda-forge::scikit-learn"
   cpus 2
   memory '16 GB'
   publishDir "results"
@@ -290,33 +333,65 @@ process calculateStats {
   """
   #!/usr/bin/env python
   import pandas as pd
+  from sklearn.metrics import classification_report
+  # --- Assembly ---
   df_assembly = pd.read_csv('${assembly_classifications}', sep='\\t')
   df_assembly['Species ID'] = df_assembly['Species ID'].astype(str)
   df_assembly['Prediction'] = df_assembly['Prediction'].astype(str)
-  assembly_matches = df_assembly.loc[df_assembly['Species ID'] == df_assembly['Prediction']]
-  assembly_mismatches = df_assembly.loc[df_assembly['Species ID'] != df_assembly['Prediction']]
+  y_true_asm = df_assembly['Species ID']
+  y_pred_asm = df_assembly['Prediction']
+  asm_matches = (y_true_asm == y_pred_asm).sum()
+  asm_total = len(df_assembly)
+  asm_labels = sorted(set(y_true_asm.unique()).union(set(y_pred_asm.unique())))
+  asm_report = classification_report(
+      y_true_asm,
+      y_pred_asm,
+      labels=asm_labels,
+      zero_division=0
+  )
+  # --- Reads ---
   df_read = pd.read_csv('${read_classifications}', sep='\\t')
   df_read['Species ID'] = df_read['Species ID'].astype(str)
   df_read['Prediction'] = df_read['Prediction'].astype(str)
-  read_matches = df_read.loc[df_read['Species ID'] == df_read['Prediction']]
-  read_mismatches = df_read.loc[df_read['Species ID'] != df_read['Prediction']]
+  y_true_read = df_read['Species ID']
+  y_pred_read = df_read['Prediction']
+  read_matches = (y_true_read == y_pred_read).sum()
+  read_total = len(df_read)
+  read_labels = sorted(set(y_true_read.unique()).union(set(y_pred_read.unique())))
+  read_report = classification_report(
+      y_true_read,
+      y_pred_read,
+      labels=read_labels,
+      zero_division=0
+  )
+  # --- Output ---
   with open('stats.txt', 'w') as f:
-      f.write(f"Assembly Total: {len(df_assembly)}\\n")
-      f.write(f"Assembly Matches: {len(assembly_matches)}\\n")
-      f.write(f"Assembly Mismatches: {len(assembly_mismatches)}\\n")
-      f.write(f"Assembly Match Rate: {len(assembly_matches) / len(df_assembly) * 100:.2f}%\\n")
-      f.write(f"Assembly Mismatch Rate: {len(assembly_mismatches) / len(df_assembly) * 100:.2f}%\\n")
-      f.write("\\n")
-      f.write(f"Read Total: {len(df_read)}\\n")
-      f.write(f"Read Matches: {len(read_matches)}\\n")
-      f.write(f"Read Mismatches: {len(read_mismatches)}\\n")
-      f.write(f"Read Match Rate: {len(read_matches) / len(df_read) * 100:.2f}%\\n")
-      f.write(f"Read Mismatch Rate: {len(read_mismatches) / len(df_read) * 100:.2f}%\\n")
+      f.write("=== Assembly ===\\n")
+      f.write(f"Total: {asm_total}\\n")
+      f.write(f"Matches: {asm_matches}\\n")
+      f.write(f"Mismatches: {asm_total - asm_matches}\\n")
+      f.write(f"Match Rate: {asm_matches / asm_total * 100:.2f}%\\n")
+      f.write(f"Mismatch Rate: {(asm_total - asm_matches) / asm_total * 100:.2f}%\\n\\n")
+      f.write("Classification report (per class):\\n")
+      f.write(asm_report + "\\n")
+      f.write("=== Reads ===\\n")
+      f.write(f"Total: {read_total}\\n")
+      f.write(f"Matches: {read_matches}\\n")
+      f.write(f"Mismatches: {read_total - read_matches}\\n")
+      f.write(f"Match Rate: {read_matches / read_total * 100:.2f}%\\n")
+      f.write(f"Mismatch Rate: {(read_total - read_matches) / read_total * 100:.2f}%\\n\\n")
+      f.write("Classification report (per class):\\n")
+      f.write(read_report + "\\n")
   """
 }

{xspect-0.5.4 → xspect-0.6.0}/src/XspecT.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: XspecT
-Version: 0.5.4
+Version: 0.6.0
 Summary: Tool to monitor and characterize pathogens using Bloom filters.
 License: MIT License
@@ -45,6 +45,9 @@ Requires-Dist: xxhash
 Requires-Dist: fastapi
 Requires-Dist: uvicorn
 Requires-Dist: python-multipart
+Requires-Dist: mappy
+Requires-Dist: pysam
+Requires-Dist: numpy
 Provides-Extra: docs
 Requires-Dist: mkdocs-material; extra == "docs"
 Requires-Dist: mkdocs-include-markdown-plugin; extra == "docs"

{xspect-0.5.4 → xspect-0.6.0}/src/XspecT.egg-info/SOURCES.txt RENAMED Viewed

@@ -37,6 +37,10 @@ src/xspect/model_management.py
 src/xspect/ncbi.py
 src/xspect/train.py
 src/xspect/web.py
+src/xspect/misclassification_detection/__init__.py
+src/xspect/misclassification_detection/mapping.py
+src/xspect/misclassification_detection/point_pattern_analysis.py
+src/xspect/misclassification_detection/simulate_reads.py
 src/xspect/mlst_feature/__init__.py
 src/xspect/mlst_feature/mlst_helper.py
 src/xspect/mlst_feature/pub_mlst_handler.py
@@ -110,6 +114,7 @@ tests/__init__.py
 tests/conftest.py
 tests/test_cli.py
 tests/test_file_io.py
+tests/test_misclassification_detection.py
 tests/test_model_management.py
 tests/test_model_result.py
 tests/test_ncbi.py

{xspect-0.5.4 → xspect-0.6.0}/src/XspecT.egg-info/requires.txt RENAMED Viewed

@@ -11,6 +11,9 @@ xxhash
 fastapi
 uvicorn
 python-multipart
+mappy
+pysam
+numpy
 [docs]
 mkdocs-material

{xspect-0.5.4 → xspect-0.6.0}/src/xspect/classify.py RENAMED Viewed

@@ -46,6 +46,7 @@ def classify_species(
     output_path: Path,
     step: int = 1,
     display_name: bool = False,
+    validation: bool = False,
 ):
     """
     Classify the species of sequences.
@@ -59,6 +60,7 @@ def classify_species(
         output_path (Path): The path to the output file where results will be saved.
         step (int): The amount of kmers to be skipped.
         display_name (bool): Includes a display name for each tax_ID.
+        validation (bool): Sorts out misclassified reads.
     """
     ProbabilisticFilterSVMModel = import_module(
         "xspect.models.probabilistic_filter_svm_model"
@@ -69,7 +71,12 @@ def classify_species(
     input_paths, get_output_path = prepare_input_output_paths(input_path)
     for idx, current_path in enumerate(input_paths):
-        result = model.predict(current_path, step=step, display_name=display_name)
+        result = model.predict(
+            current_path,
+            step=step,
+            display_name=display_name,
+            validation=validation,
+        )
         result.input_source = current_path.name
         cls_path = get_output_path(idx, output_path)
         result.save(cls_path)

{xspect-0.5.4 → xspect-0.6.0}/src/xspect/definitions.py RENAMED Viewed

@@ -89,3 +89,22 @@ def get_xspect_mlst_path() -> Path:
     mlst_path = get_xspect_root_path() / "mlst"
     mlst_path.mkdir(exist_ok=True, parents=True)
     return mlst_path
+def get_xspect_misclassification_path() -> Path:
+    """
+    Notes:
+    Developed by Oemer Cetin as part of a Bsc thesis at Goethe University Frankfurt am Main (2025).
+    (An Integration of Alignment-Free and Alignment-Based Approaches for Bacterial Taxon Assignment)
+    Return the path to the XspecT Misclassification directory.
+    Returns the path to the XspecT Misclassification directory, which is located within the XspecT data
+    directory. If the directory does not exist, it creates the directory.
+    Returns:
+        Path: The path to the XspecT Misclassification directory.
+    """
+    misclassification_path = get_xspect_root_path() / "misclassification"
+    misclassification_path.mkdir(exist_ok=True, parents=True)
+    return misclassification_path

{xspect-0.5.4 → xspect-0.6.0}/src/xspect/main.py RENAMED Viewed

@@ -87,13 +87,62 @@ def train():
     help="Email of the author.",
     default=None,
 )
-def train_ncbi(model_genus, svm_steps, author, author_email):
+@click.option(
+    "--min-n50",
+    type=int,
+    help="Minimum contig N50 to filter the accessions (default: 10000).",
+    default=10000,
+)
+@click.option(
+    "--include-atypical/--exclude-atypical",
+    help="Include or exclude atypical accessions (default: exclude).",
+    default=False,
+)
+@click.option(
+    "--allow-inconclusive",
+    is_flag=True,
+    help="Allow the use of accessions with inconclusive taxonomy check status for training.",
+    default=False,
+)
+@click.option(
+    "--allow-candidatus",
+    is_flag=True,
+    help="Allow the use of Candidatus species for training.",
+    default=False,
+)
+@click.option(
+    "--allow-sp",
+    is_flag=True,
+    help="Allow the use of species with 'sp.' in their names for training.",
+    default=False,
+)
+def train_ncbi(
+    model_genus,
+    svm_steps,
+    author,
+    author_email,
+    min_n50,
+    include_atypical,
+    allow_inconclusive,
+    allow_candidatus,
+    allow_sp,
+):
     """Train a species and a genus model based on NCBI data."""
     click.echo(f"Training {model_genus} species and genus metagenome model.")
     try:
         train_from_ncbi = import_module("xspect.train").train_from_ncbi
-        train_from_ncbi(model_genus, svm_steps, author, author_email)
+        train_from_ncbi(
+            model_genus,
+            svm_steps,
+            author,
+            author_email,
+            min_n50=min_n50,
+            exclude_atypical=not include_atypical,
+            allow_inconclusive=allow_inconclusive,
+            allow_candidatus=allow_candidatus,
+            allow_sp=allow_sp,
+        )
     except ValueError as e:
         click.echo(f"Error: {e}")
         return
@@ -287,8 +336,19 @@ def classify_genus(model_genus, input_path, output_path, sparse_sampling_step):
     help="Includes the display names next to taxonomy-IDs.",
     is_flag=True,
 )
+@click.option(
+    "-v",
+    "--validation",
+    help="Detects misclassification for small reads or contigs.",
+    is_flag=True,
+)
 def classify_species(
-    model_genus, input_path, output_path, sparse_sampling_step, display_names
+    model_genus,
+    input_path,
+    output_path,
+    sparse_sampling_step,
+    display_names,
+    validation,
 ):
     """Classify samples using a species model."""
     click.echo("Classifying...")
@@ -300,6 +360,7 @@ def classify_species(
         Path(output_path),
         sparse_sampling_step,
         display_names,
+        validation,
     )

xspect-0.6.0/src/xspect/misclassification_detection/mapping.py ADDED Viewed

@@ -0,0 +1,168 @@
+"""
+Mapping handler for the alignment-based misclassification detection.
+Notes:
+Developed by Oemer Cetin as part of a Bsc thesis at Goethe University Frankfurt am Main (2025).
+(An Integration of Alignment-Free and Alignment-Based Approaches for Bacterial Taxon Assignment)
+"""
+import mappy, pysam, os, csv
+from Bio import SeqIO
+from xspect.definitions import fasta_endings
+__author__ = "Cetin, Oemer"
+class MappingHandler:
+    """Handler class for all mapping related procedures."""
+    def __init__(self, ref_genome_path: str, reads_path: str) -> None:
+        """
+        Initialise the mapping handler.
+        This method sets up the paths to the reference genome and query sequences.
+        Additionally, the paths to the output formats (SAM, BAM and TSV) are generated.
+        Args:
+            ref_genome_path (str): The path to the reference genome.
+            reads_path (str): The path to the query sequences.
+        """
+        if not os.path.isfile(ref_genome_path):
+            raise ValueError("The path to the reference genome does not exist.")
+        if not os.path.isfile(reads_path):
+            raise ValueError("The path to the reads does not exist.")
+        if not ref_genome_path.endswith(tuple(fasta_endings)) and reads_path.endswith(
+            tuple(fasta_endings)
+        ):
+            raise ValueError("The files must be FASTA-files!")
+        stem = reads_path.rsplit(".", 1)[0] + "_mapped"
+        self.ref_genome_path = ref_genome_path
+        self.reads_path = reads_path
+        self.sam = stem + ".sam"
+        self.bam = stem + ".sorted.bam"
+        self.tsv = stem + ".start_coordinates.tsv"
+    def map_reads_onto_reference(self) -> None:
+        """
+        A Method that maps reads against the respective reference genome.
+        This function creates a SAM file via Mappy and converts it into a BAM file.
+        """
+        # create header (entry = sequences of the reference genome)
+        ref_seq = [
+            {"SN": rec.id, "LN": len(rec.seq)}
+            for rec in SeqIO.parse(self.ref_genome_path, "fasta")
+        ]
+        header = {"HD": {"VN": "1.0"}, "SQ": ref_seq}
+        target_id = {sequence["SN"]: number for number, sequence in enumerate(ref_seq)}
+        reads = list(SeqIO.parse(self.reads_path, "fasta"))
+        if not reads:
+            raise ValueError("Reads file is empty.")
+        read_length = len(reads[0].seq)
+        preset = "map-ont" if read_length > 150 else "sr"
+        # create SAM-file
+        aln = mappy.Aligner(self.ref_genome_path, preset=preset)
+        with pysam.AlignmentFile(self.sam, "w", header=header) as out:
+            for read in reads:
+                read_seq = str(read.seq)
+                for hit in aln.map(read_seq):
+                    if hit.cigar_str is None:
+                        continue
+                    # add soft-clips so CIGAR length == len(read_seq) IMPORTANT!!
+                    leftS = hit.q_st
+                    rightS = len(read_seq) - hit.q_en
+                    cigar = (
+                        (f"{leftS}S" if leftS > 0 else "")
+                        + hit.cigar_str
+                        + (f"{rightS}S" if rightS > 0 else "")
+                    )
+                    mapped_region = pysam.AlignedSegment()
+                    mapped_region.query_name = read.id
+                    mapped_region.query_sequence = read_seq
+                    mapped_region.flag = 16 if hit.strand == -1 else 0
+                    mapped_region.reference_id = target_id[hit.ctg]
+                    mapped_region.reference_start = hit.r_st
+                    mapped_region.mapping_quality = (
+                        hit.mapq or 255
+                    )  # 0-60 (255 means unavailable)
+                    mapped_region.cigarstring = cigar
+                    out.write(mapped_region)
+                    break  # keep only primary
+        # create BAM-file
+        pysam.sort("-o", self.bam, self.sam)
+        pysam.index(self.bam)
+    def get_total_genome_length(self) -> int:
+        """
+        Get the genome length from a BAM-file.
+        This function opens a BAM-file and extracts the genome length information.
+        Returns:
+            int: The genome length.
+        """
+        with pysam.AlignmentFile(self.bam, "rb") as bam:
+            return sum(bam.lengths)
+    def extract_starting_coordinates(self) -> None:
+        """
+        Extract starting coordinates of mapped regions from a BAM-file.
+        This function scans through a BAM-file and creates a TSV-file.
+        The information that is extracted is the starting coordinate for each mapped read.
+        """
+        # create tsv-file with all start positions
+        with open(self.tsv, "w") as tsv:
+            tsv.write("reference_genome\tread\tmapped_starting_coordinate\n")
+            try:
+                with pysam.AlignmentFile(self.bam, "rb") as bam:
+                    entry = {
+                        i: seq["SN"] for i, seq in enumerate(bam.header.to_dict()["SQ"])
+                    }
+                    seen = set()
+                    for ref_seq in bam.references:
+                        for hit in bam.fetch(ref_seq):
+                            if (
+                                hit.is_unmapped
+                                or hit.is_secondary
+                                or hit.is_supplementary
+                            ):
+                                continue
+                            key = (hit.reference_id, hit.reference_start)
+                            if key in seen:
+                                continue
+                            seen.add(key)
+                            tsv.write(
+                                f"{entry[hit.reference_id]}\t{hit.query_name}\t{hit.reference_start}\n"
+                            )
+            except ValueError:
+                tsv.write("dummy_reference\tdummy_read\t1000\n")
+    def get_start_coordinates(self) -> list[int]:
+        """
+        Get the coordinates of a TSV-file.
+        This function opens a TSV-file and saves all starting coordinates in a list.
+        Returns:
+            list[int]: The list containing all starting coordinates.
+        Raises:
+            ValueError: If no column with starting coordinates is found.
+        """
+        coordinates = []
+        with open(self.tsv, "r", newline="") as f:
+            reader = csv.DictReader(f, delimiter="\t")
+            for row in reader:
+                val = row.get("mapped_starting_coordinate")
+                if val is None:
+                    raise ValueError("Column with starting coordinates not found.")
+                coordinates.append(int(val))
+        return coordinates

XspecT 0.5.4__tar.gz → 0.6.0__tar.gz

XspecT 0.5.4tar.gz → 0.6.0tar.gz