PyPI - levseq - Versions diffs - 1.4.2__tar.gz → 1.5__tar.gz - Mend

levseq 1.4.2tar.gz → 1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

{levseq-1.4.2/levseq.egg-info → levseq-1.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.4
 Name: levseq
-Version: 1.4.2
+Version: 1.5
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Ariane Mora, Francesca-Zhoufan Li, Emre Gursoy
 Author-email: ylong@caltech.edu
@@ -44,6 +44,18 @@ Requires-Dist: scikit-learn
 Requires-Dist: statsmodels
 Requires-Dist: tqdm
 Requires-Dist: biopandas
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: project-url
+Dynamic: requires-dist
+Dynamic: requires-python
 # Variant Sequencing with Nanopore (LevSeq)
@@ -52,8 +64,35 @@ LevSeq provides a streamlined pipeline for sequencing and analyzing genetic vari
 ![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.jpeg)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.
+## <span style="color: orange;">**Important: Barcode Improvements and LevSeq 2.0 Development**</span>
+**We have identified and resolved demultiplexing challenges in the original barcode set.** Version 1.4 introduced alignment-aware variant calling to address these issues and significantly improve accuracy.
+**We are actively developing LevSeq 2.0** in collaboration with DTU and AITHYRA to fundamentally redesign the barcode system. The updated approach includes:
+- **Enhanced barcode design**: New barcodes will be strain-aware and sequence-aware, generated using an advanced barcode design tool
+- **Reversed workflow architecture**: LevSeq 2.0 will perform alignment first, then demultiplexing (rather than the current demultiplexing-first approach), resolving issues with forward and reverse read handling
+- **Improved accuracy**: These changes will provide more robust demultiplexing and variant calling across diverse experimental conditions
+**Please reach out to us at ylong@caltech.edu if you are planning to order barcoded primers now**
+## Notes
+LevSeq was designed for epPCR and SSM experiments, however, we are currently extending it to work for other enzyme engineering designs as well, the current features are under development:
+1. Insertion handling (see version 4.1.3) - thanks to  Brian Zhong for his contributions to this section!
+2. Gene calling (handling different genes, use the `--oligopool` flag)
+If you notice any issues with new features or have adapted the LevSeq code for your own use cases, we would love community contributions! Please submit either an issue, or a pull request and we will aim to incorperate the changes.
+Performance update: demultiplexing now runs in parallel batches of 8 plates and input FASTQs are staged once per run, improving throughput on multi-core systems.
 ## Quick Start
+Note the current stable version is: `1.5`, the latest version is `1.5`.
+For stable releases these are made available via docker and pip. For latest versions, please clone the repo and install locally (see *Local development or install of latest version* below).
 ### Docker Installation (Recommended)
 1. Install Docker: [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/)
@@ -183,6 +222,16 @@ For the wet lab protocol:
 - **Advanced Usage**: See the [manuscript notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb)
 - **Troubleshooting**: See our [computational protocols wiki](https://github.com/fhalab/LevSeq/wiki/Computational-protocols)
+### Local development or install of latest version
+```
+conda create --name levseq python=3.10
+git clone git@github.com:fhalab/LevSeq.git
+cd LevSeq
+python setup.py sdist bdist_wheel
+pip install dist/levseq-1.4.3.tar.gz
+```
 ## Citing LevSeq
 If you find LevSeq useful, please cite our paper:

{levseq-1.4.2 → levseq-1.5}/README.md RENAMED Viewed

@@ -5,8 +5,35 @@ LevSeq provides a streamlined pipeline for sequencing and analyzing genetic vari
 ![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.jpeg)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.
+## <span style="color: orange;">**Important: Barcode Improvements and LevSeq 2.0 Development**</span>
+**We have identified and resolved demultiplexing challenges in the original barcode set.** Version 1.4 introduced alignment-aware variant calling to address these issues and significantly improve accuracy.
+**We are actively developing LevSeq 2.0** in collaboration with DTU and AITHYRA to fundamentally redesign the barcode system. The updated approach includes:
+- **Enhanced barcode design**: New barcodes will be strain-aware and sequence-aware, generated using an advanced barcode design tool
+- **Reversed workflow architecture**: LevSeq 2.0 will perform alignment first, then demultiplexing (rather than the current demultiplexing-first approach), resolving issues with forward and reverse read handling
+- **Improved accuracy**: These changes will provide more robust demultiplexing and variant calling across diverse experimental conditions
+**Please reach out to us at ylong@caltech.edu if you are planning to order barcoded primers now**
+## Notes
+LevSeq was designed for epPCR and SSM experiments, however, we are currently extending it to work for other enzyme engineering designs as well, the current features are under development:
+1. Insertion handling (see version 4.1.3) - thanks to  Brian Zhong for his contributions to this section!
+2. Gene calling (handling different genes, use the `--oligopool` flag)
+If you notice any issues with new features or have adapted the LevSeq code for your own use cases, we would love community contributions! Please submit either an issue, or a pull request and we will aim to incorperate the changes.
+Performance update: demultiplexing now runs in parallel batches of 8 plates and input FASTQs are staged once per run, improving throughput on multi-core systems.
 ## Quick Start
+Note the current stable version is: `1.5`, the latest version is `1.5`.
+For stable releases these are made available via docker and pip. For latest versions, please clone the repo and install locally (see *Local development or install of latest version* below).
 ### Docker Installation (Recommended)
 1. Install Docker: [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/)
@@ -136,6 +163,16 @@ For the wet lab protocol:
 - **Advanced Usage**: See the [manuscript notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb)
 - **Troubleshooting**: See our [computational protocols wiki](https://github.com/fhalab/LevSeq/wiki/Computational-protocols)
+### Local development or install of latest version
+```
+conda create --name levseq python=3.10
+git clone git@github.com:fhalab/LevSeq.git
+cd LevSeq
+python setup.py sdist bdist_wheel
+pip install dist/levseq-1.4.3.tar.gz
+```
 ## Citing LevSeq
 If you find LevSeq useful, please cite our paper:
@@ -152,4 +189,4 @@ If you find LevSeq useful, please cite our paper:
 ## Contact
-Leave a feature request in the issues or reach us via [email](mailto:levseqdb@gmail.com).
+Leave a feature request in the issues or reach us via [email](mailto:levseqdb@gmail.com).

{levseq-1.4.2 → levseq-1.5}/levseq/__init__.py RENAMED Viewed

@@ -18,7 +18,7 @@
 __title__ = 'levseq'
 __description__ = 'LevSeq nanopore sequencing'
 __url__ = 'https://github.com/fhalab/levseq/'
-__version__ = '1.4.2'
+__version__ = '1.5'
 __author__ = 'Yueming Long, Ariane Mora, Francesca-Zhoufan Li, Emre Gursoy'
 __author_email__ = 'ylong@caltech.edu'
 __license__ = 'GPL3'

levseq-1.5/levseq/filter_orientation.py ADDED Viewed

@@ -0,0 +1,221 @@
+from Bio import SeqIO
+from Bio.Seq import Seq
+from pathlib import Path
+import gzip
+import logging
+import math
+import shutil
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+_VALID_BASES = set("ACGT")
+def build_kmer_set(seq, kmer_size):
+    """Build a set of k-mers from the parent sequence."""
+    if kmer_size <= 0:
+        return set()
+    seq = seq.upper()
+    if len(seq) < kmer_size:
+        return set()
+    kmers = set()
+    for i in range(len(seq) - kmer_size + 1):
+        kmer = seq[i:i + kmer_size]
+        if set(kmer) <= _VALID_BASES:
+            kmers.add(kmer)
+    return kmers
+def sample_kmer_positions(seq_len, kmer_size, samples, skip_front, skip_back):
+    start_min = max(skip_front, 0)
+    start_max = seq_len - max(skip_back, 0) - kmer_size
+    if start_max < start_min or samples <= 0:
+        return []
+    span = start_max - start_min
+    if span == 0:
+        return [start_min]
+    if samples == 1:
+        return [start_min + span // 2]
+    step = span / (samples - 1)
+    positions = [int(round(start_min + i * step)) for i in range(samples)]
+    seen = set()
+    uniq_positions = []
+    for pos in positions:
+        if pos < start_min or pos > start_max:
+            continue
+        if pos in seen:
+            continue
+        seen.add(pos)
+        uniq_positions.append(pos)
+    return uniq_positions
+def count_kmer_hits(seq, positions, kmer_size, parent_kmers, rev_kmers):
+    forward_hits = 0
+    reverse_hits = 0
+    sampled = 0
+    for pos in positions:
+        kmer = seq[pos:pos + kmer_size]
+        if len(kmer) != kmer_size:
+            continue
+        if set(kmer) <= _VALID_BASES:
+            sampled += 1
+            if kmer in parent_kmers:
+                forward_hits += 1
+            if kmer in rev_kmers:
+                reverse_hits += 1
+    return forward_hits, reverse_hits, sampled
+def iter_fastq_records(handle):
+    while True:
+        header = handle.readline()
+        if not header:
+            break
+        seq = handle.readline()
+        plus = handle.readline()
+        qual = handle.readline()
+        if not seq or not plus or not qual:
+            break
+        yield header, seq, plus, qual
+def filter_single_file(args):
+    """
+    Filter a single fastq file. Used for parallel processing.
+    Args:
+        args: tuple containing (input_file, parent_kmers, rev_kmers, kmer_size, samples,
+                                skip_front, skip_back, min_delta, min_ratio)
+    Returns:
+        tuple: (file_path, total_reads, kept_reads, temp_file)
+    """
+    (input_file, parent_kmers, rev_kmers, kmer_size, samples,
+     skip_front, skip_back, min_delta, min_ratio) = args
+    total_reads = 0
+    kept_count = 0
+    is_forward = "forward" in str(input_file).lower()
+    input_path = Path(input_file)
+    temp_file = input_path.parent / f"temp_{input_path.name}"
+    position_cache = {}
+    open_fn = gzip.open if input_path.suffix == ".gz" else open
+    with open_fn(input_path, "rt") as input_handle, open(temp_file, "w") as output_handle:
+        for header, seq_line, plus, qual in iter_fastq_records(input_handle):
+            total_reads += 1
+            seq = seq_line.strip().upper()
+            seq_len = len(seq)
+            if seq_len not in position_cache:
+                position_cache[seq_len] = sample_kmer_positions(
+                    seq_len, kmer_size, samples, skip_front, skip_back
+                )
+            positions = position_cache[seq_len]
+            forward_hits, reverse_hits, sampled = count_kmer_hits(
+                seq, positions, kmer_size, parent_kmers, rev_kmers
+            )
+            if sampled == 0:
+                continue
+            required_delta = max(min_delta, int(math.ceil(min_ratio * sampled)))
+            if required_delta > sampled:
+                required_delta = sampled
+            # If it's in forward file (plate barcode was rev comp)
+            # Then read should align to reverse complement parent sequence
+            if is_forward and (reverse_hits - forward_hits) >= required_delta:
+                output_handle.write(header)
+                output_handle.write(seq_line)
+                output_handle.write(plus)
+                output_handle.write(qual)
+                kept_count += 1
+            # If it's in reverse file (plate barcode was forward)
+            # Then read was already reverse complemented by demultiplexer
+            # So it should align to forward parent sequence
+            elif not is_forward and (forward_hits - reverse_hits) >= required_delta:
+                output_handle.write(header)
+                output_handle.write(seq_line)
+                output_handle.write(plus)
+                output_handle.write(qual)
+                kept_count += 1
+    return str(input_file), total_reads, kept_count, str(temp_file)
+def filter_demultiplexed_folder(
+    experiment_folder,
+    parent_sequence,
+    num_threads=8,
+    kmer_size=6,
+    samples=40,
+    skip_front=100,
+    skip_back=0,
+    min_delta=4,
+    min_ratio=0.1,
+):
+    """
+    Filter demultiplexed files using a k-mer orientation heuristic.
+    Args:
+        experiment_folder (str): Path to experiment folder containing RBC/FBC structure
+        parent_sequence (str): Parent sequence for alignment checking
+        num_threads (int): Number of threads to use
+        kmer_size (int): Length of k-mer used for orientation checks
+        samples (int): Number of k-mers sampled per read
+        skip_front (int): Bases to skip from the front of the read
+        skip_back (int): Bases to skip from the end of the read
+        min_delta (int): Minimum hit difference to keep a read
+        min_ratio (float): Minimum hit difference as a ratio of sampled k-mers
+    """
+    exp_path = Path(experiment_folder)
+    filtered_counts = {}
+    # Prepare parent sequences once
+    parent_seq_obj = Seq(parent_sequence)
+    parent_seq = str(parent_seq_obj).upper()
+    parent_rev_comp = str(parent_seq_obj.reverse_complement()).upper()
+    parent_kmers = build_kmer_set(parent_seq, kmer_size)
+    rev_kmers = build_kmer_set(parent_rev_comp, kmer_size)
+    # Collect all fastq files
+    fastq_files = []
+    for rbc_dir in exp_path.glob("RB*"):
+        if not rbc_dir.is_dir():
+            continue
+        for fbc_dir in rbc_dir.glob("NB*"):
+            if not fbc_dir.is_dir():
+                continue
+            fastq_files.extend(list(fbc_dir.glob("*.fastq")))
+    if not fastq_files:
+        logging.warning(f"No fastq files found in {experiment_folder}")
+        return filtered_counts
+    # Prepare arguments for parallel processing
+    file_args = [
+        (f, parent_kmers, rev_kmers, kmer_size, samples, skip_front, skip_back, min_delta, min_ratio)
+        for f in fastq_files
+    ]
+    # Process files in parallel with progress bar
+    with ThreadPoolExecutor(max_workers=num_threads) as executor:
+        futures = [executor.submit(filter_single_file, args) for args in file_args]
+        with tqdm(total=len(fastq_files), desc="Filtering files") as pbar:
+            for future in as_completed(futures):
+                try:
+                    file_path, total, kept, temp_file = future.result()
+                    shutil.move(temp_file, file_path)
+                    filtered_counts[file_path] = {
+                        'total': total,
+                        'kept': kept,
+                        'filtered': total - kept
+                    }
+                    logging.info(f"Processed {file_path}: {kept}/{total} reads kept")
+                    pbar.update(1)
+                except Exception as e:
+                    logging.error(f"Error processing file {file_path}: {str(e)}")
+                    pbar.update(1)
+    return filtered_counts

{levseq-1.4.2 → levseq-1.5}/levseq/run_levseq.py RENAMED Viewed

@@ -62,6 +62,7 @@ import numpy as np
 import tqdm
 import panel as pn
 import holoviews as hv
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from importlib import resources
 from holoviews.streams import Tap
@@ -485,6 +486,11 @@ def process_ref_csv_oligopool(cl_args, tqdm_fn=tqdm.tqdm):
     result_folder = create_result_folder(cl_args)
     variant_csv_path = os.path.join(result_folder, "variants.csv")
     variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])
+    output_dir = Path(result_folder) / f"{cl_args['name']}_fastq"
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if not cl_args["skip_demultiplexing"]:
+        cat_fastq_files(cl_args.get("path"), output_dir)
     # First get the different barcode plates (these will be unique)
     barcode_plates = ref_df["barcode_plate"].unique()
@@ -496,10 +502,7 @@ def process_ref_csv_oligopool(cl_args, tqdm_fn=tqdm.tqdm):
             name_folder = os.path.join(result_folder, f'RB{barcode_plate}')
             os.makedirs(name_folder, exist_ok=True)
             barcode_path = filter_bc(cl_args, name_folder, i)
-            output_dir = Path(result_folder) / f"{cl_args['name']}_fastq"
-            output_dir.mkdir(parents=True, exist_ok=True)
-            file_to_fastq = cat_fastq_files(cl_args.get("path"), output_dir)
             try:
                 demux_fastq(output_dir, name_folder, barcode_path)
             except Exception as e:
@@ -543,62 +546,134 @@ def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
     variant_csv_path = os.path.join(result_folder, "variants.csv")
     variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])
-    for i, row in tqdm_fn(ref_df.iterrows(), total=len(ref_df), desc="Processing Samples"):
+    output_dir = Path(result_folder) / f"{cl_args['name']}_fastq"
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if not cl_args["skip_demultiplexing"]:
+        cat_fastq_files(cl_args.get("path"), output_dir)
+    samples = []
+    for i, row in ref_df.iterrows():
         barcode_plate = row["barcode_plate"]
         name = row["name"]
         refseq = row["refseq"].upper()
         name_folder = os.path.join(result_folder, name)
         os.makedirs(name_folder, exist_ok=True)
         temp_fasta_path = os.path.join(name_folder, f"temp_{name}.fasta")
         if not os.path.exists(temp_fasta_path):
             with open(temp_fasta_path, "w") as f:
                 f.write(f">{name}\n{refseq}\n")
         else:
             logging.info(f"Fasta file for {name} already exists. Skipping write.")
-        barcode_path = filter_bc(cl_args, name_folder, i)
-        output_dir = Path(result_folder) / f"{cl_args['name']}_fastq"
-        output_dir.mkdir(parents=True, exist_ok=True)
-        if not cl_args["skip_demultiplexing"]:
-            file_to_fastq = cat_fastq_files(cl_args.get("path"), output_dir)
+        barcode_path = filter_bc(cl_args, name_folder, i)
+        samples.append({
+            "barcode_plate": barcode_plate,
+            "name": name,
+            "refseq": refseq,
+            "name_folder": name_folder,
+            "temp_fasta_path": temp_fasta_path,
+            "barcode_path": barcode_path,
+            "demux_ok": True,
+        })
+    def _demux_only(sample):
+        name = sample["name"]
+        name_folder = sample["name_folder"]
+        barcode_path = sample["barcode_path"]
+        try:
+            demux_fastq(output_dir, name_folder, barcode_path)
+            return True
+        except Exception:
+            logging.error(
+                "An error occurred during demultiplexing for sample {}. Skipping this sample.".format(name),
+                exc_info=True,
+            )
+            return False
+    if not cl_args["skip_demultiplexing"]:
+        batch_size = 8
+        if samples:
+            pbar = tqdm_fn(total=len(samples), desc="Demultiplex plates")
             try:
-                demux_fastq(output_dir, name_folder, barcode_path)
+                for i in range(0, len(samples), batch_size):
+                    batch = samples[i:i + batch_size]
+                    max_workers = len(batch)
+                    if max_workers <= 1:
+                        sample = batch[0]
+                        sample["demux_ok"] = _demux_only(sample)
+                        pbar.update(1)
+                        continue
+                    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+                        futures = {executor.submit(_demux_only, sample): sample for sample in batch}
+                        for future in as_completed(futures):
+                            sample = futures[future]
+                            sample["demux_ok"] = future.result()
+                            pbar.update(1)
+            finally:
+                pbar.close()
+    else:
+        for sample in samples:
+            sample["demux_ok"] = True
-                # Add filtering step here with multithreading
+    if not cl_args["skip_demultiplexing"]:
+        for sample in tqdm_fn(samples, total=len(samples), desc="Filter plates"):
+            if not sample["demux_ok"]:
+                continue
+            name = sample["name"]
+            refseq = sample["refseq"]
+            name_folder = sample["name_folder"]
+            try:
                 filtered_counts = filter_demultiplexed_folder(
-                        name_folder,
-                        refseq,
-                        num_threads=10
+                    name_folder,
+                    refseq,
+                    num_threads=10,
                 )
                 logging.info(f"Orientation filtering completed for {name}")
                 total_reads = sum(counts['total'] for counts in filtered_counts.values())
                 kept_reads = sum(counts['kept'] for counts in filtered_counts.values())
-                logging.info(f"Total filtering results: {kept_reads}/{total_reads} reads kept ({kept_reads/total_reads*100:.2f}%)")
+                if total_reads:
+                    logging.info(
+                        "Total filtering results: %d/%d reads kept (%.2f%%)",
+                        kept_reads,
+                        total_reads,
+                        kept_reads / total_reads * 100,
+                    )
                 for file, counts in filtered_counts.items():
                     logging.info(f"{file}: {counts['kept']}/{counts['total']} reads kept")
+            except Exception:
+                logging.error(
+                    "An error occurred during filtering for sample {}. Skipping this sample.".format(name),
+                    exc_info=True,
+                )
+                sample["demux_ok"] = False
+                continue
-            except Exception as e:
-                logging.error("An error occurred during demultiplexing/filtering for sample {}. Skipping this sample.".format(name), exc_info=True)
+    if not cl_args["skip_variantcalling"]:
+        for sample in tqdm_fn(samples, total=len(samples), desc="Calling variants"):
+            if not sample["demux_ok"] and not cl_args["skip_demultiplexing"]:
                 continue
-        if not cl_args["skip_variantcalling"]:
             try:
                 threshold = cl_args.get("threshold") if cl_args.get("threshold") is not None else 0.5
                 variant_result = call_variant(
-                    f"{name}", name_folder, temp_fasta_path, barcode_path, threshold=threshold
+                    f"{sample['name']}",
+                    sample["name_folder"],
+                    sample["temp_fasta_path"],
+                    sample["barcode_path"],
+                    threshold=threshold,
                 )
-                variant_result["barcode_plate"] = barcode_plate
-                variant_result["name"] = name
-                variant_result["refseq"] = refseq
+                variant_result["barcode_plate"] = sample["barcode_plate"]
+                variant_result["name"] = sample["name"]
+                variant_result["refseq"] = sample["refseq"]
                 variant_df = pd.concat([variant_df, variant_result])
             except Exception as e:
-                logging.error("An error occurred during variant calling for sample {}. Skipping this sample.".format(name), exc_info=True)
+                logging.error(
+                    "An error occurred during variant calling for sample {}. Skipping this sample.".format(sample["name"]),
+                    exc_info=True,
+                )
                 continue
     variant_df.to_csv(variant_csv_path, index=False)
@@ -676,4 +751,3 @@ def run_LevSeq(cl_args, tqdm_fn=tqdm.tqdm):
 # This modification saves the results at each critical stage, ensuring that even in the case of failure,
 # the user has access to intermediate results and does not lose all the progress.

{levseq-1.4.2 → levseq-1.5}/levseq/utils.py RENAMED Viewed

@@ -205,7 +205,7 @@ def calculate_mutation_significance_across_well(seq_df):
         seq_df.at[i, 'p(g)'] = p_g
         seq_df.at[i, 'p(c)'] = p_c
         seq_df.at[i, 'p(n)'] = p_n
-        seq_df.at[i, 'p(i)'] = p_n
+        seq_df.at[i, 'p(i)'] = p_i
         seq_df.at[i, 'p_value'] = p_value
         seq_df.at[i, 'percent_most_freq_mutation'] = val
         seq_df.at[i, 'most_frequent'] = actual_seq
@@ -324,6 +324,8 @@ def get_reads_for_well(parent_name, bam_file_path: str, ref_str: str, msa_path=N
                               'C', 'p(c)', 'N', 'p(n)', 'I', 'p(i)', 'Warnings']
             return calculate_mutation_significance_across_well(seq_df), alignment_count
     return None, 0
 def make_row_from_read_pileup_across_well(well_df, ref_str, label, insert_map):
     """
     Given a pileup of reads, we want to get some summary information about that sequence
@@ -349,12 +351,12 @@ def make_row_from_read_pileup_across_well(well_df, ref_str, label, insert_map):
                 warning = f'WARNING: INSERT.'
             rows.append([label, col, ref_seq, actual_seq, freq_non_ref, total_other, total_reads, 1.0, 0.0,
                          len(vc[vc == 'A']), 1.0, len(vc[vc == 'T']), 1.0, len(vc[vc == 'G']), 1.0,
-                         len(vc[vc == 'C']), 1.0, len(vc[vc == '-']), 1.0, len(insert_map.get(col)),
+                         len(vc[vc == 'C']), 1.0, len(vc[vc == '-']), 1.0, len(vc[vc == 'I']),
                          1.0, warning])
-        if ref_seq != '-':
+        elif ref_seq != '-':
             rows.append([label, col, ref_seq, actual_seq, freq_non_ref, total_other, total_reads, 1.0, 0.0,
                          len(vc[vc == 'A']), 1.0, len(vc[vc == 'T']), 1.0, len(vc[vc == 'G']), 1.0,
-                         len(vc[vc == 'C']), 1.0, len(vc[vc == '-']), 1.0, 0,
+                         len(vc[vc == 'C']), 1.0, len(vc[vc == '-']), 1.0, len(vc[vc == 'I']),
                          1.0, warning])
     return rows

{levseq-1.4.2 → levseq-1.5}/levseq/variantcaller.py RENAMED Viewed

@@ -18,6 +18,7 @@ import pandas as pd
 import logging
 from levseq.utils import *
 import subprocess
+import shutil
 import os
 from collections import defaultdict
 import glob
@@ -41,9 +42,7 @@ The variant caller starts from demultiplexed fastq files.
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.WARNING)  # Set default level for this module
-# Use the logger in this file
-logger.warning("This is a warning message.")
-logger.info("This won't show unless logging is configured to INFO elsewhere.")
+# Use the logger in this file.
 class VariantCaller:
     """
@@ -123,47 +122,108 @@ class VariantCaller:
     def _align_sequences(self, output_dir, filename, scores=[4, 2, 10], alignment_name="alignment_minimap"):
         try:
             all_fastq = os.path.join(output_dir, '*.fastq')
-            fastq_list = glob.glob(all_fastq)
-            fastq_files = all_fastq # os.path.join(output_dir, f"demultiplexed_{filename}.fastq")
-            if not all_fastq:
+            fastq_list = sorted(glob.glob(all_fastq))
+            if not fastq_list:
                 logger.error("No FASTQ files found in the specified output directory.")
                 return
-            # Combining fastq files into one if there are more than 1
-            if len(fastq_list) > 1:
-                with open(fastq_files, 'w') as outfile:
-                    for fastq in fastq_list:
-                        with open(fastq, 'r') as infile:
-                            outfile.write(infile.read())
-            else:
-                fastq_files = fastq_list[0]
+            sam_path = os.path.join(output_dir, f"{alignment_name}.sam")
             # Alignment using minimap2
-            minimap_cmd = f"minimap2 -ax map-ont -A {scores[0]} -B {scores[1]} -O {scores[2]},24 '{self.template_fasta}' '{fastq_files}' > '{output_dir}/{alignment_name}.sam'"
-            subprocess.run(minimap_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+            minimap_cmd = [
+                "minimap2", "-ax", "map-ont",
+                "-A", str(scores[0]),
+                "-B", str(scores[1]),
+                "-O", f"{scores[2]},24",
+                str(self.template_fasta),
+                *fastq_list,
+            ]
+            with open(sam_path, "w") as sam_handle:
+                minimap_result = subprocess.run(
+                    minimap_cmd,
+                    stdout=sam_handle,
+                    stderr=subprocess.PIPE,
+                    text=True,
+                )
+            if minimap_result.returncode != 0:
+                logger.error(
+                    "minimap2 failed for %s: %s",
+                    filename,
+                    minimap_result.stderr.strip(),
+                )
+                return
             # print(minimap_cmd)
-            # Convert SAM to BAM and sort
-            view_cmd = f"samtools view -bS '{output_dir}/{alignment_name}.sam' > '{output_dir}/{alignment_name}.bam'"
-            subprocess.run(view_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
-            # print(view_cmd)
+            # Convert SAM to BAM
+            unsorted_bam = os.path.join(output_dir, f"{alignment_name}.unsorted.bam")
+            with open(unsorted_bam, "wb") as bam_handle:
+                view_result = subprocess.run(
+                    ["samtools", "view", "-bS", sam_path],
+                    stdout=bam_handle,
+                    stderr=subprocess.PIPE,
+                )
+            if view_result.returncode != 0:
+                logger.error(
+                    "samtools view failed for %s: %s",
+                    filename,
+                    view_result.stderr.decode().strip(),
+                )
+                return
-            sort_cmd = f"samtools sort '{output_dir}/{alignment_name}.bam' -o '{output_dir}/{alignment_name}.bam'"
-            subprocess.run(sort_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
-            # print(sort_cmd)
+            # Sort BAM (support both modern and legacy samtools syntax)
+            sorted_bam = os.path.join(output_dir, f"{alignment_name}.bam")
+            sort_result = subprocess.run(
+                ["samtools", "sort", "-o", sorted_bam, unsorted_bam],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.PIPE,
+            )
+            if sort_result.returncode != 0 or not os.path.exists(sorted_bam):
+                legacy_prefix = os.path.join(output_dir, f"{alignment_name}.sorted")
+                legacy_result = subprocess.run(
+                    ["samtools", "sort", unsorted_bam, legacy_prefix],
+                    stdout=subprocess.DEVNULL,
+                    stderr=subprocess.PIPE,
+                )
+                if legacy_result.returncode != 0:
+                    logger.error(
+                        "samtools sort failed for %s: %s",
+                        filename,
+                        legacy_result.stderr.decode().strip(),
+                    )
+                    return
+                legacy_bam = f"{legacy_prefix}.bam"
+                if not os.path.exists(legacy_bam):
+                    logger.error("samtools sort did not produce %s", legacy_bam)
+                    return
+                shutil.move(legacy_bam, sorted_bam)
             # Index the BAM file
-            index_cmd = f"samtools index '{output_dir}/{alignment_name}.bam'"
-            subprocess.run(index_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+            if not os.path.exists(sorted_bam):
+                logger.error("samtools sort did not produce %s", sorted_bam)
+                return
+            index_cmd = ["samtools", "index", sorted_bam]
+            index_result = subprocess.run(
+                index_cmd,
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.PIPE,
+            )
+            if index_result.returncode != 0:
+                logger.error(
+                    "samtools index failed for %s: %s",
+                    filename,
+                    index_result.stderr.decode().strip(),
+                )
+                return
             # Cleanup SAM file to save space
-            os.remove(f"{output_dir}/{alignment_name}.sam")
+            os.remove(sam_path)
+            os.remove(unsorted_bam)
         except Exception as e:
             logger.error(f"Error during alignment for {filename}: {e}")
     def _run_variant_thread(self, args):
         barcode_ids, threshold, min_depth, output_dir = args
-        # Overall progress bar for all barcodes in this thread
-        with tqdm(barcode_ids, desc="Processing barcodes", leave=False) as pbar:
+        logger.info("Variant calling: processing %d barcodes", len(barcode_ids))
+        # Overall progress bar for all barcodes in this thread (disabled to reduce console spam)
+        with tqdm(barcode_ids, desc="Processing barcodes", leave=False, disable=True) as pbar:
             for barcode_id in pbar:
                 try:
                     row = self.variant_dict.get(barcode_id)
@@ -171,9 +231,18 @@ class VariantCaller:
                     # Check if alignment file exists, if not, align sequences
                     if not os.path.exists(bam_file):
-                       logger.info(f"Aligning sequences for {row['Path']}")
-                    self._align_sequences(row["Path"], row['Barcodes'],
-                                              alignment_name=f'{self.alignment_name}_{barcode_id}')
+                        logger.info(f"Aligning sequences for {row['Path']}")
+                        self._align_sequences(
+                            row["Path"],
+                            row['Barcodes'],
+                            alignment_name=f'{self.alignment_name}_{barcode_id}',
+                        )
+                    elif not os.path.exists(f"{bam_file}.bai"):
+                        subprocess.run(
+                            ["samtools", "index", bam_file],
+                            stdout=subprocess.DEVNULL,
+                            stderr=subprocess.DEVNULL,
+                        )
                     # Placeholder function calls to demonstrate workflow
                     well_df, alignment_count = get_reads_for_well(self.experiment_name, bam_file,
@@ -184,7 +253,12 @@ class VariantCaller:
                                 continue
                         self.variant_dict[barcode_id]['Alignment Count'] = alignment_count
                         well_df.to_csv(f"{row['Path']}/seq_{barcode_id}.csv", index=False)
-                        label, freq, combined_p_value, mixed_well, avg_error_rate = get_variant_label_for_well(well_df, threshold)
+                        # Suppress noisy numerical warnings from downstream stats on sparse wells.
+                        with warnings.catch_warnings():
+                            warnings.filterwarnings("ignore", category=RuntimeWarning)
+                            label, freq, combined_p_value, mixed_well, avg_error_rate = get_variant_label_for_well(
+                                well_df, threshold
+                            )
                         self.variant_dict[barcode_id]['Variant'] = label
                         self.variant_dict[barcode_id]['Mixed Well'] = mixed_well
                         self.variant_dict[barcode_id]['Average mutation frequency'] = freq

{levseq-1.4.2 → levseq-1.5}/levseq/visualization.py RENAMED Viewed

@@ -67,13 +67,36 @@ import seaborn as sns
 from levseq.utils import *
-output_notebook()
+def _in_notebook():
+    try:
+        from IPython import get_ipython
+    except Exception:
+        return False
+    ip = get_ipython()
+    if ip is None:
+        return False
+    return ip.__class__.__name__ == "ZMQInteractiveShell"
+def _should_init_notebook():
+    if os.environ.get("LEVSEQ_DISABLE_NOTEBOOK_INIT") == "1":
+        return False
+    if os.environ.get("LEVSEQ_FORCE_NOTEBOOK_INIT") == "1":
+        return True
+    return _in_notebook()
-pn.extension()
-pn.config.comms = "vscode"
+def init_notebook_env():
+    # Avoid notebook UI side-effects during plain imports/tests.
+    output_notebook()
+    pn.extension()
+    pn.config.comms = "vscode"
+    hv.extension("bokeh")
+    hv.renderer("bokeh").webgl = True
-hv.extension("bokeh")
-hv.renderer("bokeh").webgl = True
+if _should_init_notebook():
+    init_notebook_env()
 # warnings.filterwarnings("ignore")
 #warnings.filterwarnings("ignore", category=Warning)
@@ -392,25 +415,33 @@ def generate_platemaps(
         out=np.zeros_like(max_combo_df["Alignment Count"], dtype=float),
         where=max_combo_df["Alignment Count"] != 0,
     )
-    # Set the center
-    center = np.log(10)
+    max_combo_df["logseqdepth"] = max_combo_df["logseqdepth"].fillna(0.0)
+    min_val = max_combo_df["logseqdepth"].min()
+    max_val = max_combo_df["logseqdepth"].max()
+    if not np.isfinite(min_val) or not np.isfinite(max_val) or min_val == max_val:
+        # Avoid invalid colormap centers when data has no range.
+        color_levels = [min_val - 0.1, min_val, min_val + 0.1]
+    else:
+        # Set the center
+        center = np.log(10)
-    add_min = False
-    if max_combo_df["logseqdepth"].min() >= center:
-        add_min = True
+        add_min = False
+        if min_val >= center:
+            add_min = True
-    # Adjust if it is greater than max of data (avoids ValueError)
-    if max_combo_df["logseqdepth"].max() <= center:
-        # Adjust the center
-        center = max_combo_df["logseqdepth"].median()
+        # Adjust if it is greater than max of data (avoids ValueError)
+        if max_val <= center:
+            # Adjust the center
+            center = max_combo_df["logseqdepth"].median()
-    # center colormap
-    if not add_min:
-        color_levels = ns.viz._center_colormap(max_combo_df["logseqdepth"], center)
-    else:
-        color_levels = ns.viz._center_colormap(
-            list(max_combo_df["logseqdepth"]) + [np.log(1)], center
-        )
+        # center colormap
+        if not add_min:
+            color_levels = ns.viz._center_colormap(max_combo_df["logseqdepth"], center)
+        else:
+            color_levels = ns.viz._center_colormap(
+                list(max_combo_df["logseqdepth"]) + [np.log(1)], center
+            )
     # dictionary for storing plots
     hm_dict = {}
@@ -1198,4 +1229,4 @@ def plot_seaborn_heatmap(platemap, platemap_labels, label: str, result_folder):
     ax = pc.axes
     plt.yticks(rotation=0)
     plt.setp(ax.get_yticklabels(), ha="center")
-    plt.savefig(os.path.join(result_folder, f'platemap_{label}.svg'))
+    plt.savefig(os.path.join(result_folder, f'platemap_{label}.svg'))

{levseq-1.4.2 → levseq-1.5/levseq.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.4
 Name: levseq
-Version: 1.4.2
+Version: 1.5
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Ariane Mora, Francesca-Zhoufan Li, Emre Gursoy
 Author-email: ylong@caltech.edu
@@ -44,6 +44,18 @@ Requires-Dist: scikit-learn
 Requires-Dist: statsmodels
 Requires-Dist: tqdm
 Requires-Dist: biopandas
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: project-url
+Dynamic: requires-dist
+Dynamic: requires-python
 # Variant Sequencing with Nanopore (LevSeq)
@@ -52,8 +64,35 @@ LevSeq provides a streamlined pipeline for sequencing and analyzing genetic vari
 ![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.jpeg)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.
+## <span style="color: orange;">**Important: Barcode Improvements and LevSeq 2.0 Development**</span>
+**We have identified and resolved demultiplexing challenges in the original barcode set.** Version 1.4 introduced alignment-aware variant calling to address these issues and significantly improve accuracy.
+**We are actively developing LevSeq 2.0** in collaboration with DTU and AITHYRA to fundamentally redesign the barcode system. The updated approach includes:
+- **Enhanced barcode design**: New barcodes will be strain-aware and sequence-aware, generated using an advanced barcode design tool
+- **Reversed workflow architecture**: LevSeq 2.0 will perform alignment first, then demultiplexing (rather than the current demultiplexing-first approach), resolving issues with forward and reverse read handling
+- **Improved accuracy**: These changes will provide more robust demultiplexing and variant calling across diverse experimental conditions
+**Please reach out to us at ylong@caltech.edu if you are planning to order barcoded primers now**
+## Notes
+LevSeq was designed for epPCR and SSM experiments, however, we are currently extending it to work for other enzyme engineering designs as well, the current features are under development:
+1. Insertion handling (see version 4.1.3) - thanks to  Brian Zhong for his contributions to this section!
+2. Gene calling (handling different genes, use the `--oligopool` flag)
+If you notice any issues with new features or have adapted the LevSeq code for your own use cases, we would love community contributions! Please submit either an issue, or a pull request and we will aim to incorperate the changes.
+Performance update: demultiplexing now runs in parallel batches of 8 plates and input FASTQs are staged once per run, improving throughput on multi-core systems.
 ## Quick Start
+Note the current stable version is: `1.5`, the latest version is `1.5`.
+For stable releases these are made available via docker and pip. For latest versions, please clone the repo and install locally (see *Local development or install of latest version* below).
 ### Docker Installation (Recommended)
 1. Install Docker: [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/)
@@ -183,6 +222,16 @@ For the wet lab protocol:
 - **Advanced Usage**: See the [manuscript notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb)
 - **Troubleshooting**: See our [computational protocols wiki](https://github.com/fhalab/LevSeq/wiki/Computational-protocols)
+### Local development or install of latest version
+```
+conda create --name levseq python=3.10
+git clone git@github.com:fhalab/LevSeq.git
+cd LevSeq
+python setup.py sdist bdist_wheel
+pip install dist/levseq-1.4.3.tar.gz
+```
 ## Citing LevSeq
 If you find LevSeq useful, please cite our paper:

{levseq-1.4.2 → levseq-1.5}/tests/test_variant_calling.py RENAMED Viewed

@@ -283,7 +283,7 @@ class TestVariantCalling(TestClass):
     def test_calling_variant_with_insert(self):
         u.dp(["Testing calling variants using SSM with error"])
+        # ToDo: Update this with new calling need a new test for this
         parent_sequence = "ATGAGT"
         mutated_sequence = 'ATGAGT' # Not actually mutated
         parent_name = 'parent'

levseq-1.4.2/levseq/filter_orientation.py DELETED Viewed

@@ -1,115 +0,0 @@
-from Bio import SeqIO
-from Bio.Seq import Seq
-import os
-from pathlib import Path
-import logging
-from Bio.Align import PairwiseAligner
-import shutil
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from tqdm import tqdm
-def calculate_alignment_score(seq1, seq2):
-    """Calculate alignment score between two sequences using PairwiseAligner."""
-    aligner = PairwiseAligner()
-    aligner.mode = 'global'
-    alignment = aligner.align(seq1, seq2)[0]
-    return alignment.score / max(len(seq1), len(seq2))
-def filter_single_file(args):
-    """
-    Filter a single fastq file. Used for parallel processing.
-    Args:
-        args: tuple containing (input_file, parent_seq, parent_rev_comp)
-    Returns:
-        tuple: (file_path, total_reads, kept_reads, filtered_records)
-    """
-    input_file, parent_seq, parent_rev_comp = args
-    kept_reads = []
-    total_reads = 0
-    kept_count = 0
-    is_forward = "forward" in str(input_file).lower()
-    for record in SeqIO.parse(input_file, "fastq"):
-        total_reads += 1
-        seq = str(record.seq)
-        forward_score = calculate_alignment_score(seq, str(parent_seq))
-        reverse_score = calculate_alignment_score(seq, str(parent_rev_comp))
-        # If it's in forward file (plate barcode was rev comp)
-        # Then read should align to reverse complement parent sequence
-        if is_forward and reverse_score > forward_score:
-            kept_reads.append(record)
-            kept_count += 1
-        # If it's in reverse file (plate barcode was forward)
-        # Then read was already reverse complemented by demultiplexer
-        # So it should align to forward parent sequence
-        elif not is_forward and forward_score > reverse_score:
-            kept_reads.append(record)
-            kept_count += 1
-    return str(input_file), total_reads, kept_count, kept_reads
-def filter_demultiplexed_folder(experiment_folder, parent_sequence, num_threads=8):
-    """
-    Filter demultiplexed files using multiple threads.
-    Args:
-        experiment_folder (str): Path to experiment folder containing RBC/FBC structure
-        parent_sequence (str): Parent sequence for alignment checking
-        num_threads (int): Number of threads to use
-    """
-    exp_path = Path(experiment_folder)
-    filtered_counts = {}
-    # Prepare parent sequences once
-    parent_seq = Seq(parent_sequence)
-    parent_rev_comp = parent_seq.reverse_complement()
-    # Collect all fastq files
-    fastq_files = []
-    for rbc_dir in exp_path.glob("RB*"):
-        if not rbc_dir.is_dir():
-            continue
-        for fbc_dir in rbc_dir.glob("NB*"):
-            if not fbc_dir.is_dir():
-                continue
-            fastq_files.extend(list(fbc_dir.glob("*.fastq")))
-    if not fastq_files:
-        logging.warning(f"No fastq files found in {experiment_folder}")
-        return filtered_counts
-    # Prepare arguments for parallel processing
-    file_args = [(f, parent_seq, parent_rev_comp) for f in fastq_files]
-    # Process files in parallel with progress bar
-    with ThreadPoolExecutor(max_workers=num_threads) as executor:
-        futures = [executor.submit(filter_single_file, args) for args in file_args]
-        with tqdm(total=len(fastq_files), desc="Filtering files") as pbar:
-            for future in as_completed(futures):
-                try:
-                    file_path, total, kept, filtered_records = future.result()
-                    # Write filtered reads
-                    temp_file = Path(file_path).parent / f"temp_{Path(file_path).name}"
-                    SeqIO.write(filtered_records, temp_file, "fastq")
-                    shutil.move(str(temp_file), file_path)
-                    filtered_counts[file_path] = {
-                        'total': total,
-                        'kept': kept,
-                        'filtered': total - kept
-                    }
-                    logging.info(f"Processed {file_path}: {kept}/{total} reads kept")
-                    pbar.update(1)
-                except Exception as e:
-                    logging.error(f"Error processing file {file_path}: {str(e)}")
-                    pbar.update(1)
-    return filtered_counts