PyPI - levseq - Versions diffs - 1.3.2__tar.gz → 1.4.0__tar.gz - Mend

levseq 1.3.2tar.gz → 1.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

{levseq-1.3.2/levseq.egg-info → levseq-1.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: levseq
-Version: 1.3.2
+Version: 1.4.0
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Ariane Mora, Francesca-Zhoufan Li, Emre Gursoy
 Author-email: ylong@caltech.edu
@@ -52,9 +52,13 @@ In directed evolution, sequencing every variant enhances data insight and create
 ![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.jpeg)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.
+## Website
+A beta website is available [here](https://levseqdb.streamlit.app/) you just load directly your output from LevSeq and your LCMS results and get visualisations and per plate normalizations.
+## Data
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
+- A dockerized website and database for labs to locally host and visualize their data:  website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
 ## Setup
@@ -87,6 +91,10 @@ conda create --name levseq python=3.12 -y
 conda activate levseq
 ```
+```
+pip install levseq
+```
 #### Dependencies
 1. Samtools: https://www.htslib.org/download/
@@ -100,6 +108,11 @@ conda install -c bioconda -c conda-forge samtools
 ```
 conda install -c bioconda -c conda-forge minimap2
 ```
+3. gcc 13 and 14 on Mac M1 through M4 chips
+```
+brew install gcc@13
+brew install gcc@14
+```
 ### Docker Installation (Recommended for full pipeline)
 For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
 operating system (https://docs.docker.com/engine/install/).

{levseq-1.3.2 → levseq-1.4.0}/README.md RENAMED Viewed

@@ -5,9 +5,13 @@ In directed evolution, sequencing every variant enhances data insight and create
 ![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.jpeg)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.
+## Website
+A beta website is available [here](https://levseqdb.streamlit.app/) you just load directly your output from LevSeq and your LCMS results and get visualisations and per plate normalizations.
+## Data
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
+- A dockerized website and database for labs to locally host and visualize their data:  website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
 ## Setup
@@ -40,6 +44,10 @@ conda create --name levseq python=3.12 -y
 conda activate levseq
 ```
+```
+pip install levseq
+```
 #### Dependencies
 1. Samtools: https://www.htslib.org/download/
@@ -53,6 +61,11 @@ conda install -c bioconda -c conda-forge samtools
 ```
 conda install -c bioconda -c conda-forge minimap2
 ```
+3. gcc 13 and 14 on Mac M1 through M4 chips
+```
+brew install gcc@13
+brew install gcc@14
+```
 ### Docker Installation (Recommended for full pipeline)
 For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
 operating system (https://docs.docker.com/engine/install/).
@@ -138,4 +151,4 @@ If you have found LevSeq useful, please cite our [paper](https://pubs.acs.org/do
 #### Contact
-Leave a feature request in the issues or reach us via [email](mailto:levseqdb@gmail.com).
+Leave a feature request in the issues or reach us via [email](mailto:levseqdb@gmail.com).

{levseq-1.3.2 → levseq-1.4.0}/levseq/__init__.py RENAMED Viewed

@@ -18,7 +18,7 @@
 __title__ = 'levseq'
 __description__ = 'LevSeq nanopore sequencing'
 __url__ = 'https://github.com/fhalab/levseq/'
-__version__ = '1.3.2'
+__version__ = '1.4.0'
 __author__ = 'Yueming Long, Ariane Mora, Francesca-Zhoufan Li, Emre Gursoy'
 __author_email__ = 'ylong@caltech.edu'
 __license__ = 'GPL3'
@@ -31,4 +31,4 @@ from levseq.cmd import *
 from levseq.utils import *
 from levseq.simulation import *
 from levseq.user import *
+from levseq.filter_orientation import *

{levseq-1.3.2 → levseq-1.4.0}/levseq/barcoding/demultiplex-arm64 RENAMED Viewed

Binary file

levseq-1.4.0/levseq/barcoding/demultiplex-x86 ADDED Viewed

Binary file

levseq-1.4.0/levseq/filter_orientation.py ADDED Viewed

@@ -0,0 +1,115 @@
+from Bio import SeqIO
+from Bio.Seq import Seq
+import os
+from pathlib import Path
+import logging
+from Bio.Align import PairwiseAligner
+import shutil
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+def calculate_alignment_score(seq1, seq2):
+    """Calculate alignment score between two sequences using PairwiseAligner."""
+    aligner = PairwiseAligner()
+    aligner.mode = 'global'
+    alignment = aligner.align(seq1, seq2)[0]
+    return alignment.score / max(len(seq1), len(seq2))
+def filter_single_file(args):
+    """
+    Filter a single fastq file. Used for parallel processing.
+    Args:
+        args: tuple containing (input_file, parent_seq, parent_rev_comp)
+    Returns:
+        tuple: (file_path, total_reads, kept_reads, filtered_records)
+    """
+    input_file, parent_seq, parent_rev_comp = args
+    kept_reads = []
+    total_reads = 0
+    kept_count = 0
+    is_forward = "forward" in str(input_file).lower()
+    for record in SeqIO.parse(input_file, "fastq"):
+        total_reads += 1
+        seq = str(record.seq)
+        forward_score = calculate_alignment_score(seq, str(parent_seq))
+        reverse_score = calculate_alignment_score(seq, str(parent_rev_comp))
+        # If it's in forward file (plate barcode was rev comp)
+        # Then read should align to reverse complement parent sequence
+        if is_forward and reverse_score > forward_score:
+            kept_reads.append(record)
+            kept_count += 1
+        # If it's in reverse file (plate barcode was forward)
+        # Then read was already reverse complemented by demultiplexer
+        # So it should align to forward parent sequence
+        elif not is_forward and forward_score > reverse_score:
+            kept_reads.append(record)
+            kept_count += 1
+    return str(input_file), total_reads, kept_count, kept_reads
+def filter_demultiplexed_folder(experiment_folder, parent_sequence, num_threads=8):
+    """
+    Filter demultiplexed files using multiple threads.
+    Args:
+        experiment_folder (str): Path to experiment folder containing RBC/FBC structure
+        parent_sequence (str): Parent sequence for alignment checking
+        num_threads (int): Number of threads to use
+    """
+    exp_path = Path(experiment_folder)
+    filtered_counts = {}
+    # Prepare parent sequences once
+    parent_seq = Seq(parent_sequence)
+    parent_rev_comp = parent_seq.reverse_complement()
+    # Collect all fastq files
+    fastq_files = []
+    for rbc_dir in exp_path.glob("RB*"):
+        if not rbc_dir.is_dir():
+            continue
+        for fbc_dir in rbc_dir.glob("NB*"):
+            if not fbc_dir.is_dir():
+                continue
+            fastq_files.extend(list(fbc_dir.glob("*.fastq")))
+    if not fastq_files:
+        logging.warning(f"No fastq files found in {experiment_folder}")
+        return filtered_counts
+    # Prepare arguments for parallel processing
+    file_args = [(f, parent_seq, parent_rev_comp) for f in fastq_files]
+    # Process files in parallel with progress bar
+    with ThreadPoolExecutor(max_workers=num_threads) as executor:
+        futures = [executor.submit(filter_single_file, args) for args in file_args]
+        with tqdm(total=len(fastq_files), desc="Filtering files") as pbar:
+            for future in as_completed(futures):
+                try:
+                    file_path, total, kept, filtered_records = future.result()
+                    # Write filtered reads
+                    temp_file = Path(file_path).parent / f"temp_{Path(file_path).name}"
+                    SeqIO.write(filtered_records, temp_file, "fastq")
+                    shutil.move(str(temp_file), file_path)
+                    filtered_counts[file_path] = {
+                        'total': total,
+                        'kept': kept,
+                        'filtered': total - kept
+                    }
+                    logging.info(f"Processed {file_path}: {kept}/{total} reads kept")
+                    pbar.update(1)
+                except Exception as e:
+                    logging.error(f"Error processing file {file_path}: {str(e)}")
+                    pbar.update(1)
+    return filtered_counts

{levseq-1.3.2 → levseq-1.4.0}/levseq/interface.py RENAMED Viewed

@@ -63,6 +63,9 @@ def build_cli_parser():
     optional_args_group.add_argument("--skip_variantcalling",
                                      action="store_true",
                                      help="Skip the variant calling step, default is false")
+    optional_args_group.add_argument("--oligopool",
+                                     action="store_true",
+                                     help="Whether this experiment came from an oligopool, default is false.")
     optional_args_group.add_argument("--show_msa",
                                      default=False,
                                      help="Skip showing msa")

{levseq-1.3.2 → levseq-1.4.0}/levseq/run_levseq.py RENAMED Viewed

@@ -17,7 +17,7 @@
 # Import MinION objects
 from levseq import *
+from levseq.filter_orientation import filter_demultiplexed_folder
 # Import external packages
 import logging
 from pathlib import Path
@@ -221,13 +221,14 @@ def demux_fastq(file_to_fastq, result_folder, barcode_path):
         executable_path = package_root / "levseq" / "barcoding" / executable_name
     if not executable_path.exists():
         raise FileNotFoundError(f"Executable not found: {executable_path}")
-    seq_min = 200
+    seq_min = 200
     seq_max = 10000
     prompt = f"{executable_path} -f {file_to_fastq} -d {result_folder} -b {barcode_path} -w 100 -r 100 -m {seq_min} -x {seq_max}"
     subprocess.run(prompt, shell=True, check=True)
 # Variant calling using VariantCaller class
-def call_variant(experiment_name, experiment_folder, template_fasta, filtered_barcodes):
+def call_variant(experiment_name, experiment_folder, template_fasta, filtered_barcodes, threshold=0.5, oligopool=False):
     try:
         vc = VariantCaller(
             experiment_name,
@@ -236,8 +237,9 @@ def call_variant(experiment_name, experiment_folder, template_fasta, filtered_ba
             filtered_barcodes,
             padding_start=0,
             padding_end=0,
+            oligopool=oligopool
         )
-        variant_df = vc.get_variant_df(threshold=0.5, min_depth=5)
+        variant_df = vc.get_variant_df(threshold=threshold, min_depth=5)
         logging.info("Variant calling to create consensus reads successful")
         return variant_df
     except Exception as e:
@@ -441,6 +443,63 @@ def save_csv(df, outputdir, name):
     file_path = os.path.join(outputdir, "Results", name + ".csv")
     df.to_csv(file_path)
+# Function to process the reference CSV and generate variants
+def process_ref_csv_oligopool(cl_args, tqdm_fn=tqdm.tqdm):
+    ref_df = pd.read_csv(cl_args["summary"])
+    result_folder = create_result_folder(cl_args)
+    variant_csv_path = os.path.join(result_folder, "variants.csv")
+    variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])
+    # First get the different barcode plates (these will be unique)
+    barcode_plates = ref_df["barcode_plate"].unique()
+    ref_df["barcode_index"] = [i for i in range(len(ref_df))]
+    barcode_to_index = dict(zip(ref_df.barcode_plate, ref_df.barcode_index))
+    for barcode_plate in barcode_plates:
+        if not cl_args["skip_demultiplexing"]:
+            i = barcode_to_index[barcode_plate]
+            name_folder = os.path.join(result_folder, f'RB{barcode_plate}')
+            os.makedirs(name_folder, exist_ok=True)
+            barcode_path = filter_bc(cl_args, name_folder, i)
+            output_dir = Path(result_folder) / f"{cl_args['name']}_fastq"
+            output_dir.mkdir(parents=True, exist_ok=True)
+            file_to_fastq = cat_fastq_files(cl_args.get("path"), output_dir)
+            try:
+                demux_fastq(output_dir, name_folder, barcode_path)
+            except Exception as e:
+                logging.error("An error occurred during demultiplexing for sample {}. Skipping this sample.".format(barcode_plate), exc_info=True)
+                continue
+            # Check this - need to see if the code works... ToDo: Ariane
+    # Now they are all demultiplexed, we can call variants
+    if not cl_args["skip_variantcalling"]:
+        for i, row in tqdm_fn(ref_df.iterrows(), total=len(ref_df), desc="Processing Samples"):
+            barcode_plate = row["barcode_plate"]
+            name = row["name"]
+            refseq = row["refseq"].upper()
+            # Get the name folder and barcode path
+            temp_fasta_path = os.path.join(result_folder, f"temp_{name}.fasta")
+            if not os.path.exists(temp_fasta_path):
+                with open(temp_fasta_path, "w") as f:
+                    f.write(f">{name}\n{refseq}\n")
+            else:
+                logging.info(f"Fasta file for {name} already exists. Skipping write.")
+            try:
+                filtered_barcodes = filter_bc(cl_args, result_folder, i)
+                variant_result = call_variant(f"{name}", result_folder, temp_fasta_path, filtered_barcodes,
+                                              oligopool=True)
+                variant_result["barcode_plate"] = barcode_plate
+                variant_result["name"] = name
+                variant_result["refseq"] = refseq
+                variant_df = pd.concat([variant_df, variant_result])
+            except Exception as e:
+                logging.error("An error occurred during variant calling for sample {}. Skipping this sample.".format(name), exc_info=True)
+                continue
+        variant_df.to_csv(variant_csv_path, index=False)
+    # visualize it as well
+    return variant_df, ref_df
 # Function to process the reference CSV and generate variants
 def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
     ref_df = pd.read_csv(cl_args["summary"])
@@ -472,14 +531,30 @@ def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
             file_to_fastq = cat_fastq_files(cl_args.get("path"), output_dir)
             try:
                 demux_fastq(output_dir, name_folder, barcode_path)
+                # Add filtering step here with multithreading
+                filtered_counts = filter_demultiplexed_folder(
+                        name_folder,
+                        refseq,
+                        num_threads=10
+                )
+                logging.info(f"Orientation filtering completed for {name}")
+                total_reads = sum(counts['total'] for counts in filtered_counts.values())
+                kept_reads = sum(counts['kept'] for counts in filtered_counts.values())
+                logging.info(f"Total filtering results: {kept_reads}/{total_reads} reads kept ({kept_reads/total_reads*100:.2f}%)")
+                for file, counts in filtered_counts.items():
+                    logging.info(f"{file}: {counts['kept']}/{counts['total']} reads kept")
             except Exception as e:
-                logging.error("An error occurred during demultiplexing for sample {}. Skipping this sample.".format(name), exc_info=True)
+                logging.error("An error occurred during demultiplexing/filtering for sample {}. Skipping this sample.".format(name), exc_info=True)
                 continue
         if not cl_args["skip_variantcalling"]:
             try:
+                threshold = cl_args.get("threshold") if cl_args.get("threshold") is not None else 0.5
                 variant_result = call_variant(
-                    f"{name}", name_folder, temp_fasta_path, barcode_path
+                    f"{name}", name_folder, temp_fasta_path, barcode_path, threshold=threshold
                 )
                 variant_result["barcode_plate"] = barcode_plate
                 variant_result["name"] = name
@@ -493,6 +568,7 @@ def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
     variant_df.to_csv(variant_csv_path, index=False)
     return variant_df, ref_df
 # Main function to run LevSeq and ensure saving of intermediate results if an error occurs
 def run_LevSeq(cl_args, tqdm_fn=tqdm.tqdm):
     result_folder = create_result_folder(cl_args)
@@ -504,9 +580,12 @@ def run_LevSeq(cl_args, tqdm_fn=tqdm.tqdm):
     logging.info("Logging configured. Starting program.")
     variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])
     try:
-        variant_df, ref_df = process_ref_csv(cl_args, tqdm_fn)
+        if cl_args["oligopool"]:
+            variant_df, ref_df = process_ref_csv_oligopool(cl_args, tqdm_fn)
+        else:
+            variant_df, ref_df = process_ref_csv(cl_args, tqdm_fn)
         ref_df_path = os.path.join(ref_folder, cl_args["name"]+".csv")
         ref_df.to_csv(ref_df_path, index=False)
@@ -529,6 +608,8 @@ def run_LevSeq(cl_args, tqdm_fn=tqdm.tqdm):
         df_variants, df_vis = create_df_v(variant_df)
         processed_csv = os.path.join(result_folder, "visualization_partial.csv")
         df_vis.to_csv(processed_csv, index=False)
+        if cl_args["oligopool"]:
+            make_oligopool_plates(df_vis, result_folder=result_folder, save_files=True)
     except Exception as e:
         processed_csv = os.path.join(result_folder, "visualization_partial.csv")
         if 'df_vis' in locals():

{levseq-1.3.2 → levseq-1.4.0}/levseq/utils.py RENAMED Viewed

@@ -59,12 +59,13 @@ def translate(seq):
         'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L',
         'TAC': 'Y', 'TAT': 'Y', 'TAA': '*', 'TAG': '*',
         'TGC': 'C', 'TGT': 'C', 'TGA': '*', 'TGG': 'W',
+        'GTS': "X"
     }
     protein = ""
     if len(seq) % 3 == 0:
         for i in range(0, len(seq), 3):
             codon = seq[i:i + 3]
-            protein += table[codon]
+            protein += table.get(codon, 'X')
     return protein
@@ -290,8 +291,7 @@ def get_reads_for_well(parent_name, bam_file_path: str, ref_str: str, msa_path=N
     insert_map = defaultdict(list)
     for read in bam.fetch(until_eof=True):
         # Ensure we have at least 75% coverage
-        if read.query_sequence is not None and len(read.query_sequence) > 0.75 * len(
-                ref_str) and read.cigartuples is not None:
+        if read.query_sequence is not None and read.cigartuples is not None: # and len(read.query_sequence) > 0.75 * len(ref_str) and read.cigartuples is not None:
             seq, ref, qual, ins = alignment_from_cigar(read.cigartuples, read.query_sequence, ref_str,
                                                        read.query_qualities)
             # Make it totally align
@@ -313,16 +313,17 @@ def get_reads_for_well(parent_name, bam_file_path: str, ref_str: str, msa_path=N
     # Do this for all wells
     seq_df = make_well_df_from_reads(seqs, read_ids, read_quals)
     alignment_count = len(seq_df.values)
-    rows_all = make_row_from_read_pileup_across_well(seq_df, ref_str, parent_name, insert_map)
-    bam.close()
-    if len(rows_all) > 2:  # Check if we have anything to return
-        seq_df = pd.DataFrame(rows_all)
-        seq_df.columns = ['gene_name', 'position', 'ref', 'most_frequent', 'freq_non_ref', 'total_other',
-                          'total_reads', 'p_value', 'percent_most_freq_mutation', 'A', 'p(a)', 'T', 'p(t)', 'G', 'p(g)',
-                          'C', 'p(c)', 'N', 'p(n)', 'I', 'p(i)', 'Warnings']
-        return calculate_mutation_significance_across_well(seq_df), alignment_count
+    if alignment_count > 0:
+        rows_all = make_row_from_read_pileup_across_well(seq_df, ref_str, parent_name, insert_map)
+        bam.close()
+        if len(rows_all) > 2:  # Check if we have anything to return
+            seq_df = pd.DataFrame(rows_all)
+            seq_df.columns = ['gene_name', 'position', 'ref', 'most_frequent', 'freq_non_ref', 'total_other',
+                              'total_reads', 'p_value', 'percent_most_freq_mutation', 'A', 'p(a)', 'T', 'p(t)', 'G', 'p(g)',
+                              'C', 'p(c)', 'N', 'p(n)', 'I', 'p(i)', 'Warnings']
+            return calculate_mutation_significance_across_well(seq_df), alignment_count
+    return None, 0
 def make_row_from_read_pileup_across_well(well_df, ref_str, label, insert_map):
     """
     Given a pileup of reads, we want to get some summary information about that sequence

{levseq-1.3.2 → levseq-1.4.0}/levseq/variantcaller.py RENAMED Viewed

@@ -51,11 +51,13 @@ class VariantCaller:
     """
-    def __init__(self, experiment_name, experiment_folder: Path, template_fasta: Path, barcode_path: Path, padding_start: int = 0, padding_end: int = 0) -> None:
+    def __init__(self, experiment_name, experiment_folder: Path, template_fasta: Path, barcode_path: Path,
+                 padding_start: int = 0, padding_end: int = 0, oligopool=True) -> None:
         self.barcode_path = barcode_path
         self.experiment_name = experiment_name
         self.experiment_folder = experiment_folder
         self.padding_start = padding_start
+        self.oligopool = oligopool
         self.padding_end = padding_end
         self.template_fasta = template_fasta
         self.alignment_name = 'alignment_minimap'
@@ -90,9 +92,15 @@ class VariantCaller:
                 renamed_ids.append(f'{plate}_{well}')
                 plates.append(experiment_name)
                 wells.append(well)
-                self.variant_dict[f'{plate}_{well}'] = {'Plate': experiment_name, 'Well': well,
-                                                        'Barcodes': f'{reverse_barcode}_{forward_barcode}',
-                                                        'Path': os.path.join(self.experiment_folder, f'{reverse_barcode}/{forward_barcode}')}
+                if self.oligopool:
+                    self.variant_dict[f'{plate}_{well}'] = {'Plate': experiment_name, 'Well': well,
+                                                            'Barcodes': f'{reverse_barcode}_{forward_barcode}',
+                                                            'Path': os.path.join(self.experiment_folder,
+                                                                                 f'{reverse_barcode}/{reverse_barcode}/{forward_barcode}')}
+                else:
+                    self.variant_dict[f'{plate}_{well}'] = {'Plate': experiment_name, 'Well': well,
+                                                            'Barcodes': f'{reverse_barcode}_{forward_barcode}',
+                                                            'Path': os.path.join(self.experiment_folder, f'{reverse_barcode}/{forward_barcode}')}
         df = pd.DataFrame()
         df['Plate'] = plates
         df['Well'] = wells
@@ -100,14 +108,6 @@ class VariantCaller:
         df['ID'] = renamed_ids
         return df
-    @staticmethod
-    def load_reference(reference_path):
-        # The reference enables multiple parents to be used for different
-        # WARNING: this assumes all the parents are the same
-        ref_seq = str(SeqIO.read(template_fasta,'fasta').seq)
-        barcode_to_plate_name = experiment_name
-        return 'Parent', ref_seq, barcode_to_plate_name
     @staticmethod
     def _barcode_to_well(barcode):
         match = re.search(r'\d+', barcode)
@@ -124,28 +124,32 @@ class VariantCaller:
         try:
             all_fastq = os.path.join(output_dir, '*.fastq')
             fastq_list = glob.glob(all_fastq)
-            fastq_files = os.path.join(output_dir, f"demultiplexed_{filename}.fastq")
+            fastq_files = all_fastq # os.path.join(output_dir, f"demultiplexed_{filename}.fastq")
-            if not fastq_list:
+            if not all_fastq:
                 logger.error("No FASTQ files found in the specified output directory.")
                 return
-            # Combining fastq files into one
-            with open(fastq_files, 'w') as outfile:
-                for fastq in fastq_list:
-                    with open(fastq, 'r') as infile:
-                        outfile.write(infile.read())
+            # Combining fastq files into one if there are more than 1
+            if len(fastq_list) > 1:
+                with open(fastq_files, 'w') as outfile:
+                    for fastq in fastq_list:
+                        with open(fastq, 'r') as infile:
+                            outfile.write(infile.read())
+            else:
+                fastq_files = fastq_list[0]
             # Alignment using minimap2
             minimap_cmd = f"minimap2 -ax map-ont -A {scores[0]} -B {scores[1]} -O {scores[2]},24 '{self.template_fasta}' '{fastq_files}' > '{output_dir}/{alignment_name}.sam'"
             subprocess.run(minimap_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+            print(minimap_cmd)
             # Convert SAM to BAM and sort
             view_cmd = f"samtools view -bS '{output_dir}/{alignment_name}.sam' > '{output_dir}/{alignment_name}.bam'"
             subprocess.run(view_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+            print(view_cmd)
             sort_cmd = f"samtools sort '{output_dir}/{alignment_name}.bam' -o '{output_dir}/{alignment_name}.bam'"
             subprocess.run(sort_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+            print(sort_cmd)
             # Index the BAM file
             index_cmd = f"samtools index '{output_dir}/{alignment_name}.bam'"
@@ -163,18 +167,22 @@ class VariantCaller:
             for barcode_id in pbar:
                 try:
                     row = self.variant_dict.get(barcode_id)
-                    bam_file = os.path.join(row["Path"], f'{self.alignment_name}.bam')
+                    bam_file = os.path.join(row["Path"], f'{self.alignment_name}_{barcode_id}.bam')
                     # Check if alignment file exists, if not, align sequences
                     if not os.path.exists(bam_file):
-                        logger.info(f"Aligning sequences for {row['Path']}")
-                        self._align_sequences(row["Path"], row['Barcodes'])
+                       logger.info(f"Aligning sequences for {row['Path']}")
+                    self._align_sequences(row["Path"], row['Barcodes'],
+                                              alignment_name=f'{self.alignment_name}_{barcode_id}')
                     # Placeholder function calls to demonstrate workflow
                     well_df, alignment_count = get_reads_for_well(self.experiment_name, bam_file,
-                                                                  self.ref_str, f'{row["Path"]}/msa.fa')
-                    self.variant_dict[barcode_id]['Alignment Count'] = alignment_count
+                                                                  self.ref_str, f'{row["Path"]}/{self.alignment_name}_{barcode_id}.fa')
                     if well_df is not None:
+                        if self.oligopool:
+                            if len(well_df.values) < 10:
+                                continue
+                        self.variant_dict[barcode_id]['Alignment Count'] = alignment_count
                         well_df.to_csv(f"{row['Path']}/seq_{barcode_id}.csv", index=False)
                         label, freq, combined_p_value, mixed_well, avg_error_rate = get_variant_label_for_well(well_df, threshold)
                         self.variant_dict[barcode_id]['Variant'] = label
@@ -187,7 +195,7 @@ class VariantCaller:
                 finally:
                     pbar.update(1)
-    def get_variant_df(self, threshold: float = 0.5, min_depth: int = 5, output_dir='', num_threads=10):
+    def get_variant_df(self, threshold: float = 0.5, min_depth: int = 5, output_dir='', num_threads=20):
         """
         Get Variant Data Frame for all samples in the experiment
@@ -202,26 +210,34 @@ class VariantCaller:
         data = []
         num = int(len(self.variant_df) / num_threads)
         self.variant_df.reset_index(inplace=True)
-        for i in range(0, len(self.variant_df), num):
-            end_i = i + num if i + num < len(self.variant_df) else len(self.variant_df)
-            sub_df = self.variant_df.iloc[i: end_i]['ID'].values
-            sub_data = [sub_df, threshold, min_depth, output_dir]
-            data.append(sub_data)
+        if num_threads > 1:
+            for i in range(0, len(self.variant_df), num):
+                end_i = i + num if i + num < len(self.variant_df) else len(self.variant_df)
+                sub_df = self.variant_df.iloc[i: end_i]['ID'].values
+                sub_data = [sub_df, threshold, min_depth, output_dir]
+                data.append(sub_data)
-        # Thread it
-        pool.map(self._run_variant_thread, data)
+            # Thread it
+            pool.map(self._run_variant_thread, data)
+        else:
+            self._run_variant_thread([self.variant_df, threshold, min_depth, output_dir])
         self.variant_df['Variant'] = [self.variant_dict[b_id].get('Variant') for b_id in self.variant_df['ID'].values]
-        self.variant_df['Mixed Well'] = [self.variant_dict[b_id].get('Mixed Well') for b_id in self.variant_df['ID'].values]
-        self.variant_df['Average mutation frequency'] = [self.variant_dict[b_id].get('Average mutation frequency') for b_id in self.variant_df['ID'].values]
+        self.variant_df['Mixed Well'] = [self.variant_dict[b_id].get('Mixed Well') for b_id in
+                                         self.variant_df['ID'].values]
+        self.variant_df['Average mutation frequency'] = [self.variant_dict[b_id].get('Average mutation frequency') for
+                                                         b_id in self.variant_df['ID'].values]
         self.variant_df['P value'] = [self.variant_dict[b_id].get('P value') for b_id in self.variant_df['ID'].values]
-        self.variant_df['Alignment Count'] = [self.variant_dict[b_id].get('Alignment Count') for b_id in self.variant_df['ID'].values]
-        self.variant_df['Average error rate'] = [self.variant_dict[b_id].get('Average error rate') for b_id in self.variant_df['ID'].values]
+        self.variant_df['Alignment Count'] = [self.variant_dict[b_id].get('Alignment Count') for b_id in
+                                              self.variant_df['ID'].values]
+        self.variant_df['Average error rate'] = [self.variant_dict[b_id].get('Average error rate') for b_id in
+                                                 self.variant_df['ID'].values]
         # Adjust p-values using bonferroni make it simple
-        self.variant_df['P adj. value'] = len(self.variant_df) * self.variant_df["P value"].values
-        self.variant_df['P adj. value'] = [1 if x > 1 else x for x in self.variant_df["P adj. value"].values]
+        self.variant_df['P adj. value'] = [len(self.variant_df) * p if p else None for p in self.variant_df["P value"].values]
+        self.variant_df['P adj. value'] = [1 if x and x > 1 else x for x in self.variant_df["P adj. value"].values]
+        if self.oligopool:
+            # Filter this so we don't get all the junk
+            self.variant_df = self.variant_df[self.variant_df['Alignment Count'] > 2]
         return self.variant_df
     def _get_alignment_count(self, sample_folder_path: Path):

{levseq-1.3.2 → levseq-1.4.0}/levseq/visualization.py RENAMED Viewed

@@ -63,6 +63,7 @@ from bokeh.events import Tap
 from bokeh.io import save, show, output_file, output_notebook
 import panel as pn
+import seaborn as sns
 from levseq.utils import *
@@ -1147,3 +1148,54 @@ def plot_sequence_alignment(
         toolbar_location=None,
         sizing_mode=sizing_mode,
     )
+def make_oligopool_plates(vis_df, result_folder, save_files=False):
+    """ Simple heatmaps saved as SVGs for oligopool plates."""
+    parents = vis_df[vis_df['amino_acid_substitutions'] == '#PARENT#']
+    top_well_df = parents.sort_values(by='Alignment Count', ascending=False)
+    top_well_df = top_well_df.drop_duplicates('name', keep='first')
+    # This is one of the things that they will want returned
+    if save_files:
+        top_well_df.to_csv(os.path.join(result_folder, 'best_aligned_parents.csv'), index=False)
+    # Now for each plate we make a heatmap
+    vis_df['amino_acid_substitutions'] = [n if x == '#PARENT#' else x for n, x in
+                                      vis_df[['name', 'amino_acid_substitutions']].values]
+    plates = set(vis_df['barcode_plate'].values)
+    # Drop mixed well plates
+    for plate in plates:
+        df = vis_df[vis_df['barcode_plate'] == plate]
+        df = df.sort_values(by='Alignment Count', ascending=False)
+        # Keep only one of the variants per well (i.e. the dominant one)
+        df = df.drop_duplicates('Well')
+        # Reshape into a well format
+        df['Column'] = [int(i[1:]) for i in df['Well'].values]
+        df['Row'] = [i[0] for i in df['Well'].values]
+        df.sort_values(by=['Column', 'Row'], inplace=True, ascending=[False, True])
+        # Load the example flights dataset and convert to long-form
+        platemap = (
+            df
+            .pivot(index="Row", columns="Column", values="Alignment Count")
+        )
+        platemap_labels = (
+            df
+            .pivot(index="Row", columns="Column", values="amino_acid_substitutions")
+        )
+        plot_seaborn_heatmap(platemap, platemap_labels,f'{plate}', result_folder)
+def plot_seaborn_heatmap(platemap, platemap_labels, label: str, result_folder):
+    """ Plot the seaborn platemap using the data"""
+    platemap = platemap.fillna(0)
+    sns.set_theme()
+    f, ax = plt.subplots(figsize=(16, 8))
+    plt.rcParams['svg.fonttype'] = 'none'  # Ensure text is saved as text
+    row_labels = [str(s) for s in list(platemap.index)]
+    col_labels = [str(s) for s in list(platemap.columns)]
+    data = platemap.values
+    pc = sns.heatmap(data, cmap='Reds', annot=platemap_labels.values, xticklabels=col_labels, yticklabels=row_labels,
+                     fmt='', linewidths=.1)
+    ax = pc.axes
+    plt.yticks(rotation=0)
+    plt.setp(ax.get_yticklabels(), ha="center")
+    plt.savefig(os.path.join(result_folder, f'platemap_{label}.svg'))

{levseq-1.3.2 → levseq-1.4.0/levseq.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: levseq
-Version: 1.3.2
+Version: 1.4.0
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Ariane Mora, Francesca-Zhoufan Li, Emre Gursoy
 Author-email: ylong@caltech.edu
@@ -52,9 +52,13 @@ In directed evolution, sequencing every variant enhances data insight and create
 ![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.jpeg)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.
+## Website
+A beta website is available [here](https://levseqdb.streamlit.app/) you just load directly your output from LevSeq and your LCMS results and get visualisations and per plate normalizations.
+## Data
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
+- A dockerized website and database for labs to locally host and visualize their data:  website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
 ## Setup
@@ -87,6 +91,10 @@ conda create --name levseq python=3.12 -y
 conda activate levseq
 ```
+```
+pip install levseq
+```
 #### Dependencies
 1. Samtools: https://www.htslib.org/download/
@@ -100,6 +108,11 @@ conda install -c bioconda -c conda-forge samtools
 ```
 conda install -c bioconda -c conda-forge minimap2
 ```
+3. gcc 13 and 14 on Mac M1 through M4 chips
+```
+brew install gcc@13
+brew install gcc@14
+```
 ### Docker Installation (Recommended for full pipeline)
 For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
 operating system (https://docs.docker.com/engine/install/).

{levseq-1.3.2 → levseq-1.4.0}/levseq.egg-info/SOURCES.txt RENAMED Viewed

@@ -7,6 +7,7 @@ levseq/__init__.py
 levseq/basecaller.py
 levseq/cmd.py
 levseq/coordinates.py
+levseq/filter_orientation.py
 levseq/globals.py
 levseq/interface.py
 levseq/parser.py

{levseq-1.3.2 → levseq-1.4.0}/tests/test_deploy.py RENAMED Viewed

@@ -21,9 +21,11 @@ import unittest
 import matplotlib.pyplot as plt
 from levseq import *
 from levseq.run_levseq import process_ref_csv
 u = SciUtil()
 import math
 class TestClass(unittest.TestCase):
     @classmethod
@@ -45,47 +47,54 @@ class TestClass(unittest.TestCase):
     def teardown_class(self):
         shutil.rmtree(self.tmp_dir)
 class TestDeploy(TestClass):
     def test_deploy(self):
         cmd_list = [
             'docker',  # Needs to be installed as vina.
             'run',
             '--rm',
             '-v',
-            f'{os.getcwd()}:/levseq_results',
+            f'{os.getcwd()}/test_data/laragen_run:/levseq_results',
             'levseq',
             'test_deploy',
-            'test_data/laragen_run/levseq-1.2.7/',
-            'test_data/laragen_run/20241116-LevSeq-Review-Validation-levseq_ref.csv'
+            'levseq_results/levseq-1.2.7/',
+            'levseq_results/20241116-LevSeq-Review-Validation-levseq_ref.csv'
         ]
+        print(' '.join(cmd_list))
         # ToDo: add in scoring function for ad4
-        cmd_return = subprocess.run(cmd_list, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
-        print(cmd_return.stdout, cmd_return)
+        # cmd_return = subprocess.run(cmd_list, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+        # print(cmd_return.stdout, cmd_return)
     def test_variant_calling(self):
         # Take as input the demultiplexed fastq files and the reference csv file
-        cl_args = {'skip_demultiplexing': True, 'skip_variantcalling': False}
+        cl_args = {'skip_demultiplexing': True, 'skip_variantcalling': False, 'threshold': 0.5}
         cl_args["name"] = 'test_deploy'
         cl_args['path'] = 'test_data/laragen_run/levseq-1.2.7/'
         cl_args["summary"] = 'test_data/laragen_run/20241116-LevSeq-Review-Validation-levseq_ref.csv'
         variant_df, ref_df = process_ref_csv(cl_args)
+        variant_df.to_csv('laragen_test_run.csv')
         # Now we want to check all the variants are the same as in the original case:
         checked_variants_df = pd.read_csv('test_data/laragen_run/levseq-1.2.7/variants_gold_standard.csv')
         checked_variants = checked_variants_df['Variant'].values
-        checked_sig = checked_variants_df['P adj. value'].values
+        checked_sig = checked_variants_df['Average mutation frequency'].values
+        checked_alignments = checked_variants_df['Alignment Count'].values
         i = 0
-        for variant, pval in variant_df[['Variant', 'P adj. value']].values:
+        for variant, freq, alignment_count, pval in variant_df[['Variant', 'Average mutation frequency',
+                                                                'Alignment Count', 'P adj. value']].values:
             print(variant, checked_variants[i])
             if checked_variants[i]:
                 if variant:
                     assert variant == checked_variants[i]
-            # if pval < 0.05:
-            #     assert checked_sig[i] < 0.05
-            # elif math.isnan(pval):
-            #     assert math.isnan(checked_sig[i])
-            # else:
-            #     assert checked_sig[i] >= 0.05
-            print(pval, checked_sig[i])
+                    assert alignment_count == checked_alignments[i]
+                    if freq != checked_sig[i]:
+                        print(freq, checked_sig[i])
             i += 1
+# docker run --rm -v /Users/arianemora/Documents/code/LevSeq/data/degradeo/20250121-JR-IM-HS:/levseq_results levseq 20250121-JR-IM-HS_oligopool levseq_results/ levseq_results/ref_seq_oligopools_single.csv --skip_variantcalling
+# levseq oligpool_20250121-JR-IM-HS /Users/arianemora/Documents/code/LevSeq/data/degradeo/20250121-JR-IM-HS/  /Users/arianemora/Documents/code/LevSeq/data/degradeo/20250121-JR-IM-HS/ref_seq_oligopools_all.csv --skip_variantcalling
+# levseq results results/  /Users/arianemora/Documents/code/LevSeq/data/degradeo/20250121-JR-IM-HS/ref_seq_oligopools_all.csv --skip_demultiplexing --oligopool

{levseq-1.3.2 → levseq-1.4.0}/tests/test_opligopools.py RENAMED Viewed

@@ -58,36 +58,10 @@ class TestClass(unittest.TestCase):
     def teardown_class(self):
         shutil.rmtree(self.tmp_dir)
-    def test_making_pools(self):
-        u.dp(["Testing SSM"])
-        cl_args = {'skip_demultiplexing': True, 'skip_variantcalling': False}
-        cl_args["name"] = 'oligopools'
-        cl_args['path'] = '/Users/arianemora/Documents/projects/LevSeq/oligopools/'
-        cl_args["summary"] = '/Users/arianemora/Documents/projects/LevSeq/oligopools/oligopool_seqs.csv'
-        variant_df = process_ref_csv(cl_args)
-        # Check if variants.csv already exist
-        variant_csv_path = os.path.join('oligopools', "variants.csv")
-        if os.path.exists(variant_csv_path):
-            variant_df = pd.read_csv(variant_csv_path)
-            df_variants, df_vis = create_df_v(variant_df)
-        # Clean up and prepare dataframe for visualization
-        else:
-            df_variants, df_vis = create_df_v(variant_df)
-    def test_pools(self):
-        u.dp(["Testing SSM"])
-        cl_args = {'skip_demultiplexing': True, 'skip_variantcalling': False}
-        cl_args["name"] = 'oligopools'
-        cl_args['path'] = '/Users/arianemora/Documents/projects/LevSeq/oligopools/'
-        cl_args["summary"] = '/Users/arianemora/Documents/projects/LevSeq/oligopools/oligopool_seqs.csv'
-        variant_df = process_ref_csv(cl_args)
-        # Check if variants.csv already exist
-        variant_csv_path = os.path.join('oligopools', "variants.csv")
-        if os.path.exists(variant_csv_path):
-            variant_df = pd.read_csv(variant_csv_path)
-            df_variants, df_vis = create_df_v(variant_df)
-        # Clean up and prepare dataframe for visualization
-        else:
-            df_variants, df_vis = create_df_v(variant_df)
+    def test_demultipluxing_pools(self):
+        # Take as input the demultiplexed fastq files and the reference csv file
+        cl_args = {'skip_demultiplexing': False, 'skip_variantcalling': False, 'threshold': 0.5, 'oligopool': True, 'show_msa': False}
+        cl_args["name"] = 'oligotest_21032025'
+        cl_args['path'] = 'test_oligopool_2103/'
+        cl_args["summary"] = 'test_oligopool_2103/test_oligopool_2103.csv'
+        run_LevSeq(cl_args)