PyPI - levseq - Versions diffs - 1.2.1__tar.gz → 1.2.5__tar.gz - Mend

levseq 1.2.1tar.gz → 1.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{levseq-1.2.1/levseq.egg-info → levseq-1.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: levseq
-Version: 1.2.1
+Version: 1.2.5
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li
 Author-email: ylong@caltech.edu
@@ -43,6 +43,7 @@ Requires-Dist: seaborn
 Requires-Dist: scikit-learn
 Requires-Dist: statsmodels
 Requires-Dist: tqdm
+Requires-Dist: biopandas
 # Variant Sequencing with Nanopore
@@ -53,8 +54,7 @@ Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore tech
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available at: https://levseq.caltech.edu/ and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
+- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://github.com/ArianeMora/LevSeq_vis/) and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
 ## Setup
@@ -65,29 +65,19 @@ For setting up the experimental side of LevSeq we suggest the following preparat
 ## How to Use LevSeq
-The wet lab part is detailed in the method section of the paper.
+The wet lab part is detailed in the method section of the paper or via the [wiki](https://github.com/fhalab/LevSeq/wiki/Experimental-protocols).
 Once samples are prepared, the multiplexed sample is used for sequencing, and the sequencing data is stored in the `../data` folder as per the typical Nanopore flow (refer to Nanopore documentation for this).
 After sequencing, you can identify variants, demultiplex, and combine with your variant function here! For simple applications, we recommend using the notebook `example/Example.ipynb`.
-### Steps of LevSeq:
-1. **Basecalling**: This step converts Nanopore's FAST5 files to sequences. For basecalling, we use Nanopore's basecaller, Medaka, which can run in parallel with sequencing (recommended) or afterward.
-2. **Demultiplexing**: After sequencing, the reads, stored as bulk FASTQ files, are sorted. During demultiplexing, each read is assigned to its correct plate/well combination and stored as a FASTQ file.
-3. **Variant Calling**: For each sample, the consensus sequence is compared to the reference sequence. A variant is called if it differs from the reference sequence. The success of variant calling depends on the number of reads sequenced and their quality.
+### Installation
-### Installation:
+We aimed to make LevSeq as simple to use as possible, this means you should be able to run it all using pip (note you need `samtools`
+and `minimap2` installed on your path. However, if you have issues we recommend using the Docker instance!
+(the pip version doesn't work well with mac M3 but docker does.)
-We aimed to make LevSeq as simple to use as possible, this means you should be able to run it all using pip. However, if you have issues we recomend using the Docker instance!
-We recommend using command line interface(Terminal) and a conda environment for installation:
-```
-git clone https://github.com/fhalab/LevSeq.git
-```
+We recommend using terminal and a conda environment for installation:
 ```
 conda create --name levseq python=3.10 -y
@@ -97,11 +87,6 @@ conda create --name levseq python=3.10 -y
 conda activate levseq
 ```
-From the LevSeq folder, install the package using pip:
-```
-pip install levseq
-```
 #### Dependencies
 1. Samtools: https://www.htslib.org/download/
@@ -111,30 +96,26 @@ conda install -c bioconda -c conda-forge samtools
 or for mac users you can use: `brew install samtools`
 2. Minimap2: https://github.com/lh3/minimap2
 ```
 conda install -c bioconda -c conda-forge minimap2
 ```
-or for mac users you can use: `brew install minimap2`
-Once dependencies are all installed, you can run LevSeq using command line.
-3. GCC version 13 and 14 are both needed
-For Mac M chip users: installation via homebrew
-```
-brew install gcc@14
-brew install gcc@13
-```
-For Linux users: installation via conda
-```
-conda install conda-forge::gcc=14
-conda install conda-forge::gcc=13
-```
+### Docker Installation (Recommended for full pipeline)
+For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
+operating system (https://docs.docker.com/engine/install/).
 ### Usage
-#### Command Line Interface
-LevSeq can be run using the command line interface. Here's the basic structure of the command:
+#### Run via pip
 ```
 levseq <name of the run you can make this whatever> <location to data folder> <location of reference csv file>
 ```
+#### Run via docker
+```
+docker run --rm -v "$(pwd):/levseq_results" levseq <name> <location to data folder> <location of reference csv file>
+```
+See the [manuscrtipt notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb) for an example.
 #### Required Arguments
 1. Name of the experiment, this will be the name of the output folder
 2. Location of basecalled fastq files, this is the direct output from using the MinKnow software for sequencing
@@ -149,37 +130,10 @@ levseq <name of the run you can make this whatever> <location to data folder> <l
 --show\_msa Showing multiple sequence alignment for each well
-### Docker Installation (Recommended for full pipeline)
-For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
-operating system (https://docs.docker.com/engine/install/).
-To build the docker image run (within the main folder that contains the `Dockerfile`). Note building does **not** work
-on Mac M3 chip, please use a ubuntu machine to build the docker image!
-```
-docker build -t levseq .
-```
-This gives us the access to the lebSeq command line interface via:
-```
-docker run levseq
-```
-Note! The docker image should work with linux, and mac, however, different mac architectures may have issues (owing to the different M1/M3 processers.)
-Basically the -v connects a folder on your computer with the output from the minION sequencer with the docker image that will take these results and then perform
-demultiplexing and variant calling.
-docker run -v /disk1/ariane/vscode/LevSeq/manuscript/Data/20241116-YL-LevSeq-parlqep400-1-2-P25-28:/levseq_results/ levseq docker-test levseq_results/ levseq_results/LevSeq-T1.csv
-```
- docker run -v /Users/XXXX/Documents/LevSeq/data:/levseq_results/ levseq 20240502 levseq_results/20240502/ levseq_results/20240502-YL-ParLQ-ep2.csv
-```
-In this command: `/Users/XXXX/Documents/LevSeq/data` is a folder on your computer, which contains a subfolder `20240502`
+Great you should be all done!
-### Issues and Troubleshooting
+For more details or trouble shooting please look at our [computational_protocols](https://github.com/fhalab/LevSeq/wiki/Computational-protocols).
-If you have any issues, please check the LevSeq\_error.log find in the output direectory and report the issue. If the problem persists, please open an issue on the GitHub repository with the error details.
+#### Citing
-If you solve something code wise, submit a pull request! We would love community input.
+If you have found LevSeq useful, please cite out [paper](https://doi.org/10.1101/2024.09.04.611255).

{levseq-1.2.1 → levseq-1.2.5}/README.md RENAMED Viewed

@@ -7,8 +7,7 @@ Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore tech
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available at: https://levseq.caltech.edu/ and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
+- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://github.com/ArianeMora/LevSeq_vis/) and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
 ## Setup
@@ -19,29 +18,19 @@ For setting up the experimental side of LevSeq we suggest the following preparat
 ## How to Use LevSeq
-The wet lab part is detailed in the method section of the paper.
+The wet lab part is detailed in the method section of the paper or via the [wiki](https://github.com/fhalab/LevSeq/wiki/Experimental-protocols).
 Once samples are prepared, the multiplexed sample is used for sequencing, and the sequencing data is stored in the `../data` folder as per the typical Nanopore flow (refer to Nanopore documentation for this).
 After sequencing, you can identify variants, demultiplex, and combine with your variant function here! For simple applications, we recommend using the notebook `example/Example.ipynb`.
-### Steps of LevSeq:
-1. **Basecalling**: This step converts Nanopore's FAST5 files to sequences. For basecalling, we use Nanopore's basecaller, Medaka, which can run in parallel with sequencing (recommended) or afterward.
-2. **Demultiplexing**: After sequencing, the reads, stored as bulk FASTQ files, are sorted. During demultiplexing, each read is assigned to its correct plate/well combination and stored as a FASTQ file.
-3. **Variant Calling**: For each sample, the consensus sequence is compared to the reference sequence. A variant is called if it differs from the reference sequence. The success of variant calling depends on the number of reads sequenced and their quality.
+### Installation
-### Installation:
+We aimed to make LevSeq as simple to use as possible, this means you should be able to run it all using pip (note you need `samtools`
+and `minimap2` installed on your path. However, if you have issues we recommend using the Docker instance!
+(the pip version doesn't work well with mac M3 but docker does.)
-We aimed to make LevSeq as simple to use as possible, this means you should be able to run it all using pip. However, if you have issues we recomend using the Docker instance!
-We recommend using command line interface(Terminal) and a conda environment for installation:
-```
-git clone https://github.com/fhalab/LevSeq.git
-```
+We recommend using terminal and a conda environment for installation:
 ```
 conda create --name levseq python=3.10 -y
@@ -51,11 +40,6 @@ conda create --name levseq python=3.10 -y
 conda activate levseq
 ```
-From the LevSeq folder, install the package using pip:
-```
-pip install levseq
-```
 #### Dependencies
 1. Samtools: https://www.htslib.org/download/
@@ -65,30 +49,26 @@ conda install -c bioconda -c conda-forge samtools
 or for mac users you can use: `brew install samtools`
 2. Minimap2: https://github.com/lh3/minimap2
 ```
 conda install -c bioconda -c conda-forge minimap2
 ```
-or for mac users you can use: `brew install minimap2`
-Once dependencies are all installed, you can run LevSeq using command line.
-3. GCC version 13 and 14 are both needed
-For Mac M chip users: installation via homebrew
-```
-brew install gcc@14
-brew install gcc@13
-```
-For Linux users: installation via conda
-```
-conda install conda-forge::gcc=14
-conda install conda-forge::gcc=13
-```
+### Docker Installation (Recommended for full pipeline)
+For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
+operating system (https://docs.docker.com/engine/install/).
 ### Usage
-#### Command Line Interface
-LevSeq can be run using the command line interface. Here's the basic structure of the command:
+#### Run via pip
 ```
 levseq <name of the run you can make this whatever> <location to data folder> <location of reference csv file>
 ```
+#### Run via docker
+```
+docker run --rm -v "$(pwd):/levseq_results" levseq <name> <location to data folder> <location of reference csv file>
+```
+See the [manuscrtipt notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb) for an example.
 #### Required Arguments
 1. Name of the experiment, this will be the name of the output folder
 2. Location of basecalled fastq files, this is the direct output from using the MinKnow software for sequencing
@@ -103,37 +83,10 @@ levseq <name of the run you can make this whatever> <location to data folder> <l
 --show\_msa Showing multiple sequence alignment for each well
-### Docker Installation (Recommended for full pipeline)
-For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
-operating system (https://docs.docker.com/engine/install/).
-To build the docker image run (within the main folder that contains the `Dockerfile`). Note building does **not** work
-on Mac M3 chip, please use a ubuntu machine to build the docker image!
-```
-docker build -t levseq .
-```
-This gives us the access to the lebSeq command line interface via:
-```
-docker run levseq
-```
-Note! The docker image should work with linux, and mac, however, different mac architectures may have issues (owing to the different M1/M3 processers.)
-Basically the -v connects a folder on your computer with the output from the minION sequencer with the docker image that will take these results and then perform
-demultiplexing and variant calling.
-docker run -v /disk1/ariane/vscode/LevSeq/manuscript/Data/20241116-YL-LevSeq-parlqep400-1-2-P25-28:/levseq_results/ levseq docker-test levseq_results/ levseq_results/LevSeq-T1.csv
-```
- docker run -v /Users/XXXX/Documents/LevSeq/data:/levseq_results/ levseq 20240502 levseq_results/20240502/ levseq_results/20240502-YL-ParLQ-ep2.csv
-```
-In this command: `/Users/XXXX/Documents/LevSeq/data` is a folder on your computer, which contains a subfolder `20240502`
+Great you should be all done!
-### Issues and Troubleshooting
+For more details or trouble shooting please look at our [computational_protocols](https://github.com/fhalab/LevSeq/wiki/Computational-protocols).
-If you have any issues, please check the LevSeq\_error.log find in the output direectory and report the issue. If the problem persists, please open an issue on the GitHub repository with the error details.
+#### Citing
-If you solve something code wise, submit a pull request! We would love community input.
+If you have found LevSeq useful, please cite out [paper](https://doi.org/10.1101/2024.09.04.611255).

{levseq-1.2.1 → levseq-1.2.5}/levseq/__init__.py RENAMED Viewed

@@ -18,7 +18,7 @@
 __title__ = 'levseq'
 __description__ = 'LevSeq nanopore sequencing'
 __url__ = 'https://github.com/fhalab/levseq/'
-__version__ = '1.2.1'
+__version__ = '1.2.5'
 __author__ = 'Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li'
 __author_email__ = 'ylong@caltech.edu'
 __license__ = 'GPL3'

{levseq-1.2.1 → levseq-1.2.5}/levseq/run_levseq.py RENAMED Viewed

@@ -243,6 +243,19 @@ def call_variant(experiment_name, experiment_folder, template_fasta, filtered_ba
     except Exception as e:
         logging.error("Variant calling failed", exc_info=True)
         raise
+def assign_alignment_probability(row):
+    if row["Variant"] == "#PARENT#":
+        if row["Alignment Count"] > 20:
+            return 1
+        elif 10 <= row["Alignment Count"] <= 20:
+            return (row["Alignment Count"] - 10) / 10  # Ranges from 0 to 1 linearly
+        else:
+            return 0
+    else:
+        return row["Average mutation frequency"]
 # Full version of create_df_v function
 def create_df_v(variants_df):
     # Make copy of dataframe
@@ -258,28 +271,19 @@ def create_df_v(variants_df):
     # Translate nc_variant to aa_variant
     df_variants_["aa_variant"] = df_variants_["nc_variant"].apply(
-    lambda x: x if x in ["Deletion", "#N.A.#"] else translate(x)
+        lambda x: x if x in ["Deletion", "#N.A.#", 'Insertion'] else translate(x)
     )
     # Fill in 'Deletion' in 'aa_variant' column
     df_variants_.loc[
         df_variants_["nc_variant"] == "Deletion", "aa_variant"
     ] = "Deletion"
+    df_variants_.loc[
+        df_variants_["nc_variant"] == "Insertion", "aa_variant"
+    ] = "Insertion"
     # Compare aa_variant with translated refseq and generate Substitutions column
     df_variants_["Substitutions"] = df_variants_.apply(get_mutations, axis=1)
     # Adding sequence quality to Alignment Probability before filling in empty values
-    def assign_alignment_probability(row):
-        if row["Variant"] == "#PARENT#":
-            if row["Alignment Count"] > 20:
-                return 1
-            elif 10 <= row["Alignment Count"] <= 20:
-                return (row["Alignment Count"] - 10) / 10  # Ranges from 0 to 1 linearly
-            else:
-                return 0
-        else:
-            return row["Average mutation frequency"]
     df_variants_["Alignment Probability"] = df_variants_.apply(assign_alignment_probability, axis=1)
     df_variants_["Alignment Probability"] = df_variants_["Alignment Probability"].fillna(0.0)
     df_variants_["Alignment Count"] = df_variants_["Alignment Count"].fillna(0.0)
@@ -291,7 +295,9 @@ def create_df_v(variants_df):
         elif df_variants_["nc_variant"].iloc[i] == "#N.A.#":
             df_variants_.Substitutions.iat[i] = "#N.A.#"
+    # Low read counts override low mutations
+    df_variants_["Substitutions"] = ["#LOW#" if a < 10 and a > 0 else s for a, s in df_variants_[["Alignment Count", "Substitutions"]].values]
     # Add row and columns
     Well = df_variants_["Well"].tolist()
     row = []
@@ -317,29 +323,27 @@ def create_df_v(variants_df):
     )
     # Rename columns as per the request
     df_variants_.rename(columns={
-	"Variant": "nucleotide_mutation",
-	"Substitutions": "amino-acid_substitutions",
-	"nc_variant": "nt_sequence",
-	"aa_variant": "aa_sequence"
-	}, inplace=True)
+        "Variant": "nucleotide_mutation",
+        "Substitutions": "amino-acid_substitutions",
+        "nc_variant": "nt_sequence",
+        "aa_variant": "aa_sequence"
+	    },inplace=True)
 	# Select the desired columns in the desired order
-    restructured_df = df_variants_[
-            [
-                    "barcode_plate",
-                    "Plate",
-                    "Well",
-                    "Alignment Count",
-                    "nucleotide_mutation",
-                    "amino-acid_substitutions",
-                    "Alignment Probability",
-                    "Average mutation frequency",
-                    "P value",
-                    "P adj. value",
-                    "nt_sequence",
-                    "aa_sequence",
-            ]
+    restructured_df = df_variants_[[
+            "barcode_plate",
+            "Plate",
+            "Well",
+            "Alignment Count",
+            "nucleotide_mutation",
+            "amino-acid_substitutions",
+            "Alignment Probability",
+            "Average mutation frequency",
+            "P value",
+            "P adj. value",
+            "nt_sequence",
+            "aa_sequence",
+        ]
     ]
     return restructured_df, df_variants_
@@ -354,16 +358,21 @@ def create_nc_variant(variant, refseq):
         return refseq
     elif "DEL" in variant:
         return "Deletion"
+    elif variant == '+':
+        return "Insertion"
     else:
         mutations = variant.split("_")
         nc_variant = list(refseq)
         for mutation in mutations:
-            if len(mutation) >= 2:
+            try:
                 position = int(re.findall(r"\d+", mutation)[0]) - 1
                 original = mutation[0]
                 new = mutation[-1]
-            if position < len(nc_variant) and nc_variant[position] == original:
-                nc_variant[position] = new
+                if position < len(nc_variant) and nc_variant[position] == original:
+                    nc_variant[position] = new
+            except:
+                print('WARNING! UNABLE TO PROCESS THIS')
+                print(mutation)
         return "".join(nc_variant)
@@ -377,7 +386,9 @@ def get_mutations(row):
         # Check if alignment_count is zero and return "#N.A.#" if true
         if alignment_count == 0:
             return "#N.A.#"
+        if alignment_count <= 10:
+            return "#LOW#"
         refseq = row["refseq"]
         if not is_valid_dna_sequence(refseq):
@@ -395,10 +406,7 @@ def get_mutations(row):
                     if refseq_aa[i] != variant_aa[i]:
                         mutations.append(f"{refseq_aa[i]}{i+1}{variant_aa[i]}")
                 if not mutations:
-                    if alignment_count < 15:
-                        return "#N.A.#"
-                    else:
-                        return "#PARENT#"
+                    return "#PARENT#"
             else:
                 return "LEN"
         return "_".join(mutations) if mutations else ""
@@ -433,10 +441,7 @@ def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
     result_folder = create_result_folder(cl_args)
     variant_csv_path = os.path.join(result_folder, "variants.csv")
-    if os.path.exists(variant_csv_path):
-        variant_df = pd.read_csv(variant_csv_path)
-    else:
-        variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])
+    variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])
     for i, row in tqdm_fn(ref_df.iterrows(), total=len(ref_df), desc="Processing Samples"):
         barcode_plate = row["barcode_plate"]
@@ -456,9 +461,9 @@ def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
         barcode_path = filter_bc(cl_args, name_folder, i)
         output_dir = Path(result_folder) / "basecalled_reads"
         output_dir.mkdir(parents=True, exist_ok=True)
-        file_to_fastq = cat_fastq_files(cl_args.get("path"), output_dir)
         if not cl_args["skip_demultiplexing"]:
+            file_to_fastq = cat_fastq_files(cl_args.get("path"), output_dir)
             try:
                 demux_fastq(output_dir, name_folder, barcode_path)
             except Exception as e:

{levseq-1.2.1 → levseq-1.2.5}/levseq/seqfit.py RENAMED Viewed

@@ -115,7 +115,7 @@ def calculate_mutation_combinations(stats_df):
 def normalise_calculate_stats(processed_plate_df, value_columns, normalise='standard', stats_method='mannwhitneyu',
-                              parent_label='#PARENT#'):
+                              parent_label='#PARENT#', normalise_method='median'):
     parent = parent_label
     # if nomrliase normalize with standard normalisation
     normalised_value_columns = []
@@ -125,7 +125,11 @@ def normalise_calculate_stats(processed_plate_df, value_columns, normalise='stan
             for value_column in value_columns:
                 sub_df = processed_plate_df[processed_plate_df['Plate'] == plate]
                 parent_values = sub_df[sub_df['amino-acid_substitutions'] == parent][value_column].values
-                parent_mean = np.mean(parent_values)
+                # By default use the median
+                if normalise_method == 'median':
+                    parent_mean = np.median(parent_values)
+                else:
+                    parent_mean = np.mean(parent_values)
                 parent_sd = np.std(parent_values)
                 # For each plate we normalise to the parent of that plate
@@ -148,7 +152,10 @@ def normalise_calculate_stats(processed_plate_df, value_columns, normalise='stan
         if mutation != parent:
             for value_column in normalised_value_columns:
                 parent_values = list(processed_plate_df[processed_plate_df['amino-acid_substitutions'] == parent][value_column].values)
-                parent_mean = np.mean(parent_values)
+                if normalise_method == 'median':
+                    parent_mean = np.median(parent_values)
+                else:
+                    parent_mean = np.mean(parent_values)
                 parent_sd = np.std(parent_values)
                 vals = list(grp[value_column].values)
@@ -254,8 +261,12 @@ def work_up_lcms(
     --------
     plate: ns.Plate object (DataFrame-like)
     """
-    # Read in the data
-    df = pd.read_csv(file, header=[1])
+    if isinstance(file, str):
+        # Read in the data
+        df = pd.read_csv(file, header=[1])
+    else:
+        # Change to handling both
+        df = file
     # Convert nans to 0
     df = df.fillna(0)
     # Only grab the Sample Acq Order No.s that have a numeric value
@@ -304,6 +315,72 @@ def work_up_lcms(
     return plate
+def process_files(results_df, plate_df, plate: str, product: list) -> pd.DataFrame:
+    """
+    Process and combine a single plate file
+    Args:
+    - product : str
+        The name of the product to be analyzed. ie pdt
+    - plate : str, ie 'HMC0225_HMC0226.csv'
+        The name of the input CSV file containing the plate data.
+    Returns:
+    - pd.DataFrame
+        A pandas DataFrame containing the processed data.
+    - str
+        The path of the output CSV file containing the processed data.
+    """
+    filtered_df = results_df[["Plate", "Well", "amino-acid_substitutions", "nt_sequence", "aa_sequence"]]
+    filtered_df = filtered_df[(filtered_df["amino-acid_substitutions"] != "#N.A.#")].dropna()
+    # Extract the unique entries of Plate
+    unique_plates = filtered_df["Plate"].unique()
+    # Create an empty list to store the processed plate data
+    processed_data = []
+    # Iterate over unique Plates and search for corresponding CSV files in the current directory
+    plate_object = work_up_lcms(plate_df, product)
+    # Extract attributes from plate_object as needed for downstream processes
+    if hasattr(plate_object, "df"):
+        # Assuming plate_object has a dataframe-like attribute 'df' that we can work with
+        plate_df = plate_object.df
+        plate_df["Plate"] = plate  # Add the plate identifier for reference
+        # Merge filtered_df with plate_df to retain amino-acid_substitutionss and nt_sequence columns
+        merged_df = pd.merge(
+            plate_df, filtered_df, on=["Plate", "Well"], how="left"
+        )
+        columns_order = (
+                ["Plate", "Well", "Row", "Column", "amino-acid_substitutions"]
+                + product
+                + ["nt_sequence", "aa_sequence"]
+        )
+        merged_df = merged_df[columns_order]
+        processed_data.append(merged_df)
+    # Concatenate all dataframes if available
+    if processed_data:
+        processed_df = pd.concat(processed_data, ignore_index=True)
+    else:
+        processed_df = pd.DataFrame(
+            columns=["Plate", "Well", "Row", "Column", "amino-acid_substitutions"]
+                    + product
+                    + ["nt_sequence", "aa_sequence"]
+        )
+    # Ensure all entries in 'Mutations' are treated as strings
+    processed_df["amino-acid_substitutions"] = processed_df["amino-acid_substitutions"].astype(str)
+    # Remove any rows with empty values
+    processed_df = processed_df.dropna()
+    # Return the processed DataFrame for downstream processes
+    return processed_df
 # Function to process the plate files
 def process_plate_files(product: str, input_csv: str) -> pd.DataFrame:
@@ -1146,8 +1223,10 @@ def gen_seqfitvis(
 ):
     # normalized per plate to parent
-    df = pd.read_csv(seqfit_path)
+    if isinstance(seqfit_path, str):
+        df = pd.read_csv(seqfit_path)
+    else:
+        df = seqfit_path
     # ignore deletion meaning "Mutations" == "-"
     df = df[df["amino-acid_substitutions"] != "-"].copy()
     # count number of sites mutated and append mutation details

{levseq-1.2.1 → levseq-1.2.5}/levseq/utils.py RENAMED Viewed

@@ -170,12 +170,12 @@ def make_well_df_from_reads(seqs, read_ids, read_quals):
     seq_df = pd.DataFrame([list(s) for s in seqs])  # Convert each string to a list so that we get positions nicely
     # Also add in the read_ids and sort by the quality to only take the highest quality one
     seq_df['read_id'] = read_ids
-    #seq_df['read_qual'] = read_quals
     seq_df['seqs'] = seqs
-    #seq_df = seq_df.sort_values(by='read_qual', ascending=False)
+    seq_df['read_qual'] = [0 if isinstance(r, str) else r for r in read_quals]
+    seq_df = seq_df.sort_values(by='read_qual', ascending=False)
     # Should now be sorted by the highest quality
     seq_df = seq_df.drop_duplicates(subset=['read_id'], keep='first')
-    return seq_df.drop(columns=['read_id', 'seqs'])
+    return seq_df.drop(columns=['read_id', 'seqs', 'read_qual'])
 def calculate_mutation_significance_across_well(seq_df):
@@ -273,7 +273,7 @@ def alignment_from_cigar(cigar: str, alignment: str, ref: str, query_qualities:
             ref_pos += op_len
     return new_seq, ref_seq, qual, inserts
-def get_reads_for_well(parent_name, bam_file_path: str, ref_str: str, min_coverage=5, msa_path=None):
+def get_reads_for_well(parent_name, bam_file_path: str, ref_str: str, msa_path=None):
     """
     Rows are the reads, columns are the columns in the reference. Insertions are ignored.
     """
@@ -298,22 +298,19 @@ def get_reads_for_well(parent_name, bam_file_path: str, ref_str: str, min_covera
             for i, insert in ins.items():
                 insert_map[i].append(insert)
             read_ids.append(f'{read.query_name}')
-            read_quals.append(read.qual)
+            read_quals.append(qual)
     # Check if we want to write a MSA
-    # if msa_path is not None:
-    #     print("Writing MSA, ", len(seqs))
-    #     if len(seqs) > 30:
-    #         seqs = random.sample(seqs, 30)
-    #     with open(msa_path, 'w+') as fout:
-    #         # Write the reference first
-    #         fout.write(f'>{parent_name}\n{ref_str}\n')
-    #         for i, seq in enumerate(seqs):
-    #             fout.write(f'>{read_ids[i]}\n{"".join(seq)}\n')
-    #     # Align using clustal for debugging if you need the adapter! Here you would change above to use a different version
-    #     os.system(f'clustal-omega --force -i "{msa_path}" -o "{msa_path.replace(".fa", "_msa.fa")}"')
+    if msa_path is not None:
+        with open(msa_path, 'w+') as fout:
+            # Write the reference first
+            fout.write(f'>{parent_name}\n{ref_str}\n')
+            for i, seq in enumerate(seqs):
+                fout.write(f'>{read_ids[i]}\n{str(seq)}\n')
     # Do this for all wells
     seq_df = make_well_df_from_reads(seqs, read_ids, read_quals)
+    alignment_count = len(seq_df.values)
     rows_all = make_row_from_read_pileup_across_well(seq_df, ref_str, parent_name, insert_map)
     bam.close()
@@ -322,7 +319,7 @@ def get_reads_for_well(parent_name, bam_file_path: str, ref_str: str, min_covera
         seq_df.columns = ['gene_name', 'position', 'ref', 'most_frequent', 'freq_non_ref', 'total_other',
                           'total_reads', 'p_value', 'percent_most_freq_mutation', 'A', 'p(a)', 'T', 'p(t)', 'G', 'p(g)',
                           'C', 'p(c)', 'N', 'p(n)', 'I', 'p(i)', 'Warnings']
-        return calculate_mutation_significance_across_well(seq_df), len(seqs)
+        return calculate_mutation_significance_across_well(seq_df), alignment_count
 def make_row_from_read_pileup_across_well(well_df, ref_str, label, insert_map):
     """
@@ -339,13 +336,12 @@ def make_row_from_read_pileup_across_well(well_df, ref_str, label, insert_map):
         # Dummy values that will be filled in later once we calculate the background error rate
         warning = ''
-        if total_reads < 15:
-            warning = (f'WARNING: you had: {total_reads}, we recommend looking at the BAM file or using a '
-                       f'second sequencing method on this well.')
+        if total_reads < 20:
+            warning = f'WARNING: you had: {total_reads}, we recommend looking at the BAM file or using a second sequencing method on this well.'
         # Check if there was an insert
-        if insert_map.get(col) and len(insert_map[col]) > total_reads/2:  # i.e. at least half have the insert
+        if insert_map.get(col) and len(insert_map[col][0]) > total_reads/2:  # i.e. at least half have the insert
             if warning:
-                warning += 'INSERT ALERT DUPLICATE entry'
+                warning += '\nINSERT'
             else:
                 warning = f'WARNING: INSERT.'
             rows.append([label, col, ref_seq, actual_seq, freq_non_ref, total_other, total_reads, 1.0, 0.0,
@@ -468,12 +464,19 @@ def get_variant_label_for_well(seq_df, threshold):
     # Filter based on significance to determine whether there is a
     non_refs = seq_df[seq_df['freq_non_ref'] > threshold].sort_values(by='position')
     mixed_well = False
-    if len(non_refs) > 0:
+    # Have section for inserts to check if they are > 50% of the reads
+    if seq_df['p(i) adj.'].min() < 0.05 and seq_df['I'].max() > len(seq_df) / 2:
+        label = '+'
+        probability = np.mean([1 - x for x in non_refs['freq_non_ref'].values])
+        combined_p_value = float("nan")
+    elif len(non_refs) > 0:
         positions = non_refs['position'].values
         refs = non_refs['ref'].values
         label = [f'{refs[i]}{positions[i] + 1}{actual}' for i, actual in enumerate(non_refs['most_frequent'].values)]
         # Check if it is a mixed well i.e. there were multiple with significant greater than 0.05
         padj_vals = non_refs[['p(a) adj.', 'p(t) adj.', 'p(g) adj.', 'p(c) adj.', 'p(n) adj.', 'p(i) adj.']].values
         for p in padj_vals:
             c_sig = 0
             for padj in p:

{levseq-1.2.1 → levseq-1.2.5}/levseq/variantcaller.py RENAMED Viewed

@@ -169,9 +169,9 @@ class VariantCaller:
                         self._align_sequences(row["Path"], row['Barcodes'])
                     # Placeholder function calls to demonstrate workflow
-                    well_df, alignment_count = get_reads_for_well(self.experiment_name, bam_file, self.ref_str)
+                    well_df, alignment_count = get_reads_for_well(self.experiment_name, bam_file,
+                                                                  self.ref_str, f'{row["Path"]}/msa.fa')
                     self.variant_dict[barcode_id]['Alignment Count'] = alignment_count
                     if well_df is not None:
                         well_df.to_csv(f"{row['Path']}/seq_{barcode_id}.csv", index=False)
                         label, freq, combined_p_value, mixed_well, avg_error_rate = get_variant_label_for_well(well_df, threshold)

{levseq-1.2.1 → levseq-1.2.5/levseq.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: levseq
-Version: 1.2.1
+Version: 1.2.5
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li
 Author-email: ylong@caltech.edu
@@ -43,6 +43,7 @@ Requires-Dist: seaborn
 Requires-Dist: scikit-learn
 Requires-Dist: statsmodels
 Requires-Dist: tqdm
+Requires-Dist: biopandas
 # Variant Sequencing with Nanopore
@@ -53,8 +54,7 @@ Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore tech
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available at: https://levseq.caltech.edu/ and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
+- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://github.com/ArianeMora/LevSeq_vis/) and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
 ## Setup
@@ -65,29 +65,19 @@ For setting up the experimental side of LevSeq we suggest the following preparat
 ## How to Use LevSeq
-The wet lab part is detailed in the method section of the paper.
+The wet lab part is detailed in the method section of the paper or via the [wiki](https://github.com/fhalab/LevSeq/wiki/Experimental-protocols).
 Once samples are prepared, the multiplexed sample is used for sequencing, and the sequencing data is stored in the `../data` folder as per the typical Nanopore flow (refer to Nanopore documentation for this).
 After sequencing, you can identify variants, demultiplex, and combine with your variant function here! For simple applications, we recommend using the notebook `example/Example.ipynb`.
-### Steps of LevSeq:
-1. **Basecalling**: This step converts Nanopore's FAST5 files to sequences. For basecalling, we use Nanopore's basecaller, Medaka, which can run in parallel with sequencing (recommended) or afterward.
-2. **Demultiplexing**: After sequencing, the reads, stored as bulk FASTQ files, are sorted. During demultiplexing, each read is assigned to its correct plate/well combination and stored as a FASTQ file.
-3. **Variant Calling**: For each sample, the consensus sequence is compared to the reference sequence. A variant is called if it differs from the reference sequence. The success of variant calling depends on the number of reads sequenced and their quality.
+### Installation
-### Installation:
+We aimed to make LevSeq as simple to use as possible, this means you should be able to run it all using pip (note you need `samtools`
+and `minimap2` installed on your path. However, if you have issues we recommend using the Docker instance!
+(the pip version doesn't work well with mac M3 but docker does.)
-We aimed to make LevSeq as simple to use as possible, this means you should be able to run it all using pip. However, if you have issues we recomend using the Docker instance!
-We recommend using command line interface(Terminal) and a conda environment for installation:
-```
-git clone https://github.com/fhalab/LevSeq.git
-```
+We recommend using terminal and a conda environment for installation:
 ```
 conda create --name levseq python=3.10 -y
@@ -97,11 +87,6 @@ conda create --name levseq python=3.10 -y
 conda activate levseq
 ```
-From the LevSeq folder, install the package using pip:
-```
-pip install levseq
-```
 #### Dependencies
 1. Samtools: https://www.htslib.org/download/
@@ -111,30 +96,26 @@ conda install -c bioconda -c conda-forge samtools
 or for mac users you can use: `brew install samtools`
 2. Minimap2: https://github.com/lh3/minimap2
 ```
 conda install -c bioconda -c conda-forge minimap2
 ```
-or for mac users you can use: `brew install minimap2`
-Once dependencies are all installed, you can run LevSeq using command line.
-3. GCC version 13 and 14 are both needed
-For Mac M chip users: installation via homebrew
-```
-brew install gcc@14
-brew install gcc@13
-```
-For Linux users: installation via conda
-```
-conda install conda-forge::gcc=14
-conda install conda-forge::gcc=13
-```
+### Docker Installation (Recommended for full pipeline)
+For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
+operating system (https://docs.docker.com/engine/install/).
 ### Usage
-#### Command Line Interface
-LevSeq can be run using the command line interface. Here's the basic structure of the command:
+#### Run via pip
 ```
 levseq <name of the run you can make this whatever> <location to data folder> <location of reference csv file>
 ```
+#### Run via docker
+```
+docker run --rm -v "$(pwd):/levseq_results" levseq <name> <location to data folder> <location of reference csv file>
+```
+See the [manuscrtipt notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb) for an example.
 #### Required Arguments
 1. Name of the experiment, this will be the name of the output folder
 2. Location of basecalled fastq files, this is the direct output from using the MinKnow software for sequencing
@@ -149,37 +130,10 @@ levseq <name of the run you can make this whatever> <location to data folder> <l
 --show\_msa Showing multiple sequence alignment for each well
-### Docker Installation (Recommended for full pipeline)
-For installing the whole pipeline, you'll need to use the docker image. For this, install docker as required for your
-operating system (https://docs.docker.com/engine/install/).
-To build the docker image run (within the main folder that contains the `Dockerfile`). Note building does **not** work
-on Mac M3 chip, please use a ubuntu machine to build the docker image!
-```
-docker build -t levseq .
-```
-This gives us the access to the lebSeq command line interface via:
-```
-docker run levseq
-```
-Note! The docker image should work with linux, and mac, however, different mac architectures may have issues (owing to the different M1/M3 processers.)
-Basically the -v connects a folder on your computer with the output from the minION sequencer with the docker image that will take these results and then perform
-demultiplexing and variant calling.
-docker run -v /disk1/ariane/vscode/LevSeq/manuscript/Data/20241116-YL-LevSeq-parlqep400-1-2-P25-28:/levseq_results/ levseq docker-test levseq_results/ levseq_results/LevSeq-T1.csv
-```
- docker run -v /Users/XXXX/Documents/LevSeq/data:/levseq_results/ levseq 20240502 levseq_results/20240502/ levseq_results/20240502-YL-ParLQ-ep2.csv
-```
-In this command: `/Users/XXXX/Documents/LevSeq/data` is a folder on your computer, which contains a subfolder `20240502`
+Great you should be all done!
-### Issues and Troubleshooting
+For more details or trouble shooting please look at our [computational_protocols](https://github.com/fhalab/LevSeq/wiki/Computational-protocols).
-If you have any issues, please check the LevSeq\_error.log find in the output direectory and report the issue. If the problem persists, please open an issue on the GitHub repository with the error details.
+#### Citing
-If you solve something code wise, submit a pull request! We would love community input.
+If you have found LevSeq useful, please cite out [paper](https://doi.org/10.1101/2024.09.04.611255).

{levseq-1.2.1 → levseq-1.2.5}/levseq.egg-info/requires.txt RENAMED Viewed

@@ -20,3 +20,4 @@ seaborn
 scikit-learn
 statsmodels
 tqdm
+biopandas

{levseq-1.2.1 → levseq-1.2.5}/setup.py RENAMED Viewed

@@ -112,7 +112,8 @@ setup(name='levseq',
                         'seaborn',
                         'scikit-learn',
                         'statsmodels',
-                        'tqdm'],
+                        'tqdm',
+                        'biopandas'],
       python_requires='>=3.8',
       data_files=[("", ["LICENSE"])]
       )

{levseq-1.2.1 → levseq-1.2.5}/tests/test_variant_calling.py RENAMED Viewed

@@ -263,14 +263,14 @@ class TestVariantCalling(TestClass):
             quals.append(100)  # Dummy don't need
         well_df = make_well_df_from_reads(reads, read_ids, quals)
-        rows_all = make_row_from_read_pileup_across_well(well_df, parent_sequence, parent_name)
+        rows_all = make_row_from_read_pileup_across_well(well_df, parent_sequence, parent_name, defaultdict(list))
         well_df = pd.DataFrame(rows_all)
         well_df.columns = ['gene_name', 'position', 'ref', 'most_frequent', 'freq_non_ref', 'total_other',
                            'total_reads', 'p_value', 'percent_most_freq_mutation', 'A', 'p(a)', 'T', 'p(t)', 'G',
                            'p(g)',
-                           'C', 'p(c)', 'N', 'p(n)']
+                           'C', 'p(c)', 'N', 'p(n)', 'I', 'p(i)', 'Warning']
         well_df = calculate_mutation_significance_across_well(well_df)
-        label, frequency, combined_p_value, mixed_well = get_variant_label_for_well(well_df, 0.5)
+        label, frequency, combined_p_value, mixed_well, mean_mutation_rate = get_variant_label_for_well(well_df, 0.5)
         # This should be mutated at 100% - the rate of our sequencing errror
         u.dp([f"Input parent: {parent_sequence}", f"Variant: {mutant}"])
         u.dp(["label", label, f"frequency", frequency, f"combined_p_value", combined_p_value, "mixed_well", mixed_well])
@@ -280,6 +280,38 @@ class TestVariantCalling(TestClass):
         assert combined_p_value < 0.05
         assert mixed_well is False
+    def test_calling_variant_with_insert(self):
+        u.dp(["Testing calling variants using SSM with error"])
+        parent_sequence = "ATGAGT"
+        mutated_sequence = 'ATGAGT' # Not actually mutated
+        parent_name = 'parent'
+        reads = []
+        read_ids = []
+        quals = []
+        insert_map = defaultdict(list)
+        for i in range(0, 30):
+            read_ids.append(f'read_{i}')
+            reads.append(mutated_sequence)
+            insert_map[1].append('C')  # Making them all have an insert at C
+            quals.append(100)  # Dummy don't need
+        well_df = make_well_df_from_reads(reads, read_ids, quals)
+        rows_all = make_row_from_read_pileup_across_well(well_df, parent_sequence, parent_name, insert_map)
+        well_df = pd.DataFrame(rows_all)
+        well_df.columns = ['gene_name', 'position', 'ref', 'most_frequent', 'freq_non_ref', 'total_other',
+                           'total_reads', 'p_value', 'percent_most_freq_mutation', 'A', 'p(a)', 'T', 'p(t)', 'G',
+                           'p(g)',
+                           'C', 'p(c)', 'N', 'p(n)', 'I', 'p(i)', 'Warning']
+        print(well_df['I'].describe())
+        well_df = calculate_mutation_significance_across_well(well_df)
+        label, frequency, combined_p_value, mixed_well, mean_mutation_rate = get_variant_label_for_well(well_df, 0.5)
+        # This should be mutated at 100% - the rate of our sequencing errror
+        u.dp([f"Input parent: {parent_sequence}"])
+        u.dp(["label", label, f"frequency", frequency, f"combined_p_value", combined_p_value, "mixed_well", mixed_well])
+        assert label == '+'
     def test_mixed_wells(self):
         # Test whether we're able to call mixed well populations
         u.dp(["Testing ePCR with error"])
@@ -303,9 +335,19 @@ class TestVariantCalling(TestClass):
     def test_variant_calling_main(self):
         cl_args = {'skip_demultiplexing': True, 'skip_variantcalling': False}
         cl_args["name"] = 'Laragen_Validation'
-        cl_args["output"] = '../manuscript/Data/'
-        cl_args['path'] = '../manuscript/Data/20241116-YL-LevSeq-parlqep400-1-2-P25-28/no_sample_id/'
-        cl_args["summary"] = '../manuscript/Data/20241116-YL-LevSeq-parlqep400-1-2-P25-28/LevSeq-T1.csv'
+        cl_args["output"] = '/Users/arianemora/Documents/code/MinION/manuscript/Data/'
+        cl_args['path'] = '/Users/arianemora/Documents/code/MinION/manuscript/Data/20241116-YL-LevSeq-parlqep400-1-2-P25-28/no_sample_id/'
+        cl_args["summary"] = '/Users/arianemora/Documents/code/MinION/manuscript/Data/20241116-YL-LevSeq-parlqep400-1-2-P25-28/LevSeq-T1.csv'
+        variant_df = process_ref_csv(cl_args)
+        variant_df.to_csv('variant_NEW.csv', index=False)
+        print(variant_df.head())
+    def test_variant_calling_for_LevSeq(self):
+        cl_args = {'skip_demultiplexing': True, 'skip_variantcalling': False}
+        cl_args["name"] = 'parLQ_20240502'
+        cl_args["output"] = '/Users/arianemora/Documents/code/LevSeq/data/epPCR/epPCR_main_manuscript/ParLQ-ep2/'
+        cl_args['path'] = '/Users/arianemora/Documents/code/LevSeq/data/epPCR/epPCR_main_manuscript/ParLQ-ep2/'
+        cl_args["summary"] = '/Users/arianemora/Documents/code/LevSeq/data/epPCR/epPCR_main_manuscript/ParLQ-ep2/20240502-YL-ParLQ-ep2.csv'
         variant_df = process_ref_csv(cl_args)
         variant_df.to_csv('variant_NEW.csv', index=False)
         print(variant_df.head())