PyPI - levseq - Versions diffs - 1.2.5__tar.gz → 1.2.7__tar.gz - Mend

levseq 1.2.5tar.gz → 1.2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

{levseq-1.2.5/levseq.egg-info → levseq-1.2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: levseq
-Version: 1.2.5
+Version: 1.2.7
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li
 Author-email: ylong@caltech.edu
@@ -54,7 +54,7 @@ Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore tech
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://github.com/ArianeMora/LevSeq_vis/) and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
+- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
 ## Setup
@@ -80,7 +80,7 @@ and `minimap2` installed on your path. However, if you have issues we recommend
 We recommend using terminal and a conda environment for installation:
 ```
-conda create --name levseq python=3.10 -y
+conda create --name levseq python=3.12 -y
 ```
 ```
@@ -93,7 +93,7 @@ conda activate levseq
 ```
 conda install -c bioconda -c conda-forge samtools
 ```
-or for mac users you can use: `brew install samtools`
 2. Minimap2: https://github.com/lh3/minimap2
@@ -110,11 +110,46 @@ operating system (https://docs.docker.com/engine/install/).
 ```
 levseq <name of the run you can make this whatever> <location to data folder> <location of reference csv file>
 ```
 #### Run via docker
+If using linux system
+```
+docker pull yueminglong/levseq:levseq-1.2.5-x86
 ```
-docker run --rm -v "$(pwd):/levseq_results" levseq <name> <location to data folder> <location of reference csv file>
+If using Mac M chips (image tested on M1, M3, and M4)
+```
+docker pull yueminglong/levseq:levseq-1.2.5-arm64
+```
 ```
+docker run --rm -v "$(pwd):/levseq_results" yueminglong/levseq:levseq-1.2.5-<architecture> <name> <location to data folder> <location of reference csv file>
+```
+Explanation:
+--rm: Automatically removes the container after the command finishes.
+-v "$(pwd):/levseq\_results": Mounts the current directory ($(pwd)) to /levseq\_results inside the container, ensuring the results are saved to your current directory.
+yueminglong/levseq:levseq-1.2.5-\<architecture\>: Specifies the Docker image to run. Replace \<architecture\> with the appropriate platform (e.g., x86).
+\<name\>: The name or identifier for the analysis.
+\<location to data folder\>: Path to the folder containing input data.
+\<location of reference csv file\>: Path to the reference .csv file.
+Important Notes:
+If the current directory is mounted to the container (via -v "$(pwd):/levseq\_results"), the basecalled result in FASTQ format and the ref.csv file must be located in the current directory.
+If these files are not present in the current directory, they will not be processed by the tool.
+Output:
+The results of the analysis will be saved to your current working directory.
 See the [manuscrtipt notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb) for an example.
+*Note: if using docker, the html and csv final output will be saved in the directory that you are running from instead of in the Platemaps or Results subfolder.
 #### Required Arguments
 1. Name of the experiment, this will be the name of the output folder
@@ -137,3 +172,7 @@ For more details or trouble shooting please look at our [computational_protocols
 #### Citing
 If you have found LevSeq useful, please cite out [paper](https://doi.org/10.1101/2024.09.04.611255).
+#### Contact
+Leave a feature request in the issues or reach us via [email](mailto:levseqdb@gmail.com).

{levseq-1.2.5 → levseq-1.2.7}/README.md RENAMED Viewed

@@ -7,7 +7,7 @@ Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore tech
 - Data to reproduce the results and to test are available on zenodo [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13694463.svg)](https://doi.org/10.5281/zenodo.13694463)
-- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://github.com/ArianeMora/LevSeq_vis/) and code to host locally at: https://github.com/fhalab/LevSeq_VDB/
+- A dockerized website and database for labs to locally host and visualize their data: website is available [here](https://levseqdb.streamlit.app/) and code to host locally [here](https://github.com/fhalab/LevSeq_db)
 ## Setup
@@ -33,7 +33,7 @@ and `minimap2` installed on your path. However, if you have issues we recommend
 We recommend using terminal and a conda environment for installation:
 ```
-conda create --name levseq python=3.10 -y
+conda create --name levseq python=3.12 -y
 ```
 ```
@@ -46,7 +46,7 @@ conda activate levseq
 ```
 conda install -c bioconda -c conda-forge samtools
 ```
-or for mac users you can use: `brew install samtools`
 2. Minimap2: https://github.com/lh3/minimap2
@@ -63,11 +63,46 @@ operating system (https://docs.docker.com/engine/install/).
 ```
 levseq <name of the run you can make this whatever> <location to data folder> <location of reference csv file>
 ```
 #### Run via docker
+If using linux system
+```
+docker pull yueminglong/levseq:levseq-1.2.5-x86
 ```
-docker run --rm -v "$(pwd):/levseq_results" levseq <name> <location to data folder> <location of reference csv file>
+If using Mac M chips (image tested on M1, M3, and M4)
+```
+docker pull yueminglong/levseq:levseq-1.2.5-arm64
+```
 ```
+docker run --rm -v "$(pwd):/levseq_results" yueminglong/levseq:levseq-1.2.5-<architecture> <name> <location to data folder> <location of reference csv file>
+```
+Explanation:
+--rm: Automatically removes the container after the command finishes.
+-v "$(pwd):/levseq\_results": Mounts the current directory ($(pwd)) to /levseq\_results inside the container, ensuring the results are saved to your current directory.
+yueminglong/levseq:levseq-1.2.5-\<architecture\>: Specifies the Docker image to run. Replace \<architecture\> with the appropriate platform (e.g., x86).
+\<name\>: The name or identifier for the analysis.
+\<location to data folder\>: Path to the folder containing input data.
+\<location of reference csv file\>: Path to the reference .csv file.
+Important Notes:
+If the current directory is mounted to the container (via -v "$(pwd):/levseq\_results"), the basecalled result in FASTQ format and the ref.csv file must be located in the current directory.
+If these files are not present in the current directory, they will not be processed by the tool.
+Output:
+The results of the analysis will be saved to your current working directory.
 See the [manuscrtipt notebook](https://github.com/fhalab/LevSeq/blob/main/manuscript/notebooks/epPCR_10plates.ipynb) for an example.
+*Note: if using docker, the html and csv final output will be saved in the directory that you are running from instead of in the Platemaps or Results subfolder.
 #### Required Arguments
 1. Name of the experiment, this will be the name of the output folder
@@ -89,4 +124,8 @@ For more details or trouble shooting please look at our [computational_protocols
 #### Citing
-If you have found LevSeq useful, please cite out [paper](https://doi.org/10.1101/2024.09.04.611255).
+If you have found LevSeq useful, please cite out [paper](https://doi.org/10.1101/2024.09.04.611255).
+#### Contact
+Leave a feature request in the issues or reach us via [email](mailto:levseqdb@gmail.com).

{levseq-1.2.5 → levseq-1.2.7}/levseq/__init__.py RENAMED Viewed

@@ -18,7 +18,7 @@
 __title__ = 'levseq'
 __description__ = 'LevSeq nanopore sequencing'
 __url__ = 'https://github.com/fhalab/levseq/'
-__version__ = '1.2.5'
+__version__ = '1.2.7'
 __author__ = 'Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li'
 __author_email__ = 'ylong@caltech.edu'
 __license__ = 'GPL3'

levseq-1.2.7/levseq/coordinates.py ADDED Viewed

@@ -0,0 +1,76 @@
+import esm
+import torch
+import pandas as pd
+from sklearn.decomposition import PCA
+import os
+import argparse
+def preprocess_sequence(sequence):
+    """
+    Preprocesses the amino acid sequence by removing everything after the first '*' (stop codon).
+    """
+    if '*' in sequence:
+        sequence = sequence.split('*')[0]  # Take everything before the first '*'
+    return sequence
+def process_file(input_file, output_file=None):
+    # Load the dataset
+    data = pd.read_csv(input_file)
+    # Remove the "Unnamed: 0" column if it exists
+    if 'Unnamed: 0' in data.columns:
+        data = data.drop(columns=['Unnamed: 0'])
+    # Create the ID column as the combination of `Plate` and `Well`
+    data['ID'] = data['Plate'] + '-' + data['Well']
+    data = data[['ID'] + [col for col in data.columns if col != 'ID']]  # Reorder to make ID the first column
+    # Filter valid sequences from the `aa_sequence` column
+    valid_sequences = data['aa_sequence'].dropna()
+    valid_sequences = valid_sequences[~valid_sequences.str.contains('#N.A.#|Deletion')]
+    # Preprocess sequences to handle stop codons
+    valid_sequences = valid_sequences.apply(preprocess_sequence)
+    # Load the ESM-2 model
+    model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
+    batch_converter = alphabet.get_batch_converter()
+    # Prepare sequences for embedding
+    sequences = valid_sequences.tolist()
+    sequence_names = [f"Sequence {i}" for i in range(len(sequences))]
+    batch_labels, batch_strs, batch_tokens = batch_converter(list(zip(sequence_names, sequences)))
+    # Extract embeddings
+    with torch.no_grad():
+        results = model(batch_tokens, repr_layers=[33])
+        embeddings = results["representations"][33]  # Use the top (last) layer representations
+    # Average embeddings across residues for sequence-level representation
+    sequence_embeddings = embeddings.mean(1).numpy()
+    # Dimensionality Reduction using PCA
+    pca = PCA(n_components=2)
+    xy_coordinates = pca.fit_transform(sequence_embeddings)
+    # Add x, y coordinates back to the dataframe
+    xy_df = pd.DataFrame(xy_coordinates, columns=['x_coordinate', 'y_coordinate'], index=valid_sequences.index)
+    data = pd.concat([data, xy_df], axis=1)
+    # Determine output file location
+    if output_file is None:
+        input_name, input_ext = os.path.splitext(input_file)
+        output_file = f"{input_name}_xy{input_ext}"
+    # Save the updated dataframe to a file
+    data.to_csv(output_file, index=False)
+    print(f"Processed data with x, y coordinates saved to: {output_file}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Generate x, y coordinates for amino acid sequences")
+    parser.add_argument('input_file', type=str, help="Path to the input CSV file")
+    parser.add_argument('--output_file', type=str, default=None, help="Path to save the output CSV file (optional)")
+    args = parser.parse_args()
+    process_file(args.input_file, args.output_file)

{levseq-1.2.5 → levseq-1.2.7}/levseq/run_levseq.py RENAMED Viewed

@@ -275,11 +275,11 @@ def create_df_v(variants_df):
     )
     # Fill in 'Deletion' in 'aa_variant' column
     df_variants_.loc[
-        df_variants_["nc_variant"] == "Deletion", "aa_variant"
-    ] = "Deletion"
+        df_variants_["nc_variant"] == "#DEL#", "aa_variant"
+    ] = "#DEL#"
     df_variants_.loc[
-        df_variants_["nc_variant"] == "Insertion", "aa_variant"
-    ] = "Insertion"
+        df_variants_["nc_variant"] == "#INS#", "aa_variant"
+    ] = "#INS#"
     # Compare aa_variant with translated refseq and generate Substitutions column
     df_variants_["Substitutions"] = df_variants_.apply(get_mutations, axis=1)
@@ -291,7 +291,7 @@ def create_df_v(variants_df):
     # Fill in Deletion into Substitutions Column, keep #N.A.# unchanged
     for i in df_variants_.index:
         if df_variants_["nc_variant"].iloc[i] == "Deletion":
-            df_variants_.Substitutions.iat[i] = df_variants_.Substitutions.iat[i].replace("", "-")
+            df_variants_.Substitutions.iat[i] = df_variants_.Substitutions.iat[i].replace("", "#DEL#")
         elif df_variants_["nc_variant"].iloc[i] == "#N.A.#":
             df_variants_.Substitutions.iat[i] = "#N.A.#"
@@ -321,30 +321,36 @@ def create_df_v(variants_df):
     df_variants_["Plate"] = df_variants_["Plate"].apply(
         lambda x: f"0{x}" if len(x) == 1 else x
     )
-    # Rename columns as per the request
+    # First rename columns as before
     df_variants_.rename(columns={
         "Variant": "nucleotide_mutation",
-        "Substitutions": "amino-acid_substitutions",
+        "Substitutions": "amino_acid_substitutions",
         "nc_variant": "nt_sequence",
         "aa_variant": "aa_sequence"
-	    },inplace=True)
-	# Select the desired columns in the desired order
-    restructured_df = df_variants_[[
-            "barcode_plate",
-            "Plate",
-            "Well",
-            "Alignment Count",
-            "nucleotide_mutation",
-            "amino-acid_substitutions",
-            "Alignment Probability",
-            "Average mutation frequency",
-            "P value",
-            "P adj. value",
-            "nt_sequence",
-            "aa_sequence",
-        ]
-    ]
+        }, inplace=True)
+    # Create a copy for restructuring to avoid affecting the original
+    restructured_df = df_variants_.copy()
+    restructured_df.columns = restructured_df.columns.str.lower().str.replace('[\s-]', '_', regex=True)
+    # Fix the specific column name
+    restructured_df.columns = restructured_df.columns.str.replace('p_adj._value', 'p_adj_value')
+    # Select the desired columns in the desired order
+    restructured_df = restructured_df[[
+        "barcode_plate",
+        "plate",
+        "well",
+        "alignment_count",
+        "nucleotide_mutation",
+        "amino_acid_substitutions",
+        "alignment_probability",
+        "average_mutation_frequency",
+        "p_value",
+        "p_adj_value",
+        "nt_sequence",
+        "aa_sequence"
+    ]]
     return restructured_df, df_variants_
@@ -357,9 +363,9 @@ def create_nc_variant(variant, refseq):
     elif variant == "#PARENT#":
         return refseq
     elif "DEL" in variant:
-        return "Deletion"
+        return "#DEL#"
     elif variant == '+':
-        return "Insertion"
+        return "#INS#"
     else:
         mutations = variant.split("_")
         nc_variant = list(refseq)
@@ -459,7 +465,7 @@ def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
             logging.info(f"Fasta file for {name} already exists. Skipping write.")
         barcode_path = filter_bc(cl_args, name_folder, i)
-        output_dir = Path(result_folder) / "basecalled_reads"
+        output_dir = Path(result_folder) / f"{cl_args['name']}_fastq"
         output_dir.mkdir(parents=True, exist_ok=True)
         if not cl_args["skip_demultiplexing"]:
@@ -485,17 +491,25 @@ def process_ref_csv(cl_args, tqdm_fn=tqdm.tqdm):
                 continue
     variant_df.to_csv(variant_csv_path, index=False)
-    return variant_df
+    return variant_df, ref_df
 # Main function to run LevSeq and ensure saving of intermediate results if an error occurs
 def run_LevSeq(cl_args, tqdm_fn=tqdm.tqdm):
     result_folder = create_result_folder(cl_args)
+    # Ref folder for saving ref csv file
+    ref_folder = os.path.join(result_folder, "ref")
+    os.makedirs(ref_folder, exist_ok=True)
     configure_logging(result_folder)
+    logging.info("Logging configured. Starting program.")
     variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])
     try:
-        variant_df = process_ref_csv(cl_args, tqdm_fn)
+        variant_df, ref_df = process_ref_csv(cl_args, tqdm_fn)
+        ref_df_path = os.path.join(ref_folder, cl_args["name"]+".csv")
+        ref_df.to_csv(ref_df_path, index=False)
         if variant_df.empty:
             logging.warning("No data found during CSV processing. The CSV is empty.")
     except Exception as e:

levseq 1.2.5__tar.gz → 1.2.7__tar.gz

levseq 1.2.5tar.gz → 1.2.7tar.gz