PyPI - levseq - Versions diffs - 1.1.0__tar.gz → 1.2.1__tar.gz - Mend

levseq 1.1.0tar.gz → 1.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{levseq-1.1.0/levseq.egg-info → levseq-1.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: levseq
-Version: 1.1.0
+Version: 1.2.1
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li
 Author-email: ylong@caltech.edu
@@ -48,7 +48,7 @@ Requires-Dist: tqdm
 In directed evolution, sequencing every variant enhances data insight and creates datasets suitable for AI/ML methods. This method is presented as an extension of the original Every Variant Sequencer using Illumina technology. With this approach, sequence variants can be generated within a day at an extremely low cost.
-![Figure 1: LevSeq Workflow](manuscript/Figures/LevSeq_Figure-1.png)
+![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.png)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.

{levseq-1.1.0 → levseq-1.2.1}/README.md RENAMED Viewed

@@ -2,7 +2,7 @@
 In directed evolution, sequencing every variant enhances data insight and creates datasets suitable for AI/ML methods. This method is presented as an extension of the original Every Variant Sequencer using Illumina technology. With this approach, sequence variants can be generated within a day at an extremely low cost.
-![Figure 1: LevSeq Workflow](manuscript/Figures/LevSeq_Figure-1.png)
+![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.png)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.

{levseq-1.1.0 → levseq-1.2.1}/levseq/__init__.py RENAMED Viewed

@@ -18,7 +18,7 @@
 __title__ = 'levseq'
 __description__ = 'LevSeq nanopore sequencing'
 __url__ = 'https://github.com/fhalab/levseq/'
-__version__ = '1.1.0'
+__version__ = '1.2.1'
 __author__ = 'Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li'
 __author_email__ = 'ylong@caltech.edu'
 __license__ = 'GPL3'

{levseq-1.1.0 → levseq-1.2.1}/levseq/run_levseq.py RENAMED Viewed

@@ -221,7 +221,7 @@ def demux_fastq(file_to_fastq, result_folder, barcode_path):
         executable_path = package_root / "levseq" / "barcoding" / executable_name
     if not executable_path.exists():
         raise FileNotFoundError(f"Executable not found: {executable_path}")
-    seq_min = 150
+    seq_min = 200
     seq_max = 10000
     prompt = f"{executable_path} -f {file_to_fastq} -d {result_folder} -b {barcode_path} -w 100 -r 100 -m {seq_min} -x {seq_max}"
     subprocess.run(prompt, shell=True, check=True)
@@ -258,7 +258,7 @@ def create_df_v(variants_df):
     # Translate nc_variant to aa_variant
     df_variants_["aa_variant"] = df_variants_["nc_variant"].apply(
-        lambda x: "Deletion" if x == "Deletion" else translate(x)
+    lambda x: x if x in ["Deletion", "#N.A.#"] else translate(x)
     )
     # Fill in 'Deletion' in 'aa_variant' column
     df_variants_.loc[
@@ -284,10 +284,13 @@ def create_df_v(variants_df):
     df_variants_["Alignment Probability"] = df_variants_["Alignment Probability"].fillna(0.0)
     df_variants_["Alignment Count"] = df_variants_["Alignment Count"].fillna(0.0)
-    # Fill in Deletion into Substitutions Column
+    # Fill in Deletion into Substitutions Column, keep #N.A.# unchanged
     for i in df_variants_.index:
         if df_variants_["nc_variant"].iloc[i] == "Deletion":
             df_variants_.Substitutions.iat[i] = df_variants_.Substitutions.iat[i].replace("", "-")
+        elif df_variants_["nc_variant"].iloc[i] == "#N.A.#":
+            df_variants_.Substitutions.iat[i] = "#N.A.#"
     # Add row and columns
     Well = df_variants_["Well"].tolist()
@@ -312,40 +315,41 @@ def create_df_v(variants_df):
     df_variants_["Plate"] = df_variants_["Plate"].apply(
         lambda x: f"0{x}" if len(x) == 1 else x
     )
-	# Rename columns as per the request
-	df_variants_.rename(columns={
-		"Variant": "nucleotide_mutation",
-		"Substitutions": "amino-acid_substitutions",
-		"nc_variant": "nt_sequence",
-		"aa_variant": "aa_sequence"
+    # Rename columns as per the request
+    df_variants_.rename(columns={
+	"Variant": "nucleotide_mutation",
+	"Substitutions": "amino-acid_substitutions",
+	"nc_variant": "nt_sequence",
+	"aa_variant": "aa_sequence"
 	}, inplace=True)
 	# Select the desired columns in the desired order
-	restructured_df = df_variants_[
-		[
-			"barcode_plate",
-			"Plate",
-			"Well",
-			"Alignment Count",
-            "nucleotide_mutation",
-            "amino-acid_substitutions",
-            "Alignment Probability",
-			"Average mutation frequency",
-			"P value",
-			"P adj. value",
-			"nt_sequence",
-			"aa_sequence",
-		]
-	]
-	return restructured_df, df_variants_
+    restructured_df = df_variants_[
+            [
+                    "barcode_plate",
+                    "Plate",
+                    "Well",
+                    "Alignment Count",
+                    "nucleotide_mutation",
+                    "amino-acid_substitutions",
+                    "Alignment Probability",
+                    "Average mutation frequency",
+                    "P value",
+                    "P adj. value",
+                    "nt_sequence",
+                    "aa_sequence",
+            ]
+    ]
+    return restructured_df, df_variants_
 # Helper functions for create_df_v
 def create_nc_variant(variant, refseq):
     if isinstance(variant, np.ndarray):
         variant = variant.tolist()
     if variant == "" or pd.isnull(variant):
-        return refseq
+        return "#N.A.#"  # Return #N.A.# if variant is empty or null
     elif variant == "#PARENT#":
         return refseq
     elif "DEL" in variant:
@@ -362,19 +366,25 @@ def create_nc_variant(variant, refseq):
                 nc_variant[position] = new
         return "".join(nc_variant)
 def is_valid_dna_sequence(sequence):
     return all(nucleotide in 'ATGC' for nucleotide in sequence) and len(sequence) % 3 == 0
 def get_mutations(row):
     try:
-        refseq = row["refseq"]
+        alignment_count = row["Alignment Count"]
+        # Check if alignment_count is zero and return "#N.A.#" if true
+        if alignment_count == 0:
+            return "#N.A.#"
+        refseq = row["refseq"]
         if not is_valid_dna_sequence(refseq):
             return "Invalid refseq provided, check template sequence. Only A, T, G, C and sequence dividable by 3 are accepted."
         refseq_aa = translate(refseq)
         variant_aa = row["aa_variant"]
-        alignment_count = row["Alignment Count"]
         if variant_aa == "Deletion":
             return ""
@@ -400,6 +410,7 @@ def get_mutations(row):
         )
         raise
 # Save plate maps and CSV
 def save_platemap_to_file(heatmaps, outputdir, name, show_msa):
     if not os.path.exists(os.path.join(outputdir, "Platemaps")):

{levseq-1.1.0 → levseq-1.2.1}/levseq/seqfit.py RENAMED Viewed

@@ -124,7 +124,7 @@ def normalise_calculate_stats(processed_plate_df, value_columns, normalise='stan
         for plate in set(processed_plate_df['Plate'].values):
             for value_column in value_columns:
                 sub_df = processed_plate_df[processed_plate_df['Plate'] == plate]
-                parent_values = sub_df[sub_df['Mutations'] == parent][value_column].values
+                parent_values = sub_df[sub_df['amino-acid_substitutions'] == parent][value_column].values
                 parent_mean = np.mean(parent_values)
                 parent_sd = np.std(parent_values)
@@ -140,14 +140,14 @@ def normalise_calculate_stats(processed_plate_df, value_columns, normalise='stan
     sd_cutoff = 1.5  # The number of standard deviations we want above the parent values
     # Now for all the other mutations we want to look if they are significant, first we'll look at combinations and then individually
-    grouped_by_mutations = processed_plate_df.groupby('Mutations')
+    grouped_by_mutations = processed_plate_df.groupby('amino-acid_substitutions')
     rows = []
     for mutation, grp in tqdm(grouped_by_mutations):
         # Get the values and then do a ranksum test
         if mutation != parent:
             for value_column in normalised_value_columns:
-                parent_values = list(processed_plate_df[processed_plate_df['Mutations'] == parent][value_column].values)
+                parent_values = list(processed_plate_df[processed_plate_df['amino-acid_substitutions'] == parent][value_column].values)
                 parent_mean = np.mean(parent_values)
                 parent_sd = np.std(parent_values)
@@ -164,7 +164,7 @@ def normalise_calculate_stats(processed_plate_df, value_columns, normalise='stan
                 rows.append(
                     [value_column, mutation, len(grp), mean_vals, std_vals, median_vals, mean_vals - parent_mean, sig,
                      U1, p])
-    stats_df = pd.DataFrame(rows, columns=['value_column', 'mutation', 'number of wells with mutation', 'mean', 'std',
+    stats_df = pd.DataFrame(rows, columns=['value_column', 'amino-acid_substitutions', 'number of wells with amino-acid substitutions', 'mean', 'std',
                                            'median', 'amount greater than parent mean',
                                            f'greater than > {sd_cutoff} parent', 'man whitney U stat', 'p-value'])
     return stats_df
@@ -275,10 +275,13 @@ def work_up_lcms(
         return series
     df["Sample Vial Number"] = fill_vial_number(df["Sample Vial Number"].copy())
+    # Drop empty ones!
+    df = df[df["Sample Vial Number"] != 0]
     # Remove unwanted wells
     df = df[df["Sample Name"] != drop_string]
     # Get wells
-    df.insert(0, "Well", df["Sample Vial Number"].apply(lambda x: x.split("-")[-1]))
+    df.insert(0, "Well", df["Sample Vial Number"].apply(lambda x: str(x).split("-")[-1]))
     # Rename
     df = df.rename({"Sample Name": "Plate"}, axis="columns")
     # Create minimal DataFrame
@@ -290,7 +293,7 @@ def work_up_lcms(
         index=["Well", "Plate"], columns="Compound Name", values="Area", aggfunc="max"
     ).reset_index()
     # Get rows and columns
-    df.insert(1, "Column", df["Well"].apply(lambda x: int(x[1:])))
+    df.insert(1, "Column", df["Well"].apply(lambda x: int(x[1:]) if x[1:].isdigit() else None))
     df.insert(1, "Row", df["Well"].apply(lambda x: x[0]))
     # Set values as floats
     cols = products + substrates if substrates is not None else products
@@ -330,9 +333,10 @@ def process_plate_files(product: str, input_csv: str) -> pd.DataFrame:
     # Load the provided CSV file
     results_df = pd.read_csv(input_csv)
-    # Extract the required columns: Plate, Well, Mutations, and nc_variant, and remove rows with '#N.A.#' and NaN values
-    filtered_df = results_df[["Plate", "Well", "Mutations", "nc_variant", "aa_variant"]]
-    filtered_df = filtered_df[(filtered_df["Mutations"] != "#N.A.#")].dropna()
+    # Extract the required columns: Plate, Well, amino-acid_substitutionss, and nt_sequence, and remove rows with '#N.A.#' and NaN values
+    # barcode_plate	Plate	Well	Alignment Count	nucleotide_amino-acid_substitutions	amino-acid_substitutions	Alignment Probability	Average amino-acid_substitutions frequency	P value	P adj. value	nt_sequence	aa_sequence
+    filtered_df = results_df[["Plate", "Well", "amino-acid_substitutions", "nt_sequence", "aa_sequence"]]
+    filtered_df = filtered_df[(filtered_df["amino-acid_substitutions"] != "#N.A.#")].dropna()
     # Extract the unique entries of Plate
     unique_plates = filtered_df["Plate"].unique()
@@ -357,14 +361,14 @@ def process_plate_files(product: str, input_csv: str) -> pd.DataFrame:
                 plate_df = plate_object.df
                 plate_df["Plate"] = plate  # Add the plate identifier for reference
-                # Merge filtered_df with plate_df to retain Mutations and nc_variant columns
+                # Merge filtered_df with plate_df to retain amino-acid_substitutionss and nt_sequence columns
                 merged_df = pd.merge(
                     plate_df, filtered_df, on=["Plate", "Well"], how="left"
                 )
                 columns_order = (
-                    ["Plate", "Well", "Row", "Column", "Mutations"]
+                    ["Plate", "Well", "Row", "Column", "amino-acid_substitutions"]
                     + product
-                    + ["nc_variant", "aa_variant"]
+                    + ["nt_sequence", "aa_sequence"]
                 )
                 merged_df = merged_df[columns_order]
                 processed_data.append(merged_df)
@@ -374,13 +378,13 @@ def process_plate_files(product: str, input_csv: str) -> pd.DataFrame:
         processed_df = pd.concat(processed_data, ignore_index=True)
     else:
         processed_df = pd.DataFrame(
-            columns=["Plate", "Well", "Row", "Column", "Mutations"]
+            columns=["Plate", "Well", "Row", "Column", "amino-acid_substitutions"]
             + product
-            + ["nc_variant", "aa_variant"]
+            + ["nt_sequence", "aa_sequence"]
         )
     # Ensure all entries in 'Mutations' are treated as strings
-    processed_df["Mutations"] = processed_df["Mutations"].astype(str)
+    processed_df["amino-acid_substitutions"] = processed_df["amino-acid_substitutions"].astype(str)
     # Remove any rows with empty values
     processed_df = processed_df.dropna()
@@ -420,19 +424,19 @@ def match_plate2parent(df: pd.DataFrame, parent_dict: Optional[Dict] = None) ->
     if parent_dict is None:
-        # add aa_variant column if not present by translating from the nc_variant column
-        if "aa_variant" not in df.columns:
-            df["aa_variant"] = df["nc_variant"].apply(
-                Bio.sequence.Sequence(df["nc_variant"]).translate
+        # add aa_sequence column if not present by translating from the nt_sequence column
+        if "aa_sequence" not in df.columns:
+            df["aa_sequence"] = df["nt_sequence"].apply(
+                Bio.sequence.Sequence(df["nt_sequence"]).translate
             )
         # get all the parents from the df
-        parents = df[df["Mutations"] == "#PARENT#"].reset_index(drop=True).copy()
+        parents = df[df["amino-acid_substitutions"] == "#PARENT#"].reset_index(drop=True).copy()
-        # get the parent nc_variant
+        # get the parent nt_sequence
         parent_aas = (
-            df[df["Mutations"] == "#PARENT#"][["Mutations", "aa_variant"]]
-            .drop_duplicates()["aa_variant"]
+            df[df["amino-acid_substitutions"] == "#PARENT#"][["amino-acid_substitutions", "aa_sequence"]]
+            .drop_duplicates()["aa_sequence"]
             .tolist()
         )
@@ -440,7 +444,7 @@ def match_plate2parent(df: pd.DataFrame, parent_dict: Optional[Dict] = None) ->
     # get the plate names for each parent
     parent2plate = {
-        p_name: df[df["aa_variant"] == p_seq]["Plate"].unique().tolist()
+        p_name: df[df["aa_sequence"] == p_seq]["Plate"].unique().tolist()
         for p_name, p_seq in parent_dict.items()
     }
@@ -516,7 +520,7 @@ def norm2parent(plate_df: pd.DataFrame) -> pd.DataFrame:
     # get all the parents from the df
     parents = (
-        plate_df[plate_df["Mutations"] == "#PARENT#"].reset_index(drop=True).copy()
+        plate_df[plate_df["amino-acid_substitutions"] == "#PARENT#"].reset_index(drop=True).copy()
     )
     filtered_parents = (
         parents.drop(index=detect_outliers_iqr(parents["pdt"]))
@@ -610,7 +614,7 @@ def get_single_ssm_site_df(
     # get parents from those plates
     site_parent_df = (
         single_ssm_df[
-            (single_ssm_df["Mutations"] == "#PARENT#")
+            (single_ssm_df["amino-acid_substitutions"] == "#PARENT#")
             & (single_ssm_df["Plate"].isin(site_df["Plate"].unique()))
         ]
         .reset_index(drop=True)
@@ -1145,12 +1149,12 @@ def gen_seqfitvis(
     df = pd.read_csv(seqfit_path)
     # ignore deletion meaning "Mutations" == "-"
-    df = df[df["Mutations"] != "-"].copy()
+    df = df[df["amino-acid_substitutions"] != "-"].copy()
     # count number of sites mutated and append mutation details
     # df["num_sites"] = df['Mutations'].apply(lambda x: 0 if x == "#PARENT#" else len(x.split("_")))
     # Apply function to the column
-    df[["num_sites", "mut_dets"]] = df["Mutations"].apply(process_mutation)
+    df[["num_sites", "mut_dets"]] = df["amino-acid_substitutions"].apply(process_mutation)
     # apply the norm function to all plates
     df = df.groupby("Plate").apply(norm2parent).reset_index(drop=True).copy()

{levseq-1.1.0 → levseq-1.2.1}/levseq/visualization.py RENAMED Viewed

@@ -142,7 +142,7 @@ def _make_platemap(df, title, cmap=None):
             'Row': list('ABCDEFGH'),
             'Column': [str(i) for i in range(1, 13)],
             'logseqdepth': [0] * 96,
-            'Substitutions': [''] * 96,
+            'amino-acid_substitutions': [''] * 96,
             'Alignment Count': [0] * 96,
             'Alignment Probability': [0] * 96
         })
@@ -195,7 +195,7 @@ def _make_platemap(df, title, cmap=None):
     # add tooltips
     tooltips = [
-        ("Substitutions", "@Substitutions"),
+        ("amino-acid_substitutions", "@amino-acid_substitutions"),
         ("Alignment Count", "@Alignment Count"),
         ("Alignment Probability", "@Alignment Probability"),
     ]
@@ -211,7 +211,7 @@ def _make_platemap(df, title, cmap=None):
             kdims=["Column", "Row"],
             vdims=[
                 "logseqdepth",
-                "Substitutions",
+                "amino-acid_substitutions",
                 "Alignment Count",
                 "Alignment Probability",
             ],
@@ -300,7 +300,7 @@ def _make_platemap(df, title, cmap=None):
         return new_line_mutations
     _df = df.copy()
-    _df["Labels"] = _df["Substitutions"].apply(split_variant_labels)
+    _df["Labels"] = _df["amino-acid_substitutions"].apply(split_variant_labels)
     # Set the font size based on if #PARENT# is in a well and num of mutations
     max_num_mutations = _df["Labels"].apply(lambda x: len(x.split("\n"))).max()

{levseq-1.1.0 → levseq-1.2.1/levseq.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: levseq
-Version: 1.1.0
+Version: 1.2.1
 Home-page: https://github.com/fhalab/levseq/
 Author: Yueming Long, Emreay Gursoy, Ariane Mora, Francesca-Zhoufan Li
 Author-email: ylong@caltech.edu
@@ -48,7 +48,7 @@ Requires-Dist: tqdm
 In directed evolution, sequencing every variant enhances data insight and creates datasets suitable for AI/ML methods. This method is presented as an extension of the original Every Variant Sequencer using Illumina technology. With this approach, sequence variants can be generated within a day at an extremely low cost.
-![Figure 1: LevSeq Workflow](manuscript/Figures/LevSeq_Figure-1.png)
+![Figure 1: LevSeq Workflow](manuscript/figures/LevSeq_Figure-1.png)
 Figure 1: Overview of the LevSeq variant sequencing workflow using Nanopore technology. This diagram illustrates the key steps in the process, from sample preparation to data analysis and visualization.