PyPI - PyamilySeq - Versions diffs - 1.1.0__tar.gz → 1.1.1__tar.gz - Mend

PyamilySeq 1.1.0tar.gz → 1.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: PyamilySeq
-Version: 1.1.0
+Version: 1.1.1
 Summary: PyamilySeq - A a tool to investigate sequence-based gene groups identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
 Home-page: https://github.com/NickJD/PyamilySeq
 Author: Nicholas Dimonaco
@@ -45,7 +45,7 @@ To update to the newest version add '-U' to end of the pip install command.
 ```commandline
 usage: PyamilySeq.py [-h] {Full,Partial} ...
-PyamilySeq v1.1.0: A tool for gene clustering and analysis.
+PyamilySeq v1.1.1: A tool for gene clustering and analysis.
 positional arguments:
   {Full,Partial}  Choose a mode: 'Full' or 'Partial'.
@@ -75,7 +75,7 @@ Escherichia_coli_110957|ENSB_TIZS9kbTvShDvyX	Escherichia_coli_110957|ENSB_TIZS9k
 ```
 ### Example output:
 ```
-Running PyamilySeq v1.1.0
+Running PyamilySeq v1.1.1
 Calculating Groups
 Number of Genomes: 10
 Gene Groups
@@ -220,7 +220,7 @@ Seq-Combiner -input_dir .../test_data/genomes -name_split_gff .gff3 -output_dir
 ```
 usage: Seq_Combiner.py [-h] -input_dir INPUT_DIR -input_type {separate,combined,fasta} [-name_split_gff NAME_SPLIT_GFF] [-name_split_fasta NAME_SPLIT_FASTA] -output_dir OUTPUT_DIR -output_name OUTPUT_FILE [-gene_ident GENE_IDENT] [-translate] [-v]
-PyamilySeq v1.1.0: Seq-Combiner - A tool to extract sequences from GFF/FASTA files and prepare them for PyamilySeq.
+PyamilySeq v1.1.1: Seq-Combiner - A tool to extract sequences from GFF/FASTA files and prepare them for PyamilySeq.
 options:
   -h, --help            show this help message and exit
@@ -263,7 +263,7 @@ usage: Group_Splitter.py [-h] -input_fasta INPUT_FASTA -sequence_type {AA,DNA}
                          [-M CLUSTERING_MEMORY] [-no_delete_temp_files]
                          [-verbose] [-v]
-PyamilySeq v1.1.0: Group-Splitter - A tool to split multi-copy gene groups
+PyamilySeq v1.1.1: Group-Splitter - A tool to split multi-copy gene groups
 identified by PyamilySeq.
 options:
@@ -316,7 +316,7 @@ Cluster-Summary -genome_num 10 -input_clstr .../test_data/species/E-coli/E-coli_
 usage: Cluster_Summary.py [-h] -input_clstr INPUT_CLSTR -output OUTPUT -genome_num GENOME_NUM
                           [-output_dir OUTPUT_DIR] [-verbose] [-v]
-PyamilySeq v1.1.0: Cluster-Summary - A tool to summarise CD-HIT clustering files.
+PyamilySeq v1.1.1: Cluster-Summary - A tool to summarise CD-HIT clustering files.
 options:
   -h, --help            show this help message and exit

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/README.md RENAMED Viewed

@@ -29,7 +29,7 @@ To update to the newest version add '-U' to end of the pip install command.
 ```commandline
 usage: PyamilySeq.py [-h] {Full,Partial} ...
-PyamilySeq v1.1.0: A tool for gene clustering and analysis.
+PyamilySeq v1.1.1: A tool for gene clustering and analysis.
 positional arguments:
   {Full,Partial}  Choose a mode: 'Full' or 'Partial'.
@@ -59,7 +59,7 @@ Escherichia_coli_110957|ENSB_TIZS9kbTvShDvyX	Escherichia_coli_110957|ENSB_TIZS9k
 ```
 ### Example output:
 ```
-Running PyamilySeq v1.1.0
+Running PyamilySeq v1.1.1
 Calculating Groups
 Number of Genomes: 10
 Gene Groups
@@ -204,7 +204,7 @@ Seq-Combiner -input_dir .../test_data/genomes -name_split_gff .gff3 -output_dir
 ```
 usage: Seq_Combiner.py [-h] -input_dir INPUT_DIR -input_type {separate,combined,fasta} [-name_split_gff NAME_SPLIT_GFF] [-name_split_fasta NAME_SPLIT_FASTA] -output_dir OUTPUT_DIR -output_name OUTPUT_FILE [-gene_ident GENE_IDENT] [-translate] [-v]
-PyamilySeq v1.1.0: Seq-Combiner - A tool to extract sequences from GFF/FASTA files and prepare them for PyamilySeq.
+PyamilySeq v1.1.1: Seq-Combiner - A tool to extract sequences from GFF/FASTA files and prepare them for PyamilySeq.
 options:
   -h, --help            show this help message and exit
@@ -247,7 +247,7 @@ usage: Group_Splitter.py [-h] -input_fasta INPUT_FASTA -sequence_type {AA,DNA}
                          [-M CLUSTERING_MEMORY] [-no_delete_temp_files]
                          [-verbose] [-v]
-PyamilySeq v1.1.0: Group-Splitter - A tool to split multi-copy gene groups
+PyamilySeq v1.1.1: Group-Splitter - A tool to split multi-copy gene groups
 identified by PyamilySeq.
 options:
@@ -300,7 +300,7 @@ Cluster-Summary -genome_num 10 -input_clstr .../test_data/species/E-coli/E-coli_
 usage: Cluster_Summary.py [-h] -input_clstr INPUT_CLSTR -output OUTPUT -genome_num GENOME_NUM
                           [-output_dir OUTPUT_DIR] [-verbose] [-v]
-PyamilySeq v1.1.0: Cluster-Summary - A tool to summarise CD-HIT clustering files.
+PyamilySeq v1.1.1: Cluster-Summary - A tool to summarise CD-HIT clustering files.
 options:
   -h, --help            show this help message and exit

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/setup.cfg RENAMED Viewed

@@ -1,6 +1,6 @@
 [metadata]
 name = PyamilySeq
-version = v1.1.0
+version = v1.1.1
 license_files = LICENSE
 author = Nicholas Dimonaco
 author_email = nicholas@dimonaco.co.uk

pyamilyseq-1.1.1/src/PyamilySeq/Cluster_Compare.py ADDED Viewed

@@ -0,0 +1,108 @@
+import argparse
+from collections import defaultdict
+def read_cd_hit_output(clstr_file):
+    """
+    Reads a CD-HIT .clstr file and extracts sequence clusters.
+    Returns a dictionary where keys are sequence headers and values are cluster IDs.
+    """
+    seq_to_cluster = {}  # Maps sequence header -> cluster ID
+    cluster_id = 0  # Generic ID for clusters (since CD-HIT names don't matter)
+    with open(clstr_file, 'r') as f:
+        for line in f:
+            line = line.strip()
+            if line.startswith(">Cluster"):
+                cluster_id += 1  # Increment cluster ID
+            elif line:
+                parts = line.split('\t')
+                if len(parts) > 1:
+                    seq_header = parts[1].split('>')[1].split('...')[0]  # Extract sequence header
+                    seq_to_cluster[seq_header] = cluster_id
+    return seq_to_cluster
+def compare_cd_hit_clusters(file1, file2, output_file):
+    """
+    Compares two CD-HIT .clstr files to check if clusters are the same.
+    Writes the results to a TSV file.
+    """
+    # Read both clustering files
+    clusters1 = read_cd_hit_output(file1)
+    clusters2 = read_cd_hit_output(file2)
+    # Reverse mappings: cluster ID -> list of sequences
+    grouped_clusters1 = defaultdict(set)
+    grouped_clusters2 = defaultdict(set)
+    for seq, cluster_id in clusters1.items():
+        grouped_clusters1[cluster_id].add(seq)
+    for seq, cluster_id in clusters2.items():
+        grouped_clusters2[cluster_id].add(seq)
+    # Initialize metrics counters
+    cluster_name_changes = 0
+    sequence_shifts = 0
+    only_in_file1 = defaultdict(list)
+    only_in_file2 = defaultdict(list)
+    cluster_mismatches = defaultdict(list)
+    # Prepare data for the TSV output
+    tsv_data = []
+    # Track changes
+    for seq, cluster_id in clusters1.items():
+        if seq not in clusters2:
+            only_in_file1[cluster_id].append(seq)
+            tsv_data.append([seq, cluster_id, "NA", "Only in file1"])
+        elif clusters2[seq] != cluster_id:
+            # Sequence shifts: sequence in different clusters between files
+            sequence_shifts += 1
+            cluster_mismatches[seq].append((cluster_id, clusters2[seq]))
+            tsv_data.append([seq, cluster_id, clusters2[seq], "Mismatch"])
+    for seq, cluster_id in clusters2.items():
+        if seq not in clusters1:
+            only_in_file2[cluster_id].append(seq)
+            tsv_data.append([seq, "NA", cluster_id, "Only in file2"])
+        elif clusters1[seq] != cluster_id:
+            # Sequence shifts: sequence in different clusters between files
+            sequence_shifts += 1
+            cluster_mismatches[seq].append((clusters1[seq], cluster_id))
+            tsv_data.append([seq, clusters1[seq], cluster_id, "Mismatch"])
+    # Track cluster name changes (same sequences in different clusters)
+    for cluster_id1, seqs1 in grouped_clusters1.items():
+        for cluster_id2, seqs2 in grouped_clusters2.items():
+            if seqs1 == seqs2 and cluster_id1 != cluster_id2:
+                cluster_name_changes += 1
+                for seq in seqs1:
+                    tsv_data.append([seq, cluster_id1, cluster_id2, "Cluster name change"])
+    # Print metrics
+    print("🔢 Clustering Comparison Metrics:")
+    print(f"Cluster name changes: {cluster_name_changes}")
+    print(f"Sequence shifts (sequences assigned to different clusters): {sequence_shifts}")
+    print(f"Sequences only in the first file: {len(only_in_file1)}")
+    print(f"Sequences only in the second file: {len(only_in_file2)}")
+    print()
+    # Write the results to a TSV file
+    with open(output_file, 'w') as out_file:
+        out_file.write("Sequence\tCluster ID (File 1)\tCluster ID (File 2)\tChange Type\n")
+        for row in tsv_data:
+            out_file.write("\t".join(map(str, row)) + "\n")
+    print(f"✅ Results have been written to {output_file}")
+def main():
+    parser = argparse.ArgumentParser(description="Compare two CD-HIT .clstr files to check for clustering consistency.")
+    parser.add_argument("-file1", required=True, help="First CD-HIT .clstr file")
+    parser.add_argument("-file2", required=True, help="Second CD-HIT .clstr file")
+    parser.add_argument("-output", required=True, help="Output file (TSV format)")
+    args = parser.parse_args()
+    compare_cd_hit_clusters(args.file1, args.file2, args.output)
+if __name__ == "__main__":
+    main()

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/Cluster_Summary.py RENAMED Viewed

@@ -1,33 +1,32 @@
 import argparse
-from collections import OrderedDict
-from collections import defaultdict
+from collections import OrderedDict, defaultdict
 try:
     from .constants import *
     from .utils import *
-except (ModuleNotFoundError, ImportError, NameError, TypeError) as error:
+except (ModuleNotFoundError, ImportError, NameError, TypeError):
     from constants import *
     from utils import *
 def categorise_percentage(percent):
     """Categorise the percentage of genomes with multicopy genes."""
-    if 20 <= percent < 40:
-        return "20-40%"
-    elif 40 <= percent < 60:
-        return "40-60%"
-    elif 60 <= percent < 80:
-        return "60-80%"
-    elif 80 <= percent < 95:
-        return "80-95%"
-    elif 95 <= percent < 99:
-        return "95-99%"
-    elif 99 <= percent <= 100:
-        return "99-100%"
+    categories = {
+        (20, 40): "20-40%",
+        (40, 60): "40-60%",
+        (60, 80): "60-80%",
+        (80, 95): "80-95%",
+        (95, 99): "95-99%",
+        (99, 100): "99-100%"
+    }
+    for (low, high), label in categories.items():
+        if low <= percent < high:
+            return label
     return None
-# Read cd-hit .clstr file and extract information
 def read_cd_hit_output(clustering_output):
+    """Parse CD-HIT .cluster file and extract clustering information."""
     clusters = OrderedDict()
     with open(clustering_output, 'r') as f:
@@ -42,10 +41,8 @@ def read_cd_hit_output(clustering_output):
                 parts = line.split('\t')
                 if len(parts) > 1:
                     clustered_info = parts[1]
-                    length = clustered_info.split(',')[0]
-                    length = int(''.join(c for c in length if c.isdigit()))
-                    clustered_header = clustered_info.split('>')[1].split('...')[0]
-                    clustered_header = '>' + clustered_header
+                    length = int(''.join(c for c in clustered_info.split(',')[0] if c.isdigit()))
+                    clustered_header = '>' + clustered_info.split('>')[1].split('...')[0]
                     if 'at ' in clustered_info and '%' in clustered_info.split('at ')[-1]:
                         percent_identity = extract_identity(clustered_info)
@@ -63,12 +60,14 @@ def read_cd_hit_output(clustering_output):
     return clusters
-# Summarise the information for each cluster
-def summarise_clusters(options,clusters, output):
-    multicopy_groups = defaultdict(int)  # Counter for groups with multicopy genes
+def summarise_clusters(options, clusters, output):
+    """Generate a detailed cluster summary report."""
+    multicopy_groups = defaultdict(int)  # Counter for clusters with multicopy genes
     with open(output, 'w') as out_f:
-        out_f.write("Cluster_ID\tNum_Sequences\tAvg_Length\tLength_Range\tAvg_Identity\tIdentity_Range\n")
+        out_f.write(
+            "Cluster_ID\tNum_Sequences\tNum_Genomes\tAvg_Length\tLength_Range\tAvg_Identity\tIdentity_Range\tGenomes_With_Multiple_Genes\tMulticopy_Percentage\n"
+        )
         for cluster_id, seqs in clusters.items():
             num_seqs = len(seqs)
@@ -81,82 +80,78 @@ def summarise_clusters(options,clusters, output):
             avg_identity = sum(identities) / num_seqs if num_seqs > 0 else 0
             identity_range = f"{min(identities):.2f}-{max(identities):.2f}" if num_seqs > 0 else "N/A"
-            out_f.write(
-                f"{cluster_id}\t{num_seqs}\t{avg_length:.2f}\t{length_range}\t{avg_identity:.2f}\t{identity_range}\n")
-            # Count genomes with more than one gene
+            # Count genomes in cluster
             genome_to_gene_count = defaultdict(int)
             for seq in seqs:
-                genome = seq['header'].split('|')[0].replace('>','')
+                genome = seq['header'].split('|')[0].replace('>', '')
                 genome_to_gene_count[genome] += 1
+            num_genomes = len(genome_to_gene_count)
             num_genomes_with_multiple_genes = sum(1 for count in genome_to_gene_count.values() if count > 1)
+            multicopy_percentage = (num_genomes_with_multiple_genes / options.genome_num) * 100 if options.genome_num > 0 else 0
-            # Calculate the percentage of genomes with multicopy genes
-            multicopy_percentage = (num_genomes_with_multiple_genes / options.genome_num) * 100
+            # Categorize multicopy percentage
             category = categorise_percentage(multicopy_percentage)
             if category:
                 multicopy_groups[category] += 1
-        # Define the order of categories for printout
-        category_order = ["20-40%", "40-60%", "60-80%", "80-95%", "95-99%", "99-100%"]
+            # Write detailed output for each cluster
+            out_f.write(
+                f"{cluster_id}\t{num_seqs}\t{num_genomes}\t{avg_length:.2f}\t{length_range}\t{avg_identity:.2f}\t{identity_range}\t"
+                f"{num_genomes_with_multiple_genes}\t{multicopy_percentage:.2f}\n"
+            )
-        # Print the number of clusters with multicopy genes in each percentage range, in the correct order
+        # Define order for multicopy statistics output
+        category_order = ["20-40%", "40-60%", "60-80%", "80-95%", "95-99%", "99-100%"]
         for category in category_order:
-            print(f"Number of clusters with multicopy genes in {category} range: {multicopy_groups[category]}")
+            print(f"Clusters with multicopy genes in {category} range: {multicopy_groups[category]}")
-# Main function to parse arguments and run the analysis
 def main():
-    parser = argparse.ArgumentParser(description='PyamilySeq ' + PyamilySeq_Version + ': Cluster-Summary - A tool to summarise CD-HIT clustering files.')
-    ### Required Arguments
-    required = parser.add_argument_group('Required Parameters')
-    required.add_argument('-input_clstr', action="store", dest="input_clstr",
-                          help='Input CD-HIT .clstr file',
-                          required=True)
-    required.add_argument('-output', action="store", dest="output",
-                          help="Output TSV file to store cluster summaries - Will add '.tsv' if not provided by user",
-                          required=True)
-    required.add_argument('-genome_num', action='store', dest='genome_num', type=int,
-                          help='The total number of genomes must be provide',
-                          required=True)
-    #required.add_argument("-clustering_format", action="store", dest="clustering_format", choices=['CD-HIT','TSV','CSV'],
-    #                      help="Clustering format to use: CD-HIT or TSV (MMseqs2, BLAST, DIAMOND) / CSV edge-list file (Node1\tNode2).",
-    #                      required=True)
+    """Main function to parse arguments and process clustering files."""
+    parser = argparse.ArgumentParser(
+        description='PyamilySeq ' + PyamilySeq_Version + ': Cluster-Summary - A tool to summarise CD-HIT clustering files.')
+    # Required Arguments
+    required = parser.add_argument_group('Required Parameters')
+    required.add_argument('-input_cluster', action="store", dest="input_cluster", required=True,
+                          help='Input CD-HIT .cluster file')
+    required.add_argument('-output', action="store", dest="output", required=True,
+                          help="Output TSV file to store cluster summaries - Will add '.tsv' if not provided by user")
+    required.add_argument('-genome_num', action='store', dest='genome_num', type=int, required=True,
+                          help='Total number of genomes in dataset')
+    # Optional Arguments
     optional = parser.add_argument_group('Optional Arguments')
     optional.add_argument('-output_dir', action="store", dest="output_dir",
-                          help='Default: Same as input file',
-                          required=False)
+                          help='Default: Same as input file', required=False)
     misc = parser.add_argument_group("Misc Parameters")
     misc.add_argument("-verbose", action="store_true", dest="verbose",
-                      help="Print verbose output.",
-                      required=False)
+                      help="Print verbose output.", required=False)
     misc.add_argument("-v", "--version", action="version",
                       version=f"PyamilySeq: Group-Summary version {PyamilySeq_Version} - Exiting",
                       help="Print out version number and exit")
     options = parser.parse_args()
-    print("Running PyamilySeq " + PyamilySeq_Version+ ": Group-Summary ")
+    print("Running PyamilySeq " + PyamilySeq_Version + ": Group-Summary ")
-    ### File handling
-    options.input_clstr = fix_path(options.input_clstr)
+    # File handling
+    options.input_cluster = fix_path(options.input_cluster)
     if options.output_dir is None:
-        options.output_dir = os.path.dirname(os.path.abspath(options.input_clstr))
+        options.output_dir = os.path.dirname(os.path.abspath(options.input_cluster))
     output_path = os.path.abspath(options.output_dir)
     if not os.path.exists(output_path):
         os.makedirs(output_path)
     output_name = options.output
     if not output_name.endswith('.tsv'):
         output_name += '.tsv'
     output_file_path = os.path.join(output_path, output_name)
-    ###
-    clusters = read_cd_hit_output(options.input_clstr)
-    summarise_clusters(options,clusters, output_file_path)
+    # Process clusters and generate summary
+    clusters = read_cd_hit_output(options.input_cluster)
+    summarise_clusters(options, clusters, output_file_path)
 if __name__ == "__main__":

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/PyamilySeq.py RENAMED Viewed

@@ -20,8 +20,8 @@ def run_cd_hit(options, input_file, clustering_output, clustering_mode):
         clustering_mode,
         '-i', input_file,
         '-o', clustering_output,
-        '-c', str(options.pident),
-        '-s', str(options.len_diff),
+        '-c', f"{float(options.pident):.2f}",
+        '-s', f"{float(options.len_diff):.2f}",
         '-T', str(options.threads),
         '-M', str(options.mem),
         '-d', "0",
@@ -62,10 +62,11 @@ def main():
                              help="Clustering mode: 'DNA' or 'AA'.")
     full_parser.add_argument("-gene_ident", default="CDS", required=False,
                              help="Gene identifiers to extract sequences (e.g., 'CDS, tRNA').")
-    full_parser.add_argument("-c", type=float, dest="pident", default=0.90, required=False,
+    full_parser.add_argument("-c", type=str, dest="pident", default="0.90", required=False,
                              help="Sequence identity threshold for clustering (default: 0.90) - CD-HIT parameter '-c'.")
-    full_parser.add_argument("-s", type=float, dest="len_diff", default=0.80, required=False,
+    full_parser.add_argument("-s", type=str, dest="len_diff", default="0.80", required=False,
                              help="Length difference threshold for clustering (default: 0.80) - CD-HIT parameter '-s'.")
     full_parser.add_argument("-fast_mode", action="store_true", required=False,
                              help="Enable fast mode for CD-HIT (not recommended) - CD-HIT parameter '-g'.")
@@ -98,7 +99,7 @@ def main():
         subparser.add_argument("-write_individual_groups", action="store_true", dest="write_individual_groups", required=False,
                                help="Output individual FASTA files for each group.")
         subparser.add_argument("-align", action="store_true", dest="align_core", required=False,
-                               help="Align and concatenate sequences for 'core' groups specified with '-w'.")
+                               help="Align and concatenate sequences for 'core' groups (those in 99-100% of genomes).")
         subparser.add_argument("-align_aa", action="store_true", required=False,
                                help="Align sequences as amino acids.")
         subparser.add_argument("-no_gpa", action="store_false", dest="gene_presence_absence_out", required=False,
@@ -187,13 +188,13 @@ def main():
             elif options.sequence_type == 'AA':
                 clustering_mode = 'cd-hit'
             if options.fast_mode == True:
-                options.fast_mode = 0
+                options.fast_mode = 1
                 if options.verbose == True:
                     print("Running CD-HIT in fast mode.")
             else:
-                options.fast_mode = 1
+                options.fast_mode = 0
                 if options.verbose == True:
-                    print("Running CD-HIT in slow mode.")
+                    print("Running CD-HIT in accurate mode.")
         else:
             exit("cd-hit is not installed. Please install cd-hit to proceed.")
@@ -239,10 +240,10 @@ def main():
             translate = False
             file_to_cluster = combined_out_file
         if options.input_type == 'separate':
-            read_separate_files(options.input_dir, options.name_split_gff, options.name_split_fasta, options.gene_ident, combined_out_file, translate)
+            read_separate_files(options.input_dir, options.name_split_gff, options.name_split_fasta, options.gene_ident, combined_out_file, translate, False)
             run_cd_hit(options, file_to_cluster, clustering_output, clustering_mode)
         elif options.input_type == 'combined':
-            read_combined_files(options.input_dir, options.name_split_gff, options.gene_ident, combined_out_file, translate)
+            read_combined_files(options.input_dir, options.name_split_gff, options.gene_ident, combined_out_file, translate, False)
             run_cd_hit(options, file_to_cluster, clustering_output, clustering_mode)
         elif options.input_type == 'fasta':
             combined_out_file = options.input_fasta
@@ -281,6 +282,8 @@ def main():
         clustering_options = clustering_options()
     elif options.run_mode == 'Partial':
+        if not os.path.exists(output_path):
+            os.makedirs(output_path)
         class clustering_options:
             def __init__(self):
                 self.run_mode = options.run_mode

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/PyamilySeq_Species.py RENAMED Viewed

@@ -120,12 +120,14 @@ def calc_Second_only_core(cluster, Second_num, groups, cores):
 #@profile
 def calc_only_Second_only_core(cluster, Second_num, groups, cores): # only count the true storf onlies
-    groups_as_list = list(groups.values())
-    for idx in (idx for idx, (sec, fir) in enumerate(groups_as_list) if sec <= Second_num <= fir):
-        res = idx
-    family_group = list(groups)[res]
-    cores['only_Second_core_' + family_group].append(cluster)
+    try:
+        groups_as_list = list(groups.values())
+        for idx in (idx for idx, (sec, fir) in enumerate(groups_as_list) if sec <= Second_num <= fir):
+            res = idx
+        family_group = list(groups)[res]
+        cores['only_Second_core_' + family_group].append(cluster)
+    except UnboundLocalError:
+        sys.exit("Error in calc_only_Second_only_core")
 #@profile

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/Seq_Combiner.py RENAMED Viewed

@@ -65,17 +65,17 @@ def main():
     if not os.path.exists(output_path):
         os.makedirs(output_path)
-    output_file = options.output_file + '.fasta'
-    if os.path.exists(os.path.join(output_path, output_file)):
-        print(f"Output file {output_file} already exists in the output directory. Please delete or rename the file and try again.")
+    #output_file = options.output_file + '.fasta'
+    if os.path.exists(os.path.join(output_path, options.output_file)):
+        print(f"Output file {options.output_file} already exists in the output directory. Please delete or rename the file and try again.")
         exit(1)
-    combined_out_file = os.path.join(output_path, output_file )
+    combined_out_file = os.path.join(output_path, options.output_file )
     if options.input_type == 'separate':
-        read_separate_files(options.input_dir, options.name_split_gff, options.name_split_fasta, options.gene_ident, combined_out_file, options.translate)
+        read_separate_files(options.input_dir, options.name_split_gff, options.name_split_fasta, options.gene_ident, combined_out_file, options.translate, True)
     elif options.input_type == 'combined':
-        read_combined_files(options.input_dir, options.name_split_gff, options.gene_ident, combined_out_file, options.translate)
+        read_combined_files(options.input_dir, options.name_split_gff, options.gene_ident, combined_out_file, options.translate, True)
     elif options.input_type == 'fasta':
         read_fasta_files(options.input_dir, options.name_split_fasta, combined_out_file, options.translate)

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/clusterings.py RENAMED Viewed

@@ -279,8 +279,6 @@ def combined_clustering_CDHIT(options, taxa_dict, splitter):
     first = True
     for line in Second_in:
         if line.startswith('>'):
-            if '>Cluster 1997' in line:
-                print()
             if first == False:
                 cluster_size = len(Combined_clusters[cluster_id])
                 Combined_reps.update({rep: cluster_size})

pyamilyseq-1.1.1/src/PyamilySeq/constants.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ PyamilySeq_Version = 'v1.1.1'
2	+

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/utils.py RENAMED Viewed

@@ -228,9 +228,15 @@ def run_mafft_on_sequences(options, sequences, output_file):
-def read_separate_files(input_dir, name_split_gff, name_split_fasta, gene_ident, combined_out, translate):
-    paired_files_found = None
-    with open(combined_out, 'w') as combined_out_file, open(combined_out.replace('_dna.fasta','_aa.fasta'), 'w') as combined_out_file_aa:
+def read_separate_files(input_dir, name_split_gff, name_split_fasta, gene_ident, combined_out, translate, run_as_combiner):
+    if run_as_combiner == True:
+        combined_out_file_aa = None
+    else:
+        combined_out_file_aa = combined_out.replace('_dna.fasta','_aa.fasta')
+    with open(combined_out, 'w') as combined_out_file, (open(combined_out_file_aa, 'w') if combined_out_file_aa else open(os.devnull, 'w')) as combined_out_file_aa:
+        paired_files_found = None
+    #with open(combined_out, 'w') as combined_out_file, open(combined_out.replace('_dna.fasta','_aa.fasta'), 'w') as combined_out_file_aa:
         gff_files = glob.glob(os.path.join(input_dir, '*' + name_split_gff))
         if not gff_files:
             sys.exit("Error: No GFF files found.")
@@ -299,23 +305,40 @@ def read_separate_files(input_dir, name_split_gff, name_split_fasta, gene_ident,
                             full_sequence = fasta_dict[contig][1]
                             seq = full_sequence[corrected_start:corrected_stop]
-                        if translate == True:
-                            seq_aa = translate_frame(seq)
-                            wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
-                            combined_out_file_aa.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
-                        wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
-                        combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
+                        if run_as_combiner == True:
+                            if translate == True:
+                                seq_aa = translate_frame(seq)
+                                wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
+                                combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
+                            else:
+                                wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
+                                combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
+                        else:
+                            if translate == True:
+                                seq_aa = translate_frame(seq)
+                                wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
+                                combined_out_file_aa.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
+                            wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
+                            combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
     if not paired_files_found:
         sys.exit("Could not find matching GFF/FASTA files - Please check input directory and -name_split_gff and -name_split_fasta parameters.")
     if translate == False or translate == None:
         #Clean up unused file
-        if combined_out_file.name != combined_out_file_aa.name:
-            os.remove(combined_out_file_aa.name)
+        try: # Catches is combined_out_file_aa is None
+            if combined_out_file.name != combined_out_file_aa.name:
+                os.remove(combined_out_file_aa.name)
+        except AttributeError:
+            pass
-def read_combined_files(input_dir, name_split, gene_ident, combined_out, translate):
-    with open(combined_out, 'w') as combined_out_file, open(combined_out.replace('_dna.fasta','_aa.fasta'), 'w') as combined_out_file_aa:
+def read_combined_files(input_dir, name_split, gene_ident, combined_out, translate, run_as_combiner):
+    if run_as_combiner == True:
+        combined_out_file_aa = None
+    else:
+        combined_out_file_aa = combined_out.replace('_dna.fasta','_aa.fasta')
+    #with open(combined_out, 'w') as combined_out_file, open(combined_out_file_aa, 'w') if combined_out_file_aa else open(os.devnull, 'w'):
+    with open(combined_out, 'w') as combined_out_file, (open(combined_out_file_aa, 'w') if combined_out_file_aa else open(os.devnull, 'w')) as combined_out_file_aa:
         gff_files = glob.glob(os.path.join(input_dir, '*' + name_split))
         if not gff_files:
             sys.exit("Error: No GFF files found - check input directory and -name_split_gff parameter.")
@@ -355,7 +378,7 @@ def read_combined_files(input_dir, name_split, gene_ident, combined_out, transla
                 for contig, fasta in fasta_dict.items():
                     reverse_sequence = reverse_complement(fasta[0])
-                    fasta_dict[contig][1]=reverse_sequence
+                    fasta_dict[contig][1] = reverse_sequence
                 if fasta_dict and gff_features:
                     for contig, start, end, strand, feature, seq_id in gff_features:
@@ -369,22 +392,38 @@ def read_combined_files(input_dir, name_split, gene_ident, combined_out, transla
                                 full_sequence = fasta_dict[contig][1]
                                 seq = full_sequence[corrected_start:corrected_stop]
-                            if translate == True:
-                                seq_aa = translate_frame(seq)
-                                wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
-                                combined_out_file_aa.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
-                            wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
-                            combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
+                            if run_as_combiner == True:
+                                if translate == True:
+                                    seq_aa = translate_frame(seq)
+                                    wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
+                                    combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
+                                else:
+                                    wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
+                                    combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
+                            else:
+                                if translate == True:
+                                    seq_aa = translate_frame(seq)
+                                    wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
+                                    combined_out_file_aa.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
+                                wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
+                                combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
     if translate == False or translate == None:
         #Clean up unused file
-        if combined_out_file.name != combined_out_file_aa.name:
-            os.remove(combined_out_file_aa.name)
+        try: # Catches is combined_out_file_aa is None
+            if combined_out_file.name != combined_out_file_aa.name:
+                os.remove(combined_out_file_aa.name)
+        except AttributeError:
+            pass
-def read_fasta_files(input_dir, name_split_fasta, combined_out, translate):
-    with open(combined_out, 'w') as combined_out_file, open(combined_out.replace('_dna.fasta','_aa.fasta'), 'w') as combined_out_file_aa:
+def read_fasta_files(input_dir, name_split_fasta, combined_out, translate, run_as_combiner):
+    if run_as_combiner == True:
+        combined_out_file_aa = None
+    else:
+        combined_out_file_aa = combined_out.replace('_dna.fasta','_aa.fasta')
+    with open(combined_out, 'w') as combined_out_file, (open(combined_out_file_aa, 'w') if combined_out_file_aa else open(os.devnull, 'w')) as combined_out_file_aa:
         fasta_files = glob.glob(os.path.join(input_dir, '*' + name_split_fasta))
         if not fasta_files:
             sys.exit("Error: No GFF files found.")
@@ -400,17 +439,30 @@ def read_fasta_files(input_dir, name_split_fasta, combined_out, translate):
                     else:
                         fasta_dict[current_seq] +=line.strip()
                 for seq_id, seq in fasta_dict.items():
-                    if translate == True:
-                        seq_aa = translate_frame(seq)
-                        wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
-                        combined_out_file_aa.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
-                    wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
-                    combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
+                    if run_as_combiner == True:
+                        if translate == True:
+                            seq_aa = translate_frame(seq)
+                            wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
+                            combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
+                        else:
+                            wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
+                            combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
+                    else:
+                        if translate == True:
+                            seq_aa = translate_frame(seq)
+                            wrapped_sequence_aa = '\n'.join([seq_aa[i:i + 60] for i in range(0, len(seq_aa), 60)])
+                            combined_out_file_aa.write(f">{genome_name}|{seq_id}\n{wrapped_sequence_aa}\n")
+                        wrapped_sequence = '\n'.join([seq[i:i + 60] for i in range(0, len(seq), 60)])
+                        combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
     if translate == False or translate == None:
         #Clean up unused file
-        if combined_out_file.name != combined_out_file_aa.name:
-            os.remove(combined_out_file_aa.name)
+        try: # Catches is combined_out_file_aa is None
+            if combined_out_file.name != combined_out_file_aa.name:
+                os.remove(combined_out_file_aa.name)
+        except AttributeError:
+            pass
 def write_groups_func(options, output_dir, key_order, cores, sequences,
                  pangenome_clusters_First_sequences_sorted, combined_pangenome_clusters_Second_sequences):
@@ -533,7 +585,8 @@ def write_groups_func(options, output_dir, key_order, cores, sequences,
 def perform_alignment(gene_path,group_directory, gene_file, options, concatenated_sequences, subgrouped):
     # Read sequences from the gene family file
     sequences = read_fasta(gene_path)
+    if len(sequences) == 1: # We can't align a single sequence
+        return concatenated_sequences
     # Select the longest sequence for each genome
     longest_sequences = select_longest_gene(sequences, subgrouped)
@@ -574,20 +627,18 @@ def process_gene_groups(options, group_directory, sub_group_directory, paralog_g
         # Iterate over each gene family file
         for gene_file in os.listdir(group_directory):
             if gene_file.endswith(affix) and not gene_file.startswith('combined_group_sequences'):
-                #print(gene_file)
                 current_group = int(gene_file.split('_')[3].split('.')[0])
                 gene_path = os.path.join(group_directory, gene_file)
-                # Check for matching group in paralog_groups
-                if sub_group_directory and paralog_groups and '>Group_'+str(current_group) in paralog_groups:
-                    for subgroup, size in enumerate(paralog_groups['>Group_' + str(current_group)]['sizes']):
-                        if size >= threshold_size:
-                            gene_path = os.path.join(sub_group_directory,f"Group_{current_group}_subgroup_{subgroup}{affix}")
-                            concatenated_sequences = perform_alignment(gene_path, group_directory, gene_file, options, concatenated_sequences, True)
-                else:
-                    concatenated_sequences = perform_alignment(gene_path, group_directory, gene_file, options, concatenated_sequences, False)
+                # Could add more catches here to work with First and Secondary groups - This ensures only core '99/100' are aligned
+                if 'First_core_99' in gene_file or 'First_core_100' in gene_file:
+                    # Check for matching group in paralog_groups
+                    if sub_group_directory and paralog_groups and '>Group_'+str(current_group) in paralog_groups:
+                        for subgroup, size in enumerate(paralog_groups['>Group_' + str(current_group)]['sizes']):
+                            if size >= threshold_size:
+                                gene_path = os.path.join(sub_group_directory,f"Group_{current_group}_subgroup_{subgroup}{affix}")
+                                concatenated_sequences = perform_alignment(gene_path, group_directory, gene_file, options, concatenated_sequences, True)
+                    else:
+                        concatenated_sequences = perform_alignment(gene_path, group_directory, gene_file, options, concatenated_sequences, False)
     # Write the concatenated sequences to the output file
     with open(output_file, 'w') as out:

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: PyamilySeq
-Version: 1.1.0
+Version: 1.1.1
 Summary: PyamilySeq - A a tool to investigate sequence-based gene groups identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
 Home-page: https://github.com/NickJD/PyamilySeq
 Author: Nicholas Dimonaco
@@ -45,7 +45,7 @@ To update to the newest version add '-U' to end of the pip install command.
 ```commandline
 usage: PyamilySeq.py [-h] {Full,Partial} ...
-PyamilySeq v1.1.0: A tool for gene clustering and analysis.
+PyamilySeq v1.1.1: A tool for gene clustering and analysis.
 positional arguments:
   {Full,Partial}  Choose a mode: 'Full' or 'Partial'.
@@ -75,7 +75,7 @@ Escherichia_coli_110957|ENSB_TIZS9kbTvShDvyX	Escherichia_coli_110957|ENSB_TIZS9k
 ```
 ### Example output:
 ```
-Running PyamilySeq v1.1.0
+Running PyamilySeq v1.1.1
 Calculating Groups
 Number of Genomes: 10
 Gene Groups
@@ -220,7 +220,7 @@ Seq-Combiner -input_dir .../test_data/genomes -name_split_gff .gff3 -output_dir
 ```
 usage: Seq_Combiner.py [-h] -input_dir INPUT_DIR -input_type {separate,combined,fasta} [-name_split_gff NAME_SPLIT_GFF] [-name_split_fasta NAME_SPLIT_FASTA] -output_dir OUTPUT_DIR -output_name OUTPUT_FILE [-gene_ident GENE_IDENT] [-translate] [-v]
-PyamilySeq v1.1.0: Seq-Combiner - A tool to extract sequences from GFF/FASTA files and prepare them for PyamilySeq.
+PyamilySeq v1.1.1: Seq-Combiner - A tool to extract sequences from GFF/FASTA files and prepare them for PyamilySeq.
 options:
   -h, --help            show this help message and exit
@@ -263,7 +263,7 @@ usage: Group_Splitter.py [-h] -input_fasta INPUT_FASTA -sequence_type {AA,DNA}
                          [-M CLUSTERING_MEMORY] [-no_delete_temp_files]
                          [-verbose] [-v]
-PyamilySeq v1.1.0: Group-Splitter - A tool to split multi-copy gene groups
+PyamilySeq v1.1.1: Group-Splitter - A tool to split multi-copy gene groups
 identified by PyamilySeq.
 options:
@@ -316,7 +316,7 @@ Cluster-Summary -genome_num 10 -input_clstr .../test_data/species/E-coli/E-coli_
 usage: Cluster_Summary.py [-h] -input_clstr INPUT_CLSTR -output OUTPUT -genome_num GENOME_NUM
                           [-output_dir OUTPUT_DIR] [-verbose] [-v]
-PyamilySeq v1.1.0: Cluster-Summary - A tool to summarise CD-HIT clustering files.
+PyamilySeq v1.1.1: Cluster-Summary - A tool to summarise CD-HIT clustering files.
 options:
   -h, --help            show this help message and exit

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq.egg-info/SOURCES.txt RENAMED Viewed

@@ -2,6 +2,7 @@ LICENSE
 README.md
 pyproject.toml
 setup.cfg
+src/PyamilySeq/Cluster_Compare.py
 src/PyamilySeq/Cluster_Summary.py
 src/PyamilySeq/Group_Extractor.py
 src/PyamilySeq/Group_Sizes.py

pyamilyseq-1.1.0/src/PyamilySeq/constants.py DELETED Viewed

	@@ -1,2 +0,0 @@
1	- PyamilySeq_Version = 'v1.1.0'
2	-

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/LICENSE RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/pyproject.toml RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/Group_Extractor.py RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/Group_Sizes.py RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/Group_Splitter.py RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/PyamilySeq_Genus.py RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/Seq_Extractor.py RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/Seq_Finder.py RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq/__init__.py RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq.egg-info/entry_points.txt RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq.egg-info/requires.txt RENAMED Viewed

File without changes

{pyamilyseq-1.1.0 → pyamilyseq-1.1.1}/src/PyamilySeq.egg-info/top_level.txt RENAMED Viewed

File without changes

PyamilySeq 1.1.0__tar.gz → 1.1.1__tar.gz

PyamilySeq 1.1.0tar.gz → 1.1.1tar.gz