PyPI - PyamilySeq - Versions diffs - 0.3.0__tar.gz → 0.4.0__tar.gz - Mend

PyamilySeq 0.3.0tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

pyamilyseq-0.4.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,92 @@
+Metadata-Version: 2.1
+Name: PyamilySeq
+Version: 0.4.0
+Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
+Home-page: https://github.com/NickJD/PyamilySeq
+Author: Nicholas Dimonaco
+Author-email: nicholas@dimonaco.co.uk
+Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.6
+Description-Content-Type: text/markdown
+License-File: LICENSE
+# PyamilySeq - !BETA!
+**PyamilySeq** (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2.
+This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
+## Features
+- **End-to-End**: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
+- **Clustering**: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
+- **Reclustering**: Allows for the addition of new sequences post-initial clustering.
+- **Output**: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
+  - Align representative sequences using MAFFT.
+  - Output concatenated aligned sequences for downstream analysis.
+  - Optionally output sequences of identified families.
+### Installation
+PyamilySeq requires Python 3.6 or higher. Install using pip:
+```bash
+pip install PyamilySeq
+```
+## Usage - Menu
+```
+usage: PyamilySeq.py [-h] -id INPUT_DIR -od OUTPUT_DIR -it {separate,combined} -ns NAME_SPLIT -pid PIDENT -ld LEN_DIFF -co CLUSTERING_OUT -ct {CD-HIT,BLAST,DIAMOND,MMseqs2} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG] [-groups CORE_GROUPS]
+                     [-gpa GENE_PRESENCE_ABSENCE_OUT]
+                     ...
+PyamilySeq v0.4.0: PyamilySeq Run Parameters.
+positional arguments:
+  pyamilyseq_args       Additional arguments for PyamilySeq.
+options:
+  -h, --help            show this help message and exit
+Required Arguments:
+  -id INPUT_DIR         Directory containing GFF/FASTA files.
+  -od OUTPUT_DIR        Directory for all output files.
+  -it {separate,combined}
+                        Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
+  -ns NAME_SPLIT        Character used to split the filename and extract the genome name.
+  -pid PIDENT           Pident threshold for CD-HIT clustering.
+  -ld LEN_DIFF          Length difference (-s) threshold for CD-HIT clustering.
+  -co CLUSTERING_OUT    Output file for initial clustering.
+  -ct {CD-HIT,BLAST,DIAMOND,MMseqs2}
+                        Clustering format for PyamilySeq.
+Output Parameters:
+  -w WRITE_FAMILIES     Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
+  -con CON_CORE         Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
+  -fasta FASTA          FASTA file to use in conjunction with "-w" or "-con"
+Optional Arguments:
+  -rc RECLUSTERED       Clustering output file from secondary round of clustering
+  -st SEQUENCE_TAG      Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
+  -groups CORE_GROUPS   Default - ('99,95,15'): Gene family groups to use
+  -gpa GENE_PRESENCE_ABSENCE_OUT
+                        Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools
+```
+### Example Run End-to-End - 'genomes' is a test-directory containing GFF files with ##FASTA at the bottom
+```bash
+PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
+```
+```Calculating Groups
+Calculating Groups
+Gene Groups:
+first_core_99: 3103
+first_core_95: 0
+first_core_15: 3217
+first_core_0: 4808
+Total Number of Gene Groups (Including Singletons): 11128
+```

pyamilyseq-0.4.0/README.md ADDED Viewed

@@ -0,0 +1,77 @@
+# PyamilySeq - !BETA!
+**PyamilySeq** (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2.
+This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
+## Features
+- **End-to-End**: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
+- **Clustering**: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
+- **Reclustering**: Allows for the addition of new sequences post-initial clustering.
+- **Output**: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
+  - Align representative sequences using MAFFT.
+  - Output concatenated aligned sequences for downstream analysis.
+  - Optionally output sequences of identified families.
+### Installation
+PyamilySeq requires Python 3.6 or higher. Install using pip:
+```bash
+pip install PyamilySeq
+```
+## Usage - Menu
+```
+usage: PyamilySeq.py [-h] -id INPUT_DIR -od OUTPUT_DIR -it {separate,combined} -ns NAME_SPLIT -pid PIDENT -ld LEN_DIFF -co CLUSTERING_OUT -ct {CD-HIT,BLAST,DIAMOND,MMseqs2} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG] [-groups CORE_GROUPS]
+                     [-gpa GENE_PRESENCE_ABSENCE_OUT]
+                     ...
+PyamilySeq v0.4.0: PyamilySeq Run Parameters.
+positional arguments:
+  pyamilyseq_args       Additional arguments for PyamilySeq.
+options:
+  -h, --help            show this help message and exit
+Required Arguments:
+  -id INPUT_DIR         Directory containing GFF/FASTA files.
+  -od OUTPUT_DIR        Directory for all output files.
+  -it {separate,combined}
+                        Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
+  -ns NAME_SPLIT        Character used to split the filename and extract the genome name.
+  -pid PIDENT           Pident threshold for CD-HIT clustering.
+  -ld LEN_DIFF          Length difference (-s) threshold for CD-HIT clustering.
+  -co CLUSTERING_OUT    Output file for initial clustering.
+  -ct {CD-HIT,BLAST,DIAMOND,MMseqs2}
+                        Clustering format for PyamilySeq.
+Output Parameters:
+  -w WRITE_FAMILIES     Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
+  -con CON_CORE         Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
+  -fasta FASTA          FASTA file to use in conjunction with "-w" or "-con"
+Optional Arguments:
+  -rc RECLUSTERED       Clustering output file from secondary round of clustering
+  -st SEQUENCE_TAG      Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
+  -groups CORE_GROUPS   Default - ('99,95,15'): Gene family groups to use
+  -gpa GENE_PRESENCE_ABSENCE_OUT
+                        Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools
+```
+### Example Run End-to-End - 'genomes' is a test-directory containing GFF files with ##FASTA at the bottom
+```bash
+PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
+```
+```Calculating Groups
+Calculating Groups
+Gene Groups:
+first_core_99: 3103
+first_core_95: 0
+first_core_15: 3217
+first_core_0: 4808
+Total Number of Gene Groups (Including Singletons): 11128
+```

{pyamilyseq-0.3.0 → pyamilyseq-0.4.0}/setup.cfg RENAMED Viewed

@@ -1,6 +1,6 @@
 [metadata]
 name = PyamilySeq
-version = v0.3.0
+version = v0.4.0
 author = Nicholas Dimonaco
 author_email = nicholas@dimonaco.co.uk
 description = PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
@@ -27,7 +27,7 @@ include = *
 [options.entry_points]
 console_scripts =
-	PyamilySeq = PyamilySeq.PyamilySeq_Species:main
+	PyamilySeq = PyamilySeq.PyamilySeq:main
 [egg_info]
 tag_build =

{pyamilyseq-0.3.0 → pyamilyseq-0.4.0}/src/PyamilySeq/Constants.py RENAMED Viewed

@@ -1,6 +1,6 @@
 import subprocess
-PyamilySeq_Version = 'v0.3.0'
+PyamilySeq_Version = 'v0.4.0'

pyamilyseq-0.4.0/src/PyamilySeq/PyamilySeq.py ADDED Viewed

@@ -0,0 +1,186 @@
+import argparse
+import collections
+import os
+import glob
+import subprocess
+from PyamilySeq_Species import *
+try:
+    from .PyamilySeq_Species import cluster
+    from .Constants import *
+except (ModuleNotFoundError, ImportError, NameError, TypeError) as error:
+    from PyamilySeq_Species import cluster
+    from Constants import *
+def reverse_complement(seq):
+    complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G', 'N': 'N'}
+    return ''.join(complement[base] for base in reversed(seq))
+def read_separate_files(input_dir, name_split, combined_out):
+    with open(combined_out, 'w') as combined_out_file:
+        for fasta_file in glob.glob(os.path.join(input_dir, '*' + name_split)):
+            genome_name = os.path.basename(fasta_file).split(name_split)[0]
+            corresponding_gff_file = fasta_file.replace('.fasta', '.gff')
+            if not os.path.exists(corresponding_gff_file):
+                continue
+            cds_sequences = extract_cds_from_gff(fasta_file, corresponding_gff_file)
+            for gene_name, seq in cds_sequences:
+                header = f">{genome_name}_{gene_name}\n"
+                combined_out_file.write(header)
+                combined_out_file.write(seq + '\n')
+def read_combined_files(input_dir, name_split, combined_out):
+    with open(combined_out, 'w') as combined_out_file:
+        for gff_file in glob.glob(os.path.join(input_dir, '*' + name_split)):
+            genome_name = os.path.basename(gff_file).split(name_split)[0]
+            fasta_dict = collections.defaultdict(str)
+            gff_features = []
+            with open(gff_file, 'r') as file:
+                lines = file.readlines()
+                fasta_section = False
+                for line in lines:
+                    if line.startswith('##FASTA'):
+                        fasta_section = True
+                        continue
+                    if fasta_section:
+                        if line.startswith('>'):
+                            current_contig = line[1:].split()[0]
+                            fasta_dict[current_contig] = []
+                        else:
+                            fasta_dict[current_contig].append(line.strip())
+                    else:
+                        line_data = line.split('\t')
+                        if len(line_data) == 9:
+                            if line_data[2] == 'CDS':
+                                contig = line_data[0]
+                                feature = line_data[2]
+                                start, end = int(line_data[3]), int(line_data[4])
+                                seq_id = line_data[8].split('ID=')[1].split(';')[0]
+                                gff_features.append((contig, start, end, seq_id))
+                if fasta_dict and gff_features:
+                    for contig, start, end, seq_id in gff_features:
+                        if contig in fasta_dict:
+                            full_sequence = ''.join(fasta_dict[contig])
+                            cds_sequence = full_sequence[start - 1:end]
+                            wrapped_sequence = '\n'.join([cds_sequence[i:i + 60] for i in range(0, len(cds_sequence), 60)])
+                            combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
+def run_cd_hit(input_file, clustering_output, options):
+    cdhit_command = [
+        'cd-hit-est',
+        '-i', input_file,
+        '-o', clustering_output,
+        '-c', str(options.pident),
+        '-s', str(options.len_diff),
+        '-T', "20",
+        '-d', "0",
+        '-sc', "1",
+        '-sf', "1"
+    ]
+    subprocess.run(cdhit_command)
+def main():
+    parser = argparse.ArgumentParser(
+        description='PyamilySeq ' + PyamilySeq_Version + ': PyamilySeq Run Parameters.')
+    required = parser.add_argument_group('Required Arguments')
+    required.add_argument("-id", action="store", dest="input_dir",
+                          help="Directory containing GFF/FASTA files.",
+                          required=True)
+    required.add_argument("-od", action="store", dest="output_dir",
+                          help="Directory for all output files.",
+                          required=True)
+    required.add_argument("-it", action="store", dest="input_type", choices=['separate', 'combined'],
+                        help="Type of input files: 'separate' for separate FASTA and GFF files,"
+                             " 'combined' for GFF files with embedded FASTA sequences.",
+                          required=True)
+    required.add_argument("-ns", action="store", dest="name_split",
+                          help="Character used to split the filename and extract the genome name.",
+                          required=True)
+    required.add_argument("-pid", action="store", dest="pident", type=float,
+                          help="Pident threshold for CD-HIT clustering.",
+                          required=True)
+    required.add_argument("-ld", action="store", dest="len_diff", type=float,
+                          help="Length difference (-s) threshold for CD-HIT clustering.",
+                          required=True)
+    required.add_argument("-co", action="store", dest="clustering_out",
+                          help="Output file for initial clustering.",
+                          required=True)
+    required.add_argument("-ct", action="store", dest="clustering_type", choices=['CD-HIT', 'BLAST', 'DIAMOND', "MMseqs2"],
+                        help="Clustering format for PyamilySeq.",
+                          required=True)
+    output_args = parser.add_argument_group('Output Parameters')
+    output_args.add_argument('-w', action="store", dest='write_families', default=None,
+                          help='Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95"'
+                               ' - Must provide FASTA file with -fasta',
+                          required=False)
+    output_args.add_argument('-con', action="store", dest='con_core', default=None,
+                          help='Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95"'
+                               ' - Must provide FASTA file with -fasta',
+                          required=False)
+    output_args.add_argument('-fasta', action='store', dest='fasta',
+                          help='FASTA file to use in conjunction with "-w" or "-con"',
+                          required=False)
+    optional = parser.add_argument_group('Optional Arguments')
+    optional.add_argument('-rc', action='store', dest='reclustered', help='Clustering output file from secondary round of clustering',
+                        required=False)
+    optional.add_argument('-st', action='store', dest='sequence_tag', help='Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences',
+                        required=False)
+    optional.add_argument('-groups', action="store", dest='core_groups', default="99,95,15",
+                        help='Default - (\'99,95,15\'): Gene family groups to use')
+    optional.add_argument('-gpa', action='store', dest='gene_presence_absence_out', help='Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools',
+                        required=False)
+    parser.add_argument("pyamilyseq_args", nargs=argparse.REMAINDER, help="Additional arguments for PyamilySeq.")
+    options = parser.parse_args()
+    output_path = os.path.abspath(options.output_dir)
+    combined_out_file = os.path.join(output_path,"end_to_end_combined_sequences.fasta")
+    clustering_output = os.path.join(output_path,'clustering_'+options.clustering_type)
+    # Step 1: Read and rename sequences from files based on input type
+    if options.input_type == 'separate':
+        read_separate_files(options.input_dir, options.name_split, combined_out_file)
+    else:
+        read_combined_files(options.input_dir, options.name_split, combined_out_file)
+    # Step 2: Run CD-HIT on the renamed sequences
+    run_cd_hit(combined_out_file, clustering_output, options)
+    class clustering_options:
+        def __init__(self):
+            self.format = 'CD-HIT'
+            self.reclustered = options.reclustered
+            self.sequence_tag = 'StORF'
+            self.core_groups = '99,95,15,0'
+            self.clusters = clustering_output+'.clstr'
+            self.gene_presence_absence_out = options.gene_presence_absence_out
+            self.write_families = options.write_families
+            self.con_core = options.con_core
+    clustering_options = clustering_options()
+    # Step 3: Run PyamilySeq with the CD-HIT output
+    cluster(clustering_options)
+    #run_pyamilyseq(options.clustering_out, options.clustering_type, combined_out_file, options.pyamilyseq_args)
+if __name__ == "__main__":
+    main()

{pyamilyseq-0.3.0 → pyamilyseq-0.4.0}/src/PyamilySeq/PyamilySeq_Species.py RENAMED Viewed

@@ -722,22 +722,25 @@ def cluster(options):
 def main():
-    parser = argparse.ArgumentParser(description='PyamilySeq ' + PyamilySeq_Version + ': PyamilySeq Run Parameters.')
+    parser = argparse.ArgumentParser(description='PyamilySeq-Species ' + PyamilySeq_Version + ': PyamilySeq-Species Run Parameters.')
     parser._action_groups.pop()
     required = parser.add_argument_group('Required Arguments')
     required.add_argument('-c', action='store', dest='clusters', help='Clustering output file from CD-HIT, TSV or CSV Edge List',
                         required=True)
     required.add_argument('-f', action='store', dest='format', choices=['CD-HIT', 'CSV', 'TSV'],
-                        help='Which format to use (CD-HIT or  Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))', required=True)
+                        help='Which format to use (CD-HIT or  Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))',
+                        required=True)
     output_args = parser.add_argument_group('Output Parameters')
     output_args.add_argument('-w', action="store", dest='write_families', default=None,
                           help='Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95"'
-                               ' - Must provide FASTA file with -fasta')
+                               ' - Must provide FASTA file with -fasta',
+                          required=False)
     output_args.add_argument('-con', action="store", dest='con_core', default=None,
                           help='Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95"'
-                               ' - Must provide FASTA file with -fasta')
+                               ' - Must provide FASTA file with -fasta',
+                          required=False)
     output_args.add_argument('-fasta', action='store', dest='fasta',
                           help='FASTA file to use in conjunction with "-w" or "-con"',
                           required=False)
@@ -754,9 +757,11 @@ def main():
     misc = parser.add_argument_group('Misc')
     misc.add_argument('-verbose', action='store', dest='verbose', default=False, type=eval, choices=[True, False],
-                        help='Default - False: Print out runtime messages')
+                        help='Default - False: Print out runtime messages',
+                        required = False)
     misc.add_argument('-v', action='store_true', dest='version',
-                        help='Default - False: Print out version number and exit')
+                        help='Default - False: Print out version number and exit',
+                        required=False)
     options = parser.parse_args()

pyamilyseq-0.4.0/src/PyamilySeq.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,92 @@
+Metadata-Version: 2.1
+Name: PyamilySeq
+Version: 0.4.0
+Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
+Home-page: https://github.com/NickJD/PyamilySeq
+Author: Nicholas Dimonaco
+Author-email: nicholas@dimonaco.co.uk
+Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.6
+Description-Content-Type: text/markdown
+License-File: LICENSE
+# PyamilySeq - !BETA!
+**PyamilySeq** (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2.
+This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
+## Features
+- **End-to-End**: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
+- **Clustering**: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
+- **Reclustering**: Allows for the addition of new sequences post-initial clustering.
+- **Output**: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
+  - Align representative sequences using MAFFT.
+  - Output concatenated aligned sequences for downstream analysis.
+  - Optionally output sequences of identified families.
+### Installation
+PyamilySeq requires Python 3.6 or higher. Install using pip:
+```bash
+pip install PyamilySeq
+```
+## Usage - Menu
+```
+usage: PyamilySeq.py [-h] -id INPUT_DIR -od OUTPUT_DIR -it {separate,combined} -ns NAME_SPLIT -pid PIDENT -ld LEN_DIFF -co CLUSTERING_OUT -ct {CD-HIT,BLAST,DIAMOND,MMseqs2} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG] [-groups CORE_GROUPS]
+                     [-gpa GENE_PRESENCE_ABSENCE_OUT]
+                     ...
+PyamilySeq v0.4.0: PyamilySeq Run Parameters.
+positional arguments:
+  pyamilyseq_args       Additional arguments for PyamilySeq.
+options:
+  -h, --help            show this help message and exit
+Required Arguments:
+  -id INPUT_DIR         Directory containing GFF/FASTA files.
+  -od OUTPUT_DIR        Directory for all output files.
+  -it {separate,combined}
+                        Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
+  -ns NAME_SPLIT        Character used to split the filename and extract the genome name.
+  -pid PIDENT           Pident threshold for CD-HIT clustering.
+  -ld LEN_DIFF          Length difference (-s) threshold for CD-HIT clustering.
+  -co CLUSTERING_OUT    Output file for initial clustering.
+  -ct {CD-HIT,BLAST,DIAMOND,MMseqs2}
+                        Clustering format for PyamilySeq.
+Output Parameters:
+  -w WRITE_FAMILIES     Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
+  -con CON_CORE         Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
+  -fasta FASTA          FASTA file to use in conjunction with "-w" or "-con"
+Optional Arguments:
+  -rc RECLUSTERED       Clustering output file from secondary round of clustering
+  -st SEQUENCE_TAG      Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
+  -groups CORE_GROUPS   Default - ('99,95,15'): Gene family groups to use
+  -gpa GENE_PRESENCE_ABSENCE_OUT
+                        Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools
+```
+### Example Run End-to-End - 'genomes' is a test-directory containing GFF files with ##FASTA at the bottom
+```bash
+PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
+```
+```Calculating Groups
+Calculating Groups
+Gene Groups:
+first_core_99: 3103
+first_core_95: 0
+first_core_15: 3217
+first_core_0: 4808
+Total Number of Gene Groups (Including Singletons): 11128
+```

{pyamilyseq-0.3.0 → pyamilyseq-0.4.0}/src/PyamilySeq.egg-info/SOURCES.txt RENAMED Viewed

@@ -4,6 +4,7 @@ pyproject.toml
 setup.cfg
 src/PyamilySeq/CD-Hit_StORF-Reporter_Cross-Genera_Builder.py
 src/PyamilySeq/Constants.py
+src/PyamilySeq/PyamilySeq.py
 src/PyamilySeq/PyamilySeq_Species.py
 src/PyamilySeq/__init__.py
 src/PyamilySeq/combine_FASTA_with_genome_IDs.py

pyamilyseq-0.4.0/src/PyamilySeq.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ [console_scripts]
2	+ PyamilySeq = PyamilySeq.PyamilySeq:main

pyamilyseq-0.3.0/PKG-INFO DELETED Viewed

@@ -1,103 +0,0 @@
-Metadata-Version: 2.1
-Name: PyamilySeq
-Version: 0.3.0
-Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
-Home-page: https://github.com/NickJD/PyamilySeq
-Author: Nicholas Dimonaco
-Author-email: nicholas@dimonaco.co.uk
-Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
-Classifier: Programming Language :: Python :: 3
-Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
-Classifier: Operating System :: OS Independent
-Requires-Python: >=3.6
-Description-Content-Type: text/markdown
-License-File: LICENSE
-# PyamilySeq - !BETA!
-PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, DIAMOND or MMseqs2.
-This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
-## Features
-- **Clustering**: Supports input from CD-HIT formatted files as well as TSV and CSV Edge List formats.
-- **Reclustering**: Allows for the addition of new sequences post-initial clustering.
-- **Output**: Generates a gene 'Roary' presence-absence CSV formatted file for downstream analysis.
-## Installation
-PyamilySeq requires Python 3.6 or higher. Install dependencies using pip:
-```bash
-pip install PyamilySeq
-```
-## Usage - Menu
-```
-usage: PyamilySeq_Species.py [-h] -c CLUSTERS -f {CD-HIT,CSV,TSV} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG]
-                             [-groups CORE_GROUPS] [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}] [-v]
-PyamilySeq v0.3.0: PyamilySeq Run Parameters.
-Required Arguments:
-  -c CLUSTERS           Clustering output file from CD-HIT, TSV or CSV Edge List
-  -f {CD-HIT,CSV,TSV}   Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))
-Output Parameters:
-  -w WRITE_FAMILIES     Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file
-                        with -fasta
-  -con CON_CORE         Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to
-                        output "-w 99,95" - Must provide FASTA file with -fasta
-  -fasta FASTA          FASTA file to use in conjunction with "-w" or "-con"
-Optional Arguments:
-  -rc RECLUSTERED       Clustering output file from secondary round of clustering
-  -st SEQUENCE_TAG      Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
-  -groups CORE_GROUPS   Default - ('99,95,15'): Gene family groups to use
-  -gpa GENE_PRESENCE_ABSENCE_OUT
-                        Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other
-                        downstream tools
-Misc:
-  -verbose {True,False}
-                        Default - False: Print out runtime messages
-  -v                    Default - False: Print out version number and exit
-```
-### Clustering Analysis
-To perform clustering analysis:
-```bash
-python pyamilyseq.py -c clusters_file -f format
-```
-Replace `clusters_file` with the path to your clustering output file and `format` with one of: `CD-HIT`, `CSV`, or `TSV`.
-### Reclustering
-To add new sequences and recluster:
-```bash
-PyamilySeq -c clusters_file -f format --reclustered reclustered_file
-```
-Replace `reclustered_file` with the path to the file containing additional sequences.
-## Output
-PyamilySeq generates various outputs, including:
-- **Gene Presence-Absence File**: This CSV file details the presence and absence of genes across genomes.
-- **FASTA Files for Each Gene Family**:
-## Gene Family Groups
-After analysis, PyamilySeq categorizes gene families into several groups:
-- **First Core**: Gene families present in all analysed genomes initially.
-- **Extended Core**: Gene families extended with additional sequences.
-- **Combined Core**: Gene families combined with both initial and additional sequences.
-- **Second Core**: Gene families identified only in the additional sequences.
-- **Only Second Core**: Gene families exclusively found in the additional sequences.

pyamilyseq-0.3.0/README.md DELETED Viewed

@@ -1,88 +0,0 @@
-# PyamilySeq - !BETA!
-PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, DIAMOND or MMseqs2.
-This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
-## Features
-- **Clustering**: Supports input from CD-HIT formatted files as well as TSV and CSV Edge List formats.
-- **Reclustering**: Allows for the addition of new sequences post-initial clustering.
-- **Output**: Generates a gene 'Roary' presence-absence CSV formatted file for downstream analysis.
-## Installation
-PyamilySeq requires Python 3.6 or higher. Install dependencies using pip:
-```bash
-pip install PyamilySeq
-```
-## Usage - Menu
-```
-usage: PyamilySeq_Species.py [-h] -c CLUSTERS -f {CD-HIT,CSV,TSV} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG]
-                             [-groups CORE_GROUPS] [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}] [-v]
-PyamilySeq v0.3.0: PyamilySeq Run Parameters.
-Required Arguments:
-  -c CLUSTERS           Clustering output file from CD-HIT, TSV or CSV Edge List
-  -f {CD-HIT,CSV,TSV}   Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))
-Output Parameters:
-  -w WRITE_FAMILIES     Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file
-                        with -fasta
-  -con CON_CORE         Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to
-                        output "-w 99,95" - Must provide FASTA file with -fasta
-  -fasta FASTA          FASTA file to use in conjunction with "-w" or "-con"
-Optional Arguments:
-  -rc RECLUSTERED       Clustering output file from secondary round of clustering
-  -st SEQUENCE_TAG      Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
-  -groups CORE_GROUPS   Default - ('99,95,15'): Gene family groups to use
-  -gpa GENE_PRESENCE_ABSENCE_OUT
-                        Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other
-                        downstream tools
-Misc:
-  -verbose {True,False}
-                        Default - False: Print out runtime messages
-  -v                    Default - False: Print out version number and exit
-```
-### Clustering Analysis
-To perform clustering analysis:
-```bash
-python pyamilyseq.py -c clusters_file -f format
-```
-Replace `clusters_file` with the path to your clustering output file and `format` with one of: `CD-HIT`, `CSV`, or `TSV`.
-### Reclustering
-To add new sequences and recluster:
-```bash
-PyamilySeq -c clusters_file -f format --reclustered reclustered_file
-```
-Replace `reclustered_file` with the path to the file containing additional sequences.
-## Output
-PyamilySeq generates various outputs, including:
-- **Gene Presence-Absence File**: This CSV file details the presence and absence of genes across genomes.
-- **FASTA Files for Each Gene Family**:
-## Gene Family Groups
-After analysis, PyamilySeq categorizes gene families into several groups:
-- **First Core**: Gene families present in all analysed genomes initially.
-- **Extended Core**: Gene families extended with additional sequences.
-- **Combined Core**: Gene families combined with both initial and additional sequences.
-- **Second Core**: Gene families identified only in the additional sequences.
-- **Only Second Core**: Gene families exclusively found in the additional sequences.

pyamilyseq-0.3.0/src/PyamilySeq.egg-info/PKG-INFO DELETED Viewed

@@ -1,103 +0,0 @@
-Metadata-Version: 2.1
-Name: PyamilySeq
-Version: 0.3.0
-Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
-Home-page: https://github.com/NickJD/PyamilySeq
-Author: Nicholas Dimonaco
-Author-email: nicholas@dimonaco.co.uk
-Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
-Classifier: Programming Language :: Python :: 3
-Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
-Classifier: Operating System :: OS Independent
-Requires-Python: >=3.6
-Description-Content-Type: text/markdown
-License-File: LICENSE
-# PyamilySeq - !BETA!
-PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, DIAMOND or MMseqs2.
-This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
-## Features
-- **Clustering**: Supports input from CD-HIT formatted files as well as TSV and CSV Edge List formats.
-- **Reclustering**: Allows for the addition of new sequences post-initial clustering.
-- **Output**: Generates a gene 'Roary' presence-absence CSV formatted file for downstream analysis.
-## Installation
-PyamilySeq requires Python 3.6 or higher. Install dependencies using pip:
-```bash
-pip install PyamilySeq
-```
-## Usage - Menu
-```
-usage: PyamilySeq_Species.py [-h] -c CLUSTERS -f {CD-HIT,CSV,TSV} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG]
-                             [-groups CORE_GROUPS] [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}] [-v]
-PyamilySeq v0.3.0: PyamilySeq Run Parameters.
-Required Arguments:
-  -c CLUSTERS           Clustering output file from CD-HIT, TSV or CSV Edge List
-  -f {CD-HIT,CSV,TSV}   Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))
-Output Parameters:
-  -w WRITE_FAMILIES     Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file
-                        with -fasta
-  -con CON_CORE         Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to
-                        output "-w 99,95" - Must provide FASTA file with -fasta
-  -fasta FASTA          FASTA file to use in conjunction with "-w" or "-con"
-Optional Arguments:
-  -rc RECLUSTERED       Clustering output file from secondary round of clustering
-  -st SEQUENCE_TAG      Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
-  -groups CORE_GROUPS   Default - ('99,95,15'): Gene family groups to use
-  -gpa GENE_PRESENCE_ABSENCE_OUT
-                        Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other
-                        downstream tools
-Misc:
-  -verbose {True,False}
-                        Default - False: Print out runtime messages
-  -v                    Default - False: Print out version number and exit
-```
-### Clustering Analysis
-To perform clustering analysis:
-```bash
-python pyamilyseq.py -c clusters_file -f format
-```
-Replace `clusters_file` with the path to your clustering output file and `format` with one of: `CD-HIT`, `CSV`, or `TSV`.
-### Reclustering
-To add new sequences and recluster:
-```bash
-PyamilySeq -c clusters_file -f format --reclustered reclustered_file
-```
-Replace `reclustered_file` with the path to the file containing additional sequences.
-## Output
-PyamilySeq generates various outputs, including:
-- **Gene Presence-Absence File**: This CSV file details the presence and absence of genes across genomes.
-- **FASTA Files for Each Gene Family**:
-## Gene Family Groups
-After analysis, PyamilySeq categorizes gene families into several groups:
-- **First Core**: Gene families present in all analysed genomes initially.
-- **Extended Core**: Gene families extended with additional sequences.
-- **Combined Core**: Gene families combined with both initial and additional sequences.
-- **Second Core**: Gene families identified only in the additional sequences.
-- **Only Second Core**: Gene families exclusively found in the additional sequences.