PyamilySeq 0.3.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,92 @@
1
+ Metadata-Version: 2.1
2
+ Name: PyamilySeq
3
+ Version: 0.4.0
4
+ Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
5
+ Home-page: https://github.com/NickJD/PyamilySeq
6
+ Author: Nicholas Dimonaco
7
+ Author-email: nicholas@dimonaco.co.uk
8
+ Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
11
+ Classifier: Operating System :: OS Independent
12
+ Requires-Python: >=3.6
13
+ Description-Content-Type: text/markdown
14
+ License-File: LICENSE
15
+
16
+ # PyamilySeq - !BETA!
17
+ **PyamilySeq** (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2.
18
+ This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
19
+
20
+ ## Features
21
+ - **End-to-End**: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
22
+ - **Clustering**: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
23
+ - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
24
+ - **Output**: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
25
+ - Align representative sequences using MAFFT.
26
+ - Output concatenated aligned sequences for downstream analysis.
27
+ - Optionally output sequences of identified families.
28
+
29
+
30
+ ### Installation
31
+ PyamilySeq requires Python 3.6 or higher. Install using pip:
32
+
33
+ ```bash
34
+ pip install PyamilySeq
35
+ ```
36
+
37
+ ## Usage - Menu
38
+ ```
39
+ usage: PyamilySeq.py [-h] -id INPUT_DIR -od OUTPUT_DIR -it {separate,combined} -ns NAME_SPLIT -pid PIDENT -ld LEN_DIFF -co CLUSTERING_OUT -ct {CD-HIT,BLAST,DIAMOND,MMseqs2} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG] [-groups CORE_GROUPS]
40
+ [-gpa GENE_PRESENCE_ABSENCE_OUT]
41
+ ...
42
+
43
+ PyamilySeq v0.4.0: PyamilySeq Run Parameters.
44
+
45
+ positional arguments:
46
+ pyamilyseq_args Additional arguments for PyamilySeq.
47
+
48
+ options:
49
+ -h, --help show this help message and exit
50
+
51
+ Required Arguments:
52
+ -id INPUT_DIR Directory containing GFF/FASTA files.
53
+ -od OUTPUT_DIR Directory for all output files.
54
+ -it {separate,combined}
55
+ Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
56
+ -ns NAME_SPLIT Character used to split the filename and extract the genome name.
57
+ -pid PIDENT Pident threshold for CD-HIT clustering.
58
+ -ld LEN_DIFF Length difference (-s) threshold for CD-HIT clustering.
59
+ -co CLUSTERING_OUT Output file for initial clustering.
60
+ -ct {CD-HIT,BLAST,DIAMOND,MMseqs2}
61
+ Clustering format for PyamilySeq.
62
+
63
+ Output Parameters:
64
+ -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
65
+ -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
66
+ -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
67
+
68
+ Optional Arguments:
69
+ -rc RECLUSTERED Clustering output file from secondary round of clustering
70
+ -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
71
+ -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
72
+ -gpa GENE_PRESENCE_ABSENCE_OUT
73
+ Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools
74
+ ```
75
+
76
+ ### Example Run End-to-End - 'genomes' is a test-directory containing GFF files with ##FASTA at the bottom
77
+
78
+ ```bash
79
+ PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
80
+ ```
81
+
82
+ ```Calculating Groups
83
+ Calculating Groups
84
+ Gene Groups:
85
+ first_core_99: 3103
86
+ first_core_95: 0
87
+ first_core_15: 3217
88
+ first_core_0: 4808
89
+ Total Number of Gene Groups (Including Singletons): 11128
90
+ ```
91
+
92
+
@@ -0,0 +1,77 @@
1
+ # PyamilySeq - !BETA!
2
+ **PyamilySeq** (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2.
3
+ This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
4
+
5
+ ## Features
6
+ - **End-to-End**: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
7
+ - **Clustering**: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
8
+ - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
9
+ - **Output**: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
10
+ - Align representative sequences using MAFFT.
11
+ - Output concatenated aligned sequences for downstream analysis.
12
+ - Optionally output sequences of identified families.
13
+
14
+
15
+ ### Installation
16
+ PyamilySeq requires Python 3.6 or higher. Install using pip:
17
+
18
+ ```bash
19
+ pip install PyamilySeq
20
+ ```
21
+
22
+ ## Usage - Menu
23
+ ```
24
+ usage: PyamilySeq.py [-h] -id INPUT_DIR -od OUTPUT_DIR -it {separate,combined} -ns NAME_SPLIT -pid PIDENT -ld LEN_DIFF -co CLUSTERING_OUT -ct {CD-HIT,BLAST,DIAMOND,MMseqs2} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG] [-groups CORE_GROUPS]
25
+ [-gpa GENE_PRESENCE_ABSENCE_OUT]
26
+ ...
27
+
28
+ PyamilySeq v0.4.0: PyamilySeq Run Parameters.
29
+
30
+ positional arguments:
31
+ pyamilyseq_args Additional arguments for PyamilySeq.
32
+
33
+ options:
34
+ -h, --help show this help message and exit
35
+
36
+ Required Arguments:
37
+ -id INPUT_DIR Directory containing GFF/FASTA files.
38
+ -od OUTPUT_DIR Directory for all output files.
39
+ -it {separate,combined}
40
+ Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
41
+ -ns NAME_SPLIT Character used to split the filename and extract the genome name.
42
+ -pid PIDENT Pident threshold for CD-HIT clustering.
43
+ -ld LEN_DIFF Length difference (-s) threshold for CD-HIT clustering.
44
+ -co CLUSTERING_OUT Output file for initial clustering.
45
+ -ct {CD-HIT,BLAST,DIAMOND,MMseqs2}
46
+ Clustering format for PyamilySeq.
47
+
48
+ Output Parameters:
49
+ -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
50
+ -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
51
+ -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
52
+
53
+ Optional Arguments:
54
+ -rc RECLUSTERED Clustering output file from secondary round of clustering
55
+ -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
56
+ -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
57
+ -gpa GENE_PRESENCE_ABSENCE_OUT
58
+ Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools
59
+ ```
60
+
61
+ ### Example Run End-to-End - 'genomes' is a test-directory containing GFF files with ##FASTA at the bottom
62
+
63
+ ```bash
64
+ PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
65
+ ```
66
+
67
+ ```Calculating Groups
68
+ Calculating Groups
69
+ Gene Groups:
70
+ first_core_99: 3103
71
+ first_core_95: 0
72
+ first_core_15: 3217
73
+ first_core_0: 4808
74
+ Total Number of Gene Groups (Including Singletons): 11128
75
+ ```
76
+
77
+
@@ -1,6 +1,6 @@
1
1
  [metadata]
2
2
  name = PyamilySeq
3
- version = v0.3.0
3
+ version = v0.4.0
4
4
  author = Nicholas Dimonaco
5
5
  author_email = nicholas@dimonaco.co.uk
6
6
  description = PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
@@ -27,7 +27,7 @@ include = *
27
27
 
28
28
  [options.entry_points]
29
29
  console_scripts =
30
- PyamilySeq = PyamilySeq.PyamilySeq_Species:main
30
+ PyamilySeq = PyamilySeq.PyamilySeq:main
31
31
 
32
32
  [egg_info]
33
33
  tag_build =
@@ -1,6 +1,6 @@
1
1
  import subprocess
2
2
 
3
- PyamilySeq_Version = 'v0.3.0'
3
+ PyamilySeq_Version = 'v0.4.0'
4
4
 
5
5
 
6
6
 
@@ -0,0 +1,186 @@
1
+ import argparse
2
+ import collections
3
+ import os
4
+ import glob
5
+ import subprocess
6
+ from PyamilySeq_Species import *
7
+
8
+
9
+ try:
10
+ from .PyamilySeq_Species import cluster
11
+ from .Constants import *
12
+ except (ModuleNotFoundError, ImportError, NameError, TypeError) as error:
13
+ from PyamilySeq_Species import cluster
14
+ from Constants import *
15
+
16
+ def reverse_complement(seq):
17
+ complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G', 'N': 'N'}
18
+ return ''.join(complement[base] for base in reversed(seq))
19
+
20
+
21
+ def read_separate_files(input_dir, name_split, combined_out):
22
+ with open(combined_out, 'w') as combined_out_file:
23
+ for fasta_file in glob.glob(os.path.join(input_dir, '*' + name_split)):
24
+ genome_name = os.path.basename(fasta_file).split(name_split)[0]
25
+ corresponding_gff_file = fasta_file.replace('.fasta', '.gff')
26
+ if not os.path.exists(corresponding_gff_file):
27
+ continue
28
+ cds_sequences = extract_cds_from_gff(fasta_file, corresponding_gff_file)
29
+ for gene_name, seq in cds_sequences:
30
+ header = f">{genome_name}_{gene_name}\n"
31
+ combined_out_file.write(header)
32
+ combined_out_file.write(seq + '\n')
33
+
34
+ def read_combined_files(input_dir, name_split, combined_out):
35
+ with open(combined_out, 'w') as combined_out_file:
36
+ for gff_file in glob.glob(os.path.join(input_dir, '*' + name_split)):
37
+ genome_name = os.path.basename(gff_file).split(name_split)[0]
38
+ fasta_dict = collections.defaultdict(str)
39
+ gff_features = []
40
+ with open(gff_file, 'r') as file:
41
+ lines = file.readlines()
42
+ fasta_section = False
43
+ for line in lines:
44
+ if line.startswith('##FASTA'):
45
+ fasta_section = True
46
+ continue
47
+ if fasta_section:
48
+ if line.startswith('>'):
49
+ current_contig = line[1:].split()[0]
50
+ fasta_dict[current_contig] = []
51
+ else:
52
+ fasta_dict[current_contig].append(line.strip())
53
+ else:
54
+ line_data = line.split('\t')
55
+ if len(line_data) == 9:
56
+ if line_data[2] == 'CDS':
57
+ contig = line_data[0]
58
+ feature = line_data[2]
59
+ start, end = int(line_data[3]), int(line_data[4])
60
+ seq_id = line_data[8].split('ID=')[1].split(';')[0]
61
+ gff_features.append((contig, start, end, seq_id))
62
+
63
+ if fasta_dict and gff_features:
64
+ for contig, start, end, seq_id in gff_features:
65
+ if contig in fasta_dict:
66
+ full_sequence = ''.join(fasta_dict[contig])
67
+ cds_sequence = full_sequence[start - 1:end]
68
+ wrapped_sequence = '\n'.join([cds_sequence[i:i + 60] for i in range(0, len(cds_sequence), 60)])
69
+ combined_out_file.write(f">{genome_name}|{seq_id}\n{wrapped_sequence}\n")
70
+
71
+
72
+ def run_cd_hit(input_file, clustering_output, options):
73
+ cdhit_command = [
74
+ 'cd-hit-est',
75
+ '-i', input_file,
76
+ '-o', clustering_output,
77
+ '-c', str(options.pident),
78
+ '-s', str(options.len_diff),
79
+ '-T', "20",
80
+ '-d', "0",
81
+ '-sc', "1",
82
+ '-sf', "1"
83
+ ]
84
+ subprocess.run(cdhit_command)
85
+
86
+
87
+
88
+
89
+
90
+
91
+
92
+
93
+ def main():
94
+ parser = argparse.ArgumentParser(
95
+ description='PyamilySeq ' + PyamilySeq_Version + ': PyamilySeq Run Parameters.')
96
+ required = parser.add_argument_group('Required Arguments')
97
+ required.add_argument("-id", action="store", dest="input_dir",
98
+ help="Directory containing GFF/FASTA files.",
99
+ required=True)
100
+ required.add_argument("-od", action="store", dest="output_dir",
101
+ help="Directory for all output files.",
102
+ required=True)
103
+ required.add_argument("-it", action="store", dest="input_type", choices=['separate', 'combined'],
104
+ help="Type of input files: 'separate' for separate FASTA and GFF files,"
105
+ " 'combined' for GFF files with embedded FASTA sequences.",
106
+ required=True)
107
+ required.add_argument("-ns", action="store", dest="name_split",
108
+ help="Character used to split the filename and extract the genome name.",
109
+ required=True)
110
+ required.add_argument("-pid", action="store", dest="pident", type=float,
111
+ help="Pident threshold for CD-HIT clustering.",
112
+ required=True)
113
+ required.add_argument("-ld", action="store", dest="len_diff", type=float,
114
+ help="Length difference (-s) threshold for CD-HIT clustering.",
115
+ required=True)
116
+ required.add_argument("-co", action="store", dest="clustering_out",
117
+ help="Output file for initial clustering.",
118
+ required=True)
119
+ required.add_argument("-ct", action="store", dest="clustering_type", choices=['CD-HIT', 'BLAST', 'DIAMOND', "MMseqs2"],
120
+ help="Clustering format for PyamilySeq.",
121
+ required=True)
122
+
123
+ output_args = parser.add_argument_group('Output Parameters')
124
+ output_args.add_argument('-w', action="store", dest='write_families', default=None,
125
+ help='Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95"'
126
+ ' - Must provide FASTA file with -fasta',
127
+ required=False)
128
+ output_args.add_argument('-con', action="store", dest='con_core', default=None,
129
+ help='Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95"'
130
+ ' - Must provide FASTA file with -fasta',
131
+ required=False)
132
+ output_args.add_argument('-fasta', action='store', dest='fasta',
133
+ help='FASTA file to use in conjunction with "-w" or "-con"',
134
+ required=False)
135
+
136
+ optional = parser.add_argument_group('Optional Arguments')
137
+ optional.add_argument('-rc', action='store', dest='reclustered', help='Clustering output file from secondary round of clustering',
138
+ required=False)
139
+ optional.add_argument('-st', action='store', dest='sequence_tag', help='Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences',
140
+ required=False)
141
+ optional.add_argument('-groups', action="store", dest='core_groups', default="99,95,15",
142
+ help='Default - (\'99,95,15\'): Gene family groups to use')
143
+ optional.add_argument('-gpa', action='store', dest='gene_presence_absence_out', help='Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools',
144
+ required=False)
145
+
146
+ parser.add_argument("pyamilyseq_args", nargs=argparse.REMAINDER, help="Additional arguments for PyamilySeq.")
147
+ options = parser.parse_args()
148
+
149
+
150
+
151
+ output_path = os.path.abspath(options.output_dir)
152
+ combined_out_file = os.path.join(output_path,"end_to_end_combined_sequences.fasta")
153
+ clustering_output = os.path.join(output_path,'clustering_'+options.clustering_type)
154
+
155
+
156
+
157
+ # Step 1: Read and rename sequences from files based on input type
158
+ if options.input_type == 'separate':
159
+ read_separate_files(options.input_dir, options.name_split, combined_out_file)
160
+ else:
161
+ read_combined_files(options.input_dir, options.name_split, combined_out_file)
162
+
163
+ # Step 2: Run CD-HIT on the renamed sequences
164
+ run_cd_hit(combined_out_file, clustering_output, options)
165
+
166
+
167
+ class clustering_options:
168
+ def __init__(self):
169
+ self.format = 'CD-HIT'
170
+ self.reclustered = options.reclustered
171
+ self.sequence_tag = 'StORF'
172
+ self.core_groups = '99,95,15,0'
173
+ self.clusters = clustering_output+'.clstr'
174
+ self.gene_presence_absence_out = options.gene_presence_absence_out
175
+ self.write_families = options.write_families
176
+ self.con_core = options.con_core
177
+
178
+ clustering_options = clustering_options()
179
+
180
+ # Step 3: Run PyamilySeq with the CD-HIT output
181
+ cluster(clustering_options)
182
+ #run_pyamilyseq(options.clustering_out, options.clustering_type, combined_out_file, options.pyamilyseq_args)
183
+
184
+
185
+ if __name__ == "__main__":
186
+ main()
@@ -722,22 +722,25 @@ def cluster(options):
722
722
 
723
723
  def main():
724
724
 
725
- parser = argparse.ArgumentParser(description='PyamilySeq ' + PyamilySeq_Version + ': PyamilySeq Run Parameters.')
725
+ parser = argparse.ArgumentParser(description='PyamilySeq-Species ' + PyamilySeq_Version + ': PyamilySeq-Species Run Parameters.')
726
726
  parser._action_groups.pop()
727
727
 
728
728
  required = parser.add_argument_group('Required Arguments')
729
729
  required.add_argument('-c', action='store', dest='clusters', help='Clustering output file from CD-HIT, TSV or CSV Edge List',
730
730
  required=True)
731
731
  required.add_argument('-f', action='store', dest='format', choices=['CD-HIT', 'CSV', 'TSV'],
732
- help='Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))', required=True)
732
+ help='Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))',
733
+ required=True)
733
734
 
734
735
  output_args = parser.add_argument_group('Output Parameters')
735
736
  output_args.add_argument('-w', action="store", dest='write_families', default=None,
736
737
  help='Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95"'
737
- ' - Must provide FASTA file with -fasta')
738
+ ' - Must provide FASTA file with -fasta',
739
+ required=False)
738
740
  output_args.add_argument('-con', action="store", dest='con_core', default=None,
739
741
  help='Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95"'
740
- ' - Must provide FASTA file with -fasta')
742
+ ' - Must provide FASTA file with -fasta',
743
+ required=False)
741
744
  output_args.add_argument('-fasta', action='store', dest='fasta',
742
745
  help='FASTA file to use in conjunction with "-w" or "-con"',
743
746
  required=False)
@@ -754,9 +757,11 @@ def main():
754
757
 
755
758
  misc = parser.add_argument_group('Misc')
756
759
  misc.add_argument('-verbose', action='store', dest='verbose', default=False, type=eval, choices=[True, False],
757
- help='Default - False: Print out runtime messages')
760
+ help='Default - False: Print out runtime messages',
761
+ required = False)
758
762
  misc.add_argument('-v', action='store_true', dest='version',
759
- help='Default - False: Print out version number and exit')
763
+ help='Default - False: Print out version number and exit',
764
+ required=False)
760
765
 
761
766
 
762
767
  options = parser.parse_args()
@@ -0,0 +1,92 @@
1
+ Metadata-Version: 2.1
2
+ Name: PyamilySeq
3
+ Version: 0.4.0
4
+ Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
5
+ Home-page: https://github.com/NickJD/PyamilySeq
6
+ Author: Nicholas Dimonaco
7
+ Author-email: nicholas@dimonaco.co.uk
8
+ Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
11
+ Classifier: Operating System :: OS Independent
12
+ Requires-Python: >=3.6
13
+ Description-Content-Type: text/markdown
14
+ License-File: LICENSE
15
+
16
+ # PyamilySeq - !BETA!
17
+ **PyamilySeq** (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2.
18
+ This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
19
+
20
+ ## Features
21
+ - **End-to-End**: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
22
+ - **Clustering**: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
23
+ - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
24
+ - **Output**: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
25
+ - Align representative sequences using MAFFT.
26
+ - Output concatenated aligned sequences for downstream analysis.
27
+ - Optionally output sequences of identified families.
28
+
29
+
30
+ ### Installation
31
+ PyamilySeq requires Python 3.6 or higher. Install using pip:
32
+
33
+ ```bash
34
+ pip install PyamilySeq
35
+ ```
36
+
37
+ ## Usage - Menu
38
+ ```
39
+ usage: PyamilySeq.py [-h] -id INPUT_DIR -od OUTPUT_DIR -it {separate,combined} -ns NAME_SPLIT -pid PIDENT -ld LEN_DIFF -co CLUSTERING_OUT -ct {CD-HIT,BLAST,DIAMOND,MMseqs2} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG] [-groups CORE_GROUPS]
40
+ [-gpa GENE_PRESENCE_ABSENCE_OUT]
41
+ ...
42
+
43
+ PyamilySeq v0.4.0: PyamilySeq Run Parameters.
44
+
45
+ positional arguments:
46
+ pyamilyseq_args Additional arguments for PyamilySeq.
47
+
48
+ options:
49
+ -h, --help show this help message and exit
50
+
51
+ Required Arguments:
52
+ -id INPUT_DIR Directory containing GFF/FASTA files.
53
+ -od OUTPUT_DIR Directory for all output files.
54
+ -it {separate,combined}
55
+ Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
56
+ -ns NAME_SPLIT Character used to split the filename and extract the genome name.
57
+ -pid PIDENT Pident threshold for CD-HIT clustering.
58
+ -ld LEN_DIFF Length difference (-s) threshold for CD-HIT clustering.
59
+ -co CLUSTERING_OUT Output file for initial clustering.
60
+ -ct {CD-HIT,BLAST,DIAMOND,MMseqs2}
61
+ Clustering format for PyamilySeq.
62
+
63
+ Output Parameters:
64
+ -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
65
+ -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
66
+ -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
67
+
68
+ Optional Arguments:
69
+ -rc RECLUSTERED Clustering output file from secondary round of clustering
70
+ -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
71
+ -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
72
+ -gpa GENE_PRESENCE_ABSENCE_OUT
73
+ Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools
74
+ ```
75
+
76
+ ### Example Run End-to-End - 'genomes' is a test-directory containing GFF files with ##FASTA at the bottom
77
+
78
+ ```bash
79
+ PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
80
+ ```
81
+
82
+ ```Calculating Groups
83
+ Calculating Groups
84
+ Gene Groups:
85
+ first_core_99: 3103
86
+ first_core_95: 0
87
+ first_core_15: 3217
88
+ first_core_0: 4808
89
+ Total Number of Gene Groups (Including Singletons): 11128
90
+ ```
91
+
92
+
@@ -4,6 +4,7 @@ pyproject.toml
4
4
  setup.cfg
5
5
  src/PyamilySeq/CD-Hit_StORF-Reporter_Cross-Genera_Builder.py
6
6
  src/PyamilySeq/Constants.py
7
+ src/PyamilySeq/PyamilySeq.py
7
8
  src/PyamilySeq/PyamilySeq_Species.py
8
9
  src/PyamilySeq/__init__.py
9
10
  src/PyamilySeq/combine_FASTA_with_genome_IDs.py
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ PyamilySeq = PyamilySeq.PyamilySeq:main
pyamilyseq-0.3.0/PKG-INFO DELETED
@@ -1,103 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: PyamilySeq
3
- Version: 0.3.0
4
- Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
5
- Home-page: https://github.com/NickJD/PyamilySeq
6
- Author: Nicholas Dimonaco
7
- Author-email: nicholas@dimonaco.co.uk
8
- Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
9
- Classifier: Programming Language :: Python :: 3
10
- Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
11
- Classifier: Operating System :: OS Independent
12
- Requires-Python: >=3.6
13
- Description-Content-Type: text/markdown
14
- License-File: LICENSE
15
-
16
- # PyamilySeq - !BETA!
17
- PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, DIAMOND or MMseqs2.
18
- This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
19
-
20
- ## Features
21
-
22
- - **Clustering**: Supports input from CD-HIT formatted files as well as TSV and CSV Edge List formats.
23
- - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
24
- - **Output**: Generates a gene 'Roary' presence-absence CSV formatted file for downstream analysis.
25
-
26
- ## Installation
27
-
28
- PyamilySeq requires Python 3.6 or higher. Install dependencies using pip:
29
-
30
- ```bash
31
- pip install PyamilySeq
32
- ```
33
-
34
- ## Usage - Menu
35
- ```
36
- usage: PyamilySeq_Species.py [-h] -c CLUSTERS -f {CD-HIT,CSV,TSV} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG]
37
- [-groups CORE_GROUPS] [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}] [-v]
38
-
39
- PyamilySeq v0.3.0: PyamilySeq Run Parameters.
40
-
41
- Required Arguments:
42
- -c CLUSTERS Clustering output file from CD-HIT, TSV or CSV Edge List
43
- -f {CD-HIT,CSV,TSV} Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))
44
-
45
- Output Parameters:
46
- -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file
47
- with -fasta
48
- -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to
49
- output "-w 99,95" - Must provide FASTA file with -fasta
50
- -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
51
-
52
- Optional Arguments:
53
- -rc RECLUSTERED Clustering output file from secondary round of clustering
54
- -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
55
- -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
56
- -gpa GENE_PRESENCE_ABSENCE_OUT
57
- Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other
58
- downstream tools
59
-
60
- Misc:
61
- -verbose {True,False}
62
- Default - False: Print out runtime messages
63
- -v Default - False: Print out version number and exit
64
-
65
-
66
- ```
67
-
68
- ### Clustering Analysis
69
-
70
- To perform clustering analysis:
71
-
72
- ```bash
73
- python pyamilyseq.py -c clusters_file -f format
74
- ```
75
-
76
- Replace `clusters_file` with the path to your clustering output file and `format` with one of: `CD-HIT`, `CSV`, or `TSV`.
77
-
78
- ### Reclustering
79
-
80
- To add new sequences and recluster:
81
-
82
- ```bash
83
- PyamilySeq -c clusters_file -f format --reclustered reclustered_file
84
- ```
85
-
86
- Replace `reclustered_file` with the path to the file containing additional sequences.
87
-
88
- ## Output
89
-
90
- PyamilySeq generates various outputs, including:
91
-
92
- - **Gene Presence-Absence File**: This CSV file details the presence and absence of genes across genomes.
93
- - **FASTA Files for Each Gene Family**:
94
-
95
- ## Gene Family Groups
96
-
97
- After analysis, PyamilySeq categorizes gene families into several groups:
98
-
99
- - **First Core**: Gene families present in all analysed genomes initially.
100
- - **Extended Core**: Gene families extended with additional sequences.
101
- - **Combined Core**: Gene families combined with both initial and additional sequences.
102
- - **Second Core**: Gene families identified only in the additional sequences.
103
- - **Only Second Core**: Gene families exclusively found in the additional sequences.
@@ -1,88 +0,0 @@
1
- # PyamilySeq - !BETA!
2
- PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, DIAMOND or MMseqs2.
3
- This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
4
-
5
- ## Features
6
-
7
- - **Clustering**: Supports input from CD-HIT formatted files as well as TSV and CSV Edge List formats.
8
- - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
9
- - **Output**: Generates a gene 'Roary' presence-absence CSV formatted file for downstream analysis.
10
-
11
- ## Installation
12
-
13
- PyamilySeq requires Python 3.6 or higher. Install dependencies using pip:
14
-
15
- ```bash
16
- pip install PyamilySeq
17
- ```
18
-
19
- ## Usage - Menu
20
- ```
21
- usage: PyamilySeq_Species.py [-h] -c CLUSTERS -f {CD-HIT,CSV,TSV} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG]
22
- [-groups CORE_GROUPS] [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}] [-v]
23
-
24
- PyamilySeq v0.3.0: PyamilySeq Run Parameters.
25
-
26
- Required Arguments:
27
- -c CLUSTERS Clustering output file from CD-HIT, TSV or CSV Edge List
28
- -f {CD-HIT,CSV,TSV} Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))
29
-
30
- Output Parameters:
31
- -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file
32
- with -fasta
33
- -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to
34
- output "-w 99,95" - Must provide FASTA file with -fasta
35
- -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
36
-
37
- Optional Arguments:
38
- -rc RECLUSTERED Clustering output file from secondary round of clustering
39
- -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
40
- -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
41
- -gpa GENE_PRESENCE_ABSENCE_OUT
42
- Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other
43
- downstream tools
44
-
45
- Misc:
46
- -verbose {True,False}
47
- Default - False: Print out runtime messages
48
- -v Default - False: Print out version number and exit
49
-
50
-
51
- ```
52
-
53
- ### Clustering Analysis
54
-
55
- To perform clustering analysis:
56
-
57
- ```bash
58
- python pyamilyseq.py -c clusters_file -f format
59
- ```
60
-
61
- Replace `clusters_file` with the path to your clustering output file and `format` with one of: `CD-HIT`, `CSV`, or `TSV`.
62
-
63
- ### Reclustering
64
-
65
- To add new sequences and recluster:
66
-
67
- ```bash
68
- PyamilySeq -c clusters_file -f format --reclustered reclustered_file
69
- ```
70
-
71
- Replace `reclustered_file` with the path to the file containing additional sequences.
72
-
73
- ## Output
74
-
75
- PyamilySeq generates various outputs, including:
76
-
77
- - **Gene Presence-Absence File**: This CSV file details the presence and absence of genes across genomes.
78
- - **FASTA Files for Each Gene Family**:
79
-
80
- ## Gene Family Groups
81
-
82
- After analysis, PyamilySeq categorizes gene families into several groups:
83
-
84
- - **First Core**: Gene families present in all analysed genomes initially.
85
- - **Extended Core**: Gene families extended with additional sequences.
86
- - **Combined Core**: Gene families combined with both initial and additional sequences.
87
- - **Second Core**: Gene families identified only in the additional sequences.
88
- - **Only Second Core**: Gene families exclusively found in the additional sequences.
@@ -1,103 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: PyamilySeq
3
- Version: 0.3.0
4
- Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
5
- Home-page: https://github.com/NickJD/PyamilySeq
6
- Author: Nicholas Dimonaco
7
- Author-email: nicholas@dimonaco.co.uk
8
- Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
9
- Classifier: Programming Language :: Python :: 3
10
- Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
11
- Classifier: Operating System :: OS Independent
12
- Requires-Python: >=3.6
13
- Description-Content-Type: text/markdown
14
- License-File: LICENSE
15
-
16
- # PyamilySeq - !BETA!
17
- PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, DIAMOND or MMseqs2.
18
- This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
19
-
20
- ## Features
21
-
22
- - **Clustering**: Supports input from CD-HIT formatted files as well as TSV and CSV Edge List formats.
23
- - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
24
- - **Output**: Generates a gene 'Roary' presence-absence CSV formatted file for downstream analysis.
25
-
26
- ## Installation
27
-
28
- PyamilySeq requires Python 3.6 or higher. Install dependencies using pip:
29
-
30
- ```bash
31
- pip install PyamilySeq
32
- ```
33
-
34
- ## Usage - Menu
35
- ```
36
- usage: PyamilySeq_Species.py [-h] -c CLUSTERS -f {CD-HIT,CSV,TSV} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG]
37
- [-groups CORE_GROUPS] [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}] [-v]
38
-
39
- PyamilySeq v0.3.0: PyamilySeq Run Parameters.
40
-
41
- Required Arguments:
42
- -c CLUSTERS Clustering output file from CD-HIT, TSV or CSV Edge List
43
- -f {CD-HIT,CSV,TSV} Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))
44
-
45
- Output Parameters:
46
- -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file
47
- with -fasta
48
- -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to
49
- output "-w 99,95" - Must provide FASTA file with -fasta
50
- -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
51
-
52
- Optional Arguments:
53
- -rc RECLUSTERED Clustering output file from secondary round of clustering
54
- -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
55
- -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
56
- -gpa GENE_PRESENCE_ABSENCE_OUT
57
- Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other
58
- downstream tools
59
-
60
- Misc:
61
- -verbose {True,False}
62
- Default - False: Print out runtime messages
63
- -v Default - False: Print out version number and exit
64
-
65
-
66
- ```
67
-
68
- ### Clustering Analysis
69
-
70
- To perform clustering analysis:
71
-
72
- ```bash
73
- python pyamilyseq.py -c clusters_file -f format
74
- ```
75
-
76
- Replace `clusters_file` with the path to your clustering output file and `format` with one of: `CD-HIT`, `CSV`, or `TSV`.
77
-
78
- ### Reclustering
79
-
80
- To add new sequences and recluster:
81
-
82
- ```bash
83
- PyamilySeq -c clusters_file -f format --reclustered reclustered_file
84
- ```
85
-
86
- Replace `reclustered_file` with the path to the file containing additional sequences.
87
-
88
- ## Output
89
-
90
- PyamilySeq generates various outputs, including:
91
-
92
- - **Gene Presence-Absence File**: This CSV file details the presence and absence of genes across genomes.
93
- - **FASTA Files for Each Gene Family**:
94
-
95
- ## Gene Family Groups
96
-
97
- After analysis, PyamilySeq categorizes gene families into several groups:
98
-
99
- - **First Core**: Gene families present in all analysed genomes initially.
100
- - **Extended Core**: Gene families extended with additional sequences.
101
- - **Combined Core**: Gene families combined with both initial and additional sequences.
102
- - **Second Core**: Gene families identified only in the additional sequences.
103
- - **Only Second Core**: Gene families exclusively found in the additional sequences.
@@ -1,2 +0,0 @@
1
- [console_scripts]
2
- PyamilySeq = PyamilySeq.PyamilySeq_Species:main
File without changes
File without changes