PyamilySeq 0.3.0__py3-none-any.whl → 0.5.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,49 +0,0 @@
1
-
2
- import argparse
3
- import gzip
4
- import glob
5
-
6
- def combine_files(files, split, glob_location, combined_out):
7
- count = 0
8
-
9
- for file in glob.glob(glob_location + '/' + files):
10
- count += 1
11
- try:
12
- with gzip.open(file, 'rb') as genome:
13
-
14
- for line in genome:
15
- if line.startswith(b'#'):
16
- continue
17
- elif line.startswith(b'>'):
18
- genome_name = bytes(file.split(split)[0].split('/')[-1], 'utf-8')
19
- line = line.split(b' ')[0]
20
- line = line.replace(b'>', b'>' + genome_name + b'|')
21
- combined_out.write(line.decode('utf-8')+'\n')
22
- else:
23
- combined_out.write(line.decode('utf-8'))
24
- except gzip.BadGzipFile:
25
- with open(file, 'r') as genome:
26
-
27
- for line in genome:
28
- if line.startswith('#'):
29
- continue
30
- elif line.startswith('>'):
31
- genome_name = file.split(split)[0].split('/')[-1]
32
- line = line.replace('>', '>' + genome_name + '|')
33
- combined_out.write(line)
34
- else:
35
- combined_out.write(line)
36
-
37
- def main():
38
- parser = argparse.ArgumentParser(description="Combine gzipped fasta files.")
39
- parser.add_argument("files", help="File pattern to match within the specified directory.")
40
- parser.add_argument("split", help="String used to split the file path and extract the genome name.")
41
- parser.add_argument("glob_location", help="Directory location where the files are located.")
42
- parser.add_argument("combined_out", help="Output file where the combined data will be written.")
43
- args = parser.parse_args()
44
-
45
- with open(args.combined_out, 'w') as combined_out:
46
- combine_files(args.files, args.split, args.glob_location, combined_out)
47
-
48
- if __name__ == "__main__":
49
- main()
@@ -1,103 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: PyamilySeq
3
- Version: 0.3.0
4
- Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
5
- Home-page: https://github.com/NickJD/PyamilySeq
6
- Author: Nicholas Dimonaco
7
- Author-email: nicholas@dimonaco.co.uk
8
- Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
9
- Classifier: Programming Language :: Python :: 3
10
- Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
11
- Classifier: Operating System :: OS Independent
12
- Requires-Python: >=3.6
13
- Description-Content-Type: text/markdown
14
- License-File: LICENSE
15
-
16
- # PyamilySeq - !BETA!
17
- PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, DIAMOND or MMseqs2.
18
- This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
19
-
20
- ## Features
21
-
22
- - **Clustering**: Supports input from CD-HIT formatted files as well as TSV and CSV Edge List formats.
23
- - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
24
- - **Output**: Generates a gene 'Roary' presence-absence CSV formatted file for downstream analysis.
25
-
26
- ## Installation
27
-
28
- PyamilySeq requires Python 3.6 or higher. Install dependencies using pip:
29
-
30
- ```bash
31
- pip install PyamilySeq
32
- ```
33
-
34
- ## Usage - Menu
35
- ```
36
- usage: PyamilySeq_Species.py [-h] -c CLUSTERS -f {CD-HIT,CSV,TSV} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG]
37
- [-groups CORE_GROUPS] [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}] [-v]
38
-
39
- PyamilySeq v0.3.0: PyamilySeq Run Parameters.
40
-
41
- Required Arguments:
42
- -c CLUSTERS Clustering output file from CD-HIT, TSV or CSV Edge List
43
- -f {CD-HIT,CSV,TSV} Which format to use (CD-HIT or Comma/Tab Separated Edge-List (such as MMseqs2 tsv output))
44
-
45
- Output Parameters:
46
- -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file
47
- with -fasta
48
- -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to
49
- output "-w 99,95" - Must provide FASTA file with -fasta
50
- -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
51
-
52
- Optional Arguments:
53
- -rc RECLUSTERED Clustering output file from secondary round of clustering
54
- -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
55
- -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
56
- -gpa GENE_PRESENCE_ABSENCE_OUT
57
- Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other
58
- downstream tools
59
-
60
- Misc:
61
- -verbose {True,False}
62
- Default - False: Print out runtime messages
63
- -v Default - False: Print out version number and exit
64
-
65
-
66
- ```
67
-
68
- ### Clustering Analysis
69
-
70
- To perform clustering analysis:
71
-
72
- ```bash
73
- python pyamilyseq.py -c clusters_file -f format
74
- ```
75
-
76
- Replace `clusters_file` with the path to your clustering output file and `format` with one of: `CD-HIT`, `CSV`, or `TSV`.
77
-
78
- ### Reclustering
79
-
80
- To add new sequences and recluster:
81
-
82
- ```bash
83
- PyamilySeq -c clusters_file -f format --reclustered reclustered_file
84
- ```
85
-
86
- Replace `reclustered_file` with the path to the file containing additional sequences.
87
-
88
- ## Output
89
-
90
- PyamilySeq generates various outputs, including:
91
-
92
- - **Gene Presence-Absence File**: This CSV file details the presence and absence of genes across genomes.
93
- - **FASTA Files for Each Gene Family**:
94
-
95
- ## Gene Family Groups
96
-
97
- After analysis, PyamilySeq categorizes gene families into several groups:
98
-
99
- - **First Core**: Gene families present in all analysed genomes initially.
100
- - **Extended Core**: Gene families extended with additional sequences.
101
- - **Combined Core**: Gene families combined with both initial and additional sequences.
102
- - **Second Core**: Gene families identified only in the additional sequences.
103
- - **Only Second Core**: Gene families exclusively found in the additional sequences.
@@ -1,11 +0,0 @@
1
- PyamilySeq/CD-Hit_StORF-Reporter_Cross-Genera_Builder.py,sha256=UzQ5iOKCNfurxmj1pnkowF11YfWBO5vnBCKxQK6goB8,26538
2
- PyamilySeq/Constants.py,sha256=PdgSIux2jfv6QlAOxRIFgbsH95Xq6DMQcvZodGsk7tw,399
3
- PyamilySeq/PyamilySeq_Species.py,sha256=zLGfyTtxk4znoUevyjfb978pT3XNjWu44-8Seqnl7ec,38961
4
- PyamilySeq/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
5
- PyamilySeq/combine_FASTA_with_genome_IDs.py,sha256=aMUVSk6jKnKX0g04RMM360QueZS83lRLqLLysBtQbLo,2009
6
- PyamilySeq-0.3.0.dist-info/LICENSE,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
7
- PyamilySeq-0.3.0.dist-info/METADATA,sha256=iJOcBDtkFBUZTFGQMBwdUuaZnKAO1I9Pc-YFgvvhySQ,4382
8
- PyamilySeq-0.3.0.dist-info/WHEEL,sha256=-oYQCr74JF3a37z2nRlQays_SX2MqOANoqVjBBAP2yE,91
9
- PyamilySeq-0.3.0.dist-info/entry_points.txt,sha256=zGtA2Ycf0LG3PR7zuuT0wjaAKLFxtyGgBc0O_W7E250,66
10
- PyamilySeq-0.3.0.dist-info/top_level.txt,sha256=J6JhugUQTq4rq96yibAlQu3o4KCM9WuYfqr3w1r119M,11
11
- PyamilySeq-0.3.0.dist-info/RECORD,,
@@ -1,2 +0,0 @@
1
- [console_scripts]
2
- PyamilySeq = PyamilySeq.PyamilySeq_Species:main