PyamilySeq 0.4.0__py3-none-any.whl → 0.5.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,49 +0,0 @@
1
-
2
- import argparse
3
- import gzip
4
- import glob
5
-
6
- def combine_files(files, split, glob_location, combined_out):
7
- count = 0
8
-
9
- for file in glob.glob(glob_location + '/' + files):
10
- count += 1
11
- try:
12
- with gzip.open(file, 'rb') as genome:
13
-
14
- for line in genome:
15
- if line.startswith(b'#'):
16
- continue
17
- elif line.startswith(b'>'):
18
- genome_name = bytes(file.split(split)[0].split('/')[-1], 'utf-8')
19
- line = line.split(b' ')[0]
20
- line = line.replace(b'>', b'>' + genome_name + b'|')
21
- combined_out.write(line.decode('utf-8')+'\n')
22
- else:
23
- combined_out.write(line.decode('utf-8'))
24
- except gzip.BadGzipFile:
25
- with open(file, 'r') as genome:
26
-
27
- for line in genome:
28
- if line.startswith('#'):
29
- continue
30
- elif line.startswith('>'):
31
- genome_name = file.split(split)[0].split('/')[-1]
32
- line = line.replace('>', '>' + genome_name + '|')
33
- combined_out.write(line)
34
- else:
35
- combined_out.write(line)
36
-
37
- def main():
38
- parser = argparse.ArgumentParser(description="Combine gzipped fasta files.")
39
- parser.add_argument("files", help="File pattern to match within the specified directory.")
40
- parser.add_argument("split", help="String used to split the file path and extract the genome name.")
41
- parser.add_argument("glob_location", help="Directory location where the files are located.")
42
- parser.add_argument("combined_out", help="Output file where the combined data will be written.")
43
- args = parser.parse_args()
44
-
45
- with open(args.combined_out, 'w') as combined_out:
46
- combine_files(args.files, args.split, args.glob_location, combined_out)
47
-
48
- if __name__ == "__main__":
49
- main()
@@ -1,92 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: PyamilySeq
3
- Version: 0.4.0
4
- Summary: PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
5
- Home-page: https://github.com/NickJD/PyamilySeq
6
- Author: Nicholas Dimonaco
7
- Author-email: nicholas@dimonaco.co.uk
8
- Project-URL: Bug Tracker, https://github.com/NickJD/PyamilySeq/issues
9
- Classifier: Programming Language :: Python :: 3
10
- Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
11
- Classifier: Operating System :: OS Independent
12
- Requires-Python: >=3.6
13
- Description-Content-Type: text/markdown
14
- License-File: LICENSE
15
-
16
- # PyamilySeq - !BETA!
17
- **PyamilySeq** (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2.
18
- This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
19
-
20
- ## Features
21
- - **End-to-End**: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
22
- - **Clustering**: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
23
- - **Reclustering**: Allows for the addition of new sequences post-initial clustering.
24
- - **Output**: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
25
- - Align representative sequences using MAFFT.
26
- - Output concatenated aligned sequences for downstream analysis.
27
- - Optionally output sequences of identified families.
28
-
29
-
30
- ### Installation
31
- PyamilySeq requires Python 3.6 or higher. Install using pip:
32
-
33
- ```bash
34
- pip install PyamilySeq
35
- ```
36
-
37
- ## Usage - Menu
38
- ```
39
- usage: PyamilySeq.py [-h] -id INPUT_DIR -od OUTPUT_DIR -it {separate,combined} -ns NAME_SPLIT -pid PIDENT -ld LEN_DIFF -co CLUSTERING_OUT -ct {CD-HIT,BLAST,DIAMOND,MMseqs2} [-w WRITE_FAMILIES] [-con CON_CORE] [-fasta FASTA] [-rc RECLUSTERED] [-st SEQUENCE_TAG] [-groups CORE_GROUPS]
40
- [-gpa GENE_PRESENCE_ABSENCE_OUT]
41
- ...
42
-
43
- PyamilySeq v0.4.0: PyamilySeq Run Parameters.
44
-
45
- positional arguments:
46
- pyamilyseq_args Additional arguments for PyamilySeq.
47
-
48
- options:
49
- -h, --help show this help message and exit
50
-
51
- Required Arguments:
52
- -id INPUT_DIR Directory containing GFF/FASTA files.
53
- -od OUTPUT_DIR Directory for all output files.
54
- -it {separate,combined}
55
- Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
56
- -ns NAME_SPLIT Character used to split the filename and extract the genome name.
57
- -pid PIDENT Pident threshold for CD-HIT clustering.
58
- -ld LEN_DIFF Length difference (-s) threshold for CD-HIT clustering.
59
- -co CLUSTERING_OUT Output file for initial clustering.
60
- -ct {CD-HIT,BLAST,DIAMOND,MMseqs2}
61
- Clustering format for PyamilySeq.
62
-
63
- Output Parameters:
64
- -w WRITE_FAMILIES Default - No output: Output sequences of identified families (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
65
- -con CON_CORE Default - No output: Output aligned and concatinated sequences of identified families - used for MSA (provide levels at which to output "-w 99,95" - Must provide FASTA file with -fasta
66
- -fasta FASTA FASTA file to use in conjunction with "-w" or "-con"
67
-
68
- Optional Arguments:
69
- -rc RECLUSTERED Clustering output file from secondary round of clustering
70
- -st SEQUENCE_TAG Default - "StORF": Unique identifier to be used to distinguish the second of two rounds of clustered sequences
71
- -groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
72
- -gpa GENE_PRESENCE_ABSENCE_OUT
73
- Default - False: If selected, a Roary formatted gene_presence_absence.csv will be created - Required for Coinfinder and other downstream tools
74
- ```
75
-
76
- ### Example Run End-to-End - 'genomes' is a test-directory containing GFF files with ##FASTA at the bottom
77
-
78
- ```bash
79
- PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
80
- ```
81
-
82
- ```Calculating Groups
83
- Calculating Groups
84
- Gene Groups:
85
- first_core_99: 3103
86
- first_core_95: 0
87
- first_core_15: 3217
88
- first_core_0: 4808
89
- Total Number of Gene Groups (Including Singletons): 11128
90
- ```
91
-
92
-
@@ -1,12 +0,0 @@
1
- PyamilySeq/CD-Hit_StORF-Reporter_Cross-Genera_Builder.py,sha256=UzQ5iOKCNfurxmj1pnkowF11YfWBO5vnBCKxQK6goB8,26538
2
- PyamilySeq/Constants.py,sha256=971sO5fjptv27yRtg595ex8VuNURb2Nh4mFSdGx6HJ4,399
3
- PyamilySeq/PyamilySeq.py,sha256=Zy84pSBXY9EnMmk30SrfbQr9-SWYJ4rPHb9xbV3L9lU,8971
4
- PyamilySeq/PyamilySeq_Species.py,sha256=kTXeCgplHfCglii_g099zdt2iy0lc5wDX3k4HuSaIgo,39167
5
- PyamilySeq/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
6
- PyamilySeq/combine_FASTA_with_genome_IDs.py,sha256=aMUVSk6jKnKX0g04RMM360QueZS83lRLqLLysBtQbLo,2009
7
- PyamilySeq-0.4.0.dist-info/LICENSE,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
8
- PyamilySeq-0.4.0.dist-info/METADATA,sha256=d0goQEGZZz_q6_sZUwoPr-h7FR-Ad7WmupIJuK8MTFc,4462
9
- PyamilySeq-0.4.0.dist-info/WHEEL,sha256=rWxmBtp7hEUqVLOnTaDOPpR-cZpCDkzhhcBce-Zyd5k,91
10
- PyamilySeq-0.4.0.dist-info/entry_points.txt,sha256=aEpNchWXaSR7_hGQqXYGtvXz14FgIcfFdXESpEhsvXg,58
11
- PyamilySeq-0.4.0.dist-info/top_level.txt,sha256=J6JhugUQTq4rq96yibAlQu3o4KCM9WuYfqr3w1r119M,11
12
- PyamilySeq-0.4.0.dist-info/RECORD,,