cheripic 1.2.0 → 1.2.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/ChangeLog.md +21 -0
- data/Gemfile +0 -1
- data/README.md +29 -27
- data/Rakefile +9 -5
- data/bin/cheripic +1 -0
- data/cheripic.gemspec +0 -2
- data/galaxy_cheripic_tool.xml +196 -0
- data/lib/cheripic.rb +1 -0
- data/lib/cheripic/cmd.rb +87 -42
- data/lib/cheripic/contig.rb +3 -1
- data/lib/cheripic/contig_pileups.rb +44 -40
- data/lib/cheripic/implementer.rb +24 -14
- data/lib/cheripic/options.rb +22 -0
- data/lib/cheripic/variants.rb +140 -9
- data/lib/cheripic/vcf.rb +83 -0
- data/lib/cheripic/version.rb +1 -1
- metadata +5 -36
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 824a8c68d3707ad02cf0d3b7d567191244d1a5a6
|
4
|
+
data.tar.gz: 6d2b3c7bef04aba06b5206968d1a0d69996a25b0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6ae5a85c30a0b1ea19f118409ddec95a6c7c3e11f00663e9769a5642770e90cb2ab5b0200f9d9eaa4ed8c6873492ac7f5f3acd568dc0cc14ddf8ccaac5012435
|
7
|
+
data.tar.gz: efe77b2ccafd0ad7ed4eeb47b497207cacf3dbaee058779b46af7a5ca34597991a3949dbeb0f3807bba69ea7dca0dd7e24fb58c040bf0065fef2ca1e4e3424fc
|
data/ChangeLog.md
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
### Change Log
|
2
|
+
|
3
|
+
All significant changes to this project at each release are documented in this file.
|
4
|
+
|
5
|
+
|
6
|
+
#### Future changes to include
|
7
|
+
|
8
|
+
1. option to take multiple background pileup files
|
9
|
+
2. replace output directory with output file name tag, since we only write to one file
|
10
|
+
3. option to take bam file or pileup file as inputs of bulks
|
11
|
+
|
12
|
+
#### [1.2.0] - 2016-08-11
|
13
|
+
|
14
|
+
1. fixed calculation of heterzygosity for background bulks
|
15
|
+
2. changed command line boolean option to be set using only true or false
|
16
|
+
3. included command line option to set length of sequnce to retireve on either side of each variant
|
17
|
+
|
18
|
+
|
19
|
+
#### [1.1.0] - 2016-07-26
|
20
|
+
|
21
|
+
first release of the binaries for Linux 64 bit and OSX 64bit
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -11,6 +11,7 @@ Computing Homozygosity Enriched Regions In genomes to Prioritize Identification
|
|
11
11
|
is a ruby tools to pick causative mutation from bulks segregant sequencing.
|
12
12
|
|
13
13
|
Currently this gem is still in development and nearing complete working package.
|
14
|
+
And software only works with pileup as input files, use of bam and vcf files will be implemented in future
|
14
15
|
|
15
16
|
|
16
17
|
## Installation
|
@@ -20,7 +21,7 @@ Binaries are available for Linux 64bit and OSX.
|
|
20
21
|
Best way to use Cheripic is to download appropriate binary arhcive
|
21
22
|
unpack (`tar -xzf`) and add the unpacked directory to your `PATH`
|
22
23
|
|
23
|
-
Latest binaries are available to [download here](https://github.com/shyamrallapalli/cheripic/releases/
|
24
|
+
Latest binaries are available to [download here](https://github.com/shyamrallapalli/cheripic/releases/latest)
|
24
25
|
|
25
26
|
|
26
27
|
To install gem and use the gem in your development
|
@@ -44,7 +45,7 @@ Running `cheripic` without any input at command line interface shows following h
|
|
44
45
|
|
45
46
|
```
|
46
47
|
|
47
|
-
Cheripic v1.
|
48
|
+
Cheripic v1.2.0
|
48
49
|
Authors: Shyam Rallapalli and Dan MacLean
|
49
50
|
|
50
51
|
Description: Candidate mutation and closely linked marker selection for non reference genomes
|
@@ -59,30 +60,31 @@ USAGE:
|
|
59
60
|
cheripic <options>
|
60
61
|
|
61
62
|
OPTIONS:
|
62
|
-
-f, --assembly=<s>
|
63
|
-
-F, --input-format=<s>
|
64
|
-
-a, --mut-bulk=<s>
|
65
|
-
-b, --bg-bulk=<s>
|
66
|
-
--output=<s>
|
67
|
-
--loglevel=<s>
|
68
|
-
--hmes-adjust=<f>
|
69
|
-
--htlow=<f>
|
70
|
-
--hthigh=<f>
|
71
|
-
--mindepth=<i>
|
72
|
-
--min-non-ref-count=<i>
|
73
|
-
--min-indel-count-support=<i>
|
74
|
-
--
|
75
|
-
-q, --mapping-quality=<i>
|
76
|
-
-Q, --base-quality=<i>
|
77
|
-
--noise=<f>
|
78
|
-
--cross-type=<s>
|
79
|
-
--
|
80
|
-
--
|
81
|
-
--polyploidy
|
82
|
-
-p, --mut-parent=<s>
|
83
|
-
-r, --bg-parent=<s>
|
84
|
-
--bfr-adjust=<f>
|
85
|
-
--
|
63
|
+
-f, --assembly=<s> Assembly file in FASTA format
|
64
|
+
-F, --input-format=<s> bulk and parent alignment file format types - set either pileup or bam (default: pileup)
|
65
|
+
-a, --mut-bulk=<s> Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1
|
66
|
+
-b, --bg-bulk=<s> Pileup or sorted BAM file alignments from background/wildtype bulk 2
|
67
|
+
--output=<s> Directory to store results, will be created if not existing (default: cheripic_results)
|
68
|
+
--loglevel=<s> Choose any one of "info / warn / debug" level for logs generated (default: debug)
|
69
|
+
--hmes-adjust=<f> factor added to snp count of each contig to adjust for hme score calculations (default: 0.5)
|
70
|
+
--htlow=<f> lower level for categorizing heterozygosity (default: 0.2)
|
71
|
+
--hthigh=<f> high level for categorizing heterozygosity (default: 0.9)
|
72
|
+
--mindepth=<i> minimum read depth to conisder a position for variant calls (default: 6)
|
73
|
+
--min-non-ref-count=<i> minimum read depth supporting non reference base at each position (default: 3)
|
74
|
+
--min-indel-count-support=<i> minimum read depth supporting an indel at each position (default: 3)
|
75
|
+
--ambiguous-ref-bases including variant at completely ambiguous bases in the reference
|
76
|
+
-q, --mapping-quality=<i> minimum mapping quality of read covering the position (default: 20)
|
77
|
+
-Q, --base-quality=<i> minimum base quality of bases covering the position (default: 15)
|
78
|
+
--noise=<f> praportion of reads for a variant to conisder as noise (default: 0.1)
|
79
|
+
--cross-type=<s> type of cross used to generated mapping population - back or out (default: back)
|
80
|
+
--use-all-contigs option to select all contigs or only contigs containing variants for analysis
|
81
|
+
--include-low-hmes option to include or discard variants from contigs with low hme-score or bfr score to list in the final output
|
82
|
+
--polyploidy Set if the data input is from polyploids
|
83
|
+
-p, --mut-parent=<s> Pileup or sorted BAM file alignments from mutant/trait of interest parent (default: )
|
84
|
+
-r, --bg-parent=<s> Pileup or sorted BAM file alignments from background/wildtype parent (default: )
|
85
|
+
--bfr-adjust=<f> factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)
|
86
|
+
--sel-seq-len=<i> sequence length to print from either side of selected variants (default: 50)
|
87
|
+
--examples shows some example commands with explanation
|
86
88
|
|
87
89
|
```
|
88
90
|
|
@@ -98,7 +100,7 @@ EXAMPLE COMMANDS:
|
|
98
100
|
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
|
99
101
|
3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
100
102
|
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
|
101
|
-
--
|
103
|
+
--use-all-contigs true --include-low-hmes true --output cheripic_results
|
102
104
|
|
103
105
|
```
|
104
106
|
|
data/Rakefile
CHANGED
@@ -23,7 +23,7 @@ TRAVELING_RUBY_VERSION = '20150210-2.1.5'
|
|
23
23
|
# http://d6r77u77i8pq3.cloudfront.net/releases/traveling-ruby-20150210-2.1.5-osx.tar.gz
|
24
24
|
|
25
25
|
desc 'Package your app'
|
26
|
-
task :package =>
|
26
|
+
task :package => %w(package:linux:x86_64 package:osx)
|
27
27
|
|
28
28
|
namespace :package do
|
29
29
|
|
@@ -71,8 +71,12 @@ def create_package(target)
|
|
71
71
|
sh "cp packaging/cheripic.gemspec Gemfile Gemfile.lock LICENSE.txt #{package_dir}/lib/app/"
|
72
72
|
sh "mkdir #{package_dir}/lib/app/.bundle"
|
73
73
|
sh "cp packaging/bundler-config #{package_dir}/lib/app/.bundle/config"
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
74
|
+
if target == 'linux-x86_64'
|
75
|
+
sh "cp -p packaging/linux-x86_64_samtools/external/* packaging/cheripic-#{VERSION}-linux-x86_64/lib/app/ruby/2.1.0/gems/bio-samtools-2.4.0/lib/bio/db/sam/external/"
|
76
|
+
end
|
77
|
+
unless ENV['DIR_ONLY']
|
78
|
+
Dir.chdir('packaging') do
|
79
|
+
sh "gtar -czf #{package_dest}.tar.gz #{package_dest}"
|
80
|
+
end
|
81
|
+
end
|
78
82
|
end
|
data/bin/cheripic
CHANGED
data/cheripic.gemspec
CHANGED
@@ -23,8 +23,6 @@ Gem::Specification.new do |spec|
|
|
23
23
|
spec.add_runtime_dependency 'trollop', '~> 2.1', '>= 2.1.2'
|
24
24
|
spec.add_runtime_dependency 'bio', '~> 1.5', '>= 1.5.0'
|
25
25
|
spec.add_dependency 'bio-samtools', '~> 2.4.0'
|
26
|
-
spec.add_dependency 'bio-gngm', '~> 0.2.1'
|
27
|
-
spec.add_runtime_dependency 'rinruby', '~> 2.0', '>= 2.0.3'
|
28
26
|
|
29
27
|
spec.add_development_dependency 'activesupport', '~> 4.2.6'
|
30
28
|
spec.add_development_dependency 'bundler', '~> 1.7.6'
|
@@ -0,0 +1,196 @@
|
|
1
|
+
<tool id="cheripic" name="CHERIPIC" version="1.2.0">
|
2
|
+
|
3
|
+
<description>CHERIPIC</description>
|
4
|
+
|
5
|
+
<version_command>cheripic -v</version_command>
|
6
|
+
|
7
|
+
<command>
|
8
|
+
<![CDATA[
|
9
|
+
cheripic
|
10
|
+
--assembly $assembly
|
11
|
+
--mut-bulk $mut_bulk
|
12
|
+
--bg-bulk $bg_bulk
|
13
|
+
--output $output
|
14
|
+
--loglevel $loglevel
|
15
|
+
--hmes-adjust $hmes_adjust
|
16
|
+
--htlow $ht_low
|
17
|
+
--hthigh $ht_high
|
18
|
+
--mindepth $min_depth
|
19
|
+
--min-non-ref-count $min_non_ref_count
|
20
|
+
--min-indel-count-support $min_indel_count_support
|
21
|
+
--ambiguous-ref-bases $ambiguous_ref_bases
|
22
|
+
--mapping-quality $mapping_quality
|
23
|
+
--base-quality $base_quality
|
24
|
+
--noise $noise
|
25
|
+
--cross-type $cross_type
|
26
|
+
--use-all-contigs $use_all_contigs
|
27
|
+
--include-low-hmes $include_low_hmes
|
28
|
+
--polyploidy $polyploidy
|
29
|
+
--mut-parent $mut_parent
|
30
|
+
--bg-parent $bg_parent
|
31
|
+
--bfr-adjust $bfr_adjust
|
32
|
+
--sel-seq-len $sel_seq_len
|
33
|
+
]]>
|
34
|
+
</command>
|
35
|
+
|
36
|
+
<inputs>
|
37
|
+
<param name="assembly" type="data" format="fasta" label="Input Assembly file" help="Select Assembly fasta file" />
|
38
|
+
<param name="mut_bulk" type="data" format="pileup" label="mutant bulk pileup file" help="Select mutant bulk pileup file" />
|
39
|
+
<param name="bg_bulk" type="data" format="pileup" label="background bulk pileup file" min="1" multiple="true" help="Select background bulk pileup file" />
|
40
|
+
<param name="loglevel" type="select" optional="true" label="analysis log level" help="choose between info, warn and debug levels">
|
41
|
+
<option value="info" selected="true">info </option>
|
42
|
+
<option value="warn">warnings</option>
|
43
|
+
<option value="debug">debug</option>
|
44
|
+
</param>
|
45
|
+
<param name="hmes_adjust" size="4" type="float" optional="true" value="0.5" min="0.01" max="1.0"
|
46
|
+
label="hme score adjuster" help="factor added to snp count of each contig to adjust for hme score calculations" />
|
47
|
+
<param name="ht_low" size="4" type="float" optional="true" value="0.25" min="0.1" max="1.0"
|
48
|
+
label="heterozygosity low limit" help="lower limit to heterozygosity allele fraction" />
|
49
|
+
<param name="ht_high" size="4" type="float" optional="true" value="0.75" min="0.1" max="1.0"
|
50
|
+
label="heterozygosity high limit" help="upper limit to heterozygosity allele fraction" />
|
51
|
+
<param name="min_depth" size="4" type="integer" optional="true" value="6" min="1" max="8000"
|
52
|
+
label="minimum read coverage" help="minimum read depth to conisder a position for variant calls" />
|
53
|
+
<param name="min_non_ref_count" size="4" type="integer" optional="true" value="3" min="1" max="8000"
|
54
|
+
label="minimum alternate read coverage" help="minimum read depth supporting non reference base at each position" />
|
55
|
+
<param name="min_indel_count_support" size="4" type="integer" optional="true" value="3" min="1" max="8000"
|
56
|
+
label="minimum indel read coverage" help="minimum read depth supporting an indel at each position" />
|
57
|
+
<param name="ambiguous_ref_bases" type="boolean" optional="true" checked="false" label="ambiguous reference position"
|
58
|
+
help="including variant at completely ambiguous bases in the reference" truevalue="true" falsevalue="false" />
|
59
|
+
<param name="mapping_quality" size="4" type="integer" optional="true" value="20" min="0" max="255"
|
60
|
+
label="minimum mapping quality" help="minimum mapping quality of read covering the position" />
|
61
|
+
<param name="base_quality" size="4" type="integer" optional="true" value="15" min="0" max="40"
|
62
|
+
label="minimum base quality" help="minimum base quality of nucleotides covering the position" />
|
63
|
+
<param name="noise" size="4" type="float" optional="true" value="0.1" min="0" max="0.2"
|
64
|
+
label="read noise" help="proportion of reads supporting a variant, below which are consider as noise" />
|
65
|
+
<param name="cross_type" type="select" optional="true" label="cross type" help="type of cross used to generated mapping population - back or out" >
|
66
|
+
<option value="back" selected="true">back cross</option>
|
67
|
+
<option value="out">out cross</option>
|
68
|
+
</param>
|
69
|
+
|
70
|
+
<param name="use_all_contigs" type="boolean" optional="true" checked="false" label="use all contigs in analysis"
|
71
|
+
help="option to select all contigs or only contigs containing variants for analysis" truevalue="true" falsevalue="false" />
|
72
|
+
<param name="include_low_hmes" type="boolean" optional="true" checked="false" label="no hme or bfr score cut off"
|
73
|
+
help="option to include or discard variants from contigs with low hme-score or bfr score to list in the final output" truevalue="true" falsevalue="false" />
|
74
|
+
<param name="polyploidy" type="boolean" optional="true" checked="false" label="polyploid data"
|
75
|
+
help="Set if the input data is from polyploids" truevalue="true" falsevalue="false" />
|
76
|
+
<param name="mut-parent" type="data" optional="true" format="pileup" label="mutant parent pileup file" help="Select mutant parent pileup file" />
|
77
|
+
<param name="bg-parent" type="data" optional="true" format="pileup" label="background parent pileup file" help="Select background parent pileup file" />
|
78
|
+
|
79
|
+
<param name="bfr_adjust" size="4" type="float" optional="true" value="0.05" min="0.01" max="1.0"
|
80
|
+
label="bfr score adjuster" help="factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)" />
|
81
|
+
<param name="sel_seq_len" size="4" type="integer" optional="true" value="50" min="10" max="250"
|
82
|
+
label="selected variant seq length out" help="sequence length to print from either side of selected variants (default: 50)" />
|
83
|
+
|
84
|
+
<param name="output" type="text" size="30" value="cheripic_results" label="tag for output filename" help="write a tag to include with output filename" />
|
85
|
+
</inputs>
|
86
|
+
|
87
|
+
<outputs>
|
88
|
+
<data name="output_1" format="txt" file="${output}_selected_hme_variants.txt" />
|
89
|
+
<data name="output_2" format="txt" file="${output}_selected_bfr_variants.txt" />
|
90
|
+
</outputs>
|
91
|
+
|
92
|
+
<tests>
|
93
|
+
<test>
|
94
|
+
<param name="assembly" value="picked_fasta.fa" ftype="fasta" />
|
95
|
+
<param name="mut_bulk" value="mut_bulk.pileup" ftype="pileup" />
|
96
|
+
<param name="bg_bulk" value="wt_bulk.pileup" ftype="pileup" />
|
97
|
+
<output name="output" ftype="txt" file="selected_variants.out" />
|
98
|
+
</test>
|
99
|
+
</tests>
|
100
|
+
|
101
|
+
<help>
|
102
|
+
|
103
|
+
**Computing Homozygosity Enriched Regions In genomes to Prioritize Identification of Candidate variants (CHERIPIC)**
|
104
|
+
|
105
|
+
CHERIPIC is a ruby tool to pick causative mutation from bulk segregant sequencing
|
106
|
+
|
107
|
+
------
|
108
|
+
|
109
|
+
**What it does**
|
110
|
+
|
111
|
+
This tool uses ``cheripic`` tool to analyse bulk segregant sequencing to identify causative muation
|
112
|
+
|
113
|
+
|
114
|
+
.. class:: infomark
|
115
|
+
|
116
|
+
Provides a list of snps that could either closely linked markers or the causative mutation.
|
117
|
+
|
118
|
+
------
|
119
|
+
|
120
|
+
**Input formats**
|
121
|
+
|
122
|
+
assembly file should be a fasta file used for generating pileups from bulks
|
123
|
+
bulk alignment files should be pileup files
|
124
|
+
|
125
|
+
------
|
126
|
+
|
127
|
+
**Outputs**
|
128
|
+
|
129
|
+
The output is a text file, and has the following columns::
|
130
|
+
|
131
|
+
Column Description
|
132
|
+
----------------- --------------------------------------------------------
|
133
|
+
1 HME_Score Homozygosity Enrichment score
|
134
|
+
2 AlleleFreq Allele frequency
|
135
|
+
3 seq_id Contig/Scaffold id
|
136
|
+
4 position 1-based index of the position in contig
|
137
|
+
5 ref_base Reference nucleotide at the position
|
138
|
+
6 coverage read depth
|
139
|
+
7 bases read bases
|
140
|
+
8 base_quals read base qualities
|
141
|
+
9 sequence_left selected size of reference sequence on the left variant
|
142
|
+
10 Alt_seq Alternate allele at the position
|
143
|
+
11 sequence_right selected size of reference sequence on the right variant
|
144
|
+
|
145
|
+
------
|
146
|
+
|
147
|
+
**cheripic settings**
|
148
|
+
|
149
|
+
All of the options have a default value. You can change any of them. All of the options are implemented.
|
150
|
+
|
151
|
+
------
|
152
|
+
|
153
|
+
**cheripic parameter list**
|
154
|
+
|
155
|
+
OPTIONS:
|
156
|
+
-f, --assembly Assembly file in FASTA format
|
157
|
+
-F, --input-format bulk and parent alignment file format types - set either pileup or bam (default: pileup)
|
158
|
+
-a, --mut-bulk Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1
|
159
|
+
-b, --bg-bulk Pileup or sorted BAM file alignments from background/wildtype bulk 2
|
160
|
+
--output Directory to store results, will be created if not existing (default: cheripic_results)
|
161
|
+
--loglevel Choose any one of "info / warn / debug" level for logs generated (default: debug)
|
162
|
+
--hmes-adjust factor added to snp count of each contig to adjust for hme score calculations (default: 0.5)
|
163
|
+
--htlow lower level for categorizing heterozygosity (default: 0.2)
|
164
|
+
--hthigh high level for categorizing heterozygosity (default: 0.9)
|
165
|
+
--mindepth minimum read depth to conisder a position for variant calls (default: 6)
|
166
|
+
--min-non-ref-count minimum read depth supporting non reference base at each position (default: 3)
|
167
|
+
--min-indel-count-support minimum read depth supporting an indel at each position (default: 3)
|
168
|
+
--ambiguous-ref-bases including variant at completely ambiguous bases in the reference
|
169
|
+
-q, --mapping-quality minimum mapping quality of read covering the position (default: 20)
|
170
|
+
-Q, --base-quality minimum base quality of bases covering the position (default: 15)
|
171
|
+
--noise praportion of reads for a variant to conisder as noise (default: 0.1)
|
172
|
+
--cross-type type of cross used to generated mapping population - back or out (default: back)
|
173
|
+
--use-all-contigs option to select all contigs or only contigs containing variants for analysis
|
174
|
+
--include-low-hmes option to include or discard variants from contigs with low hme-score or bfr score to list in the final output
|
175
|
+
--polyploidy Set if the data input is from polyploids
|
176
|
+
-p, --mut-parent Pileup or sorted BAM file alignments from mutant/trait of interest parent (default: )
|
177
|
+
-r, --bg-parent Pileup or sorted BAM file alignments from background/wildtype parent (default: )
|
178
|
+
--bfr-adjust factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)
|
179
|
+
--sel-seq-len sequence length to print from either side of selected variants (default: 50)
|
180
|
+
|
181
|
+
------
|
182
|
+
|
183
|
+
.. class:: infomark
|
184
|
+
|
185
|
+
**Tool Author**
|
186
|
+
|
187
|
+
Shyam Rallapalli
|
188
|
+
|
189
|
+
|
190
|
+
</help>
|
191
|
+
|
192
|
+
<citations>
|
193
|
+
<citation type="doi">10.1093/bioinformatics/btg1080</citation>
|
194
|
+
</citations>
|
195
|
+
|
196
|
+
</tool>
|
data/lib/cheripic.rb
CHANGED
data/lib/cheripic/cmd.rb
CHANGED
@@ -52,10 +52,14 @@ module Cheripic
|
|
52
52
|
opt :mut_bulk, 'Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1',
|
53
53
|
:short => '-a',
|
54
54
|
:type => String
|
55
|
+
opt :mut_bulk_vcf, 'vcf file for variants from mutant/trait of interest bulk 1',
|
56
|
+
:type => String
|
55
57
|
opt :bg_bulk, 'Pileup or sorted BAM file alignments from background/wildtype bulk 2',
|
56
58
|
:short => '-b',
|
57
59
|
:type => String
|
58
|
-
opt :
|
60
|
+
opt :bg_bulk_vcf, 'vcf file for variants from background/wildtype bulk 2',
|
61
|
+
:type => String
|
62
|
+
opt :output, 'custom name tag to include in the output file name',
|
59
63
|
:default => 'cheripic_results'
|
60
64
|
opt :loglevel, 'Choose any one of "info / warn / debug" level for logs generated',
|
61
65
|
:default => 'debug'
|
@@ -68,9 +72,17 @@ module Cheripic
|
|
68
72
|
opt :hthigh, 'high level for categorizing heterozygosity',
|
69
73
|
:type => Float,
|
70
74
|
:default => 0.9
|
71
|
-
opt :mindepth, 'minimum read depth
|
75
|
+
opt :mindepth, 'minimum read depth at a position to consider for variant calls',
|
72
76
|
:type => Integer,
|
73
77
|
:default => 6
|
78
|
+
opt :max_d_multiple, "multiplication factor for average coverage to calculate maximum read coverage
|
79
|
+
if set zero no calculation will be made from bam file.\nsetting this value will override user set max depth",
|
80
|
+
:type => Integer,
|
81
|
+
:default => 5
|
82
|
+
opt :maxdepth, "maximum read depth at a position to consider for variant calls
|
83
|
+
if set to zero no user max depth will be used",
|
84
|
+
:type => Integer,
|
85
|
+
:default => 0
|
74
86
|
opt :min_non_ref_count, 'minimum read depth supporting non reference base at each position',
|
75
87
|
:type => Integer,
|
76
88
|
:default => 3
|
@@ -97,7 +109,8 @@ module Cheripic
|
|
97
109
|
opt :use_all_contigs, 'option to select all contigs or only contigs containing variants for analysis',
|
98
110
|
:type => FalseClass,
|
99
111
|
:default => false
|
100
|
-
opt :include_low_hmes, 'option to include or discard variants from contigs with
|
112
|
+
opt :include_low_hmes, 'option to include or discard variants from contigs with
|
113
|
+
low hme-score or bfr score to list in the final output',
|
101
114
|
:type => FalseClass,
|
102
115
|
:default => false
|
103
116
|
opt :polyploidy, 'Set if the data input is from polyploids',
|
@@ -111,6 +124,10 @@ module Cheripic
|
|
111
124
|
:short => '-r',
|
112
125
|
:type => String,
|
113
126
|
:default => ''
|
127
|
+
opt :repeats_file, 'repeat masker output file for the assembly ',
|
128
|
+
:short => '-R',
|
129
|
+
:type => String,
|
130
|
+
:default => ''
|
114
131
|
opt :bfr_adjust, 'factor added to hemi snp frequency of each parent to adjust for bfr calculations',
|
115
132
|
:type => Float,
|
116
133
|
:default => 0.05
|
@@ -133,8 +150,9 @@ module Cheripic
|
|
133
150
|
|
134
151
|
Inputs:
|
135
152
|
1. Needs a reference fasta file of asssembly use for variant analysis
|
136
|
-
2. Pileup files for mutant (phenotype of interest) bulks and background (wildtype phenotype) bulks
|
137
|
-
3. If
|
153
|
+
2. Pileup/Bam files for mutant (phenotype of interest) bulks and background (wildtype phenotype) bulks
|
154
|
+
3. If providing bam files, you have to include vcf files for the respective bulks
|
155
|
+
4. If polyploid species, include pileup/bam files from one or both parents
|
138
156
|
|
139
157
|
USAGE:
|
140
158
|
cheripic <options>
|
@@ -149,15 +167,19 @@ module Cheripic
|
|
149
167
|
def print_examples
|
150
168
|
msg = <<-EOS
|
151
169
|
|
152
|
-
|
170
|
+
Cheripic v#{Cheripic::VERSION.dup}
|
171
|
+
Authors: Shyam Rallapalli and Dan MacLean
|
172
|
+
|
173
|
+
EXAMPLE COMMANDS:
|
174
|
+
1. cheripic -f assembly.fa -a mutbulk.pileup -b bgbulk.pileup --output=cheripic_output
|
175
|
+
2. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
176
|
+
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
|
177
|
+
3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
178
|
+
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
|
179
|
+
--no-only-frag-with-vars --no-filter-out-low-hmes --output cheripic_results
|
180
|
+
4. cheripic -h or cheripic --help
|
181
|
+
5. cheripic -v or cheripic --version
|
153
182
|
|
154
|
-
EXAMPLE COMMANDS:
|
155
|
-
1. cheripic -f assembly.fa -a mutbulk.pileup -b bgbulk.pileup --output=cheripic_output
|
156
|
-
2. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
157
|
-
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
|
158
|
-
3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
159
|
-
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
|
160
|
-
--no-only-frag-with-vars --no-filter-out-low-hmes --output cheripic_results
|
161
183
|
EOS
|
162
184
|
puts msg.split("\n").map{ |line| line.lstrip }.join("\n")
|
163
185
|
exit(0)
|
@@ -165,44 +187,66 @@ module Cheripic
|
|
165
187
|
|
166
188
|
# calls other methods to check if command line inputs are valid
|
167
189
|
def check_arguments
|
168
|
-
|
190
|
+
check_output
|
169
191
|
check_log_level
|
170
|
-
|
192
|
+
check_input_types
|
171
193
|
end
|
172
194
|
|
173
|
-
#
|
174
|
-
|
175
|
-
|
176
|
-
|
177
|
-
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
195
|
+
# checks input files based on bulk file type
|
196
|
+
def check_input_types
|
197
|
+
inputfiles = {}
|
198
|
+
inputfiles[:required] = %i{assembly mut_bulk}
|
199
|
+
inputfiles[:optional] = %i{bg_bulk}
|
200
|
+
if @options[:input_format] == 'bam'
|
201
|
+
inputfiles[:required] << %i{mut_bulk_vcf}
|
202
|
+
inputfiles[:optional] << %i{bg_bulk_vcf}
|
203
|
+
end
|
182
204
|
if @options[:polyploidy]
|
183
|
-
inputfiles = %i{
|
184
|
-
else
|
185
|
-
inputfiles = %i{assembly mut_bulk bg_bulk}
|
205
|
+
inputfiles[:either] = %i{mut_parent bg_parent}
|
186
206
|
end
|
187
|
-
inputfiles
|
188
|
-
|
189
|
-
|
190
|
-
|
191
|
-
|
192
|
-
|
207
|
+
check_input_files(inputfiles)
|
208
|
+
end
|
209
|
+
|
210
|
+
# checks if input files are valid
|
211
|
+
def check_input_files(inputfiles)
|
212
|
+
check = 0
|
213
|
+
inputfiles.each_key do | type |
|
214
|
+
inputfiles[type].flatten!
|
215
|
+
inputfiles[type].each do | symbol |
|
216
|
+
if @options[symbol]
|
217
|
+
file = @options[symbol]
|
218
|
+
@options[symbol] = File.expand_path(file)
|
219
|
+
next if type == :optional
|
220
|
+
if type == :required and not File.exist?(file)
|
221
|
+
raise CheripicIOError.new "#{symbol} file, #{file} does not exist: "
|
222
|
+
elsif type == :either and File.exist?(file)
|
223
|
+
check = 1
|
224
|
+
end
|
225
|
+
elsif type == :required
|
226
|
+
raise CheripicArgError.new "Options #{inputfiles}, all must be specified. " +
|
227
|
+
'Try --help for further help.'
|
193
228
|
end
|
194
|
-
|
195
|
-
|
196
|
-
|
229
|
+
end
|
230
|
+
if type == :either and check == 0
|
231
|
+
raise CheripicArgError.new "One of the options #{inputfiles}, must be specified. " +
|
232
|
+
'Try --help for further help.'
|
197
233
|
end
|
198
234
|
end
|
199
235
|
end
|
200
236
|
|
201
|
-
# checks if output
|
202
|
-
def
|
203
|
-
if
|
204
|
-
raise CheripicArgError.new
|
205
|
-
'
|
237
|
+
# checks if files with output tag name already exists
|
238
|
+
def check_output
|
239
|
+
if (@options[:output].split('') & %w{# / : * ? ' < > | & $ ,}).any?
|
240
|
+
raise CheripicArgError.new 'please choose a name tag that contains ' +
|
241
|
+
'alphanumeric characters, hyphen(-) and underscore(_) only'
|
242
|
+
end
|
243
|
+
@options[:hmes_frags] = "#{@options[:output]}_selected_hme_variants.txt"
|
244
|
+
@options[:bfr_frags] = "#{@options[:output]}_selected_bfr_variants.txt"
|
245
|
+
[@options[:hmes_frags], @options[:bfr_frags]].each do | file |
|
246
|
+
if File.exist?(file)
|
247
|
+
raise CheripicArgError.new "'#{file}' file exists " +
|
248
|
+
'please choose a different name tag to be included in the output file name'
|
249
|
+
end
|
206
250
|
end
|
207
251
|
end
|
208
252
|
|
@@ -220,7 +264,8 @@ module Cheripic
|
|
220
264
|
# A hash of trollop option names as keys and user or default
|
221
265
|
# setting as values is passed to Implementer object
|
222
266
|
def run
|
223
|
-
@options[:
|
267
|
+
@options[:hmes_frags] = File.expand_path @options[:hmes_frags]
|
268
|
+
@options[:bfr_frags] = File.expand_path @options[:bfr_frags]
|
224
269
|
analysis = Implementer.new(@options)
|
225
270
|
analysis.run
|
226
271
|
end
|
data/lib/cheripic/contig.rb
CHANGED
@@ -22,7 +22,7 @@ module Cheripic
|
|
22
22
|
# @return [Integer] length of contig in bases
|
23
23
|
class Contig
|
24
24
|
|
25
|
-
attr_accessor :hm_pos, :ht_pos, :hemi_pos
|
25
|
+
attr_accessor :hm_pos, :ht_pos, :hemi_pos, :mean_depth, :sd_depth
|
26
26
|
attr_reader :id, :length
|
27
27
|
|
28
28
|
# creates a Contig object using fasta entry
|
@@ -33,6 +33,8 @@ module Cheripic
|
|
33
33
|
@hm_pos = {}
|
34
34
|
@ht_pos = {}
|
35
35
|
@hemi_pos = {}
|
36
|
+
@mean_depth = nil
|
37
|
+
@sd_depth = nil
|
36
38
|
end
|
37
39
|
|
38
40
|
# Number of homozygous variants identified in the contig
|
@@ -32,7 +32,7 @@ module Cheripic
|
|
32
32
|
def_delegators :@mut_parent, :each, :each_key, :each_value, :length, :[], :store
|
33
33
|
def_delegators :@bg_parent, :each, :each_key, :each_value, :length, :[], :store
|
34
34
|
attr_accessor :id, :parent_hemi
|
35
|
-
attr_accessor :mut_bulk, :bg_bulk, :mut_parent, :bg_parent
|
35
|
+
attr_accessor :mut_bulk, :bg_bulk, :mut_parent, :bg_parent, :masked_regions
|
36
36
|
|
37
37
|
# creates a ContigPileup object using fasta entry id
|
38
38
|
# @param fasta [String] a contig id from fasta entry
|
@@ -43,16 +43,27 @@ module Cheripic
|
|
43
43
|
@mut_parent = {}
|
44
44
|
@bg_parent = {}
|
45
45
|
@parent_hemi = {}
|
46
|
+
@masked_regions = Hash.new { |h,k| h[k] = {} }
|
47
|
+
@hm_pos = {}
|
48
|
+
@ht_pos = {}
|
49
|
+
@hemi_pos = {}
|
46
50
|
end
|
47
51
|
|
48
52
|
# bulk pileups are compared and variant positions are selected
|
49
53
|
# @return [Array<Hash>] variant positions are stored in hashes
|
50
54
|
# for homozygous, heterozygous and hemi-variant positions
|
51
55
|
def bulks_compared
|
52
|
-
@hm_pos = {}
|
53
|
-
@ht_pos = {}
|
54
|
-
@hemi_pos = {}
|
55
56
|
@mut_bulk.each_key do | pos |
|
57
|
+
ignore = 0
|
58
|
+
unless @masked_regions.empty?
|
59
|
+
@masked_regions.each_key do | index |
|
60
|
+
if pos.between?(@masked_regions[index][:begin], @masked_regions[index][:end])
|
61
|
+
ignore = 1
|
62
|
+
logger.info "variant is in the masked region\t#{@mut_bulk[pos].to_s}"
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
66
|
+
next if ignore == 1
|
56
67
|
if Options.polyploidy and @parent_hemi.key?(pos)
|
57
68
|
bg_bases = ''
|
58
69
|
if @bg_bulk.key?(pos)
|
@@ -74,27 +85,37 @@ module Cheripic
|
|
74
85
|
# @param pos [Integer] position in the contig
|
75
86
|
# stores variant type, position and allele fraction to either @hm_pos or @ht_pos hashes
|
76
87
|
def compare_pileup(pos)
|
77
|
-
|
78
|
-
|
79
|
-
return nil if base_hash.empty?
|
80
|
-
# we could ignore complex loci or
|
81
|
-
# take the variant type based on predominant base
|
82
|
-
if base_hash.length > 1
|
83
|
-
fraction = base_hash.values.max
|
84
|
-
mut_type = var_mode(fraction)
|
85
|
-
else
|
86
|
-
fraction = base_hash[base_hash.keys[0]]
|
87
|
-
mut_type = var_mode(fraction)
|
88
|
-
end
|
88
|
+
mut_type, fraction = var_mode_fraction(@mut_bulk[pos])
|
89
|
+
return nil if mut_type.nil?
|
89
90
|
if @bg_bulk.key?(pos)
|
90
|
-
bg_type =
|
91
|
+
bg_type = var_mode_fraction(@bg_bulk[pos])[0]
|
91
92
|
mut_type = compare_var_type(mut_type, bg_type)
|
92
93
|
end
|
93
|
-
unless mut_type
|
94
|
+
unless mut_type.nil?
|
94
95
|
categorise_pos(mut_type, pos, fraction)
|
95
96
|
end
|
96
97
|
end
|
97
98
|
|
99
|
+
|
100
|
+
# Method to extract var_mode and allele fraction from pileup information at a position in contig
|
101
|
+
#
|
102
|
+
# @param pileup_info [Pileup] pileup object
|
103
|
+
# @return [Symbol] variant mode from pileup position (:hom or :het) at the position
|
104
|
+
# @return [Float] allele fraction at the position
|
105
|
+
def var_mode_fraction(pileup_info)
|
106
|
+
base_frac_hash = pileup_info.var_base_frac
|
107
|
+
base_frac_hash.delete(:ref)
|
108
|
+
return [nil, nil] if base_frac_hash.empty?
|
109
|
+
# we could ignore complex loci or
|
110
|
+
# take the variant type based on predominant base
|
111
|
+
if base_frac_hash.length > 1
|
112
|
+
fraction = base_frac_hash.values.max
|
113
|
+
else
|
114
|
+
fraction = base_frac_hash[base_frac_hash.keys[0]]
|
115
|
+
end
|
116
|
+
[var_mode(fraction), fraction]
|
117
|
+
end
|
118
|
+
|
98
119
|
# Categorizes variant zygosity based on the allele fraction provided.
|
99
120
|
# Uses lower and upper limit set for heterozygosity in the options.
|
100
121
|
# @note consider increasing the range of heterozygosity limits for RNA-seq data
|
@@ -125,23 +146,6 @@ module Cheripic
|
|
125
146
|
end
|
126
147
|
end
|
127
148
|
|
128
|
-
# Method to extract var_mode from pileup information at a position in contig
|
129
|
-
#
|
130
|
-
# @param pos [Integer] position in the contig
|
131
|
-
# @return [Symbol] variant mode of the background bulk (:hom or :het) at the position
|
132
|
-
def bg_bulk_var(pos)
|
133
|
-
bg_base_hash = @bg_bulk[pos].var_base_frac
|
134
|
-
bg_base_hash.delete(:ref)
|
135
|
-
return nil if bg_base_hash.empty?
|
136
|
-
if bg_base_hash.length > 1
|
137
|
-
# taking only var mode
|
138
|
-
var_mode(bg_base_hash.values.max)
|
139
|
-
else
|
140
|
-
# taking only var mode
|
141
|
-
var_mode(bg_base_hash[bg_base_hash.keys[0]])
|
142
|
-
end
|
143
|
-
end
|
144
|
-
|
145
149
|
# method stores pos as key and allele fraction as value
|
146
150
|
# to @hm_pos or @ht_pos hash based on variant type
|
147
151
|
# @param var_type [Symbol] values are either :hom or :het
|
@@ -156,18 +160,18 @@ module Cheripic
|
|
156
160
|
end
|
157
161
|
|
158
162
|
# Compares parental pileups for the contig and identify position
|
159
|
-
# that indicate variants from
|
163
|
+
# that indicate variants from homeologues called hemi-snps
|
160
164
|
# and calculates bulk frequency ratio (bfr)
|
161
165
|
# @return [Hash] parent_hemi hash with position as key and bfr as value
|
162
166
|
def hemisnps_in_parent
|
163
167
|
# mark all the hemi snp based on both parents
|
164
|
-
|
168
|
+
@mut_parent.each_key do |pos|
|
165
169
|
mut_parent_frac = @mut_parent[pos].var_base_frac
|
166
|
-
if
|
170
|
+
if @bg_parent.key?(pos)
|
167
171
|
bg_parent_frac = @bg_parent[pos].var_base_frac
|
168
172
|
bfr = Bfr.get_bfr(mut_parent_frac, bg_parent_frac)
|
169
173
|
@parent_hemi[pos] = bfr
|
170
|
-
|
174
|
+
@bg_parent.delete(pos)
|
171
175
|
else
|
172
176
|
bfr = Bfr.get_bfr(mut_parent_frac)
|
173
177
|
@parent_hemi[pos] = bfr
|
@@ -175,7 +179,7 @@ module Cheripic
|
|
175
179
|
end
|
176
180
|
|
177
181
|
# now include all hemi snp unique to background parent
|
178
|
-
|
182
|
+
@bg_parent.each_key do |pos|
|
179
183
|
unless @parent_hemi.key?(pos)
|
180
184
|
bg_parent_frac = @bg_parent[pos].var_base_frac
|
181
185
|
bfr = Bfr.get_bfr(bg_parent_frac)
|
data/lib/cheripic/implementer.rb
CHANGED
@@ -25,15 +25,21 @@ module Cheripic
|
|
25
25
|
input_format
|
26
26
|
mut_bulk
|
27
27
|
bg_bulk
|
28
|
-
|
28
|
+
mut_bulk_vcf
|
29
|
+
bg_bulk_vcf
|
30
|
+
hmes_frags
|
31
|
+
bfr_frags
|
29
32
|
mut_parent
|
30
|
-
bg_parent
|
33
|
+
bg_parent
|
34
|
+
repeats_file}
|
31
35
|
@options = OpenStruct.new(inputs.select { |k| set1.include?(k) })
|
32
36
|
|
33
37
|
set2 = %i{hmes_adjust
|
34
38
|
htlow
|
35
39
|
hthigh
|
36
40
|
mindepth
|
41
|
+
maxdepth
|
42
|
+
max_d_multiple
|
37
43
|
min_non_ref_count
|
38
44
|
min_indel_count_support
|
39
45
|
ambiguous_ref_bases
|
@@ -44,10 +50,10 @@ module Cheripic
|
|
44
50
|
use_all_contigs
|
45
51
|
include_low_hmes
|
46
52
|
polyploidy
|
47
|
-
bfr_adjust
|
53
|
+
bfr_adjust
|
54
|
+
sel_seq_len}
|
48
55
|
settings = inputs.select { |k| set2.include?(k) }
|
49
56
|
Options.update(settings)
|
50
|
-
FileUtils.mkdir_p @options.output
|
51
57
|
@vars_extracted = false
|
52
58
|
@has_run = false
|
53
59
|
end
|
@@ -62,15 +68,21 @@ module Cheripic
|
|
62
68
|
|
63
69
|
# Extracted variants from bulk comparison are re-analysed
|
64
70
|
# and selected variants are written to a file
|
65
|
-
def process_variants
|
66
|
-
|
71
|
+
def process_variants(pos_type)
|
72
|
+
if pos_type == :hmes_frags
|
73
|
+
@variants.verify_bg_bulk_pileup
|
74
|
+
end
|
67
75
|
# print selected variants that could be potential markers or mutation
|
68
|
-
out_file = File.open(
|
69
|
-
out_file.puts "
|
76
|
+
out_file = File.open(@options[pos_type], 'w')
|
77
|
+
out_file.puts "Score\tAlleleFreq\tseq_id\tposition\tref_base\tcoverage\tbases\tbase_quals\tsequence_left\tAlt_seq\tsequence_right"
|
70
78
|
regions = Regions.new(@options.assembly)
|
71
|
-
@variants.
|
79
|
+
@variants.send(pos_type).each_key do | frag |
|
72
80
|
contig_obj = @variants.assembly[frag]
|
73
|
-
|
81
|
+
if pos_type == :hmes_frags
|
82
|
+
positions = contig_obj.hm_pos.keys
|
83
|
+
else
|
84
|
+
positions = contig_obj.hemi_pos.keys
|
85
|
+
end
|
74
86
|
positions.each do | pos |
|
75
87
|
pileup = @variants.pileups[frag].mut_bulk[pos]
|
76
88
|
seqs = regions.fetch_seq(frag,pos)
|
@@ -87,11 +99,9 @@ module Cheripic
|
|
87
99
|
unless @vars_extracted
|
88
100
|
self.extract_vars
|
89
101
|
end
|
102
|
+
self.process_variants(:hmes_frags)
|
90
103
|
if Options.polyploidy
|
91
|
-
self.process_variants
|
92
|
-
@variants.bfr_frags
|
93
|
-
else
|
94
|
-
self.process_variants
|
104
|
+
self.process_variants(:bfr_frags)
|
95
105
|
end
|
96
106
|
@has_run = true
|
97
107
|
end
|
data/lib/cheripic/options.rb
CHANGED
@@ -12,6 +12,8 @@ module Cheripic
|
|
12
12
|
:htlow => 0.2,
|
13
13
|
:hthigh => 0.9,
|
14
14
|
:mindepth => 6,
|
15
|
+
:maxdepth => 0,
|
16
|
+
:max_d_multiple => 5,
|
15
17
|
:min_non_ref_count => 3,
|
16
18
|
:min_indel_count_support => 3,
|
17
19
|
:ambiguous_ref_bases => false,
|
@@ -53,6 +55,26 @@ module Cheripic
|
|
53
55
|
@user_settings[:mindepth]
|
54
56
|
end
|
55
57
|
|
58
|
+
# Maximum read coverage at the variant position to be considered for analysis
|
59
|
+
# @return [Integer]
|
60
|
+
def self.maxdepth
|
61
|
+
@user_settings[:maxdepth]
|
62
|
+
end
|
63
|
+
|
64
|
+
# Setting maximum read coverage at the variant position to be considered for analysis
|
65
|
+
# @param value [Integer] provided integer value will be updated as maxdepth
|
66
|
+
# @return [Integer] updated maxdepth value
|
67
|
+
def self.maxdepth=(value)
|
68
|
+
@user_settings[:maxdepth] = value
|
69
|
+
end
|
70
|
+
|
71
|
+
# Multiplication factor to average coverage to calculate maximum read coverage
|
72
|
+
# at the variant position to be considered for analysis
|
73
|
+
# @return [Integer]
|
74
|
+
def self.max_d_multiple
|
75
|
+
@user_settings[:max_d_multiple]
|
76
|
+
end
|
77
|
+
|
56
78
|
# Minimum non reference count at the variant position to be considered for analysis
|
57
79
|
# @return [Integer]
|
58
80
|
def self.min_non_ref_count
|
data/lib/cheripic/variants.rb
CHANGED
@@ -4,6 +4,36 @@ require 'forwardable'
|
|
4
4
|
|
5
5
|
module Cheripic
|
6
6
|
|
7
|
+
require 'bio-samtools'
|
8
|
+
require 'bio/db/sam'
|
9
|
+
require 'open3'
|
10
|
+
|
11
|
+
# An extension of Bio::DB::Sam object to modify depth method
|
12
|
+
class Bio::DB::Sam
|
13
|
+
|
14
|
+
# A method to retrieve depth information from bam object
|
15
|
+
# @param opts [Hash] a hash of following input options
|
16
|
+
# b [File] list of positions or regions in BED format
|
17
|
+
# l [INT] minQLen
|
18
|
+
# q [INT] base quality threshold
|
19
|
+
# Q [INT] mapping quality threshold
|
20
|
+
# r [chr:from-to] region
|
21
|
+
# @returns a block with each line reporting sequence_name, position and depth
|
22
|
+
def depth(opts={})
|
23
|
+
command = form_opt_string(self.samtools, 'depth', opts)
|
24
|
+
# capture returns string output, so careful not to give whole genome or big contigs for depth analysis
|
25
|
+
stdout, stderr, status = Open3.capture3(command)
|
26
|
+
unless status.success?
|
27
|
+
logger.error "resulted in exit code #{status.exitstatus} using #{command}"
|
28
|
+
logger.error "stderr output is: #{stderr}"
|
29
|
+
raise CheripicError
|
30
|
+
end
|
31
|
+
# return stdout
|
32
|
+
stdout
|
33
|
+
end
|
34
|
+
|
35
|
+
end
|
36
|
+
|
7
37
|
# Custom error handling for Variants class
|
8
38
|
class VariantsError < CheripicError; end
|
9
39
|
|
@@ -27,10 +57,10 @@ module Cheripic
|
|
27
57
|
include Enumerable
|
28
58
|
extend Forwardable
|
29
59
|
def_delegators :@assembly, :each, :each_key, :each_value, :size, :length, :[]
|
30
|
-
attr_reader :assembly, :pileups, :
|
60
|
+
attr_reader :assembly, :pileups, :pileups_analyzed
|
31
61
|
|
32
62
|
# creates a Variants object using user input files
|
33
|
-
# @param options [
|
63
|
+
# @param options [OpenStruct] a hash of required input files as keys and file paths as values
|
34
64
|
def initialize(options)
|
35
65
|
@params = options
|
36
66
|
@assembly = {}
|
@@ -50,25 +80,76 @@ module Cheripic
|
|
50
80
|
@pileups[contig.id] = ContigPileups.new(contig.id)
|
51
81
|
end
|
52
82
|
@pileups_analyzed = false
|
83
|
+
unless @params.repeats_file == ''
|
84
|
+
store_repeat_regions
|
85
|
+
end
|
86
|
+
end
|
87
|
+
|
88
|
+
# reads repeat masker output file and stores masked regions to ignore variants in thos regions
|
89
|
+
def store_repeat_regions
|
90
|
+
File.foreach(@params.repeats_file) do |line|
|
91
|
+
line.strip!
|
92
|
+
next if line =~ /^SW/ or line =~ /^score/ or line == ''
|
93
|
+
info = line.split("\s")
|
94
|
+
pileups_obj = @pileups[info[4]]
|
95
|
+
index = pileups_obj.masked_regions.length
|
96
|
+
pileups_obj.masked_regions[index + 1][:begin] = info[5].to_i
|
97
|
+
pileups_obj.masked_regions[index + 1][:end] = info[6].to_i
|
98
|
+
end
|
53
99
|
end
|
54
100
|
|
55
101
|
# Reads and store pileup data for each of input bulk and parents pileup files
|
56
102
|
# And sets pileups_analyzed to true that pileups files are processed
|
57
103
|
def analyse_pileups
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
104
|
+
if @params.input_format == 'bam'
|
105
|
+
@vcf_hash = Vcf.filtering(@params.mut_bulk_vcf, @params.bg_bulk_vcf)
|
106
|
+
end
|
62
107
|
%i{mut_bulk bg_bulk mut_parent bg_parent}.each do | input |
|
63
108
|
infile = @params[input]
|
64
109
|
if infile != ''
|
65
|
-
|
110
|
+
logger.info "processing #{input} file"
|
111
|
+
if @params.input_format == 'pileup'
|
112
|
+
extract_pileup(infile, input)
|
113
|
+
else
|
114
|
+
extract_bam_pileup(infile, input)
|
115
|
+
end
|
66
116
|
end
|
67
117
|
end
|
68
118
|
|
69
119
|
@pileups_analyzed = true
|
70
120
|
end
|
71
121
|
|
122
|
+
# Bam object is read and each contig mean and std deviation of depth calculated
|
123
|
+
# @param bamobject [Bio::DB::Sam]
|
124
|
+
# Open3 capture returns string output, so careful not to give whole genome or big contigs for depth analysis
|
125
|
+
def set_max_depth(bamobject, bamfile)
|
126
|
+
logger.info "processing #{bamfile} file for depth"
|
127
|
+
all_depths = []
|
128
|
+
bq = Options.base_quality
|
129
|
+
mq = Options.mapping_quality
|
130
|
+
@assembly.each_key do | id |
|
131
|
+
contig_obj = @assembly[id]
|
132
|
+
len = contig_obj.length
|
133
|
+
data = bamobject.depth(:r => "#{id}", :Q => bq, :q => mq)
|
134
|
+
depths = []
|
135
|
+
data.split("\n").each do |line|
|
136
|
+
info = line.split("\t")
|
137
|
+
depths << info[2].to_i
|
138
|
+
end
|
139
|
+
variance = 0
|
140
|
+
mean_depth = depths.reduce(0, :+) / len.to_f
|
141
|
+
depths.each do |value|
|
142
|
+
variance += (value.to_f - mean_depth)**2
|
143
|
+
end
|
144
|
+
all_depths << mean_depth
|
145
|
+
contig_obj.sd_depth = Math.sqrt(variance)
|
146
|
+
contig_obj.mean_depth = mean_depth
|
147
|
+
end
|
148
|
+
# setting max depth as 3 times the average depth
|
149
|
+
mean_coverage = all_depths.reduce(0, :+) / @assembly.length.to_f
|
150
|
+
Options.maxdepth = Options.max_d_multiple * mean_coverage
|
151
|
+
end
|
152
|
+
|
72
153
|
# Input pileup file is read and positions are selected that pass the thresholds
|
73
154
|
# @param pileupfile [String] path to the pileup file to read
|
74
155
|
# @param sym [Symbol] Symbol of the pileup file used to write selected variants
|
@@ -84,6 +165,54 @@ module Cheripic
|
|
84
165
|
end
|
85
166
|
end
|
86
167
|
|
168
|
+
# Input bamfile is read and selected positions pileups are stored
|
169
|
+
# @param bamfile [String] path to the bam file to read
|
170
|
+
# @param sym [Symbol] Symbol of the bam file used to write selected variants
|
171
|
+
# pileup information to respective ContigPileups object
|
172
|
+
def extract_bam_pileup(bamfile, sym)
|
173
|
+
bq = Options.base_quality
|
174
|
+
mq = Options.mapping_quality
|
175
|
+
bamobject = Bio::DB::Sam.new(:bam=>bamfile, :fasta=>@params.assembly)
|
176
|
+
bamobject.index unless bamobject.indexed?
|
177
|
+
|
178
|
+
# check if user has set max depth or set to zero to ignore
|
179
|
+
max_d = Options.maxdepth
|
180
|
+
# or calculate from bamfile
|
181
|
+
if Options.max_d_multiple > 0
|
182
|
+
set_max_depth(bamobject, bamfile)
|
183
|
+
max_d = Options.maxdepth
|
184
|
+
logger.info "max depth used for #{sym} file\t#{max_d}"
|
185
|
+
end
|
186
|
+
|
187
|
+
@vcf_hash.each_key do | id |
|
188
|
+
positions = @vcf_hash[id][:het].keys
|
189
|
+
positions << @vcf_hash[id][:hom].keys
|
190
|
+
positions.flatten!
|
191
|
+
next if positions.empty?
|
192
|
+
contig_obj = @pileups[id]
|
193
|
+
positions.each do | pos |
|
194
|
+
command = "#{bamobject.samtools} mpileup -r #{id}:#{pos}-#{pos} -Q #{bq} -q #{mq} -B -f #{@params.assembly} #{bamfile}"
|
195
|
+
stdout, stderr, status = Open3.capture3(command)
|
196
|
+
unless status.success?
|
197
|
+
logger.error "resulted in exit code #{status.exitstatus} using #{command}"
|
198
|
+
logger.error "stderr output is: #{stderr}"
|
199
|
+
raise CheripicError
|
200
|
+
end
|
201
|
+
stdout.chomp!
|
202
|
+
if stdout == '' or stdout.split("\t")[3].to_i == 0 or stdout =~ /^\t0/
|
203
|
+
logger.info "pileup data empty for\t#{id}\t#{pos}"
|
204
|
+
else
|
205
|
+
pileup = Pileup.new(stdout)
|
206
|
+
unless max_d == 0 or pileup.coverage <= max_d
|
207
|
+
logger.info "pileup coverage is higher than max\t#{pileup.to_s}"
|
208
|
+
next
|
209
|
+
end
|
210
|
+
contig_obj.send(sym).store(pos, pileup)
|
211
|
+
end
|
212
|
+
end
|
213
|
+
end
|
214
|
+
end
|
215
|
+
|
87
216
|
# Once pileup files are analysed and variants are extracted from each bulk;
|
88
217
|
# bulks are compared to identify and isolate variants for downstream analysis.
|
89
218
|
# If polyploidy set to trye and mut_parent and bg_parent bulks are provided
|
@@ -95,8 +224,10 @@ module Cheripic
|
|
95
224
|
@assembly.each_key do | id |
|
96
225
|
contig = @assembly[id]
|
97
226
|
# extract parental hemi snps for polyploids before bulks are compared
|
98
|
-
if
|
99
|
-
@
|
227
|
+
if Options.polyploidy
|
228
|
+
if @params.mut_parent != '' or @params.bg_parent != ''
|
229
|
+
@pileups[id].hemisnps_in_parent
|
230
|
+
end
|
100
231
|
end
|
101
232
|
contig.hm_pos, contig.ht_pos, contig.hemi_pos = @pileups[id].bulks_compared
|
102
233
|
end
|
data/lib/cheripic/vcf.rb
ADDED
@@ -0,0 +1,83 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Cheripic
|
4
|
+
|
5
|
+
# Custom error handling for Vcf class
|
6
|
+
class VcfError < CheripicError; end
|
7
|
+
|
8
|
+
require 'bio-samtools'
|
9
|
+
|
10
|
+
class Vcf
|
11
|
+
|
12
|
+
def self.get_allele_freq(vcf_obj)
|
13
|
+
# check if the vcf is from samtools (has DP4 and AF1 fields in INFO)
|
14
|
+
if vcf_obj.info.key?('DP4')
|
15
|
+
freq = vcf_obj.info['DP4'].split(',')
|
16
|
+
depth = freq.inject { | sum, n | sum.to_f + n.to_f }
|
17
|
+
alt = freq[2].to_f + freq[3].to_f
|
18
|
+
allele_freq = alt / depth
|
19
|
+
# allele_freq = vcf_obj.non_ref_allele_freq
|
20
|
+
# check if the vcf is from VarScan (has RD, AD and FREQ fields in FORMAT)
|
21
|
+
elsif vcf_obj.samples['1'].key?('RD')
|
22
|
+
alt = vcf_obj.samples['1']['AD'].to_f
|
23
|
+
depth = vcf_obj.samples['1']['RD'].to_f + alt
|
24
|
+
allele_freq = alt / depth
|
25
|
+
# check if the vcf is from GATK (has AD and GT fields in FORMAT)
|
26
|
+
elsif vcf_obj.samples['1'].key?('AD') and vcf_obj.samples['1']['AD'].include?(',')
|
27
|
+
freq = vcf_obj.samples['1']['AD'].split(',')
|
28
|
+
allele_freq = freq[1].to_f / ( freq[0].to_f + freq[1].to_f )
|
29
|
+
# check if the vcf has has AF fields in INFO
|
30
|
+
elsif vcf_obj.info.key?('AF')
|
31
|
+
allele_freq = vcf_obj.info['AF'].to_f
|
32
|
+
else
|
33
|
+
raise VcfError.new 'not a supported vcf format (VarScan, GATK, Bcftools(Samtools), Vcf 4.0, 4.1 and 4.2)' +
|
34
|
+
" and check that it is one sample vcf\n"
|
35
|
+
end
|
36
|
+
allele_freq
|
37
|
+
end
|
38
|
+
|
39
|
+
|
40
|
+
##Input: vcf file
|
41
|
+
##Ouput: lists of hm and ht SNPS and hash of all fragments with variants
|
42
|
+
def self.get_vars(vcf_file)
|
43
|
+
ht_low = Options.htlow
|
44
|
+
ht_high = Options.hthigh
|
45
|
+
|
46
|
+
# hash of :het and :hom with frag ids and respective variant positions
|
47
|
+
var_pos = Hash.new{ |h,k| h[k] = Hash.new(&h.default_proc) }
|
48
|
+
File.foreach(vcf_file) do |line|
|
49
|
+
next if line =~ /^#/
|
50
|
+
v = Bio::DB::Vcf.new(line)
|
51
|
+
unless v.alt == '.'
|
52
|
+
allele_freq = get_allele_freq(v)
|
53
|
+
if allele_freq.between?(ht_low, ht_high)
|
54
|
+
var_pos[v.chrom][:het][v.pos] = allele_freq
|
55
|
+
elsif allele_freq > ht_high
|
56
|
+
var_pos[v.chrom][:hom][v.pos] = allele_freq
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
60
|
+
var_pos
|
61
|
+
end
|
62
|
+
|
63
|
+
def self.filtering(mutant_vcf, bgbulk_vcf)
|
64
|
+
var_pos_mut = get_vars(mutant_vcf)
|
65
|
+
return var_pos_mut if bgbulk_vcf == ''
|
66
|
+
var_pos_bg = get_vars(bgbulk_vcf)
|
67
|
+
|
68
|
+
# if both bulks have homozygous mutations at same positions then deleting them
|
69
|
+
var_pos_mut.each_key do | frag |
|
70
|
+
positions = var_pos_mut[frag][:hom].keys
|
71
|
+
pos_bg_bulk = var_pos_bg[frag][:hom].keys
|
72
|
+
positions.each do |pos|
|
73
|
+
if pos_bg_bulk.include?(pos)
|
74
|
+
var_pos_mut[frag][:hom].delete(pos)
|
75
|
+
end
|
76
|
+
end
|
77
|
+
end
|
78
|
+
var_pos_mut
|
79
|
+
end
|
80
|
+
|
81
|
+
end
|
82
|
+
|
83
|
+
end
|
data/lib/cheripic/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: cheripic
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.2.
|
4
|
+
version: 1.2.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Shyam Rallapalli
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-
|
11
|
+
date: 2016-10-17 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: yell
|
@@ -84,40 +84,6 @@ dependencies:
|
|
84
84
|
- - "~>"
|
85
85
|
- !ruby/object:Gem::Version
|
86
86
|
version: 2.4.0
|
87
|
-
- !ruby/object:Gem::Dependency
|
88
|
-
name: bio-gngm
|
89
|
-
requirement: !ruby/object:Gem::Requirement
|
90
|
-
requirements:
|
91
|
-
- - "~>"
|
92
|
-
- !ruby/object:Gem::Version
|
93
|
-
version: 0.2.1
|
94
|
-
type: :runtime
|
95
|
-
prerelease: false
|
96
|
-
version_requirements: !ruby/object:Gem::Requirement
|
97
|
-
requirements:
|
98
|
-
- - "~>"
|
99
|
-
- !ruby/object:Gem::Version
|
100
|
-
version: 0.2.1
|
101
|
-
- !ruby/object:Gem::Dependency
|
102
|
-
name: rinruby
|
103
|
-
requirement: !ruby/object:Gem::Requirement
|
104
|
-
requirements:
|
105
|
-
- - "~>"
|
106
|
-
- !ruby/object:Gem::Version
|
107
|
-
version: '2.0'
|
108
|
-
- - ">="
|
109
|
-
- !ruby/object:Gem::Version
|
110
|
-
version: 2.0.3
|
111
|
-
type: :runtime
|
112
|
-
prerelease: false
|
113
|
-
version_requirements: !ruby/object:Gem::Requirement
|
114
|
-
requirements:
|
115
|
-
- - "~>"
|
116
|
-
- !ruby/object:Gem::Version
|
117
|
-
version: '2.0'
|
118
|
-
- - ">="
|
119
|
-
- !ruby/object:Gem::Version
|
120
|
-
version: 2.0.3
|
121
87
|
- !ruby/object:Gem::Dependency
|
122
88
|
name: activesupport
|
123
89
|
requirement: !ruby/object:Gem::Requirement
|
@@ -259,6 +225,7 @@ files:
|
|
259
225
|
- ".gitignore"
|
260
226
|
- ".travis.yml"
|
261
227
|
- CODE_OF_CONDUCT.md
|
228
|
+
- ChangeLog.md
|
262
229
|
- Gemfile
|
263
230
|
- LICENSE.txt
|
264
231
|
- README.md
|
@@ -267,6 +234,7 @@ files:
|
|
267
234
|
- bin/console
|
268
235
|
- bin/setup
|
269
236
|
- cheripic.gemspec
|
237
|
+
- galaxy_cheripic_tool.xml
|
270
238
|
- lib/cheripic.rb
|
271
239
|
- lib/cheripic/bfr.rb
|
272
240
|
- lib/cheripic/cmd.rb
|
@@ -277,6 +245,7 @@ files:
|
|
277
245
|
- lib/cheripic/pileup.rb
|
278
246
|
- lib/cheripic/regions.rb
|
279
247
|
- lib/cheripic/variants.rb
|
248
|
+
- lib/cheripic/vcf.rb
|
280
249
|
- lib/cheripic/version.rb
|
281
250
|
homepage: https://github.com/shyamrallapalli/cheripic
|
282
251
|
licenses:
|