cheripic 1.2.0 → 1.2.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/ChangeLog.md +21 -0
- data/Gemfile +0 -1
- data/README.md +29 -27
- data/Rakefile +9 -5
- data/bin/cheripic +1 -0
- data/cheripic.gemspec +0 -2
- data/galaxy_cheripic_tool.xml +196 -0
- data/lib/cheripic.rb +1 -0
- data/lib/cheripic/cmd.rb +87 -42
- data/lib/cheripic/contig.rb +3 -1
- data/lib/cheripic/contig_pileups.rb +44 -40
- data/lib/cheripic/implementer.rb +24 -14
- data/lib/cheripic/options.rb +22 -0
- data/lib/cheripic/variants.rb +140 -9
- data/lib/cheripic/vcf.rb +83 -0
- data/lib/cheripic/version.rb +1 -1
- metadata +5 -36
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA1:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 824a8c68d3707ad02cf0d3b7d567191244d1a5a6
|
|
4
|
+
data.tar.gz: 6d2b3c7bef04aba06b5206968d1a0d69996a25b0
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 6ae5a85c30a0b1ea19f118409ddec95a6c7c3e11f00663e9769a5642770e90cb2ab5b0200f9d9eaa4ed8c6873492ac7f5f3acd568dc0cc14ddf8ccaac5012435
|
|
7
|
+
data.tar.gz: efe77b2ccafd0ad7ed4eeb47b497207cacf3dbaee058779b46af7a5ca34597991a3949dbeb0f3807bba69ea7dca0dd7e24fb58c040bf0065fef2ca1e4e3424fc
|
data/ChangeLog.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
### Change Log
|
|
2
|
+
|
|
3
|
+
All significant changes to this project at each release are documented in this file.
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
#### Future changes to include
|
|
7
|
+
|
|
8
|
+
1. option to take multiple background pileup files
|
|
9
|
+
2. replace output directory with output file name tag, since we only write to one file
|
|
10
|
+
3. option to take bam file or pileup file as inputs of bulks
|
|
11
|
+
|
|
12
|
+
#### [1.2.0] - 2016-08-11
|
|
13
|
+
|
|
14
|
+
1. fixed calculation of heterzygosity for background bulks
|
|
15
|
+
2. changed command line boolean option to be set using only true or false
|
|
16
|
+
3. included command line option to set length of sequnce to retireve on either side of each variant
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
#### [1.1.0] - 2016-07-26
|
|
20
|
+
|
|
21
|
+
first release of the binaries for Linux 64 bit and OSX 64bit
|
data/Gemfile
CHANGED
data/README.md
CHANGED
|
@@ -11,6 +11,7 @@ Computing Homozygosity Enriched Regions In genomes to Prioritize Identification
|
|
|
11
11
|
is a ruby tools to pick causative mutation from bulks segregant sequencing.
|
|
12
12
|
|
|
13
13
|
Currently this gem is still in development and nearing complete working package.
|
|
14
|
+
And software only works with pileup as input files, use of bam and vcf files will be implemented in future
|
|
14
15
|
|
|
15
16
|
|
|
16
17
|
## Installation
|
|
@@ -20,7 +21,7 @@ Binaries are available for Linux 64bit and OSX.
|
|
|
20
21
|
Best way to use Cheripic is to download appropriate binary arhcive
|
|
21
22
|
unpack (`tar -xzf`) and add the unpacked directory to your `PATH`
|
|
22
23
|
|
|
23
|
-
Latest binaries are available to [download here](https://github.com/shyamrallapalli/cheripic/releases/
|
|
24
|
+
Latest binaries are available to [download here](https://github.com/shyamrallapalli/cheripic/releases/latest)
|
|
24
25
|
|
|
25
26
|
|
|
26
27
|
To install gem and use the gem in your development
|
|
@@ -44,7 +45,7 @@ Running `cheripic` without any input at command line interface shows following h
|
|
|
44
45
|
|
|
45
46
|
```
|
|
46
47
|
|
|
47
|
-
Cheripic v1.
|
|
48
|
+
Cheripic v1.2.0
|
|
48
49
|
Authors: Shyam Rallapalli and Dan MacLean
|
|
49
50
|
|
|
50
51
|
Description: Candidate mutation and closely linked marker selection for non reference genomes
|
|
@@ -59,30 +60,31 @@ USAGE:
|
|
|
59
60
|
cheripic <options>
|
|
60
61
|
|
|
61
62
|
OPTIONS:
|
|
62
|
-
-f, --assembly=<s>
|
|
63
|
-
-F, --input-format=<s>
|
|
64
|
-
-a, --mut-bulk=<s>
|
|
65
|
-
-b, --bg-bulk=<s>
|
|
66
|
-
--output=<s>
|
|
67
|
-
--loglevel=<s>
|
|
68
|
-
--hmes-adjust=<f>
|
|
69
|
-
--htlow=<f>
|
|
70
|
-
--hthigh=<f>
|
|
71
|
-
--mindepth=<i>
|
|
72
|
-
--min-non-ref-count=<i>
|
|
73
|
-
--min-indel-count-support=<i>
|
|
74
|
-
--
|
|
75
|
-
-q, --mapping-quality=<i>
|
|
76
|
-
-Q, --base-quality=<i>
|
|
77
|
-
--noise=<f>
|
|
78
|
-
--cross-type=<s>
|
|
79
|
-
--
|
|
80
|
-
--
|
|
81
|
-
--polyploidy
|
|
82
|
-
-p, --mut-parent=<s>
|
|
83
|
-
-r, --bg-parent=<s>
|
|
84
|
-
--bfr-adjust=<f>
|
|
85
|
-
--
|
|
63
|
+
-f, --assembly=<s> Assembly file in FASTA format
|
|
64
|
+
-F, --input-format=<s> bulk and parent alignment file format types - set either pileup or bam (default: pileup)
|
|
65
|
+
-a, --mut-bulk=<s> Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1
|
|
66
|
+
-b, --bg-bulk=<s> Pileup or sorted BAM file alignments from background/wildtype bulk 2
|
|
67
|
+
--output=<s> Directory to store results, will be created if not existing (default: cheripic_results)
|
|
68
|
+
--loglevel=<s> Choose any one of "info / warn / debug" level for logs generated (default: debug)
|
|
69
|
+
--hmes-adjust=<f> factor added to snp count of each contig to adjust for hme score calculations (default: 0.5)
|
|
70
|
+
--htlow=<f> lower level for categorizing heterozygosity (default: 0.2)
|
|
71
|
+
--hthigh=<f> high level for categorizing heterozygosity (default: 0.9)
|
|
72
|
+
--mindepth=<i> minimum read depth to conisder a position for variant calls (default: 6)
|
|
73
|
+
--min-non-ref-count=<i> minimum read depth supporting non reference base at each position (default: 3)
|
|
74
|
+
--min-indel-count-support=<i> minimum read depth supporting an indel at each position (default: 3)
|
|
75
|
+
--ambiguous-ref-bases including variant at completely ambiguous bases in the reference
|
|
76
|
+
-q, --mapping-quality=<i> minimum mapping quality of read covering the position (default: 20)
|
|
77
|
+
-Q, --base-quality=<i> minimum base quality of bases covering the position (default: 15)
|
|
78
|
+
--noise=<f> praportion of reads for a variant to conisder as noise (default: 0.1)
|
|
79
|
+
--cross-type=<s> type of cross used to generated mapping population - back or out (default: back)
|
|
80
|
+
--use-all-contigs option to select all contigs or only contigs containing variants for analysis
|
|
81
|
+
--include-low-hmes option to include or discard variants from contigs with low hme-score or bfr score to list in the final output
|
|
82
|
+
--polyploidy Set if the data input is from polyploids
|
|
83
|
+
-p, --mut-parent=<s> Pileup or sorted BAM file alignments from mutant/trait of interest parent (default: )
|
|
84
|
+
-r, --bg-parent=<s> Pileup or sorted BAM file alignments from background/wildtype parent (default: )
|
|
85
|
+
--bfr-adjust=<f> factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)
|
|
86
|
+
--sel-seq-len=<i> sequence length to print from either side of selected variants (default: 50)
|
|
87
|
+
--examples shows some example commands with explanation
|
|
86
88
|
|
|
87
89
|
```
|
|
88
90
|
|
|
@@ -98,7 +100,7 @@ EXAMPLE COMMANDS:
|
|
|
98
100
|
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
|
|
99
101
|
3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
|
100
102
|
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
|
|
101
|
-
--
|
|
103
|
+
--use-all-contigs true --include-low-hmes true --output cheripic_results
|
|
102
104
|
|
|
103
105
|
```
|
|
104
106
|
|
data/Rakefile
CHANGED
|
@@ -23,7 +23,7 @@ TRAVELING_RUBY_VERSION = '20150210-2.1.5'
|
|
|
23
23
|
# http://d6r77u77i8pq3.cloudfront.net/releases/traveling-ruby-20150210-2.1.5-osx.tar.gz
|
|
24
24
|
|
|
25
25
|
desc 'Package your app'
|
|
26
|
-
task :package =>
|
|
26
|
+
task :package => %w(package:linux:x86_64 package:osx)
|
|
27
27
|
|
|
28
28
|
namespace :package do
|
|
29
29
|
|
|
@@ -71,8 +71,12 @@ def create_package(target)
|
|
|
71
71
|
sh "cp packaging/cheripic.gemspec Gemfile Gemfile.lock LICENSE.txt #{package_dir}/lib/app/"
|
|
72
72
|
sh "mkdir #{package_dir}/lib/app/.bundle"
|
|
73
73
|
sh "cp packaging/bundler-config #{package_dir}/lib/app/.bundle/config"
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
74
|
+
if target == 'linux-x86_64'
|
|
75
|
+
sh "cp -p packaging/linux-x86_64_samtools/external/* packaging/cheripic-#{VERSION}-linux-x86_64/lib/app/ruby/2.1.0/gems/bio-samtools-2.4.0/lib/bio/db/sam/external/"
|
|
76
|
+
end
|
|
77
|
+
unless ENV['DIR_ONLY']
|
|
78
|
+
Dir.chdir('packaging') do
|
|
79
|
+
sh "gtar -czf #{package_dest}.tar.gz #{package_dest}"
|
|
80
|
+
end
|
|
81
|
+
end
|
|
78
82
|
end
|
data/bin/cheripic
CHANGED
data/cheripic.gemspec
CHANGED
|
@@ -23,8 +23,6 @@ Gem::Specification.new do |spec|
|
|
|
23
23
|
spec.add_runtime_dependency 'trollop', '~> 2.1', '>= 2.1.2'
|
|
24
24
|
spec.add_runtime_dependency 'bio', '~> 1.5', '>= 1.5.0'
|
|
25
25
|
spec.add_dependency 'bio-samtools', '~> 2.4.0'
|
|
26
|
-
spec.add_dependency 'bio-gngm', '~> 0.2.1'
|
|
27
|
-
spec.add_runtime_dependency 'rinruby', '~> 2.0', '>= 2.0.3'
|
|
28
26
|
|
|
29
27
|
spec.add_development_dependency 'activesupport', '~> 4.2.6'
|
|
30
28
|
spec.add_development_dependency 'bundler', '~> 1.7.6'
|
|
@@ -0,0 +1,196 @@
|
|
|
1
|
+
<tool id="cheripic" name="CHERIPIC" version="1.2.0">
|
|
2
|
+
|
|
3
|
+
<description>CHERIPIC</description>
|
|
4
|
+
|
|
5
|
+
<version_command>cheripic -v</version_command>
|
|
6
|
+
|
|
7
|
+
<command>
|
|
8
|
+
<![CDATA[
|
|
9
|
+
cheripic
|
|
10
|
+
--assembly $assembly
|
|
11
|
+
--mut-bulk $mut_bulk
|
|
12
|
+
--bg-bulk $bg_bulk
|
|
13
|
+
--output $output
|
|
14
|
+
--loglevel $loglevel
|
|
15
|
+
--hmes-adjust $hmes_adjust
|
|
16
|
+
--htlow $ht_low
|
|
17
|
+
--hthigh $ht_high
|
|
18
|
+
--mindepth $min_depth
|
|
19
|
+
--min-non-ref-count $min_non_ref_count
|
|
20
|
+
--min-indel-count-support $min_indel_count_support
|
|
21
|
+
--ambiguous-ref-bases $ambiguous_ref_bases
|
|
22
|
+
--mapping-quality $mapping_quality
|
|
23
|
+
--base-quality $base_quality
|
|
24
|
+
--noise $noise
|
|
25
|
+
--cross-type $cross_type
|
|
26
|
+
--use-all-contigs $use_all_contigs
|
|
27
|
+
--include-low-hmes $include_low_hmes
|
|
28
|
+
--polyploidy $polyploidy
|
|
29
|
+
--mut-parent $mut_parent
|
|
30
|
+
--bg-parent $bg_parent
|
|
31
|
+
--bfr-adjust $bfr_adjust
|
|
32
|
+
--sel-seq-len $sel_seq_len
|
|
33
|
+
]]>
|
|
34
|
+
</command>
|
|
35
|
+
|
|
36
|
+
<inputs>
|
|
37
|
+
<param name="assembly" type="data" format="fasta" label="Input Assembly file" help="Select Assembly fasta file" />
|
|
38
|
+
<param name="mut_bulk" type="data" format="pileup" label="mutant bulk pileup file" help="Select mutant bulk pileup file" />
|
|
39
|
+
<param name="bg_bulk" type="data" format="pileup" label="background bulk pileup file" min="1" multiple="true" help="Select background bulk pileup file" />
|
|
40
|
+
<param name="loglevel" type="select" optional="true" label="analysis log level" help="choose between info, warn and debug levels">
|
|
41
|
+
<option value="info" selected="true">info </option>
|
|
42
|
+
<option value="warn">warnings</option>
|
|
43
|
+
<option value="debug">debug</option>
|
|
44
|
+
</param>
|
|
45
|
+
<param name="hmes_adjust" size="4" type="float" optional="true" value="0.5" min="0.01" max="1.0"
|
|
46
|
+
label="hme score adjuster" help="factor added to snp count of each contig to adjust for hme score calculations" />
|
|
47
|
+
<param name="ht_low" size="4" type="float" optional="true" value="0.25" min="0.1" max="1.0"
|
|
48
|
+
label="heterozygosity low limit" help="lower limit to heterozygosity allele fraction" />
|
|
49
|
+
<param name="ht_high" size="4" type="float" optional="true" value="0.75" min="0.1" max="1.0"
|
|
50
|
+
label="heterozygosity high limit" help="upper limit to heterozygosity allele fraction" />
|
|
51
|
+
<param name="min_depth" size="4" type="integer" optional="true" value="6" min="1" max="8000"
|
|
52
|
+
label="minimum read coverage" help="minimum read depth to conisder a position for variant calls" />
|
|
53
|
+
<param name="min_non_ref_count" size="4" type="integer" optional="true" value="3" min="1" max="8000"
|
|
54
|
+
label="minimum alternate read coverage" help="minimum read depth supporting non reference base at each position" />
|
|
55
|
+
<param name="min_indel_count_support" size="4" type="integer" optional="true" value="3" min="1" max="8000"
|
|
56
|
+
label="minimum indel read coverage" help="minimum read depth supporting an indel at each position" />
|
|
57
|
+
<param name="ambiguous_ref_bases" type="boolean" optional="true" checked="false" label="ambiguous reference position"
|
|
58
|
+
help="including variant at completely ambiguous bases in the reference" truevalue="true" falsevalue="false" />
|
|
59
|
+
<param name="mapping_quality" size="4" type="integer" optional="true" value="20" min="0" max="255"
|
|
60
|
+
label="minimum mapping quality" help="minimum mapping quality of read covering the position" />
|
|
61
|
+
<param name="base_quality" size="4" type="integer" optional="true" value="15" min="0" max="40"
|
|
62
|
+
label="minimum base quality" help="minimum base quality of nucleotides covering the position" />
|
|
63
|
+
<param name="noise" size="4" type="float" optional="true" value="0.1" min="0" max="0.2"
|
|
64
|
+
label="read noise" help="proportion of reads supporting a variant, below which are consider as noise" />
|
|
65
|
+
<param name="cross_type" type="select" optional="true" label="cross type" help="type of cross used to generated mapping population - back or out" >
|
|
66
|
+
<option value="back" selected="true">back cross</option>
|
|
67
|
+
<option value="out">out cross</option>
|
|
68
|
+
</param>
|
|
69
|
+
|
|
70
|
+
<param name="use_all_contigs" type="boolean" optional="true" checked="false" label="use all contigs in analysis"
|
|
71
|
+
help="option to select all contigs or only contigs containing variants for analysis" truevalue="true" falsevalue="false" />
|
|
72
|
+
<param name="include_low_hmes" type="boolean" optional="true" checked="false" label="no hme or bfr score cut off"
|
|
73
|
+
help="option to include or discard variants from contigs with low hme-score or bfr score to list in the final output" truevalue="true" falsevalue="false" />
|
|
74
|
+
<param name="polyploidy" type="boolean" optional="true" checked="false" label="polyploid data"
|
|
75
|
+
help="Set if the input data is from polyploids" truevalue="true" falsevalue="false" />
|
|
76
|
+
<param name="mut-parent" type="data" optional="true" format="pileup" label="mutant parent pileup file" help="Select mutant parent pileup file" />
|
|
77
|
+
<param name="bg-parent" type="data" optional="true" format="pileup" label="background parent pileup file" help="Select background parent pileup file" />
|
|
78
|
+
|
|
79
|
+
<param name="bfr_adjust" size="4" type="float" optional="true" value="0.05" min="0.01" max="1.0"
|
|
80
|
+
label="bfr score adjuster" help="factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)" />
|
|
81
|
+
<param name="sel_seq_len" size="4" type="integer" optional="true" value="50" min="10" max="250"
|
|
82
|
+
label="selected variant seq length out" help="sequence length to print from either side of selected variants (default: 50)" />
|
|
83
|
+
|
|
84
|
+
<param name="output" type="text" size="30" value="cheripic_results" label="tag for output filename" help="write a tag to include with output filename" />
|
|
85
|
+
</inputs>
|
|
86
|
+
|
|
87
|
+
<outputs>
|
|
88
|
+
<data name="output_1" format="txt" file="${output}_selected_hme_variants.txt" />
|
|
89
|
+
<data name="output_2" format="txt" file="${output}_selected_bfr_variants.txt" />
|
|
90
|
+
</outputs>
|
|
91
|
+
|
|
92
|
+
<tests>
|
|
93
|
+
<test>
|
|
94
|
+
<param name="assembly" value="picked_fasta.fa" ftype="fasta" />
|
|
95
|
+
<param name="mut_bulk" value="mut_bulk.pileup" ftype="pileup" />
|
|
96
|
+
<param name="bg_bulk" value="wt_bulk.pileup" ftype="pileup" />
|
|
97
|
+
<output name="output" ftype="txt" file="selected_variants.out" />
|
|
98
|
+
</test>
|
|
99
|
+
</tests>
|
|
100
|
+
|
|
101
|
+
<help>
|
|
102
|
+
|
|
103
|
+
**Computing Homozygosity Enriched Regions In genomes to Prioritize Identification of Candidate variants (CHERIPIC)**
|
|
104
|
+
|
|
105
|
+
CHERIPIC is a ruby tool to pick causative mutation from bulk segregant sequencing
|
|
106
|
+
|
|
107
|
+
------
|
|
108
|
+
|
|
109
|
+
**What it does**
|
|
110
|
+
|
|
111
|
+
This tool uses ``cheripic`` tool to analyse bulk segregant sequencing to identify causative muation
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
.. class:: infomark
|
|
115
|
+
|
|
116
|
+
Provides a list of snps that could either closely linked markers or the causative mutation.
|
|
117
|
+
|
|
118
|
+
------
|
|
119
|
+
|
|
120
|
+
**Input formats**
|
|
121
|
+
|
|
122
|
+
assembly file should be a fasta file used for generating pileups from bulks
|
|
123
|
+
bulk alignment files should be pileup files
|
|
124
|
+
|
|
125
|
+
------
|
|
126
|
+
|
|
127
|
+
**Outputs**
|
|
128
|
+
|
|
129
|
+
The output is a text file, and has the following columns::
|
|
130
|
+
|
|
131
|
+
Column Description
|
|
132
|
+
----------------- --------------------------------------------------------
|
|
133
|
+
1 HME_Score Homozygosity Enrichment score
|
|
134
|
+
2 AlleleFreq Allele frequency
|
|
135
|
+
3 seq_id Contig/Scaffold id
|
|
136
|
+
4 position 1-based index of the position in contig
|
|
137
|
+
5 ref_base Reference nucleotide at the position
|
|
138
|
+
6 coverage read depth
|
|
139
|
+
7 bases read bases
|
|
140
|
+
8 base_quals read base qualities
|
|
141
|
+
9 sequence_left selected size of reference sequence on the left variant
|
|
142
|
+
10 Alt_seq Alternate allele at the position
|
|
143
|
+
11 sequence_right selected size of reference sequence on the right variant
|
|
144
|
+
|
|
145
|
+
------
|
|
146
|
+
|
|
147
|
+
**cheripic settings**
|
|
148
|
+
|
|
149
|
+
All of the options have a default value. You can change any of them. All of the options are implemented.
|
|
150
|
+
|
|
151
|
+
------
|
|
152
|
+
|
|
153
|
+
**cheripic parameter list**
|
|
154
|
+
|
|
155
|
+
OPTIONS:
|
|
156
|
+
-f, --assembly Assembly file in FASTA format
|
|
157
|
+
-F, --input-format bulk and parent alignment file format types - set either pileup or bam (default: pileup)
|
|
158
|
+
-a, --mut-bulk Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1
|
|
159
|
+
-b, --bg-bulk Pileup or sorted BAM file alignments from background/wildtype bulk 2
|
|
160
|
+
--output Directory to store results, will be created if not existing (default: cheripic_results)
|
|
161
|
+
--loglevel Choose any one of "info / warn / debug" level for logs generated (default: debug)
|
|
162
|
+
--hmes-adjust factor added to snp count of each contig to adjust for hme score calculations (default: 0.5)
|
|
163
|
+
--htlow lower level for categorizing heterozygosity (default: 0.2)
|
|
164
|
+
--hthigh high level for categorizing heterozygosity (default: 0.9)
|
|
165
|
+
--mindepth minimum read depth to conisder a position for variant calls (default: 6)
|
|
166
|
+
--min-non-ref-count minimum read depth supporting non reference base at each position (default: 3)
|
|
167
|
+
--min-indel-count-support minimum read depth supporting an indel at each position (default: 3)
|
|
168
|
+
--ambiguous-ref-bases including variant at completely ambiguous bases in the reference
|
|
169
|
+
-q, --mapping-quality minimum mapping quality of read covering the position (default: 20)
|
|
170
|
+
-Q, --base-quality minimum base quality of bases covering the position (default: 15)
|
|
171
|
+
--noise praportion of reads for a variant to conisder as noise (default: 0.1)
|
|
172
|
+
--cross-type type of cross used to generated mapping population - back or out (default: back)
|
|
173
|
+
--use-all-contigs option to select all contigs or only contigs containing variants for analysis
|
|
174
|
+
--include-low-hmes option to include or discard variants from contigs with low hme-score or bfr score to list in the final output
|
|
175
|
+
--polyploidy Set if the data input is from polyploids
|
|
176
|
+
-p, --mut-parent Pileup or sorted BAM file alignments from mutant/trait of interest parent (default: )
|
|
177
|
+
-r, --bg-parent Pileup or sorted BAM file alignments from background/wildtype parent (default: )
|
|
178
|
+
--bfr-adjust factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)
|
|
179
|
+
--sel-seq-len sequence length to print from either side of selected variants (default: 50)
|
|
180
|
+
|
|
181
|
+
------
|
|
182
|
+
|
|
183
|
+
.. class:: infomark
|
|
184
|
+
|
|
185
|
+
**Tool Author**
|
|
186
|
+
|
|
187
|
+
Shyam Rallapalli
|
|
188
|
+
|
|
189
|
+
|
|
190
|
+
</help>
|
|
191
|
+
|
|
192
|
+
<citations>
|
|
193
|
+
<citation type="doi">10.1093/bioinformatics/btg1080</citation>
|
|
194
|
+
</citations>
|
|
195
|
+
|
|
196
|
+
</tool>
|
data/lib/cheripic.rb
CHANGED
data/lib/cheripic/cmd.rb
CHANGED
|
@@ -52,10 +52,14 @@ module Cheripic
|
|
|
52
52
|
opt :mut_bulk, 'Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1',
|
|
53
53
|
:short => '-a',
|
|
54
54
|
:type => String
|
|
55
|
+
opt :mut_bulk_vcf, 'vcf file for variants from mutant/trait of interest bulk 1',
|
|
56
|
+
:type => String
|
|
55
57
|
opt :bg_bulk, 'Pileup or sorted BAM file alignments from background/wildtype bulk 2',
|
|
56
58
|
:short => '-b',
|
|
57
59
|
:type => String
|
|
58
|
-
opt :
|
|
60
|
+
opt :bg_bulk_vcf, 'vcf file for variants from background/wildtype bulk 2',
|
|
61
|
+
:type => String
|
|
62
|
+
opt :output, 'custom name tag to include in the output file name',
|
|
59
63
|
:default => 'cheripic_results'
|
|
60
64
|
opt :loglevel, 'Choose any one of "info / warn / debug" level for logs generated',
|
|
61
65
|
:default => 'debug'
|
|
@@ -68,9 +72,17 @@ module Cheripic
|
|
|
68
72
|
opt :hthigh, 'high level for categorizing heterozygosity',
|
|
69
73
|
:type => Float,
|
|
70
74
|
:default => 0.9
|
|
71
|
-
opt :mindepth, 'minimum read depth
|
|
75
|
+
opt :mindepth, 'minimum read depth at a position to consider for variant calls',
|
|
72
76
|
:type => Integer,
|
|
73
77
|
:default => 6
|
|
78
|
+
opt :max_d_multiple, "multiplication factor for average coverage to calculate maximum read coverage
|
|
79
|
+
if set zero no calculation will be made from bam file.\nsetting this value will override user set max depth",
|
|
80
|
+
:type => Integer,
|
|
81
|
+
:default => 5
|
|
82
|
+
opt :maxdepth, "maximum read depth at a position to consider for variant calls
|
|
83
|
+
if set to zero no user max depth will be used",
|
|
84
|
+
:type => Integer,
|
|
85
|
+
:default => 0
|
|
74
86
|
opt :min_non_ref_count, 'minimum read depth supporting non reference base at each position',
|
|
75
87
|
:type => Integer,
|
|
76
88
|
:default => 3
|
|
@@ -97,7 +109,8 @@ module Cheripic
|
|
|
97
109
|
opt :use_all_contigs, 'option to select all contigs or only contigs containing variants for analysis',
|
|
98
110
|
:type => FalseClass,
|
|
99
111
|
:default => false
|
|
100
|
-
opt :include_low_hmes, 'option to include or discard variants from contigs with
|
|
112
|
+
opt :include_low_hmes, 'option to include or discard variants from contigs with
|
|
113
|
+
low hme-score or bfr score to list in the final output',
|
|
101
114
|
:type => FalseClass,
|
|
102
115
|
:default => false
|
|
103
116
|
opt :polyploidy, 'Set if the data input is from polyploids',
|
|
@@ -111,6 +124,10 @@ module Cheripic
|
|
|
111
124
|
:short => '-r',
|
|
112
125
|
:type => String,
|
|
113
126
|
:default => ''
|
|
127
|
+
opt :repeats_file, 'repeat masker output file for the assembly ',
|
|
128
|
+
:short => '-R',
|
|
129
|
+
:type => String,
|
|
130
|
+
:default => ''
|
|
114
131
|
opt :bfr_adjust, 'factor added to hemi snp frequency of each parent to adjust for bfr calculations',
|
|
115
132
|
:type => Float,
|
|
116
133
|
:default => 0.05
|
|
@@ -133,8 +150,9 @@ module Cheripic
|
|
|
133
150
|
|
|
134
151
|
Inputs:
|
|
135
152
|
1. Needs a reference fasta file of asssembly use for variant analysis
|
|
136
|
-
2. Pileup files for mutant (phenotype of interest) bulks and background (wildtype phenotype) bulks
|
|
137
|
-
3. If
|
|
153
|
+
2. Pileup/Bam files for mutant (phenotype of interest) bulks and background (wildtype phenotype) bulks
|
|
154
|
+
3. If providing bam files, you have to include vcf files for the respective bulks
|
|
155
|
+
4. If polyploid species, include pileup/bam files from one or both parents
|
|
138
156
|
|
|
139
157
|
USAGE:
|
|
140
158
|
cheripic <options>
|
|
@@ -149,15 +167,19 @@ module Cheripic
|
|
|
149
167
|
def print_examples
|
|
150
168
|
msg = <<-EOS
|
|
151
169
|
|
|
152
|
-
|
|
170
|
+
Cheripic v#{Cheripic::VERSION.dup}
|
|
171
|
+
Authors: Shyam Rallapalli and Dan MacLean
|
|
172
|
+
|
|
173
|
+
EXAMPLE COMMANDS:
|
|
174
|
+
1. cheripic -f assembly.fa -a mutbulk.pileup -b bgbulk.pileup --output=cheripic_output
|
|
175
|
+
2. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
|
176
|
+
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
|
|
177
|
+
3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
|
178
|
+
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
|
|
179
|
+
--no-only-frag-with-vars --no-filter-out-low-hmes --output cheripic_results
|
|
180
|
+
4. cheripic -h or cheripic --help
|
|
181
|
+
5. cheripic -v or cheripic --version
|
|
153
182
|
|
|
154
|
-
EXAMPLE COMMANDS:
|
|
155
|
-
1. cheripic -f assembly.fa -a mutbulk.pileup -b bgbulk.pileup --output=cheripic_output
|
|
156
|
-
2. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
|
157
|
-
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
|
|
158
|
-
3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
|
|
159
|
-
--mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
|
|
160
|
-
--no-only-frag-with-vars --no-filter-out-low-hmes --output cheripic_results
|
|
161
183
|
EOS
|
|
162
184
|
puts msg.split("\n").map{ |line| line.lstrip }.join("\n")
|
|
163
185
|
exit(0)
|
|
@@ -165,44 +187,66 @@ module Cheripic
|
|
|
165
187
|
|
|
166
188
|
# calls other methods to check if command line inputs are valid
|
|
167
189
|
def check_arguments
|
|
168
|
-
|
|
190
|
+
check_output
|
|
169
191
|
check_log_level
|
|
170
|
-
|
|
192
|
+
check_input_types
|
|
171
193
|
end
|
|
172
194
|
|
|
173
|
-
#
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
195
|
+
# checks input files based on bulk file type
|
|
196
|
+
def check_input_types
|
|
197
|
+
inputfiles = {}
|
|
198
|
+
inputfiles[:required] = %i{assembly mut_bulk}
|
|
199
|
+
inputfiles[:optional] = %i{bg_bulk}
|
|
200
|
+
if @options[:input_format] == 'bam'
|
|
201
|
+
inputfiles[:required] << %i{mut_bulk_vcf}
|
|
202
|
+
inputfiles[:optional] << %i{bg_bulk_vcf}
|
|
203
|
+
end
|
|
182
204
|
if @options[:polyploidy]
|
|
183
|
-
inputfiles = %i{
|
|
184
|
-
else
|
|
185
|
-
inputfiles = %i{assembly mut_bulk bg_bulk}
|
|
205
|
+
inputfiles[:either] = %i{mut_parent bg_parent}
|
|
186
206
|
end
|
|
187
|
-
inputfiles
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
207
|
+
check_input_files(inputfiles)
|
|
208
|
+
end
|
|
209
|
+
|
|
210
|
+
# checks if input files are valid
|
|
211
|
+
def check_input_files(inputfiles)
|
|
212
|
+
check = 0
|
|
213
|
+
inputfiles.each_key do | type |
|
|
214
|
+
inputfiles[type].flatten!
|
|
215
|
+
inputfiles[type].each do | symbol |
|
|
216
|
+
if @options[symbol]
|
|
217
|
+
file = @options[symbol]
|
|
218
|
+
@options[symbol] = File.expand_path(file)
|
|
219
|
+
next if type == :optional
|
|
220
|
+
if type == :required and not File.exist?(file)
|
|
221
|
+
raise CheripicIOError.new "#{symbol} file, #{file} does not exist: "
|
|
222
|
+
elsif type == :either and File.exist?(file)
|
|
223
|
+
check = 1
|
|
224
|
+
end
|
|
225
|
+
elsif type == :required
|
|
226
|
+
raise CheripicArgError.new "Options #{inputfiles}, all must be specified. " +
|
|
227
|
+
'Try --help for further help.'
|
|
193
228
|
end
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
229
|
+
end
|
|
230
|
+
if type == :either and check == 0
|
|
231
|
+
raise CheripicArgError.new "One of the options #{inputfiles}, must be specified. " +
|
|
232
|
+
'Try --help for further help.'
|
|
197
233
|
end
|
|
198
234
|
end
|
|
199
235
|
end
|
|
200
236
|
|
|
201
|
-
# checks if output
|
|
202
|
-
def
|
|
203
|
-
if
|
|
204
|
-
raise CheripicArgError.new
|
|
205
|
-
'
|
|
237
|
+
# checks if files with output tag name already exists
|
|
238
|
+
def check_output
|
|
239
|
+
if (@options[:output].split('') & %w{# / : * ? ' < > | & $ ,}).any?
|
|
240
|
+
raise CheripicArgError.new 'please choose a name tag that contains ' +
|
|
241
|
+
'alphanumeric characters, hyphen(-) and underscore(_) only'
|
|
242
|
+
end
|
|
243
|
+
@options[:hmes_frags] = "#{@options[:output]}_selected_hme_variants.txt"
|
|
244
|
+
@options[:bfr_frags] = "#{@options[:output]}_selected_bfr_variants.txt"
|
|
245
|
+
[@options[:hmes_frags], @options[:bfr_frags]].each do | file |
|
|
246
|
+
if File.exist?(file)
|
|
247
|
+
raise CheripicArgError.new "'#{file}' file exists " +
|
|
248
|
+
'please choose a different name tag to be included in the output file name'
|
|
249
|
+
end
|
|
206
250
|
end
|
|
207
251
|
end
|
|
208
252
|
|
|
@@ -220,7 +264,8 @@ module Cheripic
|
|
|
220
264
|
# A hash of trollop option names as keys and user or default
|
|
221
265
|
# setting as values is passed to Implementer object
|
|
222
266
|
def run
|
|
223
|
-
@options[:
|
|
267
|
+
@options[:hmes_frags] = File.expand_path @options[:hmes_frags]
|
|
268
|
+
@options[:bfr_frags] = File.expand_path @options[:bfr_frags]
|
|
224
269
|
analysis = Implementer.new(@options)
|
|
225
270
|
analysis.run
|
|
226
271
|
end
|
data/lib/cheripic/contig.rb
CHANGED
|
@@ -22,7 +22,7 @@ module Cheripic
|
|
|
22
22
|
# @return [Integer] length of contig in bases
|
|
23
23
|
class Contig
|
|
24
24
|
|
|
25
|
-
attr_accessor :hm_pos, :ht_pos, :hemi_pos
|
|
25
|
+
attr_accessor :hm_pos, :ht_pos, :hemi_pos, :mean_depth, :sd_depth
|
|
26
26
|
attr_reader :id, :length
|
|
27
27
|
|
|
28
28
|
# creates a Contig object using fasta entry
|
|
@@ -33,6 +33,8 @@ module Cheripic
|
|
|
33
33
|
@hm_pos = {}
|
|
34
34
|
@ht_pos = {}
|
|
35
35
|
@hemi_pos = {}
|
|
36
|
+
@mean_depth = nil
|
|
37
|
+
@sd_depth = nil
|
|
36
38
|
end
|
|
37
39
|
|
|
38
40
|
# Number of homozygous variants identified in the contig
|
|
@@ -32,7 +32,7 @@ module Cheripic
|
|
|
32
32
|
def_delegators :@mut_parent, :each, :each_key, :each_value, :length, :[], :store
|
|
33
33
|
def_delegators :@bg_parent, :each, :each_key, :each_value, :length, :[], :store
|
|
34
34
|
attr_accessor :id, :parent_hemi
|
|
35
|
-
attr_accessor :mut_bulk, :bg_bulk, :mut_parent, :bg_parent
|
|
35
|
+
attr_accessor :mut_bulk, :bg_bulk, :mut_parent, :bg_parent, :masked_regions
|
|
36
36
|
|
|
37
37
|
# creates a ContigPileup object using fasta entry id
|
|
38
38
|
# @param fasta [String] a contig id from fasta entry
|
|
@@ -43,16 +43,27 @@ module Cheripic
|
|
|
43
43
|
@mut_parent = {}
|
|
44
44
|
@bg_parent = {}
|
|
45
45
|
@parent_hemi = {}
|
|
46
|
+
@masked_regions = Hash.new { |h,k| h[k] = {} }
|
|
47
|
+
@hm_pos = {}
|
|
48
|
+
@ht_pos = {}
|
|
49
|
+
@hemi_pos = {}
|
|
46
50
|
end
|
|
47
51
|
|
|
48
52
|
# bulk pileups are compared and variant positions are selected
|
|
49
53
|
# @return [Array<Hash>] variant positions are stored in hashes
|
|
50
54
|
# for homozygous, heterozygous and hemi-variant positions
|
|
51
55
|
def bulks_compared
|
|
52
|
-
@hm_pos = {}
|
|
53
|
-
@ht_pos = {}
|
|
54
|
-
@hemi_pos = {}
|
|
55
56
|
@mut_bulk.each_key do | pos |
|
|
57
|
+
ignore = 0
|
|
58
|
+
unless @masked_regions.empty?
|
|
59
|
+
@masked_regions.each_key do | index |
|
|
60
|
+
if pos.between?(@masked_regions[index][:begin], @masked_regions[index][:end])
|
|
61
|
+
ignore = 1
|
|
62
|
+
logger.info "variant is in the masked region\t#{@mut_bulk[pos].to_s}"
|
|
63
|
+
end
|
|
64
|
+
end
|
|
65
|
+
end
|
|
66
|
+
next if ignore == 1
|
|
56
67
|
if Options.polyploidy and @parent_hemi.key?(pos)
|
|
57
68
|
bg_bases = ''
|
|
58
69
|
if @bg_bulk.key?(pos)
|
|
@@ -74,27 +85,37 @@ module Cheripic
|
|
|
74
85
|
# @param pos [Integer] position in the contig
|
|
75
86
|
# stores variant type, position and allele fraction to either @hm_pos or @ht_pos hashes
|
|
76
87
|
def compare_pileup(pos)
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
return nil if base_hash.empty?
|
|
80
|
-
# we could ignore complex loci or
|
|
81
|
-
# take the variant type based on predominant base
|
|
82
|
-
if base_hash.length > 1
|
|
83
|
-
fraction = base_hash.values.max
|
|
84
|
-
mut_type = var_mode(fraction)
|
|
85
|
-
else
|
|
86
|
-
fraction = base_hash[base_hash.keys[0]]
|
|
87
|
-
mut_type = var_mode(fraction)
|
|
88
|
-
end
|
|
88
|
+
mut_type, fraction = var_mode_fraction(@mut_bulk[pos])
|
|
89
|
+
return nil if mut_type.nil?
|
|
89
90
|
if @bg_bulk.key?(pos)
|
|
90
|
-
bg_type =
|
|
91
|
+
bg_type = var_mode_fraction(@bg_bulk[pos])[0]
|
|
91
92
|
mut_type = compare_var_type(mut_type, bg_type)
|
|
92
93
|
end
|
|
93
|
-
unless mut_type
|
|
94
|
+
unless mut_type.nil?
|
|
94
95
|
categorise_pos(mut_type, pos, fraction)
|
|
95
96
|
end
|
|
96
97
|
end
|
|
97
98
|
|
|
99
|
+
|
|
100
|
+
# Method to extract var_mode and allele fraction from pileup information at a position in contig
|
|
101
|
+
#
|
|
102
|
+
# @param pileup_info [Pileup] pileup object
|
|
103
|
+
# @return [Symbol] variant mode from pileup position (:hom or :het) at the position
|
|
104
|
+
# @return [Float] allele fraction at the position
|
|
105
|
+
def var_mode_fraction(pileup_info)
|
|
106
|
+
base_frac_hash = pileup_info.var_base_frac
|
|
107
|
+
base_frac_hash.delete(:ref)
|
|
108
|
+
return [nil, nil] if base_frac_hash.empty?
|
|
109
|
+
# we could ignore complex loci or
|
|
110
|
+
# take the variant type based on predominant base
|
|
111
|
+
if base_frac_hash.length > 1
|
|
112
|
+
fraction = base_frac_hash.values.max
|
|
113
|
+
else
|
|
114
|
+
fraction = base_frac_hash[base_frac_hash.keys[0]]
|
|
115
|
+
end
|
|
116
|
+
[var_mode(fraction), fraction]
|
|
117
|
+
end
|
|
118
|
+
|
|
98
119
|
# Categorizes variant zygosity based on the allele fraction provided.
|
|
99
120
|
# Uses lower and upper limit set for heterozygosity in the options.
|
|
100
121
|
# @note consider increasing the range of heterozygosity limits for RNA-seq data
|
|
@@ -125,23 +146,6 @@ module Cheripic
|
|
|
125
146
|
end
|
|
126
147
|
end
|
|
127
148
|
|
|
128
|
-
# Method to extract var_mode from pileup information at a position in contig
|
|
129
|
-
#
|
|
130
|
-
# @param pos [Integer] position in the contig
|
|
131
|
-
# @return [Symbol] variant mode of the background bulk (:hom or :het) at the position
|
|
132
|
-
def bg_bulk_var(pos)
|
|
133
|
-
bg_base_hash = @bg_bulk[pos].var_base_frac
|
|
134
|
-
bg_base_hash.delete(:ref)
|
|
135
|
-
return nil if bg_base_hash.empty?
|
|
136
|
-
if bg_base_hash.length > 1
|
|
137
|
-
# taking only var mode
|
|
138
|
-
var_mode(bg_base_hash.values.max)
|
|
139
|
-
else
|
|
140
|
-
# taking only var mode
|
|
141
|
-
var_mode(bg_base_hash[bg_base_hash.keys[0]])
|
|
142
|
-
end
|
|
143
|
-
end
|
|
144
|
-
|
|
145
149
|
# method stores pos as key and allele fraction as value
|
|
146
150
|
# to @hm_pos or @ht_pos hash based on variant type
|
|
147
151
|
# @param var_type [Symbol] values are either :hom or :het
|
|
@@ -156,18 +160,18 @@ module Cheripic
|
|
|
156
160
|
end
|
|
157
161
|
|
|
158
162
|
# Compares parental pileups for the contig and identify position
|
|
159
|
-
# that indicate variants from
|
|
163
|
+
# that indicate variants from homeologues called hemi-snps
|
|
160
164
|
# and calculates bulk frequency ratio (bfr)
|
|
161
165
|
# @return [Hash] parent_hemi hash with position as key and bfr as value
|
|
162
166
|
def hemisnps_in_parent
|
|
163
167
|
# mark all the hemi snp based on both parents
|
|
164
|
-
|
|
168
|
+
@mut_parent.each_key do |pos|
|
|
165
169
|
mut_parent_frac = @mut_parent[pos].var_base_frac
|
|
166
|
-
if
|
|
170
|
+
if @bg_parent.key?(pos)
|
|
167
171
|
bg_parent_frac = @bg_parent[pos].var_base_frac
|
|
168
172
|
bfr = Bfr.get_bfr(mut_parent_frac, bg_parent_frac)
|
|
169
173
|
@parent_hemi[pos] = bfr
|
|
170
|
-
|
|
174
|
+
@bg_parent.delete(pos)
|
|
171
175
|
else
|
|
172
176
|
bfr = Bfr.get_bfr(mut_parent_frac)
|
|
173
177
|
@parent_hemi[pos] = bfr
|
|
@@ -175,7 +179,7 @@ module Cheripic
|
|
|
175
179
|
end
|
|
176
180
|
|
|
177
181
|
# now include all hemi snp unique to background parent
|
|
178
|
-
|
|
182
|
+
@bg_parent.each_key do |pos|
|
|
179
183
|
unless @parent_hemi.key?(pos)
|
|
180
184
|
bg_parent_frac = @bg_parent[pos].var_base_frac
|
|
181
185
|
bfr = Bfr.get_bfr(bg_parent_frac)
|
data/lib/cheripic/implementer.rb
CHANGED
|
@@ -25,15 +25,21 @@ module Cheripic
|
|
|
25
25
|
input_format
|
|
26
26
|
mut_bulk
|
|
27
27
|
bg_bulk
|
|
28
|
-
|
|
28
|
+
mut_bulk_vcf
|
|
29
|
+
bg_bulk_vcf
|
|
30
|
+
hmes_frags
|
|
31
|
+
bfr_frags
|
|
29
32
|
mut_parent
|
|
30
|
-
bg_parent
|
|
33
|
+
bg_parent
|
|
34
|
+
repeats_file}
|
|
31
35
|
@options = OpenStruct.new(inputs.select { |k| set1.include?(k) })
|
|
32
36
|
|
|
33
37
|
set2 = %i{hmes_adjust
|
|
34
38
|
htlow
|
|
35
39
|
hthigh
|
|
36
40
|
mindepth
|
|
41
|
+
maxdepth
|
|
42
|
+
max_d_multiple
|
|
37
43
|
min_non_ref_count
|
|
38
44
|
min_indel_count_support
|
|
39
45
|
ambiguous_ref_bases
|
|
@@ -44,10 +50,10 @@ module Cheripic
|
|
|
44
50
|
use_all_contigs
|
|
45
51
|
include_low_hmes
|
|
46
52
|
polyploidy
|
|
47
|
-
bfr_adjust
|
|
53
|
+
bfr_adjust
|
|
54
|
+
sel_seq_len}
|
|
48
55
|
settings = inputs.select { |k| set2.include?(k) }
|
|
49
56
|
Options.update(settings)
|
|
50
|
-
FileUtils.mkdir_p @options.output
|
|
51
57
|
@vars_extracted = false
|
|
52
58
|
@has_run = false
|
|
53
59
|
end
|
|
@@ -62,15 +68,21 @@ module Cheripic
|
|
|
62
68
|
|
|
63
69
|
# Extracted variants from bulk comparison are re-analysed
|
|
64
70
|
# and selected variants are written to a file
|
|
65
|
-
def process_variants
|
|
66
|
-
|
|
71
|
+
def process_variants(pos_type)
|
|
72
|
+
if pos_type == :hmes_frags
|
|
73
|
+
@variants.verify_bg_bulk_pileup
|
|
74
|
+
end
|
|
67
75
|
# print selected variants that could be potential markers or mutation
|
|
68
|
-
out_file = File.open(
|
|
69
|
-
out_file.puts "
|
|
76
|
+
out_file = File.open(@options[pos_type], 'w')
|
|
77
|
+
out_file.puts "Score\tAlleleFreq\tseq_id\tposition\tref_base\tcoverage\tbases\tbase_quals\tsequence_left\tAlt_seq\tsequence_right"
|
|
70
78
|
regions = Regions.new(@options.assembly)
|
|
71
|
-
@variants.
|
|
79
|
+
@variants.send(pos_type).each_key do | frag |
|
|
72
80
|
contig_obj = @variants.assembly[frag]
|
|
73
|
-
|
|
81
|
+
if pos_type == :hmes_frags
|
|
82
|
+
positions = contig_obj.hm_pos.keys
|
|
83
|
+
else
|
|
84
|
+
positions = contig_obj.hemi_pos.keys
|
|
85
|
+
end
|
|
74
86
|
positions.each do | pos |
|
|
75
87
|
pileup = @variants.pileups[frag].mut_bulk[pos]
|
|
76
88
|
seqs = regions.fetch_seq(frag,pos)
|
|
@@ -87,11 +99,9 @@ module Cheripic
|
|
|
87
99
|
unless @vars_extracted
|
|
88
100
|
self.extract_vars
|
|
89
101
|
end
|
|
102
|
+
self.process_variants(:hmes_frags)
|
|
90
103
|
if Options.polyploidy
|
|
91
|
-
self.process_variants
|
|
92
|
-
@variants.bfr_frags
|
|
93
|
-
else
|
|
94
|
-
self.process_variants
|
|
104
|
+
self.process_variants(:bfr_frags)
|
|
95
105
|
end
|
|
96
106
|
@has_run = true
|
|
97
107
|
end
|
data/lib/cheripic/options.rb
CHANGED
|
@@ -12,6 +12,8 @@ module Cheripic
|
|
|
12
12
|
:htlow => 0.2,
|
|
13
13
|
:hthigh => 0.9,
|
|
14
14
|
:mindepth => 6,
|
|
15
|
+
:maxdepth => 0,
|
|
16
|
+
:max_d_multiple => 5,
|
|
15
17
|
:min_non_ref_count => 3,
|
|
16
18
|
:min_indel_count_support => 3,
|
|
17
19
|
:ambiguous_ref_bases => false,
|
|
@@ -53,6 +55,26 @@ module Cheripic
|
|
|
53
55
|
@user_settings[:mindepth]
|
|
54
56
|
end
|
|
55
57
|
|
|
58
|
+
# Maximum read coverage at the variant position to be considered for analysis
|
|
59
|
+
# @return [Integer]
|
|
60
|
+
def self.maxdepth
|
|
61
|
+
@user_settings[:maxdepth]
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
# Setting maximum read coverage at the variant position to be considered for analysis
|
|
65
|
+
# @param value [Integer] provided integer value will be updated as maxdepth
|
|
66
|
+
# @return [Integer] updated maxdepth value
|
|
67
|
+
def self.maxdepth=(value)
|
|
68
|
+
@user_settings[:maxdepth] = value
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
# Multiplication factor to average coverage to calculate maximum read coverage
|
|
72
|
+
# at the variant position to be considered for analysis
|
|
73
|
+
# @return [Integer]
|
|
74
|
+
def self.max_d_multiple
|
|
75
|
+
@user_settings[:max_d_multiple]
|
|
76
|
+
end
|
|
77
|
+
|
|
56
78
|
# Minimum non reference count at the variant position to be considered for analysis
|
|
57
79
|
# @return [Integer]
|
|
58
80
|
def self.min_non_ref_count
|
data/lib/cheripic/variants.rb
CHANGED
|
@@ -4,6 +4,36 @@ require 'forwardable'
|
|
|
4
4
|
|
|
5
5
|
module Cheripic
|
|
6
6
|
|
|
7
|
+
require 'bio-samtools'
|
|
8
|
+
require 'bio/db/sam'
|
|
9
|
+
require 'open3'
|
|
10
|
+
|
|
11
|
+
# An extension of Bio::DB::Sam object to modify depth method
|
|
12
|
+
class Bio::DB::Sam
|
|
13
|
+
|
|
14
|
+
# A method to retrieve depth information from bam object
|
|
15
|
+
# @param opts [Hash] a hash of following input options
|
|
16
|
+
# b [File] list of positions or regions in BED format
|
|
17
|
+
# l [INT] minQLen
|
|
18
|
+
# q [INT] base quality threshold
|
|
19
|
+
# Q [INT] mapping quality threshold
|
|
20
|
+
# r [chr:from-to] region
|
|
21
|
+
# @returns a block with each line reporting sequence_name, position and depth
|
|
22
|
+
def depth(opts={})
|
|
23
|
+
command = form_opt_string(self.samtools, 'depth', opts)
|
|
24
|
+
# capture returns string output, so careful not to give whole genome or big contigs for depth analysis
|
|
25
|
+
stdout, stderr, status = Open3.capture3(command)
|
|
26
|
+
unless status.success?
|
|
27
|
+
logger.error "resulted in exit code #{status.exitstatus} using #{command}"
|
|
28
|
+
logger.error "stderr output is: #{stderr}"
|
|
29
|
+
raise CheripicError
|
|
30
|
+
end
|
|
31
|
+
# return stdout
|
|
32
|
+
stdout
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
end
|
|
36
|
+
|
|
7
37
|
# Custom error handling for Variants class
|
|
8
38
|
class VariantsError < CheripicError; end
|
|
9
39
|
|
|
@@ -27,10 +57,10 @@ module Cheripic
|
|
|
27
57
|
include Enumerable
|
|
28
58
|
extend Forwardable
|
|
29
59
|
def_delegators :@assembly, :each, :each_key, :each_value, :size, :length, :[]
|
|
30
|
-
attr_reader :assembly, :pileups, :
|
|
60
|
+
attr_reader :assembly, :pileups, :pileups_analyzed
|
|
31
61
|
|
|
32
62
|
# creates a Variants object using user input files
|
|
33
|
-
# @param options [
|
|
63
|
+
# @param options [OpenStruct] a hash of required input files as keys and file paths as values
|
|
34
64
|
def initialize(options)
|
|
35
65
|
@params = options
|
|
36
66
|
@assembly = {}
|
|
@@ -50,25 +80,76 @@ module Cheripic
|
|
|
50
80
|
@pileups[contig.id] = ContigPileups.new(contig.id)
|
|
51
81
|
end
|
|
52
82
|
@pileups_analyzed = false
|
|
83
|
+
unless @params.repeats_file == ''
|
|
84
|
+
store_repeat_regions
|
|
85
|
+
end
|
|
86
|
+
end
|
|
87
|
+
|
|
88
|
+
# reads repeat masker output file and stores masked regions to ignore variants in thos regions
|
|
89
|
+
def store_repeat_regions
|
|
90
|
+
File.foreach(@params.repeats_file) do |line|
|
|
91
|
+
line.strip!
|
|
92
|
+
next if line =~ /^SW/ or line =~ /^score/ or line == ''
|
|
93
|
+
info = line.split("\s")
|
|
94
|
+
pileups_obj = @pileups[info[4]]
|
|
95
|
+
index = pileups_obj.masked_regions.length
|
|
96
|
+
pileups_obj.masked_regions[index + 1][:begin] = info[5].to_i
|
|
97
|
+
pileups_obj.masked_regions[index + 1][:end] = info[6].to_i
|
|
98
|
+
end
|
|
53
99
|
end
|
|
54
100
|
|
|
55
101
|
# Reads and store pileup data for each of input bulk and parents pileup files
|
|
56
102
|
# And sets pileups_analyzed to true that pileups files are processed
|
|
57
103
|
def analyse_pileups
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
104
|
+
if @params.input_format == 'bam'
|
|
105
|
+
@vcf_hash = Vcf.filtering(@params.mut_bulk_vcf, @params.bg_bulk_vcf)
|
|
106
|
+
end
|
|
62
107
|
%i{mut_bulk bg_bulk mut_parent bg_parent}.each do | input |
|
|
63
108
|
infile = @params[input]
|
|
64
109
|
if infile != ''
|
|
65
|
-
|
|
110
|
+
logger.info "processing #{input} file"
|
|
111
|
+
if @params.input_format == 'pileup'
|
|
112
|
+
extract_pileup(infile, input)
|
|
113
|
+
else
|
|
114
|
+
extract_bam_pileup(infile, input)
|
|
115
|
+
end
|
|
66
116
|
end
|
|
67
117
|
end
|
|
68
118
|
|
|
69
119
|
@pileups_analyzed = true
|
|
70
120
|
end
|
|
71
121
|
|
|
122
|
+
# Bam object is read and each contig mean and std deviation of depth calculated
|
|
123
|
+
# @param bamobject [Bio::DB::Sam]
|
|
124
|
+
# Open3 capture returns string output, so careful not to give whole genome or big contigs for depth analysis
|
|
125
|
+
def set_max_depth(bamobject, bamfile)
|
|
126
|
+
logger.info "processing #{bamfile} file for depth"
|
|
127
|
+
all_depths = []
|
|
128
|
+
bq = Options.base_quality
|
|
129
|
+
mq = Options.mapping_quality
|
|
130
|
+
@assembly.each_key do | id |
|
|
131
|
+
contig_obj = @assembly[id]
|
|
132
|
+
len = contig_obj.length
|
|
133
|
+
data = bamobject.depth(:r => "#{id}", :Q => bq, :q => mq)
|
|
134
|
+
depths = []
|
|
135
|
+
data.split("\n").each do |line|
|
|
136
|
+
info = line.split("\t")
|
|
137
|
+
depths << info[2].to_i
|
|
138
|
+
end
|
|
139
|
+
variance = 0
|
|
140
|
+
mean_depth = depths.reduce(0, :+) / len.to_f
|
|
141
|
+
depths.each do |value|
|
|
142
|
+
variance += (value.to_f - mean_depth)**2
|
|
143
|
+
end
|
|
144
|
+
all_depths << mean_depth
|
|
145
|
+
contig_obj.sd_depth = Math.sqrt(variance)
|
|
146
|
+
contig_obj.mean_depth = mean_depth
|
|
147
|
+
end
|
|
148
|
+
# setting max depth as 3 times the average depth
|
|
149
|
+
mean_coverage = all_depths.reduce(0, :+) / @assembly.length.to_f
|
|
150
|
+
Options.maxdepth = Options.max_d_multiple * mean_coverage
|
|
151
|
+
end
|
|
152
|
+
|
|
72
153
|
# Input pileup file is read and positions are selected that pass the thresholds
|
|
73
154
|
# @param pileupfile [String] path to the pileup file to read
|
|
74
155
|
# @param sym [Symbol] Symbol of the pileup file used to write selected variants
|
|
@@ -84,6 +165,54 @@ module Cheripic
|
|
|
84
165
|
end
|
|
85
166
|
end
|
|
86
167
|
|
|
168
|
+
# Input bamfile is read and selected positions pileups are stored
|
|
169
|
+
# @param bamfile [String] path to the bam file to read
|
|
170
|
+
# @param sym [Symbol] Symbol of the bam file used to write selected variants
|
|
171
|
+
# pileup information to respective ContigPileups object
|
|
172
|
+
def extract_bam_pileup(bamfile, sym)
|
|
173
|
+
bq = Options.base_quality
|
|
174
|
+
mq = Options.mapping_quality
|
|
175
|
+
bamobject = Bio::DB::Sam.new(:bam=>bamfile, :fasta=>@params.assembly)
|
|
176
|
+
bamobject.index unless bamobject.indexed?
|
|
177
|
+
|
|
178
|
+
# check if user has set max depth or set to zero to ignore
|
|
179
|
+
max_d = Options.maxdepth
|
|
180
|
+
# or calculate from bamfile
|
|
181
|
+
if Options.max_d_multiple > 0
|
|
182
|
+
set_max_depth(bamobject, bamfile)
|
|
183
|
+
max_d = Options.maxdepth
|
|
184
|
+
logger.info "max depth used for #{sym} file\t#{max_d}"
|
|
185
|
+
end
|
|
186
|
+
|
|
187
|
+
@vcf_hash.each_key do | id |
|
|
188
|
+
positions = @vcf_hash[id][:het].keys
|
|
189
|
+
positions << @vcf_hash[id][:hom].keys
|
|
190
|
+
positions.flatten!
|
|
191
|
+
next if positions.empty?
|
|
192
|
+
contig_obj = @pileups[id]
|
|
193
|
+
positions.each do | pos |
|
|
194
|
+
command = "#{bamobject.samtools} mpileup -r #{id}:#{pos}-#{pos} -Q #{bq} -q #{mq} -B -f #{@params.assembly} #{bamfile}"
|
|
195
|
+
stdout, stderr, status = Open3.capture3(command)
|
|
196
|
+
unless status.success?
|
|
197
|
+
logger.error "resulted in exit code #{status.exitstatus} using #{command}"
|
|
198
|
+
logger.error "stderr output is: #{stderr}"
|
|
199
|
+
raise CheripicError
|
|
200
|
+
end
|
|
201
|
+
stdout.chomp!
|
|
202
|
+
if stdout == '' or stdout.split("\t")[3].to_i == 0 or stdout =~ /^\t0/
|
|
203
|
+
logger.info "pileup data empty for\t#{id}\t#{pos}"
|
|
204
|
+
else
|
|
205
|
+
pileup = Pileup.new(stdout)
|
|
206
|
+
unless max_d == 0 or pileup.coverage <= max_d
|
|
207
|
+
logger.info "pileup coverage is higher than max\t#{pileup.to_s}"
|
|
208
|
+
next
|
|
209
|
+
end
|
|
210
|
+
contig_obj.send(sym).store(pos, pileup)
|
|
211
|
+
end
|
|
212
|
+
end
|
|
213
|
+
end
|
|
214
|
+
end
|
|
215
|
+
|
|
87
216
|
# Once pileup files are analysed and variants are extracted from each bulk;
|
|
88
217
|
# bulks are compared to identify and isolate variants for downstream analysis.
|
|
89
218
|
# If polyploidy set to trye and mut_parent and bg_parent bulks are provided
|
|
@@ -95,8 +224,10 @@ module Cheripic
|
|
|
95
224
|
@assembly.each_key do | id |
|
|
96
225
|
contig = @assembly[id]
|
|
97
226
|
# extract parental hemi snps for polyploids before bulks are compared
|
|
98
|
-
if
|
|
99
|
-
@
|
|
227
|
+
if Options.polyploidy
|
|
228
|
+
if @params.mut_parent != '' or @params.bg_parent != ''
|
|
229
|
+
@pileups[id].hemisnps_in_parent
|
|
230
|
+
end
|
|
100
231
|
end
|
|
101
232
|
contig.hm_pos, contig.ht_pos, contig.hemi_pos = @pileups[id].bulks_compared
|
|
102
233
|
end
|
data/lib/cheripic/vcf.rb
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# encoding: utf-8
|
|
2
|
+
|
|
3
|
+
module Cheripic
|
|
4
|
+
|
|
5
|
+
# Custom error handling for Vcf class
|
|
6
|
+
class VcfError < CheripicError; end
|
|
7
|
+
|
|
8
|
+
require 'bio-samtools'
|
|
9
|
+
|
|
10
|
+
class Vcf
|
|
11
|
+
|
|
12
|
+
def self.get_allele_freq(vcf_obj)
|
|
13
|
+
# check if the vcf is from samtools (has DP4 and AF1 fields in INFO)
|
|
14
|
+
if vcf_obj.info.key?('DP4')
|
|
15
|
+
freq = vcf_obj.info['DP4'].split(',')
|
|
16
|
+
depth = freq.inject { | sum, n | sum.to_f + n.to_f }
|
|
17
|
+
alt = freq[2].to_f + freq[3].to_f
|
|
18
|
+
allele_freq = alt / depth
|
|
19
|
+
# allele_freq = vcf_obj.non_ref_allele_freq
|
|
20
|
+
# check if the vcf is from VarScan (has RD, AD and FREQ fields in FORMAT)
|
|
21
|
+
elsif vcf_obj.samples['1'].key?('RD')
|
|
22
|
+
alt = vcf_obj.samples['1']['AD'].to_f
|
|
23
|
+
depth = vcf_obj.samples['1']['RD'].to_f + alt
|
|
24
|
+
allele_freq = alt / depth
|
|
25
|
+
# check if the vcf is from GATK (has AD and GT fields in FORMAT)
|
|
26
|
+
elsif vcf_obj.samples['1'].key?('AD') and vcf_obj.samples['1']['AD'].include?(',')
|
|
27
|
+
freq = vcf_obj.samples['1']['AD'].split(',')
|
|
28
|
+
allele_freq = freq[1].to_f / ( freq[0].to_f + freq[1].to_f )
|
|
29
|
+
# check if the vcf has has AF fields in INFO
|
|
30
|
+
elsif vcf_obj.info.key?('AF')
|
|
31
|
+
allele_freq = vcf_obj.info['AF'].to_f
|
|
32
|
+
else
|
|
33
|
+
raise VcfError.new 'not a supported vcf format (VarScan, GATK, Bcftools(Samtools), Vcf 4.0, 4.1 and 4.2)' +
|
|
34
|
+
" and check that it is one sample vcf\n"
|
|
35
|
+
end
|
|
36
|
+
allele_freq
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
##Input: vcf file
|
|
41
|
+
##Ouput: lists of hm and ht SNPS and hash of all fragments with variants
|
|
42
|
+
def self.get_vars(vcf_file)
|
|
43
|
+
ht_low = Options.htlow
|
|
44
|
+
ht_high = Options.hthigh
|
|
45
|
+
|
|
46
|
+
# hash of :het and :hom with frag ids and respective variant positions
|
|
47
|
+
var_pos = Hash.new{ |h,k| h[k] = Hash.new(&h.default_proc) }
|
|
48
|
+
File.foreach(vcf_file) do |line|
|
|
49
|
+
next if line =~ /^#/
|
|
50
|
+
v = Bio::DB::Vcf.new(line)
|
|
51
|
+
unless v.alt == '.'
|
|
52
|
+
allele_freq = get_allele_freq(v)
|
|
53
|
+
if allele_freq.between?(ht_low, ht_high)
|
|
54
|
+
var_pos[v.chrom][:het][v.pos] = allele_freq
|
|
55
|
+
elsif allele_freq > ht_high
|
|
56
|
+
var_pos[v.chrom][:hom][v.pos] = allele_freq
|
|
57
|
+
end
|
|
58
|
+
end
|
|
59
|
+
end
|
|
60
|
+
var_pos
|
|
61
|
+
end
|
|
62
|
+
|
|
63
|
+
def self.filtering(mutant_vcf, bgbulk_vcf)
|
|
64
|
+
var_pos_mut = get_vars(mutant_vcf)
|
|
65
|
+
return var_pos_mut if bgbulk_vcf == ''
|
|
66
|
+
var_pos_bg = get_vars(bgbulk_vcf)
|
|
67
|
+
|
|
68
|
+
# if both bulks have homozygous mutations at same positions then deleting them
|
|
69
|
+
var_pos_mut.each_key do | frag |
|
|
70
|
+
positions = var_pos_mut[frag][:hom].keys
|
|
71
|
+
pos_bg_bulk = var_pos_bg[frag][:hom].keys
|
|
72
|
+
positions.each do |pos|
|
|
73
|
+
if pos_bg_bulk.include?(pos)
|
|
74
|
+
var_pos_mut[frag][:hom].delete(pos)
|
|
75
|
+
end
|
|
76
|
+
end
|
|
77
|
+
end
|
|
78
|
+
var_pos_mut
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
end
|
data/lib/cheripic/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: cheripic
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.2.
|
|
4
|
+
version: 1.2.5
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Shyam Rallapalli
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2016-
|
|
11
|
+
date: 2016-10-17 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: yell
|
|
@@ -84,40 +84,6 @@ dependencies:
|
|
|
84
84
|
- - "~>"
|
|
85
85
|
- !ruby/object:Gem::Version
|
|
86
86
|
version: 2.4.0
|
|
87
|
-
- !ruby/object:Gem::Dependency
|
|
88
|
-
name: bio-gngm
|
|
89
|
-
requirement: !ruby/object:Gem::Requirement
|
|
90
|
-
requirements:
|
|
91
|
-
- - "~>"
|
|
92
|
-
- !ruby/object:Gem::Version
|
|
93
|
-
version: 0.2.1
|
|
94
|
-
type: :runtime
|
|
95
|
-
prerelease: false
|
|
96
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
97
|
-
requirements:
|
|
98
|
-
- - "~>"
|
|
99
|
-
- !ruby/object:Gem::Version
|
|
100
|
-
version: 0.2.1
|
|
101
|
-
- !ruby/object:Gem::Dependency
|
|
102
|
-
name: rinruby
|
|
103
|
-
requirement: !ruby/object:Gem::Requirement
|
|
104
|
-
requirements:
|
|
105
|
-
- - "~>"
|
|
106
|
-
- !ruby/object:Gem::Version
|
|
107
|
-
version: '2.0'
|
|
108
|
-
- - ">="
|
|
109
|
-
- !ruby/object:Gem::Version
|
|
110
|
-
version: 2.0.3
|
|
111
|
-
type: :runtime
|
|
112
|
-
prerelease: false
|
|
113
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
114
|
-
requirements:
|
|
115
|
-
- - "~>"
|
|
116
|
-
- !ruby/object:Gem::Version
|
|
117
|
-
version: '2.0'
|
|
118
|
-
- - ">="
|
|
119
|
-
- !ruby/object:Gem::Version
|
|
120
|
-
version: 2.0.3
|
|
121
87
|
- !ruby/object:Gem::Dependency
|
|
122
88
|
name: activesupport
|
|
123
89
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -259,6 +225,7 @@ files:
|
|
|
259
225
|
- ".gitignore"
|
|
260
226
|
- ".travis.yml"
|
|
261
227
|
- CODE_OF_CONDUCT.md
|
|
228
|
+
- ChangeLog.md
|
|
262
229
|
- Gemfile
|
|
263
230
|
- LICENSE.txt
|
|
264
231
|
- README.md
|
|
@@ -267,6 +234,7 @@ files:
|
|
|
267
234
|
- bin/console
|
|
268
235
|
- bin/setup
|
|
269
236
|
- cheripic.gemspec
|
|
237
|
+
- galaxy_cheripic_tool.xml
|
|
270
238
|
- lib/cheripic.rb
|
|
271
239
|
- lib/cheripic/bfr.rb
|
|
272
240
|
- lib/cheripic/cmd.rb
|
|
@@ -277,6 +245,7 @@ files:
|
|
|
277
245
|
- lib/cheripic/pileup.rb
|
|
278
246
|
- lib/cheripic/regions.rb
|
|
279
247
|
- lib/cheripic/variants.rb
|
|
248
|
+
- lib/cheripic/vcf.rb
|
|
280
249
|
- lib/cheripic/version.rb
|
|
281
250
|
homepage: https://github.com/shyamrallapalli/cheripic
|
|
282
251
|
licenses:
|