cheripic 1.2.0 → 1.2.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 958b4091f2c95903c3a43a13af7d75cbc7605813
4
- data.tar.gz: 18b91af8e68553f4d1700dae921beb7e420f11ac
3
+ metadata.gz: 824a8c68d3707ad02cf0d3b7d567191244d1a5a6
4
+ data.tar.gz: 6d2b3c7bef04aba06b5206968d1a0d69996a25b0
5
5
  SHA512:
6
- metadata.gz: 7290c13e270aae1a777767179168353c5c55a035bfd6e82025d000414112425e77f59ad7a6fd0c736d2d2775182e17f978ffa6b67153201e9d316458a6360db6
7
- data.tar.gz: 595be6e01fdc4e0d6185a86f79f207abb2dc6bf50a8763a67186339e639542294185656b204cde8d86cda2f3519f7ae3341443a978f9f019c51cfef294513694
6
+ metadata.gz: 6ae5a85c30a0b1ea19f118409ddec95a6c7c3e11f00663e9769a5642770e90cb2ab5b0200f9d9eaa4ed8c6873492ac7f5f3acd568dc0cc14ddf8ccaac5012435
7
+ data.tar.gz: efe77b2ccafd0ad7ed4eeb47b497207cacf3dbaee058779b46af7a5ca34597991a3949dbeb0f3807bba69ea7dca0dd7e24fb58c040bf0065fef2ca1e4e3424fc
data/ChangeLog.md ADDED
@@ -0,0 +1,21 @@
1
+ ### Change Log
2
+
3
+ All significant changes to this project at each release are documented in this file.
4
+
5
+
6
+ #### Future changes to include
7
+
8
+ 1. option to take multiple background pileup files
9
+ 2. replace output directory with output file name tag, since we only write to one file
10
+ 3. option to take bam file or pileup file as inputs of bulks
11
+
12
+ #### [1.2.0] - 2016-08-11
13
+
14
+ 1. fixed calculation of heterzygosity for background bulks
15
+ 2. changed command line boolean option to be set using only true or false
16
+ 3. included command line option to set length of sequnce to retireve on either side of each variant
17
+
18
+
19
+ #### [1.1.0] - 2016-07-26
20
+
21
+ first release of the binaries for Linux 64 bit and OSX 64bit
data/Gemfile CHANGED
@@ -1,5 +1,4 @@
1
1
  source 'https://rubygems.org'
2
- ruby '2.1.5'
3
2
 
4
3
  # Specify your gem's dependencies in cheripic.gemspec
5
4
  gemspec
data/README.md CHANGED
@@ -11,6 +11,7 @@ Computing Homozygosity Enriched Regions In genomes to Prioritize Identification
11
11
  is a ruby tools to pick causative mutation from bulks segregant sequencing.
12
12
 
13
13
  Currently this gem is still in development and nearing complete working package.
14
+ And software only works with pileup as input files, use of bam and vcf files will be implemented in future
14
15
 
15
16
 
16
17
  ## Installation
@@ -20,7 +21,7 @@ Binaries are available for Linux 64bit and OSX.
20
21
  Best way to use Cheripic is to download appropriate binary arhcive
21
22
  unpack (`tar -xzf`) and add the unpacked directory to your `PATH`
22
23
 
23
- Latest binaries are available to [download here](https://github.com/shyamrallapalli/cheripic/releases/tag/v1.1.0)
24
+ Latest binaries are available to [download here](https://github.com/shyamrallapalli/cheripic/releases/latest)
24
25
 
25
26
 
26
27
  To install gem and use the gem in your development
@@ -44,7 +45,7 @@ Running `cheripic` without any input at command line interface shows following h
44
45
 
45
46
  ```
46
47
 
47
- Cheripic v1.1.0
48
+ Cheripic v1.2.0
48
49
  Authors: Shyam Rallapalli and Dan MacLean
49
50
 
50
51
  Description: Candidate mutation and closely linked marker selection for non reference genomes
@@ -59,30 +60,31 @@ USAGE:
59
60
  cheripic <options>
60
61
 
61
62
  OPTIONS:
62
- -f, --assembly=<s> Assembly file in FASTA format
63
- -F, --input-format=<s> bulk and parent alignment file format types - set either pileup or bam (default: pileup)
64
- -a, --mut-bulk=<s> Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1
65
- -b, --bg-bulk=<s> Pileup or sorted BAM file alignments from background/wildtype bulk 2
66
- --output=<s> Directory to store results, will be created if not existing (default: cheripic_results)
67
- --loglevel=<s> Choose any one of "info / warn / debug" level for logs generated (default: debug)
68
- --hmes-adjust=<f> factor added to snp count of each contig to adjust for hme score calculations (default: 0.5)
69
- --htlow=<f> lower level for categorizing heterozygosity (default: 0.2)
70
- --hthigh=<f> high level for categorizing heterozygosity (default: 0.9)
71
- --mindepth=<i> minimum read depth to conisder a position for variant calls (default: 6)
72
- --min-non-ref-count=<i> minimum read depth supporting non reference base at each position (default: 3)
73
- --min-indel-count-support=<i> minimum read depth supporting an indel at each position (default: 3)
74
- --ignore-reference-n, --no-ignore-reference-n ignore variant calls at N (completely ambigous) bases in the reference (default: true)
75
- -q, --mapping-quality=<i> minimum mapping quality of read covering the position (default: 20)
76
- -Q, --base-quality=<i> minimum base quality of bases covering the position (default: 15)
77
- --noise=<f> praportion of reads for a variant to conisder as noise (default: 0.1)
78
- --cross-type=<s> type of cross used to generated mapping population - back or out (default: back)
79
- --only-frag-with-vars, --no-only-frag-with-vars select only contigs containing variants for analysis (default: true)
80
- --filter-out-low-hmes, --no-filter-out-low-hmes ignore variants from contigs with low hmescore or bfr to list in the final output (default: true)
81
- --polyploidy Set if the data input is from polyploids
82
- -p, --mut-parent=<s> Pileup or sorted BAM file alignments from mutant/trait of interest parent (default: )
83
- -r, --bg-parent=<s> Pileup or sorted BAM file alignments from background/wildtype parent (default: )
84
- --bfr-adjust=<f> factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)
85
- --examples shows some example commands with explanation
63
+ -f, --assembly=<s> Assembly file in FASTA format
64
+ -F, --input-format=<s> bulk and parent alignment file format types - set either pileup or bam (default: pileup)
65
+ -a, --mut-bulk=<s> Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1
66
+ -b, --bg-bulk=<s> Pileup or sorted BAM file alignments from background/wildtype bulk 2
67
+ --output=<s> Directory to store results, will be created if not existing (default: cheripic_results)
68
+ --loglevel=<s> Choose any one of "info / warn / debug" level for logs generated (default: debug)
69
+ --hmes-adjust=<f> factor added to snp count of each contig to adjust for hme score calculations (default: 0.5)
70
+ --htlow=<f> lower level for categorizing heterozygosity (default: 0.2)
71
+ --hthigh=<f> high level for categorizing heterozygosity (default: 0.9)
72
+ --mindepth=<i> minimum read depth to conisder a position for variant calls (default: 6)
73
+ --min-non-ref-count=<i> minimum read depth supporting non reference base at each position (default: 3)
74
+ --min-indel-count-support=<i> minimum read depth supporting an indel at each position (default: 3)
75
+ --ambiguous-ref-bases including variant at completely ambiguous bases in the reference
76
+ -q, --mapping-quality=<i> minimum mapping quality of read covering the position (default: 20)
77
+ -Q, --base-quality=<i> minimum base quality of bases covering the position (default: 15)
78
+ --noise=<f> praportion of reads for a variant to conisder as noise (default: 0.1)
79
+ --cross-type=<s> type of cross used to generated mapping population - back or out (default: back)
80
+ --use-all-contigs option to select all contigs or only contigs containing variants for analysis
81
+ --include-low-hmes option to include or discard variants from contigs with low hme-score or bfr score to list in the final output
82
+ --polyploidy Set if the data input is from polyploids
83
+ -p, --mut-parent=<s> Pileup or sorted BAM file alignments from mutant/trait of interest parent (default: )
84
+ -r, --bg-parent=<s> Pileup or sorted BAM file alignments from background/wildtype parent (default: )
85
+ --bfr-adjust=<f> factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)
86
+ --sel-seq-len=<i> sequence length to print from either side of selected variants (default: 50)
87
+ --examples shows some example commands with explanation
86
88
 
87
89
  ```
88
90
 
@@ -98,7 +100,7 @@ EXAMPLE COMMANDS:
98
100
  --mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
99
101
  3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
100
102
  --mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
101
- --no-only-frag-with-vars --no-filter-out-low-hmes --output cheripic_results
103
+ --use-all-contigs true --include-low-hmes true --output cheripic_results
102
104
 
103
105
  ```
104
106
 
data/Rakefile CHANGED
@@ -23,7 +23,7 @@ TRAVELING_RUBY_VERSION = '20150210-2.1.5'
23
23
  # http://d6r77u77i8pq3.cloudfront.net/releases/traveling-ruby-20150210-2.1.5-osx.tar.gz
24
24
 
25
25
  desc 'Package your app'
26
- task :package => ['package:linux:x86_64', 'package:osx']
26
+ task :package => %w(package:linux:x86_64 package:osx)
27
27
 
28
28
  namespace :package do
29
29
 
@@ -71,8 +71,12 @@ def create_package(target)
71
71
  sh "cp packaging/cheripic.gemspec Gemfile Gemfile.lock LICENSE.txt #{package_dir}/lib/app/"
72
72
  sh "mkdir #{package_dir}/lib/app/.bundle"
73
73
  sh "cp packaging/bundler-config #{package_dir}/lib/app/.bundle/config"
74
- # if !ENV['DIR_ONLY']
75
- # sh "tar -czf #{package_dir}.tar.gz #{package_dir}"
76
- # sh "rm -rf #{package_dir}"
77
- # end
74
+ if target == 'linux-x86_64'
75
+ sh "cp -p packaging/linux-x86_64_samtools/external/* packaging/cheripic-#{VERSION}-linux-x86_64/lib/app/ruby/2.1.0/gems/bio-samtools-2.4.0/lib/bio/db/sam/external/"
76
+ end
77
+ unless ENV['DIR_ONLY']
78
+ Dir.chdir('packaging') do
79
+ sh "gtar -czf #{package_dest}.tar.gz #{package_dest}"
80
+ end
81
+ end
78
82
  end
data/bin/cheripic CHANGED
@@ -1,4 +1,5 @@
1
1
  #!/usr/bin/env ruby
2
+ $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
2
3
  require 'cheripic'
3
4
 
4
5
  # rescue errors to get clean error messages through the logger
data/cheripic.gemspec CHANGED
@@ -23,8 +23,6 @@ Gem::Specification.new do |spec|
23
23
  spec.add_runtime_dependency 'trollop', '~> 2.1', '>= 2.1.2'
24
24
  spec.add_runtime_dependency 'bio', '~> 1.5', '>= 1.5.0'
25
25
  spec.add_dependency 'bio-samtools', '~> 2.4.0'
26
- spec.add_dependency 'bio-gngm', '~> 0.2.1'
27
- spec.add_runtime_dependency 'rinruby', '~> 2.0', '>= 2.0.3'
28
26
 
29
27
  spec.add_development_dependency 'activesupport', '~> 4.2.6'
30
28
  spec.add_development_dependency 'bundler', '~> 1.7.6'
@@ -0,0 +1,196 @@
1
+ <tool id="cheripic" name="CHERIPIC" version="1.2.0">
2
+
3
+ <description>CHERIPIC</description>
4
+
5
+ <version_command>cheripic -v</version_command>
6
+
7
+ <command>
8
+ <![CDATA[
9
+ cheripic
10
+ --assembly $assembly
11
+ --mut-bulk $mut_bulk
12
+ --bg-bulk $bg_bulk
13
+ --output $output
14
+ --loglevel $loglevel
15
+ --hmes-adjust $hmes_adjust
16
+ --htlow $ht_low
17
+ --hthigh $ht_high
18
+ --mindepth $min_depth
19
+ --min-non-ref-count $min_non_ref_count
20
+ --min-indel-count-support $min_indel_count_support
21
+ --ambiguous-ref-bases $ambiguous_ref_bases
22
+ --mapping-quality $mapping_quality
23
+ --base-quality $base_quality
24
+ --noise $noise
25
+ --cross-type $cross_type
26
+ --use-all-contigs $use_all_contigs
27
+ --include-low-hmes $include_low_hmes
28
+ --polyploidy $polyploidy
29
+ --mut-parent $mut_parent
30
+ --bg-parent $bg_parent
31
+ --bfr-adjust $bfr_adjust
32
+ --sel-seq-len $sel_seq_len
33
+ ]]>
34
+ </command>
35
+
36
+ <inputs>
37
+ <param name="assembly" type="data" format="fasta" label="Input Assembly file" help="Select Assembly fasta file" />
38
+ <param name="mut_bulk" type="data" format="pileup" label="mutant bulk pileup file" help="Select mutant bulk pileup file" />
39
+ <param name="bg_bulk" type="data" format="pileup" label="background bulk pileup file" min="1" multiple="true" help="Select background bulk pileup file" />
40
+ <param name="loglevel" type="select" optional="true" label="analysis log level" help="choose between info, warn and debug levels">
41
+ <option value="info" selected="true">info </option>
42
+ <option value="warn">warnings</option>
43
+ <option value="debug">debug</option>
44
+ </param>
45
+ <param name="hmes_adjust" size="4" type="float" optional="true" value="0.5" min="0.01" max="1.0"
46
+ label="hme score adjuster" help="factor added to snp count of each contig to adjust for hme score calculations" />
47
+ <param name="ht_low" size="4" type="float" optional="true" value="0.25" min="0.1" max="1.0"
48
+ label="heterozygosity low limit" help="lower limit to heterozygosity allele fraction" />
49
+ <param name="ht_high" size="4" type="float" optional="true" value="0.75" min="0.1" max="1.0"
50
+ label="heterozygosity high limit" help="upper limit to heterozygosity allele fraction" />
51
+ <param name="min_depth" size="4" type="integer" optional="true" value="6" min="1" max="8000"
52
+ label="minimum read coverage" help="minimum read depth to conisder a position for variant calls" />
53
+ <param name="min_non_ref_count" size="4" type="integer" optional="true" value="3" min="1" max="8000"
54
+ label="minimum alternate read coverage" help="minimum read depth supporting non reference base at each position" />
55
+ <param name="min_indel_count_support" size="4" type="integer" optional="true" value="3" min="1" max="8000"
56
+ label="minimum indel read coverage" help="minimum read depth supporting an indel at each position" />
57
+ <param name="ambiguous_ref_bases" type="boolean" optional="true" checked="false" label="ambiguous reference position"
58
+ help="including variant at completely ambiguous bases in the reference" truevalue="true" falsevalue="false" />
59
+ <param name="mapping_quality" size="4" type="integer" optional="true" value="20" min="0" max="255"
60
+ label="minimum mapping quality" help="minimum mapping quality of read covering the position" />
61
+ <param name="base_quality" size="4" type="integer" optional="true" value="15" min="0" max="40"
62
+ label="minimum base quality" help="minimum base quality of nucleotides covering the position" />
63
+ <param name="noise" size="4" type="float" optional="true" value="0.1" min="0" max="0.2"
64
+ label="read noise" help="proportion of reads supporting a variant, below which are consider as noise" />
65
+ <param name="cross_type" type="select" optional="true" label="cross type" help="type of cross used to generated mapping population - back or out" >
66
+ <option value="back" selected="true">back cross</option>
67
+ <option value="out">out cross</option>
68
+ </param>
69
+
70
+ <param name="use_all_contigs" type="boolean" optional="true" checked="false" label="use all contigs in analysis"
71
+ help="option to select all contigs or only contigs containing variants for analysis" truevalue="true" falsevalue="false" />
72
+ <param name="include_low_hmes" type="boolean" optional="true" checked="false" label="no hme or bfr score cut off"
73
+ help="option to include or discard variants from contigs with low hme-score or bfr score to list in the final output" truevalue="true" falsevalue="false" />
74
+ <param name="polyploidy" type="boolean" optional="true" checked="false" label="polyploid data"
75
+ help="Set if the input data is from polyploids" truevalue="true" falsevalue="false" />
76
+ <param name="mut-parent" type="data" optional="true" format="pileup" label="mutant parent pileup file" help="Select mutant parent pileup file" />
77
+ <param name="bg-parent" type="data" optional="true" format="pileup" label="background parent pileup file" help="Select background parent pileup file" />
78
+
79
+ <param name="bfr_adjust" size="4" type="float" optional="true" value="0.05" min="0.01" max="1.0"
80
+ label="bfr score adjuster" help="factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)" />
81
+ <param name="sel_seq_len" size="4" type="integer" optional="true" value="50" min="10" max="250"
82
+ label="selected variant seq length out" help="sequence length to print from either side of selected variants (default: 50)" />
83
+
84
+ <param name="output" type="text" size="30" value="cheripic_results" label="tag for output filename" help="write a tag to include with output filename" />
85
+ </inputs>
86
+
87
+ <outputs>
88
+ <data name="output_1" format="txt" file="${output}_selected_hme_variants.txt" />
89
+ <data name="output_2" format="txt" file="${output}_selected_bfr_variants.txt" />
90
+ </outputs>
91
+
92
+ <tests>
93
+ <test>
94
+ <param name="assembly" value="picked_fasta.fa" ftype="fasta" />
95
+ <param name="mut_bulk" value="mut_bulk.pileup" ftype="pileup" />
96
+ <param name="bg_bulk" value="wt_bulk.pileup" ftype="pileup" />
97
+ <output name="output" ftype="txt" file="selected_variants.out" />
98
+ </test>
99
+ </tests>
100
+
101
+ <help>
102
+
103
+ **Computing Homozygosity Enriched Regions In genomes to Prioritize Identification of Candidate variants (CHERIPIC)**
104
+
105
+ CHERIPIC is a ruby tool to pick causative mutation from bulk segregant sequencing
106
+
107
+ ------
108
+
109
+ **What it does**
110
+
111
+ This tool uses ``cheripic`` tool to analyse bulk segregant sequencing to identify causative muation
112
+
113
+
114
+ .. class:: infomark
115
+
116
+ Provides a list of snps that could either closely linked markers or the causative mutation.
117
+
118
+ ------
119
+
120
+ **Input formats**
121
+
122
+ assembly file should be a fasta file used for generating pileups from bulks
123
+ bulk alignment files should be pileup files
124
+
125
+ ------
126
+
127
+ **Outputs**
128
+
129
+ The output is a text file, and has the following columns::
130
+
131
+ Column Description
132
+ ----------------- --------------------------------------------------------
133
+ 1 HME_Score Homozygosity Enrichment score
134
+ 2 AlleleFreq Allele frequency
135
+ 3 seq_id Contig/Scaffold id
136
+ 4 position 1-based index of the position in contig
137
+ 5 ref_base Reference nucleotide at the position
138
+ 6 coverage read depth
139
+ 7 bases read bases
140
+ 8 base_quals read base qualities
141
+ 9 sequence_left selected size of reference sequence on the left variant
142
+ 10 Alt_seq Alternate allele at the position
143
+ 11 sequence_right selected size of reference sequence on the right variant
144
+
145
+ ------
146
+
147
+ **cheripic settings**
148
+
149
+ All of the options have a default value. You can change any of them. All of the options are implemented.
150
+
151
+ ------
152
+
153
+ **cheripic parameter list**
154
+
155
+ OPTIONS:
156
+ -f, --assembly Assembly file in FASTA format
157
+ -F, --input-format bulk and parent alignment file format types - set either pileup or bam (default: pileup)
158
+ -a, --mut-bulk Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1
159
+ -b, --bg-bulk Pileup or sorted BAM file alignments from background/wildtype bulk 2
160
+ --output Directory to store results, will be created if not existing (default: cheripic_results)
161
+ --loglevel Choose any one of "info / warn / debug" level for logs generated (default: debug)
162
+ --hmes-adjust factor added to snp count of each contig to adjust for hme score calculations (default: 0.5)
163
+ --htlow lower level for categorizing heterozygosity (default: 0.2)
164
+ --hthigh high level for categorizing heterozygosity (default: 0.9)
165
+ --mindepth minimum read depth to conisder a position for variant calls (default: 6)
166
+ --min-non-ref-count minimum read depth supporting non reference base at each position (default: 3)
167
+ --min-indel-count-support minimum read depth supporting an indel at each position (default: 3)
168
+ --ambiguous-ref-bases including variant at completely ambiguous bases in the reference
169
+ -q, --mapping-quality minimum mapping quality of read covering the position (default: 20)
170
+ -Q, --base-quality minimum base quality of bases covering the position (default: 15)
171
+ --noise praportion of reads for a variant to conisder as noise (default: 0.1)
172
+ --cross-type type of cross used to generated mapping population - back or out (default: back)
173
+ --use-all-contigs option to select all contigs or only contigs containing variants for analysis
174
+ --include-low-hmes option to include or discard variants from contigs with low hme-score or bfr score to list in the final output
175
+ --polyploidy Set if the data input is from polyploids
176
+ -p, --mut-parent Pileup or sorted BAM file alignments from mutant/trait of interest parent (default: )
177
+ -r, --bg-parent Pileup or sorted BAM file alignments from background/wildtype parent (default: )
178
+ --bfr-adjust factor added to hemi snp frequency of each parent to adjust for bfr calculations (default: 0.05)
179
+ --sel-seq-len sequence length to print from either side of selected variants (default: 50)
180
+
181
+ ------
182
+
183
+ .. class:: infomark
184
+
185
+ **Tool Author**
186
+
187
+ Shyam Rallapalli
188
+
189
+
190
+ </help>
191
+
192
+ <citations>
193
+ <citation type="doi">10.1093/bioinformatics/btg1080</citation>
194
+ </citations>
195
+
196
+ </tool>
data/lib/cheripic.rb CHANGED
@@ -38,3 +38,4 @@ require 'cheripic/options'
38
38
  require 'cheripic/contig_pileups'
39
39
  require 'cheripic/bfr'
40
40
  require 'cheripic/regions'
41
+ require 'cheripic/vcf'
data/lib/cheripic/cmd.rb CHANGED
@@ -52,10 +52,14 @@ module Cheripic
52
52
  opt :mut_bulk, 'Pileup or sorted BAM file alignments from mutant/trait of interest bulk 1',
53
53
  :short => '-a',
54
54
  :type => String
55
+ opt :mut_bulk_vcf, 'vcf file for variants from mutant/trait of interest bulk 1',
56
+ :type => String
55
57
  opt :bg_bulk, 'Pileup or sorted BAM file alignments from background/wildtype bulk 2',
56
58
  :short => '-b',
57
59
  :type => String
58
- opt :output, 'Directory to store results, will be created if not existing',
60
+ opt :bg_bulk_vcf, 'vcf file for variants from background/wildtype bulk 2',
61
+ :type => String
62
+ opt :output, 'custom name tag to include in the output file name',
59
63
  :default => 'cheripic_results'
60
64
  opt :loglevel, 'Choose any one of "info / warn / debug" level for logs generated',
61
65
  :default => 'debug'
@@ -68,9 +72,17 @@ module Cheripic
68
72
  opt :hthigh, 'high level for categorizing heterozygosity',
69
73
  :type => Float,
70
74
  :default => 0.9
71
- opt :mindepth, 'minimum read depth to conisder a position for variant calls',
75
+ opt :mindepth, 'minimum read depth at a position to consider for variant calls',
72
76
  :type => Integer,
73
77
  :default => 6
78
+ opt :max_d_multiple, "multiplication factor for average coverage to calculate maximum read coverage
79
+ if set zero no calculation will be made from bam file.\nsetting this value will override user set max depth",
80
+ :type => Integer,
81
+ :default => 5
82
+ opt :maxdepth, "maximum read depth at a position to consider for variant calls
83
+ if set to zero no user max depth will be used",
84
+ :type => Integer,
85
+ :default => 0
74
86
  opt :min_non_ref_count, 'minimum read depth supporting non reference base at each position',
75
87
  :type => Integer,
76
88
  :default => 3
@@ -97,7 +109,8 @@ module Cheripic
97
109
  opt :use_all_contigs, 'option to select all contigs or only contigs containing variants for analysis',
98
110
  :type => FalseClass,
99
111
  :default => false
100
- opt :include_low_hmes, 'option to include or discard variants from contigs with low hme-score or bfr score to list in the final output',
112
+ opt :include_low_hmes, 'option to include or discard variants from contigs with
113
+ low hme-score or bfr score to list in the final output',
101
114
  :type => FalseClass,
102
115
  :default => false
103
116
  opt :polyploidy, 'Set if the data input is from polyploids',
@@ -111,6 +124,10 @@ module Cheripic
111
124
  :short => '-r',
112
125
  :type => String,
113
126
  :default => ''
127
+ opt :repeats_file, 'repeat masker output file for the assembly ',
128
+ :short => '-R',
129
+ :type => String,
130
+ :default => ''
114
131
  opt :bfr_adjust, 'factor added to hemi snp frequency of each parent to adjust for bfr calculations',
115
132
  :type => Float,
116
133
  :default => 0.05
@@ -133,8 +150,9 @@ module Cheripic
133
150
 
134
151
  Inputs:
135
152
  1. Needs a reference fasta file of asssembly use for variant analysis
136
- 2. Pileup files for mutant (phenotype of interest) bulks and background (wildtype phenotype) bulks
137
- 3. If polyploid species, include of pileup from one or both parents
153
+ 2. Pileup/Bam files for mutant (phenotype of interest) bulks and background (wildtype phenotype) bulks
154
+ 3. If providing bam files, you have to include vcf files for the respective bulks
155
+ 4. If polyploid species, include pileup/bam files from one or both parents
138
156
 
139
157
  USAGE:
140
158
  cheripic <options>
@@ -149,15 +167,19 @@ module Cheripic
149
167
  def print_examples
150
168
  msg = <<-EOS
151
169
 
152
- Cheripic v#{Cheripic::VERSION.dup}
170
+ Cheripic v#{Cheripic::VERSION.dup}
171
+ Authors: Shyam Rallapalli and Dan MacLean
172
+
173
+ EXAMPLE COMMANDS:
174
+ 1. cheripic -f assembly.fa -a mutbulk.pileup -b bgbulk.pileup --output=cheripic_output
175
+ 2. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
176
+ --mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
177
+ 3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
178
+ --mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
179
+ --no-only-frag-with-vars --no-filter-out-low-hmes --output cheripic_results
180
+ 4. cheripic -h or cheripic --help
181
+ 5. cheripic -v or cheripic --version
153
182
 
154
- EXAMPLE COMMANDS:
155
- 1. cheripic -f assembly.fa -a mutbulk.pileup -b bgbulk.pileup --output=cheripic_output
156
- 2. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
157
- --mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true --output cheripic_results
158
- 3. cheripic --assembly assembly.fa --mut-bulk mutbulk.pileup --bg-bulk bgbulk.pileup
159
- --mut-parent mutparent.pileup --bg-parent bgparent.pileup --polyploidy true
160
- --no-only-frag-with-vars --no-filter-out-low-hmes --output cheripic_results
161
183
  EOS
162
184
  puts msg.split("\n").map{ |line| line.lstrip }.join("\n")
163
185
  exit(0)
@@ -165,44 +187,66 @@ module Cheripic
165
187
 
166
188
  # calls other methods to check if command line inputs are valid
167
189
  def check_arguments
168
- check_output_dir
190
+ check_output
169
191
  check_log_level
170
- check_input_files
192
+ check_input_types
171
193
  end
172
194
 
173
- # TODO: check bulk input types and process associated files
174
- # def check_input_types
175
- # if @options[:input_format] == 'vcf'
176
- #
177
- # end
178
- # end
179
-
180
- # checks if input files are valid
181
- def check_input_files
195
+ # checks input files based on bulk file type
196
+ def check_input_types
197
+ inputfiles = {}
198
+ inputfiles[:required] = %i{assembly mut_bulk}
199
+ inputfiles[:optional] = %i{bg_bulk}
200
+ if @options[:input_format] == 'bam'
201
+ inputfiles[:required] << %i{mut_bulk_vcf}
202
+ inputfiles[:optional] << %i{bg_bulk_vcf}
203
+ end
182
204
  if @options[:polyploidy]
183
- inputfiles = %i{assembly mut_bulk bg_bulk mut_parent bg_parent}
184
- else
185
- inputfiles = %i{assembly mut_bulk bg_bulk}
205
+ inputfiles[:either] = %i{mut_parent bg_parent}
186
206
  end
187
- inputfiles.each do | symbol |
188
- if @options[symbol]
189
- file = @options[symbol]
190
- @options[symbol] = File.expand_path(file)
191
- unless File.exist?(file)
192
- raise CheripicIOError.new "#{symbol} file, #{file} does not exist: "
207
+ check_input_files(inputfiles)
208
+ end
209
+
210
+ # checks if input files are valid
211
+ def check_input_files(inputfiles)
212
+ check = 0
213
+ inputfiles.each_key do | type |
214
+ inputfiles[type].flatten!
215
+ inputfiles[type].each do | symbol |
216
+ if @options[symbol]
217
+ file = @options[symbol]
218
+ @options[symbol] = File.expand_path(file)
219
+ next if type == :optional
220
+ if type == :required and not File.exist?(file)
221
+ raise CheripicIOError.new "#{symbol} file, #{file} does not exist: "
222
+ elsif type == :either and File.exist?(file)
223
+ check = 1
224
+ end
225
+ elsif type == :required
226
+ raise CheripicArgError.new "Options #{inputfiles}, all must be specified. " +
227
+ 'Try --help for further help.'
193
228
  end
194
- else
195
- raise CheripicArgError.new "Options #{inputfiles}, all must be specified. " +
196
- 'Try --help for help.'
229
+ end
230
+ if type == :either and check == 0
231
+ raise CheripicArgError.new "One of the options #{inputfiles}, must be specified. " +
232
+ 'Try --help for further help.'
197
233
  end
198
234
  end
199
235
  end
200
236
 
201
- # checks if output directory already exists
202
- def check_output_dir
203
- if Dir.exist?(@options[:output])
204
- raise CheripicArgError.new "#{@options[:output]} directory exists" +
205
- 'please choose a different output directory name'
237
+ # checks if files with output tag name already exists
238
+ def check_output
239
+ if (@options[:output].split('') & %w{# / : * ? ' < > | & $ ,}).any?
240
+ raise CheripicArgError.new 'please choose a name tag that contains ' +
241
+ 'alphanumeric characters, hyphen(-) and underscore(_) only'
242
+ end
243
+ @options[:hmes_frags] = "#{@options[:output]}_selected_hme_variants.txt"
244
+ @options[:bfr_frags] = "#{@options[:output]}_selected_bfr_variants.txt"
245
+ [@options[:hmes_frags], @options[:bfr_frags]].each do | file |
246
+ if File.exist?(file)
247
+ raise CheripicArgError.new "'#{file}' file exists " +
248
+ 'please choose a different name tag to be included in the output file name'
249
+ end
206
250
  end
207
251
  end
208
252
 
@@ -220,7 +264,8 @@ module Cheripic
220
264
  # A hash of trollop option names as keys and user or default
221
265
  # setting as values is passed to Implementer object
222
266
  def run
223
- @options[:output] = File.expand_path @options[:output]
267
+ @options[:hmes_frags] = File.expand_path @options[:hmes_frags]
268
+ @options[:bfr_frags] = File.expand_path @options[:bfr_frags]
224
269
  analysis = Implementer.new(@options)
225
270
  analysis.run
226
271
  end
@@ -22,7 +22,7 @@ module Cheripic
22
22
  # @return [Integer] length of contig in bases
23
23
  class Contig
24
24
 
25
- attr_accessor :hm_pos, :ht_pos, :hemi_pos
25
+ attr_accessor :hm_pos, :ht_pos, :hemi_pos, :mean_depth, :sd_depth
26
26
  attr_reader :id, :length
27
27
 
28
28
  # creates a Contig object using fasta entry
@@ -33,6 +33,8 @@ module Cheripic
33
33
  @hm_pos = {}
34
34
  @ht_pos = {}
35
35
  @hemi_pos = {}
36
+ @mean_depth = nil
37
+ @sd_depth = nil
36
38
  end
37
39
 
38
40
  # Number of homozygous variants identified in the contig
@@ -32,7 +32,7 @@ module Cheripic
32
32
  def_delegators :@mut_parent, :each, :each_key, :each_value, :length, :[], :store
33
33
  def_delegators :@bg_parent, :each, :each_key, :each_value, :length, :[], :store
34
34
  attr_accessor :id, :parent_hemi
35
- attr_accessor :mut_bulk, :bg_bulk, :mut_parent, :bg_parent
35
+ attr_accessor :mut_bulk, :bg_bulk, :mut_parent, :bg_parent, :masked_regions
36
36
 
37
37
  # creates a ContigPileup object using fasta entry id
38
38
  # @param fasta [String] a contig id from fasta entry
@@ -43,16 +43,27 @@ module Cheripic
43
43
  @mut_parent = {}
44
44
  @bg_parent = {}
45
45
  @parent_hemi = {}
46
+ @masked_regions = Hash.new { |h,k| h[k] = {} }
47
+ @hm_pos = {}
48
+ @ht_pos = {}
49
+ @hemi_pos = {}
46
50
  end
47
51
 
48
52
  # bulk pileups are compared and variant positions are selected
49
53
  # @return [Array<Hash>] variant positions are stored in hashes
50
54
  # for homozygous, heterozygous and hemi-variant positions
51
55
  def bulks_compared
52
- @hm_pos = {}
53
- @ht_pos = {}
54
- @hemi_pos = {}
55
56
  @mut_bulk.each_key do | pos |
57
+ ignore = 0
58
+ unless @masked_regions.empty?
59
+ @masked_regions.each_key do | index |
60
+ if pos.between?(@masked_regions[index][:begin], @masked_regions[index][:end])
61
+ ignore = 1
62
+ logger.info "variant is in the masked region\t#{@mut_bulk[pos].to_s}"
63
+ end
64
+ end
65
+ end
66
+ next if ignore == 1
56
67
  if Options.polyploidy and @parent_hemi.key?(pos)
57
68
  bg_bases = ''
58
69
  if @bg_bulk.key?(pos)
@@ -74,27 +85,37 @@ module Cheripic
74
85
  # @param pos [Integer] position in the contig
75
86
  # stores variant type, position and allele fraction to either @hm_pos or @ht_pos hashes
76
87
  def compare_pileup(pos)
77
- base_hash = @mut_bulk[pos].var_base_frac
78
- base_hash.delete(:ref)
79
- return nil if base_hash.empty?
80
- # we could ignore complex loci or
81
- # take the variant type based on predominant base
82
- if base_hash.length > 1
83
- fraction = base_hash.values.max
84
- mut_type = var_mode(fraction)
85
- else
86
- fraction = base_hash[base_hash.keys[0]]
87
- mut_type = var_mode(fraction)
88
- end
88
+ mut_type, fraction = var_mode_fraction(@mut_bulk[pos])
89
+ return nil if mut_type.nil?
89
90
  if @bg_bulk.key?(pos)
90
- bg_type = bg_bulk_var(pos)
91
+ bg_type = var_mode_fraction(@bg_bulk[pos])[0]
91
92
  mut_type = compare_var_type(mut_type, bg_type)
92
93
  end
93
- unless mut_type == nil
94
+ unless mut_type.nil?
94
95
  categorise_pos(mut_type, pos, fraction)
95
96
  end
96
97
  end
97
98
 
99
+
100
+ # Method to extract var_mode and allele fraction from pileup information at a position in contig
101
+ #
102
+ # @param pileup_info [Pileup] pileup object
103
+ # @return [Symbol] variant mode from pileup position (:hom or :het) at the position
104
+ # @return [Float] allele fraction at the position
105
+ def var_mode_fraction(pileup_info)
106
+ base_frac_hash = pileup_info.var_base_frac
107
+ base_frac_hash.delete(:ref)
108
+ return [nil, nil] if base_frac_hash.empty?
109
+ # we could ignore complex loci or
110
+ # take the variant type based on predominant base
111
+ if base_frac_hash.length > 1
112
+ fraction = base_frac_hash.values.max
113
+ else
114
+ fraction = base_frac_hash[base_frac_hash.keys[0]]
115
+ end
116
+ [var_mode(fraction), fraction]
117
+ end
118
+
98
119
  # Categorizes variant zygosity based on the allele fraction provided.
99
120
  # Uses lower and upper limit set for heterozygosity in the options.
100
121
  # @note consider increasing the range of heterozygosity limits for RNA-seq data
@@ -125,23 +146,6 @@ module Cheripic
125
146
  end
126
147
  end
127
148
 
128
- # Method to extract var_mode from pileup information at a position in contig
129
- #
130
- # @param pos [Integer] position in the contig
131
- # @return [Symbol] variant mode of the background bulk (:hom or :het) at the position
132
- def bg_bulk_var(pos)
133
- bg_base_hash = @bg_bulk[pos].var_base_frac
134
- bg_base_hash.delete(:ref)
135
- return nil if bg_base_hash.empty?
136
- if bg_base_hash.length > 1
137
- # taking only var mode
138
- var_mode(bg_base_hash.values.max)
139
- else
140
- # taking only var mode
141
- var_mode(bg_base_hash[bg_base_hash.keys[0]])
142
- end
143
- end
144
-
145
149
  # method stores pos as key and allele fraction as value
146
150
  # to @hm_pos or @ht_pos hash based on variant type
147
151
  # @param var_type [Symbol] values are either :hom or :het
@@ -156,18 +160,18 @@ module Cheripic
156
160
  end
157
161
 
158
162
  # Compares parental pileups for the contig and identify position
159
- # that indicate variants from homelogues called hemi-snps
163
+ # that indicate variants from homeologues called hemi-snps
160
164
  # and calculates bulk frequency ratio (bfr)
161
165
  # @return [Hash] parent_hemi hash with position as key and bfr as value
162
166
  def hemisnps_in_parent
163
167
  # mark all the hemi snp based on both parents
164
- self.mut_parent.each_key do |pos|
168
+ @mut_parent.each_key do |pos|
165
169
  mut_parent_frac = @mut_parent[pos].var_base_frac
166
- if self.bg_parent.key?(pos)
170
+ if @bg_parent.key?(pos)
167
171
  bg_parent_frac = @bg_parent[pos].var_base_frac
168
172
  bfr = Bfr.get_bfr(mut_parent_frac, bg_parent_frac)
169
173
  @parent_hemi[pos] = bfr
170
- self.bg_parent.delete(pos)
174
+ @bg_parent.delete(pos)
171
175
  else
172
176
  bfr = Bfr.get_bfr(mut_parent_frac)
173
177
  @parent_hemi[pos] = bfr
@@ -175,7 +179,7 @@ module Cheripic
175
179
  end
176
180
 
177
181
  # now include all hemi snp unique to background parent
178
- self.bg_parent.each_key do |pos|
182
+ @bg_parent.each_key do |pos|
179
183
  unless @parent_hemi.key?(pos)
180
184
  bg_parent_frac = @bg_parent[pos].var_base_frac
181
185
  bfr = Bfr.get_bfr(bg_parent_frac)
@@ -25,15 +25,21 @@ module Cheripic
25
25
  input_format
26
26
  mut_bulk
27
27
  bg_bulk
28
- output
28
+ mut_bulk_vcf
29
+ bg_bulk_vcf
30
+ hmes_frags
31
+ bfr_frags
29
32
  mut_parent
30
- bg_parent}
33
+ bg_parent
34
+ repeats_file}
31
35
  @options = OpenStruct.new(inputs.select { |k| set1.include?(k) })
32
36
 
33
37
  set2 = %i{hmes_adjust
34
38
  htlow
35
39
  hthigh
36
40
  mindepth
41
+ maxdepth
42
+ max_d_multiple
37
43
  min_non_ref_count
38
44
  min_indel_count_support
39
45
  ambiguous_ref_bases
@@ -44,10 +50,10 @@ module Cheripic
44
50
  use_all_contigs
45
51
  include_low_hmes
46
52
  polyploidy
47
- bfr_adjust}
53
+ bfr_adjust
54
+ sel_seq_len}
48
55
  settings = inputs.select { |k| set2.include?(k) }
49
56
  Options.update(settings)
50
- FileUtils.mkdir_p @options.output
51
57
  @vars_extracted = false
52
58
  @has_run = false
53
59
  end
@@ -62,15 +68,21 @@ module Cheripic
62
68
 
63
69
  # Extracted variants from bulk comparison are re-analysed
64
70
  # and selected variants are written to a file
65
- def process_variants
66
- @variants.verify_bg_bulk_pileup
71
+ def process_variants(pos_type)
72
+ if pos_type == :hmes_frags
73
+ @variants.verify_bg_bulk_pileup
74
+ end
67
75
  # print selected variants that could be potential markers or mutation
68
- out_file = File.open("#{@options.output}/selected_variants.txt", 'w')
69
- out_file.puts "HME_Score\tAlleleFreq\tseq_id\tposition\tref_base\tcoverage\tbases\tbase_quals\tsequence_left\tAlt_seq\tsequence_right"
76
+ out_file = File.open(@options[pos_type], 'w')
77
+ out_file.puts "Score\tAlleleFreq\tseq_id\tposition\tref_base\tcoverage\tbases\tbase_quals\tsequence_left\tAlt_seq\tsequence_right"
70
78
  regions = Regions.new(@options.assembly)
71
- @variants.hmes_frags.each_key do | frag |
79
+ @variants.send(pos_type).each_key do | frag |
72
80
  contig_obj = @variants.assembly[frag]
73
- positions = contig_obj.hm_pos.keys
81
+ if pos_type == :hmes_frags
82
+ positions = contig_obj.hm_pos.keys
83
+ else
84
+ positions = contig_obj.hemi_pos.keys
85
+ end
74
86
  positions.each do | pos |
75
87
  pileup = @variants.pileups[frag].mut_bulk[pos]
76
88
  seqs = regions.fetch_seq(frag,pos)
@@ -87,11 +99,9 @@ module Cheripic
87
99
  unless @vars_extracted
88
100
  self.extract_vars
89
101
  end
102
+ self.process_variants(:hmes_frags)
90
103
  if Options.polyploidy
91
- self.process_variants
92
- @variants.bfr_frags
93
- else
94
- self.process_variants
104
+ self.process_variants(:bfr_frags)
95
105
  end
96
106
  @has_run = true
97
107
  end
@@ -12,6 +12,8 @@ module Cheripic
12
12
  :htlow => 0.2,
13
13
  :hthigh => 0.9,
14
14
  :mindepth => 6,
15
+ :maxdepth => 0,
16
+ :max_d_multiple => 5,
15
17
  :min_non_ref_count => 3,
16
18
  :min_indel_count_support => 3,
17
19
  :ambiguous_ref_bases => false,
@@ -53,6 +55,26 @@ module Cheripic
53
55
  @user_settings[:mindepth]
54
56
  end
55
57
 
58
+ # Maximum read coverage at the variant position to be considered for analysis
59
+ # @return [Integer]
60
+ def self.maxdepth
61
+ @user_settings[:maxdepth]
62
+ end
63
+
64
+ # Setting maximum read coverage at the variant position to be considered for analysis
65
+ # @param value [Integer] provided integer value will be updated as maxdepth
66
+ # @return [Integer] updated maxdepth value
67
+ def self.maxdepth=(value)
68
+ @user_settings[:maxdepth] = value
69
+ end
70
+
71
+ # Multiplication factor to average coverage to calculate maximum read coverage
72
+ # at the variant position to be considered for analysis
73
+ # @return [Integer]
74
+ def self.max_d_multiple
75
+ @user_settings[:max_d_multiple]
76
+ end
77
+
56
78
  # Minimum non reference count at the variant position to be considered for analysis
57
79
  # @return [Integer]
58
80
  def self.min_non_ref_count
@@ -4,6 +4,36 @@ require 'forwardable'
4
4
 
5
5
  module Cheripic
6
6
 
7
+ require 'bio-samtools'
8
+ require 'bio/db/sam'
9
+ require 'open3'
10
+
11
+ # An extension of Bio::DB::Sam object to modify depth method
12
+ class Bio::DB::Sam
13
+
14
+ # A method to retrieve depth information from bam object
15
+ # @param opts [Hash] a hash of following input options
16
+ # b [File] list of positions or regions in BED format
17
+ # l [INT] minQLen
18
+ # q [INT] base quality threshold
19
+ # Q [INT] mapping quality threshold
20
+ # r [chr:from-to] region
21
+ # @returns a block with each line reporting sequence_name, position and depth
22
+ def depth(opts={})
23
+ command = form_opt_string(self.samtools, 'depth', opts)
24
+ # capture returns string output, so careful not to give whole genome or big contigs for depth analysis
25
+ stdout, stderr, status = Open3.capture3(command)
26
+ unless status.success?
27
+ logger.error "resulted in exit code #{status.exitstatus} using #{command}"
28
+ logger.error "stderr output is: #{stderr}"
29
+ raise CheripicError
30
+ end
31
+ # return stdout
32
+ stdout
33
+ end
34
+
35
+ end
36
+
7
37
  # Custom error handling for Variants class
8
38
  class VariantsError < CheripicError; end
9
39
 
@@ -27,10 +57,10 @@ module Cheripic
27
57
  include Enumerable
28
58
  extend Forwardable
29
59
  def_delegators :@assembly, :each, :each_key, :each_value, :size, :length, :[]
30
- attr_reader :assembly, :pileups, :hmes_frags, :bfr_frags, :pileups_analyzed
60
+ attr_reader :assembly, :pileups, :pileups_analyzed
31
61
 
32
62
  # creates a Variants object using user input files
33
- # @param options [Hash] a hash of required input files as keys and file paths as values
63
+ # @param options [OpenStruct] a hash of required input files as keys and file paths as values
34
64
  def initialize(options)
35
65
  @params = options
36
66
  @assembly = {}
@@ -50,25 +80,76 @@ module Cheripic
50
80
  @pileups[contig.id] = ContigPileups.new(contig.id)
51
81
  end
52
82
  @pileups_analyzed = false
83
+ unless @params.repeats_file == ''
84
+ store_repeat_regions
85
+ end
86
+ end
87
+
88
+ # reads repeat masker output file and stores masked regions to ignore variants in thos regions
89
+ def store_repeat_regions
90
+ File.foreach(@params.repeats_file) do |line|
91
+ line.strip!
92
+ next if line =~ /^SW/ or line =~ /^score/ or line == ''
93
+ info = line.split("\s")
94
+ pileups_obj = @pileups[info[4]]
95
+ index = pileups_obj.masked_regions.length
96
+ pileups_obj.masked_regions[index + 1][:begin] = info[5].to_i
97
+ pileups_obj.masked_regions[index + 1][:end] = info[6].to_i
98
+ end
53
99
  end
54
100
 
55
101
  # Reads and store pileup data for each of input bulk and parents pileup files
56
102
  # And sets pileups_analyzed to true that pileups files are processed
57
103
  def analyse_pileups
58
- @bg_bulk = @params.bg_bulk
59
- @mut_parent = @params.mut_parent
60
- @bg_parent = @params.bg_parent
61
-
104
+ if @params.input_format == 'bam'
105
+ @vcf_hash = Vcf.filtering(@params.mut_bulk_vcf, @params.bg_bulk_vcf)
106
+ end
62
107
  %i{mut_bulk bg_bulk mut_parent bg_parent}.each do | input |
63
108
  infile = @params[input]
64
109
  if infile != ''
65
- extract_pileup(infile, input)
110
+ logger.info "processing #{input} file"
111
+ if @params.input_format == 'pileup'
112
+ extract_pileup(infile, input)
113
+ else
114
+ extract_bam_pileup(infile, input)
115
+ end
66
116
  end
67
117
  end
68
118
 
69
119
  @pileups_analyzed = true
70
120
  end
71
121
 
122
+ # Bam object is read and each contig mean and std deviation of depth calculated
123
+ # @param bamobject [Bio::DB::Sam]
124
+ # Open3 capture returns string output, so careful not to give whole genome or big contigs for depth analysis
125
+ def set_max_depth(bamobject, bamfile)
126
+ logger.info "processing #{bamfile} file for depth"
127
+ all_depths = []
128
+ bq = Options.base_quality
129
+ mq = Options.mapping_quality
130
+ @assembly.each_key do | id |
131
+ contig_obj = @assembly[id]
132
+ len = contig_obj.length
133
+ data = bamobject.depth(:r => "#{id}", :Q => bq, :q => mq)
134
+ depths = []
135
+ data.split("\n").each do |line|
136
+ info = line.split("\t")
137
+ depths << info[2].to_i
138
+ end
139
+ variance = 0
140
+ mean_depth = depths.reduce(0, :+) / len.to_f
141
+ depths.each do |value|
142
+ variance += (value.to_f - mean_depth)**2
143
+ end
144
+ all_depths << mean_depth
145
+ contig_obj.sd_depth = Math.sqrt(variance)
146
+ contig_obj.mean_depth = mean_depth
147
+ end
148
+ # setting max depth as 3 times the average depth
149
+ mean_coverage = all_depths.reduce(0, :+) / @assembly.length.to_f
150
+ Options.maxdepth = Options.max_d_multiple * mean_coverage
151
+ end
152
+
72
153
  # Input pileup file is read and positions are selected that pass the thresholds
73
154
  # @param pileupfile [String] path to the pileup file to read
74
155
  # @param sym [Symbol] Symbol of the pileup file used to write selected variants
@@ -84,6 +165,54 @@ module Cheripic
84
165
  end
85
166
  end
86
167
 
168
+ # Input bamfile is read and selected positions pileups are stored
169
+ # @param bamfile [String] path to the bam file to read
170
+ # @param sym [Symbol] Symbol of the bam file used to write selected variants
171
+ # pileup information to respective ContigPileups object
172
+ def extract_bam_pileup(bamfile, sym)
173
+ bq = Options.base_quality
174
+ mq = Options.mapping_quality
175
+ bamobject = Bio::DB::Sam.new(:bam=>bamfile, :fasta=>@params.assembly)
176
+ bamobject.index unless bamobject.indexed?
177
+
178
+ # check if user has set max depth or set to zero to ignore
179
+ max_d = Options.maxdepth
180
+ # or calculate from bamfile
181
+ if Options.max_d_multiple > 0
182
+ set_max_depth(bamobject, bamfile)
183
+ max_d = Options.maxdepth
184
+ logger.info "max depth used for #{sym} file\t#{max_d}"
185
+ end
186
+
187
+ @vcf_hash.each_key do | id |
188
+ positions = @vcf_hash[id][:het].keys
189
+ positions << @vcf_hash[id][:hom].keys
190
+ positions.flatten!
191
+ next if positions.empty?
192
+ contig_obj = @pileups[id]
193
+ positions.each do | pos |
194
+ command = "#{bamobject.samtools} mpileup -r #{id}:#{pos}-#{pos} -Q #{bq} -q #{mq} -B -f #{@params.assembly} #{bamfile}"
195
+ stdout, stderr, status = Open3.capture3(command)
196
+ unless status.success?
197
+ logger.error "resulted in exit code #{status.exitstatus} using #{command}"
198
+ logger.error "stderr output is: #{stderr}"
199
+ raise CheripicError
200
+ end
201
+ stdout.chomp!
202
+ if stdout == '' or stdout.split("\t")[3].to_i == 0 or stdout =~ /^\t0/
203
+ logger.info "pileup data empty for\t#{id}\t#{pos}"
204
+ else
205
+ pileup = Pileup.new(stdout)
206
+ unless max_d == 0 or pileup.coverage <= max_d
207
+ logger.info "pileup coverage is higher than max\t#{pileup.to_s}"
208
+ next
209
+ end
210
+ contig_obj.send(sym).store(pos, pileup)
211
+ end
212
+ end
213
+ end
214
+ end
215
+
87
216
  # Once pileup files are analysed and variants are extracted from each bulk;
88
217
  # bulks are compared to identify and isolate variants for downstream analysis.
89
218
  # If polyploidy set to trye and mut_parent and bg_parent bulks are provided
@@ -95,8 +224,10 @@ module Cheripic
95
224
  @assembly.each_key do | id |
96
225
  contig = @assembly[id]
97
226
  # extract parental hemi snps for polyploids before bulks are compared
98
- if @mut_parent != '' or @bg_parent != ''
99
- @pileups[id].hemisnps_in_parent
227
+ if Options.polyploidy
228
+ if @params.mut_parent != '' or @params.bg_parent != ''
229
+ @pileups[id].hemisnps_in_parent
230
+ end
100
231
  end
101
232
  contig.hm_pos, contig.ht_pos, contig.hemi_pos = @pileups[id].bulks_compared
102
233
  end
@@ -0,0 +1,83 @@
1
+ # encoding: utf-8
2
+
3
+ module Cheripic
4
+
5
+ # Custom error handling for Vcf class
6
+ class VcfError < CheripicError; end
7
+
8
+ require 'bio-samtools'
9
+
10
+ class Vcf
11
+
12
+ def self.get_allele_freq(vcf_obj)
13
+ # check if the vcf is from samtools (has DP4 and AF1 fields in INFO)
14
+ if vcf_obj.info.key?('DP4')
15
+ freq = vcf_obj.info['DP4'].split(',')
16
+ depth = freq.inject { | sum, n | sum.to_f + n.to_f }
17
+ alt = freq[2].to_f + freq[3].to_f
18
+ allele_freq = alt / depth
19
+ # allele_freq = vcf_obj.non_ref_allele_freq
20
+ # check if the vcf is from VarScan (has RD, AD and FREQ fields in FORMAT)
21
+ elsif vcf_obj.samples['1'].key?('RD')
22
+ alt = vcf_obj.samples['1']['AD'].to_f
23
+ depth = vcf_obj.samples['1']['RD'].to_f + alt
24
+ allele_freq = alt / depth
25
+ # check if the vcf is from GATK (has AD and GT fields in FORMAT)
26
+ elsif vcf_obj.samples['1'].key?('AD') and vcf_obj.samples['1']['AD'].include?(',')
27
+ freq = vcf_obj.samples['1']['AD'].split(',')
28
+ allele_freq = freq[1].to_f / ( freq[0].to_f + freq[1].to_f )
29
+ # check if the vcf has has AF fields in INFO
30
+ elsif vcf_obj.info.key?('AF')
31
+ allele_freq = vcf_obj.info['AF'].to_f
32
+ else
33
+ raise VcfError.new 'not a supported vcf format (VarScan, GATK, Bcftools(Samtools), Vcf 4.0, 4.1 and 4.2)' +
34
+ " and check that it is one sample vcf\n"
35
+ end
36
+ allele_freq
37
+ end
38
+
39
+
40
+ ##Input: vcf file
41
+ ##Ouput: lists of hm and ht SNPS and hash of all fragments with variants
42
+ def self.get_vars(vcf_file)
43
+ ht_low = Options.htlow
44
+ ht_high = Options.hthigh
45
+
46
+ # hash of :het and :hom with frag ids and respective variant positions
47
+ var_pos = Hash.new{ |h,k| h[k] = Hash.new(&h.default_proc) }
48
+ File.foreach(vcf_file) do |line|
49
+ next if line =~ /^#/
50
+ v = Bio::DB::Vcf.new(line)
51
+ unless v.alt == '.'
52
+ allele_freq = get_allele_freq(v)
53
+ if allele_freq.between?(ht_low, ht_high)
54
+ var_pos[v.chrom][:het][v.pos] = allele_freq
55
+ elsif allele_freq > ht_high
56
+ var_pos[v.chrom][:hom][v.pos] = allele_freq
57
+ end
58
+ end
59
+ end
60
+ var_pos
61
+ end
62
+
63
+ def self.filtering(mutant_vcf, bgbulk_vcf)
64
+ var_pos_mut = get_vars(mutant_vcf)
65
+ return var_pos_mut if bgbulk_vcf == ''
66
+ var_pos_bg = get_vars(bgbulk_vcf)
67
+
68
+ # if both bulks have homozygous mutations at same positions then deleting them
69
+ var_pos_mut.each_key do | frag |
70
+ positions = var_pos_mut[frag][:hom].keys
71
+ pos_bg_bulk = var_pos_bg[frag][:hom].keys
72
+ positions.each do |pos|
73
+ if pos_bg_bulk.include?(pos)
74
+ var_pos_mut[frag][:hom].delete(pos)
75
+ end
76
+ end
77
+ end
78
+ var_pos_mut
79
+ end
80
+
81
+ end
82
+
83
+ end
@@ -2,6 +2,6 @@ module Cheripic
2
2
 
3
3
  # Sets the semantic version number for this module.
4
4
  # Version number will be used in help messages and for generating gem.
5
- VERSION = '1.2.0'
5
+ VERSION = '1.2.5'
6
6
 
7
7
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: cheripic
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.0
4
+ version: 1.2.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Shyam Rallapalli
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2016-08-11 00:00:00.000000000 Z
11
+ date: 2016-10-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: yell
@@ -84,40 +84,6 @@ dependencies:
84
84
  - - "~>"
85
85
  - !ruby/object:Gem::Version
86
86
  version: 2.4.0
87
- - !ruby/object:Gem::Dependency
88
- name: bio-gngm
89
- requirement: !ruby/object:Gem::Requirement
90
- requirements:
91
- - - "~>"
92
- - !ruby/object:Gem::Version
93
- version: 0.2.1
94
- type: :runtime
95
- prerelease: false
96
- version_requirements: !ruby/object:Gem::Requirement
97
- requirements:
98
- - - "~>"
99
- - !ruby/object:Gem::Version
100
- version: 0.2.1
101
- - !ruby/object:Gem::Dependency
102
- name: rinruby
103
- requirement: !ruby/object:Gem::Requirement
104
- requirements:
105
- - - "~>"
106
- - !ruby/object:Gem::Version
107
- version: '2.0'
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: 2.0.3
111
- type: :runtime
112
- prerelease: false
113
- version_requirements: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - "~>"
116
- - !ruby/object:Gem::Version
117
- version: '2.0'
118
- - - ">="
119
- - !ruby/object:Gem::Version
120
- version: 2.0.3
121
87
  - !ruby/object:Gem::Dependency
122
88
  name: activesupport
123
89
  requirement: !ruby/object:Gem::Requirement
@@ -259,6 +225,7 @@ files:
259
225
  - ".gitignore"
260
226
  - ".travis.yml"
261
227
  - CODE_OF_CONDUCT.md
228
+ - ChangeLog.md
262
229
  - Gemfile
263
230
  - LICENSE.txt
264
231
  - README.md
@@ -267,6 +234,7 @@ files:
267
234
  - bin/console
268
235
  - bin/setup
269
236
  - cheripic.gemspec
237
+ - galaxy_cheripic_tool.xml
270
238
  - lib/cheripic.rb
271
239
  - lib/cheripic/bfr.rb
272
240
  - lib/cheripic/cmd.rb
@@ -277,6 +245,7 @@ files:
277
245
  - lib/cheripic/pileup.rb
278
246
  - lib/cheripic/regions.rb
279
247
  - lib/cheripic/variants.rb
248
+ - lib/cheripic/vcf.rb
280
249
  - lib/cheripic/version.rb
281
250
  homepage: https://github.com/shyamrallapalli/cheripic
282
251
  licenses: