bio-locus 0.0.2 → 0.0.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile +8 -8
- data/README.md +88 -26
- data/Rakefile +13 -7
- data/VERSION +1 -1
- data/bin/bio-locus +39 -13
- data/lib/bio-locus.rb +1 -0
- data/lib/bio-locus/dbmapper.rb +106 -0
- data/lib/bio-locus/locus.rb +48 -19
- data/lib/bio-locus/match.rb +26 -10
- data/lib/bio-locus/store.rb +11 -5
- data/spec/bio-locus_spec.rb +34 -4
- metadata +38 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: eef39503225998bf18b16997dd0cea5926ef6db2
|
4
|
+
data.tar.gz: db40d63c48275bc3df6d5575931097fa30448dd5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fe9de938d89f49f69bb1aabcd1d870e04fba1ec4ee32191f9b2c5753a3d0b7a39bf2715ae93378007563f1865f98a7e860674dd08bed81efd4efcf6e33e3f698
|
7
|
+
data.tar.gz: ef30a47864ea175ac621a9c12cee0d60fcd572ca0ce10302203e2d41fe4e84358bc5534dc6d759367e1dae978593434a0dde0e1e06b099d3b13cd4340f3cde3d
|
data/Gemfile
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
source "http://rubygems.org"
|
2
|
-
# Add dependencies required to use your gem here.
|
3
|
-
# Example:
|
4
|
-
# gem "activesupport", ">= 2.3.5"
|
5
|
-
|
6
|
-
# Add dependencies to develop your gem here.
|
7
|
-
# Include everything needed to run rake, tests, features, etc.
|
8
2
|
group :development do
|
9
3
|
gem "cucumber"
|
10
4
|
gem "jeweler"
|
11
5
|
gem "bundler"
|
6
|
+
gem "rspec"
|
7
|
+
gem "tokyocabinet"
|
8
|
+
gem "localmemcache"
|
9
|
+
gem "moneta"
|
12
10
|
end
|
13
|
-
|
14
|
-
gem "
|
11
|
+
# The following are optional (Ruby serialize is the default)
|
12
|
+
# gem "tokyocabinet"
|
13
|
+
# gem "localmemcache"
|
14
|
+
# gem "moneta"
|
data/README.md
CHANGED
@@ -6,8 +6,22 @@ Bio-locus is a tool for fast querying of genome locations. Many file
|
|
6
6
|
formats in bioinformatics contain records that start with a chromosome
|
7
7
|
name and a position for a SNP, or a start-end position for indels.
|
8
8
|
|
9
|
-
This tool essentially allows your to store this
|
10
|
-
|
9
|
+
This tool essentially allows your to store this chr+pos or chr+pos+alt
|
10
|
+
information in a fast database.
|
11
|
+
|
12
|
+
Why would you use bio-locus?
|
13
|
+
|
14
|
+
1. Fast comparison of VCF files and other formats that use chr+pos
|
15
|
+
2. Fast comparison of VCF files and other formats that use chr+pos+alt
|
16
|
+
3. See what positions match an EVS or GoNL database
|
17
|
+
4. Compare locations from databases such as the TCGA and COSMIC
|
18
|
+
5. Comparison of overlap or difference
|
19
|
+
|
20
|
+
In principle any of the Moneta supported backends can be used,
|
21
|
+
including LocalMemCache, RubySerialize and TokyoCabinet. The default
|
22
|
+
is RubySerialize because it works out of the box.
|
23
|
+
|
24
|
+
Usage:
|
11
25
|
|
12
26
|
```sh
|
13
27
|
bio-locus --store < one.vcf
|
@@ -19,7 +33,7 @@ listed alt alleles. To find positions in another dataset which match
|
|
19
33
|
those in the database:
|
20
34
|
|
21
35
|
```sh
|
22
|
-
bio-locus --match < two.vcf
|
36
|
+
bio-locus --match < two.vcf > matched.vcf
|
23
37
|
```
|
24
38
|
|
25
39
|
The point is that this is a two-step process, first create the
|
@@ -29,40 +43,46 @@ with the --delete switch.
|
|
29
43
|
To match with alt use
|
30
44
|
|
31
45
|
```sh
|
32
|
-
bio-locus --match --
|
46
|
+
bio-locus --match --alt only < two.vcf > matched.vcf
|
33
47
|
```
|
34
48
|
|
35
|
-
|
49
|
+
So, with bio-locus you can
|
36
50
|
|
37
|
-
*
|
38
|
-
*
|
39
|
-
*
|
40
|
-
*
|
51
|
+
* reduce the size of large SNP databases before storage/querying
|
52
|
+
* gain performance
|
53
|
+
* filter on chr+pos (default)
|
54
|
+
* filter on chr+pos+field (where field can be a VCF ALT)
|
41
55
|
|
42
56
|
Use cases are
|
43
57
|
|
44
|
-
* To filter for annotated variants
|
58
|
+
* To filter for annotated variants (including INDELS)
|
45
59
|
* To remove common variants from a set
|
46
60
|
|
47
61
|
In short a more targeted approach allowing you to work with less data. This
|
48
|
-
tool is decently fast. For example, looking for 130 positions in 20 million
|
49
|
-
in GoNL takes 0.11s to store and 1.5 minutes to match on my laptop
|
62
|
+
tool is decently fast. For example, looking for 130 positions in 20 million
|
63
|
+
SNPs in GoNL takes 0.11s to store and 1.5 minutes to match on my laptop (using
|
64
|
+
localmemcache):
|
50
65
|
|
51
66
|
```sh
|
52
|
-
cat my_130_variants.vcf | ./bin/bio-locus --store
|
67
|
+
cat my_130_variants.vcf | ./bin/bio-locus --store --storage :localmemcache
|
53
68
|
Stored 130 positions out of 130 in locus.db
|
54
69
|
real 0m0.119s
|
55
70
|
user 0m0.108s
|
56
71
|
sys 0m0.012s
|
57
72
|
|
58
|
-
cat gonl.*.vcf |./bin/bio-locus --match
|
73
|
+
cat gonl.*.vcf |./bin/bio-locus --match --storage :localmemcache
|
59
74
|
Matched 3 out of 20736323 lines in locus.db!
|
60
75
|
real 1m34.577s
|
61
76
|
user 1m33.602s
|
62
77
|
sys 0m1.868s
|
63
78
|
```
|
64
79
|
|
65
|
-
Note: for the storage the
|
80
|
+
Note: for the storage here the
|
81
|
+
[moneta](https://github.com/minad/moneta) gem is used, currently with
|
82
|
+
localmemcache. The default mode for bio-locus is Ruby serialization,
|
83
|
+
and :tokyocabinet is also supported. The larger your data becomes, the
|
84
|
+
more likely it is that you need :tokyocabinet because the others are
|
85
|
+
more RAM oriented.
|
66
86
|
|
67
87
|
Note: the ALT field is split into components for matching, so A,C
|
68
88
|
becomes two chr+pos records, one for A and one for C.
|
@@ -82,18 +102,25 @@ of options available through
|
|
82
102
|
bio-locus --help
|
83
103
|
```
|
84
104
|
|
105
|
+
The most important one is the handling of ALT. Both with --store and
|
106
|
+
--match ALT (chr+pos+alt) can be matched in conjuction with POS
|
107
|
+
(chr+pos). When using --alt only, only ALT is matched. When using
|
108
|
+
--alt include, both ALT and POS are matched. When using --alt exclude,
|
109
|
+
only POS is matched.
|
110
|
+
|
111
|
+
|
85
112
|
### Deleting keys
|
86
113
|
|
87
|
-
To delete entries use
|
114
|
+
To delete entries from the database use
|
88
115
|
|
89
116
|
```sh
|
90
117
|
bio-locus --delete < two.vcf
|
91
118
|
```
|
92
119
|
|
93
|
-
To match with alt use
|
120
|
+
To delete those that match with alt use
|
94
121
|
|
95
122
|
```sh
|
96
|
-
bio-locus --delete --
|
123
|
+
bio-locus --delete --alt only < two.vcf
|
97
124
|
```
|
98
125
|
|
99
126
|
You may need to run both with and without alt, depending on your needs!
|
@@ -113,29 +140,64 @@ can be done with
|
|
113
140
|
bio-locus --store --eval-alt 'field[2].split(/\//)[1]'
|
114
141
|
```
|
115
142
|
|
143
|
+
Actually, if the --in-format is 'snv', this is exactly what is used.
|
144
|
+
|
116
145
|
### COSMIC
|
117
146
|
|
118
147
|
COSMIC is pretty large, so it can be useful to cut the database down to the
|
119
148
|
variants that you have. The locus information is combined
|
120
149
|
in the before last column as chr:start-end, e.g.,
|
121
|
-
19:58861911-58861911. This
|
150
|
+
19:58861911-58861911. This may work for COSMICv68
|
122
151
|
|
123
152
|
```sh
|
124
153
|
bio-locus -i --match --eval-chr='field[13] =~ /^([^:]+)/ ; $1' --eval-pos='field[13] =~ /:(\d+)-/ ; $1 ' < CosmicMutantExportIncFus_v68.tsv
|
125
154
|
```
|
126
155
|
|
156
|
+
You may also use the --in-format cosmic switch for supported COSMIC
|
157
|
+
versions.
|
158
|
+
|
127
159
|
Note the -i switch is needed to skip records that lack position
|
128
|
-
information.
|
160
|
+
information or are non-SNV.
|
161
|
+
|
162
|
+
## GoNL INDEL example
|
129
163
|
|
130
|
-
|
164
|
+
Here an example of filtering out all INDELs that also exist in a
|
165
|
+
different dataste, in this case
|
166
|
+
[GoNL](http://www.genoomvannederland.nl/) which provides a database of
|
167
|
+
population INDELs in VCF format. First we use
|
168
|
+
[bio-vcf](https://github.com/pjotrp/bioruby-vcf) to create a
|
169
|
+
subset of common INDELS:
|
131
170
|
|
132
|
-
```
|
133
|
-
|
171
|
+
```sh
|
172
|
+
cat gonl.*.snps_indels.r5.vcf |bio-vcf --filter 'r.info.set=="INDEL" and r.info.af>0.05' > gonl_indel0.05.vcf
|
173
|
+
```
|
174
|
+
|
175
|
+
Create a locus database from this VCF
|
176
|
+
|
177
|
+
```sh
|
178
|
+
bio-locus --store --db gonl_indel0.05.db --alt only < gonl_indel0.05.vcf
|
179
|
+
Stored 480639 positions out of 480639 in gonl_indel0.05.db (0 duplicate hits)
|
180
|
+
```
|
181
|
+
|
182
|
+
Next, we take our datafile and filter for INDELs that are
|
183
|
+
in the population set
|
184
|
+
|
185
|
+
```sh
|
186
|
+
bio-locus --match -v --db gonl_indel0.05.db --alt only < varscan2_indel_nfreq30_tfreq30.vcf > /dev/null
|
187
|
+
Matched 635 (unique 75) lines out of 1005 (header 18, unique 174) in gonl_indel0.05.db!
|
188
|
+
```
|
189
|
+
Which says that 75 INDELs were population matches. We have 635 hits
|
190
|
+
because there are multiple samples in this VCF.
|
191
|
+
|
192
|
+
This is not what we want in our file, so now we take our datafile and
|
193
|
+
filter for INDELs that are *not* in the population set
|
194
|
+
|
195
|
+
```sh
|
196
|
+
bio-locus --match -v --db gonl_indel0.05.db --alt only < varscan2_indel_nfreq30_tfreq30.vcf > unique_indels.vcf
|
197
|
+
Matched 370 (unique 99) lines out of 1005 (header 18, unique 174) in gonl_indel0.05.db!
|
134
198
|
```
|
199
|
+
So now we have 99 INDELs for this dataset which are not common INDELs.
|
135
200
|
|
136
|
-
The API doc is online. For more code examples see the test files in
|
137
|
-
the source tree.
|
138
|
-
|
139
201
|
## Project home page
|
140
202
|
|
141
203
|
Information on the source tree, documentation, examples, issues and
|
data/Rakefile
CHANGED
@@ -25,21 +25,27 @@ Jeweler::Tasks.new do |gem|
|
|
25
25
|
end
|
26
26
|
Jeweler::RubygemsDotOrgTasks.new
|
27
27
|
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
28
|
+
require 'rspec/core'
|
29
|
+
require 'rspec/core/rake_task'
|
30
|
+
RSpec::Core::RakeTask.new(:spec) do |spec|
|
31
|
+
spec.pattern = FileList['spec/**/*_spec.rb']
|
32
|
+
end
|
33
33
|
|
34
34
|
# RSpec::Core::RakeTask.new(:rcov) do |spec|
|
35
35
|
# spec.pattern = 'spec/**/*_spec.rb'
|
36
36
|
# spec.rcov = true
|
37
37
|
# end
|
38
38
|
|
39
|
-
require 'cucumber/rake/task'
|
40
|
-
Cucumber::Rake::Task.new(:features)
|
39
|
+
# require 'cucumber/rake/task'
|
40
|
+
# Cucumber::Rake::Task.new(:features)
|
41
41
|
|
42
42
|
task :default => :spec
|
43
|
+
task :test => [:spec]
|
44
|
+
|
45
|
+
RSpec::Core::RakeTask.new(:rcov) do |spec|
|
46
|
+
spec.pattern = 'spec/**/*_spec.rb'
|
47
|
+
spec.rcov = true
|
48
|
+
end
|
43
49
|
|
44
50
|
require 'rdoc/task'
|
45
51
|
Rake::RDocTask.new do |rdoc|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0.
|
1
|
+
0.0.6
|
data/bin/bio-locus
CHANGED
@@ -16,13 +16,16 @@ end
|
|
16
16
|
require 'bio-locus'
|
17
17
|
require 'optparse'
|
18
18
|
|
19
|
-
options = {task: nil, db: 'locus.db', show_help: false, header: 1}
|
19
|
+
options = {task: nil, db: 'locus.db', show_help: false, header: 1, in_format: :vcf, alt: :include, storage: :serialize}
|
20
20
|
opts = OptionParser.new do |o|
|
21
21
|
o.banner = "Usage: #{File.basename($0)} [options] filename\ne.g. #{File.basename($0)} test.txt"
|
22
22
|
|
23
23
|
o.on("--store", 'Create or add to a cache file') do
|
24
24
|
options[:task] = :store
|
25
|
-
|
25
|
+
end
|
26
|
+
|
27
|
+
o.on("--storage [:serialize,:tokyocabinet,:localmemcache]", [:serialize,:tokyocabinet,:localmemcache], 'Persistent cache type (default :serialize)') do |t|
|
28
|
+
options[:storage] = t
|
26
29
|
end
|
27
30
|
|
28
31
|
o.on("--delete", 'Remove matches from a cache file') do
|
@@ -33,40 +36,60 @@ opts = OptionParser.new do |o|
|
|
33
36
|
options[:task] = :match
|
34
37
|
end
|
35
38
|
|
36
|
-
o.on("--include-alt", 'Include chr+pos+ALT VCF field to filter') do
|
37
|
-
|
38
|
-
end
|
39
|
+
# o.on("--include-alt", 'Include chr+pos+ALT VCF field to filter') do
|
40
|
+
# options[:include_alt] = true
|
41
|
+
# end
|
39
42
|
|
40
|
-
o.on(
|
41
|
-
|
43
|
+
o.on('--alt [include,exclude,only]', [:include,:exclude,:only],
|
44
|
+
'Include, exclude, only ALT (default include)') do |par|
|
45
|
+
options[:alt] = par.to_sym
|
42
46
|
end
|
43
47
|
|
48
|
+
# o.on("--only-alt", 'Only look for chr+pos+ALT field in filter') do
|
49
|
+
# options[:only_alt] = true
|
50
|
+
# end
|
51
|
+
|
52
|
+
# o.on("--exclude-alt", 'Override adding chr+pos+ALT field to store') do
|
53
|
+
# options[:exclude_alt] = true
|
54
|
+
# end
|
44
55
|
|
45
56
|
o.on("--db filename",String,"Use db file") do | fn |
|
46
57
|
options[:db] = fn
|
47
58
|
end
|
48
59
|
|
49
|
-
o.on(
|
60
|
+
o.on('--in-format [vcf,tab,cosmic,snv]', [:vcf,:tab,:cosmic,:snv], 'Input format (default vcf)') do |par|
|
61
|
+
options[:in_format] = par.to_sym
|
62
|
+
end
|
63
|
+
|
64
|
+
o.on("--eval-chr expr",String,"Evaluate record to retrieve chr name (default field[0])") do | expr |
|
50
65
|
options[:eval_chr] = expr
|
51
66
|
end
|
52
67
|
|
53
|
-
o.on("--eval-pos expr",String,"Evaluate record to retrieve position") do | expr |
|
68
|
+
o.on("--eval-pos expr",String,"Evaluate record to retrieve position (default field[1])") do | expr |
|
54
69
|
options[:eval_pos] = expr
|
55
70
|
end
|
56
71
|
|
57
|
-
o.on("--eval-alt expr",String,"Evaluate record to retrieve alt list") do | expr |
|
72
|
+
o.on("--eval-alt expr",String,"Evaluate record to retrieve alt list (default field[4])") do | expr |
|
58
73
|
options[:eval_alt] = expr
|
59
74
|
end
|
60
|
-
|
75
|
+
|
61
76
|
o.on("--header num", "Header lines (default 1)") do |l|
|
62
77
|
options[:header] = l.to_i
|
63
78
|
end
|
79
|
+
|
80
|
+
o.on("-v", "--invert-match", "Invert the sense of matching, to select non-matching lines") do
|
81
|
+
options[:invert_match] = true
|
82
|
+
end
|
64
83
|
|
84
|
+
o.on("--header num", "Header lines (default 1)") do |l|
|
85
|
+
options[:header] = l.to_i
|
86
|
+
end
|
87
|
+
|
65
88
|
o.on("-q", "--quiet", "Run quietly") do |q|
|
66
89
|
options[:quiet] = true
|
67
90
|
end
|
68
91
|
|
69
|
-
o.on("
|
92
|
+
o.on("--verbose", "Run verbosely") do |v|
|
70
93
|
options[:verbose] = true
|
71
94
|
end
|
72
95
|
|
@@ -78,6 +101,10 @@ opts = OptionParser.new do |o|
|
|
78
101
|
options[:ignore_errors] = true
|
79
102
|
end
|
80
103
|
|
104
|
+
o.on("--once", "Only one copy stored/matched") do |q|
|
105
|
+
options[:once] = true
|
106
|
+
end
|
107
|
+
|
81
108
|
|
82
109
|
o.separator ""
|
83
110
|
o.on_tail('-h', '--help', 'display this help and exit') do
|
@@ -107,7 +134,6 @@ end
|
|
107
134
|
case options[:task]
|
108
135
|
when :store then
|
109
136
|
require 'bio-locus/store'
|
110
|
-
options[:include_alt]=false if options[:exclude_alt]
|
111
137
|
BioLocus::Store.run(options)
|
112
138
|
when :match ,:delete then
|
113
139
|
require 'bio-locus/match'
|
data/lib/bio-locus.rb
CHANGED
@@ -0,0 +1,106 @@
|
|
1
|
+
module BioLocus
|
2
|
+
|
3
|
+
class SerializeMapper
|
4
|
+
def initialize dbname
|
5
|
+
@dbname = dbname
|
6
|
+
@h = {}
|
7
|
+
if File.exist?(@dbname)
|
8
|
+
@h = Marshal.load(File.read(@dbname))
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
def [] key
|
13
|
+
@h[key]
|
14
|
+
end
|
15
|
+
|
16
|
+
def []= key, value
|
17
|
+
@h[key] = value
|
18
|
+
end
|
19
|
+
|
20
|
+
def close
|
21
|
+
File.open(@dbname, 'w') {|f| f.write(Marshal.dump(@h)) }
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
class MonetaMapper
|
26
|
+
def initialize storage, dbname
|
27
|
+
begin
|
28
|
+
require 'moneta'
|
29
|
+
rescue LoadError
|
30
|
+
$stderr.print "Error: Missing moneta. Install with command 'gem install moneta'\n"
|
31
|
+
exit 1
|
32
|
+
end
|
33
|
+
@store = Moneta.new(storage, file: dbname)
|
34
|
+
end
|
35
|
+
|
36
|
+
def [] key
|
37
|
+
@store[key]
|
38
|
+
end
|
39
|
+
|
40
|
+
def []= key, value
|
41
|
+
@store[key] = value
|
42
|
+
end
|
43
|
+
|
44
|
+
def close
|
45
|
+
@store.close
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
class TokyoCabinetMapper
|
50
|
+
def initialize dbname
|
51
|
+
begin
|
52
|
+
require 'tokyocabinet'
|
53
|
+
rescue LoadError
|
54
|
+
$stderr.print "Error: Missing tokyocabinet. Install with command 'gem install tokyocabinet'\n"
|
55
|
+
exit 1
|
56
|
+
end
|
57
|
+
@hdb = TokyoCabinet::HDB::new
|
58
|
+
if File.exist?(dbname)
|
59
|
+
if !@hdb.open(dbname, TokyoCabinet::HDB::OREADER)
|
60
|
+
ecode = @hdb.ecode
|
61
|
+
raise sprintf("open error: %s\n", @hdb.errmsg(ecode))
|
62
|
+
end
|
63
|
+
else
|
64
|
+
if !@hdb.open(dbname, TokyoCabinet::HDB::OWRITER | TokyoCabinet::HDB::OCREAT)
|
65
|
+
ecode = @hdb.ecode
|
66
|
+
raise sprintf("open error: %s\n", @hdb.errmsg(ecode))
|
67
|
+
end
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
def [] key
|
72
|
+
@hdb.get(key)
|
73
|
+
end
|
74
|
+
|
75
|
+
def []= key, value
|
76
|
+
if !@hdb.put(key,value)
|
77
|
+
ecode = @hdb.ecode
|
78
|
+
raise sprintf("put error: %s\n", @hdb.errmsg(ecode))
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
def close
|
83
|
+
if !@hdb.close
|
84
|
+
ecode = @hdb.ecode
|
85
|
+
raise sprintf("close error: %s\n", @hdb.errmsg(ecode))
|
86
|
+
end
|
87
|
+
end
|
88
|
+
end
|
89
|
+
|
90
|
+
module DbMapper
|
91
|
+
def DbMapper::factory options
|
92
|
+
dbname = options[:db]
|
93
|
+
if File.exist?(dbname)
|
94
|
+
$stderr.print "Database #{dbname} exists!\n"
|
95
|
+
end
|
96
|
+
case options[:storage]
|
97
|
+
when :tokyocabinet
|
98
|
+
TokyoCabinetMapper.new(dbname)
|
99
|
+
when :localmemcache
|
100
|
+
MonetaMapper.new(:LocalMemCache,dbname)
|
101
|
+
else
|
102
|
+
SerializeMapper.new(dbname)
|
103
|
+
end
|
104
|
+
end
|
105
|
+
end
|
106
|
+
end
|
data/lib/bio-locus/locus.rb
CHANGED
@@ -1,41 +1,70 @@
|
|
1
1
|
|
2
2
|
module BioLocus
|
3
3
|
module Keys
|
4
|
+
@@in_list = {}
|
5
|
+
|
4
6
|
def Keys::each_key(line,options)
|
7
|
+
use_alt = (options[:alt] == :include or options[:alt] == :only)
|
8
|
+
use_pos = (options[:alt] == :include or options[:alt] == :exclude)
|
9
|
+
|
5
10
|
if line =~ /^[[:alnum:]]+/
|
6
11
|
fields = nil
|
7
|
-
# The default layout (VCF) may or may not work
|
12
|
+
# The default layout (VCF) may or may not work. Critically
|
13
|
+
# chr,pos and alt are expected in positions 0,1,4 respectively.
|
8
14
|
chr,pos,id,no_use,alt,rest = line.split(/\t/,6)[0..-1]
|
9
|
-
|
10
|
-
|
11
|
-
fields ||= line.split(/\t/)
|
12
|
-
field = fields
|
13
|
-
chr = eval(options[:eval_chr])
|
14
|
-
end
|
15
|
-
if options[:eval_pos]
|
16
|
-
fields ||= line.split(/\t/)
|
17
|
-
field = fields
|
18
|
-
pos = eval(options[:eval_pos])
|
19
|
-
end
|
20
|
-
if options[:eval_alt]
|
21
|
-
fields ||= line.split(/\t/)
|
15
|
+
if options[:in_format] or options[:eval_chr] or options[:eval_pos] or options[:eval_alt]
|
16
|
+
fields = line.split(/\t/)
|
22
17
|
field = fields
|
23
|
-
|
18
|
+
case options[:in_format]
|
19
|
+
when :tab then
|
20
|
+
# chr,pos,ref,alt
|
21
|
+
alt = field[3].strip.split(/,/)[0] if field[3]
|
22
|
+
when :snv then
|
23
|
+
alt = field[2].split(/\//)[1] if field[2]
|
24
|
+
when :cosmic then
|
25
|
+
# COSMIC tsv files, either in field 17 (COSMICv70)
|
26
|
+
locus_field = field[17]
|
27
|
+
locus_field = field[13] if locus_field !~ /:/
|
28
|
+
if field[15] !~ /delet/i and locus_field =~ /:/
|
29
|
+
chr = /^([^:]+)/.match(locus_field)[1]
|
30
|
+
a = /:(\d+)-(\d+)/.match(locus_field)
|
31
|
+
pos = a[1] if a[1]==a[2]
|
32
|
+
end
|
33
|
+
end
|
34
|
+
# Override parsing with
|
35
|
+
if options[:eval_chr]
|
36
|
+
chr = eval(options[:eval_chr])
|
37
|
+
end
|
38
|
+
if options[:eval_pos]
|
39
|
+
pos = eval(options[:eval_pos])
|
40
|
+
end
|
41
|
+
if options[:eval_alt]
|
42
|
+
alt = eval(options[:eval_alt])
|
43
|
+
end
|
24
44
|
end
|
25
|
-
p [chr,pos] if options[:debug]
|
45
|
+
# p [:debug,chr,pos,alt] if options[:debug]
|
26
46
|
|
27
47
|
# If we have a position emit it
|
28
48
|
if pos =~ /^\d+$/ and chr and chr != ''
|
29
|
-
alts =
|
30
|
-
|
49
|
+
alts = if use_pos
|
50
|
+
['']
|
51
|
+
else
|
52
|
+
[]
|
53
|
+
end
|
54
|
+
alts += alt.split(/,/) if use_alt and alt
|
31
55
|
alts.each do | nuc |
|
32
56
|
key = chr+"\t"+pos
|
33
57
|
key += "\t"+nuc if nuc != ''
|
58
|
+
if options[:once]
|
59
|
+
# check we haven't already sent this out in this run
|
60
|
+
return if @@in_list[key]
|
61
|
+
@@in_list[key] = true
|
62
|
+
end
|
34
63
|
yield key
|
35
64
|
end
|
36
65
|
else
|
37
66
|
if options[:ignore_errors]
|
38
|
-
$stderr.print "WARNING, skipping: ",line if not options[:quiet]
|
67
|
+
$stderr.print "WARNING, <#{chr}:#{pos}> skipping: ",line if not options[:quiet]
|
39
68
|
else
|
40
69
|
p line
|
41
70
|
p fields
|
data/lib/bio-locus/match.rb
CHANGED
@@ -1,42 +1,58 @@
|
|
1
1
|
module BioLocus
|
2
|
-
|
3
|
-
require 'moneta'
|
4
|
-
|
5
2
|
module Match
|
6
3
|
def Match.run(options)
|
7
4
|
do_delete = (options[:task] == :delete)
|
8
|
-
|
5
|
+
invert_match = options[:invert_match]
|
6
|
+
store = DbMapper.factory(options)
|
9
7
|
lines = 0
|
8
|
+
header_lines = 0
|
10
9
|
count = 0
|
11
10
|
in_header = true
|
12
|
-
|
11
|
+
uniq_match = {}
|
12
|
+
uniq_no_match = {}
|
13
13
|
STDIN.each_line do | line |
|
14
14
|
if in_header and line =~ /^#/
|
15
15
|
# Retain comments in header (for VCF)
|
16
16
|
print line
|
17
|
+
header_lines += 1
|
17
18
|
next
|
18
19
|
else
|
19
20
|
in_header = false
|
20
21
|
end
|
21
|
-
|
22
|
+
if line =~ /^#/
|
23
|
+
header_lines += 1
|
24
|
+
else
|
25
|
+
lines += 1
|
26
|
+
end
|
22
27
|
$stderr.print '.' if (lines % 1_000_000) == 0 if not options[:quiet]
|
23
28
|
Keys::each_key(line,options) do | key |
|
24
|
-
|
29
|
+
has_match = lambda {
|
30
|
+
if invert_match
|
31
|
+
not store[key]
|
32
|
+
else
|
33
|
+
store[key]
|
34
|
+
end
|
35
|
+
}
|
36
|
+
if has_match.call
|
37
|
+
# We have a match
|
38
|
+
$stderr.print "Matched <#{key}>\n" if options[:debug]
|
25
39
|
count += 1
|
26
40
|
if do_delete
|
27
41
|
store.delete(key)
|
28
42
|
else
|
29
43
|
print line
|
30
|
-
|
44
|
+
uniq_match[key] ||= true
|
31
45
|
end
|
46
|
+
else
|
47
|
+
uniq_no_match[key] ||= true
|
32
48
|
end
|
33
49
|
end
|
34
50
|
end
|
35
51
|
store.close
|
36
52
|
if do_delete
|
37
|
-
$stderr.print "\nDeleted #{count} keys
|
53
|
+
$stderr.print "\nDeleted #{count} keys in #{options[:db]} reading #{lines} lines !\n" if not options[:quiet]
|
38
54
|
else
|
39
|
-
$stderr.print "\nMatched #{count} (unique #{
|
55
|
+
$stderr.print "\nMatched #{count} (unique #{uniq_match.keys.size}) lines out of #{lines} (header #{header_lines}, unique #{uniq_no_match.keys.size+uniq_match.keys.size}) in #{options[:db]}!\n" if not options[:quiet]
|
40
56
|
end
|
41
57
|
end
|
42
58
|
end
|
data/lib/bio-locus/store.rb
CHANGED
@@ -1,14 +1,20 @@
|
|
1
1
|
module BioLocus
|
2
2
|
|
3
|
-
require 'moneta'
|
4
|
-
|
5
3
|
module Store
|
6
4
|
def Store.run(options)
|
7
|
-
|
5
|
+
invert_match = options[:invert_match]
|
6
|
+
store = DbMapper.factory(options)
|
8
7
|
count = count_new = count_dup = 0
|
9
8
|
STDIN.each_line do | line |
|
10
9
|
Keys::each_key(line,options) do | key |
|
11
|
-
|
10
|
+
has_match = lambda {
|
11
|
+
if invert_match
|
12
|
+
not store[key]
|
13
|
+
else
|
14
|
+
store[key]
|
15
|
+
end
|
16
|
+
}
|
17
|
+
if not has_match.call
|
12
18
|
count_new += 1
|
13
19
|
store[key] = true
|
14
20
|
else
|
@@ -23,7 +29,7 @@ module BioLocus
|
|
23
29
|
end
|
24
30
|
end
|
25
31
|
store.close
|
26
|
-
$stderr.print "Stored #{count_new} positions out of #{count} in #{options[:db]} (#{count_dup} hits)\n" if !options[:quiet]
|
32
|
+
$stderr.print "Stored #{count_new} positions out of #{count} in #{options[:db]} (#{count_dup} duplicate hits)\n" if !options[:quiet]
|
27
33
|
end
|
28
34
|
end
|
29
35
|
end
|
data/spec/bio-locus_spec.rb
CHANGED
@@ -1,7 +1,37 @@
|
|
1
1
|
require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
|
2
2
|
|
3
|
-
describe "BioLocus" do
|
4
|
-
|
5
|
-
|
6
|
-
|
3
|
+
describe "BioLocus with Serialize" do
|
4
|
+
fn = 'biolocus_serialize.db'
|
5
|
+
store = BioLocus::DbMapper.factory({storage: :serialize, db: fn})
|
6
|
+
store['test'] = 'yes'
|
7
|
+
store['test2'] = 'no'
|
8
|
+
a = store['test']
|
9
|
+
store['test'].should == 'yes'
|
10
|
+
store['test2'].should == 'no'
|
11
|
+
store.close
|
12
|
+
File.unlink(fn)
|
13
|
+
end
|
14
|
+
|
15
|
+
describe "BioLocus with Moneta" do
|
16
|
+
fn = 'biolocus_moneta_localmemcache.db'
|
17
|
+
store = BioLocus::MonetaMapper.new(:LocalMemCache,fn)
|
18
|
+
store['test'] = 'yes'
|
19
|
+
store['test2'] = 'no'
|
20
|
+
a = store['test']
|
21
|
+
store['test'].should == 'yes'
|
22
|
+
store['test2'].should == 'no'
|
23
|
+
store.close
|
24
|
+
File.unlink(fn)
|
25
|
+
end
|
26
|
+
|
27
|
+
describe "BioLocus with TokyoCabinet" do
|
28
|
+
fn = 'biolocus_tokyocabinet.db'
|
29
|
+
store = BioLocus::TokyoCabinetMapper.new(fn)
|
30
|
+
store['test'] = 'yes'
|
31
|
+
store['test2'] = 'no'
|
32
|
+
a = store['test']
|
33
|
+
store['test'].should == 'yes'
|
34
|
+
store['test2'].should == 'no'
|
35
|
+
store.close
|
36
|
+
File.unlink(fn)
|
7
37
|
end
|
metadata
CHANGED
@@ -1,23 +1,23 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-locus
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.6
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Pjotr Prins
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-
|
11
|
+
date: 2014-10-10 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
|
-
name:
|
14
|
+
name: cucumber
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
19
|
version: '0'
|
20
|
-
type: :
|
20
|
+
type: :development
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
@@ -25,13 +25,13 @@ dependencies:
|
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
|
-
name:
|
28
|
+
name: jeweler
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
31
|
- - ">="
|
32
32
|
- !ruby/object:Gem::Version
|
33
33
|
version: '0'
|
34
|
-
type: :
|
34
|
+
type: :development
|
35
35
|
prerelease: false
|
36
36
|
version_requirements: !ruby/object:Gem::Requirement
|
37
37
|
requirements:
|
@@ -39,7 +39,7 @@ dependencies:
|
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '0'
|
41
41
|
- !ruby/object:Gem::Dependency
|
42
|
-
name:
|
42
|
+
name: bundler
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
44
44
|
requirements:
|
45
45
|
- - ">="
|
@@ -53,7 +53,7 @@ dependencies:
|
|
53
53
|
- !ruby/object:Gem::Version
|
54
54
|
version: '0'
|
55
55
|
- !ruby/object:Gem::Dependency
|
56
|
-
name:
|
56
|
+
name: rspec
|
57
57
|
requirement: !ruby/object:Gem::Requirement
|
58
58
|
requirements:
|
59
59
|
- - ">="
|
@@ -67,7 +67,35 @@ dependencies:
|
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '0'
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
|
-
name:
|
70
|
+
name: tokyocabinet
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - ">="
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0'
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - ">="
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: '0'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: localmemcache
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - ">="
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '0'
|
90
|
+
type: :development
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - ">="
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '0'
|
97
|
+
- !ruby/object:Gem::Dependency
|
98
|
+
name: moneta
|
71
99
|
requirement: !ruby/object:Gem::Requirement
|
72
100
|
requirements:
|
73
101
|
- - ">="
|
@@ -103,6 +131,7 @@ files:
|
|
103
131
|
- features/step_definitions/bio-locus_steps.rb
|
104
132
|
- features/support/env.rb
|
105
133
|
- lib/bio-locus.rb
|
134
|
+
- lib/bio-locus/dbmapper.rb
|
106
135
|
- lib/bio-locus/locus.rb
|
107
136
|
- lib/bio-locus/match.rb
|
108
137
|
- lib/bio-locus/store.rb
|