bio-locus 0.0.2 → 0.0.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile +8 -8
- data/README.md +88 -26
- data/Rakefile +13 -7
- data/VERSION +1 -1
- data/bin/bio-locus +39 -13
- data/lib/bio-locus.rb +1 -0
- data/lib/bio-locus/dbmapper.rb +106 -0
- data/lib/bio-locus/locus.rb +48 -19
- data/lib/bio-locus/match.rb +26 -10
- data/lib/bio-locus/store.rb +11 -5
- data/spec/bio-locus_spec.rb +34 -4
- metadata +38 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: eef39503225998bf18b16997dd0cea5926ef6db2
|
4
|
+
data.tar.gz: db40d63c48275bc3df6d5575931097fa30448dd5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fe9de938d89f49f69bb1aabcd1d870e04fba1ec4ee32191f9b2c5753a3d0b7a39bf2715ae93378007563f1865f98a7e860674dd08bed81efd4efcf6e33e3f698
|
7
|
+
data.tar.gz: ef30a47864ea175ac621a9c12cee0d60fcd572ca0ce10302203e2d41fe4e84358bc5534dc6d759367e1dae978593434a0dde0e1e06b099d3b13cd4340f3cde3d
|
data/Gemfile
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
source "http://rubygems.org"
|
2
|
-
# Add dependencies required to use your gem here.
|
3
|
-
# Example:
|
4
|
-
# gem "activesupport", ">= 2.3.5"
|
5
|
-
|
6
|
-
# Add dependencies to develop your gem here.
|
7
|
-
# Include everything needed to run rake, tests, features, etc.
|
8
2
|
group :development do
|
9
3
|
gem "cucumber"
|
10
4
|
gem "jeweler"
|
11
5
|
gem "bundler"
|
6
|
+
gem "rspec"
|
7
|
+
gem "tokyocabinet"
|
8
|
+
gem "localmemcache"
|
9
|
+
gem "moneta"
|
12
10
|
end
|
13
|
-
|
14
|
-
gem "
|
11
|
+
# The following are optional (Ruby serialize is the default)
|
12
|
+
# gem "tokyocabinet"
|
13
|
+
# gem "localmemcache"
|
14
|
+
# gem "moneta"
|
data/README.md
CHANGED
@@ -6,8 +6,22 @@ Bio-locus is a tool for fast querying of genome locations. Many file
|
|
6
6
|
formats in bioinformatics contain records that start with a chromosome
|
7
7
|
name and a position for a SNP, or a start-end position for indels.
|
8
8
|
|
9
|
-
This tool essentially allows your to store this
|
10
|
-
|
9
|
+
This tool essentially allows your to store this chr+pos or chr+pos+alt
|
10
|
+
information in a fast database.
|
11
|
+
|
12
|
+
Why would you use bio-locus?
|
13
|
+
|
14
|
+
1. Fast comparison of VCF files and other formats that use chr+pos
|
15
|
+
2. Fast comparison of VCF files and other formats that use chr+pos+alt
|
16
|
+
3. See what positions match an EVS or GoNL database
|
17
|
+
4. Compare locations from databases such as the TCGA and COSMIC
|
18
|
+
5. Comparison of overlap or difference
|
19
|
+
|
20
|
+
In principle any of the Moneta supported backends can be used,
|
21
|
+
including LocalMemCache, RubySerialize and TokyoCabinet. The default
|
22
|
+
is RubySerialize because it works out of the box.
|
23
|
+
|
24
|
+
Usage:
|
11
25
|
|
12
26
|
```sh
|
13
27
|
bio-locus --store < one.vcf
|
@@ -19,7 +33,7 @@ listed alt alleles. To find positions in another dataset which match
|
|
19
33
|
those in the database:
|
20
34
|
|
21
35
|
```sh
|
22
|
-
bio-locus --match < two.vcf
|
36
|
+
bio-locus --match < two.vcf > matched.vcf
|
23
37
|
```
|
24
38
|
|
25
39
|
The point is that this is a two-step process, first create the
|
@@ -29,40 +43,46 @@ with the --delete switch.
|
|
29
43
|
To match with alt use
|
30
44
|
|
31
45
|
```sh
|
32
|
-
bio-locus --match --
|
46
|
+
bio-locus --match --alt only < two.vcf > matched.vcf
|
33
47
|
```
|
34
48
|
|
35
|
-
|
49
|
+
So, with bio-locus you can
|
36
50
|
|
37
|
-
*
|
38
|
-
*
|
39
|
-
*
|
40
|
-
*
|
51
|
+
* reduce the size of large SNP databases before storage/querying
|
52
|
+
* gain performance
|
53
|
+
* filter on chr+pos (default)
|
54
|
+
* filter on chr+pos+field (where field can be a VCF ALT)
|
41
55
|
|
42
56
|
Use cases are
|
43
57
|
|
44
|
-
* To filter for annotated variants
|
58
|
+
* To filter for annotated variants (including INDELS)
|
45
59
|
* To remove common variants from a set
|
46
60
|
|
47
61
|
In short a more targeted approach allowing you to work with less data. This
|
48
|
-
tool is decently fast. For example, looking for 130 positions in 20 million
|
49
|
-
in GoNL takes 0.11s to store and 1.5 minutes to match on my laptop
|
62
|
+
tool is decently fast. For example, looking for 130 positions in 20 million
|
63
|
+
SNPs in GoNL takes 0.11s to store and 1.5 minutes to match on my laptop (using
|
64
|
+
localmemcache):
|
50
65
|
|
51
66
|
```sh
|
52
|
-
cat my_130_variants.vcf | ./bin/bio-locus --store
|
67
|
+
cat my_130_variants.vcf | ./bin/bio-locus --store --storage :localmemcache
|
53
68
|
Stored 130 positions out of 130 in locus.db
|
54
69
|
real 0m0.119s
|
55
70
|
user 0m0.108s
|
56
71
|
sys 0m0.012s
|
57
72
|
|
58
|
-
cat gonl.*.vcf |./bin/bio-locus --match
|
73
|
+
cat gonl.*.vcf |./bin/bio-locus --match --storage :localmemcache
|
59
74
|
Matched 3 out of 20736323 lines in locus.db!
|
60
75
|
real 1m34.577s
|
61
76
|
user 1m33.602s
|
62
77
|
sys 0m1.868s
|
63
78
|
```
|
64
79
|
|
65
|
-
Note: for the storage the
|
80
|
+
Note: for the storage here the
|
81
|
+
[moneta](https://github.com/minad/moneta) gem is used, currently with
|
82
|
+
localmemcache. The default mode for bio-locus is Ruby serialization,
|
83
|
+
and :tokyocabinet is also supported. The larger your data becomes, the
|
84
|
+
more likely it is that you need :tokyocabinet because the others are
|
85
|
+
more RAM oriented.
|
66
86
|
|
67
87
|
Note: the ALT field is split into components for matching, so A,C
|
68
88
|
becomes two chr+pos records, one for A and one for C.
|
@@ -82,18 +102,25 @@ of options available through
|
|
82
102
|
bio-locus --help
|
83
103
|
```
|
84
104
|
|
105
|
+
The most important one is the handling of ALT. Both with --store and
|
106
|
+
--match ALT (chr+pos+alt) can be matched in conjuction with POS
|
107
|
+
(chr+pos). When using --alt only, only ALT is matched. When using
|
108
|
+
--alt include, both ALT and POS are matched. When using --alt exclude,
|
109
|
+
only POS is matched.
|
110
|
+
|
111
|
+
|
85
112
|
### Deleting keys
|
86
113
|
|
87
|
-
To delete entries use
|
114
|
+
To delete entries from the database use
|
88
115
|
|
89
116
|
```sh
|
90
117
|
bio-locus --delete < two.vcf
|
91
118
|
```
|
92
119
|
|
93
|
-
To match with alt use
|
120
|
+
To delete those that match with alt use
|
94
121
|
|
95
122
|
```sh
|
96
|
-
bio-locus --delete --
|
123
|
+
bio-locus --delete --alt only < two.vcf
|
97
124
|
```
|
98
125
|
|
99
126
|
You may need to run both with and without alt, depending on your needs!
|
@@ -113,29 +140,64 @@ can be done with
|
|
113
140
|
bio-locus --store --eval-alt 'field[2].split(/\//)[1]'
|
114
141
|
```
|
115
142
|
|
143
|
+
Actually, if the --in-format is 'snv', this is exactly what is used.
|
144
|
+
|
116
145
|
### COSMIC
|
117
146
|
|
118
147
|
COSMIC is pretty large, so it can be useful to cut the database down to the
|
119
148
|
variants that you have. The locus information is combined
|
120
149
|
in the before last column as chr:start-end, e.g.,
|
121
|
-
19:58861911-58861911. This
|
150
|
+
19:58861911-58861911. This may work for COSMICv68
|
122
151
|
|
123
152
|
```sh
|
124
153
|
bio-locus -i --match --eval-chr='field[13] =~ /^([^:]+)/ ; $1' --eval-pos='field[13] =~ /:(\d+)-/ ; $1 ' < CosmicMutantExportIncFus_v68.tsv
|
125
154
|
```
|
126
155
|
|
156
|
+
You may also use the --in-format cosmic switch for supported COSMIC
|
157
|
+
versions.
|
158
|
+
|
127
159
|
Note the -i switch is needed to skip records that lack position
|
128
|
-
information.
|
160
|
+
information or are non-SNV.
|
161
|
+
|
162
|
+
## GoNL INDEL example
|
129
163
|
|
130
|
-
|
164
|
+
Here an example of filtering out all INDELs that also exist in a
|
165
|
+
different dataste, in this case
|
166
|
+
[GoNL](http://www.genoomvannederland.nl/) which provides a database of
|
167
|
+
population INDELs in VCF format. First we use
|
168
|
+
[bio-vcf](https://github.com/pjotrp/bioruby-vcf) to create a
|
169
|
+
subset of common INDELS:
|
131
170
|
|
132
|
-
```
|
133
|
-
|
171
|
+
```sh
|
172
|
+
cat gonl.*.snps_indels.r5.vcf |bio-vcf --filter 'r.info.set=="INDEL" and r.info.af>0.05' > gonl_indel0.05.vcf
|
173
|
+
```
|
174
|
+
|
175
|
+
Create a locus database from this VCF
|
176
|
+
|
177
|
+
```sh
|
178
|
+
bio-locus --store --db gonl_indel0.05.db --alt only < gonl_indel0.05.vcf
|
179
|
+
Stored 480639 positions out of 480639 in gonl_indel0.05.db (0 duplicate hits)
|
180
|
+
```
|
181
|
+
|
182
|
+
Next, we take our datafile and filter for INDELs that are
|
183
|
+
in the population set
|
184
|
+
|
185
|
+
```sh
|
186
|
+
bio-locus --match -v --db gonl_indel0.05.db --alt only < varscan2_indel_nfreq30_tfreq30.vcf > /dev/null
|
187
|
+
Matched 635 (unique 75) lines out of 1005 (header 18, unique 174) in gonl_indel0.05.db!
|
188
|
+
```
|
189
|
+
Which says that 75 INDELs were population matches. We have 635 hits
|
190
|
+
because there are multiple samples in this VCF.
|
191
|
+
|
192
|
+
This is not what we want in our file, so now we take our datafile and
|
193
|
+
filter for INDELs that are *not* in the population set
|
194
|
+
|
195
|
+
```sh
|
196
|
+
bio-locus --match -v --db gonl_indel0.05.db --alt only < varscan2_indel_nfreq30_tfreq30.vcf > unique_indels.vcf
|
197
|
+
Matched 370 (unique 99) lines out of 1005 (header 18, unique 174) in gonl_indel0.05.db!
|
134
198
|
```
|
199
|
+
So now we have 99 INDELs for this dataset which are not common INDELs.
|
135
200
|
|
136
|
-
The API doc is online. For more code examples see the test files in
|
137
|
-
the source tree.
|
138
|
-
|
139
201
|
## Project home page
|
140
202
|
|
141
203
|
Information on the source tree, documentation, examples, issues and
|
data/Rakefile
CHANGED
@@ -25,21 +25,27 @@ Jeweler::Tasks.new do |gem|
|
|
25
25
|
end
|
26
26
|
Jeweler::RubygemsDotOrgTasks.new
|
27
27
|
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
28
|
+
require 'rspec/core'
|
29
|
+
require 'rspec/core/rake_task'
|
30
|
+
RSpec::Core::RakeTask.new(:spec) do |spec|
|
31
|
+
spec.pattern = FileList['spec/**/*_spec.rb']
|
32
|
+
end
|
33
33
|
|
34
34
|
# RSpec::Core::RakeTask.new(:rcov) do |spec|
|
35
35
|
# spec.pattern = 'spec/**/*_spec.rb'
|
36
36
|
# spec.rcov = true
|
37
37
|
# end
|
38
38
|
|
39
|
-
require 'cucumber/rake/task'
|
40
|
-
Cucumber::Rake::Task.new(:features)
|
39
|
+
# require 'cucumber/rake/task'
|
40
|
+
# Cucumber::Rake::Task.new(:features)
|
41
41
|
|
42
42
|
task :default => :spec
|
43
|
+
task :test => [:spec]
|
44
|
+
|
45
|
+
RSpec::Core::RakeTask.new(:rcov) do |spec|
|
46
|
+
spec.pattern = 'spec/**/*_spec.rb'
|
47
|
+
spec.rcov = true
|
48
|
+
end
|
43
49
|
|
44
50
|
require 'rdoc/task'
|
45
51
|
Rake::RDocTask.new do |rdoc|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0.
|
1
|
+
0.0.6
|
data/bin/bio-locus
CHANGED
@@ -16,13 +16,16 @@ end
|
|
16
16
|
require 'bio-locus'
|
17
17
|
require 'optparse'
|
18
18
|
|
19
|
-
options = {task: nil, db: 'locus.db', show_help: false, header: 1}
|
19
|
+
options = {task: nil, db: 'locus.db', show_help: false, header: 1, in_format: :vcf, alt: :include, storage: :serialize}
|
20
20
|
opts = OptionParser.new do |o|
|
21
21
|
o.banner = "Usage: #{File.basename($0)} [options] filename\ne.g. #{File.basename($0)} test.txt"
|
22
22
|
|
23
23
|
o.on("--store", 'Create or add to a cache file') do
|
24
24
|
options[:task] = :store
|
25
|
-
|
25
|
+
end
|
26
|
+
|
27
|
+
o.on("--storage [:serialize,:tokyocabinet,:localmemcache]", [:serialize,:tokyocabinet,:localmemcache], 'Persistent cache type (default :serialize)') do |t|
|
28
|
+
options[:storage] = t
|
26
29
|
end
|
27
30
|
|
28
31
|
o.on("--delete", 'Remove matches from a cache file') do
|
@@ -33,40 +36,60 @@ opts = OptionParser.new do |o|
|
|
33
36
|
options[:task] = :match
|
34
37
|
end
|
35
38
|
|
36
|
-
o.on("--include-alt", 'Include chr+pos+ALT VCF field to filter') do
|
37
|
-
|
38
|
-
end
|
39
|
+
# o.on("--include-alt", 'Include chr+pos+ALT VCF field to filter') do
|
40
|
+
# options[:include_alt] = true
|
41
|
+
# end
|
39
42
|
|
40
|
-
o.on(
|
41
|
-
|
43
|
+
o.on('--alt [include,exclude,only]', [:include,:exclude,:only],
|
44
|
+
'Include, exclude, only ALT (default include)') do |par|
|
45
|
+
options[:alt] = par.to_sym
|
42
46
|
end
|
43
47
|
|
48
|
+
# o.on("--only-alt", 'Only look for chr+pos+ALT field in filter') do
|
49
|
+
# options[:only_alt] = true
|
50
|
+
# end
|
51
|
+
|
52
|
+
# o.on("--exclude-alt", 'Override adding chr+pos+ALT field to store') do
|
53
|
+
# options[:exclude_alt] = true
|
54
|
+
# end
|
44
55
|
|
45
56
|
o.on("--db filename",String,"Use db file") do | fn |
|
46
57
|
options[:db] = fn
|
47
58
|
end
|
48
59
|
|
49
|
-
o.on(
|
60
|
+
o.on('--in-format [vcf,tab,cosmic,snv]', [:vcf,:tab,:cosmic,:snv], 'Input format (default vcf)') do |par|
|
61
|
+
options[:in_format] = par.to_sym
|
62
|
+
end
|
63
|
+
|
64
|
+
o.on("--eval-chr expr",String,"Evaluate record to retrieve chr name (default field[0])") do | expr |
|
50
65
|
options[:eval_chr] = expr
|
51
66
|
end
|
52
67
|
|
53
|
-
o.on("--eval-pos expr",String,"Evaluate record to retrieve position") do | expr |
|
68
|
+
o.on("--eval-pos expr",String,"Evaluate record to retrieve position (default field[1])") do | expr |
|
54
69
|
options[:eval_pos] = expr
|
55
70
|
end
|
56
71
|
|
57
|
-
o.on("--eval-alt expr",String,"Evaluate record to retrieve alt list") do | expr |
|
72
|
+
o.on("--eval-alt expr",String,"Evaluate record to retrieve alt list (default field[4])") do | expr |
|
58
73
|
options[:eval_alt] = expr
|
59
74
|
end
|
60
|
-
|
75
|
+
|
61
76
|
o.on("--header num", "Header lines (default 1)") do |l|
|
62
77
|
options[:header] = l.to_i
|
63
78
|
end
|
79
|
+
|
80
|
+
o.on("-v", "--invert-match", "Invert the sense of matching, to select non-matching lines") do
|
81
|
+
options[:invert_match] = true
|
82
|
+
end
|
64
83
|
|
84
|
+
o.on("--header num", "Header lines (default 1)") do |l|
|
85
|
+
options[:header] = l.to_i
|
86
|
+
end
|
87
|
+
|
65
88
|
o.on("-q", "--quiet", "Run quietly") do |q|
|
66
89
|
options[:quiet] = true
|
67
90
|
end
|
68
91
|
|
69
|
-
o.on("
|
92
|
+
o.on("--verbose", "Run verbosely") do |v|
|
70
93
|
options[:verbose] = true
|
71
94
|
end
|
72
95
|
|
@@ -78,6 +101,10 @@ opts = OptionParser.new do |o|
|
|
78
101
|
options[:ignore_errors] = true
|
79
102
|
end
|
80
103
|
|
104
|
+
o.on("--once", "Only one copy stored/matched") do |q|
|
105
|
+
options[:once] = true
|
106
|
+
end
|
107
|
+
|
81
108
|
|
82
109
|
o.separator ""
|
83
110
|
o.on_tail('-h', '--help', 'display this help and exit') do
|
@@ -107,7 +134,6 @@ end
|
|
107
134
|
case options[:task]
|
108
135
|
when :store then
|
109
136
|
require 'bio-locus/store'
|
110
|
-
options[:include_alt]=false if options[:exclude_alt]
|
111
137
|
BioLocus::Store.run(options)
|
112
138
|
when :match ,:delete then
|
113
139
|
require 'bio-locus/match'
|
data/lib/bio-locus.rb
CHANGED
@@ -0,0 +1,106 @@
|
|
1
|
+
module BioLocus
|
2
|
+
|
3
|
+
class SerializeMapper
|
4
|
+
def initialize dbname
|
5
|
+
@dbname = dbname
|
6
|
+
@h = {}
|
7
|
+
if File.exist?(@dbname)
|
8
|
+
@h = Marshal.load(File.read(@dbname))
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
def [] key
|
13
|
+
@h[key]
|
14
|
+
end
|
15
|
+
|
16
|
+
def []= key, value
|
17
|
+
@h[key] = value
|
18
|
+
end
|
19
|
+
|
20
|
+
def close
|
21
|
+
File.open(@dbname, 'w') {|f| f.write(Marshal.dump(@h)) }
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
class MonetaMapper
|
26
|
+
def initialize storage, dbname
|
27
|
+
begin
|
28
|
+
require 'moneta'
|
29
|
+
rescue LoadError
|
30
|
+
$stderr.print "Error: Missing moneta. Install with command 'gem install moneta'\n"
|
31
|
+
exit 1
|
32
|
+
end
|
33
|
+
@store = Moneta.new(storage, file: dbname)
|
34
|
+
end
|
35
|
+
|
36
|
+
def [] key
|
37
|
+
@store[key]
|
38
|
+
end
|
39
|
+
|
40
|
+
def []= key, value
|
41
|
+
@store[key] = value
|
42
|
+
end
|
43
|
+
|
44
|
+
def close
|
45
|
+
@store.close
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
class TokyoCabinetMapper
|
50
|
+
def initialize dbname
|
51
|
+
begin
|
52
|
+
require 'tokyocabinet'
|
53
|
+
rescue LoadError
|
54
|
+
$stderr.print "Error: Missing tokyocabinet. Install with command 'gem install tokyocabinet'\n"
|
55
|
+
exit 1
|
56
|
+
end
|
57
|
+
@hdb = TokyoCabinet::HDB::new
|
58
|
+
if File.exist?(dbname)
|
59
|
+
if !@hdb.open(dbname, TokyoCabinet::HDB::OREADER)
|
60
|
+
ecode = @hdb.ecode
|
61
|
+
raise sprintf("open error: %s\n", @hdb.errmsg(ecode))
|
62
|
+
end
|
63
|
+
else
|
64
|
+
if !@hdb.open(dbname, TokyoCabinet::HDB::OWRITER | TokyoCabinet::HDB::OCREAT)
|
65
|
+
ecode = @hdb.ecode
|
66
|
+
raise sprintf("open error: %s\n", @hdb.errmsg(ecode))
|
67
|
+
end
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
def [] key
|
72
|
+
@hdb.get(key)
|
73
|
+
end
|
74
|
+
|
75
|
+
def []= key, value
|
76
|
+
if !@hdb.put(key,value)
|
77
|
+
ecode = @hdb.ecode
|
78
|
+
raise sprintf("put error: %s\n", @hdb.errmsg(ecode))
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
def close
|
83
|
+
if !@hdb.close
|
84
|
+
ecode = @hdb.ecode
|
85
|
+
raise sprintf("close error: %s\n", @hdb.errmsg(ecode))
|
86
|
+
end
|
87
|
+
end
|
88
|
+
end
|
89
|
+
|
90
|
+
module DbMapper
|
91
|
+
def DbMapper::factory options
|
92
|
+
dbname = options[:db]
|
93
|
+
if File.exist?(dbname)
|
94
|
+
$stderr.print "Database #{dbname} exists!\n"
|
95
|
+
end
|
96
|
+
case options[:storage]
|
97
|
+
when :tokyocabinet
|
98
|
+
TokyoCabinetMapper.new(dbname)
|
99
|
+
when :localmemcache
|
100
|
+
MonetaMapper.new(:LocalMemCache,dbname)
|
101
|
+
else
|
102
|
+
SerializeMapper.new(dbname)
|
103
|
+
end
|
104
|
+
end
|
105
|
+
end
|
106
|
+
end
|
data/lib/bio-locus/locus.rb
CHANGED
@@ -1,41 +1,70 @@
|
|
1
1
|
|
2
2
|
module BioLocus
|
3
3
|
module Keys
|
4
|
+
@@in_list = {}
|
5
|
+
|
4
6
|
def Keys::each_key(line,options)
|
7
|
+
use_alt = (options[:alt] == :include or options[:alt] == :only)
|
8
|
+
use_pos = (options[:alt] == :include or options[:alt] == :exclude)
|
9
|
+
|
5
10
|
if line =~ /^[[:alnum:]]+/
|
6
11
|
fields = nil
|
7
|
-
# The default layout (VCF) may or may not work
|
12
|
+
# The default layout (VCF) may or may not work. Critically
|
13
|
+
# chr,pos and alt are expected in positions 0,1,4 respectively.
|
8
14
|
chr,pos,id,no_use,alt,rest = line.split(/\t/,6)[0..-1]
|
9
|
-
|
10
|
-
|
11
|
-
fields ||= line.split(/\t/)
|
12
|
-
field = fields
|
13
|
-
chr = eval(options[:eval_chr])
|
14
|
-
end
|
15
|
-
if options[:eval_pos]
|
16
|
-
fields ||= line.split(/\t/)
|
17
|
-
field = fields
|
18
|
-
pos = eval(options[:eval_pos])
|
19
|
-
end
|
20
|
-
if options[:eval_alt]
|
21
|
-
fields ||= line.split(/\t/)
|
15
|
+
if options[:in_format] or options[:eval_chr] or options[:eval_pos] or options[:eval_alt]
|
16
|
+
fields = line.split(/\t/)
|
22
17
|
field = fields
|
23
|
-
|
18
|
+
case options[:in_format]
|
19
|
+
when :tab then
|
20
|
+
# chr,pos,ref,alt
|
21
|
+
alt = field[3].strip.split(/,/)[0] if field[3]
|
22
|
+
when :snv then
|
23
|
+
alt = field[2].split(/\//)[1] if field[2]
|
24
|
+
when :cosmic then
|
25
|
+
# COSMIC tsv files, either in field 17 (COSMICv70)
|
26
|
+
locus_field = field[17]
|
27
|
+
locus_field = field[13] if locus_field !~ /:/
|
28
|
+
if field[15] !~ /delet/i and locus_field =~ /:/
|
29
|
+
chr = /^([^:]+)/.match(locus_field)[1]
|
30
|
+
a = /:(\d+)-(\d+)/.match(locus_field)
|
31
|
+
pos = a[1] if a[1]==a[2]
|
32
|
+
end
|
33
|
+
end
|
34
|
+
# Override parsing with
|
35
|
+
if options[:eval_chr]
|
36
|
+
chr = eval(options[:eval_chr])
|
37
|
+
end
|
38
|
+
if options[:eval_pos]
|
39
|
+
pos = eval(options[:eval_pos])
|
40
|
+
end
|
41
|
+
if options[:eval_alt]
|
42
|
+
alt = eval(options[:eval_alt])
|
43
|
+
end
|
24
44
|
end
|
25
|
-
p [chr,pos] if options[:debug]
|
45
|
+
# p [:debug,chr,pos,alt] if options[:debug]
|
26
46
|
|
27
47
|
# If we have a position emit it
|
28
48
|
if pos =~ /^\d+$/ and chr and chr != ''
|
29
|
-
alts =
|
30
|
-
|
49
|
+
alts = if use_pos
|
50
|
+
['']
|
51
|
+
else
|
52
|
+
[]
|
53
|
+
end
|
54
|
+
alts += alt.split(/,/) if use_alt and alt
|
31
55
|
alts.each do | nuc |
|
32
56
|
key = chr+"\t"+pos
|
33
57
|
key += "\t"+nuc if nuc != ''
|
58
|
+
if options[:once]
|
59
|
+
# check we haven't already sent this out in this run
|
60
|
+
return if @@in_list[key]
|
61
|
+
@@in_list[key] = true
|
62
|
+
end
|
34
63
|
yield key
|
35
64
|
end
|
36
65
|
else
|
37
66
|
if options[:ignore_errors]
|
38
|
-
$stderr.print "WARNING, skipping: ",line if not options[:quiet]
|
67
|
+
$stderr.print "WARNING, <#{chr}:#{pos}> skipping: ",line if not options[:quiet]
|
39
68
|
else
|
40
69
|
p line
|
41
70
|
p fields
|
data/lib/bio-locus/match.rb
CHANGED
@@ -1,42 +1,58 @@
|
|
1
1
|
module BioLocus
|
2
|
-
|
3
|
-
require 'moneta'
|
4
|
-
|
5
2
|
module Match
|
6
3
|
def Match.run(options)
|
7
4
|
do_delete = (options[:task] == :delete)
|
8
|
-
|
5
|
+
invert_match = options[:invert_match]
|
6
|
+
store = DbMapper.factory(options)
|
9
7
|
lines = 0
|
8
|
+
header_lines = 0
|
10
9
|
count = 0
|
11
10
|
in_header = true
|
12
|
-
|
11
|
+
uniq_match = {}
|
12
|
+
uniq_no_match = {}
|
13
13
|
STDIN.each_line do | line |
|
14
14
|
if in_header and line =~ /^#/
|
15
15
|
# Retain comments in header (for VCF)
|
16
16
|
print line
|
17
|
+
header_lines += 1
|
17
18
|
next
|
18
19
|
else
|
19
20
|
in_header = false
|
20
21
|
end
|
21
|
-
|
22
|
+
if line =~ /^#/
|
23
|
+
header_lines += 1
|
24
|
+
else
|
25
|
+
lines += 1
|
26
|
+
end
|
22
27
|
$stderr.print '.' if (lines % 1_000_000) == 0 if not options[:quiet]
|
23
28
|
Keys::each_key(line,options) do | key |
|
24
|
-
|
29
|
+
has_match = lambda {
|
30
|
+
if invert_match
|
31
|
+
not store[key]
|
32
|
+
else
|
33
|
+
store[key]
|
34
|
+
end
|
35
|
+
}
|
36
|
+
if has_match.call
|
37
|
+
# We have a match
|
38
|
+
$stderr.print "Matched <#{key}>\n" if options[:debug]
|
25
39
|
count += 1
|
26
40
|
if do_delete
|
27
41
|
store.delete(key)
|
28
42
|
else
|
29
43
|
print line
|
30
|
-
|
44
|
+
uniq_match[key] ||= true
|
31
45
|
end
|
46
|
+
else
|
47
|
+
uniq_no_match[key] ||= true
|
32
48
|
end
|
33
49
|
end
|
34
50
|
end
|
35
51
|
store.close
|
36
52
|
if do_delete
|
37
|
-
$stderr.print "\nDeleted #{count} keys
|
53
|
+
$stderr.print "\nDeleted #{count} keys in #{options[:db]} reading #{lines} lines !\n" if not options[:quiet]
|
38
54
|
else
|
39
|
-
$stderr.print "\nMatched #{count} (unique #{
|
55
|
+
$stderr.print "\nMatched #{count} (unique #{uniq_match.keys.size}) lines out of #{lines} (header #{header_lines}, unique #{uniq_no_match.keys.size+uniq_match.keys.size}) in #{options[:db]}!\n" if not options[:quiet]
|
40
56
|
end
|
41
57
|
end
|
42
58
|
end
|
data/lib/bio-locus/store.rb
CHANGED
@@ -1,14 +1,20 @@
|
|
1
1
|
module BioLocus
|
2
2
|
|
3
|
-
require 'moneta'
|
4
|
-
|
5
3
|
module Store
|
6
4
|
def Store.run(options)
|
7
|
-
|
5
|
+
invert_match = options[:invert_match]
|
6
|
+
store = DbMapper.factory(options)
|
8
7
|
count = count_new = count_dup = 0
|
9
8
|
STDIN.each_line do | line |
|
10
9
|
Keys::each_key(line,options) do | key |
|
11
|
-
|
10
|
+
has_match = lambda {
|
11
|
+
if invert_match
|
12
|
+
not store[key]
|
13
|
+
else
|
14
|
+
store[key]
|
15
|
+
end
|
16
|
+
}
|
17
|
+
if not has_match.call
|
12
18
|
count_new += 1
|
13
19
|
store[key] = true
|
14
20
|
else
|
@@ -23,7 +29,7 @@ module BioLocus
|
|
23
29
|
end
|
24
30
|
end
|
25
31
|
store.close
|
26
|
-
$stderr.print "Stored #{count_new} positions out of #{count} in #{options[:db]} (#{count_dup} hits)\n" if !options[:quiet]
|
32
|
+
$stderr.print "Stored #{count_new} positions out of #{count} in #{options[:db]} (#{count_dup} duplicate hits)\n" if !options[:quiet]
|
27
33
|
end
|
28
34
|
end
|
29
35
|
end
|
data/spec/bio-locus_spec.rb
CHANGED
@@ -1,7 +1,37 @@
|
|
1
1
|
require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
|
2
2
|
|
3
|
-
describe "BioLocus" do
|
4
|
-
|
5
|
-
|
6
|
-
|
3
|
+
describe "BioLocus with Serialize" do
|
4
|
+
fn = 'biolocus_serialize.db'
|
5
|
+
store = BioLocus::DbMapper.factory({storage: :serialize, db: fn})
|
6
|
+
store['test'] = 'yes'
|
7
|
+
store['test2'] = 'no'
|
8
|
+
a = store['test']
|
9
|
+
store['test'].should == 'yes'
|
10
|
+
store['test2'].should == 'no'
|
11
|
+
store.close
|
12
|
+
File.unlink(fn)
|
13
|
+
end
|
14
|
+
|
15
|
+
describe "BioLocus with Moneta" do
|
16
|
+
fn = 'biolocus_moneta_localmemcache.db'
|
17
|
+
store = BioLocus::MonetaMapper.new(:LocalMemCache,fn)
|
18
|
+
store['test'] = 'yes'
|
19
|
+
store['test2'] = 'no'
|
20
|
+
a = store['test']
|
21
|
+
store['test'].should == 'yes'
|
22
|
+
store['test2'].should == 'no'
|
23
|
+
store.close
|
24
|
+
File.unlink(fn)
|
25
|
+
end
|
26
|
+
|
27
|
+
describe "BioLocus with TokyoCabinet" do
|
28
|
+
fn = 'biolocus_tokyocabinet.db'
|
29
|
+
store = BioLocus::TokyoCabinetMapper.new(fn)
|
30
|
+
store['test'] = 'yes'
|
31
|
+
store['test2'] = 'no'
|
32
|
+
a = store['test']
|
33
|
+
store['test'].should == 'yes'
|
34
|
+
store['test2'].should == 'no'
|
35
|
+
store.close
|
36
|
+
File.unlink(fn)
|
7
37
|
end
|
metadata
CHANGED
@@ -1,23 +1,23 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-locus
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.6
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Pjotr Prins
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-
|
11
|
+
date: 2014-10-10 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
|
-
name:
|
14
|
+
name: cucumber
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
19
|
version: '0'
|
20
|
-
type: :
|
20
|
+
type: :development
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
@@ -25,13 +25,13 @@ dependencies:
|
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
|
-
name:
|
28
|
+
name: jeweler
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
31
|
- - ">="
|
32
32
|
- !ruby/object:Gem::Version
|
33
33
|
version: '0'
|
34
|
-
type: :
|
34
|
+
type: :development
|
35
35
|
prerelease: false
|
36
36
|
version_requirements: !ruby/object:Gem::Requirement
|
37
37
|
requirements:
|
@@ -39,7 +39,7 @@ dependencies:
|
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '0'
|
41
41
|
- !ruby/object:Gem::Dependency
|
42
|
-
name:
|
42
|
+
name: bundler
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
44
44
|
requirements:
|
45
45
|
- - ">="
|
@@ -53,7 +53,7 @@ dependencies:
|
|
53
53
|
- !ruby/object:Gem::Version
|
54
54
|
version: '0'
|
55
55
|
- !ruby/object:Gem::Dependency
|
56
|
-
name:
|
56
|
+
name: rspec
|
57
57
|
requirement: !ruby/object:Gem::Requirement
|
58
58
|
requirements:
|
59
59
|
- - ">="
|
@@ -67,7 +67,35 @@ dependencies:
|
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '0'
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
|
-
name:
|
70
|
+
name: tokyocabinet
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - ">="
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0'
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - ">="
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: '0'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: localmemcache
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - ">="
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '0'
|
90
|
+
type: :development
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - ">="
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '0'
|
97
|
+
- !ruby/object:Gem::Dependency
|
98
|
+
name: moneta
|
71
99
|
requirement: !ruby/object:Gem::Requirement
|
72
100
|
requirements:
|
73
101
|
- - ">="
|
@@ -103,6 +131,7 @@ files:
|
|
103
131
|
- features/step_definitions/bio-locus_steps.rb
|
104
132
|
- features/support/env.rb
|
105
133
|
- lib/bio-locus.rb
|
134
|
+
- lib/bio-locus/dbmapper.rb
|
106
135
|
- lib/bio-locus/locus.rb
|
107
136
|
- lib/bio-locus/match.rb
|
108
137
|
- lib/bio-locus/store.rb
|