bio-vcf 0.8.1 → 0.8.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 90a933c33c683c1f0886a202fa5a9ee5ed2ad8ff
4
- data.tar.gz: 1f769a89fcb3e3b44e22864ddf729ea3ac040260
3
+ metadata.gz: 515319faec0710075f13a0265a4027130ec5f10a
4
+ data.tar.gz: aed2ff09861568291363ca21944567ad36987813
5
5
  SHA512:
6
- metadata.gz: 308d93ca1bcb142fa9cd4be63d929edb0ad92b7ac0da0d4f2d51f4b363de6ef0b87ac3d2688af8c36257317e203f69355fdc3348c6a330adb7d997af7ab6714d
7
- data.tar.gz: d7d328a13d90b209a6068f9d3f09d56e8a00262ccf7fba9d9d67ffe4935993b2ccbb2ddc2f9a4831dd8928258a9c5d468f050d9edfbb95737616f4bfaf184bb0
6
+ metadata.gz: 94ff3bfda4357fc187a89c9a55116ceefe15fc2b8fa28af45e92afcad452c8d2bd65e5eae17dd2c40b046f89288c640ba5d4b40b8efb711781caed766e48f518
7
+ data.tar.gz: 3d810db35d1ad862aad6f4ec81d695c6d7d74d46336d4e5563e925da267d04521387994d794ff7d8384cf10d8c94701e0e2af9380ddc0b4505e00edbbedb7c3e
@@ -3,6 +3,11 @@ rvm:
3
3
  # - 1.9.3 <- No longer working
4
4
  - 2.0.0
5
5
  - 2.1.0
6
+
7
+ branches:
8
+ only:
9
+ - master
10
+
6
11
  # - jruby-head
7
12
  # - jruby-19mode # JRuby in 1.9 mode
8
13
  # - 1.8.7
data/Gemfile CHANGED
@@ -7,9 +7,9 @@ source "http://rubygems.org"
7
7
  # Include everything needed to run rake, tests, features, etc.
8
8
  group :development do
9
9
  # gem "minitest"
10
- gem "rspec"
11
- gem "cucumber"
12
- gem "jeweler", "~> 2.0.1" # , "~> 1.8.4", :git => "https://github.com/technicalpickles/jeweler.git"
13
- gem "regressiontest", "~> 0.0.3"
10
+ gem "rspec", ">= 2.14.0"
11
+ gem "cucumber", ">= 1.3.11"
12
+ gem "jeweler", ">= 2.0.1" # , "~> 1.8.4", :git => "https://github.com/technicalpickles/jeweler.git"
13
+ gem "regressiontest", ">= 0.0.3"
14
14
  end
15
15
 
@@ -75,7 +75,7 @@ PLATFORMS
75
75
  ruby
76
76
 
77
77
  DEPENDENCIES
78
- cucumber
79
- jeweler (~> 2.0.1)
80
- regressiontest (~> 0.0.3)
81
- rspec
78
+ cucumber (>= 1.3.11)
79
+ jeweler (>= 2.0.1)
80
+ regressiontest (>= 0.0.3)
81
+ rspec (>= 2.14.0)
data/README.md CHANGED
@@ -5,7 +5,9 @@
5
5
  A new generation VCF parser. Bio-vcf is not only fast for genome-wide
6
6
  (WGS) data, it also comes with a really nice filtering, evaluation and
7
7
  rewrite language and it can output any type of textual data, including
8
- RDF and JSON. Why would you use bio-vcf over other parsers?
8
+ VCF header and contents in RDF and JSON.
9
+
10
+ So, why would you use bio-vcf over other parsers? Because
9
11
 
10
12
  1. Bio-vcf is fast and scales on multi-core computers
11
13
  2. Bio-vcf has an expressive filtering and evaluation language
@@ -16,14 +18,14 @@ RDF and JSON. Why would you use bio-vcf over other parsers?
16
18
  7. Bio-vcf allows for genotype processing
17
19
  8. Bio-vcf has support for set analysis
18
20
  9. Bio-vcf has sane error handling
19
- 10. Bio-vcf can output tabular data, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs using (erb) templates
21
+ 10. Bio-vcf can convert *any* VCF to *any* output, including tabular data, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates
20
22
 
21
23
  Bio-vcf has better performance than other tools
22
24
  because of lazy parsing, multi-threading, and useful combinations of
23
25
  (fancy) command line filtering. For example on an 2 core machine
24
- bio-vcf is typically 50% faster than JVM based SnpSift. On an 8 core machine
25
- bio-vcf is at least 3x faster than SnpSift. Parsing a 1 Gb ESP
26
- VCF with 8 cores with bio-vcf takes
26
+ bio-vcf is typically 50% faster than JVM based SnpSift. Adding
27
+ cores, bio-vcf just does better. The more complicated the filters,
28
+ the larger the gain.
27
29
 
28
30
  ```sh
29
31
  time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf
@@ -52,8 +54,8 @@ a 16 core machine takes
52
54
  sys 0m5.039s
53
55
  ```
54
56
 
55
- which shows decent core utilisation (10x). We are running
56
- gzip compressed VCF files of 30+ Gb with similar performance gains.
57
+ which shows decent core utilisation (10x). Running
58
+ gzip compressed VCF files of 30+ Gb has similar performance gains.
57
59
 
58
60
  Use zcat to
59
61
  pipe such gzipped (vcf.gz) files into bio-vcf, e.g.
@@ -64,10 +66,10 @@ pipe such gzipped (vcf.gz) files into bio-vcf, e.g.
64
66
  --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
65
67
  ```
66
68
 
67
- bio-vcf comes with a sensible parser definition language (it is 100%
68
- Ruby), as well as primitives for set analysis. Few
69
+ bio-vcf comes with a sensible parser definition language (interestingly it is 100%
70
+ Ruby), an embedded Ragel parser for INFO and FORMAT header definitions, as well as primitives for set analysis. Few
69
71
  assumptions are made about the actual contents of the VCF file (field
70
- names are resolved on the fly), so bio-vcf should practically work with
72
+ names are resolved on the fly), so bio-vcf should work with
71
73
  all VCF files.
72
74
 
73
75
  To fetch all entries where all samples have depth larger than 20 use
@@ -679,7 +681,7 @@ Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert t
679
681
 
680
682
  ## Templates
681
683
 
682
- To have more output options blastxmlparser can use an [ERB
684
+ To have more output options bio-vcf can use an [ERB
683
685
  template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
684
686
  very flexible option that can output textual formats such as JSON, YAML, HTML
685
687
  and RDF. Examples are provided in
@@ -785,6 +787,12 @@ can be
785
787
  ]
786
788
  ```
787
789
 
790
+ with
791
+
792
+ ```sh
793
+ bio-vcf --template template/vcf2json.erb < dbsnp.vcf
794
+ ```
795
+
788
796
  may generate something like
789
797
 
790
798
  ```Javascript
@@ -816,6 +824,19 @@ from the last BODY element. To make it valid JSON that needs to be
816
824
  removed. A future version may add a parameter to the BODY element or a
817
825
  global rewrite function for this purpose. YAML and RDF have no such issue.
818
826
 
827
+ ### Using full VCF header (meta) info
828
+
829
+ To get and put the full information from the header, simple use
830
+ vcf.meta.to_json. See ./template/vcf2json_full_header.erb for an
831
+ example. This meta information can also be used to output info fields
832
+ and sample values on the fly! For an example, see the template at
833
+ [./template/vcf2json_use_meta.erb](https://github.com/pjotrp/bioruby-vcf/tree/master/template/vcf2json_use_meta.erb)
834
+ and the generated output at
835
+ [./test/data/regression/vcf2json_use_meta.ref](https://github.com/pjotrp/bioruby-vcf/tree/master/test/data/regression/vcf2json_use_meta.ref).
836
+
837
+ This way, it is possible to write templates that can convert the content of
838
+ *any* VCF file without prior knowledge to JSON, RDF, etc.
839
+
819
840
  ## Statistics
820
841
 
821
842
  Simple statistics are available for REF>ALT changes:
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.8.1
1
+ 0.8.2
@@ -200,7 +200,7 @@ end
200
200
 
201
201
  include BioVcf
202
202
 
203
- # Parse the header section of a VCF file
203
+ # Parse the header section of a VCF file (chomping STDIN)
204
204
  def parse_header line, samples, options
205
205
  header = VcfHeader.new
206
206
  header.add(line)
@@ -374,22 +374,31 @@ begin
374
374
  end
375
375
  } # end output
376
376
 
377
- print template.header(binding) if template
378
377
  # ---- Main loop
379
378
  STDIN.each_line do | line |
380
379
  line_number += 1
381
380
  # ---- In this section header information is handled
381
+
382
+ # ---- Skip embedded headers down the line...
382
383
  next if header_output_completed and line =~ /^#/
383
- if line =~ /^##fileformat=/ or line =~ /^#CHR/
384
+
385
+ # ---- Parse the header lines (chomps from STDIN)
386
+ # and returns header info and the current line
387
+ if line =~ /^#/
384
388
  header,line = parse_header(line,samples,options)
385
389
  end
386
- next if line =~ /^##/ # empty file
387
- header_output_completed = true
388
- if not options[:efilter_samples] and options[:ifilter_samples]
389
- # Create exclude set as a complement of include set
390
- options[:efilter_samples] = header.column_names[9..-1].fill{|i|i.to_s}-options[:ifilter_samples]
390
+ # p [line_number,line]
391
+ # ---- After the header continue processing
392
+ if not header_output_completed
393
+ # one-time post-header processing
394
+ if not options[:efilter_samples] and options[:ifilter_samples]
395
+ # Create exclude set as a complement of include set
396
+ options[:efilter_samples] = header.column_names[9..-1].fill{|i|i.to_s}-options[:ifilter_samples]
397
+ end
398
+ print template.header(binding) if template
399
+ header_output_completed = true
391
400
  end
392
-
401
+
393
402
  # ---- In this section the VCF variant lines are parsed
394
403
  lines << line
395
404
  if NUM_THREADS == 1
@@ -2,16 +2,14 @@
2
2
  # DO NOT EDIT THIS FILE DIRECTLY
3
3
  # Instead, edit Jeweler::Tasks in Rakefile, and run 'rake gemspec'
4
4
  # -*- encoding: utf-8 -*-
5
- # stub: bio-vcf 0.8.1 ruby lib
6
5
 
7
6
  Gem::Specification.new do |s|
8
7
  s.name = "bio-vcf"
9
- s.version = "0.8.1"
8
+ s.version = "0.8.2"
10
9
 
11
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
12
- s.require_paths = ["lib"]
13
11
  s.authors = ["Pjotr Prins"]
14
- s.date = "2014-11-26"
12
+ s.date = "2014-12-28"
15
13
  s.description = "Smart lazy multi-threaded parser for VCF format with useful filtering and output rewriting"
16
14
  s.email = "pjotr.public01@thebird.nl"
17
15
  s.executables = ["bio-vcf"]
@@ -40,7 +38,9 @@ Gem::Specification.new do |s|
40
38
  "features/step_definitions/multisample.rb",
41
39
  "features/step_definitions/sfilter.rb",
42
40
  "features/step_definitions/somaticsniper.rb",
41
+ "features/step_definitions/vcf_header.rb",
43
42
  "features/support/env.rb",
43
+ "features/vcf_header.feature",
44
44
  "lib/bio-vcf.rb",
45
45
  "lib/bio-vcf/bedfilter.rb",
46
46
  "lib/bio-vcf/template.rb",
@@ -49,13 +49,19 @@ Gem::Specification.new do |s|
49
49
  "lib/bio-vcf/vcf.rb",
50
50
  "lib/bio-vcf/vcfgenotypefield.rb",
51
51
  "lib/bio-vcf/vcfheader.rb",
52
+ "lib/bio-vcf/vcfheader_line.rb",
52
53
  "lib/bio-vcf/vcfline.rb",
53
54
  "lib/bio-vcf/vcfrdf.rb",
54
55
  "lib/bio-vcf/vcfrecord.rb",
55
56
  "lib/bio-vcf/vcfsample.rb",
56
57
  "lib/bio-vcf/vcfstatistics.rb",
58
+ "ragel/gen_vcfheaderline_parser.rb",
59
+ "ragel/gen_vcfheaderline_parser.rl",
60
+ "ragel/generate.sh",
57
61
  "template/gatk_vcf2rdf.erb",
58
62
  "template/vcf2json.erb",
63
+ "template/vcf2json_full_header.erb",
64
+ "template/vcf2json_use_meta.erb",
59
65
  "template/vcf2rdf.erb",
60
66
  "template/vcf2rdf_header.erb",
61
67
  "test/data/input/dbsnp.vcf",
@@ -71,33 +77,35 @@ Gem::Specification.new do |s|
71
77
  "test/data/regression/thread4.ref",
72
78
  "test/data/regression/thread4_4.ref",
73
79
  "test/data/regression/thread4_4_failed_filter-stderr.ref",
80
+ "test/data/regression/vcf2json_full_header.ref",
74
81
  "test/performance/metrics.md"
75
82
  ]
76
83
  s.homepage = "http://github.com/pjotrp/bioruby-vcf"
77
84
  s.licenses = ["MIT"]
85
+ s.require_paths = ["lib"]
78
86
  s.required_ruby_version = Gem::Requirement.new(">= 2.0.0")
79
- s.rubygems_version = "2.2.2"
87
+ s.rubygems_version = "2.0.3"
80
88
  s.summary = "Fast multi-threaded VCF parser"
81
89
 
82
90
  if s.respond_to? :specification_version then
83
91
  s.specification_version = 4
84
92
 
85
93
  if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
86
- s.add_development_dependency(%q<rspec>, [">= 0"])
87
- s.add_development_dependency(%q<cucumber>, [">= 0"])
88
- s.add_development_dependency(%q<jeweler>, ["~> 2.0.1"])
89
- s.add_development_dependency(%q<regressiontest>, ["~> 0.0.3"])
94
+ s.add_development_dependency(%q<rspec>, [">= 2.14.0"])
95
+ s.add_development_dependency(%q<cucumber>, [">= 1.3.11"])
96
+ s.add_development_dependency(%q<jeweler>, [">= 2.0.1"])
97
+ s.add_development_dependency(%q<regressiontest>, [">= 0.0.3"])
90
98
  else
91
- s.add_dependency(%q<rspec>, [">= 0"])
92
- s.add_dependency(%q<cucumber>, [">= 0"])
93
- s.add_dependency(%q<jeweler>, ["~> 2.0.1"])
94
- s.add_dependency(%q<regressiontest>, ["~> 0.0.3"])
99
+ s.add_dependency(%q<rspec>, [">= 2.14.0"])
100
+ s.add_dependency(%q<cucumber>, [">= 1.3.11"])
101
+ s.add_dependency(%q<jeweler>, [">= 2.0.1"])
102
+ s.add_dependency(%q<regressiontest>, [">= 0.0.3"])
95
103
  end
96
104
  else
97
- s.add_dependency(%q<rspec>, [">= 0"])
98
- s.add_dependency(%q<cucumber>, [">= 0"])
99
- s.add_dependency(%q<jeweler>, ["~> 2.0.1"])
100
- s.add_dependency(%q<regressiontest>, ["~> 0.0.3"])
105
+ s.add_dependency(%q<rspec>, [">= 2.14.0"])
106
+ s.add_dependency(%q<cucumber>, [">= 1.3.11"])
107
+ s.add_dependency(%q<jeweler>, [">= 2.0.1"])
108
+ s.add_dependency(%q<regressiontest>, [">= 0.0.3"])
101
109
  end
102
110
  end
103
111
 
@@ -43,14 +43,24 @@ Feature: Command-line interface (CLI)
43
43
  When I execute "./bin/bio-vcf -i --sfilter 's.dp>10' --seval 's.dp'"
44
44
  Then I expect the named output to match the named output "sfilter_seval_s.dp"
45
45
 
46
-
47
46
  Scenario: Rewrite an info field
48
47
  Given I have input file(s) named "test/data/input/multisample.vcf"
49
48
  When I execute "./bin/bio-vcf --rewrite rec.info[\'sample\']=\'XXXXX\'"
50
49
  Then I expect the named output to match the named output "rewrite.info.sample"
51
50
 
51
+ Scenario: Test JSON output with header meta data
52
+ Given I have input file(s) named "test/data/input/multisample.vcf"
53
+ When I execute "./bin/bio-vcf --template template/vcf2json_full_header.erb"
54
+ Then I expect the named output to match the named output "vcf2json_full_header"
55
+
56
+ Scenario: Test JSON output with header meta data and query samples
57
+ Given I have input file(s) named "test/data/input/multisample.vcf"
58
+ When I execute "./bin/bio-vcf --template template/vcf2json_use_meta.erb"
59
+ Then I expect the named output to match the named output "vcf2json_use_meta"
60
+
52
61
  Scenario: Test deadlock on failed filter with threads
53
62
  Given I have input file(s) named "test/data/input/multisample.vcf"
54
63
  When I execute "./bin/bio-vcf --num-threads 4 --thread-lines 4 --filter 't.info.dp>2'"
55
64
  Then I expect an error and the named output to match the named output "thread4_4_failed_filter" in under 30 seconds
56
65
 
66
+
@@ -8,7 +8,7 @@ When /^I execute "(.*?)"$/ do |arg1|
8
8
  end
9
9
 
10
10
  Then(/^I expect the named output to match the named output "(.*?)"$/) do |arg1|
11
- RegressionTest::CliExec::exec(@cmd,arg1,ignore: '##BioVcf=').should be_true
11
+ RegressionTest::CliExec::exec(@cmd,arg1,ignore: '(##BioVcf|date|"version":)').should be_true
12
12
  end
13
13
 
14
14
  Then(/^I expect an error and the named output to match the named output "(.*?)" in under (\d+) seconds$/) do |arg1,arg2|
@@ -0,0 +1,48 @@
1
+ Given(/^the VCF header lines$/) do |string|
2
+ header = VcfHeader.new
3
+ header.add string
4
+ @vcf = header
5
+ end
6
+
7
+ When(/^I parse the VCF header$/) do
8
+ end
9
+
10
+ Then(/^I expect vcf\.columns to be \[CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO','FORMAT','NORMAL','TUMOR'\]$/) do
11
+ expect(@vcf.column_names).to eq ['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO','FORMAT','NORMAL','TUMOR']
12
+ end
13
+
14
+ Then(/^I expect vcf\.fileformat to be "(.*?)"$/) do |arg1|
15
+ expect(@vcf.fileformat).to eq arg1
16
+ end
17
+
18
+ Then(/^I expect vcf\.fileDate to be "(.*?)"$/) do |arg1|
19
+ expect(@vcf.fileDate).to eq arg1
20
+ end
21
+
22
+ Then(/^I expect vcf.field\['fileDate'\] to be "(.*?)"$/) do |arg1|
23
+ expect(@vcf.field['fileDate']).to eq arg1
24
+ end
25
+
26
+ Then(/^I expect vcf\.phasing to be "(.*?)"$/) do |arg1|
27
+ expect(@vcf.phasing).to eq arg1
28
+ end
29
+
30
+ Then(/^I expect vcf\.reference to be "(.*?)"$/) do |arg1|
31
+ expect(@vcf.reference).to eq arg1
32
+ end
33
+
34
+ Then(/^I expect vcf\.format\['(\w+)'\] to be (\{[^}]+\})/) do |arg1,arg2|
35
+ expect(@vcf.format[arg1].to_s).to eq arg2
36
+ end
37
+
38
+ Then(/^I expect vcf\.info\['(\w+)'\] to be (\{[^}]+\})/) do |arg1,arg2|
39
+ expect(@vcf.info[arg1].to_s).to eq arg2
40
+ end
41
+
42
+ Then(/^I expect vcf\.meta to contain all header meta information$/) do
43
+ m = @vcf.meta
44
+ expect(m['fileformat']).to eq "VCFv4.1"
45
+ expect(m['FORMAT']['DP']['Number']).to eq "1"
46
+ expect(m.size).to be 6
47
+ end
48
+
@@ -0,0 +1,35 @@
1
+ @meta
2
+ Feature: Parsing VCF meta information from the header
3
+
4
+ Take a header and parse that information as defined by the VCF standard.
5
+
6
+ Scenario: When parsing a header line
7
+
8
+ Given the VCF header lines
9
+ """
10
+ ##fileformat=VCFv4.1
11
+ ##fileDate=20140121
12
+ ##phasing=none
13
+ ##reference=file:///data/GENOMES/human_GATK_GRCh37/GRCh37_gatk.fasta
14
+ ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
15
+ ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total read depth">
16
+ ##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
17
+ ##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant is Precious(Clinical,Pubmed Cited)">
18
+ #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
19
+ """
20
+ When I parse the VCF header
21
+ Then I expect vcf.columns to be [CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO','FORMAT','NORMAL','TUMOR']
22
+ And I expect vcf.fileformat to be "VCFv4.1"
23
+ And I expect vcf.fileDate to be "20140121"
24
+ And I expect vcf.field['fileDate'] to be "20140121"
25
+ And I expect vcf.phasing to be "none"
26
+ And I expect vcf.reference to be "file:///data/GENOMES/human_GATK_GRCh37/GRCh37_gatk.fasta"
27
+ And I expect vcf.format['GT'] to be {"ID"=>"GT", "Number"=>"1", "Type"=>"String", "Description"=>"Genotype"}
28
+ And I expect vcf.format['DP'] to be {"ID"=>"DP", "Number"=>"1", "Type"=>"Integer", "Description"=>"Total read depth"}
29
+ And I expect vcf.format['DP4'] to be {"ID"=>"DP4", "Number"=>"4", "Type"=>"Integer", "Description"=>"# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"}
30
+ And I expect vcf.info['PM'] to be {"ID"=>"PM", "Number"=>"0", "Type"=>"Flag", "Description"=>"Variant is Precious(Clinical,Pubmed Cited)"}'
31
+ And I expect vcf.meta to contain all header meta information
32
+
33
+ Scenario: When parsing the header of somatic_sniper.vcf
34
+
35
+ Do something
@@ -11,6 +11,7 @@
11
11
  require 'bio-vcf/utils'
12
12
  require 'bio-vcf/vcf'
13
13
  require 'bio-vcf/vcfsample'
14
+ require 'bio-vcf/vcfheader_line'
14
15
  require 'bio-vcf/vcfheader'
15
16
  require 'bio-vcf/vcfline'
16
17
  require 'bio-vcf/vcfgenotypefield'
@@ -1,3 +1,14 @@
1
+ # This module parses the VCF header. A header consists of lines
2
+ # containing fields. Most fields are of 'key=value' type and appear
3
+ # only once. These can be retrieved with the find_field method.
4
+ #
5
+ # INFO and FORMAT fields are special as they appear multiple times
6
+ # and contain multiple key values (identified by an ID field).
7
+ # To retrieve these call 'info' and 'format' functions respectively,
8
+ # which return a hash on the contained ID.
9
+ #
10
+ # For the INFO and FORMAT fields a Ragel parser is used, mostly to
11
+ # deal with embedded quoted fields.
1
12
 
2
13
  module BioVcf
3
14
 
@@ -13,21 +24,27 @@ module BioVcf
13
24
  end
14
25
  nil
15
26
  end
27
+
28
+ def VcfHeaderParser.parse_field(line)
29
+ BioVcf::VcfHeaderParser::RagelKeyValues.run_lexer(line, debug: false)
30
+ end
16
31
  end
17
32
 
18
33
  class VcfHeader
19
34
 
20
- attr_reader :lines
35
+ attr_reader :lines, :field
21
36
 
22
37
  def initialize
23
38
  @lines = []
39
+ @field = {}
24
40
  end
25
41
 
42
+ # Add a new field to the header
26
43
  def add line
27
- @lines << line.strip
44
+ @lines += line.split(/\n/)
28
45
  end
29
46
 
30
- # Add a key value list to the header
47
+ # Push a special key value list to the header
31
48
  def tag h
32
49
  h2 = h.dup
33
50
  [:show_help,:skip_header,:verbose,:quiet,:debug].each { |key| h2.delete(key) }
@@ -82,6 +99,73 @@ module BioVcf
82
99
  @sample_index = index
83
100
  index
84
101
  end
85
- end
86
102
 
103
+ # Look for a line in the header with the field name and return the
104
+ # value, otherwise return nil
105
+ def find_field name
106
+ return field[name] if field[name]
107
+ @lines.each do | line |
108
+ value = line.scan(/###{name}=(.*)/)
109
+ if value[0]
110
+ v = value[0][0]
111
+ field[name] = v
112
+ return v
113
+ end
114
+ end
115
+ nil
116
+ end
117
+
118
+ # Look for all the lines that match the field name and return
119
+ # a hash of hashes. An empty hash is returned when there are
120
+ # no matches.
121
+ def find_fields name
122
+ res = {}
123
+ @lines.each do | line |
124
+ value = line.scan(/###{name}=<(.*)>/)
125
+ if value[0]
126
+ str = value[0][0]
127
+ # p str
128
+ v = VcfHeaderParser.parse_field(line)
129
+ id = v['ID']
130
+ res[id] = v
131
+ end
132
+ end
133
+ # p res
134
+ res
135
+ end
136
+
137
+ def format
138
+ find_fields('FORMAT')
139
+ end
140
+
141
+ def info
142
+ find_fields('INFO')
143
+ end
144
+
145
+ def meta
146
+ res = { 'INFO' => {}, 'FORMAT' => {} }
147
+ @lines.each do | line |
148
+ value = line.scan(/##(.*?)=(.*)/)
149
+ if value[0]
150
+ k,v = value[0]
151
+ if k != 'FORMAT' and k != 'INFO'
152
+ # p [k,v]
153
+ res[k] = v
154
+ end
155
+ end
156
+ end
157
+ res['INFO'] = info
158
+ res['FORMAT'] = format
159
+ # p [:res, res]
160
+ res
161
+ end
162
+
163
+ def method_missing(m, *args, &block)
164
+ name = m.to_s
165
+ value = find_field(name)
166
+ return value if value
167
+ raise "Unknown VCF header query '#{name}'"
168
+ end
169
+
170
+ end
87
171
  end