bio-vcf 0.9.2 → 0.9.4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.travis.yml +1 -21
- data/LICENSE.txt +1 -1
- data/README.md +107 -73
- data/RELEASE_NOTES.md +20 -0
- data/RELEASE_NOTES.md~ +11 -0
- data/VERSION +1 -1
- data/bin/bio-vcf +49 -30
- data/bio-vcf.gemspec +1 -1
- data/features/cli.feature +4 -1
- data/features/diff_count.feature +0 -1
- data/features/step_definitions/cli-feature.rb +13 -9
- data/features/step_definitions/diff_count.rb +1 -1
- data/features/step_definitions/somaticsniper.rb +1 -1
- data/lib/bio-vcf/pcows.rb +31 -25
- data/lib/bio-vcf/vcffile.rb +46 -0
- data/lib/bio-vcf/vcfgenotypefield.rb +20 -20
- data/lib/bio-vcf/vcfheader.rb +29 -0
- data/lib/bio-vcf/vcfrecord.rb +5 -3
- data/lib/bio-vcf/vcfsample.rb +3 -1
- data/test/data/input/empty.vcf +2 -0
- data/test/data/regression/empty-stderr.new +12 -0
- data/test/data/regression/empty.new +2 -0
- data/test/data/regression/empty.ref +2 -0
- data/test/data/regression/eval_once-stderr.new +2 -2
- data/test/data/regression/eval_r.info.dp-stderr.new +9 -7
- data/test/data/regression/ifilter_s.dp-stderr.new +9 -7
- data/test/data/regression/pass1-stderr.new +9 -7
- data/test/data/regression/r.info.dp-stderr.new +4 -8
- data/test/data/regression/r.info.dp.new +0 -33
- data/test/data/regression/rewrite.info.sample-stderr.new +9 -7
- data/test/data/regression/s.dp-stderr.new +9 -7
- data/test/data/regression/seval_s.dp-stderr.new +9 -7
- data/test/data/regression/sfilter_seval_s.dp-stderr.new +9 -7
- data/test/data/regression/thread4-stderr.new +9 -7
- data/test/data/regression/thread4_4-stderr.new +25 -44
- data/test/data/regression/thread4_4.new +0 -20
- data/test/data/regression/thread4_4_failed_filter-stderr.new +1 -1
- data/test/data/regression/thread4_4_failed_filter-stderr.ref +1 -1
- data/test/data/regression/vcf2json_full_header-stderr.new +9 -7
- data/test/data/regression/vcf2json_use_meta-stderr.new +9 -7
- metadata +11 -7
- data/features/#cli.feature# +0 -71
- data/features/filter.feature~ +0 -35
- data/test/stress/stress_test.sh~ +0 -8
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: f5d7a81871906abfffc93455b4d664d5755fe8d79312134eae94e84659506198
|
4
|
+
data.tar.gz: 8029269859aedd53c613ea9bbb17f951972b062060b5a40c22bdbe65c6c3dfa7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ed231c3a918e5f9ab9cd8a618f3f25f0c39613ac934b496af334d77dabe64831ff08cfc722a467fc51ab8c583358ca21be769ba1d9654437d54e7d21b811ee2c
|
7
|
+
data.tar.gz: df49786c4f4aa5e3a3659c678fb66aeb4b7dd4bb575aacf34cc468663c18fa893502699d238b5034c308ff51a4dc05e0fadf929b923c1d646f61c3f07fef26c7
|
data/.travis.yml
CHANGED
@@ -1,23 +1,3 @@
|
|
1
|
-
sudo: false # required for the new containers
|
2
|
-
|
3
1
|
language: ruby
|
4
|
-
rvm:
|
5
|
-
# - 1.9.3 <- No longer working
|
6
|
-
- 2.1.0
|
7
|
-
- 2.2.3
|
8
|
-
|
9
|
-
# install:
|
10
|
-
# - gem install cucumber rspec regressiontest
|
11
|
-
|
12
|
-
branches:
|
13
|
-
only:
|
14
|
-
- master
|
15
|
-
|
16
|
-
# - jruby-head
|
17
|
-
# - jruby-19mode # JRuby in 1.9 mode
|
18
|
-
# - 1.8.7
|
19
|
-
# - jruby-18mode # JRuby in 1.8 mode
|
20
|
-
# - rbx-18mode
|
21
2
|
|
22
|
-
|
23
|
-
# script: bundle exec rspec spec
|
3
|
+
arch: arm64
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
@@ -1,23 +1,15 @@
|
|
1
1
|
# bio-vcf
|
2
2
|
|
3
|
-
[![Build Status](https://secure.travis-ci.org/
|
3
|
+
[![Build Status](https://secure.travis-ci.org/vcflib/bio-vcf.png)](http://travis-ci.org/vcflib/bio-vcf)
|
4
4
|
|
5
|
-
## Updates
|
6
|
-
|
7
|
-
* Getting ready for a 1.0 release
|
8
|
-
* 0.9.1 removed a rare threading bug and cleanup on error
|
9
|
-
* Added support for soft filters (request by Brad Chapman)
|
10
|
-
* The outputter now writes (properly) in parallel with the parser
|
11
|
-
* bio-vcf turns any VCF into JSON with header information, and
|
12
|
-
allows you to pipe that JSON directly into any JSON supporting
|
13
|
-
language, including Python and Javascript!
|
14
5
|
|
15
6
|
## Bio-vcf
|
16
7
|
|
17
|
-
Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf
|
18
|
-
very fast for genome-wide (WGS) data, it also comes with a
|
19
|
-
filtering, evaluation and rewrite language and it can
|
20
|
-
of textual data, including VCF header and contents in
|
8
|
+
Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf
|
9
|
+
is not only very fast for genome-wide (WGS) data, it also comes with a
|
10
|
+
really nice filtering, evaluation and rewrite language and it can
|
11
|
+
output any type of textual data, including VCF header and contents in
|
12
|
+
RDF and JSON.
|
21
13
|
|
22
14
|
So, why would you use bio-vcf over other parsers? Because
|
23
15
|
|
@@ -79,18 +71,18 @@ BED format on a 16 core machine takes
|
|
79
71
|
sys 0m5.039s
|
80
72
|
```
|
81
73
|
|
82
|
-
which shows decent core utilisation (10x). Running
|
74
|
+
which shows decent core utilisation (10x). Running
|
83
75
|
gzip compressed VCF files of 30+ Gb has similar performance gains.
|
84
76
|
|
85
77
|
To view some complex filters on an 80Gb SNP file check out a
|
86
|
-
[GTEx exercise](https://github.com/
|
78
|
+
[GTEx exercise](https://github.com/vcflib/bio-vcf/blob/master/doc/GTEx_reduce.md).
|
87
79
|
|
88
80
|
Use zcat (or even better pigz which is multi-core itself) to pipe such
|
89
81
|
gzipped (vcf.gz) files into bio-vcf, e.g.
|
90
82
|
|
91
83
|
```sh
|
92
84
|
zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
|
93
|
-
--sfilter '!s.empty? and s.dp>20'
|
85
|
+
--sfilter '!s.empty? and s.dp>20'
|
94
86
|
--eval '[r.chrom,r.pos,r.pos+1]' > test.bed
|
95
87
|
```
|
96
88
|
|
@@ -124,7 +116,7 @@ Where 's.dp' is the shorter name for 'sample.dp'.
|
|
124
116
|
|
125
117
|
It is also possible to specify sample names, or info fields:
|
126
118
|
|
127
|
-
For example, to filter somatic data
|
119
|
+
For example, to filter somatic data
|
128
120
|
|
129
121
|
```ruby
|
130
122
|
bio-vcf --filter 'rec.info.dp>5 and rec.alt.size==1 and rec.tumor.bq[rec.alt]>30 and rec.tumor.mq>20' < file.vcf
|
@@ -252,7 +244,7 @@ The VCF format is commonly used for variant calling between NGS
|
|
252
244
|
samples. The fast parser needs to carry some state, recorded for each
|
253
245
|
file in VcfHeader, which contains the VCF file header. Individual
|
254
246
|
lines (variant calls) first go through a raw parser returning an array
|
255
|
-
of fields. Further (lazy) parsing is handled through VcfRecord.
|
247
|
+
of fields. Further (lazy) parsing is handled through VcfRecord.
|
256
248
|
|
257
249
|
At this point the filter is pretty generic with multi-sample support.
|
258
250
|
If something is not working, check out the feature descriptions and
|
@@ -261,17 +253,16 @@ example of a VCF statement you need to work on.
|
|
261
253
|
|
262
254
|
## Installation
|
263
255
|
|
264
|
-
|
265
|
-
a performance improvement. Bio-vcf will show the Ruby version when
|
266
|
-
typing the command 'bio-vcf -h'.
|
256
|
+
The bio-vcf has no other dependencies but Ruby.
|
267
257
|
|
268
|
-
To
|
258
|
+
To install bio-vcf with Ruby gems:
|
269
259
|
|
270
260
|
```sh
|
271
261
|
gem install bio-vcf
|
272
262
|
bio-vcf -h
|
273
263
|
```
|
274
264
|
|
265
|
+
|
275
266
|
## Command line interface (CLI)
|
276
267
|
|
277
268
|
Get the version of the VCF file
|
@@ -295,6 +286,13 @@ Get the sample names
|
|
295
286
|
NORMAL,TUMOR
|
296
287
|
```
|
297
288
|
|
289
|
+
Alternatively use the command line switch for --names, e.g.
|
290
|
+
|
291
|
+
```ruby
|
292
|
+
bio-vcf --names < file.vcf
|
293
|
+
NORMAL,TUMOR
|
294
|
+
```
|
295
|
+
|
298
296
|
Get information from the header (META)
|
299
297
|
|
300
298
|
```ruby
|
@@ -305,39 +303,39 @@ The 'fields' array contains unprocessed data (strings). Print first
|
|
305
303
|
five raw fields
|
306
304
|
|
307
305
|
```ruby
|
308
|
-
bio-vcf --eval 'fields[0..4]' < file.vcf
|
306
|
+
bio-vcf --eval 'fields[0..4]' < file.vcf
|
309
307
|
```
|
310
308
|
|
311
309
|
Add a filter to display the fields on chromosome 12
|
312
310
|
|
313
311
|
```ruby
|
314
|
-
bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4]' < file.vcf
|
312
|
+
bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4]' < file.vcf
|
315
313
|
```
|
316
314
|
|
317
315
|
It gets better when we start using processed data, represented by an
|
318
316
|
object named 'rec'. Position is a value, so we can filter a range
|
319
317
|
|
320
318
|
```ruby
|
321
|
-
bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf
|
319
|
+
bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf
|
322
320
|
```
|
323
321
|
|
324
322
|
The shorter name for 'rec.chrom' is 'r.chrom', so you may write
|
325
323
|
|
326
324
|
```ruby
|
327
|
-
bio-vcf --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
325
|
+
bio-vcf --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
328
326
|
```
|
329
327
|
|
330
328
|
To ignore and continue parsing on missing data use the
|
331
329
|
--ignore-missing (-i) and or --quiet (-q) switches
|
332
330
|
|
333
331
|
```ruby
|
334
|
-
bio-vcf -i --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
332
|
+
bio-vcf -i --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
335
333
|
```
|
336
334
|
|
337
335
|
Info fields are referenced by
|
338
336
|
|
339
337
|
```ruby
|
340
|
-
bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf
|
338
|
+
bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf
|
341
339
|
```
|
342
340
|
|
343
341
|
(alternatively you can use the indexed rec.info['DP'] and list INFO fields with
|
@@ -346,14 +344,14 @@ rec.info.fields).
|
|
346
344
|
Subfields defined by rec.format:
|
347
345
|
|
348
346
|
```ruby
|
349
|
-
bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf
|
347
|
+
bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf
|
350
348
|
```
|
351
349
|
|
352
350
|
Output
|
353
351
|
|
354
352
|
```ruby
|
355
|
-
bio-vcf --filter 'rec.tumor.gq>30'
|
356
|
-
--eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq]'
|
353
|
+
bio-vcf --filter 'rec.tumor.gq>30'
|
354
|
+
--eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq]'
|
357
355
|
< file.vcf
|
358
356
|
```
|
359
357
|
|
@@ -367,26 +365,26 @@ Show the count of the bases that were scored as somatic
|
|
367
365
|
Actually, we have a convenience implementation for bcount, so this is the same
|
368
366
|
|
369
367
|
```ruby
|
370
|
-
bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount[rec.alt].to_s+"\t"+rec.tumor.gq.to_s'
|
368
|
+
bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount[rec.alt].to_s+"\t"+rec.tumor.gq.to_s'
|
371
369
|
< file.vcf
|
372
370
|
```
|
373
371
|
|
374
372
|
Filter on the somatic results that were scored at least 4 times
|
375
|
-
|
373
|
+
|
376
374
|
```ruby
|
377
|
-
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bcount[rec.alt]>4' < test.vcf
|
375
|
+
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bcount[rec.alt]>4' < test.vcf
|
378
376
|
```
|
379
377
|
|
380
378
|
Similar for base quality scores
|
381
379
|
|
382
380
|
```ruby
|
383
|
-
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf
|
381
|
+
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf
|
384
382
|
```
|
385
383
|
|
386
384
|
Filter out on sample values
|
387
385
|
|
388
386
|
```ruby
|
389
|
-
bio-vcf --sfilter 's.dp>20' < test.vcf
|
387
|
+
bio-vcf --sfilter 's.dp>20' < test.vcf
|
390
388
|
```
|
391
389
|
|
392
390
|
To filter missing on samples:
|
@@ -468,17 +466,17 @@ Even shorter r is an alias for rec
|
|
468
466
|
Note: special functions are not yet implemented! Look below
|
469
467
|
for genotype processing which has indexing in 'gti'.
|
470
468
|
|
471
|
-
Sometime you want to use a special function in a filter. For
|
472
|
-
example percentage variant reads can be defined as [a,c,g,t]
|
473
|
-
with frequencies against sample read depth (dp) as
|
474
|
-
[0,0.03,0.47,0.50]. Filtering would with a special function,
|
469
|
+
Sometime you want to use a special function in a filter. For
|
470
|
+
example percentage variant reads can be defined as [a,c,g,t]
|
471
|
+
with frequencies against sample read depth (dp) as
|
472
|
+
[0,0.03,0.47,0.50]. Filtering would with a special function,
|
475
473
|
which we named freq
|
476
474
|
|
477
475
|
```sh
|
478
476
|
bio-vcf --sfilter "s.freq(2)>0.30" < file.vcf
|
479
477
|
```
|
480
478
|
|
481
|
-
which is equal to
|
479
|
+
which is equal to
|
482
480
|
|
483
481
|
```sh
|
484
482
|
bio-vcf --sfilter "s.freq.g>0.30" < file.vcf
|
@@ -498,7 +496,7 @@ ref should always be identical across samples.
|
|
498
496
|
|
499
497
|
## DbSNP
|
500
498
|
|
501
|
-
One clinical variant DbSNP example
|
499
|
+
One clinical variant DbSNP example
|
502
500
|
|
503
501
|
```sh
|
504
502
|
bio-vcf --eval '[rec.id,rec.chr,rec.pos,rec.alt,rec.info.sao,rec.info.CLNDBN]' < clinvar_20140303.vcf
|
@@ -523,16 +521,16 @@ renders
|
|
523
521
|
|
524
522
|
bio-vcf allows for set analysis. With the complement filter, for
|
525
523
|
example, samples are selected that evaluate to true, all others should
|
526
|
-
evaluate to false. For this we create three filters, one for all
|
524
|
+
evaluate to false. For this we create three filters, one for all
|
527
525
|
samples that are included (the --ifilter or -if), for all samples that
|
528
526
|
are excluded (the --efilter or -ef) and for any sample (the --sfilter
|
529
527
|
or -sf). So i=include (OR filter), e=exclude and s=any sample (AND
|
530
|
-
filter).
|
528
|
+
filter).
|
531
529
|
|
532
530
|
The equivalent of the union filter is by using the --sfilter, so
|
533
531
|
|
534
532
|
```sh
|
535
|
-
bio-vcf --sfilter 's.dp>20'
|
533
|
+
bio-vcf --sfilter 's.dp>20'
|
536
534
|
```
|
537
535
|
|
538
536
|
Filters DP on all samples and is true if all samples match the
|
@@ -540,7 +538,7 @@ criterium (AND). To filter on a subset you can add a
|
|
540
538
|
selector
|
541
539
|
|
542
540
|
```sh
|
543
|
-
bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
|
541
|
+
bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
|
544
542
|
```
|
545
543
|
|
546
544
|
For set analysis there are the additional ifilter (include) and
|
@@ -560,7 +558,7 @@ values
|
|
560
558
|
|
561
559
|
The equivalent of the complement filter is by specifying what samples
|
562
560
|
to include, here with a regex and define filters on the included
|
563
|
-
and excluded samples (the ones not in ifilter-samples) and the
|
561
|
+
and excluded samples (the ones not in ifilter-samples) and the
|
564
562
|
|
565
563
|
```sh
|
566
564
|
./bin/bio-vcf -i --sfilter 's.dp>20' --ifilter-samples 2,4 --ifilter 's.gt==r.s1t1.gt'
|
@@ -581,7 +579,7 @@ To print out the GT's add --seval
|
|
581
579
|
To set an additional filter on the excluded samples:
|
582
580
|
|
583
581
|
```sh
|
584
|
-
bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99'
|
582
|
+
bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99'
|
585
583
|
```
|
586
584
|
|
587
585
|
Etc. etc. Any combination of sfilter, ifilter and efilter is possible.
|
@@ -594,15 +592,15 @@ In the near future it is also possible to select samples on a regex (here
|
|
594
592
|
select all samples where the name starts with s3)
|
595
593
|
|
596
594
|
```sh
|
597
|
-
bio-vcf --isample-regex '/^s3/' --ifilter 's.dp>20'
|
595
|
+
bio-vcf --isample-regex '/^s3/' --ifilter 's.dp>20'
|
598
596
|
```
|
599
597
|
|
600
598
|
```sh
|
601
|
-
bio-vcf --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt' --efilter 'gt!=s3t1.gt'
|
599
|
+
bio-vcf --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt' --efilter 'gt!=s3t1.gt'
|
602
600
|
--set-intersect include=true
|
603
|
-
bio-vcf --include /s3.+/ --sample-regex /^t2/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt'
|
601
|
+
bio-vcf --include /s3.+/ --sample-regex /^t2/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt'
|
604
602
|
--set-catesian one in include=true, rest=false
|
605
|
-
bio-vcf --unique-sample (any) --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt!="0/0"'
|
603
|
+
bio-vcf --unique-sample (any) --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt!="0/0"'
|
606
604
|
```
|
607
605
|
|
608
606
|
With the filter commands you can use --ignore-missing to skip errors.
|
@@ -625,7 +623,7 @@ results in a string value
|
|
625
623
|
to access components of the genotype field we can use standard Ruby
|
626
624
|
|
627
625
|
```ruby
|
628
|
-
bio-vcf --seval 's.gt.split(/\//)[0]'
|
626
|
+
bio-vcf --seval 's.gt.split(/\//)[0]'
|
629
627
|
1 10665 . . 0 0 . 0 0
|
630
628
|
1 10694 . . 1 1 . . .
|
631
629
|
1 12783 0 0 0 0 0 0 0
|
@@ -636,7 +634,7 @@ or special functions, such as 'gti' which gives the genotype as an
|
|
636
634
|
indexed value array
|
637
635
|
|
638
636
|
```ruby
|
639
|
-
bio-vcf --seval 's.gti[0]'
|
637
|
+
bio-vcf --seval 's.gti[0]'
|
640
638
|
1 10665 0 0 0 0
|
641
639
|
1 10694 1 1
|
642
640
|
1 12783 0 0 0 0 0 0 0
|
@@ -646,7 +644,7 @@ indexed value array
|
|
646
644
|
and 'gts' as a nucleotide string array
|
647
645
|
|
648
646
|
```ruby
|
649
|
-
bio-vcf --seval 's.gts'
|
647
|
+
bio-vcf --seval 's.gts'
|
650
648
|
1 10665 C C C C
|
651
649
|
1 10694 G G
|
652
650
|
1 12783 G G G G G G G
|
@@ -670,9 +668,9 @@ example signficance, use
|
|
670
668
|
Now you can index other fields, e.g. GL
|
671
669
|
|
672
670
|
```ruby
|
673
|
-
./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]'
|
671
|
+
./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]'
|
674
672
|
1 900057 1.0 1.0 0.994 1.0 1.0 -1 0.999 1.0 0.997 -1 0.994 0.989 -1 0.991 -1 0.972 0.992 1.0
|
675
|
-
|
673
|
+
```
|
676
674
|
|
677
675
|
shows a number of SNPs have been scored with high significance and a
|
678
676
|
number are missing, here marked as -1.
|
@@ -741,6 +739,17 @@ To remove/select 3 samples:
|
|
741
739
|
bio-vcf --samples 0,1,3 < mytest.vcf
|
742
740
|
```
|
743
741
|
|
742
|
+
You can also select samples by name (as long as they do not contain
|
743
|
+
spaces)
|
744
|
+
|
745
|
+
|
746
|
+
```sh
|
747
|
+
bio-vcf --names < mytest.vcf
|
748
|
+
Original s1t1 s2t1 s3t1 s1t2 s2t2 s3t2
|
749
|
+
bio-vcf --samples "Original,s1t1,s3t1" < mytest.vcf
|
750
|
+
```
|
751
|
+
|
752
|
+
|
744
753
|
Filter on a BED file and annotate the gene name in the resulting VCF
|
745
754
|
|
746
755
|
```sh
|
@@ -791,7 +800,7 @@ To have more output options bio-vcf can use an [ERB
|
|
791
800
|
template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
|
792
801
|
very flexible option that can output textual formats such as JSON, YAML, HTML
|
793
802
|
and RDF. Examples are provided in
|
794
|
-
[./templates](https://github.com/
|
803
|
+
[./templates](https://github.com/vcflib/bio-vcf/templates/). A JSON
|
795
804
|
template could be
|
796
805
|
|
797
806
|
```Javascript
|
@@ -805,7 +814,7 @@ template could be
|
|
805
814
|
};
|
806
815
|
```
|
807
816
|
|
808
|
-
To get JSON, run with something like (combining
|
817
|
+
To get JSON, run with something like (combining
|
809
818
|
with a filter)
|
810
819
|
|
811
820
|
```sh
|
@@ -831,11 +840,11 @@ Likewise for RDF output:
|
|
831
840
|
bio-vcf --template template/vcf2rdf.erb --filter 'r.info.sao==1' < dbsnp.vcf
|
832
841
|
```
|
833
842
|
|
834
|
-
renders the ERB template
|
843
|
+
renders the ERB template
|
835
844
|
|
836
845
|
```ruby
|
837
846
|
<%
|
838
|
-
id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
|
847
|
+
id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
|
839
848
|
%>
|
840
849
|
:<%= id %>
|
841
850
|
:query_id "<%= id %>",
|
@@ -848,7 +857,7 @@ renders the ERB template
|
|
848
857
|
db:vcf true .
|
849
858
|
```
|
850
859
|
|
851
|
-
into
|
860
|
+
into
|
852
861
|
|
853
862
|
```
|
854
863
|
:ch13_33703698_A
|
@@ -936,9 +945,9 @@ To get and put the full information from the header, simple use
|
|
936
945
|
vcf.meta.to_json. See ./template/vcf2json_full_header.erb for an
|
937
946
|
example. This meta information can also be used to output info fields
|
938
947
|
and sample values on the fly! For an example, see the template at
|
939
|
-
[./template/vcf2json_use_meta.erb](https://github.com/
|
948
|
+
[./template/vcf2json_use_meta.erb](https://github.com/vcflib/bio-vcf/tree/master/template/vcf2json_use_meta.erb)
|
940
949
|
and the generated output at
|
941
|
-
[./test/data/regression/vcf2json_use_meta.ref](https://github.com/
|
950
|
+
[./test/data/regression/vcf2json_use_meta.ref](https://github.com/vcflib/bio-vcf/tree/master/test/data/regression/vcf2json_use_meta.ref).
|
942
951
|
|
943
952
|
This way, it is possible to write templates that can convert the content of
|
944
953
|
*any* VCF file without prior knowledge to JSON, RDF, etc.
|
@@ -955,7 +964,7 @@ Simple statistics are available for REF>ALT changes:
|
|
955
964
|
G>A 59 45%
|
956
965
|
C>T 30 23%
|
957
966
|
A>G 5 4%
|
958
|
-
C>G 5 4%
|
967
|
+
C>G 5 4%
|
959
968
|
C>A 5 4%
|
960
969
|
G>T 4 3%
|
961
970
|
T>C 4 3%
|
@@ -976,9 +985,9 @@ Simple statistics are available for REF>ALT changes:
|
|
976
985
|
## Other examples
|
977
986
|
|
978
987
|
For more exercises and examples see
|
979
|
-
[doc](https://github.com/
|
988
|
+
[doc](https://github.com/vcflib/bio-vcf/tree/master/doc) directory
|
980
989
|
and the the feature
|
981
|
-
[section](https://github.com/
|
990
|
+
[section](https://github.com/vcflib/bio-vcf/tree/master/features).
|
982
991
|
|
983
992
|
## API
|
984
993
|
|
@@ -1009,6 +1018,23 @@ what the command line interface uses (see ./bin/bio-vcf)
|
|
1009
1018
|
end
|
1010
1019
|
```
|
1011
1020
|
|
1021
|
+
### VCFFile
|
1022
|
+
|
1023
|
+
The class ```BioVcf::VCFfile``` wraps a file and provides an ```enum``` with the
|
1024
|
+
method each, that can be used as in iterator.
|
1025
|
+
|
1026
|
+
```ruby
|
1027
|
+
vcf_file = "dbsnp.vcf"
|
1028
|
+
vcf = BioVcf::VCFfile.new(file:file, is_gz: false )
|
1029
|
+
it vcf.each
|
1030
|
+
puts it.peek
|
1031
|
+
|
1032
|
+
vcf_file = "dbsnp.vcf.gz"
|
1033
|
+
vcf = BioVcf::VCFfile.new(file:file, is_gz: true )
|
1034
|
+
it vcf.each
|
1035
|
+
puts it.peek
|
1036
|
+
```
|
1037
|
+
|
1012
1038
|
## Trouble shooting
|
1013
1039
|
|
1014
1040
|
### MRI supports threading
|
@@ -1037,7 +1063,7 @@ For more complex filters use lambda inside a conditional
|
|
1037
1063
|
```ruby
|
1038
1064
|
( fast_check ? lambda { slow_check }.call : false )
|
1039
1065
|
```
|
1040
|
-
|
1066
|
+
|
1041
1067
|
where slow_check is the slow section of your query. As is shown
|
1042
1068
|
earlier in this document. Don't forget the .call!
|
1043
1069
|
|
@@ -1056,6 +1082,15 @@ For larger files set the timeout to 600, or so. --timeout 600.
|
|
1056
1082
|
|
1057
1083
|
Different values may show different core use on a machine.
|
1058
1084
|
|
1085
|
+
### Development
|
1086
|
+
|
1087
|
+
To run the tests from source
|
1088
|
+
|
1089
|
+
```sh
|
1090
|
+
bundle install --path vendor/bundle
|
1091
|
+
bundle exec rake
|
1092
|
+
```
|
1093
|
+
|
1059
1094
|
### Debugging
|
1060
1095
|
|
1061
1096
|
To debug output use '-v --num-threads=1' for generating useful
|
@@ -1073,12 +1108,12 @@ temporary directory may remain.
|
|
1073
1108
|
Information on the source tree, documentation, examples, issues and
|
1074
1109
|
how to contribute, see
|
1075
1110
|
|
1076
|
-
http://github.com/
|
1111
|
+
http://github.com/vcflib/bio-vcf
|
1077
1112
|
|
1078
1113
|
## Cite
|
1079
1114
|
|
1080
1115
|
If you use this software, please cite one of
|
1081
|
-
|
1116
|
+
|
1082
1117
|
* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
|
1083
1118
|
* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
|
1084
1119
|
|
@@ -1088,5 +1123,4 @@ This Biogem is published at (http://biogems.info/index.html#bio-vcf)
|
|
1088
1123
|
|
1089
1124
|
## Copyright
|
1090
1125
|
|
1091
|
-
Copyright (c) 2014 Pjotr Prins. See LICENSE.txt for further details.
|
1092
|
-
|
1126
|
+
Copyright (c) 2014-2020 Pjotr Prins. See LICENSE.txt for further details.
|