bio-vcf 0.9.2 → 0.9.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
- data/.travis.yml +1 -21
- data/LICENSE.txt +1 -1
- data/README.md +107 -73
- data/RELEASE_NOTES.md +20 -0
- data/RELEASE_NOTES.md~ +11 -0
- data/VERSION +1 -1
- data/bin/bio-vcf +49 -30
- data/bio-vcf.gemspec +1 -1
- data/features/cli.feature +4 -1
- data/features/diff_count.feature +0 -1
- data/features/step_definitions/cli-feature.rb +13 -9
- data/features/step_definitions/diff_count.rb +1 -1
- data/features/step_definitions/somaticsniper.rb +1 -1
- data/lib/bio-vcf/pcows.rb +31 -25
- data/lib/bio-vcf/vcffile.rb +46 -0
- data/lib/bio-vcf/vcfgenotypefield.rb +20 -20
- data/lib/bio-vcf/vcfheader.rb +29 -0
- data/lib/bio-vcf/vcfrecord.rb +5 -3
- data/lib/bio-vcf/vcfsample.rb +3 -1
- data/test/data/input/empty.vcf +2 -0
- data/test/data/regression/empty-stderr.new +12 -0
- data/test/data/regression/empty.new +2 -0
- data/test/data/regression/empty.ref +2 -0
- data/test/data/regression/eval_once-stderr.new +2 -2
- data/test/data/regression/eval_r.info.dp-stderr.new +9 -7
- data/test/data/regression/ifilter_s.dp-stderr.new +9 -7
- data/test/data/regression/pass1-stderr.new +9 -7
- data/test/data/regression/r.info.dp-stderr.new +4 -8
- data/test/data/regression/r.info.dp.new +0 -33
- data/test/data/regression/rewrite.info.sample-stderr.new +9 -7
- data/test/data/regression/s.dp-stderr.new +9 -7
- data/test/data/regression/seval_s.dp-stderr.new +9 -7
- data/test/data/regression/sfilter_seval_s.dp-stderr.new +9 -7
- data/test/data/regression/thread4-stderr.new +9 -7
- data/test/data/regression/thread4_4-stderr.new +25 -44
- data/test/data/regression/thread4_4.new +0 -20
- data/test/data/regression/thread4_4_failed_filter-stderr.new +1 -1
- data/test/data/regression/thread4_4_failed_filter-stderr.ref +1 -1
- data/test/data/regression/vcf2json_full_header-stderr.new +9 -7
- data/test/data/regression/vcf2json_use_meta-stderr.new +9 -7
- metadata +11 -7
- data/features/#cli.feature# +0 -71
- data/features/filter.feature~ +0 -35
- data/test/stress/stress_test.sh~ +0 -8
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: f5d7a81871906abfffc93455b4d664d5755fe8d79312134eae94e84659506198
|
4
|
+
data.tar.gz: 8029269859aedd53c613ea9bbb17f951972b062060b5a40c22bdbe65c6c3dfa7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ed231c3a918e5f9ab9cd8a618f3f25f0c39613ac934b496af334d77dabe64831ff08cfc722a467fc51ab8c583358ca21be769ba1d9654437d54e7d21b811ee2c
|
7
|
+
data.tar.gz: df49786c4f4aa5e3a3659c678fb66aeb4b7dd4bb575aacf34cc468663c18fa893502699d238b5034c308ff51a4dc05e0fadf929b923c1d646f61c3f07fef26c7
|
data/.travis.yml
CHANGED
@@ -1,23 +1,3 @@
|
|
1
|
-
sudo: false # required for the new containers
|
2
|
-
|
3
1
|
language: ruby
|
4
|
-
rvm:
|
5
|
-
# - 1.9.3 <- No longer working
|
6
|
-
- 2.1.0
|
7
|
-
- 2.2.3
|
8
|
-
|
9
|
-
# install:
|
10
|
-
# - gem install cucumber rspec regressiontest
|
11
|
-
|
12
|
-
branches:
|
13
|
-
only:
|
14
|
-
- master
|
15
|
-
|
16
|
-
# - jruby-head
|
17
|
-
# - jruby-19mode # JRuby in 1.9 mode
|
18
|
-
# - 1.8.7
|
19
|
-
# - jruby-18mode # JRuby in 1.8 mode
|
20
|
-
# - rbx-18mode
|
21
2
|
|
22
|
-
|
23
|
-
# script: bundle exec rspec spec
|
3
|
+
arch: arm64
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
@@ -1,23 +1,15 @@
|
|
1
1
|
# bio-vcf
|
2
2
|
|
3
|
-
[](http://travis-ci.org/vcflib/bio-vcf)
|
4
4
|
|
5
|
-
## Updates
|
6
|
-
|
7
|
-
* Getting ready for a 1.0 release
|
8
|
-
* 0.9.1 removed a rare threading bug and cleanup on error
|
9
|
-
* Added support for soft filters (request by Brad Chapman)
|
10
|
-
* The outputter now writes (properly) in parallel with the parser
|
11
|
-
* bio-vcf turns any VCF into JSON with header information, and
|
12
|
-
allows you to pipe that JSON directly into any JSON supporting
|
13
|
-
language, including Python and Javascript!
|
14
5
|
|
15
6
|
## Bio-vcf
|
16
7
|
|
17
|
-
Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf
|
18
|
-
very fast for genome-wide (WGS) data, it also comes with a
|
19
|
-
filtering, evaluation and rewrite language and it can
|
20
|
-
of textual data, including VCF header and contents in
|
8
|
+
Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf
|
9
|
+
is not only very fast for genome-wide (WGS) data, it also comes with a
|
10
|
+
really nice filtering, evaluation and rewrite language and it can
|
11
|
+
output any type of textual data, including VCF header and contents in
|
12
|
+
RDF and JSON.
|
21
13
|
|
22
14
|
So, why would you use bio-vcf over other parsers? Because
|
23
15
|
|
@@ -79,18 +71,18 @@ BED format on a 16 core machine takes
|
|
79
71
|
sys 0m5.039s
|
80
72
|
```
|
81
73
|
|
82
|
-
which shows decent core utilisation (10x). Running
|
74
|
+
which shows decent core utilisation (10x). Running
|
83
75
|
gzip compressed VCF files of 30+ Gb has similar performance gains.
|
84
76
|
|
85
77
|
To view some complex filters on an 80Gb SNP file check out a
|
86
|
-
[GTEx exercise](https://github.com/
|
78
|
+
[GTEx exercise](https://github.com/vcflib/bio-vcf/blob/master/doc/GTEx_reduce.md).
|
87
79
|
|
88
80
|
Use zcat (or even better pigz which is multi-core itself) to pipe such
|
89
81
|
gzipped (vcf.gz) files into bio-vcf, e.g.
|
90
82
|
|
91
83
|
```sh
|
92
84
|
zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
|
93
|
-
--sfilter '!s.empty? and s.dp>20'
|
85
|
+
--sfilter '!s.empty? and s.dp>20'
|
94
86
|
--eval '[r.chrom,r.pos,r.pos+1]' > test.bed
|
95
87
|
```
|
96
88
|
|
@@ -124,7 +116,7 @@ Where 's.dp' is the shorter name for 'sample.dp'.
|
|
124
116
|
|
125
117
|
It is also possible to specify sample names, or info fields:
|
126
118
|
|
127
|
-
For example, to filter somatic data
|
119
|
+
For example, to filter somatic data
|
128
120
|
|
129
121
|
```ruby
|
130
122
|
bio-vcf --filter 'rec.info.dp>5 and rec.alt.size==1 and rec.tumor.bq[rec.alt]>30 and rec.tumor.mq>20' < file.vcf
|
@@ -252,7 +244,7 @@ The VCF format is commonly used for variant calling between NGS
|
|
252
244
|
samples. The fast parser needs to carry some state, recorded for each
|
253
245
|
file in VcfHeader, which contains the VCF file header. Individual
|
254
246
|
lines (variant calls) first go through a raw parser returning an array
|
255
|
-
of fields. Further (lazy) parsing is handled through VcfRecord.
|
247
|
+
of fields. Further (lazy) parsing is handled through VcfRecord.
|
256
248
|
|
257
249
|
At this point the filter is pretty generic with multi-sample support.
|
258
250
|
If something is not working, check out the feature descriptions and
|
@@ -261,17 +253,16 @@ example of a VCF statement you need to work on.
|
|
261
253
|
|
262
254
|
## Installation
|
263
255
|
|
264
|
-
|
265
|
-
a performance improvement. Bio-vcf will show the Ruby version when
|
266
|
-
typing the command 'bio-vcf -h'.
|
256
|
+
The bio-vcf has no other dependencies but Ruby.
|
267
257
|
|
268
|
-
To
|
258
|
+
To install bio-vcf with Ruby gems:
|
269
259
|
|
270
260
|
```sh
|
271
261
|
gem install bio-vcf
|
272
262
|
bio-vcf -h
|
273
263
|
```
|
274
264
|
|
265
|
+
|
275
266
|
## Command line interface (CLI)
|
276
267
|
|
277
268
|
Get the version of the VCF file
|
@@ -295,6 +286,13 @@ Get the sample names
|
|
295
286
|
NORMAL,TUMOR
|
296
287
|
```
|
297
288
|
|
289
|
+
Alternatively use the command line switch for --names, e.g.
|
290
|
+
|
291
|
+
```ruby
|
292
|
+
bio-vcf --names < file.vcf
|
293
|
+
NORMAL,TUMOR
|
294
|
+
```
|
295
|
+
|
298
296
|
Get information from the header (META)
|
299
297
|
|
300
298
|
```ruby
|
@@ -305,39 +303,39 @@ The 'fields' array contains unprocessed data (strings). Print first
|
|
305
303
|
five raw fields
|
306
304
|
|
307
305
|
```ruby
|
308
|
-
bio-vcf --eval 'fields[0..4]' < file.vcf
|
306
|
+
bio-vcf --eval 'fields[0..4]' < file.vcf
|
309
307
|
```
|
310
308
|
|
311
309
|
Add a filter to display the fields on chromosome 12
|
312
310
|
|
313
311
|
```ruby
|
314
|
-
bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4]' < file.vcf
|
312
|
+
bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4]' < file.vcf
|
315
313
|
```
|
316
314
|
|
317
315
|
It gets better when we start using processed data, represented by an
|
318
316
|
object named 'rec'. Position is a value, so we can filter a range
|
319
317
|
|
320
318
|
```ruby
|
321
|
-
bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf
|
319
|
+
bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf
|
322
320
|
```
|
323
321
|
|
324
322
|
The shorter name for 'rec.chrom' is 'r.chrom', so you may write
|
325
323
|
|
326
324
|
```ruby
|
327
|
-
bio-vcf --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
325
|
+
bio-vcf --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
328
326
|
```
|
329
327
|
|
330
328
|
To ignore and continue parsing on missing data use the
|
331
329
|
--ignore-missing (-i) and or --quiet (-q) switches
|
332
330
|
|
333
331
|
```ruby
|
334
|
-
bio-vcf -i --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
332
|
+
bio-vcf -i --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
|
335
333
|
```
|
336
334
|
|
337
335
|
Info fields are referenced by
|
338
336
|
|
339
337
|
```ruby
|
340
|
-
bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf
|
338
|
+
bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf
|
341
339
|
```
|
342
340
|
|
343
341
|
(alternatively you can use the indexed rec.info['DP'] and list INFO fields with
|
@@ -346,14 +344,14 @@ rec.info.fields).
|
|
346
344
|
Subfields defined by rec.format:
|
347
345
|
|
348
346
|
```ruby
|
349
|
-
bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf
|
347
|
+
bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf
|
350
348
|
```
|
351
349
|
|
352
350
|
Output
|
353
351
|
|
354
352
|
```ruby
|
355
|
-
bio-vcf --filter 'rec.tumor.gq>30'
|
356
|
-
--eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq]'
|
353
|
+
bio-vcf --filter 'rec.tumor.gq>30'
|
354
|
+
--eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq]'
|
357
355
|
< file.vcf
|
358
356
|
```
|
359
357
|
|
@@ -367,26 +365,26 @@ Show the count of the bases that were scored as somatic
|
|
367
365
|
Actually, we have a convenience implementation for bcount, so this is the same
|
368
366
|
|
369
367
|
```ruby
|
370
|
-
bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount[rec.alt].to_s+"\t"+rec.tumor.gq.to_s'
|
368
|
+
bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount[rec.alt].to_s+"\t"+rec.tumor.gq.to_s'
|
371
369
|
< file.vcf
|
372
370
|
```
|
373
371
|
|
374
372
|
Filter on the somatic results that were scored at least 4 times
|
375
|
-
|
373
|
+
|
376
374
|
```ruby
|
377
|
-
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bcount[rec.alt]>4' < test.vcf
|
375
|
+
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bcount[rec.alt]>4' < test.vcf
|
378
376
|
```
|
379
377
|
|
380
378
|
Similar for base quality scores
|
381
379
|
|
382
380
|
```ruby
|
383
|
-
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf
|
381
|
+
bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf
|
384
382
|
```
|
385
383
|
|
386
384
|
Filter out on sample values
|
387
385
|
|
388
386
|
```ruby
|
389
|
-
bio-vcf --sfilter 's.dp>20' < test.vcf
|
387
|
+
bio-vcf --sfilter 's.dp>20' < test.vcf
|
390
388
|
```
|
391
389
|
|
392
390
|
To filter missing on samples:
|
@@ -468,17 +466,17 @@ Even shorter r is an alias for rec
|
|
468
466
|
Note: special functions are not yet implemented! Look below
|
469
467
|
for genotype processing which has indexing in 'gti'.
|
470
468
|
|
471
|
-
Sometime you want to use a special function in a filter. For
|
472
|
-
example percentage variant reads can be defined as [a,c,g,t]
|
473
|
-
with frequencies against sample read depth (dp) as
|
474
|
-
[0,0.03,0.47,0.50]. Filtering would with a special function,
|
469
|
+
Sometime you want to use a special function in a filter. For
|
470
|
+
example percentage variant reads can be defined as [a,c,g,t]
|
471
|
+
with frequencies against sample read depth (dp) as
|
472
|
+
[0,0.03,0.47,0.50]. Filtering would with a special function,
|
475
473
|
which we named freq
|
476
474
|
|
477
475
|
```sh
|
478
476
|
bio-vcf --sfilter "s.freq(2)>0.30" < file.vcf
|
479
477
|
```
|
480
478
|
|
481
|
-
which is equal to
|
479
|
+
which is equal to
|
482
480
|
|
483
481
|
```sh
|
484
482
|
bio-vcf --sfilter "s.freq.g>0.30" < file.vcf
|
@@ -498,7 +496,7 @@ ref should always be identical across samples.
|
|
498
496
|
|
499
497
|
## DbSNP
|
500
498
|
|
501
|
-
One clinical variant DbSNP example
|
499
|
+
One clinical variant DbSNP example
|
502
500
|
|
503
501
|
```sh
|
504
502
|
bio-vcf --eval '[rec.id,rec.chr,rec.pos,rec.alt,rec.info.sao,rec.info.CLNDBN]' < clinvar_20140303.vcf
|
@@ -523,16 +521,16 @@ renders
|
|
523
521
|
|
524
522
|
bio-vcf allows for set analysis. With the complement filter, for
|
525
523
|
example, samples are selected that evaluate to true, all others should
|
526
|
-
evaluate to false. For this we create three filters, one for all
|
524
|
+
evaluate to false. For this we create three filters, one for all
|
527
525
|
samples that are included (the --ifilter or -if), for all samples that
|
528
526
|
are excluded (the --efilter or -ef) and for any sample (the --sfilter
|
529
527
|
or -sf). So i=include (OR filter), e=exclude and s=any sample (AND
|
530
|
-
filter).
|
528
|
+
filter).
|
531
529
|
|
532
530
|
The equivalent of the union filter is by using the --sfilter, so
|
533
531
|
|
534
532
|
```sh
|
535
|
-
bio-vcf --sfilter 's.dp>20'
|
533
|
+
bio-vcf --sfilter 's.dp>20'
|
536
534
|
```
|
537
535
|
|
538
536
|
Filters DP on all samples and is true if all samples match the
|
@@ -540,7 +538,7 @@ criterium (AND). To filter on a subset you can add a
|
|
540
538
|
selector
|
541
539
|
|
542
540
|
```sh
|
543
|
-
bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
|
541
|
+
bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
|
544
542
|
```
|
545
543
|
|
546
544
|
For set analysis there are the additional ifilter (include) and
|
@@ -560,7 +558,7 @@ values
|
|
560
558
|
|
561
559
|
The equivalent of the complement filter is by specifying what samples
|
562
560
|
to include, here with a regex and define filters on the included
|
563
|
-
and excluded samples (the ones not in ifilter-samples) and the
|
561
|
+
and excluded samples (the ones not in ifilter-samples) and the
|
564
562
|
|
565
563
|
```sh
|
566
564
|
./bin/bio-vcf -i --sfilter 's.dp>20' --ifilter-samples 2,4 --ifilter 's.gt==r.s1t1.gt'
|
@@ -581,7 +579,7 @@ To print out the GT's add --seval
|
|
581
579
|
To set an additional filter on the excluded samples:
|
582
580
|
|
583
581
|
```sh
|
584
|
-
bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99'
|
582
|
+
bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99'
|
585
583
|
```
|
586
584
|
|
587
585
|
Etc. etc. Any combination of sfilter, ifilter and efilter is possible.
|
@@ -594,15 +592,15 @@ In the near future it is also possible to select samples on a regex (here
|
|
594
592
|
select all samples where the name starts with s3)
|
595
593
|
|
596
594
|
```sh
|
597
|
-
bio-vcf --isample-regex '/^s3/' --ifilter 's.dp>20'
|
595
|
+
bio-vcf --isample-regex '/^s3/' --ifilter 's.dp>20'
|
598
596
|
```
|
599
597
|
|
600
598
|
```sh
|
601
|
-
bio-vcf --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt' --efilter 'gt!=s3t1.gt'
|
599
|
+
bio-vcf --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt' --efilter 'gt!=s3t1.gt'
|
602
600
|
--set-intersect include=true
|
603
|
-
bio-vcf --include /s3.+/ --sample-regex /^t2/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt'
|
601
|
+
bio-vcf --include /s3.+/ --sample-regex /^t2/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt'
|
604
602
|
--set-catesian one in include=true, rest=false
|
605
|
-
bio-vcf --unique-sample (any) --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt!="0/0"'
|
603
|
+
bio-vcf --unique-sample (any) --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt!="0/0"'
|
606
604
|
```
|
607
605
|
|
608
606
|
With the filter commands you can use --ignore-missing to skip errors.
|
@@ -625,7 +623,7 @@ results in a string value
|
|
625
623
|
to access components of the genotype field we can use standard Ruby
|
626
624
|
|
627
625
|
```ruby
|
628
|
-
bio-vcf --seval 's.gt.split(/\//)[0]'
|
626
|
+
bio-vcf --seval 's.gt.split(/\//)[0]'
|
629
627
|
1 10665 . . 0 0 . 0 0
|
630
628
|
1 10694 . . 1 1 . . .
|
631
629
|
1 12783 0 0 0 0 0 0 0
|
@@ -636,7 +634,7 @@ or special functions, such as 'gti' which gives the genotype as an
|
|
636
634
|
indexed value array
|
637
635
|
|
638
636
|
```ruby
|
639
|
-
bio-vcf --seval 's.gti[0]'
|
637
|
+
bio-vcf --seval 's.gti[0]'
|
640
638
|
1 10665 0 0 0 0
|
641
639
|
1 10694 1 1
|
642
640
|
1 12783 0 0 0 0 0 0 0
|
@@ -646,7 +644,7 @@ indexed value array
|
|
646
644
|
and 'gts' as a nucleotide string array
|
647
645
|
|
648
646
|
```ruby
|
649
|
-
bio-vcf --seval 's.gts'
|
647
|
+
bio-vcf --seval 's.gts'
|
650
648
|
1 10665 C C C C
|
651
649
|
1 10694 G G
|
652
650
|
1 12783 G G G G G G G
|
@@ -670,9 +668,9 @@ example signficance, use
|
|
670
668
|
Now you can index other fields, e.g. GL
|
671
669
|
|
672
670
|
```ruby
|
673
|
-
./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]'
|
671
|
+
./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]'
|
674
672
|
1 900057 1.0 1.0 0.994 1.0 1.0 -1 0.999 1.0 0.997 -1 0.994 0.989 -1 0.991 -1 0.972 0.992 1.0
|
675
|
-
|
673
|
+
```
|
676
674
|
|
677
675
|
shows a number of SNPs have been scored with high significance and a
|
678
676
|
number are missing, here marked as -1.
|
@@ -741,6 +739,17 @@ To remove/select 3 samples:
|
|
741
739
|
bio-vcf --samples 0,1,3 < mytest.vcf
|
742
740
|
```
|
743
741
|
|
742
|
+
You can also select samples by name (as long as they do not contain
|
743
|
+
spaces)
|
744
|
+
|
745
|
+
|
746
|
+
```sh
|
747
|
+
bio-vcf --names < mytest.vcf
|
748
|
+
Original s1t1 s2t1 s3t1 s1t2 s2t2 s3t2
|
749
|
+
bio-vcf --samples "Original,s1t1,s3t1" < mytest.vcf
|
750
|
+
```
|
751
|
+
|
752
|
+
|
744
753
|
Filter on a BED file and annotate the gene name in the resulting VCF
|
745
754
|
|
746
755
|
```sh
|
@@ -791,7 +800,7 @@ To have more output options bio-vcf can use an [ERB
|
|
791
800
|
template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
|
792
801
|
very flexible option that can output textual formats such as JSON, YAML, HTML
|
793
802
|
and RDF. Examples are provided in
|
794
|
-
[./templates](https://github.com/
|
803
|
+
[./templates](https://github.com/vcflib/bio-vcf/templates/). A JSON
|
795
804
|
template could be
|
796
805
|
|
797
806
|
```Javascript
|
@@ -805,7 +814,7 @@ template could be
|
|
805
814
|
};
|
806
815
|
```
|
807
816
|
|
808
|
-
To get JSON, run with something like (combining
|
817
|
+
To get JSON, run with something like (combining
|
809
818
|
with a filter)
|
810
819
|
|
811
820
|
```sh
|
@@ -831,11 +840,11 @@ Likewise for RDF output:
|
|
831
840
|
bio-vcf --template template/vcf2rdf.erb --filter 'r.info.sao==1' < dbsnp.vcf
|
832
841
|
```
|
833
842
|
|
834
|
-
renders the ERB template
|
843
|
+
renders the ERB template
|
835
844
|
|
836
845
|
```ruby
|
837
846
|
<%
|
838
|
-
id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
|
847
|
+
id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
|
839
848
|
%>
|
840
849
|
:<%= id %>
|
841
850
|
:query_id "<%= id %>",
|
@@ -848,7 +857,7 @@ renders the ERB template
|
|
848
857
|
db:vcf true .
|
849
858
|
```
|
850
859
|
|
851
|
-
into
|
860
|
+
into
|
852
861
|
|
853
862
|
```
|
854
863
|
:ch13_33703698_A
|
@@ -936,9 +945,9 @@ To get and put the full information from the header, simple use
|
|
936
945
|
vcf.meta.to_json. See ./template/vcf2json_full_header.erb for an
|
937
946
|
example. This meta information can also be used to output info fields
|
938
947
|
and sample values on the fly! For an example, see the template at
|
939
|
-
[./template/vcf2json_use_meta.erb](https://github.com/
|
948
|
+
[./template/vcf2json_use_meta.erb](https://github.com/vcflib/bio-vcf/tree/master/template/vcf2json_use_meta.erb)
|
940
949
|
and the generated output at
|
941
|
-
[./test/data/regression/vcf2json_use_meta.ref](https://github.com/
|
950
|
+
[./test/data/regression/vcf2json_use_meta.ref](https://github.com/vcflib/bio-vcf/tree/master/test/data/regression/vcf2json_use_meta.ref).
|
942
951
|
|
943
952
|
This way, it is possible to write templates that can convert the content of
|
944
953
|
*any* VCF file without prior knowledge to JSON, RDF, etc.
|
@@ -955,7 +964,7 @@ Simple statistics are available for REF>ALT changes:
|
|
955
964
|
G>A 59 45%
|
956
965
|
C>T 30 23%
|
957
966
|
A>G 5 4%
|
958
|
-
C>G 5 4%
|
967
|
+
C>G 5 4%
|
959
968
|
C>A 5 4%
|
960
969
|
G>T 4 3%
|
961
970
|
T>C 4 3%
|
@@ -976,9 +985,9 @@ Simple statistics are available for REF>ALT changes:
|
|
976
985
|
## Other examples
|
977
986
|
|
978
987
|
For more exercises and examples see
|
979
|
-
[doc](https://github.com/
|
988
|
+
[doc](https://github.com/vcflib/bio-vcf/tree/master/doc) directory
|
980
989
|
and the the feature
|
981
|
-
[section](https://github.com/
|
990
|
+
[section](https://github.com/vcflib/bio-vcf/tree/master/features).
|
982
991
|
|
983
992
|
## API
|
984
993
|
|
@@ -1009,6 +1018,23 @@ what the command line interface uses (see ./bin/bio-vcf)
|
|
1009
1018
|
end
|
1010
1019
|
```
|
1011
1020
|
|
1021
|
+
### VCFFile
|
1022
|
+
|
1023
|
+
The class ```BioVcf::VCFfile``` wraps a file and provides an ```enum``` with the
|
1024
|
+
method each, that can be used as in iterator.
|
1025
|
+
|
1026
|
+
```ruby
|
1027
|
+
vcf_file = "dbsnp.vcf"
|
1028
|
+
vcf = BioVcf::VCFfile.new(file:file, is_gz: false )
|
1029
|
+
it vcf.each
|
1030
|
+
puts it.peek
|
1031
|
+
|
1032
|
+
vcf_file = "dbsnp.vcf.gz"
|
1033
|
+
vcf = BioVcf::VCFfile.new(file:file, is_gz: true )
|
1034
|
+
it vcf.each
|
1035
|
+
puts it.peek
|
1036
|
+
```
|
1037
|
+
|
1012
1038
|
## Trouble shooting
|
1013
1039
|
|
1014
1040
|
### MRI supports threading
|
@@ -1037,7 +1063,7 @@ For more complex filters use lambda inside a conditional
|
|
1037
1063
|
```ruby
|
1038
1064
|
( fast_check ? lambda { slow_check }.call : false )
|
1039
1065
|
```
|
1040
|
-
|
1066
|
+
|
1041
1067
|
where slow_check is the slow section of your query. As is shown
|
1042
1068
|
earlier in this document. Don't forget the .call!
|
1043
1069
|
|
@@ -1056,6 +1082,15 @@ For larger files set the timeout to 600, or so. --timeout 600.
|
|
1056
1082
|
|
1057
1083
|
Different values may show different core use on a machine.
|
1058
1084
|
|
1085
|
+
### Development
|
1086
|
+
|
1087
|
+
To run the tests from source
|
1088
|
+
|
1089
|
+
```sh
|
1090
|
+
bundle install --path vendor/bundle
|
1091
|
+
bundle exec rake
|
1092
|
+
```
|
1093
|
+
|
1059
1094
|
### Debugging
|
1060
1095
|
|
1061
1096
|
To debug output use '-v --num-threads=1' for generating useful
|
@@ -1073,12 +1108,12 @@ temporary directory may remain.
|
|
1073
1108
|
Information on the source tree, documentation, examples, issues and
|
1074
1109
|
how to contribute, see
|
1075
1110
|
|
1076
|
-
http://github.com/
|
1111
|
+
http://github.com/vcflib/bio-vcf
|
1077
1112
|
|
1078
1113
|
## Cite
|
1079
1114
|
|
1080
1115
|
If you use this software, please cite one of
|
1081
|
-
|
1116
|
+
|
1082
1117
|
* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
|
1083
1118
|
* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
|
1084
1119
|
|
@@ -1088,5 +1123,4 @@ This Biogem is published at (http://biogems.info/index.html#bio-vcf)
|
|
1088
1123
|
|
1089
1124
|
## Copyright
|
1090
1125
|
|
1091
|
-
Copyright (c) 2014 Pjotr Prins. See LICENSE.txt for further details.
|
1092
|
-
|
1126
|
+
Copyright (c) 2014-2020 Pjotr Prins. See LICENSE.txt for further details.
|