bio-vcf 0.9.2 → 0.9.4

Sign up to get free protection for your applications and to get access to all the features.
Files changed (45) hide show
  1. checksums.yaml +5 -5
  2. data/.travis.yml +1 -21
  3. data/LICENSE.txt +1 -1
  4. data/README.md +107 -73
  5. data/RELEASE_NOTES.md +20 -0
  6. data/RELEASE_NOTES.md~ +11 -0
  7. data/VERSION +1 -1
  8. data/bin/bio-vcf +49 -30
  9. data/bio-vcf.gemspec +1 -1
  10. data/features/cli.feature +4 -1
  11. data/features/diff_count.feature +0 -1
  12. data/features/step_definitions/cli-feature.rb +13 -9
  13. data/features/step_definitions/diff_count.rb +1 -1
  14. data/features/step_definitions/somaticsniper.rb +1 -1
  15. data/lib/bio-vcf/pcows.rb +31 -25
  16. data/lib/bio-vcf/vcffile.rb +46 -0
  17. data/lib/bio-vcf/vcfgenotypefield.rb +20 -20
  18. data/lib/bio-vcf/vcfheader.rb +29 -0
  19. data/lib/bio-vcf/vcfrecord.rb +5 -3
  20. data/lib/bio-vcf/vcfsample.rb +3 -1
  21. data/test/data/input/empty.vcf +2 -0
  22. data/test/data/regression/empty-stderr.new +12 -0
  23. data/test/data/regression/empty.new +2 -0
  24. data/test/data/regression/empty.ref +2 -0
  25. data/test/data/regression/eval_once-stderr.new +2 -2
  26. data/test/data/regression/eval_r.info.dp-stderr.new +9 -7
  27. data/test/data/regression/ifilter_s.dp-stderr.new +9 -7
  28. data/test/data/regression/pass1-stderr.new +9 -7
  29. data/test/data/regression/r.info.dp-stderr.new +4 -8
  30. data/test/data/regression/r.info.dp.new +0 -33
  31. data/test/data/regression/rewrite.info.sample-stderr.new +9 -7
  32. data/test/data/regression/s.dp-stderr.new +9 -7
  33. data/test/data/regression/seval_s.dp-stderr.new +9 -7
  34. data/test/data/regression/sfilter_seval_s.dp-stderr.new +9 -7
  35. data/test/data/regression/thread4-stderr.new +9 -7
  36. data/test/data/regression/thread4_4-stderr.new +25 -44
  37. data/test/data/regression/thread4_4.new +0 -20
  38. data/test/data/regression/thread4_4_failed_filter-stderr.new +1 -1
  39. data/test/data/regression/thread4_4_failed_filter-stderr.ref +1 -1
  40. data/test/data/regression/vcf2json_full_header-stderr.new +9 -7
  41. data/test/data/regression/vcf2json_use_meta-stderr.new +9 -7
  42. metadata +11 -7
  43. data/features/#cli.feature# +0 -71
  44. data/features/filter.feature~ +0 -35
  45. data/test/stress/stress_test.sh~ +0 -8
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: a09729e3548751923f4b3c5ef81c8c9d7402b6b2
4
- data.tar.gz: 4c525ad745c5486075e9a0f14fe5372a21c8f056
2
+ SHA256:
3
+ metadata.gz: f5d7a81871906abfffc93455b4d664d5755fe8d79312134eae94e84659506198
4
+ data.tar.gz: 8029269859aedd53c613ea9bbb17f951972b062060b5a40c22bdbe65c6c3dfa7
5
5
  SHA512:
6
- metadata.gz: 343083ee8c055f534a840c8f668cb35a0c33fbccabe2b580edf859747aff8c8069266168ac66631bc3bbd2c8f58691847796eb00cef7784c7ebf966ec85e1d4f
7
- data.tar.gz: f55292d0d744a496a5b39123285904a120f4c4ffae066dc33f244e09ae021618e94071cf8e587f56debfaf0c54233c3a5688887e21b21f5640bc8b271a3a00bb
6
+ metadata.gz: ed231c3a918e5f9ab9cd8a618f3f25f0c39613ac934b496af334d77dabe64831ff08cfc722a467fc51ab8c583358ca21be769ba1d9654437d54e7d21b811ee2c
7
+ data.tar.gz: df49786c4f4aa5e3a3659c678fb66aeb4b7dd4bb575aacf34cc468663c18fa893502699d238b5034c308ff51a4dc05e0fadf929b923c1d646f61c3f07fef26c7
@@ -1,23 +1,3 @@
1
- sudo: false # required for the new containers
2
-
3
1
  language: ruby
4
- rvm:
5
- # - 1.9.3 <- No longer working
6
- - 2.1.0
7
- - 2.2.3
8
-
9
- # install:
10
- # - gem install cucumber rspec regressiontest
11
-
12
- branches:
13
- only:
14
- - master
15
-
16
- # - jruby-head
17
- # - jruby-19mode # JRuby in 1.9 mode
18
- # - 1.8.7
19
- # - jruby-18mode # JRuby in 1.8 mode
20
- # - rbx-18mode
21
2
 
22
- # uncomment this line if your project needs to run something other than `rake`:
23
- # script: bundle exec rspec spec
3
+ arch: arm64
@@ -1,4 +1,4 @@
1
- Copyright (c) 2013 Pjotr Prins
1
+ Copyright (c) 2013-2020 Pjotr Prins <pjotr.public68@thebird.nl>
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,23 +1,15 @@
1
1
  # bio-vcf
2
2
 
3
- [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf)
3
+ [![Build Status](https://secure.travis-ci.org/vcflib/bio-vcf.png)](http://travis-ci.org/vcflib/bio-vcf)
4
4
 
5
- ## Updates
6
-
7
- * Getting ready for a 1.0 release
8
- * 0.9.1 removed a rare threading bug and cleanup on error
9
- * Added support for soft filters (request by Brad Chapman)
10
- * The outputter now writes (properly) in parallel with the parser
11
- * bio-vcf turns any VCF into JSON with header information, and
12
- allows you to pipe that JSON directly into any JSON supporting
13
- language, including Python and Javascript!
14
5
 
15
6
  ## Bio-vcf
16
7
 
17
- Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf is not only
18
- very fast for genome-wide (WGS) data, it also comes with a really nice
19
- filtering, evaluation and rewrite language and it can output any type
20
- of textual data, including VCF header and contents in RDF and JSON.
8
+ Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf
9
+ is not only very fast for genome-wide (WGS) data, it also comes with a
10
+ really nice filtering, evaluation and rewrite language and it can
11
+ output any type of textual data, including VCF header and contents in
12
+ RDF and JSON.
21
13
 
22
14
  So, why would you use bio-vcf over other parsers? Because
23
15
 
@@ -79,18 +71,18 @@ BED format on a 16 core machine takes
79
71
  sys 0m5.039s
80
72
  ```
81
73
 
82
- which shows decent core utilisation (10x). Running
74
+ which shows decent core utilisation (10x). Running
83
75
  gzip compressed VCF files of 30+ Gb has similar performance gains.
84
76
 
85
77
  To view some complex filters on an 80Gb SNP file check out a
86
- [GTEx exercise](https://github.com/pjotrp/bioruby-vcf/blob/master/doc/GTEx_reduce.md).
78
+ [GTEx exercise](https://github.com/vcflib/bio-vcf/blob/master/doc/GTEx_reduce.md).
87
79
 
88
80
  Use zcat (or even better pigz which is multi-core itself) to pipe such
89
81
  gzipped (vcf.gz) files into bio-vcf, e.g.
90
82
 
91
83
  ```sh
92
84
  zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
93
- --sfilter '!s.empty? and s.dp>20'
85
+ --sfilter '!s.empty? and s.dp>20'
94
86
  --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
95
87
  ```
96
88
 
@@ -124,7 +116,7 @@ Where 's.dp' is the shorter name for 'sample.dp'.
124
116
 
125
117
  It is also possible to specify sample names, or info fields:
126
118
 
127
- For example, to filter somatic data
119
+ For example, to filter somatic data
128
120
 
129
121
  ```ruby
130
122
  bio-vcf --filter 'rec.info.dp>5 and rec.alt.size==1 and rec.tumor.bq[rec.alt]>30 and rec.tumor.mq>20' < file.vcf
@@ -252,7 +244,7 @@ The VCF format is commonly used for variant calling between NGS
252
244
  samples. The fast parser needs to carry some state, recorded for each
253
245
  file in VcfHeader, which contains the VCF file header. Individual
254
246
  lines (variant calls) first go through a raw parser returning an array
255
- of fields. Further (lazy) parsing is handled through VcfRecord.
247
+ of fields. Further (lazy) parsing is handled through VcfRecord.
256
248
 
257
249
  At this point the filter is pretty generic with multi-sample support.
258
250
  If something is not working, check out the feature descriptions and
@@ -261,17 +253,16 @@ example of a VCF statement you need to work on.
261
253
 
262
254
  ## Installation
263
255
 
264
- Note that you need Ruby 2.x or later. The 2.x Ruby series also give
265
- a performance improvement. Bio-vcf will show the Ruby version when
266
- typing the command 'bio-vcf -h'.
256
+ The bio-vcf has no other dependencies but Ruby.
267
257
 
268
- To intall bio-vcf with gem:
258
+ To install bio-vcf with Ruby gems:
269
259
 
270
260
  ```sh
271
261
  gem install bio-vcf
272
262
  bio-vcf -h
273
263
  ```
274
264
 
265
+
275
266
  ## Command line interface (CLI)
276
267
 
277
268
  Get the version of the VCF file
@@ -295,6 +286,13 @@ Get the sample names
295
286
  NORMAL,TUMOR
296
287
  ```
297
288
 
289
+ Alternatively use the command line switch for --names, e.g.
290
+
291
+ ```ruby
292
+ bio-vcf --names < file.vcf
293
+ NORMAL,TUMOR
294
+ ```
295
+
298
296
  Get information from the header (META)
299
297
 
300
298
  ```ruby
@@ -305,39 +303,39 @@ The 'fields' array contains unprocessed data (strings). Print first
305
303
  five raw fields
306
304
 
307
305
  ```ruby
308
- bio-vcf --eval 'fields[0..4]' < file.vcf
306
+ bio-vcf --eval 'fields[0..4]' < file.vcf
309
307
  ```
310
308
 
311
309
  Add a filter to display the fields on chromosome 12
312
310
 
313
311
  ```ruby
314
- bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4]' < file.vcf
312
+ bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4]' < file.vcf
315
313
  ```
316
314
 
317
315
  It gets better when we start using processed data, represented by an
318
316
  object named 'rec'. Position is a value, so we can filter a range
319
317
 
320
318
  ```ruby
321
- bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf
319
+ bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf
322
320
  ```
323
321
 
324
322
  The shorter name for 'rec.chrom' is 'r.chrom', so you may write
325
323
 
326
324
  ```ruby
327
- bio-vcf --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
325
+ bio-vcf --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
328
326
  ```
329
327
 
330
328
  To ignore and continue parsing on missing data use the
331
329
  --ignore-missing (-i) and or --quiet (-q) switches
332
330
 
333
331
  ```ruby
334
- bio-vcf -i --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
332
+ bio-vcf -i --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
335
333
  ```
336
334
 
337
335
  Info fields are referenced by
338
336
 
339
337
  ```ruby
340
- bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf
338
+ bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf
341
339
  ```
342
340
 
343
341
  (alternatively you can use the indexed rec.info['DP'] and list INFO fields with
@@ -346,14 +344,14 @@ rec.info.fields).
346
344
  Subfields defined by rec.format:
347
345
 
348
346
  ```ruby
349
- bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf
347
+ bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf
350
348
  ```
351
349
 
352
350
  Output
353
351
 
354
352
  ```ruby
355
- bio-vcf --filter 'rec.tumor.gq>30'
356
- --eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq]'
353
+ bio-vcf --filter 'rec.tumor.gq>30'
354
+ --eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq]'
357
355
  < file.vcf
358
356
  ```
359
357
 
@@ -367,26 +365,26 @@ Show the count of the bases that were scored as somatic
367
365
  Actually, we have a convenience implementation for bcount, so this is the same
368
366
 
369
367
  ```ruby
370
- bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount[rec.alt].to_s+"\t"+rec.tumor.gq.to_s'
368
+ bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount[rec.alt].to_s+"\t"+rec.tumor.gq.to_s'
371
369
  < file.vcf
372
370
  ```
373
371
 
374
372
  Filter on the somatic results that were scored at least 4 times
375
-
373
+
376
374
  ```ruby
377
- bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bcount[rec.alt]>4' < test.vcf
375
+ bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bcount[rec.alt]>4' < test.vcf
378
376
  ```
379
377
 
380
378
  Similar for base quality scores
381
379
 
382
380
  ```ruby
383
- bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf
381
+ bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf
384
382
  ```
385
383
 
386
384
  Filter out on sample values
387
385
 
388
386
  ```ruby
389
- bio-vcf --sfilter 's.dp>20' < test.vcf
387
+ bio-vcf --sfilter 's.dp>20' < test.vcf
390
388
  ```
391
389
 
392
390
  To filter missing on samples:
@@ -468,17 +466,17 @@ Even shorter r is an alias for rec
468
466
  Note: special functions are not yet implemented! Look below
469
467
  for genotype processing which has indexing in 'gti'.
470
468
 
471
- Sometime you want to use a special function in a filter. For
472
- example percentage variant reads can be defined as [a,c,g,t]
473
- with frequencies against sample read depth (dp) as
474
- [0,0.03,0.47,0.50]. Filtering would with a special function,
469
+ Sometime you want to use a special function in a filter. For
470
+ example percentage variant reads can be defined as [a,c,g,t]
471
+ with frequencies against sample read depth (dp) as
472
+ [0,0.03,0.47,0.50]. Filtering would with a special function,
475
473
  which we named freq
476
474
 
477
475
  ```sh
478
476
  bio-vcf --sfilter "s.freq(2)>0.30" < file.vcf
479
477
  ```
480
478
 
481
- which is equal to
479
+ which is equal to
482
480
 
483
481
  ```sh
484
482
  bio-vcf --sfilter "s.freq.g>0.30" < file.vcf
@@ -498,7 +496,7 @@ ref should always be identical across samples.
498
496
 
499
497
  ## DbSNP
500
498
 
501
- One clinical variant DbSNP example
499
+ One clinical variant DbSNP example
502
500
 
503
501
  ```sh
504
502
  bio-vcf --eval '[rec.id,rec.chr,rec.pos,rec.alt,rec.info.sao,rec.info.CLNDBN]' < clinvar_20140303.vcf
@@ -523,16 +521,16 @@ renders
523
521
 
524
522
  bio-vcf allows for set analysis. With the complement filter, for
525
523
  example, samples are selected that evaluate to true, all others should
526
- evaluate to false. For this we create three filters, one for all
524
+ evaluate to false. For this we create three filters, one for all
527
525
  samples that are included (the --ifilter or -if), for all samples that
528
526
  are excluded (the --efilter or -ef) and for any sample (the --sfilter
529
527
  or -sf). So i=include (OR filter), e=exclude and s=any sample (AND
530
- filter).
528
+ filter).
531
529
 
532
530
  The equivalent of the union filter is by using the --sfilter, so
533
531
 
534
532
  ```sh
535
- bio-vcf --sfilter 's.dp>20'
533
+ bio-vcf --sfilter 's.dp>20'
536
534
  ```
537
535
 
538
536
  Filters DP on all samples and is true if all samples match the
@@ -540,7 +538,7 @@ criterium (AND). To filter on a subset you can add a
540
538
  selector
541
539
 
542
540
  ```sh
543
- bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
541
+ bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
544
542
  ```
545
543
 
546
544
  For set analysis there are the additional ifilter (include) and
@@ -560,7 +558,7 @@ values
560
558
 
561
559
  The equivalent of the complement filter is by specifying what samples
562
560
  to include, here with a regex and define filters on the included
563
- and excluded samples (the ones not in ifilter-samples) and the
561
+ and excluded samples (the ones not in ifilter-samples) and the
564
562
 
565
563
  ```sh
566
564
  ./bin/bio-vcf -i --sfilter 's.dp>20' --ifilter-samples 2,4 --ifilter 's.gt==r.s1t1.gt'
@@ -581,7 +579,7 @@ To print out the GT's add --seval
581
579
  To set an additional filter on the excluded samples:
582
580
 
583
581
  ```sh
584
- bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99'
582
+ bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99'
585
583
  ```
586
584
 
587
585
  Etc. etc. Any combination of sfilter, ifilter and efilter is possible.
@@ -594,15 +592,15 @@ In the near future it is also possible to select samples on a regex (here
594
592
  select all samples where the name starts with s3)
595
593
 
596
594
  ```sh
597
- bio-vcf --isample-regex '/^s3/' --ifilter 's.dp>20'
595
+ bio-vcf --isample-regex '/^s3/' --ifilter 's.dp>20'
598
596
  ```
599
597
 
600
598
  ```sh
601
- bio-vcf --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt' --efilter 'gt!=s3t1.gt'
599
+ bio-vcf --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt' --efilter 'gt!=s3t1.gt'
602
600
  --set-intersect include=true
603
- bio-vcf --include /s3.+/ --sample-regex /^t2/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt'
601
+ bio-vcf --include /s3.+/ --sample-regex /^t2/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt'
604
602
  --set-catesian one in include=true, rest=false
605
- bio-vcf --unique-sample (any) --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt!="0/0"'
603
+ bio-vcf --unique-sample (any) --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt!="0/0"'
606
604
  ```
607
605
 
608
606
  With the filter commands you can use --ignore-missing to skip errors.
@@ -625,7 +623,7 @@ results in a string value
625
623
  to access components of the genotype field we can use standard Ruby
626
624
 
627
625
  ```ruby
628
- bio-vcf --seval 's.gt.split(/\//)[0]'
626
+ bio-vcf --seval 's.gt.split(/\//)[0]'
629
627
  1 10665 . . 0 0 . 0 0
630
628
  1 10694 . . 1 1 . . .
631
629
  1 12783 0 0 0 0 0 0 0
@@ -636,7 +634,7 @@ or special functions, such as 'gti' which gives the genotype as an
636
634
  indexed value array
637
635
 
638
636
  ```ruby
639
- bio-vcf --seval 's.gti[0]'
637
+ bio-vcf --seval 's.gti[0]'
640
638
  1 10665 0 0 0 0
641
639
  1 10694 1 1
642
640
  1 12783 0 0 0 0 0 0 0
@@ -646,7 +644,7 @@ indexed value array
646
644
  and 'gts' as a nucleotide string array
647
645
 
648
646
  ```ruby
649
- bio-vcf --seval 's.gts'
647
+ bio-vcf --seval 's.gts'
650
648
  1 10665 C C C C
651
649
  1 10694 G G
652
650
  1 12783 G G G G G G G
@@ -670,9 +668,9 @@ example signficance, use
670
668
  Now you can index other fields, e.g. GL
671
669
 
672
670
  ```ruby
673
- ./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]'
671
+ ./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]'
674
672
  1 900057 1.0 1.0 0.994 1.0 1.0 -1 0.999 1.0 0.997 -1 0.994 0.989 -1 0.991 -1 0.972 0.992 1.0
675
- ```
673
+ ```
676
674
 
677
675
  shows a number of SNPs have been scored with high significance and a
678
676
  number are missing, here marked as -1.
@@ -741,6 +739,17 @@ To remove/select 3 samples:
741
739
  bio-vcf --samples 0,1,3 < mytest.vcf
742
740
  ```
743
741
 
742
+ You can also select samples by name (as long as they do not contain
743
+ spaces)
744
+
745
+
746
+ ```sh
747
+ bio-vcf --names < mytest.vcf
748
+ Original s1t1 s2t1 s3t1 s1t2 s2t2 s3t2
749
+ bio-vcf --samples "Original,s1t1,s3t1" < mytest.vcf
750
+ ```
751
+
752
+
744
753
  Filter on a BED file and annotate the gene name in the resulting VCF
745
754
 
746
755
  ```sh
@@ -791,7 +800,7 @@ To have more output options bio-vcf can use an [ERB
791
800
  template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
792
801
  very flexible option that can output textual formats such as JSON, YAML, HTML
793
802
  and RDF. Examples are provided in
794
- [./templates](https://github.com/pjotrp/bioruby-vcf/templates/). A JSON
803
+ [./templates](https://github.com/vcflib/bio-vcf/templates/). A JSON
795
804
  template could be
796
805
 
797
806
  ```Javascript
@@ -805,7 +814,7 @@ template could be
805
814
  };
806
815
  ```
807
816
 
808
- To get JSON, run with something like (combining
817
+ To get JSON, run with something like (combining
809
818
  with a filter)
810
819
 
811
820
  ```sh
@@ -831,11 +840,11 @@ Likewise for RDF output:
831
840
  bio-vcf --template template/vcf2rdf.erb --filter 'r.info.sao==1' < dbsnp.vcf
832
841
  ```
833
842
 
834
- renders the ERB template
843
+ renders the ERB template
835
844
 
836
845
  ```ruby
837
846
  <%
838
- id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
847
+ id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
839
848
  %>
840
849
  :<%= id %>
841
850
  :query_id "<%= id %>",
@@ -848,7 +857,7 @@ renders the ERB template
848
857
  db:vcf true .
849
858
  ```
850
859
 
851
- into
860
+ into
852
861
 
853
862
  ```
854
863
  :ch13_33703698_A
@@ -936,9 +945,9 @@ To get and put the full information from the header, simple use
936
945
  vcf.meta.to_json. See ./template/vcf2json_full_header.erb for an
937
946
  example. This meta information can also be used to output info fields
938
947
  and sample values on the fly! For an example, see the template at
939
- [./template/vcf2json_use_meta.erb](https://github.com/pjotrp/bioruby-vcf/tree/master/template/vcf2json_use_meta.erb)
948
+ [./template/vcf2json_use_meta.erb](https://github.com/vcflib/bio-vcf/tree/master/template/vcf2json_use_meta.erb)
940
949
  and the generated output at
941
- [./test/data/regression/vcf2json_use_meta.ref](https://github.com/pjotrp/bioruby-vcf/tree/master/test/data/regression/vcf2json_use_meta.ref).
950
+ [./test/data/regression/vcf2json_use_meta.ref](https://github.com/vcflib/bio-vcf/tree/master/test/data/regression/vcf2json_use_meta.ref).
942
951
 
943
952
  This way, it is possible to write templates that can convert the content of
944
953
  *any* VCF file without prior knowledge to JSON, RDF, etc.
@@ -955,7 +964,7 @@ Simple statistics are available for REF>ALT changes:
955
964
  G>A 59 45%
956
965
  C>T 30 23%
957
966
  A>G 5 4%
958
- C>G 5 4%
967
+ C>G 5 4%
959
968
  C>A 5 4%
960
969
  G>T 4 3%
961
970
  T>C 4 3%
@@ -976,9 +985,9 @@ Simple statistics are available for REF>ALT changes:
976
985
  ## Other examples
977
986
 
978
987
  For more exercises and examples see
979
- [doc](https://github.com/pjotrp/bioruby-vcf/tree/master/doc) directory
988
+ [doc](https://github.com/vcflib/bio-vcf/tree/master/doc) directory
980
989
  and the the feature
981
- [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
990
+ [section](https://github.com/vcflib/bio-vcf/tree/master/features).
982
991
 
983
992
  ## API
984
993
 
@@ -1009,6 +1018,23 @@ what the command line interface uses (see ./bin/bio-vcf)
1009
1018
  end
1010
1019
  ```
1011
1020
 
1021
+ ### VCFFile
1022
+
1023
+ The class ```BioVcf::VCFfile``` wraps a file and provides an ```enum``` with the
1024
+ method each, that can be used as in iterator.
1025
+
1026
+ ```ruby
1027
+ vcf_file = "dbsnp.vcf"
1028
+ vcf = BioVcf::VCFfile.new(file:file, is_gz: false )
1029
+ it vcf.each
1030
+ puts it.peek
1031
+
1032
+ vcf_file = "dbsnp.vcf.gz"
1033
+ vcf = BioVcf::VCFfile.new(file:file, is_gz: true )
1034
+ it vcf.each
1035
+ puts it.peek
1036
+ ```
1037
+
1012
1038
  ## Trouble shooting
1013
1039
 
1014
1040
  ### MRI supports threading
@@ -1037,7 +1063,7 @@ For more complex filters use lambda inside a conditional
1037
1063
  ```ruby
1038
1064
  ( fast_check ? lambda { slow_check }.call : false )
1039
1065
  ```
1040
-
1066
+
1041
1067
  where slow_check is the slow section of your query. As is shown
1042
1068
  earlier in this document. Don't forget the .call!
1043
1069
 
@@ -1056,6 +1082,15 @@ For larger files set the timeout to 600, or so. --timeout 600.
1056
1082
 
1057
1083
  Different values may show different core use on a machine.
1058
1084
 
1085
+ ### Development
1086
+
1087
+ To run the tests from source
1088
+
1089
+ ```sh
1090
+ bundle install --path vendor/bundle
1091
+ bundle exec rake
1092
+ ```
1093
+
1059
1094
  ### Debugging
1060
1095
 
1061
1096
  To debug output use '-v --num-threads=1' for generating useful
@@ -1073,12 +1108,12 @@ temporary directory may remain.
1073
1108
  Information on the source tree, documentation, examples, issues and
1074
1109
  how to contribute, see
1075
1110
 
1076
- http://github.com/pjotrp/bioruby-vcf
1111
+ http://github.com/vcflib/bio-vcf
1077
1112
 
1078
1113
  ## Cite
1079
1114
 
1080
1115
  If you use this software, please cite one of
1081
-
1116
+
1082
1117
  * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
1083
1118
  * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
1084
1119
 
@@ -1088,5 +1123,4 @@ This Biogem is published at (http://biogems.info/index.html#bio-vcf)
1088
1123
 
1089
1124
  ## Copyright
1090
1125
 
1091
- Copyright (c) 2014 Pjotr Prins. See LICENSE.txt for further details.
1092
-
1126
+ Copyright (c) 2014-2020 Pjotr Prins. See LICENSE.txt for further details.