bio-vcf 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 014f3f4adef8a533501e115027fc8a920487f949
4
- data.tar.gz: 89e442176c21a0893c267dc324db403fd57e7577
3
+ metadata.gz: 1f08be0a8d7ad751ad4758156e5ce6ccbc518cc0
4
+ data.tar.gz: 741386a278d7c38340abf35cc08d1f1923636131
5
5
  SHA512:
6
- metadata.gz: 624f6cd5251384da85b824cd1d98e49be2efcb765d34b902b225ba8e8dc7ce7b6fa68d9a47b8bea5f7eb60b33e21182c7d941d7b0da70ce58228354021c596a0
7
- data.tar.gz: 7ffe437d72368db97987fe53c8aa93bebdcff68002975f86e02cf45ec69b48f76e060e76140e27242122b3c533021eae4073a19c0c0a09089edac62b02394682
6
+ metadata.gz: 04d20d248629cccebbd3d639c2a25bb5d33efd2163999f87912343f44442c1c3f19429d6008900973116354a522679e75d853e3e6f9428d54a6647a38ef5e7fe
7
+ data.tar.gz: 625b39c9172569d3e893721a6f943721b30032b9946cea03923524af75345793edb3654f1489ebb21e084d81970a288e0552e442c27857647d0982999c497487
data/Gemfile CHANGED
@@ -13,4 +13,5 @@ group :development do
13
13
  # gem "bundler", ">= 1.0.21"
14
14
  # gem "bio", ">= 1.4.2"
15
15
  # gem "rdoc", "~> 3.12"
16
+ gem "regressiontest"
16
17
  end
data/Gemfile.lock CHANGED
@@ -15,6 +15,8 @@ GEM
15
15
  multipart-post (>= 1.2, < 3)
16
16
  gherkin (2.12.2)
17
17
  multi_json (~> 1.3)
18
+ gherkin (2.12.2-java)
19
+ multi_json (~> 1.3)
18
20
  git (1.2.6)
19
21
  github_api (0.11.3)
20
22
  addressable (~> 2.3)
@@ -36,6 +38,7 @@ GEM
36
38
  rake
37
39
  rdoc
38
40
  json (1.8.1)
41
+ json (1.8.1-java)
39
42
  jwt (0.1.11)
40
43
  multi_json (>= 1.5)
41
44
  mini_portile (0.5.2)
@@ -45,6 +48,8 @@ GEM
45
48
  multipart-post (2.0.0)
46
49
  nokogiri (1.6.1)
47
50
  mini_portile (~> 0.5.0)
51
+ nokogiri (1.6.1-java)
52
+ mini_portile (~> 0.5.0)
48
53
  oauth2 (0.9.3)
49
54
  faraday (>= 0.8, < 0.10)
50
55
  jwt (~> 0.1.8)
@@ -55,6 +60,7 @@ GEM
55
60
  rake (10.1.1)
56
61
  rdoc (4.1.1)
57
62
  json (~> 1.4)
63
+ regressiontest (0.0.2)
58
64
  rspec (2.14.1)
59
65
  rspec-core (~> 2.14.0)
60
66
  rspec-expectations (~> 2.14.0)
@@ -65,9 +71,11 @@ GEM
65
71
  rspec-mocks (2.14.6)
66
72
 
67
73
  PLATFORMS
74
+ java
68
75
  ruby
69
76
 
70
77
  DEPENDENCIES
71
78
  cucumber
72
79
  jeweler
80
+ regressiontest
73
81
  rspec
data/README.md CHANGED
@@ -4,12 +4,102 @@
4
4
 
5
5
  Yet another VCF parser. This one may give better performance because
6
6
  of lazy parsing and useful combinations of (fancy) command line
7
- filtering. For example, to filter somatic data
7
+ filtering. bio-vcf comes with a sensible parser definition language,
8
+ as well as primitives for set analysis. Also few assumptions are made
9
+ about the actual contents of the VCF file (field names are resolved on
10
+ the fly).
11
+
12
+ To fetch all entries where all samples have depth larger than 20 use an sfilter
13
+
14
+ ```ruby
15
+ bio-vcf --sfilter 'sample.dp>20' < file.vcf
16
+ ```
17
+
18
+ To only filter on some samples number 0 and 3:
19
+
20
+ ```ruby
21
+ bio-vcf --sfilter-samples 0,3 --sfilter 's.dp>20' < file.vcf
22
+ ```
23
+
24
+ Where 's.dp' is the shorter name for 'sample.dp'.
25
+
26
+ It is also possible to specify sample names, or info fields:
27
+
28
+ For example, to filter somatic data
29
+
30
+ ```ruby
31
+ bio-vcf --filter 'rec.info.dp>5 and rec.alt.size==1 and rec.tumor.bq[rec.alt]>30 and rec.tumor.mq>20' < file.vcf
32
+ ```
33
+
34
+ To output specific fields in tabular (and HTML, XML or LaTeX) format
35
+ use the --eval switch, e.g.,
36
+
37
+ ```ruby
38
+ bio-vcf --eval 'rec.alt+"\t"+rec.info.dp+"\t"+rec.tumor.gq.to_s' < file.vcf
39
+ ```
40
+
41
+ In fact, if the result is an Array the output gets tab dilimited so
42
+ the nicer version is
43
+
44
+ ```ruby
45
+ bio-vcf --eval '[r.alt,r.info.dp,r.tumor.gq.to_s]' < file.vcf
46
+ ```
47
+
48
+ To output the DP values of every sample that has a depth larger than
49
+ 100:
50
+
51
+ ```ruby
52
+ bio-vcf -i --sfilter 's.dp>100' --seval 's.dp' < file.vcf
53
+
54
+ 1 10257 159 242 249 249 186 212 218
55
+ 1 10291 165 249 249 247 161 163 189
56
+ 1 10297 182 246 250 246 165 158 183
57
+ 1 10303 198 247 248 248 172 157 182
58
+ (etc.)
59
+ ```
60
+
61
+ Where -i ignores missing samples. Pick up sample allele depth
62
+
63
+ ```ruby
64
+ bio-vcf -i --seval 's.ad'
65
+ 1 10257 151,8 219,22 227,22 226,22 166,18 185,27 201,15
66
+ 1 10291 145,16 218,26 214,30 213,32 122,36 131,27 156,31
67
+ 1 10297 155,18 218,23 219,26 207,30 137,20 124,27 151,27
68
+ ```
69
+
70
+ And to output DP ang GQ values for tumor normal:
71
+
72
+ ```ruby
73
+ bio-vcf --filter 'r.normal.dp>=7 and r.tumor.dp>=5' --seval '[s.dp,s.gq]' < freebayes.vcf
74
+
75
+ 17 45235620 22 139.35 20 0
76
+ 17 45235635 20 137.224 14 41.5688
77
+ 17 45235653 18 146.509 12 146.509
78
+ 17 45247354 32 0 9 6.59312
79
+ 17 45247362 27 0 6 110.097
80
+
81
+ ```
82
+
83
+ To parse and output genotype
8
84
 
9
85
  ```ruby
10
- bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bq[rec.alt]>30 and rec.tumor.mq>20' < file.vcf
86
+ bio-vcf -iq --sfilter 's.dp>=20 and s.gq>=20' --ifilter-sampler 's.gt!="0/0"' --seval s.gt < test/data/input/multisample.vcf
87
+ 1 10257 0/0 0/0 0/0 0/0 0/0 0/1 0/0
88
+ 1 10291 0/1 0/1 0/1 0/1 0/1 0/1 0/1
89
+ 1 10297 0/1 0/1 0/1 0/0 0/0 0/1 0/1
90
+ 1 12783 0/1 0/1 0/1 0/1 0/1 0/1 0/1
11
91
  ```
12
92
 
93
+ Most filter and eval commands can be used at the same time. Special set
94
+ commands exit for filtering and eval. When a set is defined, based on
95
+ the sample name, you can apply filters on the samples inside the set,
96
+ outside the set and over all samples. E.g.
97
+
98
+ Also note you can use
99
+ [bio-table](https://github.com/pjotrp/bioruby-table) to
100
+ filter/transform data further and convert to other formats, such as
101
+ RDF.
102
+
13
103
  The VCF format is commonly used for variant calling between NGS
14
104
  samples. The fast parser needs to carry some state, recorded for each
15
105
  file in VcfHeader, which contains the VCF file header. Individual
@@ -18,17 +108,18 @@ of fields. Further (lazy) parsing is handled through VcfRecord.
18
108
 
19
109
  At this point the filter is pretty generic with multi-sample support.
20
110
  If something is not working, check out the feature descriptions and
21
- the source code. It is not hard to add features. Otherwise, send me a short
111
+ the source code. It is not hard to add features. Otherwise, send a short
22
112
  example of a VCF statement you need to work on.
23
113
 
114
+ bio-vcf is fast. Parsing a 55K line DbSNP file (22Mb) takes 1.5 seconds on a
115
+ Macbook PRO running 64-bits Linux (Ruby 2.1.0).
116
+
24
117
  ## Installation
25
118
 
26
119
  ```sh
27
120
  gem install bio-vcf
28
121
  ```
29
122
 
30
- ## Quick start
31
-
32
123
  ## Command line interface (CLI)
33
124
 
34
125
  Get the version of the VCF file
@@ -56,13 +147,13 @@ The 'fields' array contains unprocessed data (strings). Print first
56
147
  five raw fields
57
148
 
58
149
  ```ruby
59
- bio-vcf --eval 'fields[0..4].join("\t")' < file.vcf
150
+ bio-vcf --eval 'fields[0..4]' < file.vcf
60
151
  ```
61
152
 
62
153
  Add a filter to display the fields on chromosome 12
63
154
 
64
155
  ```ruby
65
- bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4].join("\t")' < file.vcf
156
+ bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4]' < file.vcf
66
157
  ```
67
158
 
68
159
  It gets better when we start using processed data, represented by an
@@ -72,6 +163,19 @@ object named 'rec'. Position is a value, so we can filter a range
72
163
  bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf
73
164
  ```
74
165
 
166
+ The shorter name for 'rec.chrom' is 'r.chrom', so you may write
167
+
168
+ ```ruby
169
+ bio-vcf --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
170
+ ```
171
+
172
+ To ignore and continue parsing on missing data use the
173
+ --ignore-missing (-i) and or --quiet (-q) switches
174
+
175
+ ```ruby
176
+ bio-vcf -i --filter 'r.chrom=="12" and r.pos>96_641_270 and r.pos<96_641_276' < file.vcf
177
+ ```
178
+
75
179
  Info fields are referenced by
76
180
 
77
181
  ```ruby
@@ -118,26 +222,287 @@ Similar for base quality scores
118
222
  bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf
119
223
  ```
120
224
 
225
+ Filter out on sample values
226
+
227
+ ```ruby
228
+ bio-vcf --sfilter 's.dp>20' < test.vcf
229
+ ```
230
+
231
+ To filter missing on samples:
232
+
233
+ ```sh
234
+ bio-vcf --filter "rec.s3t2?" < file.vcf
235
+ ```
236
+
237
+ or for all
238
+
239
+ ```sh
240
+ bio-vcf --filter "rec.missing_samples?" < file.vcf
241
+ ```
242
+
243
+ Likewise you can check for record validity
244
+
245
+ ```sh
246
+ bio-vcf --filter "not rec.valid?" < file.vcf
247
+ ```
248
+
249
+ which, at this point, simply counts the number of fields.
250
+
121
251
  If your samples have other names you can fetch genotypes for that
122
252
  sample with
123
253
 
124
254
  ```sh
125
- bio-vcf --eval "rec.sample['BIOPSY17513D'].gt" < file.vcf
255
+ bio-vcf --eval "rec.sample['Original'].gt" < file.vcf
126
256
  ```
127
257
 
128
258
  Or read depth for another
129
259
 
130
260
  ```sh
131
- bio-vcf --eval "rec.sample['subclone46'].dp" < file.vcf
261
+ bio-vcf --eval "rec.sample['s3t2'].dp" < file.vcf
132
262
  ```
133
263
 
134
264
  Better even, you can access samples directly with
135
265
 
136
266
  ```sh
137
- bio-vcf --eval "rec.sample.biopsy17513d.gt" < file.vcf
138
- bio-vcf --eval "rec.sample.subclone46.dp" < file.vcf
267
+ bio-vcf --eval "rec.sample.original.gt" < file.vcf
268
+ bio-vcf --eval "rec.sample.s3t2.dp" < file.vcf
269
+ ```
270
+
271
+ And even better because of Ruby magic
272
+
273
+ ```sh
274
+ bio-vcf --eval "rec.original.gt" < file.vcf
275
+ bio-vcf --eval "rec.s3t2.dp" < file.vcf
276
+ ```
277
+
278
+ Note that only valid method names in lower case get picked up this
279
+ way. Also by convention normal is sample 1 and tumor is sample 2.
280
+
281
+ Even shorter r is an alias for rec (nyi)
282
+
283
+ ```sh
284
+ bio-vcf --eval "r.original.gt" < file.vcf
285
+ bio-vcf --eval "r.s3t2.dp" < file.vcf
286
+ ```
287
+
288
+ ## Special functions
289
+
290
+ Note: special functions are not yet implemented!
291
+
292
+ Sometime you want to use a special function in a filter. For
293
+ example percentage variant reads can be defined as [a,c,g,t]
294
+ with frequencies against sample read depth (dp) as
295
+ [0,0.03,0.47,0.50]. Filtering would with a special function,
296
+ which we named freq
297
+
298
+ ```sh
299
+ bio-vcf --sfilter "s.freq(2)>0.30" < file.vcf
300
+ ```
301
+
302
+ which is equal to
303
+
304
+ ```sh
305
+ bio-vcf --sfilter "s.freq.g>0.30" < file.vcf
306
+ ```
307
+
308
+ To check for ref or variant frequencies use more sugar
309
+
310
+ ```sh
311
+ bio-vcf --sfilter "s.freq.var>0.30 and s.freq.ref<0.10" < file.vcf
312
+ ```
313
+
314
+ For all includes var should be identical for set analysis except for
315
+ cartesian. So when --include is defined test for identical var and in
316
+ the case of cartesian one unique var, when tested.
317
+
318
+ ref should always be identical across samples.
319
+
320
+ ## DbSNP
321
+
322
+ One clinical variant DbSNP example
323
+
324
+ ```sh
325
+ bio-vcf --eval '[rec.id,rec.chr,rec.pos,rec.alt,rec.info.sao,rec.info.CLNDBN].join("\t")' < clinvar_20140303.vcf
326
+ ```
327
+
328
+ renders
329
+
330
+ ```
331
+ 1 1916905 rs267598254 A 3 Malignant_melanoma
332
+ 1 1916906 rs267598255 A 3 Malignant_melanoma
333
+ 1 1959075 rs121434580 C 1 Generalized_epilepsy_with_febrile_seizures_plus_type_5
334
+ 1 1959699 rs41307846 A 1 Generalized_epilepsy_with_febrile_seizures_plus_type_5|Epilepsy\x2c_juvenile_myoclonic_7|Epilepsy\x2c_idiopathic_generalized_10
335
+ 1 1961453 rs142619552 T 3 Malignant_melanoma
336
+ 1 2160299 rs387907304 G 0 Shprintzen-Goldberg_syndrome
337
+ 1 2160305 rs387907306 A T 0 Shprintzen-Goldberg_syndrome,Shprintzen-Goldberg_syndrome
338
+ 1 2160306 rs387907305 A T 0 Shprintzen-Goldberg_syndrome,Shprintzen-Goldberg_syndrome
339
+ 1 2160308 rs397514590 T 0 Shprintzen-Goldberg_syndrome
340
+ 1 2160309 rs397514589 A 0 Shprintzen-Goldberg_syndrome
341
+ ```
342
+
343
+ ## Set analysis
344
+
345
+ bio-vcf allows for set analysis. With the complement filter, for
346
+ example, samples are selected that evaluate to true, all others should
347
+ evaluate to false. For this we create three filters, one for all
348
+ samples that are included (the --ifilter or -if), for all samples that
349
+ are excluded (the --efilter or -ef) and for any sample (the --sfilter
350
+ or -sf). So i=include, e=exclude and s=any sample.
351
+
352
+ The equivalent of the union filter is by using the --sfilter, so
353
+
354
+ ```sh
355
+ bio-vcf --sfilter 's.dp>20'
356
+ ```
357
+
358
+ Filters DP on all samples. To filter on a subset you can add a
359
+ selector
360
+
361
+ ```sh
362
+ bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
363
+ ```
364
+
365
+ For set analysis there are the additional ifilter (include) and efilter (exclude). To filter
366
+ on samples 0,1,4 and output the gq values
367
+
368
+ ```sh
369
+ bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gq<10 or s.gq==99' --seval s.gq
370
+ 1 14907 99 99 99 99 99 99 99
371
+ 1 14930 99 99 99 99 99 99 99
372
+ 1 14933 1 99 99 39 99 99 99
373
+ 1 15190 99 99 91 99 99 99 99
374
+ 1 15211 99 99 99 99 99 99 99
375
+ ```
376
+
377
+ The equivalent of the complement filter is by specifying what samples
378
+ to include, here with a regex and define filters on the included
379
+ and excluded samples (the ones not in ifilter-samples) and the
380
+
381
+ ```sh
382
+ ./bin/bio-vcf -i --sfilter 's.dp>20' --ifilter-samples 2,4 --ifilter 's.gt==r.s1t1.gt'
383
+ ```
384
+
385
+ To print out the GT's add --seval
386
+
387
+ ```sh
388
+ bio-vcf -i --sfilter 's.dp>20' --ifilter-samples 2,4 --ifilter 's.gt==r.s1t1.gt' --seval 's.gt'
389
+ 1 14673 0/1 0/1 0/1 0/1 0/1 0/1 0/1
390
+ 1 14907 0/1 0/1 0/1 0/1 0/1 0/1 0/1
391
+ 1 14930 0/1 0/1 0/1 0/1 0/1 0/1 0/1
392
+ 1 15211 0/1 0/1 0/1 0/1 0/1 0/1 0/1
393
+ 1 15274 1/2 1/2 1/2 1/2 1/2 1/2 1/2
394
+ 1 16103 0/1 0/1 0/1 0/1 0/1 0/1 0/1
139
395
  ```
140
396
 
397
+ To set an additional filter on the excluded samples:
398
+
399
+ ```sh
400
+ bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99'
401
+ ```
402
+
403
+ Etc. etc. Any combination of sfilter, ifilter and efilter is possible.
404
+
405
+ The following are not yet implemented:
406
+
407
+ In the near future it is also possible to select samples on a regex (here
408
+ select all samples where the name starts with s3)
409
+
410
+ ```sh
411
+ bio-vcf --isample-regex '/^s3/' --ifilter 's.dp>20'
412
+ ```
413
+
414
+ ```sh
415
+ bio-vcf --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt' --efilter 'gt!=s3t1.gt'
416
+ --set-intersect include=true
417
+ bio-vcf --include /s3.+/ --sample-regex /^t2/ --sfilter 'dp>20' --ifilter 'gt==s3t1.gt'
418
+ --set-catesian one in include=true, rest=false
419
+ bio-vcf --unique-sample (any) --include /s3.+/ --sfilter 'dp>20' --ifilter 'gt!="0/0"'
420
+ ```
421
+
422
+ With the filter commands you can use --ignore-missing to skip errors.
423
+
424
+ ## Genotype processing
425
+
426
+ The sample GT field counts 0 as the reference and numbers >1 as
427
+ indexed ALT values. The field is simply built up using a slash or | as
428
+ a separator (e.g., 0/1, 0|2, ./. are valid values). The standard field
429
+ results in a string value
430
+
431
+ ```ruby
432
+ bio-vcf --seval s.gt
433
+ 1 10665 ./. ./. 0/1 0/1 ./. 0/0 0/0
434
+ 1 10694 ./. ./. 1/1 1/1 ./. ./. ./.
435
+ 1 12783 0/1 0/1 0/1 0/1 0/1 0/1 0/1
436
+ 1 15274 1/2 1/2 1/2 1/2 1/2 1/2 1/2
437
+ ```
438
+
439
+ to access components of the genotype field we can use standard Ruby
440
+
441
+ ```ruby
442
+ bio-vcf --seval 's.gt.split(/\//)[0]'
443
+ 1 10665 . . 0 0 . 0 0
444
+ 1 10694 . . 1 1 . . .
445
+ 1 12783 0 0 0 0 0 0 0
446
+ 1 15274 1 1 1 1 1 1 1
447
+ ```
448
+
449
+ or special functions, such as 'gti' which gives the genotype as an
450
+ indexed value array
451
+
452
+ ```ruby
453
+ bio-vcf --seval 's.gti[0]'
454
+ 1 10665 0 0 0 0
455
+ 1 10694 1 1
456
+ 1 12783 0 0 0 0 0 0 0
457
+ 1 15274 1 1 1 1 1 1 1
458
+ ```
459
+
460
+ and 'gts' as a nucleotide string array
461
+
462
+ ```ruby
463
+ bio-vcf --seval 's.gts[0]'
464
+ 1 10665 C C C C
465
+ 1 10694 G G
466
+ 1 12783 G G G G G G G
467
+ 1 15274 G G G G G G G
468
+ ```
469
+
470
+ These values can also be used in filters and output allele depth, for
471
+ example
472
+
473
+ ```ruby
474
+ bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gt=="0/0"' --seval 'rec.original.ad[s.gti[1]]'
475
+ 1 10257 151 151 151 151 151 8 151
476
+ 1 13302 26 10 10 10 10 10 10
477
+ 1 13757 47 47 4 47 47 4 47
478
+ ```
479
+
480
+ The following does not yet work (using the gti in a sample directly)
481
+
482
+ ```ruby
483
+ bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gti[0]==0' --seval 'rec.original.ad[s.gti[1]]'
484
+ ```
485
+
486
+ ## Modify VCF files
487
+
488
+ Add or modify the sample file name in the INFO fields:
489
+
490
+ ```sh
491
+ bio-vcf --rewrite 'rec.info["sample"]="mytest"' < mytest.vcf
492
+ ```
493
+
494
+ To remove/select 3 samples and create a new file:
495
+
496
+ ```sh
497
+ bio-vcf --samples 0,1,3 < mytest.vcf
498
+ ```
499
+
500
+ ## RDF output
501
+
502
+ Use [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
503
+
504
+ ## Other examples
505
+
141
506
  For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
142
507
 
143
508
  ## API