bio-vcf 0.9.4 → 0.9.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f5d7a81871906abfffc93455b4d664d5755fe8d79312134eae94e84659506198
4
- data.tar.gz: 8029269859aedd53c613ea9bbb17f951972b062060b5a40c22bdbe65c6c3dfa7
3
+ metadata.gz: 814f6cb6c8bc237fd08ab4f22f2bfea514525fc6c2fd1b9081cd314bfa3c2fd2
4
+ data.tar.gz: 194aa006ac5c46c157360e37e98e42dbf88d121f9c64983792e8a7cfe43d0142
5
5
  SHA512:
6
- metadata.gz: ed231c3a918e5f9ab9cd8a618f3f25f0c39613ac934b496af334d77dabe64831ff08cfc722a467fc51ab8c583358ca21be769ba1d9654437d54e7d21b811ee2c
7
- data.tar.gz: df49786c4f4aa5e3a3659c678fb66aeb4b7dd4bb575aacf34cc468663c18fa893502699d238b5034c308ff51a4dc05e0fadf929b923c1d646f61c3f07fef26c7
6
+ metadata.gz: 4cad19fa108652d42aaaff95296176a21781813848a1dacae879cd44fcb324b9f38f763fc1e516ce41cb492433457c17d27e4aca0f64308758594fb68c0abadf
7
+ data.tar.gz: 5290a23fe85fe063b6fb8606c77cca3abd3df3af624ac9da2ac3348336bbfafa7e8b3d70d2c5c3ba57a55da75a052f1fa2c1d20212ba7b4e656ac7263f5f1af0
data/Gemfile CHANGED
@@ -1,13 +1,9 @@
1
1
  source "http://rubygems.org"
2
2
 
3
- # Add dependencies to develop your gem here.
4
- # Include everything needed to run rake, tests, features, etc.
5
3
  group :development do
6
- # gem "minitest"
7
4
  gem "rake"
8
5
  gem "rspec"
9
6
  gem "cucumber"
10
- gem "regressiontest", ">= 0.0.3"
11
7
  end
12
8
 
13
9
 
data/README.md CHANGED
@@ -1,10 +1,35 @@
1
1
  # bio-vcf
2
2
 
3
- [![Build Status](https://secure.travis-ci.org/vcflib/bio-vcf.png)](http://travis-ci.org/vcflib/bio-vcf)
3
+ [![Build Status](https://secure.travis-ci.org/vcflib/bio-vcf.png)](http://travis-ci.org/vcflib/bio-vcf) [![rubygem](https://img.shields.io/gem/v/bio-vcf.svg?style=flat)](http://rubygems.org/gems/bio-vcf "Install with Rubygems") [![AnacondaBadge](https://anaconda.org/bioconda/bio-vcf/badges/installer/conda.svg)](https://anaconda.org/bioconda/bio-vcf) [![DL](https://anaconda.org/bioconda/bio-vcf/badges/downloads.svg)](https://anaconda.org/bioconda/bio-vcf)
4
+ [![DebianBadge](https://badges.debian.net/badges/debian/testing/bio-vcf/version.svg)](https://packages.debian.org/testing/bio-vcf)
5
+
6
+ Quick index:
7
+
8
+ - [INSTALL](#Install)
9
+ - [Command line interface (CLI)](#command-line-interface-cli)
10
+ + [Set analysis](#set-analysis)
11
+ + [Genotype processing](#genotype-processing)
12
+ + [Sample counting](#sample-counting)
13
+ + [Filter with lambda](#reorder-filter-with-lambda)
14
+ + [Modify VCF files](#modify-vcf-files)
15
+ + [RDF output](#rdf-output)
16
+ - [Templates](#templates)
17
+ - [Metadata](#metadata)
18
+ - [Statistics](#statistics)
19
+ - [API](#api)
20
+ - [Cite](#cite)
4
21
 
5
22
 
6
23
  ## Bio-vcf
7
24
 
25
+ Bio-vcf provides a domain specific language (DSL) for processing the
26
+ VCF format. Record named fields can be queried with regular
27
+ expressions, e.g.
28
+
29
+ ```ruby
30
+ sample.dp>20 and rec.filter !~ /LowQD/ and rec.tumor.bcount[rec.alt]>4
31
+ ```
32
+
8
33
  Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf
9
34
  is not only very fast for genome-wide (WGS) data, it also comes with a
10
35
  really nice filtering, evaluation and rewrite language and it can
@@ -26,9 +51,62 @@ So, why would you use bio-vcf over other parsers? Because
26
51
  11. Bio-vcf can convert *any* VCF to *any* output, including tabular data, BED, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates
27
52
  12. Bio-vcf has soft filters
28
53
 
54
+ Some examples are documented for [reducing GTeX](doc/GTEx_reduce.md),
55
+ [comparing GATK](doc/GATK_comparison.md), [comparing
56
+ VCFs](doc/Compare_VCFs.md), JSON [loading Mongo
57
+ database](doc/Using_Mongo.md), and [generating RDF](doc/Using_RDF.md).
58
+
59
+ ## Options
60
+
61
+ In true Unix fashion files can be piped in or passed on the command
62
+ line:
63
+
64
+ bio-vcf --help
65
+
66
+ ```
67
+ bio-vcf (biogem with pcows) by Pjotr Prins 2015-2020
68
+
69
+ Usage: bio-vcf [options] filename
70
+ e.g. bio-vcf < test/data/input/somaticsniper.vcf
71
+ -i, --ignore-missing Ignore missing data
72
+ --filter cmd Evaluate filter on each record
73
+ --sfilter cmd Evaluate filter on each sample
74
+ --sfilter-samples list Filter on selected samples (e.g., 0,1
75
+ --ifilter, --if cmd Include filter
76
+ --ifilter-samples list Include set - implicitely defines exclude set
77
+ --efilter, --ef cmd Exclude filter
78
+ --efilter-samples list Exclude set - overrides exclude set
79
+ --add-filter name Set/add filter field to name
80
+ --bed bedfile Filter on BED elements
81
+ -e, --eval cmd Evaluate command on each record
82
+ --eval-once cmd Evaluate command once (usually for header info)
83
+ --seval cmd Evaluate command on each sample
84
+ --rewrite eval Rewrite INFO
85
+ --samples list Output selected samples
86
+ --rdf Generate Turtle RDF (also check out --template!)
87
+ --num-threads [num] Multi-core version (default ALL)
88
+ --thread-lines num Fork thread on num lines (default 40000)
89
+ --skip-header Do not output VCF header info
90
+ --set-header list Set a special tab delimited output header (#samples expands to sample names)
91
+ -t, --template erb Use ERB template for output
92
+ --add-header-tag Add bio-vcf status tag to header output
93
+ --timeout [num] Timeout waiting for thread to complete (default 180)
94
+ --names Output sample names
95
+ --statistics Output statistics
96
+ -q, --quiet Run quietly
97
+ -v, --verbose Run verbosely
98
+ --debug Show debug messages and keep intermediate output
99
+
100
+ --id name Identifier
101
+ --tags list Add tags
102
+ -h, --help display this help and exit
103
+ ```
104
+
105
+ ## Performance
106
+
29
107
  Bio-vcf has better performance than other tools because of lazy
30
108
  parsing, multi-threading, and useful combinations of (fancy) command
31
- line filtering (who says Ruby is slow?). Adding cores, bio-vcf just
109
+ line filtering. Adding cores, bio-vcf just
32
110
  does better. The more complicated the filters, the larger the
33
111
  gain. First a base line test to show IO performance
34
112
 
@@ -86,8 +164,7 @@ gzipped (vcf.gz) files into bio-vcf, e.g.
86
164
  --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
87
165
  ```
88
166
 
89
- bio-vcf comes with a sensible parser definition language
90
- (interestingly it is 100% Ruby), an embedded Ragel parser for INFO and
167
+ bio-vcf comes with a sensible parser definition language, an embedded Ragel parser for INFO and
91
168
  FORMAT header definitions, as well as primitives for set analysis. Few
92
169
  assumptions are made about the actual contents of the VCF file (field
93
170
  names are resolved on the fly), so bio-vcf should work with all VCF
@@ -251,17 +328,58 @@ If something is not working, check out the feature descriptions and
251
328
  the source code. It is not hard to add features. Otherwise, send a short
252
329
  example of a VCF statement you need to work on.
253
330
 
254
- ## Installation
331
+ ## Install
332
+
333
+ Requirements:
334
+
335
+ * ruby
255
336
 
256
- The bio-vcf has no other dependencies but Ruby.
337
+ To install bio-vcf with Ruby gems, install Ruby first, e.g. on Debian
338
+ (as root)
257
339
 
258
- To install bio-vcf with Ruby gems:
340
+ ```sh
341
+ apt-get install ruby
342
+ ```
343
+
344
+ Installing ruby includes the `gem` command to install bio-vcf:
259
345
 
260
346
  ```sh
261
347
  gem install bio-vcf
348
+ export PATH=/usr/local/bin:$PATH
262
349
  bio-vcf -h
263
350
  ```
264
351
 
352
+ displays the help
353
+
354
+ ```
355
+ bio-vcf x.x (biogem Ruby with pcows) by Pjotr Prins 2015-2020
356
+ Usage: bio-vcf [options] filename
357
+ e.g. bio-vcf < test/data/input/somaticsniper.vcf
358
+ -i, --ignore-missing Ignore missing data
359
+ --filter cmd Evaluate filter on each record
360
+ (etc.)
361
+ ```
362
+
363
+ To install without root you may install a gem locally with
364
+
365
+ ```sh
366
+ gem install --install-dir ~/bio-vcf bio-vcf
367
+ ```
368
+
369
+ and run it with something like
370
+
371
+ ```sh
372
+ ~/bio-vcf/gems/bio-vcf-0.9.4/bin/bio-vcf -h
373
+ ```
374
+
375
+ Finally, it is possible to checkout the git repository and simply
376
+ run the tool with
377
+
378
+ ```sh
379
+ git clone https://github.com/vcflib/bio-vcf.git
380
+ cd bio-vcf
381
+ ruby ./bin/bio-vcf -h
382
+ ```
265
383
 
266
384
  ## Command line interface (CLI)
267
385
 
@@ -1091,12 +1209,19 @@ bundle install --path vendor/bundle
1091
1209
  bundle exec rake
1092
1210
  ```
1093
1211
 
1212
+ Note: we develop in a GNU Guix environment, see the header of
1213
+ [guix.scm](guix.scm) which does not use bundler.
1214
+
1094
1215
  ### Debugging
1095
1216
 
1096
1217
  To debug output use '-v --num-threads=1' for generating useful
1097
1218
  output. Also do not use the -i switch (ignore errors) when there
1098
1219
  are problems.
1099
1220
 
1221
+ ### Could not find rake-10.4.2 in any of the sources
1222
+
1223
+ Remove Gemfile.lock before running other tools.
1224
+
1100
1225
  ### Tmpdir contains (old) bio-vcf directories
1101
1226
 
1102
1227
  Multi-threaded bio-vcf writes into a temporary directory during
@@ -1,8 +1,15 @@
1
- ## ChangeLog v0.9.4 (2020????)
1
+ ## ChangeLog v0.9.5 (20210118)
2
+
3
+ + Improved README and installation instructions
4
+ + Added guix.scm build and instructions (no need for bundler)
5
+ + Moved regressiontest into tree
6
+
7
+ ## ChangeLog v0.9.4 (20201222)
2
8
 
3
9
  This is an important maintenance release of bio-vcf:
4
10
 
5
- + Rename bioruby-vcf to bio-vcf and migrate project to [vcflib](https://github.com/vcflib/bio-vcf).
11
+ + Rename bioruby-vcf to bio-vcf and migrate project to [vcflib](https://github.com/vcflib/bio-vcf)
12
+ + Fixed tests to match recent Ruby updates
6
13
 
7
14
  ## Older release notes
8
15
 
data/Rakefile CHANGED
@@ -1,16 +1,21 @@
1
1
  # encoding: utf-8
2
2
 
3
- require 'rubygems'
3
+ # require 'rubygems'
4
4
  require 'rake'
5
+ # require 'cucumber/rake/task'
5
6
 
6
- require 'cucumber/rake/task'
7
- Cucumber::Rake::Task.new(:features) do |t|
7
+ # Cucumber::Rake::Task.new(:features) do |t|
8
8
  # t.cucumber_opts = "--bundler false"
9
+ # end
10
+
11
+ desc 'Run cucumber' # without bundler
12
+ task :features do
13
+ sh 'cucumber features'
9
14
  end
10
15
 
11
16
  task :default => :features
12
17
 
13
- task :test => [ :features ]
18
+ task :test => [ :features ]
14
19
 
15
20
  require 'rdoc/task'
16
21
  Rake::RDocTask.new do |rdoc|
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.9.4
1
+ 0.9.5
@@ -7,7 +7,6 @@ Gem::Specification.new do |s|
7
7
 
8
8
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
9
9
  s.authors = ["Pjotr Prins"]
10
- # s.date = "2015-12-28"
11
10
  s.description = "Smart lazy multi-threaded parser for VCF format with useful filtering and output rewriting (JSON, RDF etc.)"
12
11
  s.email = "pjotr.public01@thebird.nl"
13
12
  s.executables = ["bio-vcf"]
@@ -1,4 +1,3 @@
1
- # require 'mini/test'
2
1
 
3
2
  $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../../lib')
4
3
  require 'bio-vcf'
@@ -7,7 +6,7 @@ require 'rspec/expectations'
7
6
 
8
7
  # Add the regression module if in the path (it can also be a gem)
9
8
  rootdir = File.dirname(__FILE__) + '/../..'
10
- $LOAD_PATH.unshift(rootdir+'/lib',rootdir+'/../regressiontest/lib')
9
+ $LOAD_PATH.unshift(rootdir+'/lib/regressiontest',rootdir+'/../regressiontest/lib')
11
10
  require 'regressiontest'
12
11
 
13
12
  include BioVcf
@@ -0,0 +1,11 @@
1
+ # Please require your code below, respecting the naming conventions in the
2
+ # bioruby directory tree.
3
+ #
4
+ # For example, say you have a plugin named bio-plugin, the only uncommented
5
+ # line in this file would be
6
+ #
7
+ # require 'bio/bio-plugin/plugin'
8
+ #
9
+ # In this file only require other files. Avoid other source code.
10
+
11
+ require 'regressiontest/cli_exec'
@@ -0,0 +1,101 @@
1
+ require 'fileutils'
2
+
3
+ module RegressionTest
4
+
5
+ DEFAULT_TESTDIR = "test/data/regression"
6
+
7
+ # Regression test runner compares output in ./test/data/regression
8
+ # (by default). The convention is to have a file with names .ref
9
+ # (reference) and create .new
10
+ #
11
+ # You can add an :ignore regex option which ignores lines in the
12
+ # comparson files matching a regex
13
+ #
14
+ # :timeout sets the time out for calling a system command
15
+ #
16
+ # :should_fail expects the system command to return a non-zero
17
+ module CliExec
18
+ FilePair = Struct.new(:outfn,:reffn)
19
+
20
+ def CliExec::exec command, testname, options = {}
21
+ # ---- Find .ref file
22
+ fullname = DEFAULT_TESTDIR + "/" + testname
23
+ basefn = if File.exist?(testname+".ref") || File.exist?(testname+"-stderr.ref")
24
+ testname
25
+ elsif File.exist?(fullname + ".ref") || File.exist?(fullname+"-stderr.ref")
26
+ FileUtils.mkdir_p DEFAULT_TESTDIR
27
+ fullname
28
+ else
29
+ raise "Can not find reference file for #{testname} - expected #{fullname}.ref"
30
+ end
31
+ std_out = FilePair.new(basefn + ".new", basefn + ".ref")
32
+ std_err = FilePair.new(basefn + "-stderr.new", basefn + "-stderr.ref")
33
+ files = [std_out,std_err]
34
+ # ---- Create .new file
35
+ cmd = command + " > #{std_out.outfn} 2>#{std_err.outfn}"
36
+ $stderr.print cmd,"\n"
37
+ exec_ret = nil
38
+ if options[:timeout] && options[:timeout] > 0
39
+ Timeout.timeout(options[:timeout]) do
40
+ begin
41
+ exec_ret = Kernel.system(cmd)
42
+ rescue Timeout::Error
43
+ $stderr.print cmd, " failed to finish in under #{options[:timeout]}\n"
44
+ return false
45
+ end
46
+ end
47
+ else
48
+ exec_ret = Kernel.system(cmd)
49
+ end
50
+ expect_fail = (options[:should_fail] != nil)
51
+ if !expect_fail and exec_ret==0
52
+ $stderr.print cmd," returned an error\n"
53
+ return false
54
+ end
55
+ if expect_fail and exec_ret
56
+ $stderr.print cmd," did not return an error\n"
57
+ return false
58
+ end
59
+ if options[:ignore]
60
+ regex = options[:ignore]
61
+ files.each do |f|
62
+ outfn = f.outfn
63
+ outfn1 = outfn + ".1"
64
+ FileUtils.mv(outfn,outfn1)
65
+ f1 = File.open(outfn1)
66
+ f2 = File.open(outfn,"w")
67
+ f1.each_line do | line |
68
+ f2.print(line) if line !~ /#{regex}/
69
+ end
70
+ f1.close
71
+ f2.close
72
+ FileUtils::rm(outfn1)
73
+ end
74
+ end
75
+ # ---- Compare the two files
76
+ files.each do |f|
77
+ next unless File.exist?(f.reffn)
78
+ return false unless compare_files(f.outfn,f.reffn,options[:ignore])
79
+ end
80
+ return true
81
+ end
82
+
83
+ def CliExec::compare_files fn1, fn2, ignore = nil
84
+ if not File.exist?(fn2)
85
+ FileUtils::cp(fn1,fn2)
86
+ true
87
+ else
88
+ cmd = "diff #{fn2} #{fn1}"
89
+ $stderr.print cmd+"\n"
90
+ return true if Kernel.system(cmd) == true
91
+ # Hmmm. We have a different result. We are going to try again
92
+ # because sometimes threads have not completed
93
+ sleep 0.25
94
+ return true if Kernel.system(cmd) == true
95
+ $stderr.print "If it is correct, execute \"cp #{fn1} #{fn2}\", and run again"
96
+ false
97
+ end
98
+ end
99
+ end
100
+
101
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-vcf
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.4
4
+ version: 0.9.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Pjotr Prins
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-12-22 00:00:00.000000000 Z
11
+ date: 2021-01-18 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description: Smart lazy multi-threaded parser for VCF format with useful filtering
14
14
  and output rewriting (JSON, RDF etc.)
@@ -22,7 +22,6 @@ extra_rdoc_files:
22
22
  files:
23
23
  - ".travis.yml"
24
24
  - Gemfile
25
- - Gemfile.lock
26
25
  - LICENSE.txt
27
26
  - README.md
28
27
  - RELEASE_NOTES.md
@@ -63,6 +62,8 @@ files:
63
62
  - lib/bio-vcf/vcfrecord.rb
64
63
  - lib/bio-vcf/vcfsample.rb
65
64
  - lib/bio-vcf/vcfstatistics.rb
65
+ - lib/regressiontest.rb
66
+ - lib/regressiontest/cli_exec.rb
66
67
  - ragel/gen_vcfheaderline_parser.rl
67
68
  - ragel/generate.sh
68
69
  - template/gatk_vcf2rdf.erb
@@ -1,44 +0,0 @@
1
- GEM
2
- remote: http://rubygems.org/
3
- specs:
4
- builder (3.2.2)
5
- cucumber (2.1.0)
6
- builder (>= 2.1.2)
7
- cucumber-core (~> 1.3.0)
8
- diff-lcs (>= 1.1.3)
9
- gherkin3 (~> 3.1.0)
10
- multi_json (>= 1.7.5, < 2.0)
11
- multi_test (>= 0.1.2)
12
- cucumber-core (1.3.0)
13
- gherkin3 (~> 3.1.0)
14
- diff-lcs (1.2.5)
15
- gherkin3 (3.1.1)
16
- multi_json (1.11.2)
17
- multi_test (0.1.2)
18
- rake (10.4.2)
19
- regressiontest (0.0.3)
20
- rspec (3.3.0)
21
- rspec-core (~> 3.3.0)
22
- rspec-expectations (~> 3.3.0)
23
- rspec-mocks (~> 3.3.0)
24
- rspec-core (3.3.2)
25
- rspec-support (~> 3.3.0)
26
- rspec-expectations (3.3.1)
27
- diff-lcs (>= 1.2.0, < 2.0)
28
- rspec-support (~> 3.3.0)
29
- rspec-mocks (3.3.2)
30
- diff-lcs (>= 1.2.0, < 2.0)
31
- rspec-support (~> 3.3.0)
32
- rspec-support (3.3.0)
33
-
34
- PLATFORMS
35
- ruby
36
-
37
- DEPENDENCIES
38
- cucumber
39
- rake
40
- regressiontest (>= 0.0.3)
41
- rspec
42
-
43
- BUNDLED WITH
44
- 1.10.6