rdf-vcf 0.1.2

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 0308e08c49b3ab4dafd99a2c4d7e1eb6f290f5d8
4
+ data.tar.gz: 62ab5fbfbb17000b1e77ae49347cc08a43466fbd
5
+ SHA512:
6
+ metadata.gz: 6c91ef56c3884e0308a5ecf9c816a531eabb2d79fb18b0ec0e6d64a1fafd2dbfb2088d83d39ba368b04ab8bbd091cb2ec4a41459e807cc6977d5b2c8fad155dc
7
+ data.tar.gz: 4b7ceefbdcf83a81f1aad75410c7062dd37acbecc3d6d2207faee57574908683024ed721db674b5fba882ab675f3acc8d1399564d48a7352f6b7b36b5b97920d
data/AUTHORS ADDED
@@ -0,0 +1,3 @@
1
+ * Arto Bendiken <arto@bendiken.net>
2
+ * Raoul J.P. Bonnal <ilpuccio.febo@gmail.com>
3
+ * Francesco Strozzi <francesco.strozzi@gmail.com>
data/CREDITS ADDED
File without changes
data/README ADDED
@@ -0,0 +1,79 @@
1
+ RDF::VCF
2
+ ========
3
+
4
+ This is an [RDF.rb](https://github.com/ruby-rdf/rdf) reader plugin for
5
+ Variant Call Format (VCF) files, widely used in bioinformatics.
6
+
7
+ This project grew out of [BioHackathon 2014](http://2014.biohackathon.org/)
8
+ [work](https://github.com/dbcls/bh14/wiki/On-The-Fly-RDF-converter) by
9
+ [Raoul J.P. Bonnal](https://github.com/helios) and [Francesco
10
+ Strozzi](https://github.com/fstrozzi), and was further developed during
11
+ [BioHackathon 2015](http://2015.biohackathon.org/).
12
+
13
+ Note: at present, the project requires JRuby 9.0 (or newer) due to the
14
+ Java-based VCF parser. We hope to eventually substitute the pure-Ruby
15
+ [Bio-vcf](https://github.com/pjotrp/bioruby-vcf) instead.
16
+
17
+ Features
18
+ ========
19
+
20
+ * Implements an RDF.rb reader for VCF and BCF files, supporting also
21
+ bgzipped files.
22
+ * Includes a CLI tool called `vcf2rdf` to transform VCF files into RDF.
23
+
24
+ Examples
25
+ ========
26
+
27
+ Reading VCF Files
28
+ -----------------
29
+
30
+ The gem can be used like any other RDF.rb reader plugin:
31
+
32
+ require 'rdf/vcf'
33
+
34
+ RDF::VCF::Reader.open('Homo_sapiens.vcf.gz') do |reader|
35
+ reader.each_statement do |statement|
36
+ p statement
37
+ end
38
+ end
39
+
40
+ Command-Line Interface (CLI)
41
+ ----------------------------
42
+
43
+ The gem includes a CLI tool called `vcf2rdf` which transforms VCF files into
44
+ RDF (currently, N-Triples):
45
+
46
+ vcf2rdf your_vcf_file.vcf.gz
47
+
48
+ Input files can be either plain text VCF or else compressed by `bgzip`, as
49
+ above.
50
+
51
+ Notes
52
+ -----
53
+
54
+ Please create the [`tabix`](http://www.htslib.org/doc/tabix.html) index by
55
+ youself. In the future, this gem will create the index automatically.
56
+
57
+ Dependencies
58
+ ============
59
+
60
+ * [JRuby](http://jruby.org) (>= 9.0)
61
+ * [RDF.rb](https://github.com/ruby-rdf/rdf) (>= 1.1)
62
+
63
+ Mailing List
64
+ ============
65
+
66
+ * http://lists.w3.org/Archives/Public/public-rdf-ruby/
67
+
68
+ Authors
69
+ =======
70
+
71
+ * [Arto Bendiken](https://github.com/bendiken)
72
+ * [Raoul J.P. Bonnal](https://github.com/helios)
73
+ * [Francesco Strozzi](https://github.com/fstrozzi)
74
+
75
+ License
76
+ =======
77
+
78
+ This is free and unencumbered public domain software. For more information,
79
+ see <http://unlicense.org/> or the accompanying {file:UNLICENSE} file.
@@ -0,0 +1,24 @@
1
+ This is free and unencumbered software released into the public domain.
2
+
3
+ Anyone is free to copy, modify, publish, use, compile, sell, or
4
+ distribute this software, either in source code form or as a compiled
5
+ binary, for any purpose, commercial or non-commercial, and by any
6
+ means.
7
+
8
+ In jurisdictions that recognize copyright laws, the author or authors
9
+ of this software dedicate any and all copyright interest in the
10
+ software to the public domain. We make this dedication for the benefit
11
+ of the public at large and to the detriment of our heirs and
12
+ successors. We intend this dedication to be an overt act of
13
+ relinquishment in perpetuity of all present and future rights to this
14
+ software under copyright law.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19
+ IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20
+ OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21
+ ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22
+ OTHER DEALINGS IN THE SOFTWARE.
23
+
24
+ For more information, please refer to <http://unlicense.org/>
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.1.2
@@ -0,0 +1,23 @@
1
+ #!/usr/bin/env jruby
2
+ $:.unshift(File.expand_path('../lib', __dir__))
3
+ require 'rdf/vcf'
4
+
5
+ if ARGV.empty? || ARGV.include?('--help') || ARGV.include?('-h')
6
+ abort "usage: #{$0} file.vcf[.gz]..."
7
+ end
8
+
9
+ ARGV.each do |pathname|
10
+ # TODO: support an --output=MIME option:
11
+ writer = RDF::Writer.for(:nquads).new($stdout)
12
+
13
+ RDF::VCF::Reader.open(pathname) do |reader|
14
+ reader.each_statement do |statement|
15
+ begin
16
+ writer << statement
17
+ rescue => error
18
+ # @see https://github.com/ruby-rdf/rdf/issues/210
19
+ warn error # skip statements that cannot be serialized
20
+ end
21
+ end
22
+ end
23
+ end
@@ -0,0 +1 @@
1
+ require 'rdf/vcf'
@@ -0,0 +1,3 @@
1
+ require 'rdf/vcf/reader'
2
+ require 'rdf/vcf/record'
3
+ require 'rdf/vcf/version'
Binary file
@@ -0,0 +1,146 @@
1
+ require 'java' # requires JRuby
2
+ require 'rdf/vcf/jar/htsjdk-1.139.jar'
3
+ require 'rdf/vcf/jar/bzip2.jar'
4
+ require 'digest/md5'
5
+
6
+ require 'rdf/vcf/record'
7
+
8
+ module RDF; module VCF
9
+ ##
10
+ # VCF file reader.
11
+ #
12
+ # This is a user-friendly wrapper for the HTSJDK implementation.
13
+ #
14
+ # @see https://github.com/samtools/htsjdk
15
+ # @see https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFFileReader.html
16
+ class Reader
17
+ java_import 'htsjdk.variant.vcf.VCFFileReader'
18
+
19
+ REF_BASE_URI = 'http://rdf.ebi.ac.uk/resource/ensembl/%s/chromosome:%s:%s'.freeze
20
+
21
+ def ref_base_uri #
22
+ @file_iri ||= RDF::URI(REF_BASE_URI % [reference_md5, file_date || "" , source || ""])
23
+ end
24
+
25
+
26
+ ##
27
+ # @param [#to_s] pathname
28
+ def self.open(pathname, &block)
29
+ reader = self.new(pathname)
30
+ block.call(reader)
31
+ ensure
32
+ #reader.close
33
+ end
34
+
35
+ ##
36
+ # @param [#to_s] pathname
37
+ def initialize(pathname)
38
+ pathname = pathname.to_s
39
+ @vcf_file = java.io.File.new(pathname)
40
+ @tbi_file = java.io.File.new("#{pathname}.tbi") rescue nil
41
+ @reader = VCFFileReader.new(@vcf_file, @tbi_file, false)
42
+ end
43
+
44
+ ##
45
+ # @return [Boolean]
46
+ def closed?
47
+ @reader.nil?
48
+ end
49
+
50
+ ##
51
+ # @return [void]
52
+ def close
53
+ @reader.close if @reader
54
+ ensure
55
+ @reader, @vcf_file, @tbi_file = nil, nil, nil
56
+ end
57
+
58
+ ##
59
+ # @yield [statement]
60
+ # @yieldparam [RDF::Statement] statement
61
+ # @yieldreturn [void]
62
+ # @return [void]
63
+ def each_statement(&block)
64
+ self.each_record do |record|
65
+ record.to_rdf.each(&block)
66
+ end
67
+ end
68
+
69
+ ##
70
+ # @yield [record]
71
+ # @yieldparam [Record] record
72
+ # @yieldreturn [void]
73
+ # @return [void]
74
+ def each_record(&block)
75
+ return unless @reader
76
+ @reader.iterator.each do |variant_context| # VariantContext
77
+ record = Record.new(variant_context, self)
78
+ block.call(record)
79
+ end
80
+ end
81
+
82
+ ##
83
+ # @param [String] chromosome
84
+ # @param [Integer] start_pos
85
+ # @param [Integer] end_pos
86
+ # @yield [record]
87
+ # @yieldparam [Record] record
88
+ # @yieldreturn [void]
89
+ # @return [void]
90
+ def find_records(chromosome: nil, start_pos: nil, end_pos: nil, &block)
91
+ return unless @reader
92
+ start_pos ||= 0
93
+ end_pos ||= java.lang.Integer::MAX_VALUE
94
+ @reader.query(chromosome, start_pos, end_pos).each do |variant_context| # VariantContext
95
+ record = Record.new(variant_context, self)
96
+ block.call(record)
97
+ end
98
+ end
99
+
100
+ ##
101
+ # @param [Integer] pos
102
+ # @return [Boolean]
103
+ def has_position?(pos)
104
+ true # TODO
105
+ end
106
+
107
+ ##
108
+ # @return [String]
109
+ def reference()
110
+ @reference ||= @reader.getFileHeader().getMetaDataLine("reference").getKey rescue nil
111
+ end
112
+
113
+ def reference_md5
114
+ ::Digest::MD5.hexdigest(reference|| "")
115
+ end
116
+
117
+
118
+ ##
119
+ # @return [String]
120
+ def file_date
121
+ @file_date ||= @reader.getFileHeader().getMetaDataLine("fileDate").getKey rescue nil
122
+ end
123
+
124
+ ##
125
+ # @return [String]
126
+ def source
127
+ @source ||= @reader.getFileHeader().getMetaDataLine("source").getKey rescue nil
128
+ end
129
+
130
+
131
+
132
+ end # Reader
133
+ end; end # RDF::VCF
134
+
135
+ if $0 == __FILE__
136
+ VCF::Reader.open('Homo_sapiens.vcf.gz') do |file|
137
+ p file
138
+ file.find_records(chromosome: "Y") do |record|
139
+ #file.each_record do |record|
140
+ p record
141
+ #record.to_rdf.each do |statement|
142
+ # p statement
143
+ #end
144
+ end
145
+ end
146
+ end
@@ -0,0 +1,168 @@
1
+ require 'java' # requires JRuby
2
+ require 'rdf/vcf/jar/htsjdk-1.139.jar'
3
+ require 'rdf/vcf/jar/bzip2.jar'
4
+
5
+ require 'digest/md5'
6
+ require 'rdf'
7
+
8
+ module RDF; module VCF
9
+ ##
10
+ # VCF file record.
11
+ #
12
+ # This is a user-friendly wrapper for the HTSJDK implementation.
13
+ #
14
+ # @see https://github.com/samtools/htsjdk
15
+ # @see https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContext.html
16
+ class Record
17
+ include RDF
18
+ java_import 'htsjdk.variant.variantcontext.VariantContext'
19
+
20
+ FALDO = RDF::Vocabulary.new("http://biohackathon.org/resource/faldo#")
21
+
22
+ VAR_BASE_URI = 'http://rdf.ebi.ac.uk/terms/ensemblvariation/%s'.freeze
23
+
24
+ ##
25
+ # @param [VariantContext] variant_context
26
+ def initialize(variant_context, reader)
27
+ @vcf = variant_context
28
+ @reader = reader
29
+ end
30
+
31
+ ##
32
+ # @return [String]
33
+ def id
34
+ @id ||= case (id = @vcf.getID)
35
+ when '.'
36
+ label = "%s:%s:%s-%s" % ['', @vcf.getChr, @vcf.getStart, @vcf.getEnd] # TODO: species
37
+ ::Digest::MD5.hexdigest(label)
38
+ else id
39
+ end
40
+ end
41
+
42
+ ##
43
+ # @return [String]
44
+ def uri
45
+ @uri ||= VAR_BASE_URI % self.id
46
+ end
47
+
48
+ ##
49
+ # @return [String]
50
+ def chromosome
51
+ @vcf.getChr
52
+ end
53
+
54
+ def start
55
+ @vcf.getStart
56
+ end
57
+
58
+ def stop
59
+ @vcf.getEnd
60
+ end
61
+
62
+ ##
63
+ # @return [Hash{String => String}]
64
+ def attributes
65
+ @vcf.getAttributes
66
+ end
67
+
68
+ ##
69
+ # @return [String]
70
+ def reference
71
+ @reader.reference
72
+ end
73
+
74
+ ##
75
+ # @return [String]
76
+ def file_date
77
+ @reader.file_date
78
+ end
79
+
80
+ ##
81
+ # @return [String]
82
+ def source
83
+ @reader.source
84
+ end
85
+
86
+ ##
87
+ # @return [URI]
88
+ def ref_base_uri
89
+ @reader.ref_base_uri
90
+ end
91
+
92
+
93
+ def get_alleles(&block)
94
+ @vcf.getAlleles.each do |allele|
95
+ block.call(allele)
96
+ end
97
+ end
98
+
99
+ def get_alternate_alleles(&block)
100
+ @vcf.getAlternateAlleles.each do |allele|
101
+ block.call(allele)
102
+ end
103
+ end
104
+ ##
105
+ # @return [String]
106
+ def get_reference_allele
107
+ @vcf.getReference.getBaseString
108
+ end
109
+
110
+ ##
111
+ # @return [String]
112
+ # def get_alternate_allele
113
+ # @vcf.getAlternateAlleles.first.getBaseString
114
+ # end
115
+
116
+ ##
117
+ # @return [Double]
118
+ def get_phred_scaled_qual
119
+ @vcf.getPhredScaledQual
120
+ end
121
+
122
+ ##
123
+ # @return [RDF::Graph]
124
+ def to_rdf
125
+ var_uri = RDF::URI(self.uri)
126
+ RDF::Graph.new do |graph|
127
+ graph << [var_uri, RDF::DC.identifier, self.id]
128
+ graph << [var_uri, RDF::RDFS.label, self.id]
129
+ graph << [self.ref_base_uri,DC.identifier,self.id]
130
+ faldoRegion = ref_base_uri+":#{self.start}-#{self.stop}:1"
131
+ graph << [var_uri,FALDO.location,faldoRegion]
132
+ graph << [faldoRegion,RDFS.label,"#{self.id}:#{self.start}-#{self.stop}:1"]
133
+ graph << [faldoRegion,RDF.type,FALDO.Region]
134
+ graph << [faldoRegion,FALDO.begin,self.ref_base_uri+":#{self.start}:1"]
135
+ graph << [faldoRegion,FALDO.end,self.ref_base_uri+":#{self.stop}:1"]
136
+ graph << [faldoRegion,FALDO.reference,self.ref_base_uri]
137
+ if self.start == self.stop
138
+ faldoExactPosition = self.ref_base_uri+":#{self.start}:1"
139
+ graph << [faldoExactPosition,RDF.type,"faldo:ExactPosition"]
140
+ graph << [faldoExactPosition,RDF.type,"faldo:ForwardStrandPosition"]
141
+ graph << [faldoExactPosition,FALDO.position,self.start]
142
+ graph << [faldoExactPosition,FALDO.reference,self.ref_base_uri]
143
+ end
144
+
145
+ # TODO check if there are multiple alleles and iterate over them
146
+ # TODO what happends if there is an insertion?
147
+ refAlleleURI = var_uri+"\##{get_reference_allele}"
148
+ graph << [var_uri,var_uri+":has_allele",refAlleleURI]
149
+ graph << [refAlleleURI,RDFS.label,"#{self.id} allele #{get_reference_allele}"]
150
+ graph << [refAlleleURI,RDF.type,var_uri+":reference_allele"]
151
+ self.get_alternate_alleles do |altAllele|
152
+ altAlleleURI = var_uri+"\##{altAllele.getBaseString}"
153
+ graph << [var_uri,var_uri+":has_allele",altAlleleURI]
154
+ graph << [altAlleleURI,RDFS.label,"#{self.id} allele #{altAllele.getBaseString}"]
155
+ graph << [altAlleleURI, RDF.type, var_uri+":ancestral_allele"]
156
+
157
+ end
158
+ graph << [var_uri,URI(VAR_BASE_URI % "/vcf/quality"), RDF::Literal(get_phred_scaled_qual)]
159
+
160
+ @vcf.attributes.each do |k, v|
161
+ graph << [var_uri, RDF::URI( VAR_BASE_URI % "vcf/attribute\##{k}"), v] if v
162
+ end
163
+
164
+ end #new graph
165
+
166
+ end #to_rdf
167
+ end # Record
168
+ end; end # RDF::VCF
@@ -0,0 +1,19 @@
1
+ module RDF; module VCF
2
+ module VERSION
3
+ FILE = File.expand_path('../../../VERSION', __dir__).freeze
4
+ STRING = File.read(FILE).chomp.freeze
5
+ MAJOR, MINOR, TINY, EXTRA = STRING.split('.').map(&:to_i).freeze
6
+
7
+ ##
8
+ # @return [String]
9
+ def self.to_s() STRING end
10
+
11
+ ##
12
+ # @return [String]
13
+ def self.to_str() STRING end
14
+
15
+ ##
16
+ # @return [Array(Integer, Integer, Integer)]
17
+ def self.to_a() [MAJOR, MINOR, TINY].freeze end
18
+ end # VERSION
19
+ end; end # RDF::VCF
metadata ADDED
@@ -0,0 +1,119 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rdf-vcf
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.2
5
+ platform: ruby
6
+ authors:
7
+ - Arto Bendiken
8
+ - Raoul J.P. Bonnal
9
+ - Francesco Strozzi
10
+ autorequire:
11
+ bindir: bin
12
+ cert_chain: []
13
+ date: 2015-09-18 00:00:00.000000000 Z
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ requirement: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - "~>"
19
+ - !ruby/object:Gem::Version
20
+ version: '1.1'
21
+ name: rdf
22
+ prerelease: false
23
+ type: :runtime
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ requirements:
26
+ - - "~>"
27
+ - !ruby/object:Gem::Version
28
+ version: '1.1'
29
+ - !ruby/object:Gem::Dependency
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - "~>"
33
+ - !ruby/object:Gem::Version
34
+ version: '10'
35
+ name: rake
36
+ prerelease: false
37
+ type: :development
38
+ version_requirements: !ruby/object:Gem::Requirement
39
+ requirements:
40
+ - - "~>"
41
+ - !ruby/object:Gem::Version
42
+ version: '10'
43
+ - !ruby/object:Gem::Dependency
44
+ requirement: !ruby/object:Gem::Requirement
45
+ requirements:
46
+ - - "~>"
47
+ - !ruby/object:Gem::Version
48
+ version: '3'
49
+ name: rspec
50
+ prerelease: false
51
+ type: :development
52
+ version_requirements: !ruby/object:Gem::Requirement
53
+ requirements:
54
+ - - "~>"
55
+ - !ruby/object:Gem::Version
56
+ version: '3'
57
+ - !ruby/object:Gem::Dependency
58
+ requirement: !ruby/object:Gem::Requirement
59
+ requirements:
60
+ - - "~>"
61
+ - !ruby/object:Gem::Version
62
+ version: '0.8'
63
+ name: yard
64
+ prerelease: false
65
+ type: :development
66
+ version_requirements: !ruby/object:Gem::Requirement
67
+ requirements:
68
+ - - "~>"
69
+ - !ruby/object:Gem::Version
70
+ version: '0.8'
71
+ description: |2
72
+ This is an RDF.rb reader plugin for Variant Call Format (VCF) files,
73
+ widely used in bioinformatics.
74
+ email: public-rdf-ruby@w3.org
75
+ executables:
76
+ - vcf2rdf
77
+ extensions: []
78
+ extra_rdoc_files: []
79
+ files:
80
+ - AUTHORS
81
+ - CREDITS
82
+ - README
83
+ - UNLICENSE
84
+ - VERSION
85
+ - bin/vcf2rdf
86
+ - lib/rdf-vcf.rb
87
+ - lib/rdf/vcf.rb
88
+ - lib/rdf/vcf/jar/bzip2.jar
89
+ - lib/rdf/vcf/jar/htsjdk-1.139.jar
90
+ - lib/rdf/vcf/reader.rb
91
+ - lib/rdf/vcf/record.rb
92
+ - lib/rdf/vcf/version.rb
93
+ homepage: https://github.com/ruby-rdf/rdf-vcf
94
+ licenses:
95
+ - Public Domain
96
+ metadata: {}
97
+ post_install_message:
98
+ rdoc_options: []
99
+ require_paths:
100
+ - lib
101
+ required_ruby_version: !ruby/object:Gem::Requirement
102
+ requirements:
103
+ - - ">="
104
+ - !ruby/object:Gem::Version
105
+ version: '2.2'
106
+ required_rubygems_version: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '2.2'
111
+ requirements:
112
+ - JRuby (>= 9.0)
113
+ rubyforge_project:
114
+ rubygems_version: 2.4.8
115
+ signing_key:
116
+ specification_version: 4
117
+ summary: RDF.rb reader for Variant Call Format (VCF) files.
118
+ test_files: []
119
+ has_rdoc: false