rdf-vcf 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 0308e08c49b3ab4dafd99a2c4d7e1eb6f290f5d8
4
+ data.tar.gz: 62ab5fbfbb17000b1e77ae49347cc08a43466fbd
5
+ SHA512:
6
+ metadata.gz: 6c91ef56c3884e0308a5ecf9c816a531eabb2d79fb18b0ec0e6d64a1fafd2dbfb2088d83d39ba368b04ab8bbd091cb2ec4a41459e807cc6977d5b2c8fad155dc
7
+ data.tar.gz: 4b7ceefbdcf83a81f1aad75410c7062dd37acbecc3d6d2207faee57574908683024ed721db674b5fba882ab675f3acc8d1399564d48a7352f6b7b36b5b97920d
data/AUTHORS ADDED
@@ -0,0 +1,3 @@
1
+ * Arto Bendiken <arto@bendiken.net>
2
+ * Raoul J.P. Bonnal <ilpuccio.febo@gmail.com>
3
+ * Francesco Strozzi <francesco.strozzi@gmail.com>
data/CREDITS ADDED
File without changes
data/README ADDED
@@ -0,0 +1,79 @@
1
+ RDF::VCF
2
+ ========
3
+
4
+ This is an [RDF.rb](https://github.com/ruby-rdf/rdf) reader plugin for
5
+ Variant Call Format (VCF) files, widely used in bioinformatics.
6
+
7
+ This project grew out of [BioHackathon 2014](http://2014.biohackathon.org/)
8
+ [work](https://github.com/dbcls/bh14/wiki/On-The-Fly-RDF-converter) by
9
+ [Raoul J.P. Bonnal](https://github.com/helios) and [Francesco
10
+ Strozzi](https://github.com/fstrozzi), and was further developed during
11
+ [BioHackathon 2015](http://2015.biohackathon.org/).
12
+
13
+ Note: at present, the project requires JRuby 9.0 (or newer) due to the
14
+ Java-based VCF parser. We hope to eventually substitute the pure-Ruby
15
+ [Bio-vcf](https://github.com/pjotrp/bioruby-vcf) instead.
16
+
17
+ Features
18
+ ========
19
+
20
+ * Implements an RDF.rb reader for VCF and BCF files, supporting also
21
+ bgzipped files.
22
+ * Includes a CLI tool called `vcf2rdf` to transform VCF files into RDF.
23
+
24
+ Examples
25
+ ========
26
+
27
+ Reading VCF Files
28
+ -----------------
29
+
30
+ The gem can be used like any other RDF.rb reader plugin:
31
+
32
+ require 'rdf/vcf'
33
+
34
+ RDF::VCF::Reader.open('Homo_sapiens.vcf.gz') do |reader|
35
+ reader.each_statement do |statement|
36
+ p statement
37
+ end
38
+ end
39
+
40
+ Command-Line Interface (CLI)
41
+ ----------------------------
42
+
43
+ The gem includes a CLI tool called `vcf2rdf` which transforms VCF files into
44
+ RDF (currently, N-Triples):
45
+
46
+ vcf2rdf your_vcf_file.vcf.gz
47
+
48
+ Input files can be either plain text VCF or else compressed by `bgzip`, as
49
+ above.
50
+
51
+ Notes
52
+ -----
53
+
54
+ Please create the [`tabix`](http://www.htslib.org/doc/tabix.html) index by
55
+ youself. In the future, this gem will create the index automatically.
56
+
57
+ Dependencies
58
+ ============
59
+
60
+ * [JRuby](http://jruby.org) (>= 9.0)
61
+ * [RDF.rb](https://github.com/ruby-rdf/rdf) (>= 1.1)
62
+
63
+ Mailing List
64
+ ============
65
+
66
+ * http://lists.w3.org/Archives/Public/public-rdf-ruby/
67
+
68
+ Authors
69
+ =======
70
+
71
+ * [Arto Bendiken](https://github.com/bendiken)
72
+ * [Raoul J.P. Bonnal](https://github.com/helios)
73
+ * [Francesco Strozzi](https://github.com/fstrozzi)
74
+
75
+ License
76
+ =======
77
+
78
+ This is free and unencumbered public domain software. For more information,
79
+ see <http://unlicense.org/> or the accompanying {file:UNLICENSE} file.
@@ -0,0 +1,24 @@
1
+ This is free and unencumbered software released into the public domain.
2
+
3
+ Anyone is free to copy, modify, publish, use, compile, sell, or
4
+ distribute this software, either in source code form or as a compiled
5
+ binary, for any purpose, commercial or non-commercial, and by any
6
+ means.
7
+
8
+ In jurisdictions that recognize copyright laws, the author or authors
9
+ of this software dedicate any and all copyright interest in the
10
+ software to the public domain. We make this dedication for the benefit
11
+ of the public at large and to the detriment of our heirs and
12
+ successors. We intend this dedication to be an overt act of
13
+ relinquishment in perpetuity of all present and future rights to this
14
+ software under copyright law.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19
+ IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20
+ OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21
+ ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22
+ OTHER DEALINGS IN THE SOFTWARE.
23
+
24
+ For more information, please refer to <http://unlicense.org/>
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.1.2
@@ -0,0 +1,23 @@
1
+ #!/usr/bin/env jruby
2
+ $:.unshift(File.expand_path('../lib', __dir__))
3
+ require 'rdf/vcf'
4
+
5
+ if ARGV.empty? || ARGV.include?('--help') || ARGV.include?('-h')
6
+ abort "usage: #{$0} file.vcf[.gz]..."
7
+ end
8
+
9
+ ARGV.each do |pathname|
10
+ # TODO: support an --output=MIME option:
11
+ writer = RDF::Writer.for(:nquads).new($stdout)
12
+
13
+ RDF::VCF::Reader.open(pathname) do |reader|
14
+ reader.each_statement do |statement|
15
+ begin
16
+ writer << statement
17
+ rescue => error
18
+ # @see https://github.com/ruby-rdf/rdf/issues/210
19
+ warn error # skip statements that cannot be serialized
20
+ end
21
+ end
22
+ end
23
+ end
@@ -0,0 +1 @@
1
+ require 'rdf/vcf'
@@ -0,0 +1,3 @@
1
+ require 'rdf/vcf/reader'
2
+ require 'rdf/vcf/record'
3
+ require 'rdf/vcf/version'
Binary file
@@ -0,0 +1,146 @@
1
+ require 'java' # requires JRuby
2
+ require 'rdf/vcf/jar/htsjdk-1.139.jar'
3
+ require 'rdf/vcf/jar/bzip2.jar'
4
+ require 'digest/md5'
5
+
6
+ require 'rdf/vcf/record'
7
+
8
+ module RDF; module VCF
9
+ ##
10
+ # VCF file reader.
11
+ #
12
+ # This is a user-friendly wrapper for the HTSJDK implementation.
13
+ #
14
+ # @see https://github.com/samtools/htsjdk
15
+ # @see https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFFileReader.html
16
+ class Reader
17
+ java_import 'htsjdk.variant.vcf.VCFFileReader'
18
+
19
+ REF_BASE_URI = 'http://rdf.ebi.ac.uk/resource/ensembl/%s/chromosome:%s:%s'.freeze
20
+
21
+ def ref_base_uri #
22
+ @file_iri ||= RDF::URI(REF_BASE_URI % [reference_md5, file_date || "" , source || ""])
23
+ end
24
+
25
+
26
+ ##
27
+ # @param [#to_s] pathname
28
+ def self.open(pathname, &block)
29
+ reader = self.new(pathname)
30
+ block.call(reader)
31
+ ensure
32
+ #reader.close
33
+ end
34
+
35
+ ##
36
+ # @param [#to_s] pathname
37
+ def initialize(pathname)
38
+ pathname = pathname.to_s
39
+ @vcf_file = java.io.File.new(pathname)
40
+ @tbi_file = java.io.File.new("#{pathname}.tbi") rescue nil
41
+ @reader = VCFFileReader.new(@vcf_file, @tbi_file, false)
42
+ end
43
+
44
+ ##
45
+ # @return [Boolean]
46
+ def closed?
47
+ @reader.nil?
48
+ end
49
+
50
+ ##
51
+ # @return [void]
52
+ def close
53
+ @reader.close if @reader
54
+ ensure
55
+ @reader, @vcf_file, @tbi_file = nil, nil, nil
56
+ end
57
+
58
+ ##
59
+ # @yield [statement]
60
+ # @yieldparam [RDF::Statement] statement
61
+ # @yieldreturn [void]
62
+ # @return [void]
63
+ def each_statement(&block)
64
+ self.each_record do |record|
65
+ record.to_rdf.each(&block)
66
+ end
67
+ end
68
+
69
+ ##
70
+ # @yield [record]
71
+ # @yieldparam [Record] record
72
+ # @yieldreturn [void]
73
+ # @return [void]
74
+ def each_record(&block)
75
+ return unless @reader
76
+ @reader.iterator.each do |variant_context| # VariantContext
77
+ record = Record.new(variant_context, self)
78
+ block.call(record)
79
+ end
80
+ end
81
+
82
+ ##
83
+ # @param [String] chromosome
84
+ # @param [Integer] start_pos
85
+ # @param [Integer] end_pos
86
+ # @yield [record]
87
+ # @yieldparam [Record] record
88
+ # @yieldreturn [void]
89
+ # @return [void]
90
+ def find_records(chromosome: nil, start_pos: nil, end_pos: nil, &block)
91
+ return unless @reader
92
+ start_pos ||= 0
93
+ end_pos ||= java.lang.Integer::MAX_VALUE
94
+ @reader.query(chromosome, start_pos, end_pos).each do |variant_context| # VariantContext
95
+ record = Record.new(variant_context, self)
96
+ block.call(record)
97
+ end
98
+ end
99
+
100
+ ##
101
+ # @param [Integer] pos
102
+ # @return [Boolean]
103
+ def has_position?(pos)
104
+ true # TODO
105
+ end
106
+
107
+ ##
108
+ # @return [String]
109
+ def reference()
110
+ @reference ||= @reader.getFileHeader().getMetaDataLine("reference").getKey rescue nil
111
+ end
112
+
113
+ def reference_md5
114
+ ::Digest::MD5.hexdigest(reference|| "")
115
+ end
116
+
117
+
118
+ ##
119
+ # @return [String]
120
+ def file_date
121
+ @file_date ||= @reader.getFileHeader().getMetaDataLine("fileDate").getKey rescue nil
122
+ end
123
+
124
+ ##
125
+ # @return [String]
126
+ def source
127
+ @source ||= @reader.getFileHeader().getMetaDataLine("source").getKey rescue nil
128
+ end
129
+
130
+
131
+
132
+ end # Reader
133
+ end; end # RDF::VCF
134
+
135
+ if $0 == __FILE__
136
+ VCF::Reader.open('Homo_sapiens.vcf.gz') do |file|
137
+ p file
138
+ file.find_records(chromosome: "Y") do |record|
139
+ #file.each_record do |record|
140
+ p record
141
+ #record.to_rdf.each do |statement|
142
+ # p statement
143
+ #end
144
+ end
145
+ end
146
+ end
@@ -0,0 +1,168 @@
1
+ require 'java' # requires JRuby
2
+ require 'rdf/vcf/jar/htsjdk-1.139.jar'
3
+ require 'rdf/vcf/jar/bzip2.jar'
4
+
5
+ require 'digest/md5'
6
+ require 'rdf'
7
+
8
+ module RDF; module VCF
9
+ ##
10
+ # VCF file record.
11
+ #
12
+ # This is a user-friendly wrapper for the HTSJDK implementation.
13
+ #
14
+ # @see https://github.com/samtools/htsjdk
15
+ # @see https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContext.html
16
+ class Record
17
+ include RDF
18
+ java_import 'htsjdk.variant.variantcontext.VariantContext'
19
+
20
+ FALDO = RDF::Vocabulary.new("http://biohackathon.org/resource/faldo#")
21
+
22
+ VAR_BASE_URI = 'http://rdf.ebi.ac.uk/terms/ensemblvariation/%s'.freeze
23
+
24
+ ##
25
+ # @param [VariantContext] variant_context
26
+ def initialize(variant_context, reader)
27
+ @vcf = variant_context
28
+ @reader = reader
29
+ end
30
+
31
+ ##
32
+ # @return [String]
33
+ def id
34
+ @id ||= case (id = @vcf.getID)
35
+ when '.'
36
+ label = "%s:%s:%s-%s" % ['', @vcf.getChr, @vcf.getStart, @vcf.getEnd] # TODO: species
37
+ ::Digest::MD5.hexdigest(label)
38
+ else id
39
+ end
40
+ end
41
+
42
+ ##
43
+ # @return [String]
44
+ def uri
45
+ @uri ||= VAR_BASE_URI % self.id
46
+ end
47
+
48
+ ##
49
+ # @return [String]
50
+ def chromosome
51
+ @vcf.getChr
52
+ end
53
+
54
+ def start
55
+ @vcf.getStart
56
+ end
57
+
58
+ def stop
59
+ @vcf.getEnd
60
+ end
61
+
62
+ ##
63
+ # @return [Hash{String => String}]
64
+ def attributes
65
+ @vcf.getAttributes
66
+ end
67
+
68
+ ##
69
+ # @return [String]
70
+ def reference
71
+ @reader.reference
72
+ end
73
+
74
+ ##
75
+ # @return [String]
76
+ def file_date
77
+ @reader.file_date
78
+ end
79
+
80
+ ##
81
+ # @return [String]
82
+ def source
83
+ @reader.source
84
+ end
85
+
86
+ ##
87
+ # @return [URI]
88
+ def ref_base_uri
89
+ @reader.ref_base_uri
90
+ end
91
+
92
+
93
+ def get_alleles(&block)
94
+ @vcf.getAlleles.each do |allele|
95
+ block.call(allele)
96
+ end
97
+ end
98
+
99
+ def get_alternate_alleles(&block)
100
+ @vcf.getAlternateAlleles.each do |allele|
101
+ block.call(allele)
102
+ end
103
+ end
104
+ ##
105
+ # @return [String]
106
+ def get_reference_allele
107
+ @vcf.getReference.getBaseString
108
+ end
109
+
110
+ ##
111
+ # @return [String]
112
+ # def get_alternate_allele
113
+ # @vcf.getAlternateAlleles.first.getBaseString
114
+ # end
115
+
116
+ ##
117
+ # @return [Double]
118
+ def get_phred_scaled_qual
119
+ @vcf.getPhredScaledQual
120
+ end
121
+
122
+ ##
123
+ # @return [RDF::Graph]
124
+ def to_rdf
125
+ var_uri = RDF::URI(self.uri)
126
+ RDF::Graph.new do |graph|
127
+ graph << [var_uri, RDF::DC.identifier, self.id]
128
+ graph << [var_uri, RDF::RDFS.label, self.id]
129
+ graph << [self.ref_base_uri,DC.identifier,self.id]
130
+ faldoRegion = ref_base_uri+":#{self.start}-#{self.stop}:1"
131
+ graph << [var_uri,FALDO.location,faldoRegion]
132
+ graph << [faldoRegion,RDFS.label,"#{self.id}:#{self.start}-#{self.stop}:1"]
133
+ graph << [faldoRegion,RDF.type,FALDO.Region]
134
+ graph << [faldoRegion,FALDO.begin,self.ref_base_uri+":#{self.start}:1"]
135
+ graph << [faldoRegion,FALDO.end,self.ref_base_uri+":#{self.stop}:1"]
136
+ graph << [faldoRegion,FALDO.reference,self.ref_base_uri]
137
+ if self.start == self.stop
138
+ faldoExactPosition = self.ref_base_uri+":#{self.start}:1"
139
+ graph << [faldoExactPosition,RDF.type,"faldo:ExactPosition"]
140
+ graph << [faldoExactPosition,RDF.type,"faldo:ForwardStrandPosition"]
141
+ graph << [faldoExactPosition,FALDO.position,self.start]
142
+ graph << [faldoExactPosition,FALDO.reference,self.ref_base_uri]
143
+ end
144
+
145
+ # TODO check if there are multiple alleles and iterate over them
146
+ # TODO what happends if there is an insertion?
147
+ refAlleleURI = var_uri+"\##{get_reference_allele}"
148
+ graph << [var_uri,var_uri+":has_allele",refAlleleURI]
149
+ graph << [refAlleleURI,RDFS.label,"#{self.id} allele #{get_reference_allele}"]
150
+ graph << [refAlleleURI,RDF.type,var_uri+":reference_allele"]
151
+ self.get_alternate_alleles do |altAllele|
152
+ altAlleleURI = var_uri+"\##{altAllele.getBaseString}"
153
+ graph << [var_uri,var_uri+":has_allele",altAlleleURI]
154
+ graph << [altAlleleURI,RDFS.label,"#{self.id} allele #{altAllele.getBaseString}"]
155
+ graph << [altAlleleURI, RDF.type, var_uri+":ancestral_allele"]
156
+
157
+ end
158
+ graph << [var_uri,URI(VAR_BASE_URI % "/vcf/quality"), RDF::Literal(get_phred_scaled_qual)]
159
+
160
+ @vcf.attributes.each do |k, v|
161
+ graph << [var_uri, RDF::URI( VAR_BASE_URI % "vcf/attribute\##{k}"), v] if v
162
+ end
163
+
164
+ end #new graph
165
+
166
+ end #to_rdf
167
+ end # Record
168
+ end; end # RDF::VCF
@@ -0,0 +1,19 @@
1
+ module RDF; module VCF
2
+ module VERSION
3
+ FILE = File.expand_path('../../../VERSION', __dir__).freeze
4
+ STRING = File.read(FILE).chomp.freeze
5
+ MAJOR, MINOR, TINY, EXTRA = STRING.split('.').map(&:to_i).freeze
6
+
7
+ ##
8
+ # @return [String]
9
+ def self.to_s() STRING end
10
+
11
+ ##
12
+ # @return [String]
13
+ def self.to_str() STRING end
14
+
15
+ ##
16
+ # @return [Array(Integer, Integer, Integer)]
17
+ def self.to_a() [MAJOR, MINOR, TINY].freeze end
18
+ end # VERSION
19
+ end; end # RDF::VCF
metadata ADDED
@@ -0,0 +1,119 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rdf-vcf
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.2
5
+ platform: ruby
6
+ authors:
7
+ - Arto Bendiken
8
+ - Raoul J.P. Bonnal
9
+ - Francesco Strozzi
10
+ autorequire:
11
+ bindir: bin
12
+ cert_chain: []
13
+ date: 2015-09-18 00:00:00.000000000 Z
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ requirement: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - "~>"
19
+ - !ruby/object:Gem::Version
20
+ version: '1.1'
21
+ name: rdf
22
+ prerelease: false
23
+ type: :runtime
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ requirements:
26
+ - - "~>"
27
+ - !ruby/object:Gem::Version
28
+ version: '1.1'
29
+ - !ruby/object:Gem::Dependency
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - "~>"
33
+ - !ruby/object:Gem::Version
34
+ version: '10'
35
+ name: rake
36
+ prerelease: false
37
+ type: :development
38
+ version_requirements: !ruby/object:Gem::Requirement
39
+ requirements:
40
+ - - "~>"
41
+ - !ruby/object:Gem::Version
42
+ version: '10'
43
+ - !ruby/object:Gem::Dependency
44
+ requirement: !ruby/object:Gem::Requirement
45
+ requirements:
46
+ - - "~>"
47
+ - !ruby/object:Gem::Version
48
+ version: '3'
49
+ name: rspec
50
+ prerelease: false
51
+ type: :development
52
+ version_requirements: !ruby/object:Gem::Requirement
53
+ requirements:
54
+ - - "~>"
55
+ - !ruby/object:Gem::Version
56
+ version: '3'
57
+ - !ruby/object:Gem::Dependency
58
+ requirement: !ruby/object:Gem::Requirement
59
+ requirements:
60
+ - - "~>"
61
+ - !ruby/object:Gem::Version
62
+ version: '0.8'
63
+ name: yard
64
+ prerelease: false
65
+ type: :development
66
+ version_requirements: !ruby/object:Gem::Requirement
67
+ requirements:
68
+ - - "~>"
69
+ - !ruby/object:Gem::Version
70
+ version: '0.8'
71
+ description: |2
72
+ This is an RDF.rb reader plugin for Variant Call Format (VCF) files,
73
+ widely used in bioinformatics.
74
+ email: public-rdf-ruby@w3.org
75
+ executables:
76
+ - vcf2rdf
77
+ extensions: []
78
+ extra_rdoc_files: []
79
+ files:
80
+ - AUTHORS
81
+ - CREDITS
82
+ - README
83
+ - UNLICENSE
84
+ - VERSION
85
+ - bin/vcf2rdf
86
+ - lib/rdf-vcf.rb
87
+ - lib/rdf/vcf.rb
88
+ - lib/rdf/vcf/jar/bzip2.jar
89
+ - lib/rdf/vcf/jar/htsjdk-1.139.jar
90
+ - lib/rdf/vcf/reader.rb
91
+ - lib/rdf/vcf/record.rb
92
+ - lib/rdf/vcf/version.rb
93
+ homepage: https://github.com/ruby-rdf/rdf-vcf
94
+ licenses:
95
+ - Public Domain
96
+ metadata: {}
97
+ post_install_message:
98
+ rdoc_options: []
99
+ require_paths:
100
+ - lib
101
+ required_ruby_version: !ruby/object:Gem::Requirement
102
+ requirements:
103
+ - - ">="
104
+ - !ruby/object:Gem::Version
105
+ version: '2.2'
106
+ required_rubygems_version: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '2.2'
111
+ requirements:
112
+ - JRuby (>= 9.0)
113
+ rubyforge_project:
114
+ rubygems_version: 2.4.8
115
+ signing_key:
116
+ specification_version: 4
117
+ summary: RDF.rb reader for Variant Call Format (VCF) files.
118
+ test_files: []
119
+ has_rdoc: false