rdf-microdata 0.2.2 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README CHANGED
@@ -6,13 +6,20 @@
6
6
  RDF::Microdata is a Microdata reader for Ruby using the [RDF.rb][RDF.rb] library suite.
7
7
 
8
8
  ## FEATURES
9
- RDF::Microdata parses [Microdata][] into statements or triples.
9
+ RDF::Microdata parses [Microdata][] into statements or triples using the rules defined in [Microdata RDF][].
10
10
 
11
11
  * Microdata parser.
12
- * Uses Nokogiri for parsing HTML
12
+ * If available, Uses Nokogiri for parsing HTML/SVG, falls back to REXML otherwise (and for JRuby)
13
13
 
14
14
  Install with 'gem install rdf-microdata'
15
15
 
16
+ ### Living implementation
17
+ Microdata to RDF transformation is undergoing active development. This implementation attempts to be up-to-date
18
+ as of the time of release, and is being used in developing the [Microdata RDF][] specification
19
+
20
+ ### Microdata Registry
21
+ The parser uses a build-in version of the [Microdata RDF][] registry.
22
+
16
23
  ## Usage
17
24
 
18
25
  ### Reading RDF data in the Microdata format
@@ -20,49 +27,14 @@ Install with 'gem install rdf-microdata'
20
27
  graph = RDF::Graph.load("etc/foaf.html", :format => :microdata)
21
28
 
22
29
  ## Note
23
- The Microdata editor has recently [dropped support for RDF
24
- conversion](http://html5.org/tools/web-apps-tracker?from=6426&to=6427), as a result, this gem is being used to
25
- investigate ways in which Microdata might have more satisfactory RDF generation.
26
-
27
- ### Generating RDF friendly URIs from terms
28
- If the `@itemprop` is included within an item having an `@itemtype`,
29
- the URI of the `@itemtype` will be used for generating a term URI. The type URI will be trimmed following
30
- the last '#' or '/' character, and the term will be appended to the resulting URI. This is in keeping
31
- with standard convention for defining properties and classes within an RDFS or OWL vocabulary.
32
-
33
- For example:
34
-
35
- <div itemscope itemtype="http://schema.org/Person">
36
- My name is <span itemprop="name">Gregg</span>
37
- </div>
38
-
39
- Without the `:rdf\_terms` option, this would create the following statements:
40
-
41
- @prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
42
- @prefix schema: <http://schema.org/> .
43
- <> md:item [
44
- a schema:Person;
45
- <http://www.w3.org/1999/xhtml/microdata#http://schema.org/Person%23:name> "Gregg"
46
- ] .
47
-
48
- With the `:rdf\_terms` option, this becomes:
49
-
50
- @prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
51
- @prefix schema: <http://schema.org/> .
52
- <> md:item [ a schema:Person; schema:name "Gregg" ] .
53
-
54
- ### Improve xsd:date, xsd:time, xsd:dateTime and xsd:duration generation from _time_ element
55
-
56
- Use the lexical form of the @datetime attribute of the _time_ element to determine the specific type
57
- of the generated literal.
58
-
59
- ### Remove implicit RDF triple generation
60
-
61
- html>head>title and anchor (_a_) elements no longer generate triples without @item* properties
62
-
30
+ This spec is based on the W3C HTML Data Task Force specification and does not support
31
+ GRDDL-type triple generation, such as for html>head>title and <a>
32
+
63
33
  ## Dependencies
64
34
  * [RDF.rb](http://rubygems.org/gems/rdf) (>= 0.3.4)
65
- * [Nokogiri](http://rubygems.org/gems/nokogiri) (>= 1.3.3)
35
+ * [RDF::XSD](http://rubygems.org/gems/rdf-xsd) (>= 0.3.4)
36
+ * [HTMLEntities](https://rubygems.org/gems/htmlentities) ('>= 4.3.0')
37
+ * Soft dependency on [Nokogiri](http://rubygems.org/gems/nokogiri) (>= 1.5.0)
66
38
 
67
39
  ## Documentation
68
40
  Full documentation available on [Rubydoc.info][Microdata doc]
@@ -71,6 +43,8 @@ Full documentation available on [Rubydoc.info][Microdata doc]
71
43
  * {RDF::Microdata::Format}
72
44
  Asserts :html format, text/html mime-type and .html file extension.
73
45
  * {RDF::Microdata::Reader}
46
+ * {RDF::Microdata::Reader::Nokogiri}
47
+ * {RDF::Microdata::Reader::REXML}
74
48
 
75
49
  ### Additional vocabularies
76
50
 
@@ -81,8 +55,9 @@ Full documentation available on [Rubydoc.info][Microdata doc]
81
55
  ## Resources
82
56
  * [RDF.rb][RDF.rb]
83
57
  * [Documentation](http://rdf.rubyforge.org/microdata)
84
- * [History](file:file.History.html)
58
+ * [History](file:History.md)
85
59
  * [Microdata][]
60
+ * [Microdata RDF][]
86
61
 
87
62
  ## Author
88
63
  * [Gregg Kellogg](http://github.com/gkellogg) - <http://kellogg-assoc.com/>
@@ -117,5 +92,6 @@ see <http://unlicense.org/> or the accompanying {file:UNLICENSE} file.
117
92
  [YARD]: http://yardoc.org/
118
93
  [YARD-GS]: http://rubydoc.info/docs/yard/file/docs/GettingStarted.md
119
94
  [PDD]: http://lists.w3.org/Archives/Public/public-rdf-ruby/2010May/0013.html
120
- [Microdata]: http://www.w3.org/TR/2011/WD-microdata-20110525/ "HTML Microdata"
95
+ [Microdata]: http://dev.w3.org/html5/md/Overview.html "HTML Microdata"
96
+ [Microdata RDF]: https://dvcs.w3.org/hg/htmldata/raw-file/default/microdata-rdf/index.html "Microdata to RDF"
121
97
  [Microdata doc]: http://rubydoc.info/github/gkellogg/rdf-microdata/frames
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.2.2
1
+ 0.2.3
data/etc/doap.html ADDED
@@ -0,0 +1,42 @@
1
+ <!DOCTYPE html>
2
+ <html itemscope itemid="http://rubygems.org/gems/rdf-microdata" itemtype="http://usefulinc.com/ns/doap#Project">
3
+ <head>
4
+ <title lang="en" itemprop="shortdesc">Microdata reader for Ruby.</title>
5
+ </head>
6
+ <body about="" typeof="Project">
7
+ <p>Project description for <span itemprop="name">RDF::Microdata</span>.</p>
8
+ <p lang="en" itemprop="description">
9
+ RDF::Microdata is an Microdata reader for Ruby using the RDF.rb library suite.
10
+ </p>
11
+ <dl>
12
+ <dt>Creator</dt><dd>
13
+ <a itemprop="http://purl.org/dc/terms/creator developer documenter maintainer http://xmlns.com/foaf/0.1/creator" href="http://greggkellogg.net/foaf#me"
14
+ >Gregg Kellogg</a>
15
+ </dd>
16
+ <dt>Created</dt><time itemprop="created" datetime="2011-08-29"/></dd>
17
+ <dt>Blog</dt><dd><a href="http://greggkellogg.net/" itemprop="blog">http://greggkellogg.net/</a></dd>
18
+ <dt>Bug DB</dt><dd>
19
+ <a href="http://github.com/gkellogg/rdf-microdata/issues" itemprop="bug-database">
20
+ http://github.com/gkellogg/rdf-microdata/issues
21
+ </a>
22
+ </dd>
23
+ <dt>Category</dt><dd itemprop="category">
24
+ <a href="http://dbpedia.org/resource/Resource_Description_Framework">Resource Description Framework</a>
25
+ for
26
+ <a itemprop="programming-language" href="http://dbpedia.org/resource/Ruby_(programming_language)">Ruby</a>
27
+ </dd>
28
+ <dt>Download</dt><dd><a href="http://rubygems.org/gems/rdf-microdata" itemprop="download-page">
29
+ http://rubygems.org/gems/rdf-microdata
30
+ </a></dd>
31
+ <dt>Home Page</dt><dd><a href="http://github.com/gkellogg/rdf-microdata" itemprop="homepage">
32
+ http://github.com/gkellogg/rdf-microdata
33
+ </a></dd>
34
+ <dt>License</dt><dd>
35
+ <a href="http://creativecommons.org/licenses/publicdomain/" itemprop="license">Public Domain</a>
36
+ </dd>
37
+ <dt>Mailing List</dt><dd><a href="http://lists.w3.org/Archives/Public/public-rdf-ruby/" itemprop="mailing-list">
38
+ http://lists.w3.org/Archives/Public/public-rdf-ruby/
39
+ </a></dd>
40
+ </dl>
41
+ </body>
42
+ </html>
data/etc/registry.json ADDED
@@ -0,0 +1,39 @@
1
+ {
2
+ "http://schema.org/": {
3
+ "propertyURI": "vocabulary",
4
+ "multipleValues": "unordered",
5
+ "properties": {
6
+ "blogPosts": {"multipleValues": "list"},
7
+ "breadcrumb": {"multipleValues": "list"},
8
+ "byArtist": {"multipleValues": "list"},
9
+ "creator": {"multipleValues": "list"},
10
+ "episodes": {"multipleValues": "list"},
11
+ "events": {"multipleValues": "list"},
12
+ "founders": {"multipleValues": "list"},
13
+ "itemListElement": {"multipleValues": "list"},
14
+ "musicGroupMember": {"multipleValues": "list"},
15
+ "performerIn": {"multipleValues": "list"},
16
+ "performers": {"multipleValues": "list"},
17
+ "producer": {"multipleValues": "list"},
18
+ "recipeInstructions": {"multipleValues": "list"},
19
+ "seasons": {"multipleValues": "list"},
20
+ "subEvents": {"multipleValues": "list"},
21
+ "tracks": {"multipleValues": "list"}
22
+ }
23
+ },
24
+ "http://microformats.org/profile/hcard": {
25
+ "propertyURI": "vocabulary",
26
+ "multipleValues": "unordered"
27
+ },
28
+ "http://microformats.org/profile/hcalendar#": {
29
+ "propertyURI": "vocabulary",
30
+ "multipleValues": "unordered",
31
+ "properties": {
32
+ "categories": {"multipleValues": "list"}
33
+ }
34
+ },
35
+ "http://n.whatwg.org/work": {
36
+ "propertyURI": "contextual",
37
+ "multipleValues": "list"
38
+ }
39
+ }
data/lib/rdf/microdata.rb CHANGED
@@ -30,5 +30,3 @@ module RDF
30
30
  def self.debug=(value); @debug = value; end
31
31
  end
32
32
  end
33
-
34
- require 'rdf/microdata/extensions'
@@ -1,24 +1,33 @@
1
- require 'nokogiri' # FIXME: Implement using different modules as in RDF::TriX
1
+ begin
2
+ raise LoadError, "not with java" if RUBY_PLATFORM == "java"
3
+ require 'nokogiri'
4
+ rescue LoadError => e
5
+ :rexml
6
+ end
7
+ require 'rdf/xsd'
8
+ require 'json'
2
9
 
3
10
  module RDF::Microdata
4
11
  ##
5
12
  # An Microdata parser in Ruby
6
13
  #
7
14
  # Based on processing rules, amended with the following:
8
- # * property generation from tokens now uses the associated @itemtype as the basis for generation
9
- # * implicit triples are not generated, only those with @item*
10
- # * @datetime values are scanned lexically to find appropriate datatype
11
15
  #
12
- # @see http://dev.w3.org/html5/md/
16
+ # @see https://dvcs.w3.org/hg/htmldata/raw-file/0d6b89f5befb/microdata-rdf/index.html
13
17
  # @author [Gregg Kellogg](http://kellogg-assoc.com/)
14
18
  class Reader < RDF::Reader
15
19
  format Format
16
- XHTML = "http://www.w3.org/1999/xhtml"
17
20
  URL_PROPERTY_ELEMENTS = %w(a area audio embed iframe img link object source track video)
21
+ DEFAULT_REGISTRY = File.expand_path(File.join(File.dirname(__FILE__), "..", "..", "..", "etc", "registry.json"))
18
22
 
19
23
  class CrawlFailure < StandardError #:nodoc:
20
24
  end
21
25
 
26
+ # Returns the HTML implementation module for this reader instance.
27
+ #
28
+ # @attr_reader [Module]
29
+ attr_reader :implementation
30
+
22
31
  ##
23
32
  # Returns the base URI determined by this reader.
24
33
  #
@@ -31,6 +40,124 @@ module RDF::Microdata
31
40
  @options[:base_uri]
32
41
  end
33
42
 
43
+ # Interface to registry
44
+ class Registry
45
+ ##
46
+ # Initialize the registry from a URI or file path
47
+ #
48
+ # @param [Hash] json
49
+ def self.load_registry(json)
50
+ @prefixes = {}
51
+ json.each do |prefix, elements|
52
+ propertyURI = elements.fetch("propertyURI", "vocabulary").to_sym
53
+ multipleValues = elements.fetch("multipleValues", "unordered").to_sym
54
+ properties = elements.fetch("properties", {})
55
+ @prefixes[prefix] = Registry.new(prefix, propertyURI, multipleValues, properties)
56
+ end
57
+ end
58
+
59
+ ##
60
+ # True if registry has already been loaded
61
+ def self.loaded?
62
+ @prefixes.is_a?(Hash)
63
+ end
64
+
65
+ ##
66
+ # Initialize registry for a particular prefix URI
67
+ #
68
+ # @param [RDF::URI] prefixURI
69
+ # @param [#to_sym] propertyURI (:vocabulary)
70
+ # @param [#to_sym] multipleValues (:unordered)
71
+ # @param [Hash] properties ({})
72
+ def initialize(prefixURI, propertyURI = :vocabulary, multipleValues = :unordered, properties = {})
73
+ @scheme = propertyURI.to_sym
74
+ @multipleValues = multipleValues.to_sym
75
+ @properties = properties
76
+ if @scheme == :vocabulary
77
+ @property_base = prefixURI.to_s
78
+ @property_base += '#' unless %w(/ #).include?(@property_base[-1]) # Append a '#' for fragment if necessary
79
+ else
80
+ @property_base = 'http://www.w3.org/ns/md?type='
81
+ end
82
+ end
83
+
84
+ ##
85
+ # Find a registry entry given a type URI
86
+ #
87
+ # @param [RDF::URI] type
88
+ # @return [Registry]
89
+ def self.find(type)
90
+ @prefixes.select do |key, value|
91
+ type.to_s.index(key) == 0
92
+ end.values.first
93
+ end
94
+
95
+ ##
96
+ # Generate a predicateURI given a `name`
97
+ #
98
+ # @param [#to_s] name
99
+ # @param [Hash{}] ec Evaluation Context
100
+ # @return [RDF::URI]
101
+ def predicateURI(name, ec)
102
+ u = RDF::URI(name)
103
+ return u if u.absolute?
104
+
105
+ n = frag_escape(name)
106
+ if ec[:current_type].nil?
107
+ u = RDF::URI(ec[:document_base].to_s)
108
+ u.fragment = frag_escape(name)
109
+ u
110
+ elsif @scheme == :vocabulary
111
+ # If scheme is vocabulary return the URI reference constructed by appending the fragment escaped value of name
112
+ # to current vocabulary, separated by a U+0023 NUMBER SIGN character (#) unless the current vocabulary ends
113
+ # with either a U+0023 NUMBER SIGN character (#) or SOLIDUS U+002F (/).
114
+ RDF::URI(@property_base + n)
115
+ else # @scheme == :contextual
116
+ if ec[:current_type].to_s.index(@property_base) == 0
117
+ # return the concatenation of s, a U+002E FULL STOP character (.) and the fragment-escaped value of name.
118
+ RDF::URI(@property_base + '.' + n)
119
+ else
120
+ # return the concatenation of http://www.w3.org/ns/md?type=, the fragment-escaped value of s,
121
+ # the string &prop=, and the fragment-escaped value of name
122
+ RDF::URI(@property_base + frag_escape(ec[:current_type]) + '?prop=' + n)
123
+ end
124
+ end
125
+ end
126
+
127
+
128
+ ##
129
+ # Turn a predicateURI into a simple token
130
+ # @param [RDF::URI] predicateURI
131
+ # @return [String]
132
+ def tokenize(predicateURI)
133
+ case @scheme
134
+ when :vocabulary
135
+ predicateURI.to_s.sub(@property_base, '')
136
+ when :contextual
137
+ predicateURI.to_s.split('?prop=').last.split('.').last
138
+ end
139
+ end
140
+
141
+ ##
142
+ # Determine if property should be serialized as a list or not
143
+ # @param [RDF::URI] predicateURI
144
+ # @return [Boolean]
145
+ def as_list(predicateURI)
146
+ tok = tokenize(predicateURI)
147
+ if @properties[tok].is_a?(Hash)
148
+ @properties[tok]["multipleValues"].to_sym == :list
149
+ else
150
+ @multipleValues == :list
151
+ end
152
+ end
153
+
154
+ ##
155
+ # Fragment escape a name
156
+ def frag_escape(name)
157
+ name.to_s.gsub(/["#%<>\[\\\]^{|}]/) {|c| '%' + c.unpack('H2' * c.bytesize).join('%').upcase}
158
+ end
159
+ end
160
+
34
161
  ##
35
162
  # Initializes the Microdata reader instance.
36
163
  #
@@ -38,6 +165,8 @@ module RDF::Microdata
38
165
  # the input stream to read
39
166
  # @param [Hash{Symbol => Object}] options
40
167
  # any additional options
168
+ # @option options [Symbol] :library (:nokogiri)
169
+ # One of :nokogiri or :rexml. If nil/unspecified uses :nokogiri if available, :rexml otherwise.
41
170
  # @option options [Encoding] :encoding (Encoding::UTF_8)
42
171
  # the encoding of the input stream (Ruby 1.9+)
43
172
  # @option options [Boolean] :validate (false)
@@ -48,6 +177,7 @@ module RDF::Microdata
48
177
  # whether to intern all parsed URIs
49
178
  # @option options [#to_s] :base_uri (nil)
50
179
  # the base URI to use when resolving relative URIs
180
+ # @option options [#to_s] :registry_uri (DEFAULT_REGISTRY)
51
181
  # @option options [Array] :debug
52
182
  # Array to place debug messages
53
183
  # @return [reader]
@@ -59,24 +189,43 @@ module RDF::Microdata
59
189
  super do
60
190
  @debug = options[:debug]
61
191
 
62
- @doc = case input
63
- when Nokogiri::HTML::Document, Nokogiri::XML::Document
64
- input
65
- else
66
- # Try to detect charset from input
67
- options[:encoding] ||= input.charset if input.respond_to?(:charset)
68
-
69
- # Otherwise, default is utf-8
70
- options[:encoding] ||= 'utf-8'
192
+ @library = case options[:library]
193
+ when nil
194
+ (defined?(::Nokogiri) && RUBY_PLATFORM != 'java') ? :nokogiri : :rexml
195
+ when :nokogiri, :rexml
196
+ options[:library]
197
+ else
198
+ raise ArgumentError.new("expected :rexml or :nokogiri, but got #{options[:library].inspect}")
199
+ end
71
200
 
72
- add_debug(nil, "base_uri: #{base_uri}")
73
- Nokogiri::HTML.parse(input, base_uri.to_s, options[:encoding])
201
+ require "rdf/microdata/reader/#{@library}"
202
+ @implementation = case @library
203
+ when :nokogiri then Nokogiri
204
+ when :rexml then REXML
74
205
  end
75
-
76
- errors = @doc.errors.reject {|e| e.to_s =~ /Tag (audio|source|track|video|time) invalid/}
206
+ self.extend(@implementation)
207
+
208
+ initialize_html(input, options) rescue raise RDF::ReaderError.new($!.message)
209
+
210
+ if (root.nil? && validate?)
211
+ raise RDF::ReaderError, "Empty Document"
212
+ end
213
+ errors = doc_errors.reject {|e| e.to_s =~ /Tag (audio|source|track|video|time) invalid/}
77
214
  raise RDF::ReaderError, "Syntax errors:\n#{errors}" if !errors.empty? && validate?
78
- raise RDF::ReaderError, "Empty document" if (@doc.nil? || @doc.root.nil?) && validate?
79
215
 
216
+ add_debug(@doc, "library = #{@library}")
217
+
218
+ # Load registry
219
+ unless Registry.loaded?
220
+ registry = options[:registry_uri] || DEFAULT_REGISTRY
221
+ begin
222
+ json = RDF::Util::File.open_file(registry) { |f| JSON.load(f) }
223
+ rescue JSON::ParserError => e
224
+ raise RDF::ReaderError, "Failed to parse registry: #{e.message}"
225
+ end
226
+ Registry.load_registry(json)
227
+ end
228
+
80
229
  if block_given?
81
230
  case block.arity
82
231
  when 0 then instance_eval(&block)
@@ -121,19 +270,19 @@ module RDF::Microdata
121
270
  @bnode_cache[value.to_s] ||= RDF::Node.new(value)
122
271
  end
123
272
 
124
- # Figure out the document path, if it is a Nokogiri::XML::Element or Attribute
273
+ # Figure out the document path, if it is an Element or Attribute
125
274
  def node_path(node)
126
- "<#{base_uri}>" + case node
127
- when Nokogiri::XML::Node then node.display_path
128
- else node.to_s
129
- end
275
+ "<#{base_uri}>#{node.respond_to?(:display_path) ? node.display_path : node}"
130
276
  end
131
277
 
132
278
  # Add debug event to debug array, if specified
133
279
  #
134
- # @param [XML Node, any] node:: XML Node or string for showing context
280
+ # @param [Nokogiri::XML::Node, #to_s] node:: XML Node or string for showing context
135
281
  # @param [String] message::
136
- def add_debug(node, message)
282
+ # @yieldreturn [String] appended to message, to allow for lazy-evaulation of message
283
+ def add_debug(node, message = "")
284
+ return unless ::RDF::Microdata.debug? || @debug
285
+ message = message + yield if block_given?
137
286
  puts "#{node_path(node)}: #{message}" if ::RDF::Microdata::debug?
138
287
  @debug << "#{node_path(node)}: #{message}" if @debug.is_a?(Array)
139
288
  end
@@ -153,107 +302,50 @@ module RDF::Microdata
153
302
  # @raise [ReaderError]:: Checks parameter types and raises if they are incorrect if parsing mode is _validate_.
154
303
  def add_triple(node, subject, predicate, object)
155
304
  statement = RDF::Statement.new(subject, predicate, object)
156
- add_debug(node, "statement: #{RDF::NTriples.serialize(statement)}")
305
+ add_debug(node) {"statement: #{RDF::NTriples.serialize(statement)}"}
157
306
  @callback.call(statement)
158
307
  end
159
308
 
160
309
  # Parsing a Microdata document (this is *not* the recursive method)
161
310
  def parse_whole_document(doc, base)
162
- base_el = doc.at_css('html>head>base')
163
- base = base_el.attribute('href').to_s.split('#').first if base_el
164
-
165
- add_debug(doc, "parse_whole_doc: options=#{@options.inspect}")
166
-
167
- if (base)
311
+ base = doc_base(base)
312
+ options[:base_uri] = if (base)
168
313
  # Strip any fragment from base
169
314
  base = base.to_s.split('#').first
170
- base = options[:base_uri] = uri(base)
171
- add_debug(base_el, "parse_whole_doc: base='#{base}'")
315
+ base = uri(base)
172
316
  else
173
317
  base = RDF::URI("")
174
318
  end
175
319
 
176
- # 2. For each a, area, and link element in the Document, run these substeps:
320
+ add_debug(nil) {"parse_whole_doc: base='#{base}'"}
321
+
322
+ ec = {
323
+ :memory => {},
324
+ :current_name => nil,
325
+ :current_type => nil,
326
+ :current_vocabulary => nil,
327
+ :document_base => base,
328
+ }
329
+ items = []
330
+ # 1) For each element that is also a top-level item run the following algorithm:
177
331
  #
178
- # * If the element does not have a rel attribute, then skip this element.
179
- # * If the element does not have an href attribute, then skip this element.
180
- # * If resolving the element's href attribute relative to the element is not successful,
181
- # then skip this element.
182
- doc.css('a, area, link').each do |el|
183
- rel, href = el.attribute('rel'), el.attribute('href')
184
- next unless rel && href
185
- href = uri(href, el.base || base)
186
- add_debug(el, "a: rel=#{rel.inspect}, href=#{href}")
187
-
188
- # Otherwise, split the value of the element's rel attribute on spaces, obtaining list of tokens.
189
- # Coalesce duplicate tokens in list of tokens.
190
- tokens = rel.to_s.split(/\s+/).map do |tok|
191
- # Convert each token in list of tokens that does not contain a U+003A COLON characters (:)
192
- # to ASCII lowercase.
193
- tok =~ /:/ ? tok : tok.downcase
194
- end.uniq
195
-
196
- # If list of tokens contains both the tokens alternate and stylesheet,
197
- # then remove them both and replace them with the single (uppercase) token
198
- # ALTERNATE-STYLESHEET.
199
- if tokens.include?('alternate') && tokens.include?('stylesheet')
200
- tokens = tokens - %w(alternate stylesheet)
201
- tokens << 'ALTERNATE-STYLESHEET'
202
- end
203
-
204
- tokens.each do |tok|
205
- tok_uri = RDF::URI(tok)
206
- if tok !~ /:/
207
- # For each token token in list of tokens that contains no U+003A COLON characters (:),
208
- # generate the following triple:
209
- add_triple(el, base, RDF::XHV[tok.gsub('#', '%23')], href)
210
- elsif tok_uri.absolute?
211
- # For each token token in list of tokens that is an absolute URL, generate the following triple:
212
- add_triple(el, base, tok_uri, href)
213
- end
214
- end
215
- end
216
-
217
- # 3. For each meta element in the Document that has a name attribute and a content attribute,
218
- doc.css('meta[name][content]').each do |el|
219
- name, content = el.attribute('name'), el.attribute('content')
220
- name = name.to_s
221
- name_uri = uri(name, el.base || base)
222
- add_debug(el, "meta: name=#{name.inspect}")
223
- if name !~ /:/
224
- # If the value of the name attribute contains no U+003A COLON characters (:),
225
- # generate the following triple:
226
- add_triple(el, base, RDF::XHV[name.downcase.gsub('#', '%23')], RDF::Literal(content, :language => el.language))
227
- elsif name_uri.absolute?
228
- # If the value of the name attribute contains no U+003A COLON characters (:),
229
- # generate the following triple:
230
- add_triple(el, base, name_uri, RDF::Literal(content, :language => el.language))
231
- end
232
- end
233
-
234
- # 4. For each blockquote and q element in the Document that has a cite attribute that resolves
235
- # successfully relative to the element, generate the following triple:
236
- doc.css('blockquote[cite], q[cite]').each do |el|
237
- object = uri(el.attribute('cite'), el.base || base)
238
- add_debug(el, "blockquote: cite=#{object}")
239
- add_triple(el, base, RDF::DC.source, object)
332
+ # 1) Generate the triples for an item item, using the evaluation context.
333
+ # Let result be the (URI reference or blank node) subject returned.
334
+ # 2) Append result to item list.
335
+ getItems.each do |el|
336
+ result = generate_triples(el, ec)
337
+ items << result
240
338
  end
339
+
340
+ # 2) Generate an RDF Collection list from
341
+ # the ordered list of values. Set value to the value returned from generate an RDF Collection.
342
+ value = generateRDFCollection(root, items)
241
343
 
242
- # 5. Let memory be a mapping of items to subjects, initially empty.
243
- # 6. For each element that is also a top-level microdata item, run the following steps:
244
- # * Generate the triples for the item. Pass a reference to memory as the item/subject list.
245
- # Let result be the subject returned.
246
- # * Generate the following triple:
247
- # subject the document's current address
248
- # predicate http://www.w3.org/1999/xhtml/microdata#item
249
- # object result
250
- memory = {}
251
- doc.css('[itemscope]').
252
- select {|el| !el.has_attribute?('itemprop')}.
253
- each do |el|
254
- object = generate_triples(el, memory)
255
- add_triple(el, base, RDF::MD.item, object)
256
- end
344
+ # 3) Generate the following triple:
345
+ # subject Document base
346
+ # predicate http://www.w3.org/1999/xhtml/microdata#item
347
+ # object value
348
+ add_triple(doc, base, RDF::MD.item, value) if value
257
349
 
258
350
  add_debug(doc, "parse_whole_doc: traversal complete")
259
351
  end
@@ -261,94 +353,119 @@ module RDF::Microdata
261
353
  ##
262
354
  # Generate triples for an item
263
355
  # @param [RDF::Resource] item
264
- # @param [Hash{Nokogiri::XML::Element} => RDF::Resource] memory
265
- # @param [Hash{Symbol => Object}] options
266
- # @option options [RDF::Resource] :fallback_type
267
- # @option options [RDF::Resource] :fallback_name
356
+ # @param [Hash{Symbol => Object}] ec
357
+ # @option ec [Hash{Nokogiri::XML::Element} => RDF::Resource] memory
358
+ # @option ec [RDF::Resource] :current_type
268
359
  # @return [RDF::Resource]
269
- def generate_triples(item, memory, options = {})
270
- fallback_type = options[:fallback_type]
271
- fallback_name = options[:fallback_name]
272
-
273
- # 1. If there is an entry for item in memory, then let subject be the subject of that entry.
360
+ def generate_triples(item, ec = {})
361
+ memory = ec[:memory]
362
+ # 1) If there is an entry for item in memory, then let subject be the subject of that entry.
274
363
  # Otherwise, if item has a global identifier and that global identifier is an absolute URL,
275
364
  # let subject be that global identifier. Otherwise, let subject be a new blank node.
276
- subject = if memory.include?(item)
277
- memory[item][:subject]
365
+ subject = if memory.include?(item.node)
366
+ memory[item.node][:subject]
278
367
  elsif item.has_attribute?('itemid')
279
- u = uri(item.attribute('itemid'), item.base || base_uri)
368
+ uri(item.attribute('itemid'), item.base || base_uri)
280
369
  end || RDF::Node.new
281
- memory[item] ||= {}
370
+ memory[item.node] ||= {}
282
371
 
283
- add_debug(item, "gentrips(2): subject=#{subject.inspect}")
372
+ add_debug(item) {"gentrips(2): subject=#{subject.inspect}, current_type: #{ec[:current_type]}"}
284
373
 
285
- # 2. Add a mapping from item to subject in memory, if there isn't one already.
286
- memory[item][:subject] ||= subject
374
+ # 2) Add a mapping from item to subject in memory, if there isn't one already.
375
+ memory[item.node][:subject] ||= subject
287
376
 
288
- # 3. If item has an item type and that item type is an absolute URL, let type be that item type.
289
- # Otherwise, let type be the empty string.
290
- rdf_type = type = uri(item.attribute('itemtype'))
291
- type = '' unless type.absolute?
377
+ # 3) For each type returned from element.itemType of the element defining the item.
378
+ type = nil
379
+ item.attribute('itemtype').to_s.split(' ').map{|n| uri(n)}.select(&:absolute?).each do |t|
380
+ # 3.1. If type is an absolute URL, generate the following triple:
381
+ type ||= t
382
+ add_triple(item, subject, RDF.type, t)
383
+ end
292
384
 
293
- if type != ''
294
- add_triple(item, subject, RDF.type, type)
295
- # 4.2. If type does not contain a U+0023 NUMBER SIGN character (#), then append a # to type.
296
- type += '#' unless type.to_s.include?('#')
297
- # 4.3. If type does not have a : after its #, append a : to type.
298
- type += ':' unless type.to_s.match(/\#:/)
299
- elsif fallback_type
300
- add_debug(item, "gentrips(5.2): fallback_type=#{fallback_type}, fallback_name=#{fallback_name}")
301
- rdf_type = type = fallback_type
302
- # 5.2. If type does not contain a U+0023 NUMBER SIGN character (#), then append a # to type.
303
- type += '#' unless type.to_s.include?('#')
304
- # 5.3. If type does not have a : after its #, append a : to type.
305
- type += ':' unless type.to_s.match(/\#:/)
306
- # 5.4. If the last character of type is not a :, %20 to type.
307
- type += '%20' unless type.to_s[-1,1] == ':'
308
- # 5.5. Append the fragment-escaped value of fallback name to type.
309
- type += fallback_name.to_s.gsub('#', '%23')
385
+ # 5) If type is not an absolute URL, set it to current type from the Evaluation Context if not empty.
386
+ type ||= ec[:current_type]
387
+ add_debug(item) {"gentrips(5): type=#{type.inspect}"}
388
+
389
+ # 6) If the registry contains a URI prefix that is a character for character match of type up to the length of the
390
+ # URI prefix, set vocab as that URI prefix
391
+ vocab = Registry.find(type)
392
+
393
+ # 7) Otherwise, if type is not empty, construct vocab by removing everything following the last
394
+ # SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from type.
395
+ vocab ||= begin
396
+ type_vocab = type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1')
397
+ add_debug(item) {"gentrips(7): typtype_vocab=#{type_vocab.inspect}"}
398
+ Registry.new(type_vocab) # if type
310
399
  end
311
400
 
312
- add_debug(item, "gentrips(6): type=#{type.inspect}")
313
-
314
- # 6. For each element _element_ that has one or more property names and is one of the
401
+ # 8) Update evaluation context setting current vocabulary to vocab.
402
+ ec[:current_vocabulary] = vocab
403
+
404
+ # 9) Set property list to an empty mapping between properties and one or more ordered values as established below.
405
+ property_list = {}
406
+
407
+ # 10. For each element _element_ that has one or more property names and is one of the
315
408
  # properties of the item _item_, in the order those elements are given by the algorithm
316
409
  # that returns the properties of an item, run the following substep:
317
410
  props = item_properties(item)
318
-
319
- # 6.1. For each name name in element's property names, run the following substeps:
411
+ # 10.1. For each name name in element's property names, run the following substeps:
320
412
  props.each do |element|
321
- element.attribute('itemprop').to_s.split(' ').each do |name|
322
- add_debug(element, "gentrips(6.1): name=#{name.inspect}")
323
- # If type is the empty string and name is not an absolute URL, then abort these substeps.
324
- name_uri = RDF::URI(name)
325
- next if type == '' && !name_uri.absolute?
326
-
413
+ element.attribute('itemprop').to_s.split(' ').compact.each do |name|
414
+ add_debug(element) {"gentrips(10.1): name=#{name.inspect}, type=#{type}"}
415
+ # Let context be a copy of evaluation context with current type set to type and current vocabulary set to vocab.
416
+ ec_new = ec.merge({:current_type => type, :current_vocabulary => vocab})
417
+
418
+ predicate = vocab.predicateURI(name, ec_new)
419
+ ec_new[:current_name] = predicate
420
+ add_debug(element) {"gentrips(10.1.2): predicate=#{predicate}"}
421
+
422
+ # 10.1.3) Let value be the property value of element.
327
423
  value = property_value(element)
328
- add_debug(element, "gentrips(6.1.2) value=#{value.inspect}")
424
+ add_debug(element) {"gentrips(10.1.3) value=#{value.inspect}"}
329
425
 
426
+ # 10.1.4) If value is an item, then generate the triples for value using a copy of evaluation context with
427
+ # current type set to type. Replace value by the subject returned from those steps.
330
428
  if value.is_a?(Hash)
331
- value = generate_triples(element, memory, :fallback_type => type, :fallback_name => name)
429
+ value = generate_triples(element, ec_new)
430
+ add_debug(element) {"gentrips(10.1.4): value=#{value.inspect}"}
332
431
  end
333
-
334
- add_debug(element, "gentrips(6.1.3): value=#{value.inspect}")
335
432
 
336
- predicate = if name_uri.absolute?
337
- name_uri
338
- else
339
- # Use the URI of the type to create URIs for @itemprop terms
340
- add_debug(element, "gentrips: rdf_type=#{rdf_type}")
341
- predicate = RDF::URI(rdf_type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1' + name))
342
- end
343
- add_debug(element, "gentrips(6.1.5): predicate=#{predicate}")
344
-
345
- add_triple(element, subject, predicate, value) if predicate
433
+ property_list[predicate] ||= []
434
+ property_list[predicate] << value
346
435
  end
347
436
  end
348
437
 
438
+ # 11) For each predicate in property list
439
+ property_list.each do |predicate, values|
440
+ generatePropertyValues(item, subject, predicate, values, ec)
441
+ end
442
+
349
443
  subject
350
444
  end
351
445
 
446
+ def generatePropertyValues(element, subject, predicate, values, ec)
447
+ registry = ec[:current_vocabulary]
448
+ if registry.as_list(predicate)
449
+ value = generateRDFCollection(element, values)
450
+ add_triple(element, subject, predicate, value)
451
+ else
452
+ values.each {|v| add_triple(element, subject, predicate, v)}
453
+ end
454
+ end
455
+
456
+ ##
457
+ # Called when values has more than one entry
458
+ # @param [Nokogiri::HTML::Element] element
459
+ # @param [Array<RDF::Value>] values
460
+ # @return [RDF::Node]
461
+ def generateRDFCollection(element, values)
462
+ list = RDF::List.new(nil, nil, values)
463
+ list.each_statement do |st|
464
+ add_triple(element, st.subject, st.predicate, st.object) unless st.object == RDF.List
465
+ end
466
+ list.subject
467
+ end
468
+
352
469
  ##
353
470
  # To find the properties of an item defined by the element root, the user agent must try
354
471
  # to crawl the properties of the element root, with an empty list as the value of memory:
@@ -378,13 +495,14 @@ module RDF::Microdata
378
495
  # @return [Array<Array<Nokogiri::XML::Element>, Integer>]
379
496
  # Resultant elements and error count
380
497
  def crawl_properties(root, memory)
498
+
381
499
  # 1. If root is in memory, then the algorithm fails; abort these steps.
382
500
  raise CrawlFailure, "crawl_props mem already has #{root.inspect}" if memory.include?(root)
383
501
 
384
502
  # 2. Collect all the elements in the item root; let results be the resulting
385
503
  # list of elements, and errors be the resulting count of errors.
386
504
  results, errors = elements_in_item(root)
387
- add_debug(root, "crawl_properties results=#{results.inspect}, errors=#{errors}")
505
+ add_debug(root) {"crawl_properties results=#{results.map {|e| node_path(e)}.inspect}, errors=#{errors}"}
388
506
 
389
507
  # 3. Remove any elements from results that do not have an itemprop attribute specified.
390
508
  results = results.select {|e| e.has_attribute?('itemprop')}
@@ -427,13 +545,13 @@ module RDF::Microdata
427
545
  # If root has an itemref attribute, split the value of that itemref attribute on spaces.
428
546
  # For each resulting token ID,
429
547
  root.attribute('itemref').to_s.split(' ').each do |id|
430
- add_debug(root, "elements_in_item itemref id #{id}")
548
+ add_debug(root) {"elements_in_item itemref id #{id}"}
431
549
  # if there is an element in the home subtree of root with the ID ID,
432
550
  # then add the first such element to pending.
433
- id_elem = @doc.at_css("##{id}")
551
+ id_elem = find_element_by_id(id)
434
552
  pending << id_elem if id_elem
435
553
  end
436
- add_debug(root, "elements_in_item pending #{pending.inspect}")
554
+ add_debug(root) {"elements_in_item pending #{pending.inspect}"}
437
555
 
438
556
  # Loop: Remove an element from pending and let current be that element.
439
557
  while current = pending.shift
@@ -457,37 +575,42 @@ module RDF::Microdata
457
575
  ##
458
576
  #
459
577
  def property_value(element)
460
- add_debug(element, "property_value(#{element.inspect}): base #{element.base.inspect}, base_uri: #{base_uri.inspect}")
461
- case
578
+ base = element.base || base_uri
579
+ add_debug(element) {"property_value(#{element.name}): base #{base.inspect}"}
580
+ value = case
462
581
  when element.has_attribute?('itemscope')
463
582
  {}
464
583
  when element.name == 'meta'
465
- element.attribute('content').to_s
584
+ RDF::Literal.new(element.attribute('content').to_s, :language => element.language)
585
+ when element.name == 'data'
586
+ RDF::Literal.new(element.attribute('value').to_s, :language => element.language)
466
587
  when %w(audio embed iframe img source track video).include?(element.name)
467
- uri(element.attribute('src'), element.base || base_uri)
588
+ uri(element.attribute('src'), base)
468
589
  when %w(a area link).include?(element.name)
469
- uri(element.attribute('href'), element.base || base_uri)
590
+ uri(element.attribute('href'), base)
470
591
  when %w(object).include?(element.name)
471
- uri(element.attribute('data'), element.base || base_uri)
472
- when %w(time).include?(element.name) && element.has_attribute?('datetime')
592
+ uri(element.attribute('data'), base)
593
+ when %w(time).include?(element.name)
473
594
  # Lexically scan value and assign appropriate type, otherwise, leave untyped
474
- v = element.attribute('datetime').to_s
475
- datatype = %w(Date Time DateTime).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
595
+ v = (element.attribute('datetime') || element.text).to_s
596
+ datatype = %w(Date Time DateTime Duration).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
476
597
  v.match(dt::GRAMMAR)
477
598
  end || RDF::Literal
478
- datatype.new(v)
599
+ datatype.new(v, :language => element.language)
479
600
  else
480
- RDF::Literal.new(element.text, :language => element.language)
601
+ RDF::Literal.new(element.inner_text, :language => element.language)
481
602
  end
603
+ add_debug(element) {" #{value.inspect}"}
604
+ value
482
605
  end
483
606
 
484
607
  # Fixme, what about xml:base relative to element?
485
608
  def uri(value, base = nil)
486
609
  value = if base
487
610
  base = uri(base) unless base.is_a?(RDF::URI)
488
- base.join(value)
611
+ base.join(value.to_s)
489
612
  else
490
- RDF::URI(value)
613
+ RDF::URI(value.to_s)
491
614
  end
492
615
  value.validate! if validate?
493
616
  value.canonicalize! if canonicalize?