rdf-microdata 0.2.2 → 0.2.3

Sign up to get free protection for your applications and to get access to all the features.
data/README CHANGED
@@ -6,13 +6,20 @@
6
6
  RDF::Microdata is a Microdata reader for Ruby using the [RDF.rb][RDF.rb] library suite.
7
7
 
8
8
  ## FEATURES
9
- RDF::Microdata parses [Microdata][] into statements or triples.
9
+ RDF::Microdata parses [Microdata][] into statements or triples using the rules defined in [Microdata RDF][].
10
10
 
11
11
  * Microdata parser.
12
- * Uses Nokogiri for parsing HTML
12
+ * If available, Uses Nokogiri for parsing HTML/SVG, falls back to REXML otherwise (and for JRuby)
13
13
 
14
14
  Install with 'gem install rdf-microdata'
15
15
 
16
+ ### Living implementation
17
+ Microdata to RDF transformation is undergoing active development. This implementation attempts to be up-to-date
18
+ as of the time of release, and is being used in developing the [Microdata RDF][] specification
19
+
20
+ ### Microdata Registry
21
+ The parser uses a build-in version of the [Microdata RDF][] registry.
22
+
16
23
  ## Usage
17
24
 
18
25
  ### Reading RDF data in the Microdata format
@@ -20,49 +27,14 @@ Install with 'gem install rdf-microdata'
20
27
  graph = RDF::Graph.load("etc/foaf.html", :format => :microdata)
21
28
 
22
29
  ## Note
23
- The Microdata editor has recently [dropped support for RDF
24
- conversion](http://html5.org/tools/web-apps-tracker?from=6426&to=6427), as a result, this gem is being used to
25
- investigate ways in which Microdata might have more satisfactory RDF generation.
26
-
27
- ### Generating RDF friendly URIs from terms
28
- If the `@itemprop` is included within an item having an `@itemtype`,
29
- the URI of the `@itemtype` will be used for generating a term URI. The type URI will be trimmed following
30
- the last '#' or '/' character, and the term will be appended to the resulting URI. This is in keeping
31
- with standard convention for defining properties and classes within an RDFS or OWL vocabulary.
32
-
33
- For example:
34
-
35
- <div itemscope itemtype="http://schema.org/Person">
36
- My name is <span itemprop="name">Gregg</span>
37
- </div>
38
-
39
- Without the `:rdf\_terms` option, this would create the following statements:
40
-
41
- @prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
42
- @prefix schema: <http://schema.org/> .
43
- <> md:item [
44
- a schema:Person;
45
- <http://www.w3.org/1999/xhtml/microdata#http://schema.org/Person%23:name> "Gregg"
46
- ] .
47
-
48
- With the `:rdf\_terms` option, this becomes:
49
-
50
- @prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
51
- @prefix schema: <http://schema.org/> .
52
- <> md:item [ a schema:Person; schema:name "Gregg" ] .
53
-
54
- ### Improve xsd:date, xsd:time, xsd:dateTime and xsd:duration generation from _time_ element
55
-
56
- Use the lexical form of the @datetime attribute of the _time_ element to determine the specific type
57
- of the generated literal.
58
-
59
- ### Remove implicit RDF triple generation
60
-
61
- html>head>title and anchor (_a_) elements no longer generate triples without @item* properties
62
-
30
+ This spec is based on the W3C HTML Data Task Force specification and does not support
31
+ GRDDL-type triple generation, such as for html>head>title and <a>
32
+
63
33
  ## Dependencies
64
34
  * [RDF.rb](http://rubygems.org/gems/rdf) (>= 0.3.4)
65
- * [Nokogiri](http://rubygems.org/gems/nokogiri) (>= 1.3.3)
35
+ * [RDF::XSD](http://rubygems.org/gems/rdf-xsd) (>= 0.3.4)
36
+ * [HTMLEntities](https://rubygems.org/gems/htmlentities) ('>= 4.3.0')
37
+ * Soft dependency on [Nokogiri](http://rubygems.org/gems/nokogiri) (>= 1.5.0)
66
38
 
67
39
  ## Documentation
68
40
  Full documentation available on [Rubydoc.info][Microdata doc]
@@ -71,6 +43,8 @@ Full documentation available on [Rubydoc.info][Microdata doc]
71
43
  * {RDF::Microdata::Format}
72
44
  Asserts :html format, text/html mime-type and .html file extension.
73
45
  * {RDF::Microdata::Reader}
46
+ * {RDF::Microdata::Reader::Nokogiri}
47
+ * {RDF::Microdata::Reader::REXML}
74
48
 
75
49
  ### Additional vocabularies
76
50
 
@@ -81,8 +55,9 @@ Full documentation available on [Rubydoc.info][Microdata doc]
81
55
  ## Resources
82
56
  * [RDF.rb][RDF.rb]
83
57
  * [Documentation](http://rdf.rubyforge.org/microdata)
84
- * [History](file:file.History.html)
58
+ * [History](file:History.md)
85
59
  * [Microdata][]
60
+ * [Microdata RDF][]
86
61
 
87
62
  ## Author
88
63
  * [Gregg Kellogg](http://github.com/gkellogg) - <http://kellogg-assoc.com/>
@@ -117,5 +92,6 @@ see <http://unlicense.org/> or the accompanying {file:UNLICENSE} file.
117
92
  [YARD]: http://yardoc.org/
118
93
  [YARD-GS]: http://rubydoc.info/docs/yard/file/docs/GettingStarted.md
119
94
  [PDD]: http://lists.w3.org/Archives/Public/public-rdf-ruby/2010May/0013.html
120
- [Microdata]: http://www.w3.org/TR/2011/WD-microdata-20110525/ "HTML Microdata"
95
+ [Microdata]: http://dev.w3.org/html5/md/Overview.html "HTML Microdata"
96
+ [Microdata RDF]: https://dvcs.w3.org/hg/htmldata/raw-file/default/microdata-rdf/index.html "Microdata to RDF"
121
97
  [Microdata doc]: http://rubydoc.info/github/gkellogg/rdf-microdata/frames
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.2.2
1
+ 0.2.3
data/etc/doap.html ADDED
@@ -0,0 +1,42 @@
1
+ <!DOCTYPE html>
2
+ <html itemscope itemid="http://rubygems.org/gems/rdf-microdata" itemtype="http://usefulinc.com/ns/doap#Project">
3
+ <head>
4
+ <title lang="en" itemprop="shortdesc">Microdata reader for Ruby.</title>
5
+ </head>
6
+ <body about="" typeof="Project">
7
+ <p>Project description for <span itemprop="name">RDF::Microdata</span>.</p>
8
+ <p lang="en" itemprop="description">
9
+ RDF::Microdata is an Microdata reader for Ruby using the RDF.rb library suite.
10
+ </p>
11
+ <dl>
12
+ <dt>Creator</dt><dd>
13
+ <a itemprop="http://purl.org/dc/terms/creator developer documenter maintainer http://xmlns.com/foaf/0.1/creator" href="http://greggkellogg.net/foaf#me"
14
+ >Gregg Kellogg</a>
15
+ </dd>
16
+ <dt>Created</dt><time itemprop="created" datetime="2011-08-29"/></dd>
17
+ <dt>Blog</dt><dd><a href="http://greggkellogg.net/" itemprop="blog">http://greggkellogg.net/</a></dd>
18
+ <dt>Bug DB</dt><dd>
19
+ <a href="http://github.com/gkellogg/rdf-microdata/issues" itemprop="bug-database">
20
+ http://github.com/gkellogg/rdf-microdata/issues
21
+ </a>
22
+ </dd>
23
+ <dt>Category</dt><dd itemprop="category">
24
+ <a href="http://dbpedia.org/resource/Resource_Description_Framework">Resource Description Framework</a>
25
+ for
26
+ <a itemprop="programming-language" href="http://dbpedia.org/resource/Ruby_(programming_language)">Ruby</a>
27
+ </dd>
28
+ <dt>Download</dt><dd><a href="http://rubygems.org/gems/rdf-microdata" itemprop="download-page">
29
+ http://rubygems.org/gems/rdf-microdata
30
+ </a></dd>
31
+ <dt>Home Page</dt><dd><a href="http://github.com/gkellogg/rdf-microdata" itemprop="homepage">
32
+ http://github.com/gkellogg/rdf-microdata
33
+ </a></dd>
34
+ <dt>License</dt><dd>
35
+ <a href="http://creativecommons.org/licenses/publicdomain/" itemprop="license">Public Domain</a>
36
+ </dd>
37
+ <dt>Mailing List</dt><dd><a href="http://lists.w3.org/Archives/Public/public-rdf-ruby/" itemprop="mailing-list">
38
+ http://lists.w3.org/Archives/Public/public-rdf-ruby/
39
+ </a></dd>
40
+ </dl>
41
+ </body>
42
+ </html>
data/etc/registry.json ADDED
@@ -0,0 +1,39 @@
1
+ {
2
+ "http://schema.org/": {
3
+ "propertyURI": "vocabulary",
4
+ "multipleValues": "unordered",
5
+ "properties": {
6
+ "blogPosts": {"multipleValues": "list"},
7
+ "breadcrumb": {"multipleValues": "list"},
8
+ "byArtist": {"multipleValues": "list"},
9
+ "creator": {"multipleValues": "list"},
10
+ "episodes": {"multipleValues": "list"},
11
+ "events": {"multipleValues": "list"},
12
+ "founders": {"multipleValues": "list"},
13
+ "itemListElement": {"multipleValues": "list"},
14
+ "musicGroupMember": {"multipleValues": "list"},
15
+ "performerIn": {"multipleValues": "list"},
16
+ "performers": {"multipleValues": "list"},
17
+ "producer": {"multipleValues": "list"},
18
+ "recipeInstructions": {"multipleValues": "list"},
19
+ "seasons": {"multipleValues": "list"},
20
+ "subEvents": {"multipleValues": "list"},
21
+ "tracks": {"multipleValues": "list"}
22
+ }
23
+ },
24
+ "http://microformats.org/profile/hcard": {
25
+ "propertyURI": "vocabulary",
26
+ "multipleValues": "unordered"
27
+ },
28
+ "http://microformats.org/profile/hcalendar#": {
29
+ "propertyURI": "vocabulary",
30
+ "multipleValues": "unordered",
31
+ "properties": {
32
+ "categories": {"multipleValues": "list"}
33
+ }
34
+ },
35
+ "http://n.whatwg.org/work": {
36
+ "propertyURI": "contextual",
37
+ "multipleValues": "list"
38
+ }
39
+ }
data/lib/rdf/microdata.rb CHANGED
@@ -30,5 +30,3 @@ module RDF
30
30
  def self.debug=(value); @debug = value; end
31
31
  end
32
32
  end
33
-
34
- require 'rdf/microdata/extensions'
@@ -1,24 +1,33 @@
1
- require 'nokogiri' # FIXME: Implement using different modules as in RDF::TriX
1
+ begin
2
+ raise LoadError, "not with java" if RUBY_PLATFORM == "java"
3
+ require 'nokogiri'
4
+ rescue LoadError => e
5
+ :rexml
6
+ end
7
+ require 'rdf/xsd'
8
+ require 'json'
2
9
 
3
10
  module RDF::Microdata
4
11
  ##
5
12
  # An Microdata parser in Ruby
6
13
  #
7
14
  # Based on processing rules, amended with the following:
8
- # * property generation from tokens now uses the associated @itemtype as the basis for generation
9
- # * implicit triples are not generated, only those with @item*
10
- # * @datetime values are scanned lexically to find appropriate datatype
11
15
  #
12
- # @see http://dev.w3.org/html5/md/
16
+ # @see https://dvcs.w3.org/hg/htmldata/raw-file/0d6b89f5befb/microdata-rdf/index.html
13
17
  # @author [Gregg Kellogg](http://kellogg-assoc.com/)
14
18
  class Reader < RDF::Reader
15
19
  format Format
16
- XHTML = "http://www.w3.org/1999/xhtml"
17
20
  URL_PROPERTY_ELEMENTS = %w(a area audio embed iframe img link object source track video)
21
+ DEFAULT_REGISTRY = File.expand_path(File.join(File.dirname(__FILE__), "..", "..", "..", "etc", "registry.json"))
18
22
 
19
23
  class CrawlFailure < StandardError #:nodoc:
20
24
  end
21
25
 
26
+ # Returns the HTML implementation module for this reader instance.
27
+ #
28
+ # @attr_reader [Module]
29
+ attr_reader :implementation
30
+
22
31
  ##
23
32
  # Returns the base URI determined by this reader.
24
33
  #
@@ -31,6 +40,124 @@ module RDF::Microdata
31
40
  @options[:base_uri]
32
41
  end
33
42
 
43
+ # Interface to registry
44
+ class Registry
45
+ ##
46
+ # Initialize the registry from a URI or file path
47
+ #
48
+ # @param [Hash] json
49
+ def self.load_registry(json)
50
+ @prefixes = {}
51
+ json.each do |prefix, elements|
52
+ propertyURI = elements.fetch("propertyURI", "vocabulary").to_sym
53
+ multipleValues = elements.fetch("multipleValues", "unordered").to_sym
54
+ properties = elements.fetch("properties", {})
55
+ @prefixes[prefix] = Registry.new(prefix, propertyURI, multipleValues, properties)
56
+ end
57
+ end
58
+
59
+ ##
60
+ # True if registry has already been loaded
61
+ def self.loaded?
62
+ @prefixes.is_a?(Hash)
63
+ end
64
+
65
+ ##
66
+ # Initialize registry for a particular prefix URI
67
+ #
68
+ # @param [RDF::URI] prefixURI
69
+ # @param [#to_sym] propertyURI (:vocabulary)
70
+ # @param [#to_sym] multipleValues (:unordered)
71
+ # @param [Hash] properties ({})
72
+ def initialize(prefixURI, propertyURI = :vocabulary, multipleValues = :unordered, properties = {})
73
+ @scheme = propertyURI.to_sym
74
+ @multipleValues = multipleValues.to_sym
75
+ @properties = properties
76
+ if @scheme == :vocabulary
77
+ @property_base = prefixURI.to_s
78
+ @property_base += '#' unless %w(/ #).include?(@property_base[-1]) # Append a '#' for fragment if necessary
79
+ else
80
+ @property_base = 'http://www.w3.org/ns/md?type='
81
+ end
82
+ end
83
+
84
+ ##
85
+ # Find a registry entry given a type URI
86
+ #
87
+ # @param [RDF::URI] type
88
+ # @return [Registry]
89
+ def self.find(type)
90
+ @prefixes.select do |key, value|
91
+ type.to_s.index(key) == 0
92
+ end.values.first
93
+ end
94
+
95
+ ##
96
+ # Generate a predicateURI given a `name`
97
+ #
98
+ # @param [#to_s] name
99
+ # @param [Hash{}] ec Evaluation Context
100
+ # @return [RDF::URI]
101
+ def predicateURI(name, ec)
102
+ u = RDF::URI(name)
103
+ return u if u.absolute?
104
+
105
+ n = frag_escape(name)
106
+ if ec[:current_type].nil?
107
+ u = RDF::URI(ec[:document_base].to_s)
108
+ u.fragment = frag_escape(name)
109
+ u
110
+ elsif @scheme == :vocabulary
111
+ # If scheme is vocabulary return the URI reference constructed by appending the fragment escaped value of name
112
+ # to current vocabulary, separated by a U+0023 NUMBER SIGN character (#) unless the current vocabulary ends
113
+ # with either a U+0023 NUMBER SIGN character (#) or SOLIDUS U+002F (/).
114
+ RDF::URI(@property_base + n)
115
+ else # @scheme == :contextual
116
+ if ec[:current_type].to_s.index(@property_base) == 0
117
+ # return the concatenation of s, a U+002E FULL STOP character (.) and the fragment-escaped value of name.
118
+ RDF::URI(@property_base + '.' + n)
119
+ else
120
+ # return the concatenation of http://www.w3.org/ns/md?type=, the fragment-escaped value of s,
121
+ # the string &prop=, and the fragment-escaped value of name
122
+ RDF::URI(@property_base + frag_escape(ec[:current_type]) + '?prop=' + n)
123
+ end
124
+ end
125
+ end
126
+
127
+
128
+ ##
129
+ # Turn a predicateURI into a simple token
130
+ # @param [RDF::URI] predicateURI
131
+ # @return [String]
132
+ def tokenize(predicateURI)
133
+ case @scheme
134
+ when :vocabulary
135
+ predicateURI.to_s.sub(@property_base, '')
136
+ when :contextual
137
+ predicateURI.to_s.split('?prop=').last.split('.').last
138
+ end
139
+ end
140
+
141
+ ##
142
+ # Determine if property should be serialized as a list or not
143
+ # @param [RDF::URI] predicateURI
144
+ # @return [Boolean]
145
+ def as_list(predicateURI)
146
+ tok = tokenize(predicateURI)
147
+ if @properties[tok].is_a?(Hash)
148
+ @properties[tok]["multipleValues"].to_sym == :list
149
+ else
150
+ @multipleValues == :list
151
+ end
152
+ end
153
+
154
+ ##
155
+ # Fragment escape a name
156
+ def frag_escape(name)
157
+ name.to_s.gsub(/["#%<>\[\\\]^{|}]/) {|c| '%' + c.unpack('H2' * c.bytesize).join('%').upcase}
158
+ end
159
+ end
160
+
34
161
  ##
35
162
  # Initializes the Microdata reader instance.
36
163
  #
@@ -38,6 +165,8 @@ module RDF::Microdata
38
165
  # the input stream to read
39
166
  # @param [Hash{Symbol => Object}] options
40
167
  # any additional options
168
+ # @option options [Symbol] :library (:nokogiri)
169
+ # One of :nokogiri or :rexml. If nil/unspecified uses :nokogiri if available, :rexml otherwise.
41
170
  # @option options [Encoding] :encoding (Encoding::UTF_8)
42
171
  # the encoding of the input stream (Ruby 1.9+)
43
172
  # @option options [Boolean] :validate (false)
@@ -48,6 +177,7 @@ module RDF::Microdata
48
177
  # whether to intern all parsed URIs
49
178
  # @option options [#to_s] :base_uri (nil)
50
179
  # the base URI to use when resolving relative URIs
180
+ # @option options [#to_s] :registry_uri (DEFAULT_REGISTRY)
51
181
  # @option options [Array] :debug
52
182
  # Array to place debug messages
53
183
  # @return [reader]
@@ -59,24 +189,43 @@ module RDF::Microdata
59
189
  super do
60
190
  @debug = options[:debug]
61
191
 
62
- @doc = case input
63
- when Nokogiri::HTML::Document, Nokogiri::XML::Document
64
- input
65
- else
66
- # Try to detect charset from input
67
- options[:encoding] ||= input.charset if input.respond_to?(:charset)
68
-
69
- # Otherwise, default is utf-8
70
- options[:encoding] ||= 'utf-8'
192
+ @library = case options[:library]
193
+ when nil
194
+ (defined?(::Nokogiri) && RUBY_PLATFORM != 'java') ? :nokogiri : :rexml
195
+ when :nokogiri, :rexml
196
+ options[:library]
197
+ else
198
+ raise ArgumentError.new("expected :rexml or :nokogiri, but got #{options[:library].inspect}")
199
+ end
71
200
 
72
- add_debug(nil, "base_uri: #{base_uri}")
73
- Nokogiri::HTML.parse(input, base_uri.to_s, options[:encoding])
201
+ require "rdf/microdata/reader/#{@library}"
202
+ @implementation = case @library
203
+ when :nokogiri then Nokogiri
204
+ when :rexml then REXML
74
205
  end
75
-
76
- errors = @doc.errors.reject {|e| e.to_s =~ /Tag (audio|source|track|video|time) invalid/}
206
+ self.extend(@implementation)
207
+
208
+ initialize_html(input, options) rescue raise RDF::ReaderError.new($!.message)
209
+
210
+ if (root.nil? && validate?)
211
+ raise RDF::ReaderError, "Empty Document"
212
+ end
213
+ errors = doc_errors.reject {|e| e.to_s =~ /Tag (audio|source|track|video|time) invalid/}
77
214
  raise RDF::ReaderError, "Syntax errors:\n#{errors}" if !errors.empty? && validate?
78
- raise RDF::ReaderError, "Empty document" if (@doc.nil? || @doc.root.nil?) && validate?
79
215
 
216
+ add_debug(@doc, "library = #{@library}")
217
+
218
+ # Load registry
219
+ unless Registry.loaded?
220
+ registry = options[:registry_uri] || DEFAULT_REGISTRY
221
+ begin
222
+ json = RDF::Util::File.open_file(registry) { |f| JSON.load(f) }
223
+ rescue JSON::ParserError => e
224
+ raise RDF::ReaderError, "Failed to parse registry: #{e.message}"
225
+ end
226
+ Registry.load_registry(json)
227
+ end
228
+
80
229
  if block_given?
81
230
  case block.arity
82
231
  when 0 then instance_eval(&block)
@@ -121,19 +270,19 @@ module RDF::Microdata
121
270
  @bnode_cache[value.to_s] ||= RDF::Node.new(value)
122
271
  end
123
272
 
124
- # Figure out the document path, if it is a Nokogiri::XML::Element or Attribute
273
+ # Figure out the document path, if it is an Element or Attribute
125
274
  def node_path(node)
126
- "<#{base_uri}>" + case node
127
- when Nokogiri::XML::Node then node.display_path
128
- else node.to_s
129
- end
275
+ "<#{base_uri}>#{node.respond_to?(:display_path) ? node.display_path : node}"
130
276
  end
131
277
 
132
278
  # Add debug event to debug array, if specified
133
279
  #
134
- # @param [XML Node, any] node:: XML Node or string for showing context
280
+ # @param [Nokogiri::XML::Node, #to_s] node:: XML Node or string for showing context
135
281
  # @param [String] message::
136
- def add_debug(node, message)
282
+ # @yieldreturn [String] appended to message, to allow for lazy-evaulation of message
283
+ def add_debug(node, message = "")
284
+ return unless ::RDF::Microdata.debug? || @debug
285
+ message = message + yield if block_given?
137
286
  puts "#{node_path(node)}: #{message}" if ::RDF::Microdata::debug?
138
287
  @debug << "#{node_path(node)}: #{message}" if @debug.is_a?(Array)
139
288
  end
@@ -153,107 +302,50 @@ module RDF::Microdata
153
302
  # @raise [ReaderError]:: Checks parameter types and raises if they are incorrect if parsing mode is _validate_.
154
303
  def add_triple(node, subject, predicate, object)
155
304
  statement = RDF::Statement.new(subject, predicate, object)
156
- add_debug(node, "statement: #{RDF::NTriples.serialize(statement)}")
305
+ add_debug(node) {"statement: #{RDF::NTriples.serialize(statement)}"}
157
306
  @callback.call(statement)
158
307
  end
159
308
 
160
309
  # Parsing a Microdata document (this is *not* the recursive method)
161
310
  def parse_whole_document(doc, base)
162
- base_el = doc.at_css('html>head>base')
163
- base = base_el.attribute('href').to_s.split('#').first if base_el
164
-
165
- add_debug(doc, "parse_whole_doc: options=#{@options.inspect}")
166
-
167
- if (base)
311
+ base = doc_base(base)
312
+ options[:base_uri] = if (base)
168
313
  # Strip any fragment from base
169
314
  base = base.to_s.split('#').first
170
- base = options[:base_uri] = uri(base)
171
- add_debug(base_el, "parse_whole_doc: base='#{base}'")
315
+ base = uri(base)
172
316
  else
173
317
  base = RDF::URI("")
174
318
  end
175
319
 
176
- # 2. For each a, area, and link element in the Document, run these substeps:
320
+ add_debug(nil) {"parse_whole_doc: base='#{base}'"}
321
+
322
+ ec = {
323
+ :memory => {},
324
+ :current_name => nil,
325
+ :current_type => nil,
326
+ :current_vocabulary => nil,
327
+ :document_base => base,
328
+ }
329
+ items = []
330
+ # 1) For each element that is also a top-level item run the following algorithm:
177
331
  #
178
- # * If the element does not have a rel attribute, then skip this element.
179
- # * If the element does not have an href attribute, then skip this element.
180
- # * If resolving the element's href attribute relative to the element is not successful,
181
- # then skip this element.
182
- doc.css('a, area, link').each do |el|
183
- rel, href = el.attribute('rel'), el.attribute('href')
184
- next unless rel && href
185
- href = uri(href, el.base || base)
186
- add_debug(el, "a: rel=#{rel.inspect}, href=#{href}")
187
-
188
- # Otherwise, split the value of the element's rel attribute on spaces, obtaining list of tokens.
189
- # Coalesce duplicate tokens in list of tokens.
190
- tokens = rel.to_s.split(/\s+/).map do |tok|
191
- # Convert each token in list of tokens that does not contain a U+003A COLON characters (:)
192
- # to ASCII lowercase.
193
- tok =~ /:/ ? tok : tok.downcase
194
- end.uniq
195
-
196
- # If list of tokens contains both the tokens alternate and stylesheet,
197
- # then remove them both and replace them with the single (uppercase) token
198
- # ALTERNATE-STYLESHEET.
199
- if tokens.include?('alternate') && tokens.include?('stylesheet')
200
- tokens = tokens - %w(alternate stylesheet)
201
- tokens << 'ALTERNATE-STYLESHEET'
202
- end
203
-
204
- tokens.each do |tok|
205
- tok_uri = RDF::URI(tok)
206
- if tok !~ /:/
207
- # For each token token in list of tokens that contains no U+003A COLON characters (:),
208
- # generate the following triple:
209
- add_triple(el, base, RDF::XHV[tok.gsub('#', '%23')], href)
210
- elsif tok_uri.absolute?
211
- # For each token token in list of tokens that is an absolute URL, generate the following triple:
212
- add_triple(el, base, tok_uri, href)
213
- end
214
- end
215
- end
216
-
217
- # 3. For each meta element in the Document that has a name attribute and a content attribute,
218
- doc.css('meta[name][content]').each do |el|
219
- name, content = el.attribute('name'), el.attribute('content')
220
- name = name.to_s
221
- name_uri = uri(name, el.base || base)
222
- add_debug(el, "meta: name=#{name.inspect}")
223
- if name !~ /:/
224
- # If the value of the name attribute contains no U+003A COLON characters (:),
225
- # generate the following triple:
226
- add_triple(el, base, RDF::XHV[name.downcase.gsub('#', '%23')], RDF::Literal(content, :language => el.language))
227
- elsif name_uri.absolute?
228
- # If the value of the name attribute contains no U+003A COLON characters (:),
229
- # generate the following triple:
230
- add_triple(el, base, name_uri, RDF::Literal(content, :language => el.language))
231
- end
232
- end
233
-
234
- # 4. For each blockquote and q element in the Document that has a cite attribute that resolves
235
- # successfully relative to the element, generate the following triple:
236
- doc.css('blockquote[cite], q[cite]').each do |el|
237
- object = uri(el.attribute('cite'), el.base || base)
238
- add_debug(el, "blockquote: cite=#{object}")
239
- add_triple(el, base, RDF::DC.source, object)
332
+ # 1) Generate the triples for an item item, using the evaluation context.
333
+ # Let result be the (URI reference or blank node) subject returned.
334
+ # 2) Append result to item list.
335
+ getItems.each do |el|
336
+ result = generate_triples(el, ec)
337
+ items << result
240
338
  end
339
+
340
+ # 2) Generate an RDF Collection list from
341
+ # the ordered list of values. Set value to the value returned from generate an RDF Collection.
342
+ value = generateRDFCollection(root, items)
241
343
 
242
- # 5. Let memory be a mapping of items to subjects, initially empty.
243
- # 6. For each element that is also a top-level microdata item, run the following steps:
244
- # * Generate the triples for the item. Pass a reference to memory as the item/subject list.
245
- # Let result be the subject returned.
246
- # * Generate the following triple:
247
- # subject the document's current address
248
- # predicate http://www.w3.org/1999/xhtml/microdata#item
249
- # object result
250
- memory = {}
251
- doc.css('[itemscope]').
252
- select {|el| !el.has_attribute?('itemprop')}.
253
- each do |el|
254
- object = generate_triples(el, memory)
255
- add_triple(el, base, RDF::MD.item, object)
256
- end
344
+ # 3) Generate the following triple:
345
+ # subject Document base
346
+ # predicate http://www.w3.org/1999/xhtml/microdata#item
347
+ # object value
348
+ add_triple(doc, base, RDF::MD.item, value) if value
257
349
 
258
350
  add_debug(doc, "parse_whole_doc: traversal complete")
259
351
  end
@@ -261,94 +353,119 @@ module RDF::Microdata
261
353
  ##
262
354
  # Generate triples for an item
263
355
  # @param [RDF::Resource] item
264
- # @param [Hash{Nokogiri::XML::Element} => RDF::Resource] memory
265
- # @param [Hash{Symbol => Object}] options
266
- # @option options [RDF::Resource] :fallback_type
267
- # @option options [RDF::Resource] :fallback_name
356
+ # @param [Hash{Symbol => Object}] ec
357
+ # @option ec [Hash{Nokogiri::XML::Element} => RDF::Resource] memory
358
+ # @option ec [RDF::Resource] :current_type
268
359
  # @return [RDF::Resource]
269
- def generate_triples(item, memory, options = {})
270
- fallback_type = options[:fallback_type]
271
- fallback_name = options[:fallback_name]
272
-
273
- # 1. If there is an entry for item in memory, then let subject be the subject of that entry.
360
+ def generate_triples(item, ec = {})
361
+ memory = ec[:memory]
362
+ # 1) If there is an entry for item in memory, then let subject be the subject of that entry.
274
363
  # Otherwise, if item has a global identifier and that global identifier is an absolute URL,
275
364
  # let subject be that global identifier. Otherwise, let subject be a new blank node.
276
- subject = if memory.include?(item)
277
- memory[item][:subject]
365
+ subject = if memory.include?(item.node)
366
+ memory[item.node][:subject]
278
367
  elsif item.has_attribute?('itemid')
279
- u = uri(item.attribute('itemid'), item.base || base_uri)
368
+ uri(item.attribute('itemid'), item.base || base_uri)
280
369
  end || RDF::Node.new
281
- memory[item] ||= {}
370
+ memory[item.node] ||= {}
282
371
 
283
- add_debug(item, "gentrips(2): subject=#{subject.inspect}")
372
+ add_debug(item) {"gentrips(2): subject=#{subject.inspect}, current_type: #{ec[:current_type]}"}
284
373
 
285
- # 2. Add a mapping from item to subject in memory, if there isn't one already.
286
- memory[item][:subject] ||= subject
374
+ # 2) Add a mapping from item to subject in memory, if there isn't one already.
375
+ memory[item.node][:subject] ||= subject
287
376
 
288
- # 3. If item has an item type and that item type is an absolute URL, let type be that item type.
289
- # Otherwise, let type be the empty string.
290
- rdf_type = type = uri(item.attribute('itemtype'))
291
- type = '' unless type.absolute?
377
+ # 3) For each type returned from element.itemType of the element defining the item.
378
+ type = nil
379
+ item.attribute('itemtype').to_s.split(' ').map{|n| uri(n)}.select(&:absolute?).each do |t|
380
+ # 3.1. If type is an absolute URL, generate the following triple:
381
+ type ||= t
382
+ add_triple(item, subject, RDF.type, t)
383
+ end
292
384
 
293
- if type != ''
294
- add_triple(item, subject, RDF.type, type)
295
- # 4.2. If type does not contain a U+0023 NUMBER SIGN character (#), then append a # to type.
296
- type += '#' unless type.to_s.include?('#')
297
- # 4.3. If type does not have a : after its #, append a : to type.
298
- type += ':' unless type.to_s.match(/\#:/)
299
- elsif fallback_type
300
- add_debug(item, "gentrips(5.2): fallback_type=#{fallback_type}, fallback_name=#{fallback_name}")
301
- rdf_type = type = fallback_type
302
- # 5.2. If type does not contain a U+0023 NUMBER SIGN character (#), then append a # to type.
303
- type += '#' unless type.to_s.include?('#')
304
- # 5.3. If type does not have a : after its #, append a : to type.
305
- type += ':' unless type.to_s.match(/\#:/)
306
- # 5.4. If the last character of type is not a :, %20 to type.
307
- type += '%20' unless type.to_s[-1,1] == ':'
308
- # 5.5. Append the fragment-escaped value of fallback name to type.
309
- type += fallback_name.to_s.gsub('#', '%23')
385
+ # 5) If type is not an absolute URL, set it to current type from the Evaluation Context if not empty.
386
+ type ||= ec[:current_type]
387
+ add_debug(item) {"gentrips(5): type=#{type.inspect}"}
388
+
389
+ # 6) If the registry contains a URI prefix that is a character for character match of type up to the length of the
390
+ # URI prefix, set vocab as that URI prefix
391
+ vocab = Registry.find(type)
392
+
393
+ # 7) Otherwise, if type is not empty, construct vocab by removing everything following the last
394
+ # SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from type.
395
+ vocab ||= begin
396
+ type_vocab = type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1')
397
+ add_debug(item) {"gentrips(7): typtype_vocab=#{type_vocab.inspect}"}
398
+ Registry.new(type_vocab) # if type
310
399
  end
311
400
 
312
- add_debug(item, "gentrips(6): type=#{type.inspect}")
313
-
314
- # 6. For each element _element_ that has one or more property names and is one of the
401
+ # 8) Update evaluation context setting current vocabulary to vocab.
402
+ ec[:current_vocabulary] = vocab
403
+
404
+ # 9) Set property list to an empty mapping between properties and one or more ordered values as established below.
405
+ property_list = {}
406
+
407
+ # 10. For each element _element_ that has one or more property names and is one of the
315
408
  # properties of the item _item_, in the order those elements are given by the algorithm
316
409
  # that returns the properties of an item, run the following substep:
317
410
  props = item_properties(item)
318
-
319
- # 6.1. For each name name in element's property names, run the following substeps:
411
+ # 10.1. For each name name in element's property names, run the following substeps:
320
412
  props.each do |element|
321
- element.attribute('itemprop').to_s.split(' ').each do |name|
322
- add_debug(element, "gentrips(6.1): name=#{name.inspect}")
323
- # If type is the empty string and name is not an absolute URL, then abort these substeps.
324
- name_uri = RDF::URI(name)
325
- next if type == '' && !name_uri.absolute?
326
-
413
+ element.attribute('itemprop').to_s.split(' ').compact.each do |name|
414
+ add_debug(element) {"gentrips(10.1): name=#{name.inspect}, type=#{type}"}
415
+ # Let context be a copy of evaluation context with current type set to type and current vocabulary set to vocab.
416
+ ec_new = ec.merge({:current_type => type, :current_vocabulary => vocab})
417
+
418
+ predicate = vocab.predicateURI(name, ec_new)
419
+ ec_new[:current_name] = predicate
420
+ add_debug(element) {"gentrips(10.1.2): predicate=#{predicate}"}
421
+
422
+ # 10.1.3) Let value be the property value of element.
327
423
  value = property_value(element)
328
- add_debug(element, "gentrips(6.1.2) value=#{value.inspect}")
424
+ add_debug(element) {"gentrips(10.1.3) value=#{value.inspect}"}
329
425
 
426
+ # 10.1.4) If value is an item, then generate the triples for value using a copy of evaluation context with
427
+ # current type set to type. Replace value by the subject returned from those steps.
330
428
  if value.is_a?(Hash)
331
- value = generate_triples(element, memory, :fallback_type => type, :fallback_name => name)
429
+ value = generate_triples(element, ec_new)
430
+ add_debug(element) {"gentrips(10.1.4): value=#{value.inspect}"}
332
431
  end
333
-
334
- add_debug(element, "gentrips(6.1.3): value=#{value.inspect}")
335
432
 
336
- predicate = if name_uri.absolute?
337
- name_uri
338
- else
339
- # Use the URI of the type to create URIs for @itemprop terms
340
- add_debug(element, "gentrips: rdf_type=#{rdf_type}")
341
- predicate = RDF::URI(rdf_type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1' + name))
342
- end
343
- add_debug(element, "gentrips(6.1.5): predicate=#{predicate}")
344
-
345
- add_triple(element, subject, predicate, value) if predicate
433
+ property_list[predicate] ||= []
434
+ property_list[predicate] << value
346
435
  end
347
436
  end
348
437
 
438
+ # 11) For each predicate in property list
439
+ property_list.each do |predicate, values|
440
+ generatePropertyValues(item, subject, predicate, values, ec)
441
+ end
442
+
349
443
  subject
350
444
  end
351
445
 
446
+ def generatePropertyValues(element, subject, predicate, values, ec)
447
+ registry = ec[:current_vocabulary]
448
+ if registry.as_list(predicate)
449
+ value = generateRDFCollection(element, values)
450
+ add_triple(element, subject, predicate, value)
451
+ else
452
+ values.each {|v| add_triple(element, subject, predicate, v)}
453
+ end
454
+ end
455
+
456
+ ##
457
+ # Called when values has more than one entry
458
+ # @param [Nokogiri::HTML::Element] element
459
+ # @param [Array<RDF::Value>] values
460
+ # @return [RDF::Node]
461
+ def generateRDFCollection(element, values)
462
+ list = RDF::List.new(nil, nil, values)
463
+ list.each_statement do |st|
464
+ add_triple(element, st.subject, st.predicate, st.object) unless st.object == RDF.List
465
+ end
466
+ list.subject
467
+ end
468
+
352
469
  ##
353
470
  # To find the properties of an item defined by the element root, the user agent must try
354
471
  # to crawl the properties of the element root, with an empty list as the value of memory:
@@ -378,13 +495,14 @@ module RDF::Microdata
378
495
  # @return [Array<Array<Nokogiri::XML::Element>, Integer>]
379
496
  # Resultant elements and error count
380
497
  def crawl_properties(root, memory)
498
+
381
499
  # 1. If root is in memory, then the algorithm fails; abort these steps.
382
500
  raise CrawlFailure, "crawl_props mem already has #{root.inspect}" if memory.include?(root)
383
501
 
384
502
  # 2. Collect all the elements in the item root; let results be the resulting
385
503
  # list of elements, and errors be the resulting count of errors.
386
504
  results, errors = elements_in_item(root)
387
- add_debug(root, "crawl_properties results=#{results.inspect}, errors=#{errors}")
505
+ add_debug(root) {"crawl_properties results=#{results.map {|e| node_path(e)}.inspect}, errors=#{errors}"}
388
506
 
389
507
  # 3. Remove any elements from results that do not have an itemprop attribute specified.
390
508
  results = results.select {|e| e.has_attribute?('itemprop')}
@@ -427,13 +545,13 @@ module RDF::Microdata
427
545
  # If root has an itemref attribute, split the value of that itemref attribute on spaces.
428
546
  # For each resulting token ID,
429
547
  root.attribute('itemref').to_s.split(' ').each do |id|
430
- add_debug(root, "elements_in_item itemref id #{id}")
548
+ add_debug(root) {"elements_in_item itemref id #{id}"}
431
549
  # if there is an element in the home subtree of root with the ID ID,
432
550
  # then add the first such element to pending.
433
- id_elem = @doc.at_css("##{id}")
551
+ id_elem = find_element_by_id(id)
434
552
  pending << id_elem if id_elem
435
553
  end
436
- add_debug(root, "elements_in_item pending #{pending.inspect}")
554
+ add_debug(root) {"elements_in_item pending #{pending.inspect}"}
437
555
 
438
556
  # Loop: Remove an element from pending and let current be that element.
439
557
  while current = pending.shift
@@ -457,37 +575,42 @@ module RDF::Microdata
457
575
  ##
458
576
  #
459
577
  def property_value(element)
460
- add_debug(element, "property_value(#{element.inspect}): base #{element.base.inspect}, base_uri: #{base_uri.inspect}")
461
- case
578
+ base = element.base || base_uri
579
+ add_debug(element) {"property_value(#{element.name}): base #{base.inspect}"}
580
+ value = case
462
581
  when element.has_attribute?('itemscope')
463
582
  {}
464
583
  when element.name == 'meta'
465
- element.attribute('content').to_s
584
+ RDF::Literal.new(element.attribute('content').to_s, :language => element.language)
585
+ when element.name == 'data'
586
+ RDF::Literal.new(element.attribute('value').to_s, :language => element.language)
466
587
  when %w(audio embed iframe img source track video).include?(element.name)
467
- uri(element.attribute('src'), element.base || base_uri)
588
+ uri(element.attribute('src'), base)
468
589
  when %w(a area link).include?(element.name)
469
- uri(element.attribute('href'), element.base || base_uri)
590
+ uri(element.attribute('href'), base)
470
591
  when %w(object).include?(element.name)
471
- uri(element.attribute('data'), element.base || base_uri)
472
- when %w(time).include?(element.name) && element.has_attribute?('datetime')
592
+ uri(element.attribute('data'), base)
593
+ when %w(time).include?(element.name)
473
594
  # Lexically scan value and assign appropriate type, otherwise, leave untyped
474
- v = element.attribute('datetime').to_s
475
- datatype = %w(Date Time DateTime).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
595
+ v = (element.attribute('datetime') || element.text).to_s
596
+ datatype = %w(Date Time DateTime Duration).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
476
597
  v.match(dt::GRAMMAR)
477
598
  end || RDF::Literal
478
- datatype.new(v)
599
+ datatype.new(v, :language => element.language)
479
600
  else
480
- RDF::Literal.new(element.text, :language => element.language)
601
+ RDF::Literal.new(element.inner_text, :language => element.language)
481
602
  end
603
+ add_debug(element) {" #{value.inspect}"}
604
+ value
482
605
  end
483
606
 
484
607
  # Fixme, what about xml:base relative to element?
485
608
  def uri(value, base = nil)
486
609
  value = if base
487
610
  base = uri(base) unless base.is_a?(RDF::URI)
488
- base.join(value)
611
+ base.join(value.to_s)
489
612
  else
490
- RDF::URI(value)
613
+ RDF::URI(value.to_s)
491
614
  end
492
615
  value.validate! if validate?
493
616
  value.canonicalize! if canonicalize?