rdf-microdata 0.2.2 → 0.2.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README +21 -45
- data/VERSION +1 -1
- data/etc/doap.html +42 -0
- data/etc/registry.json +39 -0
- data/lib/rdf/microdata.rb +0 -2
- data/lib/rdf/microdata/reader.rb +316 -193
- data/lib/rdf/microdata/reader/nokogiri.rb +232 -0
- data/lib/rdf/microdata/reader/rexml.rb +277 -0
- data/lib/rdf/microdata/vocab.rb +1 -1
- metadata +58 -21
- data/lib/rdf/microdata/extensions.rb +0 -34
data/README
CHANGED
@@ -6,13 +6,20 @@
|
|
6
6
|
RDF::Microdata is a Microdata reader for Ruby using the [RDF.rb][RDF.rb] library suite.
|
7
7
|
|
8
8
|
## FEATURES
|
9
|
-
RDF::Microdata parses [Microdata][] into statements or triples.
|
9
|
+
RDF::Microdata parses [Microdata][] into statements or triples using the rules defined in [Microdata RDF][].
|
10
10
|
|
11
11
|
* Microdata parser.
|
12
|
-
* Uses Nokogiri for parsing HTML
|
12
|
+
* If available, Uses Nokogiri for parsing HTML/SVG, falls back to REXML otherwise (and for JRuby)
|
13
13
|
|
14
14
|
Install with 'gem install rdf-microdata'
|
15
15
|
|
16
|
+
### Living implementation
|
17
|
+
Microdata to RDF transformation is undergoing active development. This implementation attempts to be up-to-date
|
18
|
+
as of the time of release, and is being used in developing the [Microdata RDF][] specification
|
19
|
+
|
20
|
+
### Microdata Registry
|
21
|
+
The parser uses a build-in version of the [Microdata RDF][] registry.
|
22
|
+
|
16
23
|
## Usage
|
17
24
|
|
18
25
|
### Reading RDF data in the Microdata format
|
@@ -20,49 +27,14 @@ Install with 'gem install rdf-microdata'
|
|
20
27
|
graph = RDF::Graph.load("etc/foaf.html", :format => :microdata)
|
21
28
|
|
22
29
|
## Note
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
### Generating RDF friendly URIs from terms
|
28
|
-
If the `@itemprop` is included within an item having an `@itemtype`,
|
29
|
-
the URI of the `@itemtype` will be used for generating a term URI. The type URI will be trimmed following
|
30
|
-
the last '#' or '/' character, and the term will be appended to the resulting URI. This is in keeping
|
31
|
-
with standard convention for defining properties and classes within an RDFS or OWL vocabulary.
|
32
|
-
|
33
|
-
For example:
|
34
|
-
|
35
|
-
<div itemscope itemtype="http://schema.org/Person">
|
36
|
-
My name is <span itemprop="name">Gregg</span>
|
37
|
-
</div>
|
38
|
-
|
39
|
-
Without the `:rdf\_terms` option, this would create the following statements:
|
40
|
-
|
41
|
-
@prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
|
42
|
-
@prefix schema: <http://schema.org/> .
|
43
|
-
<> md:item [
|
44
|
-
a schema:Person;
|
45
|
-
<http://www.w3.org/1999/xhtml/microdata#http://schema.org/Person%23:name> "Gregg"
|
46
|
-
] .
|
47
|
-
|
48
|
-
With the `:rdf\_terms` option, this becomes:
|
49
|
-
|
50
|
-
@prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
|
51
|
-
@prefix schema: <http://schema.org/> .
|
52
|
-
<> md:item [ a schema:Person; schema:name "Gregg" ] .
|
53
|
-
|
54
|
-
### Improve xsd:date, xsd:time, xsd:dateTime and xsd:duration generation from _time_ element
|
55
|
-
|
56
|
-
Use the lexical form of the @datetime attribute of the _time_ element to determine the specific type
|
57
|
-
of the generated literal.
|
58
|
-
|
59
|
-
### Remove implicit RDF triple generation
|
60
|
-
|
61
|
-
html>head>title and anchor (_a_) elements no longer generate triples without @item* properties
|
62
|
-
|
30
|
+
This spec is based on the W3C HTML Data Task Force specification and does not support
|
31
|
+
GRDDL-type triple generation, such as for html>head>title and <a>
|
32
|
+
|
63
33
|
## Dependencies
|
64
34
|
* [RDF.rb](http://rubygems.org/gems/rdf) (>= 0.3.4)
|
65
|
-
* [
|
35
|
+
* [RDF::XSD](http://rubygems.org/gems/rdf-xsd) (>= 0.3.4)
|
36
|
+
* [HTMLEntities](https://rubygems.org/gems/htmlentities) ('>= 4.3.0')
|
37
|
+
* Soft dependency on [Nokogiri](http://rubygems.org/gems/nokogiri) (>= 1.5.0)
|
66
38
|
|
67
39
|
## Documentation
|
68
40
|
Full documentation available on [Rubydoc.info][Microdata doc]
|
@@ -71,6 +43,8 @@ Full documentation available on [Rubydoc.info][Microdata doc]
|
|
71
43
|
* {RDF::Microdata::Format}
|
72
44
|
Asserts :html format, text/html mime-type and .html file extension.
|
73
45
|
* {RDF::Microdata::Reader}
|
46
|
+
* {RDF::Microdata::Reader::Nokogiri}
|
47
|
+
* {RDF::Microdata::Reader::REXML}
|
74
48
|
|
75
49
|
### Additional vocabularies
|
76
50
|
|
@@ -81,8 +55,9 @@ Full documentation available on [Rubydoc.info][Microdata doc]
|
|
81
55
|
## Resources
|
82
56
|
* [RDF.rb][RDF.rb]
|
83
57
|
* [Documentation](http://rdf.rubyforge.org/microdata)
|
84
|
-
* [History](file:
|
58
|
+
* [History](file:History.md)
|
85
59
|
* [Microdata][]
|
60
|
+
* [Microdata RDF][]
|
86
61
|
|
87
62
|
## Author
|
88
63
|
* [Gregg Kellogg](http://github.com/gkellogg) - <http://kellogg-assoc.com/>
|
@@ -117,5 +92,6 @@ see <http://unlicense.org/> or the accompanying {file:UNLICENSE} file.
|
|
117
92
|
[YARD]: http://yardoc.org/
|
118
93
|
[YARD-GS]: http://rubydoc.info/docs/yard/file/docs/GettingStarted.md
|
119
94
|
[PDD]: http://lists.w3.org/Archives/Public/public-rdf-ruby/2010May/0013.html
|
120
|
-
[Microdata]: http://
|
95
|
+
[Microdata]: http://dev.w3.org/html5/md/Overview.html "HTML Microdata"
|
96
|
+
[Microdata RDF]: https://dvcs.w3.org/hg/htmldata/raw-file/default/microdata-rdf/index.html "Microdata to RDF"
|
121
97
|
[Microdata doc]: http://rubydoc.info/github/gkellogg/rdf-microdata/frames
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.2.
|
1
|
+
0.2.3
|
data/etc/doap.html
ADDED
@@ -0,0 +1,42 @@
|
|
1
|
+
<!DOCTYPE html>
|
2
|
+
<html itemscope itemid="http://rubygems.org/gems/rdf-microdata" itemtype="http://usefulinc.com/ns/doap#Project">
|
3
|
+
<head>
|
4
|
+
<title lang="en" itemprop="shortdesc">Microdata reader for Ruby.</title>
|
5
|
+
</head>
|
6
|
+
<body about="" typeof="Project">
|
7
|
+
<p>Project description for <span itemprop="name">RDF::Microdata</span>.</p>
|
8
|
+
<p lang="en" itemprop="description">
|
9
|
+
RDF::Microdata is an Microdata reader for Ruby using the RDF.rb library suite.
|
10
|
+
</p>
|
11
|
+
<dl>
|
12
|
+
<dt>Creator</dt><dd>
|
13
|
+
<a itemprop="http://purl.org/dc/terms/creator developer documenter maintainer http://xmlns.com/foaf/0.1/creator" href="http://greggkellogg.net/foaf#me"
|
14
|
+
>Gregg Kellogg</a>
|
15
|
+
</dd>
|
16
|
+
<dt>Created</dt><time itemprop="created" datetime="2011-08-29"/></dd>
|
17
|
+
<dt>Blog</dt><dd><a href="http://greggkellogg.net/" itemprop="blog">http://greggkellogg.net/</a></dd>
|
18
|
+
<dt>Bug DB</dt><dd>
|
19
|
+
<a href="http://github.com/gkellogg/rdf-microdata/issues" itemprop="bug-database">
|
20
|
+
http://github.com/gkellogg/rdf-microdata/issues
|
21
|
+
</a>
|
22
|
+
</dd>
|
23
|
+
<dt>Category</dt><dd itemprop="category">
|
24
|
+
<a href="http://dbpedia.org/resource/Resource_Description_Framework">Resource Description Framework</a>
|
25
|
+
for
|
26
|
+
<a itemprop="programming-language" href="http://dbpedia.org/resource/Ruby_(programming_language)">Ruby</a>
|
27
|
+
</dd>
|
28
|
+
<dt>Download</dt><dd><a href="http://rubygems.org/gems/rdf-microdata" itemprop="download-page">
|
29
|
+
http://rubygems.org/gems/rdf-microdata
|
30
|
+
</a></dd>
|
31
|
+
<dt>Home Page</dt><dd><a href="http://github.com/gkellogg/rdf-microdata" itemprop="homepage">
|
32
|
+
http://github.com/gkellogg/rdf-microdata
|
33
|
+
</a></dd>
|
34
|
+
<dt>License</dt><dd>
|
35
|
+
<a href="http://creativecommons.org/licenses/publicdomain/" itemprop="license">Public Domain</a>
|
36
|
+
</dd>
|
37
|
+
<dt>Mailing List</dt><dd><a href="http://lists.w3.org/Archives/Public/public-rdf-ruby/" itemprop="mailing-list">
|
38
|
+
http://lists.w3.org/Archives/Public/public-rdf-ruby/
|
39
|
+
</a></dd>
|
40
|
+
</dl>
|
41
|
+
</body>
|
42
|
+
</html>
|
data/etc/registry.json
ADDED
@@ -0,0 +1,39 @@
|
|
1
|
+
{
|
2
|
+
"http://schema.org/": {
|
3
|
+
"propertyURI": "vocabulary",
|
4
|
+
"multipleValues": "unordered",
|
5
|
+
"properties": {
|
6
|
+
"blogPosts": {"multipleValues": "list"},
|
7
|
+
"breadcrumb": {"multipleValues": "list"},
|
8
|
+
"byArtist": {"multipleValues": "list"},
|
9
|
+
"creator": {"multipleValues": "list"},
|
10
|
+
"episodes": {"multipleValues": "list"},
|
11
|
+
"events": {"multipleValues": "list"},
|
12
|
+
"founders": {"multipleValues": "list"},
|
13
|
+
"itemListElement": {"multipleValues": "list"},
|
14
|
+
"musicGroupMember": {"multipleValues": "list"},
|
15
|
+
"performerIn": {"multipleValues": "list"},
|
16
|
+
"performers": {"multipleValues": "list"},
|
17
|
+
"producer": {"multipleValues": "list"},
|
18
|
+
"recipeInstructions": {"multipleValues": "list"},
|
19
|
+
"seasons": {"multipleValues": "list"},
|
20
|
+
"subEvents": {"multipleValues": "list"},
|
21
|
+
"tracks": {"multipleValues": "list"}
|
22
|
+
}
|
23
|
+
},
|
24
|
+
"http://microformats.org/profile/hcard": {
|
25
|
+
"propertyURI": "vocabulary",
|
26
|
+
"multipleValues": "unordered"
|
27
|
+
},
|
28
|
+
"http://microformats.org/profile/hcalendar#": {
|
29
|
+
"propertyURI": "vocabulary",
|
30
|
+
"multipleValues": "unordered",
|
31
|
+
"properties": {
|
32
|
+
"categories": {"multipleValues": "list"}
|
33
|
+
}
|
34
|
+
},
|
35
|
+
"http://n.whatwg.org/work": {
|
36
|
+
"propertyURI": "contextual",
|
37
|
+
"multipleValues": "list"
|
38
|
+
}
|
39
|
+
}
|
data/lib/rdf/microdata.rb
CHANGED
data/lib/rdf/microdata/reader.rb
CHANGED
@@ -1,24 +1,33 @@
|
|
1
|
-
|
1
|
+
begin
|
2
|
+
raise LoadError, "not with java" if RUBY_PLATFORM == "java"
|
3
|
+
require 'nokogiri'
|
4
|
+
rescue LoadError => e
|
5
|
+
:rexml
|
6
|
+
end
|
7
|
+
require 'rdf/xsd'
|
8
|
+
require 'json'
|
2
9
|
|
3
10
|
module RDF::Microdata
|
4
11
|
##
|
5
12
|
# An Microdata parser in Ruby
|
6
13
|
#
|
7
14
|
# Based on processing rules, amended with the following:
|
8
|
-
# * property generation from tokens now uses the associated @itemtype as the basis for generation
|
9
|
-
# * implicit triples are not generated, only those with @item*
|
10
|
-
# * @datetime values are scanned lexically to find appropriate datatype
|
11
15
|
#
|
12
|
-
# @see
|
16
|
+
# @see https://dvcs.w3.org/hg/htmldata/raw-file/0d6b89f5befb/microdata-rdf/index.html
|
13
17
|
# @author [Gregg Kellogg](http://kellogg-assoc.com/)
|
14
18
|
class Reader < RDF::Reader
|
15
19
|
format Format
|
16
|
-
XHTML = "http://www.w3.org/1999/xhtml"
|
17
20
|
URL_PROPERTY_ELEMENTS = %w(a area audio embed iframe img link object source track video)
|
21
|
+
DEFAULT_REGISTRY = File.expand_path(File.join(File.dirname(__FILE__), "..", "..", "..", "etc", "registry.json"))
|
18
22
|
|
19
23
|
class CrawlFailure < StandardError #:nodoc:
|
20
24
|
end
|
21
25
|
|
26
|
+
# Returns the HTML implementation module for this reader instance.
|
27
|
+
#
|
28
|
+
# @attr_reader [Module]
|
29
|
+
attr_reader :implementation
|
30
|
+
|
22
31
|
##
|
23
32
|
# Returns the base URI determined by this reader.
|
24
33
|
#
|
@@ -31,6 +40,124 @@ module RDF::Microdata
|
|
31
40
|
@options[:base_uri]
|
32
41
|
end
|
33
42
|
|
43
|
+
# Interface to registry
|
44
|
+
class Registry
|
45
|
+
##
|
46
|
+
# Initialize the registry from a URI or file path
|
47
|
+
#
|
48
|
+
# @param [Hash] json
|
49
|
+
def self.load_registry(json)
|
50
|
+
@prefixes = {}
|
51
|
+
json.each do |prefix, elements|
|
52
|
+
propertyURI = elements.fetch("propertyURI", "vocabulary").to_sym
|
53
|
+
multipleValues = elements.fetch("multipleValues", "unordered").to_sym
|
54
|
+
properties = elements.fetch("properties", {})
|
55
|
+
@prefixes[prefix] = Registry.new(prefix, propertyURI, multipleValues, properties)
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
##
|
60
|
+
# True if registry has already been loaded
|
61
|
+
def self.loaded?
|
62
|
+
@prefixes.is_a?(Hash)
|
63
|
+
end
|
64
|
+
|
65
|
+
##
|
66
|
+
# Initialize registry for a particular prefix URI
|
67
|
+
#
|
68
|
+
# @param [RDF::URI] prefixURI
|
69
|
+
# @param [#to_sym] propertyURI (:vocabulary)
|
70
|
+
# @param [#to_sym] multipleValues (:unordered)
|
71
|
+
# @param [Hash] properties ({})
|
72
|
+
def initialize(prefixURI, propertyURI = :vocabulary, multipleValues = :unordered, properties = {})
|
73
|
+
@scheme = propertyURI.to_sym
|
74
|
+
@multipleValues = multipleValues.to_sym
|
75
|
+
@properties = properties
|
76
|
+
if @scheme == :vocabulary
|
77
|
+
@property_base = prefixURI.to_s
|
78
|
+
@property_base += '#' unless %w(/ #).include?(@property_base[-1]) # Append a '#' for fragment if necessary
|
79
|
+
else
|
80
|
+
@property_base = 'http://www.w3.org/ns/md?type='
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
##
|
85
|
+
# Find a registry entry given a type URI
|
86
|
+
#
|
87
|
+
# @param [RDF::URI] type
|
88
|
+
# @return [Registry]
|
89
|
+
def self.find(type)
|
90
|
+
@prefixes.select do |key, value|
|
91
|
+
type.to_s.index(key) == 0
|
92
|
+
end.values.first
|
93
|
+
end
|
94
|
+
|
95
|
+
##
|
96
|
+
# Generate a predicateURI given a `name`
|
97
|
+
#
|
98
|
+
# @param [#to_s] name
|
99
|
+
# @param [Hash{}] ec Evaluation Context
|
100
|
+
# @return [RDF::URI]
|
101
|
+
def predicateURI(name, ec)
|
102
|
+
u = RDF::URI(name)
|
103
|
+
return u if u.absolute?
|
104
|
+
|
105
|
+
n = frag_escape(name)
|
106
|
+
if ec[:current_type].nil?
|
107
|
+
u = RDF::URI(ec[:document_base].to_s)
|
108
|
+
u.fragment = frag_escape(name)
|
109
|
+
u
|
110
|
+
elsif @scheme == :vocabulary
|
111
|
+
# If scheme is vocabulary return the URI reference constructed by appending the fragment escaped value of name
|
112
|
+
# to current vocabulary, separated by a U+0023 NUMBER SIGN character (#) unless the current vocabulary ends
|
113
|
+
# with either a U+0023 NUMBER SIGN character (#) or SOLIDUS U+002F (/).
|
114
|
+
RDF::URI(@property_base + n)
|
115
|
+
else # @scheme == :contextual
|
116
|
+
if ec[:current_type].to_s.index(@property_base) == 0
|
117
|
+
# return the concatenation of s, a U+002E FULL STOP character (.) and the fragment-escaped value of name.
|
118
|
+
RDF::URI(@property_base + '.' + n)
|
119
|
+
else
|
120
|
+
# return the concatenation of http://www.w3.org/ns/md?type=, the fragment-escaped value of s,
|
121
|
+
# the string &prop=, and the fragment-escaped value of name
|
122
|
+
RDF::URI(@property_base + frag_escape(ec[:current_type]) + '?prop=' + n)
|
123
|
+
end
|
124
|
+
end
|
125
|
+
end
|
126
|
+
|
127
|
+
|
128
|
+
##
|
129
|
+
# Turn a predicateURI into a simple token
|
130
|
+
# @param [RDF::URI] predicateURI
|
131
|
+
# @return [String]
|
132
|
+
def tokenize(predicateURI)
|
133
|
+
case @scheme
|
134
|
+
when :vocabulary
|
135
|
+
predicateURI.to_s.sub(@property_base, '')
|
136
|
+
when :contextual
|
137
|
+
predicateURI.to_s.split('?prop=').last.split('.').last
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
##
|
142
|
+
# Determine if property should be serialized as a list or not
|
143
|
+
# @param [RDF::URI] predicateURI
|
144
|
+
# @return [Boolean]
|
145
|
+
def as_list(predicateURI)
|
146
|
+
tok = tokenize(predicateURI)
|
147
|
+
if @properties[tok].is_a?(Hash)
|
148
|
+
@properties[tok]["multipleValues"].to_sym == :list
|
149
|
+
else
|
150
|
+
@multipleValues == :list
|
151
|
+
end
|
152
|
+
end
|
153
|
+
|
154
|
+
##
|
155
|
+
# Fragment escape a name
|
156
|
+
def frag_escape(name)
|
157
|
+
name.to_s.gsub(/["#%<>\[\\\]^{|}]/) {|c| '%' + c.unpack('H2' * c.bytesize).join('%').upcase}
|
158
|
+
end
|
159
|
+
end
|
160
|
+
|
34
161
|
##
|
35
162
|
# Initializes the Microdata reader instance.
|
36
163
|
#
|
@@ -38,6 +165,8 @@ module RDF::Microdata
|
|
38
165
|
# the input stream to read
|
39
166
|
# @param [Hash{Symbol => Object}] options
|
40
167
|
# any additional options
|
168
|
+
# @option options [Symbol] :library (:nokogiri)
|
169
|
+
# One of :nokogiri or :rexml. If nil/unspecified uses :nokogiri if available, :rexml otherwise.
|
41
170
|
# @option options [Encoding] :encoding (Encoding::UTF_8)
|
42
171
|
# the encoding of the input stream (Ruby 1.9+)
|
43
172
|
# @option options [Boolean] :validate (false)
|
@@ -48,6 +177,7 @@ module RDF::Microdata
|
|
48
177
|
# whether to intern all parsed URIs
|
49
178
|
# @option options [#to_s] :base_uri (nil)
|
50
179
|
# the base URI to use when resolving relative URIs
|
180
|
+
# @option options [#to_s] :registry_uri (DEFAULT_REGISTRY)
|
51
181
|
# @option options [Array] :debug
|
52
182
|
# Array to place debug messages
|
53
183
|
# @return [reader]
|
@@ -59,24 +189,43 @@ module RDF::Microdata
|
|
59
189
|
super do
|
60
190
|
@debug = options[:debug]
|
61
191
|
|
62
|
-
@
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
options[:encoding] ||= 'utf-8'
|
192
|
+
@library = case options[:library]
|
193
|
+
when nil
|
194
|
+
(defined?(::Nokogiri) && RUBY_PLATFORM != 'java') ? :nokogiri : :rexml
|
195
|
+
when :nokogiri, :rexml
|
196
|
+
options[:library]
|
197
|
+
else
|
198
|
+
raise ArgumentError.new("expected :rexml or :nokogiri, but got #{options[:library].inspect}")
|
199
|
+
end
|
71
200
|
|
72
|
-
|
73
|
-
|
201
|
+
require "rdf/microdata/reader/#{@library}"
|
202
|
+
@implementation = case @library
|
203
|
+
when :nokogiri then Nokogiri
|
204
|
+
when :rexml then REXML
|
74
205
|
end
|
75
|
-
|
76
|
-
|
206
|
+
self.extend(@implementation)
|
207
|
+
|
208
|
+
initialize_html(input, options) rescue raise RDF::ReaderError.new($!.message)
|
209
|
+
|
210
|
+
if (root.nil? && validate?)
|
211
|
+
raise RDF::ReaderError, "Empty Document"
|
212
|
+
end
|
213
|
+
errors = doc_errors.reject {|e| e.to_s =~ /Tag (audio|source|track|video|time) invalid/}
|
77
214
|
raise RDF::ReaderError, "Syntax errors:\n#{errors}" if !errors.empty? && validate?
|
78
|
-
raise RDF::ReaderError, "Empty document" if (@doc.nil? || @doc.root.nil?) && validate?
|
79
215
|
|
216
|
+
add_debug(@doc, "library = #{@library}")
|
217
|
+
|
218
|
+
# Load registry
|
219
|
+
unless Registry.loaded?
|
220
|
+
registry = options[:registry_uri] || DEFAULT_REGISTRY
|
221
|
+
begin
|
222
|
+
json = RDF::Util::File.open_file(registry) { |f| JSON.load(f) }
|
223
|
+
rescue JSON::ParserError => e
|
224
|
+
raise RDF::ReaderError, "Failed to parse registry: #{e.message}"
|
225
|
+
end
|
226
|
+
Registry.load_registry(json)
|
227
|
+
end
|
228
|
+
|
80
229
|
if block_given?
|
81
230
|
case block.arity
|
82
231
|
when 0 then instance_eval(&block)
|
@@ -121,19 +270,19 @@ module RDF::Microdata
|
|
121
270
|
@bnode_cache[value.to_s] ||= RDF::Node.new(value)
|
122
271
|
end
|
123
272
|
|
124
|
-
# Figure out the document path, if it is
|
273
|
+
# Figure out the document path, if it is an Element or Attribute
|
125
274
|
def node_path(node)
|
126
|
-
"<#{base_uri}
|
127
|
-
when Nokogiri::XML::Node then node.display_path
|
128
|
-
else node.to_s
|
129
|
-
end
|
275
|
+
"<#{base_uri}>#{node.respond_to?(:display_path) ? node.display_path : node}"
|
130
276
|
end
|
131
277
|
|
132
278
|
# Add debug event to debug array, if specified
|
133
279
|
#
|
134
|
-
# @param [XML
|
280
|
+
# @param [Nokogiri::XML::Node, #to_s] node:: XML Node or string for showing context
|
135
281
|
# @param [String] message::
|
136
|
-
|
282
|
+
# @yieldreturn [String] appended to message, to allow for lazy-evaulation of message
|
283
|
+
def add_debug(node, message = "")
|
284
|
+
return unless ::RDF::Microdata.debug? || @debug
|
285
|
+
message = message + yield if block_given?
|
137
286
|
puts "#{node_path(node)}: #{message}" if ::RDF::Microdata::debug?
|
138
287
|
@debug << "#{node_path(node)}: #{message}" if @debug.is_a?(Array)
|
139
288
|
end
|
@@ -153,107 +302,50 @@ module RDF::Microdata
|
|
153
302
|
# @raise [ReaderError]:: Checks parameter types and raises if they are incorrect if parsing mode is _validate_.
|
154
303
|
def add_triple(node, subject, predicate, object)
|
155
304
|
statement = RDF::Statement.new(subject, predicate, object)
|
156
|
-
add_debug(node
|
305
|
+
add_debug(node) {"statement: #{RDF::NTriples.serialize(statement)}"}
|
157
306
|
@callback.call(statement)
|
158
307
|
end
|
159
308
|
|
160
309
|
# Parsing a Microdata document (this is *not* the recursive method)
|
161
310
|
def parse_whole_document(doc, base)
|
162
|
-
|
163
|
-
|
164
|
-
|
165
|
-
add_debug(doc, "parse_whole_doc: options=#{@options.inspect}")
|
166
|
-
|
167
|
-
if (base)
|
311
|
+
base = doc_base(base)
|
312
|
+
options[:base_uri] = if (base)
|
168
313
|
# Strip any fragment from base
|
169
314
|
base = base.to_s.split('#').first
|
170
|
-
base =
|
171
|
-
add_debug(base_el, "parse_whole_doc: base='#{base}'")
|
315
|
+
base = uri(base)
|
172
316
|
else
|
173
317
|
base = RDF::URI("")
|
174
318
|
end
|
175
319
|
|
176
|
-
|
320
|
+
add_debug(nil) {"parse_whole_doc: base='#{base}'"}
|
321
|
+
|
322
|
+
ec = {
|
323
|
+
:memory => {},
|
324
|
+
:current_name => nil,
|
325
|
+
:current_type => nil,
|
326
|
+
:current_vocabulary => nil,
|
327
|
+
:document_base => base,
|
328
|
+
}
|
329
|
+
items = []
|
330
|
+
# 1) For each element that is also a top-level item run the following algorithm:
|
177
331
|
#
|
178
|
-
#
|
179
|
-
#
|
180
|
-
#
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
next unless rel && href
|
185
|
-
href = uri(href, el.base || base)
|
186
|
-
add_debug(el, "a: rel=#{rel.inspect}, href=#{href}")
|
187
|
-
|
188
|
-
# Otherwise, split the value of the element's rel attribute on spaces, obtaining list of tokens.
|
189
|
-
# Coalesce duplicate tokens in list of tokens.
|
190
|
-
tokens = rel.to_s.split(/\s+/).map do |tok|
|
191
|
-
# Convert each token in list of tokens that does not contain a U+003A COLON characters (:)
|
192
|
-
# to ASCII lowercase.
|
193
|
-
tok =~ /:/ ? tok : tok.downcase
|
194
|
-
end.uniq
|
195
|
-
|
196
|
-
# If list of tokens contains both the tokens alternate and stylesheet,
|
197
|
-
# then remove them both and replace them with the single (uppercase) token
|
198
|
-
# ALTERNATE-STYLESHEET.
|
199
|
-
if tokens.include?('alternate') && tokens.include?('stylesheet')
|
200
|
-
tokens = tokens - %w(alternate stylesheet)
|
201
|
-
tokens << 'ALTERNATE-STYLESHEET'
|
202
|
-
end
|
203
|
-
|
204
|
-
tokens.each do |tok|
|
205
|
-
tok_uri = RDF::URI(tok)
|
206
|
-
if tok !~ /:/
|
207
|
-
# For each token token in list of tokens that contains no U+003A COLON characters (:),
|
208
|
-
# generate the following triple:
|
209
|
-
add_triple(el, base, RDF::XHV[tok.gsub('#', '%23')], href)
|
210
|
-
elsif tok_uri.absolute?
|
211
|
-
# For each token token in list of tokens that is an absolute URL, generate the following triple:
|
212
|
-
add_triple(el, base, tok_uri, href)
|
213
|
-
end
|
214
|
-
end
|
215
|
-
end
|
216
|
-
|
217
|
-
# 3. For each meta element in the Document that has a name attribute and a content attribute,
|
218
|
-
doc.css('meta[name][content]').each do |el|
|
219
|
-
name, content = el.attribute('name'), el.attribute('content')
|
220
|
-
name = name.to_s
|
221
|
-
name_uri = uri(name, el.base || base)
|
222
|
-
add_debug(el, "meta: name=#{name.inspect}")
|
223
|
-
if name !~ /:/
|
224
|
-
# If the value of the name attribute contains no U+003A COLON characters (:),
|
225
|
-
# generate the following triple:
|
226
|
-
add_triple(el, base, RDF::XHV[name.downcase.gsub('#', '%23')], RDF::Literal(content, :language => el.language))
|
227
|
-
elsif name_uri.absolute?
|
228
|
-
# If the value of the name attribute contains no U+003A COLON characters (:),
|
229
|
-
# generate the following triple:
|
230
|
-
add_triple(el, base, name_uri, RDF::Literal(content, :language => el.language))
|
231
|
-
end
|
232
|
-
end
|
233
|
-
|
234
|
-
# 4. For each blockquote and q element in the Document that has a cite attribute that resolves
|
235
|
-
# successfully relative to the element, generate the following triple:
|
236
|
-
doc.css('blockquote[cite], q[cite]').each do |el|
|
237
|
-
object = uri(el.attribute('cite'), el.base || base)
|
238
|
-
add_debug(el, "blockquote: cite=#{object}")
|
239
|
-
add_triple(el, base, RDF::DC.source, object)
|
332
|
+
# 1) Generate the triples for an item item, using the evaluation context.
|
333
|
+
# Let result be the (URI reference or blank node) subject returned.
|
334
|
+
# 2) Append result to item list.
|
335
|
+
getItems.each do |el|
|
336
|
+
result = generate_triples(el, ec)
|
337
|
+
items << result
|
240
338
|
end
|
339
|
+
|
340
|
+
# 2) Generate an RDF Collection list from
|
341
|
+
# the ordered list of values. Set value to the value returned from generate an RDF Collection.
|
342
|
+
value = generateRDFCollection(root, items)
|
241
343
|
|
242
|
-
#
|
243
|
-
#
|
244
|
-
#
|
245
|
-
#
|
246
|
-
|
247
|
-
# subject the document's current address
|
248
|
-
# predicate http://www.w3.org/1999/xhtml/microdata#item
|
249
|
-
# object result
|
250
|
-
memory = {}
|
251
|
-
doc.css('[itemscope]').
|
252
|
-
select {|el| !el.has_attribute?('itemprop')}.
|
253
|
-
each do |el|
|
254
|
-
object = generate_triples(el, memory)
|
255
|
-
add_triple(el, base, RDF::MD.item, object)
|
256
|
-
end
|
344
|
+
# 3) Generate the following triple:
|
345
|
+
# subject Document base
|
346
|
+
# predicate http://www.w3.org/1999/xhtml/microdata#item
|
347
|
+
# object value
|
348
|
+
add_triple(doc, base, RDF::MD.item, value) if value
|
257
349
|
|
258
350
|
add_debug(doc, "parse_whole_doc: traversal complete")
|
259
351
|
end
|
@@ -261,94 +353,119 @@ module RDF::Microdata
|
|
261
353
|
##
|
262
354
|
# Generate triples for an item
|
263
355
|
# @param [RDF::Resource] item
|
264
|
-
# @param [Hash{
|
265
|
-
# @
|
266
|
-
# @option
|
267
|
-
# @option options [RDF::Resource] :fallback_name
|
356
|
+
# @param [Hash{Symbol => Object}] ec
|
357
|
+
# @option ec [Hash{Nokogiri::XML::Element} => RDF::Resource] memory
|
358
|
+
# @option ec [RDF::Resource] :current_type
|
268
359
|
# @return [RDF::Resource]
|
269
|
-
def generate_triples(item,
|
270
|
-
|
271
|
-
|
272
|
-
|
273
|
-
# 1. If there is an entry for item in memory, then let subject be the subject of that entry.
|
360
|
+
def generate_triples(item, ec = {})
|
361
|
+
memory = ec[:memory]
|
362
|
+
# 1) If there is an entry for item in memory, then let subject be the subject of that entry.
|
274
363
|
# Otherwise, if item has a global identifier and that global identifier is an absolute URL,
|
275
364
|
# let subject be that global identifier. Otherwise, let subject be a new blank node.
|
276
|
-
subject = if memory.include?(item)
|
277
|
-
memory[item][:subject]
|
365
|
+
subject = if memory.include?(item.node)
|
366
|
+
memory[item.node][:subject]
|
278
367
|
elsif item.has_attribute?('itemid')
|
279
|
-
|
368
|
+
uri(item.attribute('itemid'), item.base || base_uri)
|
280
369
|
end || RDF::Node.new
|
281
|
-
memory[item] ||= {}
|
370
|
+
memory[item.node] ||= {}
|
282
371
|
|
283
|
-
add_debug(item
|
372
|
+
add_debug(item) {"gentrips(2): subject=#{subject.inspect}, current_type: #{ec[:current_type]}"}
|
284
373
|
|
285
|
-
# 2
|
286
|
-
memory[item][:subject] ||= subject
|
374
|
+
# 2) Add a mapping from item to subject in memory, if there isn't one already.
|
375
|
+
memory[item.node][:subject] ||= subject
|
287
376
|
|
288
|
-
# 3
|
289
|
-
|
290
|
-
|
291
|
-
|
377
|
+
# 3) For each type returned from element.itemType of the element defining the item.
|
378
|
+
type = nil
|
379
|
+
item.attribute('itemtype').to_s.split(' ').map{|n| uri(n)}.select(&:absolute?).each do |t|
|
380
|
+
# 3.1. If type is an absolute URL, generate the following triple:
|
381
|
+
type ||= t
|
382
|
+
add_triple(item, subject, RDF.type, t)
|
383
|
+
end
|
292
384
|
|
293
|
-
|
294
|
-
|
295
|
-
|
296
|
-
|
297
|
-
|
298
|
-
|
299
|
-
|
300
|
-
|
301
|
-
|
302
|
-
|
303
|
-
|
304
|
-
|
305
|
-
|
306
|
-
|
307
|
-
type += '%20' unless type.to_s[-1,1] == ':'
|
308
|
-
# 5.5. Append the fragment-escaped value of fallback name to type.
|
309
|
-
type += fallback_name.to_s.gsub('#', '%23')
|
385
|
+
# 5) If type is not an absolute URL, set it to current type from the Evaluation Context if not empty.
|
386
|
+
type ||= ec[:current_type]
|
387
|
+
add_debug(item) {"gentrips(5): type=#{type.inspect}"}
|
388
|
+
|
389
|
+
# 6) If the registry contains a URI prefix that is a character for character match of type up to the length of the
|
390
|
+
# URI prefix, set vocab as that URI prefix
|
391
|
+
vocab = Registry.find(type)
|
392
|
+
|
393
|
+
# 7) Otherwise, if type is not empty, construct vocab by removing everything following the last
|
394
|
+
# SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from type.
|
395
|
+
vocab ||= begin
|
396
|
+
type_vocab = type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1')
|
397
|
+
add_debug(item) {"gentrips(7): typtype_vocab=#{type_vocab.inspect}"}
|
398
|
+
Registry.new(type_vocab) # if type
|
310
399
|
end
|
311
400
|
|
312
|
-
|
313
|
-
|
314
|
-
|
401
|
+
# 8) Update evaluation context setting current vocabulary to vocab.
|
402
|
+
ec[:current_vocabulary] = vocab
|
403
|
+
|
404
|
+
# 9) Set property list to an empty mapping between properties and one or more ordered values as established below.
|
405
|
+
property_list = {}
|
406
|
+
|
407
|
+
# 10. For each element _element_ that has one or more property names and is one of the
|
315
408
|
# properties of the item _item_, in the order those elements are given by the algorithm
|
316
409
|
# that returns the properties of an item, run the following substep:
|
317
410
|
props = item_properties(item)
|
318
|
-
|
319
|
-
# 6.1. For each name name in element's property names, run the following substeps:
|
411
|
+
# 10.1. For each name name in element's property names, run the following substeps:
|
320
412
|
props.each do |element|
|
321
|
-
element.attribute('itemprop').to_s.split(' ').each do |name|
|
322
|
-
add_debug(element
|
323
|
-
#
|
324
|
-
|
325
|
-
|
326
|
-
|
413
|
+
element.attribute('itemprop').to_s.split(' ').compact.each do |name|
|
414
|
+
add_debug(element) {"gentrips(10.1): name=#{name.inspect}, type=#{type}"}
|
415
|
+
# Let context be a copy of evaluation context with current type set to type and current vocabulary set to vocab.
|
416
|
+
ec_new = ec.merge({:current_type => type, :current_vocabulary => vocab})
|
417
|
+
|
418
|
+
predicate = vocab.predicateURI(name, ec_new)
|
419
|
+
ec_new[:current_name] = predicate
|
420
|
+
add_debug(element) {"gentrips(10.1.2): predicate=#{predicate}"}
|
421
|
+
|
422
|
+
# 10.1.3) Let value be the property value of element.
|
327
423
|
value = property_value(element)
|
328
|
-
add_debug(element
|
424
|
+
add_debug(element) {"gentrips(10.1.3) value=#{value.inspect}"}
|
329
425
|
|
426
|
+
# 10.1.4) If value is an item, then generate the triples for value using a copy of evaluation context with
|
427
|
+
# current type set to type. Replace value by the subject returned from those steps.
|
330
428
|
if value.is_a?(Hash)
|
331
|
-
value = generate_triples(element,
|
429
|
+
value = generate_triples(element, ec_new)
|
430
|
+
add_debug(element) {"gentrips(10.1.4): value=#{value.inspect}"}
|
332
431
|
end
|
333
|
-
|
334
|
-
add_debug(element, "gentrips(6.1.3): value=#{value.inspect}")
|
335
432
|
|
336
|
-
predicate
|
337
|
-
|
338
|
-
else
|
339
|
-
# Use the URI of the type to create URIs for @itemprop terms
|
340
|
-
add_debug(element, "gentrips: rdf_type=#{rdf_type}")
|
341
|
-
predicate = RDF::URI(rdf_type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1' + name))
|
342
|
-
end
|
343
|
-
add_debug(element, "gentrips(6.1.5): predicate=#{predicate}")
|
344
|
-
|
345
|
-
add_triple(element, subject, predicate, value) if predicate
|
433
|
+
property_list[predicate] ||= []
|
434
|
+
property_list[predicate] << value
|
346
435
|
end
|
347
436
|
end
|
348
437
|
|
438
|
+
# 11) For each predicate in property list
|
439
|
+
property_list.each do |predicate, values|
|
440
|
+
generatePropertyValues(item, subject, predicate, values, ec)
|
441
|
+
end
|
442
|
+
|
349
443
|
subject
|
350
444
|
end
|
351
445
|
|
446
|
+
def generatePropertyValues(element, subject, predicate, values, ec)
|
447
|
+
registry = ec[:current_vocabulary]
|
448
|
+
if registry.as_list(predicate)
|
449
|
+
value = generateRDFCollection(element, values)
|
450
|
+
add_triple(element, subject, predicate, value)
|
451
|
+
else
|
452
|
+
values.each {|v| add_triple(element, subject, predicate, v)}
|
453
|
+
end
|
454
|
+
end
|
455
|
+
|
456
|
+
##
|
457
|
+
# Called when values has more than one entry
|
458
|
+
# @param [Nokogiri::HTML::Element] element
|
459
|
+
# @param [Array<RDF::Value>] values
|
460
|
+
# @return [RDF::Node]
|
461
|
+
def generateRDFCollection(element, values)
|
462
|
+
list = RDF::List.new(nil, nil, values)
|
463
|
+
list.each_statement do |st|
|
464
|
+
add_triple(element, st.subject, st.predicate, st.object) unless st.object == RDF.List
|
465
|
+
end
|
466
|
+
list.subject
|
467
|
+
end
|
468
|
+
|
352
469
|
##
|
353
470
|
# To find the properties of an item defined by the element root, the user agent must try
|
354
471
|
# to crawl the properties of the element root, with an empty list as the value of memory:
|
@@ -378,13 +495,14 @@ module RDF::Microdata
|
|
378
495
|
# @return [Array<Array<Nokogiri::XML::Element>, Integer>]
|
379
496
|
# Resultant elements and error count
|
380
497
|
def crawl_properties(root, memory)
|
498
|
+
|
381
499
|
# 1. If root is in memory, then the algorithm fails; abort these steps.
|
382
500
|
raise CrawlFailure, "crawl_props mem already has #{root.inspect}" if memory.include?(root)
|
383
501
|
|
384
502
|
# 2. Collect all the elements in the item root; let results be the resulting
|
385
503
|
# list of elements, and errors be the resulting count of errors.
|
386
504
|
results, errors = elements_in_item(root)
|
387
|
-
add_debug(root
|
505
|
+
add_debug(root) {"crawl_properties results=#{results.map {|e| node_path(e)}.inspect}, errors=#{errors}"}
|
388
506
|
|
389
507
|
# 3. Remove any elements from results that do not have an itemprop attribute specified.
|
390
508
|
results = results.select {|e| e.has_attribute?('itemprop')}
|
@@ -427,13 +545,13 @@ module RDF::Microdata
|
|
427
545
|
# If root has an itemref attribute, split the value of that itemref attribute on spaces.
|
428
546
|
# For each resulting token ID,
|
429
547
|
root.attribute('itemref').to_s.split(' ').each do |id|
|
430
|
-
add_debug(root
|
548
|
+
add_debug(root) {"elements_in_item itemref id #{id}"}
|
431
549
|
# if there is an element in the home subtree of root with the ID ID,
|
432
550
|
# then add the first such element to pending.
|
433
|
-
id_elem =
|
551
|
+
id_elem = find_element_by_id(id)
|
434
552
|
pending << id_elem if id_elem
|
435
553
|
end
|
436
|
-
add_debug(root
|
554
|
+
add_debug(root) {"elements_in_item pending #{pending.inspect}"}
|
437
555
|
|
438
556
|
# Loop: Remove an element from pending and let current be that element.
|
439
557
|
while current = pending.shift
|
@@ -457,37 +575,42 @@ module RDF::Microdata
|
|
457
575
|
##
|
458
576
|
#
|
459
577
|
def property_value(element)
|
460
|
-
|
461
|
-
|
578
|
+
base = element.base || base_uri
|
579
|
+
add_debug(element) {"property_value(#{element.name}): base #{base.inspect}"}
|
580
|
+
value = case
|
462
581
|
when element.has_attribute?('itemscope')
|
463
582
|
{}
|
464
583
|
when element.name == 'meta'
|
465
|
-
element.attribute('content').to_s
|
584
|
+
RDF::Literal.new(element.attribute('content').to_s, :language => element.language)
|
585
|
+
when element.name == 'data'
|
586
|
+
RDF::Literal.new(element.attribute('value').to_s, :language => element.language)
|
466
587
|
when %w(audio embed iframe img source track video).include?(element.name)
|
467
|
-
uri(element.attribute('src'),
|
588
|
+
uri(element.attribute('src'), base)
|
468
589
|
when %w(a area link).include?(element.name)
|
469
|
-
uri(element.attribute('href'),
|
590
|
+
uri(element.attribute('href'), base)
|
470
591
|
when %w(object).include?(element.name)
|
471
|
-
uri(element.attribute('data'),
|
472
|
-
when %w(time).include?(element.name)
|
592
|
+
uri(element.attribute('data'), base)
|
593
|
+
when %w(time).include?(element.name)
|
473
594
|
# Lexically scan value and assign appropriate type, otherwise, leave untyped
|
474
|
-
v = element.attribute('datetime').to_s
|
475
|
-
datatype = %w(Date Time DateTime).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
|
595
|
+
v = (element.attribute('datetime') || element.text).to_s
|
596
|
+
datatype = %w(Date Time DateTime Duration).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
|
476
597
|
v.match(dt::GRAMMAR)
|
477
598
|
end || RDF::Literal
|
478
|
-
datatype.new(v)
|
599
|
+
datatype.new(v, :language => element.language)
|
479
600
|
else
|
480
|
-
RDF::Literal.new(element.
|
601
|
+
RDF::Literal.new(element.inner_text, :language => element.language)
|
481
602
|
end
|
603
|
+
add_debug(element) {" #{value.inspect}"}
|
604
|
+
value
|
482
605
|
end
|
483
606
|
|
484
607
|
# Fixme, what about xml:base relative to element?
|
485
608
|
def uri(value, base = nil)
|
486
609
|
value = if base
|
487
610
|
base = uri(base) unless base.is_a?(RDF::URI)
|
488
|
-
base.join(value)
|
611
|
+
base.join(value.to_s)
|
489
612
|
else
|
490
|
-
RDF::URI(value)
|
613
|
+
RDF::URI(value.to_s)
|
491
614
|
end
|
492
615
|
value.validate! if validate?
|
493
616
|
value.canonicalize! if canonicalize?
|