rdf-microdata 0.2.2 → 0.2.3
Sign up to get free protection for your applications and to get access to all the features.
- data/README +21 -45
- data/VERSION +1 -1
- data/etc/doap.html +42 -0
- data/etc/registry.json +39 -0
- data/lib/rdf/microdata.rb +0 -2
- data/lib/rdf/microdata/reader.rb +316 -193
- data/lib/rdf/microdata/reader/nokogiri.rb +232 -0
- data/lib/rdf/microdata/reader/rexml.rb +277 -0
- data/lib/rdf/microdata/vocab.rb +1 -1
- metadata +58 -21
- data/lib/rdf/microdata/extensions.rb +0 -34
data/README
CHANGED
@@ -6,13 +6,20 @@
|
|
6
6
|
RDF::Microdata is a Microdata reader for Ruby using the [RDF.rb][RDF.rb] library suite.
|
7
7
|
|
8
8
|
## FEATURES
|
9
|
-
RDF::Microdata parses [Microdata][] into statements or triples.
|
9
|
+
RDF::Microdata parses [Microdata][] into statements or triples using the rules defined in [Microdata RDF][].
|
10
10
|
|
11
11
|
* Microdata parser.
|
12
|
-
* Uses Nokogiri for parsing HTML
|
12
|
+
* If available, Uses Nokogiri for parsing HTML/SVG, falls back to REXML otherwise (and for JRuby)
|
13
13
|
|
14
14
|
Install with 'gem install rdf-microdata'
|
15
15
|
|
16
|
+
### Living implementation
|
17
|
+
Microdata to RDF transformation is undergoing active development. This implementation attempts to be up-to-date
|
18
|
+
as of the time of release, and is being used in developing the [Microdata RDF][] specification
|
19
|
+
|
20
|
+
### Microdata Registry
|
21
|
+
The parser uses a build-in version of the [Microdata RDF][] registry.
|
22
|
+
|
16
23
|
## Usage
|
17
24
|
|
18
25
|
### Reading RDF data in the Microdata format
|
@@ -20,49 +27,14 @@ Install with 'gem install rdf-microdata'
|
|
20
27
|
graph = RDF::Graph.load("etc/foaf.html", :format => :microdata)
|
21
28
|
|
22
29
|
## Note
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
### Generating RDF friendly URIs from terms
|
28
|
-
If the `@itemprop` is included within an item having an `@itemtype`,
|
29
|
-
the URI of the `@itemtype` will be used for generating a term URI. The type URI will be trimmed following
|
30
|
-
the last '#' or '/' character, and the term will be appended to the resulting URI. This is in keeping
|
31
|
-
with standard convention for defining properties and classes within an RDFS or OWL vocabulary.
|
32
|
-
|
33
|
-
For example:
|
34
|
-
|
35
|
-
<div itemscope itemtype="http://schema.org/Person">
|
36
|
-
My name is <span itemprop="name">Gregg</span>
|
37
|
-
</div>
|
38
|
-
|
39
|
-
Without the `:rdf\_terms` option, this would create the following statements:
|
40
|
-
|
41
|
-
@prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
|
42
|
-
@prefix schema: <http://schema.org/> .
|
43
|
-
<> md:item [
|
44
|
-
a schema:Person;
|
45
|
-
<http://www.w3.org/1999/xhtml/microdata#http://schema.org/Person%23:name> "Gregg"
|
46
|
-
] .
|
47
|
-
|
48
|
-
With the `:rdf\_terms` option, this becomes:
|
49
|
-
|
50
|
-
@prefix md: <http://www.w3.org/1999/xhtml/microdata#> .
|
51
|
-
@prefix schema: <http://schema.org/> .
|
52
|
-
<> md:item [ a schema:Person; schema:name "Gregg" ] .
|
53
|
-
|
54
|
-
### Improve xsd:date, xsd:time, xsd:dateTime and xsd:duration generation from _time_ element
|
55
|
-
|
56
|
-
Use the lexical form of the @datetime attribute of the _time_ element to determine the specific type
|
57
|
-
of the generated literal.
|
58
|
-
|
59
|
-
### Remove implicit RDF triple generation
|
60
|
-
|
61
|
-
html>head>title and anchor (_a_) elements no longer generate triples without @item* properties
|
62
|
-
|
30
|
+
This spec is based on the W3C HTML Data Task Force specification and does not support
|
31
|
+
GRDDL-type triple generation, such as for html>head>title and <a>
|
32
|
+
|
63
33
|
## Dependencies
|
64
34
|
* [RDF.rb](http://rubygems.org/gems/rdf) (>= 0.3.4)
|
65
|
-
* [
|
35
|
+
* [RDF::XSD](http://rubygems.org/gems/rdf-xsd) (>= 0.3.4)
|
36
|
+
* [HTMLEntities](https://rubygems.org/gems/htmlentities) ('>= 4.3.0')
|
37
|
+
* Soft dependency on [Nokogiri](http://rubygems.org/gems/nokogiri) (>= 1.5.0)
|
66
38
|
|
67
39
|
## Documentation
|
68
40
|
Full documentation available on [Rubydoc.info][Microdata doc]
|
@@ -71,6 +43,8 @@ Full documentation available on [Rubydoc.info][Microdata doc]
|
|
71
43
|
* {RDF::Microdata::Format}
|
72
44
|
Asserts :html format, text/html mime-type and .html file extension.
|
73
45
|
* {RDF::Microdata::Reader}
|
46
|
+
* {RDF::Microdata::Reader::Nokogiri}
|
47
|
+
* {RDF::Microdata::Reader::REXML}
|
74
48
|
|
75
49
|
### Additional vocabularies
|
76
50
|
|
@@ -81,8 +55,9 @@ Full documentation available on [Rubydoc.info][Microdata doc]
|
|
81
55
|
## Resources
|
82
56
|
* [RDF.rb][RDF.rb]
|
83
57
|
* [Documentation](http://rdf.rubyforge.org/microdata)
|
84
|
-
* [History](file:
|
58
|
+
* [History](file:History.md)
|
85
59
|
* [Microdata][]
|
60
|
+
* [Microdata RDF][]
|
86
61
|
|
87
62
|
## Author
|
88
63
|
* [Gregg Kellogg](http://github.com/gkellogg) - <http://kellogg-assoc.com/>
|
@@ -117,5 +92,6 @@ see <http://unlicense.org/> or the accompanying {file:UNLICENSE} file.
|
|
117
92
|
[YARD]: http://yardoc.org/
|
118
93
|
[YARD-GS]: http://rubydoc.info/docs/yard/file/docs/GettingStarted.md
|
119
94
|
[PDD]: http://lists.w3.org/Archives/Public/public-rdf-ruby/2010May/0013.html
|
120
|
-
[Microdata]: http://
|
95
|
+
[Microdata]: http://dev.w3.org/html5/md/Overview.html "HTML Microdata"
|
96
|
+
[Microdata RDF]: https://dvcs.w3.org/hg/htmldata/raw-file/default/microdata-rdf/index.html "Microdata to RDF"
|
121
97
|
[Microdata doc]: http://rubydoc.info/github/gkellogg/rdf-microdata/frames
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.2.
|
1
|
+
0.2.3
|
data/etc/doap.html
ADDED
@@ -0,0 +1,42 @@
|
|
1
|
+
<!DOCTYPE html>
|
2
|
+
<html itemscope itemid="http://rubygems.org/gems/rdf-microdata" itemtype="http://usefulinc.com/ns/doap#Project">
|
3
|
+
<head>
|
4
|
+
<title lang="en" itemprop="shortdesc">Microdata reader for Ruby.</title>
|
5
|
+
</head>
|
6
|
+
<body about="" typeof="Project">
|
7
|
+
<p>Project description for <span itemprop="name">RDF::Microdata</span>.</p>
|
8
|
+
<p lang="en" itemprop="description">
|
9
|
+
RDF::Microdata is an Microdata reader for Ruby using the RDF.rb library suite.
|
10
|
+
</p>
|
11
|
+
<dl>
|
12
|
+
<dt>Creator</dt><dd>
|
13
|
+
<a itemprop="http://purl.org/dc/terms/creator developer documenter maintainer http://xmlns.com/foaf/0.1/creator" href="http://greggkellogg.net/foaf#me"
|
14
|
+
>Gregg Kellogg</a>
|
15
|
+
</dd>
|
16
|
+
<dt>Created</dt><time itemprop="created" datetime="2011-08-29"/></dd>
|
17
|
+
<dt>Blog</dt><dd><a href="http://greggkellogg.net/" itemprop="blog">http://greggkellogg.net/</a></dd>
|
18
|
+
<dt>Bug DB</dt><dd>
|
19
|
+
<a href="http://github.com/gkellogg/rdf-microdata/issues" itemprop="bug-database">
|
20
|
+
http://github.com/gkellogg/rdf-microdata/issues
|
21
|
+
</a>
|
22
|
+
</dd>
|
23
|
+
<dt>Category</dt><dd itemprop="category">
|
24
|
+
<a href="http://dbpedia.org/resource/Resource_Description_Framework">Resource Description Framework</a>
|
25
|
+
for
|
26
|
+
<a itemprop="programming-language" href="http://dbpedia.org/resource/Ruby_(programming_language)">Ruby</a>
|
27
|
+
</dd>
|
28
|
+
<dt>Download</dt><dd><a href="http://rubygems.org/gems/rdf-microdata" itemprop="download-page">
|
29
|
+
http://rubygems.org/gems/rdf-microdata
|
30
|
+
</a></dd>
|
31
|
+
<dt>Home Page</dt><dd><a href="http://github.com/gkellogg/rdf-microdata" itemprop="homepage">
|
32
|
+
http://github.com/gkellogg/rdf-microdata
|
33
|
+
</a></dd>
|
34
|
+
<dt>License</dt><dd>
|
35
|
+
<a href="http://creativecommons.org/licenses/publicdomain/" itemprop="license">Public Domain</a>
|
36
|
+
</dd>
|
37
|
+
<dt>Mailing List</dt><dd><a href="http://lists.w3.org/Archives/Public/public-rdf-ruby/" itemprop="mailing-list">
|
38
|
+
http://lists.w3.org/Archives/Public/public-rdf-ruby/
|
39
|
+
</a></dd>
|
40
|
+
</dl>
|
41
|
+
</body>
|
42
|
+
</html>
|
data/etc/registry.json
ADDED
@@ -0,0 +1,39 @@
|
|
1
|
+
{
|
2
|
+
"http://schema.org/": {
|
3
|
+
"propertyURI": "vocabulary",
|
4
|
+
"multipleValues": "unordered",
|
5
|
+
"properties": {
|
6
|
+
"blogPosts": {"multipleValues": "list"},
|
7
|
+
"breadcrumb": {"multipleValues": "list"},
|
8
|
+
"byArtist": {"multipleValues": "list"},
|
9
|
+
"creator": {"multipleValues": "list"},
|
10
|
+
"episodes": {"multipleValues": "list"},
|
11
|
+
"events": {"multipleValues": "list"},
|
12
|
+
"founders": {"multipleValues": "list"},
|
13
|
+
"itemListElement": {"multipleValues": "list"},
|
14
|
+
"musicGroupMember": {"multipleValues": "list"},
|
15
|
+
"performerIn": {"multipleValues": "list"},
|
16
|
+
"performers": {"multipleValues": "list"},
|
17
|
+
"producer": {"multipleValues": "list"},
|
18
|
+
"recipeInstructions": {"multipleValues": "list"},
|
19
|
+
"seasons": {"multipleValues": "list"},
|
20
|
+
"subEvents": {"multipleValues": "list"},
|
21
|
+
"tracks": {"multipleValues": "list"}
|
22
|
+
}
|
23
|
+
},
|
24
|
+
"http://microformats.org/profile/hcard": {
|
25
|
+
"propertyURI": "vocabulary",
|
26
|
+
"multipleValues": "unordered"
|
27
|
+
},
|
28
|
+
"http://microformats.org/profile/hcalendar#": {
|
29
|
+
"propertyURI": "vocabulary",
|
30
|
+
"multipleValues": "unordered",
|
31
|
+
"properties": {
|
32
|
+
"categories": {"multipleValues": "list"}
|
33
|
+
}
|
34
|
+
},
|
35
|
+
"http://n.whatwg.org/work": {
|
36
|
+
"propertyURI": "contextual",
|
37
|
+
"multipleValues": "list"
|
38
|
+
}
|
39
|
+
}
|
data/lib/rdf/microdata.rb
CHANGED
data/lib/rdf/microdata/reader.rb
CHANGED
@@ -1,24 +1,33 @@
|
|
1
|
-
|
1
|
+
begin
|
2
|
+
raise LoadError, "not with java" if RUBY_PLATFORM == "java"
|
3
|
+
require 'nokogiri'
|
4
|
+
rescue LoadError => e
|
5
|
+
:rexml
|
6
|
+
end
|
7
|
+
require 'rdf/xsd'
|
8
|
+
require 'json'
|
2
9
|
|
3
10
|
module RDF::Microdata
|
4
11
|
##
|
5
12
|
# An Microdata parser in Ruby
|
6
13
|
#
|
7
14
|
# Based on processing rules, amended with the following:
|
8
|
-
# * property generation from tokens now uses the associated @itemtype as the basis for generation
|
9
|
-
# * implicit triples are not generated, only those with @item*
|
10
|
-
# * @datetime values are scanned lexically to find appropriate datatype
|
11
15
|
#
|
12
|
-
# @see
|
16
|
+
# @see https://dvcs.w3.org/hg/htmldata/raw-file/0d6b89f5befb/microdata-rdf/index.html
|
13
17
|
# @author [Gregg Kellogg](http://kellogg-assoc.com/)
|
14
18
|
class Reader < RDF::Reader
|
15
19
|
format Format
|
16
|
-
XHTML = "http://www.w3.org/1999/xhtml"
|
17
20
|
URL_PROPERTY_ELEMENTS = %w(a area audio embed iframe img link object source track video)
|
21
|
+
DEFAULT_REGISTRY = File.expand_path(File.join(File.dirname(__FILE__), "..", "..", "..", "etc", "registry.json"))
|
18
22
|
|
19
23
|
class CrawlFailure < StandardError #:nodoc:
|
20
24
|
end
|
21
25
|
|
26
|
+
# Returns the HTML implementation module for this reader instance.
|
27
|
+
#
|
28
|
+
# @attr_reader [Module]
|
29
|
+
attr_reader :implementation
|
30
|
+
|
22
31
|
##
|
23
32
|
# Returns the base URI determined by this reader.
|
24
33
|
#
|
@@ -31,6 +40,124 @@ module RDF::Microdata
|
|
31
40
|
@options[:base_uri]
|
32
41
|
end
|
33
42
|
|
43
|
+
# Interface to registry
|
44
|
+
class Registry
|
45
|
+
##
|
46
|
+
# Initialize the registry from a URI or file path
|
47
|
+
#
|
48
|
+
# @param [Hash] json
|
49
|
+
def self.load_registry(json)
|
50
|
+
@prefixes = {}
|
51
|
+
json.each do |prefix, elements|
|
52
|
+
propertyURI = elements.fetch("propertyURI", "vocabulary").to_sym
|
53
|
+
multipleValues = elements.fetch("multipleValues", "unordered").to_sym
|
54
|
+
properties = elements.fetch("properties", {})
|
55
|
+
@prefixes[prefix] = Registry.new(prefix, propertyURI, multipleValues, properties)
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
##
|
60
|
+
# True if registry has already been loaded
|
61
|
+
def self.loaded?
|
62
|
+
@prefixes.is_a?(Hash)
|
63
|
+
end
|
64
|
+
|
65
|
+
##
|
66
|
+
# Initialize registry for a particular prefix URI
|
67
|
+
#
|
68
|
+
# @param [RDF::URI] prefixURI
|
69
|
+
# @param [#to_sym] propertyURI (:vocabulary)
|
70
|
+
# @param [#to_sym] multipleValues (:unordered)
|
71
|
+
# @param [Hash] properties ({})
|
72
|
+
def initialize(prefixURI, propertyURI = :vocabulary, multipleValues = :unordered, properties = {})
|
73
|
+
@scheme = propertyURI.to_sym
|
74
|
+
@multipleValues = multipleValues.to_sym
|
75
|
+
@properties = properties
|
76
|
+
if @scheme == :vocabulary
|
77
|
+
@property_base = prefixURI.to_s
|
78
|
+
@property_base += '#' unless %w(/ #).include?(@property_base[-1]) # Append a '#' for fragment if necessary
|
79
|
+
else
|
80
|
+
@property_base = 'http://www.w3.org/ns/md?type='
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
##
|
85
|
+
# Find a registry entry given a type URI
|
86
|
+
#
|
87
|
+
# @param [RDF::URI] type
|
88
|
+
# @return [Registry]
|
89
|
+
def self.find(type)
|
90
|
+
@prefixes.select do |key, value|
|
91
|
+
type.to_s.index(key) == 0
|
92
|
+
end.values.first
|
93
|
+
end
|
94
|
+
|
95
|
+
##
|
96
|
+
# Generate a predicateURI given a `name`
|
97
|
+
#
|
98
|
+
# @param [#to_s] name
|
99
|
+
# @param [Hash{}] ec Evaluation Context
|
100
|
+
# @return [RDF::URI]
|
101
|
+
def predicateURI(name, ec)
|
102
|
+
u = RDF::URI(name)
|
103
|
+
return u if u.absolute?
|
104
|
+
|
105
|
+
n = frag_escape(name)
|
106
|
+
if ec[:current_type].nil?
|
107
|
+
u = RDF::URI(ec[:document_base].to_s)
|
108
|
+
u.fragment = frag_escape(name)
|
109
|
+
u
|
110
|
+
elsif @scheme == :vocabulary
|
111
|
+
# If scheme is vocabulary return the URI reference constructed by appending the fragment escaped value of name
|
112
|
+
# to current vocabulary, separated by a U+0023 NUMBER SIGN character (#) unless the current vocabulary ends
|
113
|
+
# with either a U+0023 NUMBER SIGN character (#) or SOLIDUS U+002F (/).
|
114
|
+
RDF::URI(@property_base + n)
|
115
|
+
else # @scheme == :contextual
|
116
|
+
if ec[:current_type].to_s.index(@property_base) == 0
|
117
|
+
# return the concatenation of s, a U+002E FULL STOP character (.) and the fragment-escaped value of name.
|
118
|
+
RDF::URI(@property_base + '.' + n)
|
119
|
+
else
|
120
|
+
# return the concatenation of http://www.w3.org/ns/md?type=, the fragment-escaped value of s,
|
121
|
+
# the string &prop=, and the fragment-escaped value of name
|
122
|
+
RDF::URI(@property_base + frag_escape(ec[:current_type]) + '?prop=' + n)
|
123
|
+
end
|
124
|
+
end
|
125
|
+
end
|
126
|
+
|
127
|
+
|
128
|
+
##
|
129
|
+
# Turn a predicateURI into a simple token
|
130
|
+
# @param [RDF::URI] predicateURI
|
131
|
+
# @return [String]
|
132
|
+
def tokenize(predicateURI)
|
133
|
+
case @scheme
|
134
|
+
when :vocabulary
|
135
|
+
predicateURI.to_s.sub(@property_base, '')
|
136
|
+
when :contextual
|
137
|
+
predicateURI.to_s.split('?prop=').last.split('.').last
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
##
|
142
|
+
# Determine if property should be serialized as a list or not
|
143
|
+
# @param [RDF::URI] predicateURI
|
144
|
+
# @return [Boolean]
|
145
|
+
def as_list(predicateURI)
|
146
|
+
tok = tokenize(predicateURI)
|
147
|
+
if @properties[tok].is_a?(Hash)
|
148
|
+
@properties[tok]["multipleValues"].to_sym == :list
|
149
|
+
else
|
150
|
+
@multipleValues == :list
|
151
|
+
end
|
152
|
+
end
|
153
|
+
|
154
|
+
##
|
155
|
+
# Fragment escape a name
|
156
|
+
def frag_escape(name)
|
157
|
+
name.to_s.gsub(/["#%<>\[\\\]^{|}]/) {|c| '%' + c.unpack('H2' * c.bytesize).join('%').upcase}
|
158
|
+
end
|
159
|
+
end
|
160
|
+
|
34
161
|
##
|
35
162
|
# Initializes the Microdata reader instance.
|
36
163
|
#
|
@@ -38,6 +165,8 @@ module RDF::Microdata
|
|
38
165
|
# the input stream to read
|
39
166
|
# @param [Hash{Symbol => Object}] options
|
40
167
|
# any additional options
|
168
|
+
# @option options [Symbol] :library (:nokogiri)
|
169
|
+
# One of :nokogiri or :rexml. If nil/unspecified uses :nokogiri if available, :rexml otherwise.
|
41
170
|
# @option options [Encoding] :encoding (Encoding::UTF_8)
|
42
171
|
# the encoding of the input stream (Ruby 1.9+)
|
43
172
|
# @option options [Boolean] :validate (false)
|
@@ -48,6 +177,7 @@ module RDF::Microdata
|
|
48
177
|
# whether to intern all parsed URIs
|
49
178
|
# @option options [#to_s] :base_uri (nil)
|
50
179
|
# the base URI to use when resolving relative URIs
|
180
|
+
# @option options [#to_s] :registry_uri (DEFAULT_REGISTRY)
|
51
181
|
# @option options [Array] :debug
|
52
182
|
# Array to place debug messages
|
53
183
|
# @return [reader]
|
@@ -59,24 +189,43 @@ module RDF::Microdata
|
|
59
189
|
super do
|
60
190
|
@debug = options[:debug]
|
61
191
|
|
62
|
-
@
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
options[:encoding] ||= 'utf-8'
|
192
|
+
@library = case options[:library]
|
193
|
+
when nil
|
194
|
+
(defined?(::Nokogiri) && RUBY_PLATFORM != 'java') ? :nokogiri : :rexml
|
195
|
+
when :nokogiri, :rexml
|
196
|
+
options[:library]
|
197
|
+
else
|
198
|
+
raise ArgumentError.new("expected :rexml or :nokogiri, but got #{options[:library].inspect}")
|
199
|
+
end
|
71
200
|
|
72
|
-
|
73
|
-
|
201
|
+
require "rdf/microdata/reader/#{@library}"
|
202
|
+
@implementation = case @library
|
203
|
+
when :nokogiri then Nokogiri
|
204
|
+
when :rexml then REXML
|
74
205
|
end
|
75
|
-
|
76
|
-
|
206
|
+
self.extend(@implementation)
|
207
|
+
|
208
|
+
initialize_html(input, options) rescue raise RDF::ReaderError.new($!.message)
|
209
|
+
|
210
|
+
if (root.nil? && validate?)
|
211
|
+
raise RDF::ReaderError, "Empty Document"
|
212
|
+
end
|
213
|
+
errors = doc_errors.reject {|e| e.to_s =~ /Tag (audio|source|track|video|time) invalid/}
|
77
214
|
raise RDF::ReaderError, "Syntax errors:\n#{errors}" if !errors.empty? && validate?
|
78
|
-
raise RDF::ReaderError, "Empty document" if (@doc.nil? || @doc.root.nil?) && validate?
|
79
215
|
|
216
|
+
add_debug(@doc, "library = #{@library}")
|
217
|
+
|
218
|
+
# Load registry
|
219
|
+
unless Registry.loaded?
|
220
|
+
registry = options[:registry_uri] || DEFAULT_REGISTRY
|
221
|
+
begin
|
222
|
+
json = RDF::Util::File.open_file(registry) { |f| JSON.load(f) }
|
223
|
+
rescue JSON::ParserError => e
|
224
|
+
raise RDF::ReaderError, "Failed to parse registry: #{e.message}"
|
225
|
+
end
|
226
|
+
Registry.load_registry(json)
|
227
|
+
end
|
228
|
+
|
80
229
|
if block_given?
|
81
230
|
case block.arity
|
82
231
|
when 0 then instance_eval(&block)
|
@@ -121,19 +270,19 @@ module RDF::Microdata
|
|
121
270
|
@bnode_cache[value.to_s] ||= RDF::Node.new(value)
|
122
271
|
end
|
123
272
|
|
124
|
-
# Figure out the document path, if it is
|
273
|
+
# Figure out the document path, if it is an Element or Attribute
|
125
274
|
def node_path(node)
|
126
|
-
"<#{base_uri}
|
127
|
-
when Nokogiri::XML::Node then node.display_path
|
128
|
-
else node.to_s
|
129
|
-
end
|
275
|
+
"<#{base_uri}>#{node.respond_to?(:display_path) ? node.display_path : node}"
|
130
276
|
end
|
131
277
|
|
132
278
|
# Add debug event to debug array, if specified
|
133
279
|
#
|
134
|
-
# @param [XML
|
280
|
+
# @param [Nokogiri::XML::Node, #to_s] node:: XML Node or string for showing context
|
135
281
|
# @param [String] message::
|
136
|
-
|
282
|
+
# @yieldreturn [String] appended to message, to allow for lazy-evaulation of message
|
283
|
+
def add_debug(node, message = "")
|
284
|
+
return unless ::RDF::Microdata.debug? || @debug
|
285
|
+
message = message + yield if block_given?
|
137
286
|
puts "#{node_path(node)}: #{message}" if ::RDF::Microdata::debug?
|
138
287
|
@debug << "#{node_path(node)}: #{message}" if @debug.is_a?(Array)
|
139
288
|
end
|
@@ -153,107 +302,50 @@ module RDF::Microdata
|
|
153
302
|
# @raise [ReaderError]:: Checks parameter types and raises if they are incorrect if parsing mode is _validate_.
|
154
303
|
def add_triple(node, subject, predicate, object)
|
155
304
|
statement = RDF::Statement.new(subject, predicate, object)
|
156
|
-
add_debug(node
|
305
|
+
add_debug(node) {"statement: #{RDF::NTriples.serialize(statement)}"}
|
157
306
|
@callback.call(statement)
|
158
307
|
end
|
159
308
|
|
160
309
|
# Parsing a Microdata document (this is *not* the recursive method)
|
161
310
|
def parse_whole_document(doc, base)
|
162
|
-
|
163
|
-
|
164
|
-
|
165
|
-
add_debug(doc, "parse_whole_doc: options=#{@options.inspect}")
|
166
|
-
|
167
|
-
if (base)
|
311
|
+
base = doc_base(base)
|
312
|
+
options[:base_uri] = if (base)
|
168
313
|
# Strip any fragment from base
|
169
314
|
base = base.to_s.split('#').first
|
170
|
-
base =
|
171
|
-
add_debug(base_el, "parse_whole_doc: base='#{base}'")
|
315
|
+
base = uri(base)
|
172
316
|
else
|
173
317
|
base = RDF::URI("")
|
174
318
|
end
|
175
319
|
|
176
|
-
|
320
|
+
add_debug(nil) {"parse_whole_doc: base='#{base}'"}
|
321
|
+
|
322
|
+
ec = {
|
323
|
+
:memory => {},
|
324
|
+
:current_name => nil,
|
325
|
+
:current_type => nil,
|
326
|
+
:current_vocabulary => nil,
|
327
|
+
:document_base => base,
|
328
|
+
}
|
329
|
+
items = []
|
330
|
+
# 1) For each element that is also a top-level item run the following algorithm:
|
177
331
|
#
|
178
|
-
#
|
179
|
-
#
|
180
|
-
#
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
next unless rel && href
|
185
|
-
href = uri(href, el.base || base)
|
186
|
-
add_debug(el, "a: rel=#{rel.inspect}, href=#{href}")
|
187
|
-
|
188
|
-
# Otherwise, split the value of the element's rel attribute on spaces, obtaining list of tokens.
|
189
|
-
# Coalesce duplicate tokens in list of tokens.
|
190
|
-
tokens = rel.to_s.split(/\s+/).map do |tok|
|
191
|
-
# Convert each token in list of tokens that does not contain a U+003A COLON characters (:)
|
192
|
-
# to ASCII lowercase.
|
193
|
-
tok =~ /:/ ? tok : tok.downcase
|
194
|
-
end.uniq
|
195
|
-
|
196
|
-
# If list of tokens contains both the tokens alternate and stylesheet,
|
197
|
-
# then remove them both and replace them with the single (uppercase) token
|
198
|
-
# ALTERNATE-STYLESHEET.
|
199
|
-
if tokens.include?('alternate') && tokens.include?('stylesheet')
|
200
|
-
tokens = tokens - %w(alternate stylesheet)
|
201
|
-
tokens << 'ALTERNATE-STYLESHEET'
|
202
|
-
end
|
203
|
-
|
204
|
-
tokens.each do |tok|
|
205
|
-
tok_uri = RDF::URI(tok)
|
206
|
-
if tok !~ /:/
|
207
|
-
# For each token token in list of tokens that contains no U+003A COLON characters (:),
|
208
|
-
# generate the following triple:
|
209
|
-
add_triple(el, base, RDF::XHV[tok.gsub('#', '%23')], href)
|
210
|
-
elsif tok_uri.absolute?
|
211
|
-
# For each token token in list of tokens that is an absolute URL, generate the following triple:
|
212
|
-
add_triple(el, base, tok_uri, href)
|
213
|
-
end
|
214
|
-
end
|
215
|
-
end
|
216
|
-
|
217
|
-
# 3. For each meta element in the Document that has a name attribute and a content attribute,
|
218
|
-
doc.css('meta[name][content]').each do |el|
|
219
|
-
name, content = el.attribute('name'), el.attribute('content')
|
220
|
-
name = name.to_s
|
221
|
-
name_uri = uri(name, el.base || base)
|
222
|
-
add_debug(el, "meta: name=#{name.inspect}")
|
223
|
-
if name !~ /:/
|
224
|
-
# If the value of the name attribute contains no U+003A COLON characters (:),
|
225
|
-
# generate the following triple:
|
226
|
-
add_triple(el, base, RDF::XHV[name.downcase.gsub('#', '%23')], RDF::Literal(content, :language => el.language))
|
227
|
-
elsif name_uri.absolute?
|
228
|
-
# If the value of the name attribute contains no U+003A COLON characters (:),
|
229
|
-
# generate the following triple:
|
230
|
-
add_triple(el, base, name_uri, RDF::Literal(content, :language => el.language))
|
231
|
-
end
|
232
|
-
end
|
233
|
-
|
234
|
-
# 4. For each blockquote and q element in the Document that has a cite attribute that resolves
|
235
|
-
# successfully relative to the element, generate the following triple:
|
236
|
-
doc.css('blockquote[cite], q[cite]').each do |el|
|
237
|
-
object = uri(el.attribute('cite'), el.base || base)
|
238
|
-
add_debug(el, "blockquote: cite=#{object}")
|
239
|
-
add_triple(el, base, RDF::DC.source, object)
|
332
|
+
# 1) Generate the triples for an item item, using the evaluation context.
|
333
|
+
# Let result be the (URI reference or blank node) subject returned.
|
334
|
+
# 2) Append result to item list.
|
335
|
+
getItems.each do |el|
|
336
|
+
result = generate_triples(el, ec)
|
337
|
+
items << result
|
240
338
|
end
|
339
|
+
|
340
|
+
# 2) Generate an RDF Collection list from
|
341
|
+
# the ordered list of values. Set value to the value returned from generate an RDF Collection.
|
342
|
+
value = generateRDFCollection(root, items)
|
241
343
|
|
242
|
-
#
|
243
|
-
#
|
244
|
-
#
|
245
|
-
#
|
246
|
-
|
247
|
-
# subject the document's current address
|
248
|
-
# predicate http://www.w3.org/1999/xhtml/microdata#item
|
249
|
-
# object result
|
250
|
-
memory = {}
|
251
|
-
doc.css('[itemscope]').
|
252
|
-
select {|el| !el.has_attribute?('itemprop')}.
|
253
|
-
each do |el|
|
254
|
-
object = generate_triples(el, memory)
|
255
|
-
add_triple(el, base, RDF::MD.item, object)
|
256
|
-
end
|
344
|
+
# 3) Generate the following triple:
|
345
|
+
# subject Document base
|
346
|
+
# predicate http://www.w3.org/1999/xhtml/microdata#item
|
347
|
+
# object value
|
348
|
+
add_triple(doc, base, RDF::MD.item, value) if value
|
257
349
|
|
258
350
|
add_debug(doc, "parse_whole_doc: traversal complete")
|
259
351
|
end
|
@@ -261,94 +353,119 @@ module RDF::Microdata
|
|
261
353
|
##
|
262
354
|
# Generate triples for an item
|
263
355
|
# @param [RDF::Resource] item
|
264
|
-
# @param [Hash{
|
265
|
-
# @
|
266
|
-
# @option
|
267
|
-
# @option options [RDF::Resource] :fallback_name
|
356
|
+
# @param [Hash{Symbol => Object}] ec
|
357
|
+
# @option ec [Hash{Nokogiri::XML::Element} => RDF::Resource] memory
|
358
|
+
# @option ec [RDF::Resource] :current_type
|
268
359
|
# @return [RDF::Resource]
|
269
|
-
def generate_triples(item,
|
270
|
-
|
271
|
-
|
272
|
-
|
273
|
-
# 1. If there is an entry for item in memory, then let subject be the subject of that entry.
|
360
|
+
def generate_triples(item, ec = {})
|
361
|
+
memory = ec[:memory]
|
362
|
+
# 1) If there is an entry for item in memory, then let subject be the subject of that entry.
|
274
363
|
# Otherwise, if item has a global identifier and that global identifier is an absolute URL,
|
275
364
|
# let subject be that global identifier. Otherwise, let subject be a new blank node.
|
276
|
-
subject = if memory.include?(item)
|
277
|
-
memory[item][:subject]
|
365
|
+
subject = if memory.include?(item.node)
|
366
|
+
memory[item.node][:subject]
|
278
367
|
elsif item.has_attribute?('itemid')
|
279
|
-
|
368
|
+
uri(item.attribute('itemid'), item.base || base_uri)
|
280
369
|
end || RDF::Node.new
|
281
|
-
memory[item] ||= {}
|
370
|
+
memory[item.node] ||= {}
|
282
371
|
|
283
|
-
add_debug(item
|
372
|
+
add_debug(item) {"gentrips(2): subject=#{subject.inspect}, current_type: #{ec[:current_type]}"}
|
284
373
|
|
285
|
-
# 2
|
286
|
-
memory[item][:subject] ||= subject
|
374
|
+
# 2) Add a mapping from item to subject in memory, if there isn't one already.
|
375
|
+
memory[item.node][:subject] ||= subject
|
287
376
|
|
288
|
-
# 3
|
289
|
-
|
290
|
-
|
291
|
-
|
377
|
+
# 3) For each type returned from element.itemType of the element defining the item.
|
378
|
+
type = nil
|
379
|
+
item.attribute('itemtype').to_s.split(' ').map{|n| uri(n)}.select(&:absolute?).each do |t|
|
380
|
+
# 3.1. If type is an absolute URL, generate the following triple:
|
381
|
+
type ||= t
|
382
|
+
add_triple(item, subject, RDF.type, t)
|
383
|
+
end
|
292
384
|
|
293
|
-
|
294
|
-
|
295
|
-
|
296
|
-
|
297
|
-
|
298
|
-
|
299
|
-
|
300
|
-
|
301
|
-
|
302
|
-
|
303
|
-
|
304
|
-
|
305
|
-
|
306
|
-
|
307
|
-
type += '%20' unless type.to_s[-1,1] == ':'
|
308
|
-
# 5.5. Append the fragment-escaped value of fallback name to type.
|
309
|
-
type += fallback_name.to_s.gsub('#', '%23')
|
385
|
+
# 5) If type is not an absolute URL, set it to current type from the Evaluation Context if not empty.
|
386
|
+
type ||= ec[:current_type]
|
387
|
+
add_debug(item) {"gentrips(5): type=#{type.inspect}"}
|
388
|
+
|
389
|
+
# 6) If the registry contains a URI prefix that is a character for character match of type up to the length of the
|
390
|
+
# URI prefix, set vocab as that URI prefix
|
391
|
+
vocab = Registry.find(type)
|
392
|
+
|
393
|
+
# 7) Otherwise, if type is not empty, construct vocab by removing everything following the last
|
394
|
+
# SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from type.
|
395
|
+
vocab ||= begin
|
396
|
+
type_vocab = type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1')
|
397
|
+
add_debug(item) {"gentrips(7): typtype_vocab=#{type_vocab.inspect}"}
|
398
|
+
Registry.new(type_vocab) # if type
|
310
399
|
end
|
311
400
|
|
312
|
-
|
313
|
-
|
314
|
-
|
401
|
+
# 8) Update evaluation context setting current vocabulary to vocab.
|
402
|
+
ec[:current_vocabulary] = vocab
|
403
|
+
|
404
|
+
# 9) Set property list to an empty mapping between properties and one or more ordered values as established below.
|
405
|
+
property_list = {}
|
406
|
+
|
407
|
+
# 10. For each element _element_ that has one or more property names and is one of the
|
315
408
|
# properties of the item _item_, in the order those elements are given by the algorithm
|
316
409
|
# that returns the properties of an item, run the following substep:
|
317
410
|
props = item_properties(item)
|
318
|
-
|
319
|
-
# 6.1. For each name name in element's property names, run the following substeps:
|
411
|
+
# 10.1. For each name name in element's property names, run the following substeps:
|
320
412
|
props.each do |element|
|
321
|
-
element.attribute('itemprop').to_s.split(' ').each do |name|
|
322
|
-
add_debug(element
|
323
|
-
#
|
324
|
-
|
325
|
-
|
326
|
-
|
413
|
+
element.attribute('itemprop').to_s.split(' ').compact.each do |name|
|
414
|
+
add_debug(element) {"gentrips(10.1): name=#{name.inspect}, type=#{type}"}
|
415
|
+
# Let context be a copy of evaluation context with current type set to type and current vocabulary set to vocab.
|
416
|
+
ec_new = ec.merge({:current_type => type, :current_vocabulary => vocab})
|
417
|
+
|
418
|
+
predicate = vocab.predicateURI(name, ec_new)
|
419
|
+
ec_new[:current_name] = predicate
|
420
|
+
add_debug(element) {"gentrips(10.1.2): predicate=#{predicate}"}
|
421
|
+
|
422
|
+
# 10.1.3) Let value be the property value of element.
|
327
423
|
value = property_value(element)
|
328
|
-
add_debug(element
|
424
|
+
add_debug(element) {"gentrips(10.1.3) value=#{value.inspect}"}
|
329
425
|
|
426
|
+
# 10.1.4) If value is an item, then generate the triples for value using a copy of evaluation context with
|
427
|
+
# current type set to type. Replace value by the subject returned from those steps.
|
330
428
|
if value.is_a?(Hash)
|
331
|
-
value = generate_triples(element,
|
429
|
+
value = generate_triples(element, ec_new)
|
430
|
+
add_debug(element) {"gentrips(10.1.4): value=#{value.inspect}"}
|
332
431
|
end
|
333
|
-
|
334
|
-
add_debug(element, "gentrips(6.1.3): value=#{value.inspect}")
|
335
432
|
|
336
|
-
predicate
|
337
|
-
|
338
|
-
else
|
339
|
-
# Use the URI of the type to create URIs for @itemprop terms
|
340
|
-
add_debug(element, "gentrips: rdf_type=#{rdf_type}")
|
341
|
-
predicate = RDF::URI(rdf_type.to_s.sub(/([\/\#])[^\/\#]*$/, '\1' + name))
|
342
|
-
end
|
343
|
-
add_debug(element, "gentrips(6.1.5): predicate=#{predicate}")
|
344
|
-
|
345
|
-
add_triple(element, subject, predicate, value) if predicate
|
433
|
+
property_list[predicate] ||= []
|
434
|
+
property_list[predicate] << value
|
346
435
|
end
|
347
436
|
end
|
348
437
|
|
438
|
+
# 11) For each predicate in property list
|
439
|
+
property_list.each do |predicate, values|
|
440
|
+
generatePropertyValues(item, subject, predicate, values, ec)
|
441
|
+
end
|
442
|
+
|
349
443
|
subject
|
350
444
|
end
|
351
445
|
|
446
|
+
def generatePropertyValues(element, subject, predicate, values, ec)
|
447
|
+
registry = ec[:current_vocabulary]
|
448
|
+
if registry.as_list(predicate)
|
449
|
+
value = generateRDFCollection(element, values)
|
450
|
+
add_triple(element, subject, predicate, value)
|
451
|
+
else
|
452
|
+
values.each {|v| add_triple(element, subject, predicate, v)}
|
453
|
+
end
|
454
|
+
end
|
455
|
+
|
456
|
+
##
|
457
|
+
# Called when values has more than one entry
|
458
|
+
# @param [Nokogiri::HTML::Element] element
|
459
|
+
# @param [Array<RDF::Value>] values
|
460
|
+
# @return [RDF::Node]
|
461
|
+
def generateRDFCollection(element, values)
|
462
|
+
list = RDF::List.new(nil, nil, values)
|
463
|
+
list.each_statement do |st|
|
464
|
+
add_triple(element, st.subject, st.predicate, st.object) unless st.object == RDF.List
|
465
|
+
end
|
466
|
+
list.subject
|
467
|
+
end
|
468
|
+
|
352
469
|
##
|
353
470
|
# To find the properties of an item defined by the element root, the user agent must try
|
354
471
|
# to crawl the properties of the element root, with an empty list as the value of memory:
|
@@ -378,13 +495,14 @@ module RDF::Microdata
|
|
378
495
|
# @return [Array<Array<Nokogiri::XML::Element>, Integer>]
|
379
496
|
# Resultant elements and error count
|
380
497
|
def crawl_properties(root, memory)
|
498
|
+
|
381
499
|
# 1. If root is in memory, then the algorithm fails; abort these steps.
|
382
500
|
raise CrawlFailure, "crawl_props mem already has #{root.inspect}" if memory.include?(root)
|
383
501
|
|
384
502
|
# 2. Collect all the elements in the item root; let results be the resulting
|
385
503
|
# list of elements, and errors be the resulting count of errors.
|
386
504
|
results, errors = elements_in_item(root)
|
387
|
-
add_debug(root
|
505
|
+
add_debug(root) {"crawl_properties results=#{results.map {|e| node_path(e)}.inspect}, errors=#{errors}"}
|
388
506
|
|
389
507
|
# 3. Remove any elements from results that do not have an itemprop attribute specified.
|
390
508
|
results = results.select {|e| e.has_attribute?('itemprop')}
|
@@ -427,13 +545,13 @@ module RDF::Microdata
|
|
427
545
|
# If root has an itemref attribute, split the value of that itemref attribute on spaces.
|
428
546
|
# For each resulting token ID,
|
429
547
|
root.attribute('itemref').to_s.split(' ').each do |id|
|
430
|
-
add_debug(root
|
548
|
+
add_debug(root) {"elements_in_item itemref id #{id}"}
|
431
549
|
# if there is an element in the home subtree of root with the ID ID,
|
432
550
|
# then add the first such element to pending.
|
433
|
-
id_elem =
|
551
|
+
id_elem = find_element_by_id(id)
|
434
552
|
pending << id_elem if id_elem
|
435
553
|
end
|
436
|
-
add_debug(root
|
554
|
+
add_debug(root) {"elements_in_item pending #{pending.inspect}"}
|
437
555
|
|
438
556
|
# Loop: Remove an element from pending and let current be that element.
|
439
557
|
while current = pending.shift
|
@@ -457,37 +575,42 @@ module RDF::Microdata
|
|
457
575
|
##
|
458
576
|
#
|
459
577
|
def property_value(element)
|
460
|
-
|
461
|
-
|
578
|
+
base = element.base || base_uri
|
579
|
+
add_debug(element) {"property_value(#{element.name}): base #{base.inspect}"}
|
580
|
+
value = case
|
462
581
|
when element.has_attribute?('itemscope')
|
463
582
|
{}
|
464
583
|
when element.name == 'meta'
|
465
|
-
element.attribute('content').to_s
|
584
|
+
RDF::Literal.new(element.attribute('content').to_s, :language => element.language)
|
585
|
+
when element.name == 'data'
|
586
|
+
RDF::Literal.new(element.attribute('value').to_s, :language => element.language)
|
466
587
|
when %w(audio embed iframe img source track video).include?(element.name)
|
467
|
-
uri(element.attribute('src'),
|
588
|
+
uri(element.attribute('src'), base)
|
468
589
|
when %w(a area link).include?(element.name)
|
469
|
-
uri(element.attribute('href'),
|
590
|
+
uri(element.attribute('href'), base)
|
470
591
|
when %w(object).include?(element.name)
|
471
|
-
uri(element.attribute('data'),
|
472
|
-
when %w(time).include?(element.name)
|
592
|
+
uri(element.attribute('data'), base)
|
593
|
+
when %w(time).include?(element.name)
|
473
594
|
# Lexically scan value and assign appropriate type, otherwise, leave untyped
|
474
|
-
v = element.attribute('datetime').to_s
|
475
|
-
datatype = %w(Date Time DateTime).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
|
595
|
+
v = (element.attribute('datetime') || element.text).to_s
|
596
|
+
datatype = %w(Date Time DateTime Duration).map {|t| RDF::Literal.const_get(t)}.detect do |dt|
|
476
597
|
v.match(dt::GRAMMAR)
|
477
598
|
end || RDF::Literal
|
478
|
-
datatype.new(v)
|
599
|
+
datatype.new(v, :language => element.language)
|
479
600
|
else
|
480
|
-
RDF::Literal.new(element.
|
601
|
+
RDF::Literal.new(element.inner_text, :language => element.language)
|
481
602
|
end
|
603
|
+
add_debug(element) {" #{value.inspect}"}
|
604
|
+
value
|
482
605
|
end
|
483
606
|
|
484
607
|
# Fixme, what about xml:base relative to element?
|
485
608
|
def uri(value, base = nil)
|
486
609
|
value = if base
|
487
610
|
base = uri(base) unless base.is_a?(RDF::URI)
|
488
|
-
base.join(value)
|
611
|
+
base.join(value.to_s)
|
489
612
|
else
|
490
|
-
RDF::URI(value)
|
613
|
+
RDF::URI(value.to_s)
|
491
614
|
end
|
492
615
|
value.validate! if validate?
|
493
616
|
value.canonicalize! if canonicalize?
|