jm-calais 0.0.13
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.markdown +63 -0
- data/Gemfile +3 -0
- data/MIT-LICENSE +20 -0
- data/README.markdown +55 -0
- data/Rakefile +36 -0
- data/lib/calais/client.rb +115 -0
- data/lib/calais/error.rb +3 -0
- data/lib/calais/response.rb +220 -0
- data/lib/calais/version.rb +3 -0
- data/lib/calais.rb +59 -0
- data/spec/calais/client_spec.rb +79 -0
- data/spec/calais/response_spec.rb +149 -0
- data/spec/fixtures/bicycles_australia.response.json +538 -0
- data/spec/fixtures/bicycles_australia.response.rdf +836 -0
- data/spec/fixtures/bicycles_australia.xml +18 -0
- data/spec/fixtures/calais.yml.sample +1 -0
- data/spec/fixtures/error.response.xml +1 -0
- data/spec/fixtures/slovenia_euro.xml +14 -0
- data/spec/fixtures/twitter_tweet_without_score.response.rdf +96 -0
- data/spec/helper.rb +16 -0
- metadata +98 -0
data/CHANGELOG.markdown
ADDED
@@ -0,0 +1,63 @@
|
|
1
|
+
# Changes
|
2
|
+
|
3
|
+
## 0.0.13
|
4
|
+
|
5
|
+
* load path fix
|
6
|
+
|
7
|
+
## 0.0.12
|
8
|
+
|
9
|
+
* added relevances to Geographies
|
10
|
+
* improved doc
|
11
|
+
* removed jeweler dependency and simplified Rakefile
|
12
|
+
* bumped rspec requirement
|
13
|
+
|
14
|
+
## 0.0.11
|
15
|
+
|
16
|
+
* simple fix for some rubies not liking DateTime.parse without including date
|
17
|
+
* tests for SocialTags
|
18
|
+
* typo fix: SocailTag != SocialTag
|
19
|
+
|
20
|
+
## 0.0.10
|
21
|
+
|
22
|
+
* community patch to expose SocialTags
|
23
|
+
|
24
|
+
## 0.0.9
|
25
|
+
|
26
|
+
* updates related to API changes
|
27
|
+
* community patches to support bundler, support ruby 1.9
|
28
|
+
|
29
|
+
## 0.0.8
|
30
|
+
|
31
|
+
* community patches to use nokogiri
|
32
|
+
|
33
|
+
## 0.0.7
|
34
|
+
* verified 4.0 API
|
35
|
+
* moved gem packaging to `jeweler` and documentation to `yard`
|
36
|
+
|
37
|
+
## 0.0.6
|
38
|
+
* fully implemented 3.1 API
|
39
|
+
|
40
|
+
## 0.0.5
|
41
|
+
* fixed error where classes weren't being required in the proper order on Ubuntu (reported by Jon Moses)
|
42
|
+
* New things coming back from the API. Fixing in tests.
|
43
|
+
|
44
|
+
## 0.0.4
|
45
|
+
* changed dependency from `hpricot` to `libxml`
|
46
|
+
* unicode fun
|
47
|
+
* cleanup all around
|
48
|
+
|
49
|
+
## 0.0.3
|
50
|
+
* pluginized the library for Rails (thanks [pius](http://gitorious.org/projects/calais-au-rails))
|
51
|
+
* added helper methods name entity types from a response
|
52
|
+
|
53
|
+
## 0.0.2
|
54
|
+
* cleanup in the specs
|
55
|
+
* cleaner parsing
|
56
|
+
* location of named entities
|
57
|
+
* more data in relationships
|
58
|
+
* moved Names and Relationships
|
59
|
+
|
60
|
+
## 0.0.1
|
61
|
+
* Access to OpenCalais's Enlighten action
|
62
|
+
* Single method to process a document
|
63
|
+
* Get relationships and names from a document
|
data/Gemfile
ADDED
data/MIT-LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2008 Abhay Kumar info@opensynapse.net
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
'Software'), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
17
|
+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
18
|
+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
19
|
+
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
20
|
+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.markdown
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
# Calais #
|
2
|
+
A Ruby interface to the [Open Calais Web Service](http://opencalais.com)
|
3
|
+
|
4
|
+
## About this Fork ##
|
5
|
+
Forked from https://github.com/abhay/calais version ~> 0.0.13
|
6
|
+
to fix issues caused by the depreciation of iconv in ruby > 1.9.3
|
7
|
+
|
8
|
+
## Features ##
|
9
|
+
* Accepts documents in text/plain, text/xml and text/html format.
|
10
|
+
* Basic access to the Open Calais API's Enlighten action.
|
11
|
+
* Output is RDF representation of input document.
|
12
|
+
* Single function ability to extract names, entities and geographies from given text.
|
13
|
+
|
14
|
+
## Synopsis ##
|
15
|
+
|
16
|
+
This is a very basic wrapper to the Open Calais API. It uses the POST endpoint and currently supports the Enlighten action. Here's a simple call:
|
17
|
+
|
18
|
+
Calais.enlighten(
|
19
|
+
:content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
|
20
|
+
:content_type => :raw,
|
21
|
+
:license_id => 'your license id'
|
22
|
+
)
|
23
|
+
|
24
|
+
This is the easiest way to get the RDF-formated response from the OpenCalais service.
|
25
|
+
|
26
|
+
If you want to do something more fun like getting all sorts of fun information about a document, you can try this:
|
27
|
+
|
28
|
+
Calais.process_document(
|
29
|
+
:content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
|
30
|
+
:content_type => :raw,
|
31
|
+
:license_id => 'your license id'
|
32
|
+
)
|
33
|
+
|
34
|
+
This will return an object containing information extracted from the RDF response.
|
35
|
+
|
36
|
+
## Requirements ##
|
37
|
+
|
38
|
+
* [Ruby 1.8.5 or better](http://ruby-lang.org)
|
39
|
+
* [nokogiri](http://nokogiri.rubyforge.org/nokogiri/), [libxml2](http://xmlsoft.org/), [libxslt](http://xmlsoft.org/xslt/)
|
40
|
+
* [curb](http://curb.rubyforge.org/), [libcurl](http://curl.haxx.se/)
|
41
|
+
* [json](http://json.rubyforge.org/)
|
42
|
+
|
43
|
+
## Install ##
|
44
|
+
|
45
|
+
You can install the Calais gem via Rubygems (`gem install calais`) or by building from source.
|
46
|
+
|
47
|
+
## Authors ##
|
48
|
+
|
49
|
+
* [Abhay Kumar](http://opensynapse.net)
|
50
|
+
|
51
|
+
## Acknowledgements ##
|
52
|
+
|
53
|
+
* [Paul Legato](http://www.economaton.com/): Help all around with the new response processor and implementation of the 3.1 API.
|
54
|
+
* [Ryan Ong](http://www.ryanong.net/)
|
55
|
+
* [Juan Antonio Chavez](https://github.com/TheNaoX): Geographies relevance
|
data/Rakefile
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
# -*- ruby -*-
|
2
|
+
|
3
|
+
require 'rake'
|
4
|
+
require 'rake/clean'
|
5
|
+
|
6
|
+
require './lib/calais.rb'
|
7
|
+
|
8
|
+
begin
|
9
|
+
require 'rspec/core/rake_task'
|
10
|
+
|
11
|
+
RSpec::Core::RakeTask.new(:spec)
|
12
|
+
|
13
|
+
task :default => :spec
|
14
|
+
rescue LoadError
|
15
|
+
puts "RSpec, or one of its dependencies, is not available. Please install it."
|
16
|
+
exit(1)
|
17
|
+
end
|
18
|
+
|
19
|
+
begin
|
20
|
+
require 'yard'
|
21
|
+
require 'yard/rake/yardoc_task'
|
22
|
+
|
23
|
+
YARD::Rake::YardocTask.new do |t|
|
24
|
+
t.options = ["--verbose", "--markup=markdown", "--files=CHANGELOG.markdown,MIT-LICENSE"]
|
25
|
+
end
|
26
|
+
|
27
|
+
task :rdoc => :yardoc
|
28
|
+
|
29
|
+
CLOBBER.include 'doc'
|
30
|
+
CLOBBER.include '.yardoc'
|
31
|
+
rescue LoadError
|
32
|
+
puts "Yard, or one of its dependencies is not available. Please install it."
|
33
|
+
exit(1)
|
34
|
+
end
|
35
|
+
|
36
|
+
# vim: syntax=Ruby
|
@@ -0,0 +1,115 @@
|
|
1
|
+
module Calais
|
2
|
+
class Client
|
3
|
+
# base attributes of the call
|
4
|
+
attr_accessor :content
|
5
|
+
attr_accessor :license_id
|
6
|
+
|
7
|
+
# processing directives
|
8
|
+
attr_accessor :content_type, :output_format, :reltag_base_url, :calculate_relevance, :omit_outputting_original_text
|
9
|
+
attr_accessor :store_rdf, :metadata_enables, :metadata_discards
|
10
|
+
|
11
|
+
# user directives
|
12
|
+
attr_accessor :allow_distribution, :allow_search, :external_id, :submitter
|
13
|
+
|
14
|
+
attr_accessor :external_metadata
|
15
|
+
|
16
|
+
attr_accessor :use_beta
|
17
|
+
|
18
|
+
def initialize(options={}, &block)
|
19
|
+
options.each {|k,v| send("#{k}=", v)}
|
20
|
+
yield(self) if block_given?
|
21
|
+
end
|
22
|
+
|
23
|
+
def enlighten
|
24
|
+
post_args = {
|
25
|
+
"licenseID" => @license_id,
|
26
|
+
"content" => RUBY_VERSION.to_f < 1.9 ?
|
27
|
+
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "#{@content} ").first[0..-2] :
|
28
|
+
"#{@content} ".encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')[0 .. -2],
|
29
|
+
"paramsXML" => params_xml
|
30
|
+
}
|
31
|
+
|
32
|
+
do_request(post_args)
|
33
|
+
end
|
34
|
+
|
35
|
+
def params_xml
|
36
|
+
check_params
|
37
|
+
document = Nokogiri::XML::Document.new
|
38
|
+
|
39
|
+
params_node = Nokogiri::XML::Node.new('c:params', document)
|
40
|
+
params_node['xmlns:c'] = 'http://s.opencalais.com/1/pred/'
|
41
|
+
params_node['xmlns:rdf'] = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
|
42
|
+
|
43
|
+
processing_node = Nokogiri::XML::Node.new('c:processingDirectives', document)
|
44
|
+
processing_node['c:contentType'] = AVAILABLE_CONTENT_TYPES[@content_type] if @content_type
|
45
|
+
processing_node['c:outputFormat'] = AVAILABLE_OUTPUT_FORMATS[@output_format] if @output_format
|
46
|
+
processing_node['c:calculateRelevanceScore'] = 'false' if @calculate_relevance == false
|
47
|
+
processing_node['c:reltagBaseURL'] = @reltag_base_url.to_s if @reltag_base_url
|
48
|
+
|
49
|
+
processing_node['c:enableMetadataType'] = @metadata_enables.join(',') unless @metadata_enables.empty?
|
50
|
+
processing_node['c:docRDFaccessible'] = @store_rdf if @store_rdf
|
51
|
+
processing_node['c:discardMetadata'] = @metadata_discards.join(';') unless @metadata_discards.empty?
|
52
|
+
processing_node['c:omitOutputtingOriginalText'] = 'true' if @omit_outputting_original_text
|
53
|
+
|
54
|
+
user_node = Nokogiri::XML::Node.new('c:userDirectives', document)
|
55
|
+
user_node['c:allowDistribution'] = @allow_distribution.to_s unless @allow_distribution.nil?
|
56
|
+
user_node['c:allowSearch'] = @allow_search.to_s unless @allow_search.nil?
|
57
|
+
user_node['c:externalID'] = @external_id.to_s if @external_id
|
58
|
+
user_node['c:submitter'] = @submitter.to_s if @submitter
|
59
|
+
|
60
|
+
params_node << processing_node
|
61
|
+
params_node << user_node
|
62
|
+
|
63
|
+
if @external_metadata
|
64
|
+
external_node = Nokogiri::XML::Node.new('c:externalMetadata', document)
|
65
|
+
external_node << @external_metadata
|
66
|
+
params_node << external_node
|
67
|
+
end
|
68
|
+
|
69
|
+
params_node.to_xml(:indent => 2)
|
70
|
+
end
|
71
|
+
|
72
|
+
def url
|
73
|
+
@url ||= URI.parse(calais_endpoint)
|
74
|
+
end
|
75
|
+
|
76
|
+
private
|
77
|
+
def check_params
|
78
|
+
raise 'missing content' if @content.nil? || @content.empty?
|
79
|
+
|
80
|
+
content_length = @content.length
|
81
|
+
raise 'content is too small' if content_length < MIN_CONTENT_SIZE
|
82
|
+
raise 'content is too large' if content_length > MAX_CONTENT_SIZE
|
83
|
+
|
84
|
+
raise 'missing license id' if @license_id.nil? || @license_id.empty?
|
85
|
+
|
86
|
+
raise 'unknown content type' unless AVAILABLE_CONTENT_TYPES.keys.include?(@content_type) if @content_type
|
87
|
+
raise 'unknown output format' unless AVAILABLE_OUTPUT_FORMATS.keys.include?(@output_format) if @output_format
|
88
|
+
|
89
|
+
%w[calculate_relevance store_rdf allow_distribution allow_search].each do |variable|
|
90
|
+
value = self.send(variable)
|
91
|
+
unless NilClass === value || TrueClass === value || FalseClass === value
|
92
|
+
raise "expected a boolean value for #{variable} but got #{value}"
|
93
|
+
end
|
94
|
+
end
|
95
|
+
|
96
|
+
@metadata_enables ||= []
|
97
|
+
unknown_enables = Set.new(@metadata_enables) - KNOWN_ENABLES
|
98
|
+
raise "unknown metadata enables: #{unknown_enables.to_a.inspect}" unless unknown_enables.empty?
|
99
|
+
|
100
|
+
@metadata_discards ||= []
|
101
|
+
unknown_discards = Set.new(@metadata_discards) - KNOWN_DISCARDS
|
102
|
+
raise "unknown metadata discards: #{unknown_discards.to_a.inspect}" unless unknown_discards.empty?
|
103
|
+
end
|
104
|
+
|
105
|
+
def do_request(post_fields)
|
106
|
+
@request ||= Net::HTTP::Post.new(url.path)
|
107
|
+
@request.set_form_data(post_fields)
|
108
|
+
Net::HTTP.new(url.host, url.port).start {|http| http.request(@request)}.body
|
109
|
+
end
|
110
|
+
|
111
|
+
def calais_endpoint
|
112
|
+
@use_beta ? BETA_REST_ENDPOINT : REST_ENDPOINT
|
113
|
+
end
|
114
|
+
end
|
115
|
+
end
|
data/lib/calais/error.rb
ADDED
@@ -0,0 +1,220 @@
|
|
1
|
+
module Calais
|
2
|
+
class Response
|
3
|
+
MATCHERS = {
|
4
|
+
:docinfo => 'DocInfo',
|
5
|
+
:docinfometa => 'DocInfoMeta',
|
6
|
+
:defaultlangid => 'DefaultLangId',
|
7
|
+
:doccat => 'DocCat',
|
8
|
+
:entities => 'type/em/e',
|
9
|
+
:relations => 'type/em/r',
|
10
|
+
:geographies => 'type/er',
|
11
|
+
:instances => 'type/sys/InstanceInfo',
|
12
|
+
:relevances => 'type/sys/RelevanceInfo',
|
13
|
+
}
|
14
|
+
|
15
|
+
attr_accessor :submitter_code, :signature, :language, :submission_date, :request_id, :doc_title, :doc_date
|
16
|
+
attr_accessor :hashes, :entities, :relations, :geographies, :categories, :socialtags, :relevances
|
17
|
+
|
18
|
+
def initialize(rdf_string)
|
19
|
+
@raw_response = rdf_string
|
20
|
+
|
21
|
+
@hashes = []
|
22
|
+
@entities = []
|
23
|
+
@relations = []
|
24
|
+
@geographies = []
|
25
|
+
@relevances = {} # key = String hash, val = Float relevance
|
26
|
+
@categories = []
|
27
|
+
@socialtags = []
|
28
|
+
|
29
|
+
extract_data
|
30
|
+
end
|
31
|
+
|
32
|
+
class Entity
|
33
|
+
attr_accessor :calais_hash, :type, :attributes, :relevance, :instances
|
34
|
+
end
|
35
|
+
|
36
|
+
class Relation
|
37
|
+
attr_accessor :calais_hash, :type, :attributes, :instances
|
38
|
+
end
|
39
|
+
|
40
|
+
class Geography
|
41
|
+
attr_accessor :name, :calais_hash, :attributes, :relevance
|
42
|
+
end
|
43
|
+
|
44
|
+
class Category
|
45
|
+
attr_accessor :name, :score
|
46
|
+
end
|
47
|
+
|
48
|
+
class SocialTag
|
49
|
+
attr_accessor :name, :importance
|
50
|
+
end
|
51
|
+
|
52
|
+
class Instance
|
53
|
+
attr_accessor :prefix, :exact, :suffix, :offset, :length
|
54
|
+
|
55
|
+
# Makes a new Instance object from an appropriate Nokogiri::XML::Node.
|
56
|
+
def self.from_node(node)
|
57
|
+
instance = self.new
|
58
|
+
instance.prefix = node.xpath("c:prefix[1]").first.content
|
59
|
+
instance.exact = node.xpath("c:exact[1]").first.content
|
60
|
+
instance.suffix = node.xpath("c:suffix[1]").first.content
|
61
|
+
instance.offset = node.xpath("c:offset[1]").first.content.to_i
|
62
|
+
instance.length = node.xpath("c:length[1]").first.content.to_i
|
63
|
+
|
64
|
+
instance
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
class CalaisHash
|
69
|
+
attr_accessor :value
|
70
|
+
|
71
|
+
def self.find_or_create(hash, hashes)
|
72
|
+
if !selected = hashes.select {|h| h.value == hash }.first
|
73
|
+
selected = self.new
|
74
|
+
selected.value = hash
|
75
|
+
hashes << selected
|
76
|
+
end
|
77
|
+
|
78
|
+
selected
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
private
|
83
|
+
def extract_data
|
84
|
+
doc = Nokogiri::XML(@raw_response)
|
85
|
+
|
86
|
+
if doc.root.xpath("/Error[1]").first
|
87
|
+
raise Calais::Error, doc.root.xpath("/Error/Exception").first.content
|
88
|
+
end
|
89
|
+
|
90
|
+
doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfometa]}')]/..").each do |node|
|
91
|
+
@language = node['language']
|
92
|
+
@submission_date = DateTime.parse node['submissionDate']
|
93
|
+
|
94
|
+
attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
|
95
|
+
|
96
|
+
@signature = attributes.delete('signature')
|
97
|
+
@submitter_code = attributes.delete('submitterCode')
|
98
|
+
|
99
|
+
node.remove
|
100
|
+
end
|
101
|
+
|
102
|
+
doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfo]}')]/..").each do |node|
|
103
|
+
@request_id = node['calaisRequestID']
|
104
|
+
|
105
|
+
attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
|
106
|
+
|
107
|
+
@doc_title = attributes.delete('docTitle')
|
108
|
+
@doc_date = Date.parse(attributes.delete('docDate'))
|
109
|
+
|
110
|
+
node.remove
|
111
|
+
end
|
112
|
+
|
113
|
+
@socialtags = doc.root.xpath("rdf:Description/c:socialtag/..").map do |node|
|
114
|
+
tag = SocialTag.new
|
115
|
+
tag.name = node.xpath("c:name[1]").first.content
|
116
|
+
tag.importance = node.xpath("c:importance[1]").first.content.to_i
|
117
|
+
|
118
|
+
node.remove if node.xpath("c:categoryName[1]").first.nil?
|
119
|
+
|
120
|
+
tag
|
121
|
+
end
|
122
|
+
|
123
|
+
@categories = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:doccat]}')]/..").map do |node|
|
124
|
+
category = Category.new
|
125
|
+
category.name = node.xpath("c:categoryName[1]").first.content
|
126
|
+
score = node.xpath("c:score[1]").first
|
127
|
+
category.score = score.content.to_f unless score.nil?
|
128
|
+
|
129
|
+
node.remove
|
130
|
+
category
|
131
|
+
end
|
132
|
+
|
133
|
+
@relevances = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relevances]}')]/..").inject({}) do |acc, node|
|
134
|
+
subject_hash = node.xpath("c:subject[1]").first[:resource].split('/')[-1]
|
135
|
+
acc[subject_hash] = node.xpath("c:relevance[1]").first.content.to_f
|
136
|
+
|
137
|
+
node.remove
|
138
|
+
acc
|
139
|
+
end
|
140
|
+
|
141
|
+
@entities = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:entities]}')]/..").map do |node|
|
142
|
+
extracted_hash = node['about'].split('/')[-1] rescue nil
|
143
|
+
|
144
|
+
entity = Entity.new
|
145
|
+
entity.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
|
146
|
+
entity.type = extract_type(node)
|
147
|
+
entity.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
|
148
|
+
|
149
|
+
entity.relevance = @relevances[extracted_hash]
|
150
|
+
entity.instances = extract_instances(doc, extracted_hash)
|
151
|
+
|
152
|
+
node.remove
|
153
|
+
entity
|
154
|
+
end
|
155
|
+
|
156
|
+
@relations = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relations]}')]/..").map do |node|
|
157
|
+
extracted_hash = node['about'].split('/')[-1] rescue nil
|
158
|
+
|
159
|
+
relation = Relation.new
|
160
|
+
relation.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
|
161
|
+
relation.type = extract_type(node)
|
162
|
+
relation.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
|
163
|
+
relation.instances = extract_instances(doc, extracted_hash)
|
164
|
+
|
165
|
+
node.remove
|
166
|
+
relation
|
167
|
+
end
|
168
|
+
|
169
|
+
@geographies = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:geographies]}')]/..").map do |node|
|
170
|
+
attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
|
171
|
+
|
172
|
+
geography = Geography.new
|
173
|
+
geography.name = attributes.delete('name')
|
174
|
+
geography.calais_hash = attributes.delete('subject')
|
175
|
+
geography.attributes = attributes
|
176
|
+
geography.relevance = extract_relevance(geography.calais_hash.value)
|
177
|
+
|
178
|
+
node.remove
|
179
|
+
geography
|
180
|
+
end
|
181
|
+
|
182
|
+
doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:defaultlangid]}')]/..").each { |node| node.remove }
|
183
|
+
doc.root.xpath("./*").each { |node| node.remove }
|
184
|
+
|
185
|
+
return
|
186
|
+
end
|
187
|
+
|
188
|
+
def extract_instances(doc, hash)
|
189
|
+
doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:instances]}')]/..").select do |instance_node|
|
190
|
+
instance_node.xpath("c:subject[1]").first[:resource].split("/")[-1] == hash
|
191
|
+
end.map do |instance_node|
|
192
|
+
instance = Instance.from_node(instance_node)
|
193
|
+
instance_node.remove
|
194
|
+
|
195
|
+
instance
|
196
|
+
end
|
197
|
+
end
|
198
|
+
|
199
|
+
def extract_type(node)
|
200
|
+
node.xpath("*[name()='rdf:type']")[0]['resource'].split('/')[-1]
|
201
|
+
rescue
|
202
|
+
nil
|
203
|
+
end
|
204
|
+
|
205
|
+
def extract_attributes(nodes)
|
206
|
+
nodes.inject({}) do |hsh, node|
|
207
|
+
value = if node['resource']
|
208
|
+
extracted_hash = node['resource'].split('/')[-1] rescue nil
|
209
|
+
CalaisHash.find_or_create(extracted_hash, @hashes)
|
210
|
+
else
|
211
|
+
node.content
|
212
|
+
end
|
213
|
+
hsh.merge(node.name => value)
|
214
|
+
end
|
215
|
+
end
|
216
|
+
def extract_relevance(value)
|
217
|
+
return @relevances[value]
|
218
|
+
end
|
219
|
+
end
|
220
|
+
end
|