abhay-calais 0.0.7
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.markdown +33 -0
- data/MIT-LICENSE +20 -0
- data/README.markdown +49 -0
- data/Rakefile +97 -0
- data/VERSION.yml +4 -0
- data/lib/calais.rb +56 -0
- data/lib/calais/client.rb +110 -0
- data/lib/calais/error.rb +3 -0
- data/lib/calais/response.rb +201 -0
- data/spec/calais/client_spec.rb +79 -0
- data/spec/calais/response_spec.rb +128 -0
- data/spec/helper.rb +12 -0
- metadata +92 -0
data/CHANGELOG.markdown
ADDED
@@ -0,0 +1,33 @@
|
|
1
|
+
# Changes
|
2
|
+
|
3
|
+
## 0.0.7
|
4
|
+
* verified 4.0 API
|
5
|
+
* moved gem packaging to `jeweler` and documentation to `yard`
|
6
|
+
|
7
|
+
## 0.0.6
|
8
|
+
* fully implemented 3.1 API
|
9
|
+
|
10
|
+
## 0.0.5
|
11
|
+
* fixed error where classes weren't being required in the proper order on Ubuntu (reported by Jon Moses)
|
12
|
+
* New things coming back from the API. Fixing in tests.
|
13
|
+
|
14
|
+
## 0.0.4
|
15
|
+
* changed dependency from `hpricot` to `libxml`
|
16
|
+
* unicode fun
|
17
|
+
* cleanup all around
|
18
|
+
|
19
|
+
## 0.0.3
|
20
|
+
* pluginized the library for Rails (thanks [pius](http://gitorious.org/projects/calais-au-rails))
|
21
|
+
* added helper methods name entity types from a response
|
22
|
+
|
23
|
+
## 0.0.2
|
24
|
+
* cleanup in the specs
|
25
|
+
* cleaner parsing
|
26
|
+
* location of named entities
|
27
|
+
* more data in relationships
|
28
|
+
* moved Names and Relationships
|
29
|
+
|
30
|
+
## 0.0.1
|
31
|
+
* Access to OpenCalais's Enlighten action
|
32
|
+
* Single method to process a document
|
33
|
+
* Get relationships and names from a document
|
data/MIT-LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2008 Abhay Kumar info@opensynapse.net
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
'Software'), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
17
|
+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
18
|
+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
19
|
+
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
20
|
+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.markdown
ADDED
@@ -0,0 +1,49 @@
|
|
1
|
+
# Calais #
|
2
|
+
A Ruby interface to the [Open Calais Web Service](http://opencalais.com)
|
3
|
+
|
4
|
+
## Features ##
|
5
|
+
* Accepts documents in text/plain, text/xml and text/html format.
|
6
|
+
* Basic access to the Open Calais API's Enlighten action.
|
7
|
+
* Output is RDF representation of input document.
|
8
|
+
* Single function ability to extract names, entities and geographies from given text.
|
9
|
+
|
10
|
+
## Synopsis ##
|
11
|
+
|
12
|
+
This is a very basic wrapper to the Open Calais API. It uses the POST endpoint and currently supports the Enlighten action. Here's a simple call:
|
13
|
+
|
14
|
+
Calais.enlighten(
|
15
|
+
:content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program."
|
16
|
+
:content_type => :text,
|
17
|
+
:license_id => 'your license id'
|
18
|
+
)
|
19
|
+
|
20
|
+
This is the easiest way to get the RDF-formated response from the OpenCalais service.
|
21
|
+
|
22
|
+
If you want to do something more fun like getting all sorts of fun information about a document, you can try this:
|
23
|
+
|
24
|
+
Calais.process_document(
|
25
|
+
:content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
|
26
|
+
:content_type => :text,
|
27
|
+
:license_id => 'your license id'
|
28
|
+
)
|
29
|
+
|
30
|
+
This will return an object containing information extracted from the RDF response.
|
31
|
+
|
32
|
+
## Requirements ##
|
33
|
+
|
34
|
+
* [Ruby 1.8.5 or better](http://ruby-lang.org)
|
35
|
+
* [libxml-ruby](http://libxml.rubyforge.org/), [libxml2](http://xmlsoft.org/)
|
36
|
+
* [curb](http://curb.rubyforge.org/), [libcurl](http://curl.haxx.se/)
|
37
|
+
* [json](http://json.rubyforge.org/)
|
38
|
+
|
39
|
+
## Install ##
|
40
|
+
|
41
|
+
You can install the Calais gem via Rubygems (`gem install calais`) or by building from source.
|
42
|
+
|
43
|
+
## Authors ##
|
44
|
+
|
45
|
+
* [Abhay Kumar](http://opensynapse.net)
|
46
|
+
|
47
|
+
## Acknowledgements ##
|
48
|
+
|
49
|
+
* [Paul Legato](http://www.economaton.com/): Help all around with the new response processor and implementation of the 3.1 API.
|
data/Rakefile
ADDED
@@ -0,0 +1,97 @@
|
|
1
|
+
# -*- ruby -*-
|
2
|
+
|
3
|
+
require 'rake'
|
4
|
+
require 'rake/clean'
|
5
|
+
|
6
|
+
require './lib/calais.rb'
|
7
|
+
|
8
|
+
begin
|
9
|
+
gem 'jeweler', '>= 1.0.1'
|
10
|
+
require 'jeweler'
|
11
|
+
|
12
|
+
Jeweler::Tasks.new do |s|
|
13
|
+
s.name = 'calais'
|
14
|
+
s.summary = 'A Ruby interface to the Calais Web Service'
|
15
|
+
s.email = 'info@opensynapse.net'
|
16
|
+
s.homepage = 'http://github.com/abhay/calais'
|
17
|
+
s.description = 'A Ruby interface to the Calais Web Service'
|
18
|
+
s.authors = ['Abhay Kumar']
|
19
|
+
s.files = FileList["[A-Z]*", "{bin,generators,lib,test}/**/*"]
|
20
|
+
s.rubyforge_project = 'calais'
|
21
|
+
s.add_dependency 'libxml-ruby', '>= 0.5.4'
|
22
|
+
s.add_dependency 'json', '>= 1.1.3'
|
23
|
+
s.add_dependency 'curb', '>= 0.1.4'
|
24
|
+
end
|
25
|
+
rescue LoadError
|
26
|
+
puts "Jeweler, or one of its dependencies, is not available. Please install it."
|
27
|
+
exit(1)
|
28
|
+
end
|
29
|
+
|
30
|
+
begin
|
31
|
+
require 'spec/rake/spectask'
|
32
|
+
|
33
|
+
desc "Run all specs"
|
34
|
+
Spec::Rake::SpecTask.new do |t|
|
35
|
+
t.spec_files = FileList["spec/**/*_spec.rb"].sort
|
36
|
+
t.spec_opts = ["--options", "spec/spec.opts"]
|
37
|
+
end
|
38
|
+
|
39
|
+
desc "Run all specs and get coverage statistics"
|
40
|
+
Spec::Rake::SpecTask.new('coverage') do |t|
|
41
|
+
t.spec_opts = ["--options", "spec/spec.opts"]
|
42
|
+
t.spec_files = FileList["spec/*_spec.rb"].sort
|
43
|
+
t.rcov_opts = ["--exclude", "spec", "--exclude", "gems"]
|
44
|
+
t.rcov = true
|
45
|
+
end
|
46
|
+
|
47
|
+
task :default => :spec
|
48
|
+
rescue LoadError
|
49
|
+
puts "RSpec, or one of its dependencies, is not available. Please install it."
|
50
|
+
exit(1)
|
51
|
+
end
|
52
|
+
|
53
|
+
begin
|
54
|
+
require 'yard'
|
55
|
+
require 'yard/rake/yardoc_task'
|
56
|
+
|
57
|
+
YARD::Rake::YardocTask.new do |t|
|
58
|
+
t.options = ["--verbose", "--markup=markdown", "--files=CHANGELOG.markdown,MIT-LICENSE"]
|
59
|
+
end
|
60
|
+
|
61
|
+
task :rdoc => :yardoc
|
62
|
+
|
63
|
+
CLOBBER.include 'doc'
|
64
|
+
CLOBBER.include '.yardoc'
|
65
|
+
rescue LoadError
|
66
|
+
puts "Yard, or one of its dependencies is not available. Please install it."
|
67
|
+
exit(1)
|
68
|
+
end
|
69
|
+
|
70
|
+
begin
|
71
|
+
require 'rake/contrib/sshpublisher'
|
72
|
+
namespace :rubyforge do
|
73
|
+
|
74
|
+
desc "Release gem and RDoc documentation to RubyForge"
|
75
|
+
task :release => ["rubyforge:release:gem", "rubyforge:release:docs"]
|
76
|
+
|
77
|
+
namespace :release do
|
78
|
+
desc "Publish RDoc to RubyForge."
|
79
|
+
task :docs => [:yardoc] do
|
80
|
+
config = YAML.load(
|
81
|
+
File.read(File.expand_path('~/.rubyforge/user-config.yml'))
|
82
|
+
)
|
83
|
+
|
84
|
+
host = "#{config['username']}@rubyforge.org"
|
85
|
+
remote_dir = "/var/www/gforge-projects/calais/"
|
86
|
+
local_dir = 'doc'
|
87
|
+
|
88
|
+
Rake::SshDirPublisher.new(host, remote_dir, local_dir).upload
|
89
|
+
end
|
90
|
+
end
|
91
|
+
end
|
92
|
+
rescue LoadError
|
93
|
+
puts "Rake SshDirPublisher is unavailable or your rubyforge environment is not configured."
|
94
|
+
exit(1)
|
95
|
+
end
|
96
|
+
|
97
|
+
# vim: syntax=Ruby
|
data/VERSION.yml
ADDED
data/lib/calais.rb
ADDED
@@ -0,0 +1,56 @@
|
|
1
|
+
require 'digest/sha1'
|
2
|
+
require 'net/http'
|
3
|
+
require 'cgi'
|
4
|
+
require 'iconv'
|
5
|
+
require 'set'
|
6
|
+
|
7
|
+
require 'rubygems'
|
8
|
+
require 'xml/libxml'
|
9
|
+
require 'json'
|
10
|
+
require 'curb'
|
11
|
+
|
12
|
+
$KCODE = "UTF8"
|
13
|
+
require 'jcode'
|
14
|
+
|
15
|
+
$:.unshift File.expand_path(File.dirname(__FILE__)) + '/calais'
|
16
|
+
|
17
|
+
require 'client'
|
18
|
+
require 'response'
|
19
|
+
require 'error'
|
20
|
+
|
21
|
+
module Calais
|
22
|
+
REST_ENDPOINT = "http://api.opencalais.com/enlighten/rest/"
|
23
|
+
BETA_REST_ENDPOINT = "http://beta.opencalais.com/enlighten/rest/"
|
24
|
+
|
25
|
+
AVAILABLE_CONTENT_TYPES = {
|
26
|
+
:xml => 'text/xml',
|
27
|
+
:text => 'text/txt',
|
28
|
+
:html => 'text/html',
|
29
|
+
:raw => 'text/raw'
|
30
|
+
}
|
31
|
+
|
32
|
+
AVAILABLE_OUTPUT_FORMATS = {
|
33
|
+
:rdf => 'xml/rdf',
|
34
|
+
:simple => 'text/simple',
|
35
|
+
:microformats => 'text/microformats',
|
36
|
+
:json => 'application/json'
|
37
|
+
}
|
38
|
+
|
39
|
+
KNOWN_ENABLES = ['GenericRelations']
|
40
|
+
KNOWN_DISCARDS = ['er/Company', 'er/Geo']
|
41
|
+
|
42
|
+
MAX_RETRIES = 5
|
43
|
+
HTTP_TIMEOUT = 60
|
44
|
+
MIN_CONTENT_SIZE = 1
|
45
|
+
MAX_CONTENT_SIZE = 100_000
|
46
|
+
|
47
|
+
class << self
|
48
|
+
def enlighten(*args, &block); Client.new(*args, &block).enlighten; end
|
49
|
+
|
50
|
+
def process_document(*args, &block)
|
51
|
+
client = Client.new(*args, &block)
|
52
|
+
client.output_format = :rdf
|
53
|
+
Response.new(client.enlighten)
|
54
|
+
end
|
55
|
+
end
|
56
|
+
end
|
@@ -0,0 +1,110 @@
|
|
1
|
+
module Calais
|
2
|
+
class Client
|
3
|
+
# base attributes of the call
|
4
|
+
attr_accessor :content
|
5
|
+
attr_accessor :license_id
|
6
|
+
|
7
|
+
# processing directives
|
8
|
+
attr_accessor :content_type, :output_format, :reltag_base_url, :calculate_relevance, :omit_outputting_original_text
|
9
|
+
attr_accessor :metadata_enables, :metadata_discards
|
10
|
+
|
11
|
+
# user directives
|
12
|
+
attr_accessor :allow_distribution, :allow_search, :external_id, :submitter
|
13
|
+
|
14
|
+
attr_accessor :external_metadata
|
15
|
+
|
16
|
+
attr_accessor :use_beta
|
17
|
+
|
18
|
+
def initialize(options={}, &block)
|
19
|
+
options.each {|k,v| send("#{k}=", v)}
|
20
|
+
yield(self) if block_given?
|
21
|
+
end
|
22
|
+
|
23
|
+
def enlighten
|
24
|
+
post_args = {
|
25
|
+
"licenseID" => @license_id,
|
26
|
+
"content" => Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "#{@content} ").first[0..-2],
|
27
|
+
"paramsXML" => params_xml
|
28
|
+
}
|
29
|
+
|
30
|
+
@client ||= Curl::Easy.new
|
31
|
+
@client.url = @use_beta ? BETA_REST_ENDPOINT : REST_ENDPOINT
|
32
|
+
@client.timeout = HTTP_TIMEOUT
|
33
|
+
|
34
|
+
post_fields = post_args.map {|k,v| Curl::PostField.content(k, v) }
|
35
|
+
|
36
|
+
do_request(post_fields)
|
37
|
+
end
|
38
|
+
|
39
|
+
def params_xml
|
40
|
+
check_params
|
41
|
+
|
42
|
+
params_node = XML::Node.new('c:params')
|
43
|
+
params_node['xmlns:c'] = 'http://s.opencalais.com/1/pred/'
|
44
|
+
params_node['xmlns:rdf'] = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
|
45
|
+
|
46
|
+
processing_node = XML::Node.new('c:processingDirectives')
|
47
|
+
processing_node['c:contentType'] = AVAILABLE_CONTENT_TYPES[@content_type] if @content_type
|
48
|
+
processing_node['c:outputFormat'] = AVAILABLE_OUTPUT_FORMATS[@output_format] if @output_format
|
49
|
+
processing_node['c:reltagBaseURL'] = @reltag_base_url.to_s if @reltag_base_url
|
50
|
+
|
51
|
+
processing_node['c:enableMetadataType'] = @metadata_enables.join(';') unless @metadata_enables.empty?
|
52
|
+
processing_node['c:discardMetadata'] = @metadata_discards.join(';') unless @metadata_discards.empty?
|
53
|
+
processing_node['c:omitOutputtingOriginalText'] = 'true' if @omit_outputting_original_text
|
54
|
+
|
55
|
+
user_node = XML::Node.new('c:userDirectives')
|
56
|
+
user_node['c:allowDistribution'] = @allow_distribution.to_s unless @allow_distribution.nil?
|
57
|
+
user_node['c:allowSearch'] = @allow_search.to_s unless @allow_search.nil?
|
58
|
+
user_node['c:externalID'] = @external_id.to_s if @external_id
|
59
|
+
user_node['c:submitter'] = @submitter.to_s if @submitter
|
60
|
+
|
61
|
+
params_node << processing_node
|
62
|
+
params_node << user_node
|
63
|
+
|
64
|
+
if @external_metadata
|
65
|
+
external_node = XML::Node.new('c:externalMetadata')
|
66
|
+
external_node << @external_metadata
|
67
|
+
params_node << external_node
|
68
|
+
end
|
69
|
+
|
70
|
+
params_node.to_s
|
71
|
+
end
|
72
|
+
|
73
|
+
private
|
74
|
+
def check_params
|
75
|
+
raise 'missing content' if @content.nil? || @content.empty?
|
76
|
+
|
77
|
+
content_length = @content.length
|
78
|
+
raise 'content is too small' if content_length < MIN_CONTENT_SIZE
|
79
|
+
raise 'content is too large' if content_length > MAX_CONTENT_SIZE
|
80
|
+
|
81
|
+
raise 'missing license id' if @license_id.nil? || @license_id.empty?
|
82
|
+
|
83
|
+
raise 'unknown content type' unless AVAILABLE_CONTENT_TYPES.keys.include?(@content_type) if @content_type
|
84
|
+
raise 'unknown output format' unless AVAILABLE_OUTPUT_FORMATS.keys.include?(@output_format) if @output_format
|
85
|
+
|
86
|
+
%w[calculate_relevance allow_distribution allow_search].each do |variable|
|
87
|
+
value = self.send(variable)
|
88
|
+
unless NilClass === value || TrueClass === value || FalseClass === value
|
89
|
+
raise "expected a boolean value for #{variable} but got #{value}"
|
90
|
+
end
|
91
|
+
end
|
92
|
+
|
93
|
+
@metadata_enables ||= []
|
94
|
+
unknown_enables = Set.new(@metadata_enables) - KNOWN_ENABLES
|
95
|
+
raise "unknown metadata enables: #{unknown_enables.to_ainspect}" unless unknown_enables.empty?
|
96
|
+
|
97
|
+
@metadata_discards ||= []
|
98
|
+
unknown_discards = Set.new(@metadata_discards) - KNOWN_DISCARDS
|
99
|
+
raise "unknown metadata discards: #{unknown_discards.to_ainspect}" unless unknown_discards.empty?
|
100
|
+
end
|
101
|
+
|
102
|
+
def do_request(post_fields)
|
103
|
+
unless @client.http_post(post_fields)
|
104
|
+
raise 'unable to post to api endpoint'
|
105
|
+
end
|
106
|
+
|
107
|
+
@client.body_str
|
108
|
+
end
|
109
|
+
end
|
110
|
+
end
|
data/lib/calais/error.rb
ADDED
@@ -0,0 +1,201 @@
|
|
1
|
+
module Calais
|
2
|
+
class Response
|
3
|
+
MATCHERS = {
|
4
|
+
:docinfo => 'DocInfo',
|
5
|
+
:docinfometa => 'DocInfoMeta',
|
6
|
+
:defaultlangid => 'DefaultLangId',
|
7
|
+
:doccat => 'DocCat',
|
8
|
+
:entities => 'type/em/e',
|
9
|
+
:relations => 'type/em/r',
|
10
|
+
:geographies => 'type/er',
|
11
|
+
:instances => 'type/sys/InstanceInfo',
|
12
|
+
:relevances => 'type/sys/RelevanceInfo',
|
13
|
+
}
|
14
|
+
|
15
|
+
attr_accessor :submitter_code, :signature, :language, :submission_date, :request_id, :doc_title, :doc_date
|
16
|
+
attr_accessor :hashes, :entities, :relations, :geographies, :categories
|
17
|
+
|
18
|
+
def initialize(rdf_string)
|
19
|
+
@raw_response = rdf_string
|
20
|
+
|
21
|
+
@hashes = []
|
22
|
+
@entities = []
|
23
|
+
@relations = []
|
24
|
+
@geographies = []
|
25
|
+
@relevances = {} # key = String hash, val = Float relevance
|
26
|
+
@categories = []
|
27
|
+
|
28
|
+
extract_data
|
29
|
+
end
|
30
|
+
|
31
|
+
class Entity
|
32
|
+
attr_accessor :calais_hash, :type, :attributes, :relevance, :instances
|
33
|
+
end
|
34
|
+
|
35
|
+
class Relation
|
36
|
+
attr_accessor :calais_hash, :type, :attributes, :instances
|
37
|
+
end
|
38
|
+
|
39
|
+
class Geography
|
40
|
+
attr_accessor :name, :calais_hash, :attributes
|
41
|
+
end
|
42
|
+
|
43
|
+
class Category
|
44
|
+
attr_accessor :name, :score
|
45
|
+
end
|
46
|
+
|
47
|
+
class Instance
|
48
|
+
attr_accessor :prefix, :exact, :suffix, :offset, :length
|
49
|
+
|
50
|
+
# Makes a new Instance object from an appropriate LibXML::XML::Node.
|
51
|
+
def self.from_node(node)
|
52
|
+
instance = self.new
|
53
|
+
instance.prefix = node.find_first("c:prefix").content
|
54
|
+
instance.exact = node.find_first("c:exact").content
|
55
|
+
instance.suffix = node.find_first("c:suffix").content
|
56
|
+
instance.offset = node.find_first("c:offset").content.to_i
|
57
|
+
instance.length = node.find_first("c:length").content.to_i
|
58
|
+
|
59
|
+
instance
|
60
|
+
end
|
61
|
+
end
|
62
|
+
|
63
|
+
class CalaisHash
|
64
|
+
attr_accessor :value
|
65
|
+
|
66
|
+
def self.find_or_create(hash, hashes)
|
67
|
+
if !selected = hashes.select {|h| h.value == hash }.first
|
68
|
+
selected = self.new
|
69
|
+
selected.value = hash
|
70
|
+
hashes << selected
|
71
|
+
end
|
72
|
+
|
73
|
+
selected
|
74
|
+
end
|
75
|
+
end
|
76
|
+
|
77
|
+
private
|
78
|
+
def extract_data
|
79
|
+
doc = XML::Parser.string(@raw_response).parse
|
80
|
+
|
81
|
+
if doc.root.find("/Error").first
|
82
|
+
raise Calais::Error, doc.root.find("/Error/Exception").first.content
|
83
|
+
end
|
84
|
+
|
85
|
+
doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfometa]}')]/..").each do |node|
|
86
|
+
@language = node['language']
|
87
|
+
@submission_date = DateTime.parse node['submissionDate']
|
88
|
+
|
89
|
+
attributes = extract_attributes(node.find("*[contains(name(), 'c:')]"))
|
90
|
+
|
91
|
+
@signature = attributes.delete('signature')
|
92
|
+
@submitter_code = attributes.delete('submitterCode')
|
93
|
+
|
94
|
+
node.remove!
|
95
|
+
end
|
96
|
+
|
97
|
+
doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfo]}')]/..").each do |node|
|
98
|
+
@request_id = node['calaisRequestID']
|
99
|
+
|
100
|
+
attributes = extract_attributes(node.find("*[contains(name(), 'c:')]"))
|
101
|
+
|
102
|
+
@doc_title = attributes.delete('docTitle')
|
103
|
+
@doc_date = Date.parse attributes.delete('docDate')
|
104
|
+
|
105
|
+
node.remove!
|
106
|
+
end
|
107
|
+
|
108
|
+
@categories = doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:doccat]}')]/..").map do |node|
|
109
|
+
category = Category.new
|
110
|
+
category.name = node.find_first("c:categoryName").content
|
111
|
+
score = node.find_first("c:score")
|
112
|
+
category.score = score.content.to_f unless score.nil?
|
113
|
+
|
114
|
+
node.remove!
|
115
|
+
category
|
116
|
+
end
|
117
|
+
|
118
|
+
@relevances = doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relevances]}')]/..").inject({}) do |acc, node|
|
119
|
+
subject_hash = node.find_first("c:subject")[:resource].split('/')[-1]
|
120
|
+
acc[subject_hash] = node.find_first("c:relevance").content.to_f
|
121
|
+
|
122
|
+
node.remove!
|
123
|
+
acc
|
124
|
+
end
|
125
|
+
|
126
|
+
@entities = doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:entities]}')]/..").map do |node|
|
127
|
+
extracted_hash = node['about'].split('/')[-1] rescue nil
|
128
|
+
|
129
|
+
entity = Entity.new
|
130
|
+
entity.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
|
131
|
+
entity.type = extract_type(node)
|
132
|
+
entity.attributes = extract_attributes(node.find("*[contains(name(), 'c:')]"))
|
133
|
+
|
134
|
+
entity.relevance = @relevances[extracted_hash]
|
135
|
+
entity.instances = extract_instances(doc, extracted_hash)
|
136
|
+
|
137
|
+
node.remove!
|
138
|
+
entity
|
139
|
+
end
|
140
|
+
|
141
|
+
@relations = doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relations]}')]/..").map do |node|
|
142
|
+
extracted_hash = node['about'].split('/')[-1] rescue nil
|
143
|
+
|
144
|
+
relation = Relation.new
|
145
|
+
relation.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
|
146
|
+
relation.type = extract_type(node)
|
147
|
+
relation.attributes = extract_attributes(node.find("*[contains(name(), 'c:')]"))
|
148
|
+
relation.instances = extract_instances(doc, extracted_hash)
|
149
|
+
|
150
|
+
node.remove!
|
151
|
+
relation
|
152
|
+
end
|
153
|
+
|
154
|
+
@geographies = doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:geographies]}')]/..").map do |node|
|
155
|
+
attributes = extract_attributes(node.find("*[contains(name(), 'c:')]"))
|
156
|
+
|
157
|
+
geography = Geography.new
|
158
|
+
geography.name = attributes.delete('name')
|
159
|
+
geography.calais_hash = attributes.delete('subject')
|
160
|
+
geography.attributes = attributes
|
161
|
+
|
162
|
+
node.remove!
|
163
|
+
geography
|
164
|
+
end
|
165
|
+
|
166
|
+
doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:defaultlangid]}')]/..").each { |node| node.remove! }
|
167
|
+
doc.root.find("./*").each { |node| node.remove! }
|
168
|
+
|
169
|
+
return
|
170
|
+
end
|
171
|
+
|
172
|
+
def extract_instances(doc, hash)
|
173
|
+
doc.root.find("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:instances]}')]/..").select do |instance_node|
|
174
|
+
instance_node.find_first("c:subject")[:resource].split("/")[-1] == hash
|
175
|
+
end.map do |instance_node|
|
176
|
+
instance = Instance.from_node(instance_node)
|
177
|
+
instance_node.remove!
|
178
|
+
|
179
|
+
instance
|
180
|
+
end
|
181
|
+
end
|
182
|
+
|
183
|
+
def extract_type(node)
|
184
|
+
node.find("*[name()='rdf:type']")[0]['resource'].split('/')[-1]
|
185
|
+
rescue
|
186
|
+
nil
|
187
|
+
end
|
188
|
+
|
189
|
+
def extract_attributes(nodes)
|
190
|
+
nodes.inject({}) do |hsh, node|
|
191
|
+
value = if node['resource']
|
192
|
+
extracted_hash = node['resource'].split('/')[-1] rescue nil
|
193
|
+
CalaisHash.find_or_create(extracted_hash, @hashes)
|
194
|
+
else
|
195
|
+
node.content
|
196
|
+
end
|
197
|
+
hsh.merge(node.name => value)
|
198
|
+
end
|
199
|
+
end
|
200
|
+
end
|
201
|
+
end
|
@@ -0,0 +1,79 @@
|
|
1
|
+
require File.join(File.dirname(__FILE__), %w[.. helper])
|
2
|
+
|
3
|
+
describe Calais::Client, :new do
|
4
|
+
it 'accepts arguments as a hash' do
|
5
|
+
client = nil
|
6
|
+
|
7
|
+
lambda { client = Calais::Client.new(:content => SAMPLE_DOCUMENT, :license_id => LICENSE_ID) }.should_not raise_error
|
8
|
+
|
9
|
+
client.license_id.should == LICENSE_ID
|
10
|
+
client.content.should == SAMPLE_DOCUMENT
|
11
|
+
end
|
12
|
+
|
13
|
+
it 'accepts arguments as a block' do
|
14
|
+
client = nil
|
15
|
+
|
16
|
+
lambda {
|
17
|
+
client = Calais::Client.new do |c|
|
18
|
+
c.content = SAMPLE_DOCUMENT
|
19
|
+
c.license_id = LICENSE_ID
|
20
|
+
end
|
21
|
+
}.should_not raise_error
|
22
|
+
|
23
|
+
client.license_id.should == LICENSE_ID
|
24
|
+
client.content.should == SAMPLE_DOCUMENT
|
25
|
+
end
|
26
|
+
|
27
|
+
it 'should not accept unknown attributes' do
|
28
|
+
lambda { Calais::Client.new(:monkey => 'monkey', :license_id => LICENSE_ID) }.should raise_error(NoMethodError)
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
describe Calais::Client, :params_xml do
|
33
|
+
it 'returns an xml encoded string' do
|
34
|
+
client = Calais::Client.new(:content => SAMPLE_DOCUMENT, :license_id => LICENSE_ID)
|
35
|
+
client.params_xml.should == %[<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">\n <c:processingDirectives/>\n <c:userDirectives/>\n</c:params>]
|
36
|
+
|
37
|
+
client.content_type = :xml
|
38
|
+
client.output_format = :json
|
39
|
+
client.reltag_base_url = 'http://opencalais.com'
|
40
|
+
client.calculate_relevance = true
|
41
|
+
client.metadata_enables = Calais::KNOWN_ENABLES
|
42
|
+
client.metadata_discards = Calais::KNOWN_DISCARDS
|
43
|
+
client.allow_distribution = true
|
44
|
+
client.allow_search = true
|
45
|
+
client.external_id = Digest::SHA1.hexdigest(client.content)
|
46
|
+
client.submitter = 'calais.rb'
|
47
|
+
|
48
|
+
client.params_xml.should == %[<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">\n <c:processingDirectives c:contentType="text/xml" c:outputFormat="application/json" c:reltagBaseURL="http://opencalais.com" c:enableMetadataType="GenericRelations" c:discardMetadata="er/Company;er/Geo"/>\n <c:userDirectives c:allowDistribution="true" c:allowSearch="true" c:externalID="1a008b91e7d21962e132bc1d6cb252532116a606" c:submitter="calais.rb"/>\n</c:params>]
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
describe Calais::Client, :enlighten do
|
53
|
+
before do
|
54
|
+
@client = Calais::Client.new do |c|
|
55
|
+
c.content = SAMPLE_DOCUMENT
|
56
|
+
c.license_id = LICENSE_ID
|
57
|
+
c.content_type = :xml
|
58
|
+
c.output_format = :json
|
59
|
+
c.calculate_relevance = true
|
60
|
+
c.metadata_enables = Calais::KNOWN_ENABLES
|
61
|
+
c.allow_distribution = true
|
62
|
+
c.allow_search = true
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
66
|
+
it 'provides access to the enlighten command on the generic rest endpoint' do
|
67
|
+
@client.should_receive(:do_request).with(anything).and_return(SAMPLE_RESPONSE)
|
68
|
+
@client.enlighten
|
69
|
+
@client.instance_variable_get(:@client).url.should == Calais::REST_ENDPOINT
|
70
|
+
end
|
71
|
+
|
72
|
+
it 'provides access to the enlighten command on the beta rest endpoint' do
|
73
|
+
@client.use_beta = true
|
74
|
+
|
75
|
+
@client.should_receive(:do_request).with(anything).and_return(SAMPLE_RESPONSE)
|
76
|
+
@client.enlighten
|
77
|
+
@client.instance_variable_get(:@client).url.should == Calais::BETA_REST_ENDPOINT
|
78
|
+
end
|
79
|
+
end
|
@@ -0,0 +1,128 @@
|
|
1
|
+
require File.join(File.dirname(__FILE__), %w[.. helper])
|
2
|
+
|
3
|
+
describe Calais::Response, :new do
|
4
|
+
it 'accepts an rdf string to generate the response object' do
|
5
|
+
lambda { Calais::Response.new(SAMPLE_RESPONSE) }.should_not raise_error
|
6
|
+
end
|
7
|
+
end
|
8
|
+
|
9
|
+
describe Calais::Response, :new do
|
10
|
+
it "should return error message in runtime error" do
|
11
|
+
lambda {
|
12
|
+
@response = Calais::Response.new(RESPONSE_WITH_EXCEPTION)
|
13
|
+
}.should raise_error(Calais::Error, "My Error Message")
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
describe Calais::Response, :new do
|
18
|
+
before :all do
|
19
|
+
@response = Calais::Response.new(SAMPLE_RESPONSE)
|
20
|
+
end
|
21
|
+
|
22
|
+
it 'should extract document information' do
|
23
|
+
@response.language.should == 'English'
|
24
|
+
@response.submission_date.should be_a_kind_of(DateTime)
|
25
|
+
@response.signature.should be_a_kind_of(String)
|
26
|
+
@response.submitter_code.should be_a_kind_of(String)
|
27
|
+
@response.request_id.should be_a_kind_of(String)
|
28
|
+
@response.doc_title.should == 'Record number of bicycles sold in Australia in 2006'
|
29
|
+
@response.doc_date.should be_a_kind_of(Date)
|
30
|
+
end
|
31
|
+
|
32
|
+
it 'should extract entities' do
|
33
|
+
entities = @response.entities
|
34
|
+
entities.map { |e| e.type }.sort.uniq.should == %w[City Continent Country IndustryTerm Organization Person Position ProvinceOrState]
|
35
|
+
end
|
36
|
+
|
37
|
+
it 'should extract relations' do
|
38
|
+
relations = @response.relations
|
39
|
+
relations.map { |e| e.type }.sort.uniq.should == %w[GenericRelations PersonAttributes PersonCareer Quotation]
|
40
|
+
end
|
41
|
+
|
42
|
+
it 'should extract geographies' do
|
43
|
+
geographies = @response.geographies
|
44
|
+
geographies.map { |e| e.name }.sort.uniq.should == %w[Australia Hobart,Tasmania,Australia Tasmania,Australia]
|
45
|
+
end
|
46
|
+
|
47
|
+
it 'should extract relevances' do
|
48
|
+
@response.instance_variable_get(:@relevances).should be_a_kind_of(Hash)
|
49
|
+
end
|
50
|
+
|
51
|
+
it 'should assign a floating-point relevance to each entity' do
|
52
|
+
@response.entities.each {|e| e.relevance.should be_a_kind_of(Float) }
|
53
|
+
end
|
54
|
+
|
55
|
+
it 'should find the correct document categories returned by OpenCalais' do
|
56
|
+
@response.categories.map {|c| c.name }.sort.should == %w[Business_Finance Technology_Internet]
|
57
|
+
end
|
58
|
+
|
59
|
+
it 'should find the correct document category scores returned by OpenCalais' do
|
60
|
+
@response.categories.map {|c| c.score.should be_a_kind_of(Float) }
|
61
|
+
end
|
62
|
+
|
63
|
+
it "should not raise an error if no score is given by OpenCalais" do
|
64
|
+
lambda {Calais::Response.new(SAMPLE_RESPONSE_WITH_NO_SCORE)}.should_not raise_error
|
65
|
+
end
|
66
|
+
|
67
|
+
it "should not raise an error if no score is given by OpenCalais" do
|
68
|
+
response = Calais::Response.new(SAMPLE_RESPONSE_WITH_NO_SCORE)
|
69
|
+
response.categories.map {|c| c.score }.should == [nil]
|
70
|
+
end
|
71
|
+
|
72
|
+
it 'should find instances for each entity' do
|
73
|
+
@response.entities.each {|e|
|
74
|
+
e.instances.size.should > 0
|
75
|
+
}
|
76
|
+
end
|
77
|
+
|
78
|
+
|
79
|
+
it 'should find instances for each relation' do
|
80
|
+
@response.relations.each {|r|
|
81
|
+
r.instances.size.should > 0
|
82
|
+
}
|
83
|
+
end
|
84
|
+
|
85
|
+
it 'should find the correct instances for each entity' do
|
86
|
+
## This currently tests only for the "Australia" entity's
|
87
|
+
## instances. A more thorough test that tests for the instances
|
88
|
+
## of each of the many entities in the sample doc is desirable in
|
89
|
+
## the future.
|
90
|
+
|
91
|
+
australia = @response.entities.select {|e| e.attributes["name"] == "Australia" }.first
|
92
|
+
australia.instances.size.should == 3
|
93
|
+
instances = australia.instances.sort{|a,b| a.offset <=> b.offset }
|
94
|
+
|
95
|
+
instances[0].prefix.should == "number of bicycles sold in "
|
96
|
+
instances[0].exact.should == "Australia"
|
97
|
+
instances[0].suffix.should == " in 2006<\/title>\n<date>January 4,"
|
98
|
+
instances[0].offset.should == 67
|
99
|
+
instances[0].length.should == 9
|
100
|
+
|
101
|
+
instances[1].prefix.should == "4, 2007<\/date>\n<body>\nBicycle sales in "
|
102
|
+
instances[1].exact.should == "Australia"
|
103
|
+
instances[1].suffix.should == " have recorded record sales of 1,273,781 units"
|
104
|
+
instances[1].offset.should == 146
|
105
|
+
instances[1].length.should == 9
|
106
|
+
|
107
|
+
instances[2].prefix.should == " the traditional company car,\" he said.\n\n\"Some of "
|
108
|
+
instances[2].exact.should == "Australia"
|
109
|
+
instances[2].suffix.should == "'s biggest corporations now have bicycle fleets,"
|
110
|
+
instances[2].offset.should == 952
|
111
|
+
instances[2].length.should == 9
|
112
|
+
end
|
113
|
+
|
114
|
+
it 'should find the correct instances for each relation' do
|
115
|
+
## This currently tests only for one relation's instances. A more
|
116
|
+
## thorough test that tests for the instances of each of the many other
|
117
|
+
## relations in the sample doc is desirable in the future.
|
118
|
+
|
119
|
+
rel = @response.relations.select {|e| e.calais_hash.value == "8f3936d9-cf6b-37fc-ae0d-a145959ae3b5" }.first
|
120
|
+
rel.instances.size.should == 1
|
121
|
+
|
122
|
+
rel.instances.first.prefix.should == " manufacturers.\n\nThe Cycling Promotion Fund (CPF) "
|
123
|
+
rel.instances.first.exact.should == "spokesman Ian Christie said Australians were increasingly using bicycles as an alternative to cars."
|
124
|
+
rel.instances.first.suffix.should == " Sales rose nine percent in 2006 while the car"
|
125
|
+
rel.instances.first.offset.should == 425
|
126
|
+
rel.instances.first.length.should == 99
|
127
|
+
end
|
128
|
+
end
|
data/spec/helper.rb
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
require 'rubygems'
|
2
|
+
require 'spec'
|
3
|
+
require 'yaml'
|
4
|
+
|
5
|
+
require File.dirname(__FILE__) + '/../lib/calais'
|
6
|
+
|
7
|
+
FIXTURES_DIR = File.join File.dirname(__FILE__), %[fixtures]
|
8
|
+
SAMPLE_DOCUMENT = File.read(File.join(FIXTURES_DIR, %[bicycles_australia.xml]))
|
9
|
+
SAMPLE_RESPONSE = File.read(File.join(FIXTURES_DIR, %[bicycles_australia.response.rdf]))
|
10
|
+
SAMPLE_RESPONSE_WITH_NO_SCORE = File.read(File.join(FIXTURES_DIR, %[twitter_tweet_without_score.response.rdf]))
|
11
|
+
RESPONSE_WITH_EXCEPTION = File.read(File.join(FIXTURES_DIR, %[error.response.xml]))
|
12
|
+
LICENSE_ID = YAML.load(File.read(File.join(FIXTURES_DIR, %[calais.yml])))['key']
|
metadata
ADDED
@@ -0,0 +1,92 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: abhay-calais
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.7
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Abhay Kumar
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2009-06-08 00:00:00 -07:00
|
13
|
+
default_executable:
|
14
|
+
dependencies:
|
15
|
+
- !ruby/object:Gem::Dependency
|
16
|
+
name: libxml-ruby
|
17
|
+
type: :runtime
|
18
|
+
version_requirement:
|
19
|
+
version_requirements: !ruby/object:Gem::Requirement
|
20
|
+
requirements:
|
21
|
+
- - ">="
|
22
|
+
- !ruby/object:Gem::Version
|
23
|
+
version: 0.5.4
|
24
|
+
version:
|
25
|
+
- !ruby/object:Gem::Dependency
|
26
|
+
name: json
|
27
|
+
type: :runtime
|
28
|
+
version_requirement:
|
29
|
+
version_requirements: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ">="
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: 1.1.3
|
34
|
+
version:
|
35
|
+
- !ruby/object:Gem::Dependency
|
36
|
+
name: curb
|
37
|
+
type: :runtime
|
38
|
+
version_requirement:
|
39
|
+
version_requirements: !ruby/object:Gem::Requirement
|
40
|
+
requirements:
|
41
|
+
- - ">="
|
42
|
+
- !ruby/object:Gem::Version
|
43
|
+
version: 0.1.4
|
44
|
+
version:
|
45
|
+
description: A Ruby interface to the Calais Web Service
|
46
|
+
email: info@opensynapse.net
|
47
|
+
executables: []
|
48
|
+
|
49
|
+
extensions: []
|
50
|
+
|
51
|
+
extra_rdoc_files:
|
52
|
+
- README.markdown
|
53
|
+
files:
|
54
|
+
- CHANGELOG.markdown
|
55
|
+
- MIT-LICENSE
|
56
|
+
- README.markdown
|
57
|
+
- Rakefile
|
58
|
+
- VERSION.yml
|
59
|
+
- lib/calais.rb
|
60
|
+
- lib/calais/client.rb
|
61
|
+
- lib/calais/error.rb
|
62
|
+
- lib/calais/response.rb
|
63
|
+
has_rdoc: true
|
64
|
+
homepage: http://github.com/abhay/calais
|
65
|
+
post_install_message:
|
66
|
+
rdoc_options:
|
67
|
+
- --charset=UTF-8
|
68
|
+
require_paths:
|
69
|
+
- lib
|
70
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
71
|
+
requirements:
|
72
|
+
- - ">="
|
73
|
+
- !ruby/object:Gem::Version
|
74
|
+
version: "0"
|
75
|
+
version:
|
76
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
77
|
+
requirements:
|
78
|
+
- - ">="
|
79
|
+
- !ruby/object:Gem::Version
|
80
|
+
version: "0"
|
81
|
+
version:
|
82
|
+
requirements: []
|
83
|
+
|
84
|
+
rubyforge_project: calais
|
85
|
+
rubygems_version: 1.2.0
|
86
|
+
signing_key:
|
87
|
+
specification_version: 2
|
88
|
+
summary: A Ruby interface to the Calais Web Service
|
89
|
+
test_files:
|
90
|
+
- spec/helper.rb
|
91
|
+
- spec/calais/response_spec.rb
|
92
|
+
- spec/calais/client_spec.rb
|