documentcloud-calais 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,33 @@
1
+ # Changes
2
+
3
+ ## 0.0.7
4
+ * verified 4.0 API
5
+ * moved gem packaging to `jeweler` and documentation to `yard`
6
+
7
+ ## 0.0.6
8
+ * fully implemented 3.1 API
9
+
10
+ ## 0.0.5
11
+ * fixed error where classes weren't being required in the proper order on Ubuntu (reported by Jon Moses)
12
+ * New things coming back from the API. Fixing in tests.
13
+
14
+ ## 0.0.4
15
+ * changed dependency from `hpricot` to `libxml`
16
+ * unicode fun
17
+ * cleanup all around
18
+
19
+ ## 0.0.3
20
+ * pluginized the library for Rails (thanks [pius](http://gitorious.org/projects/calais-au-rails))
21
+ * added helper methods name entity types from a response
22
+
23
+ ## 0.0.2
24
+ * cleanup in the specs
25
+ * cleaner parsing
26
+ * location of named entities
27
+ * more data in relationships
28
+ * moved Names and Relationships
29
+
30
+ ## 0.0.1
31
+ * Access to OpenCalais's Enlighten action
32
+ * Single method to process a document
33
+ * Get relationships and names from a document
data/MIT-LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2008 Abhay Kumar info@opensynapse.net
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ 'Software'), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
18
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
19
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
20
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.markdown ADDED
@@ -0,0 +1,49 @@
1
+ # Calais #
2
+ A Ruby interface to the [Open Calais Web Service](http://opencalais.com)
3
+
4
+ ## Features ##
5
+ * Accepts documents in text/plain, text/xml and text/html format.
6
+ * Basic access to the Open Calais API's Enlighten action.
7
+ * Output is RDF representation of input document.
8
+ * Single function ability to extract names, entities and geographies from given text.
9
+
10
+ ## Synopsis ##
11
+
12
+ This is a very basic wrapper to the Open Calais API. It uses the POST endpoint and currently supports the Enlighten action. Here's a simple call:
13
+
14
+ Calais.enlighten(
15
+ :content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program."
16
+ :content_type => :text,
17
+ :license_id => 'your license id'
18
+ )
19
+
20
+ This is the easiest way to get the RDF-formated response from the OpenCalais service.
21
+
22
+ If you want to do something more fun like getting all sorts of fun information about a document, you can try this:
23
+
24
+ Calais.process_document(
25
+ :content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
26
+ :content_type => :text,
27
+ :license_id => 'your license id'
28
+ )
29
+
30
+ This will return an object containing information extracted from the RDF response.
31
+
32
+ ## Requirements ##
33
+
34
+ * [Ruby 1.8.5 or better](http://ruby-lang.org)
35
+ * [nokogiri](http://nokogiri.rubyforge.org/nokogiri/), [libxml2](http://xmlsoft.org/)
36
+ * [curb](http://curb.rubyforge.org/), [libcurl](http://curl.haxx.se/)
37
+ * [json](http://json.rubyforge.org/)
38
+
39
+ ## Install ##
40
+
41
+ You can install the Calais gem via Rubygems (`gem install calais`) or by building from source.
42
+
43
+ ## Authors ##
44
+
45
+ * [Abhay Kumar](http://opensynapse.net)
46
+
47
+ ## Acknowledgements ##
48
+
49
+ * [Paul Legato](http://www.economaton.com/): Help all around with the new response processor and implementation of the 3.1 API.
data/Rakefile ADDED
@@ -0,0 +1,97 @@
1
+ # -*- ruby -*-
2
+
3
+ require 'rake'
4
+ require 'rake/clean'
5
+
6
+ require './lib/calais.rb'
7
+
8
+ begin
9
+ gem 'jeweler', '>= 1.0.1'
10
+ require 'jeweler'
11
+
12
+ Jeweler::Tasks.new do |s|
13
+ s.name = 'calais'
14
+ s.summary = 'A Ruby interface to the Calais Web Service'
15
+ s.email = 'info@opensynapse.net'
16
+ s.homepage = 'http://github.com/abhay/calais'
17
+ s.description = 'A Ruby interface to the Calais Web Service'
18
+ s.authors = ['Abhay Kumar']
19
+ s.files = FileList["[A-Z]*", "{bin,generators,lib,test}/**/*"]
20
+ s.rubyforge_project = 'calais'
21
+ s.add_dependency 'nokogiri', '>= 1.3.3'
22
+ s.add_dependency 'json', '>= 1.1.3'
23
+ s.add_dependency 'curb', '>= 0.1.4'
24
+ end
25
+ rescue LoadError
26
+ puts "Jeweler, or one of its dependencies, is not available. Please install it."
27
+ exit(1)
28
+ end
29
+
30
+ begin
31
+ require 'spec/rake/spectask'
32
+
33
+ desc "Run all specs"
34
+ Spec::Rake::SpecTask.new do |t|
35
+ t.spec_files = FileList["spec/**/*_spec.rb"].sort
36
+ t.spec_opts = ["--options", "spec/spec.opts"]
37
+ end
38
+
39
+ desc "Run all specs and get coverage statistics"
40
+ Spec::Rake::SpecTask.new('coverage') do |t|
41
+ t.spec_opts = ["--options", "spec/spec.opts"]
42
+ t.spec_files = FileList["spec/*_spec.rb"].sort
43
+ t.rcov_opts = ["--exclude", "spec", "--exclude", "gems"]
44
+ t.rcov = true
45
+ end
46
+
47
+ task :default => :spec
48
+ rescue LoadError
49
+ puts "RSpec, or one of its dependencies, is not available. Please install it."
50
+ exit(1)
51
+ end
52
+
53
+ begin
54
+ require 'yard'
55
+ require 'yard/rake/yardoc_task'
56
+
57
+ YARD::Rake::YardocTask.new do |t|
58
+ t.options = ["--verbose", "--markup=markdown", "--files=CHANGELOG.markdown,MIT-LICENSE"]
59
+ end
60
+
61
+ task :rdoc => :yardoc
62
+
63
+ CLOBBER.include 'doc'
64
+ CLOBBER.include '.yardoc'
65
+ rescue LoadError
66
+ puts "Yard, or one of its dependencies is not available. Please install it."
67
+ exit(1)
68
+ end
69
+
70
+ begin
71
+ require 'rake/contrib/sshpublisher'
72
+ namespace :rubyforge do
73
+
74
+ desc "Release gem and RDoc documentation to RubyForge"
75
+ task :release => ["rubyforge:release:gem", "rubyforge:release:docs"]
76
+
77
+ namespace :release do
78
+ desc "Publish RDoc to RubyForge."
79
+ task :docs => [:yardoc] do
80
+ config = YAML.load(
81
+ File.read(File.expand_path('~/.rubyforge/user-config.yml'))
82
+ )
83
+
84
+ host = "#{config['username']}@rubyforge.org"
85
+ remote_dir = "/var/www/gforge-projects/calais/"
86
+ local_dir = 'doc'
87
+
88
+ Rake::SshDirPublisher.new(host, remote_dir, local_dir).upload
89
+ end
90
+ end
91
+ end
92
+ rescue LoadError
93
+ puts "Rake SshDirPublisher is unavailable or your rubyforge environment is not configured."
94
+ exit(1)
95
+ end
96
+
97
+ # vim: syntax=Ruby
data/VERSION.yml ADDED
@@ -0,0 +1,4 @@
1
+ ---
2
+ :patch: 7
3
+ :major: 0
4
+ :minor: 0
data/lib/calais.rb ADDED
@@ -0,0 +1,56 @@
1
+ require 'digest/sha1'
2
+ require 'net/http'
3
+ require 'cgi'
4
+ require 'iconv'
5
+ require 'set'
6
+
7
+ require 'rubygems'
8
+ require 'nokogiri'
9
+ require 'json'
10
+ require 'curb'
11
+
12
+ $KCODE = "UTF8"
13
+ require 'jcode'
14
+
15
+ $:.unshift File.expand_path(File.dirname(__FILE__)) + '/calais'
16
+
17
+ require 'client'
18
+ require 'response'
19
+ require 'error'
20
+
21
+ module Calais
22
+ REST_ENDPOINT = "http://api.opencalais.com/enlighten/rest/"
23
+ BETA_REST_ENDPOINT = "http://beta.opencalais.com/enlighten/rest/"
24
+
25
+ AVAILABLE_CONTENT_TYPES = {
26
+ :xml => 'text/xml',
27
+ :text => 'text/txt',
28
+ :html => 'text/html',
29
+ :raw => 'text/raw'
30
+ }
31
+
32
+ AVAILABLE_OUTPUT_FORMATS = {
33
+ :rdf => 'xml/rdf',
34
+ :simple => 'text/simple',
35
+ :microformats => 'text/microformats',
36
+ :json => 'application/json'
37
+ }
38
+
39
+ KNOWN_ENABLES = ['GenericRelations']
40
+ KNOWN_DISCARDS = ['er/Company', 'er/Geo']
41
+
42
+ MAX_RETRIES = 5
43
+ HTTP_TIMEOUT = 60
44
+ MIN_CONTENT_SIZE = 1
45
+ MAX_CONTENT_SIZE = 100_000
46
+
47
+ class << self
48
+ def enlighten(*args, &block); Client.new(*args, &block).enlighten; end
49
+
50
+ def process_document(*args, &block)
51
+ client = Client.new(*args, &block)
52
+ client.output_format = :rdf
53
+ Response.new(client.enlighten)
54
+ end
55
+ end
56
+ end
@@ -0,0 +1,111 @@
1
+ module Calais
2
+ class Client
3
+ # base attributes of the call
4
+ attr_accessor :content
5
+ attr_accessor :license_id
6
+
7
+ # processing directives
8
+ attr_accessor :content_type, :output_format, :reltag_base_url, :calculate_relevance, :omit_outputting_original_text
9
+ attr_accessor :metadata_enables, :metadata_discards
10
+
11
+ # user directives
12
+ attr_accessor :allow_distribution, :allow_search, :external_id, :submitter
13
+
14
+ attr_accessor :external_metadata
15
+
16
+ attr_accessor :use_beta
17
+
18
+ def initialize(options={}, &block)
19
+ options.each {|k,v| send("#{k}=", v)}
20
+ yield(self) if block_given?
21
+ end
22
+
23
+ def enlighten
24
+ post_args = {
25
+ "licenseID" => @license_id,
26
+ "content" => Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "#{@content} ").first[0..-2],
27
+ "paramsXML" => params_xml
28
+ }
29
+
30
+ @client ||= Curl::Easy.new
31
+ @client.url = @use_beta ? BETA_REST_ENDPOINT : REST_ENDPOINT
32
+ @client.timeout = HTTP_TIMEOUT
33
+
34
+ post_fields = post_args.map {|k,v| Curl::PostField.content(k, v) }
35
+
36
+ do_request(post_fields)
37
+ end
38
+
39
+ def params_xml
40
+ check_params
41
+ document = Nokogiri::XML::Document.new
42
+
43
+ params_node = Nokogiri::XML::Node.new('c:params', document)
44
+ params_node['xmlns:c'] = 'http://s.opencalais.com/1/pred/'
45
+ params_node['xmlns:rdf'] = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
46
+
47
+ processing_node = Nokogiri::XML::Node.new('c:processingDirectives', document)
48
+ processing_node['c:contentType'] = AVAILABLE_CONTENT_TYPES[@content_type] if @content_type
49
+ processing_node['c:outputFormat'] = AVAILABLE_OUTPUT_FORMATS[@output_format] if @output_format
50
+ processing_node['c:reltagBaseURL'] = @reltag_base_url.to_s if @reltag_base_url
51
+
52
+ processing_node['c:enableMetadataType'] = @metadata_enables.join(';') unless @metadata_enables.empty?
53
+ processing_node['c:discardMetadata'] = @metadata_discards.join(';') unless @metadata_discards.empty?
54
+ processing_node['c:omitOutputtingOriginalText'] = 'true' if @omit_outputting_original_text
55
+
56
+ user_node = Nokogiri::XML::Node.new('c:userDirectives', document)
57
+ user_node['c:allowDistribution'] = @allow_distribution.to_s unless @allow_distribution.nil?
58
+ user_node['c:allowSearch'] = @allow_search.to_s unless @allow_search.nil?
59
+ user_node['c:externalID'] = @external_id.to_s if @external_id
60
+ user_node['c:submitter'] = @submitter.to_s if @submitter
61
+
62
+ params_node << processing_node
63
+ params_node << user_node
64
+
65
+ if @external_metadata
66
+ external_node = Nokogiri::XML::Node.new('c:externalMetadata', document)
67
+ external_node << @external_metadata
68
+ params_node << external_node
69
+ end
70
+
71
+ params_node.to_xml(:indent => 2)
72
+ end
73
+
74
+ private
75
+ def check_params
76
+ raise 'missing content' if @content.nil? || @content.empty?
77
+
78
+ content_length = @content.length
79
+ raise 'content is too small' if content_length < MIN_CONTENT_SIZE
80
+ raise 'content is too large' if content_length > MAX_CONTENT_SIZE
81
+
82
+ raise 'missing license id' if @license_id.nil? || @license_id.empty?
83
+
84
+ raise 'unknown content type' unless AVAILABLE_CONTENT_TYPES.keys.include?(@content_type) if @content_type
85
+ raise 'unknown output format' unless AVAILABLE_OUTPUT_FORMATS.keys.include?(@output_format) if @output_format
86
+
87
+ %w[calculate_relevance allow_distribution allow_search].each do |variable|
88
+ value = self.send(variable)
89
+ unless NilClass === value || TrueClass === value || FalseClass === value
90
+ raise "expected a boolean value for #{variable} but got #{value}"
91
+ end
92
+ end
93
+
94
+ @metadata_enables ||= []
95
+ unknown_enables = Set.new(@metadata_enables) - KNOWN_ENABLES
96
+ raise "unknown metadata enables: #{unknown_enables.to_ainspect}" unless unknown_enables.empty?
97
+
98
+ @metadata_discards ||= []
99
+ unknown_discards = Set.new(@metadata_discards) - KNOWN_DISCARDS
100
+ raise "unknown metadata discards: #{unknown_discards.to_ainspect}" unless unknown_discards.empty?
101
+ end
102
+
103
+ def do_request(post_fields)
104
+ unless @client.http_post(post_fields)
105
+ raise 'unable to post to api endpoint'
106
+ end
107
+
108
+ @client.body_str
109
+ end
110
+ end
111
+ end
@@ -0,0 +1,3 @@
1
+ class Calais::Error < StandardError
2
+
3
+ end
@@ -0,0 +1,201 @@
1
+ module Calais
2
+ class Response
3
+ MATCHERS = {
4
+ :docinfo => 'DocInfo',
5
+ :docinfometa => 'DocInfoMeta',
6
+ :defaultlangid => 'DefaultLangId',
7
+ :doccat => 'DocCat',
8
+ :entities => 'type/em/e',
9
+ :relations => 'type/em/r',
10
+ :geographies => 'type/er',
11
+ :instances => 'type/sys/InstanceInfo',
12
+ :relevances => 'type/sys/RelevanceInfo',
13
+ }
14
+
15
+ attr_accessor :submitter_code, :signature, :language, :submission_date, :request_id, :doc_title, :doc_date
16
+ attr_accessor :hashes, :entities, :relations, :geographies, :categories
17
+
18
+ def initialize(rdf_string)
19
+ @raw_response = rdf_string
20
+
21
+ @hashes = []
22
+ @entities = []
23
+ @relations = []
24
+ @geographies = []
25
+ @relevances = {} # key = String hash, val = Float relevance
26
+ @categories = []
27
+
28
+ extract_data
29
+ end
30
+
31
+ class Entity
32
+ attr_accessor :calais_hash, :type, :attributes, :relevance, :instances
33
+ end
34
+
35
+ class Relation
36
+ attr_accessor :calais_hash, :type, :attributes, :instances
37
+ end
38
+
39
+ class Geography
40
+ attr_accessor :name, :calais_hash, :attributes
41
+ end
42
+
43
+ class Category
44
+ attr_accessor :name, :score
45
+ end
46
+
47
+ class Instance
48
+ attr_accessor :prefix, :exact, :suffix, :offset, :length
49
+
50
+ # Makes a new Instance object from an appropriate Nokogiri::XML::Node.
51
+ def self.from_node(node)
52
+ instance = self.new
53
+ instance.prefix = node.xpath("c:prefix[1]").first.content
54
+ instance.exact = node.xpath("c:exact[1]").first.content
55
+ instance.suffix = node.xpath("c:suffix[1]").first.content
56
+ instance.offset = node.xpath("c:offset[1]").first.content.to_i
57
+ instance.length = node.xpath("c:length[1]").first.content.to_i
58
+
59
+ instance
60
+ end
61
+ end
62
+
63
+ class CalaisHash
64
+ attr_accessor :value
65
+
66
+ def self.find_or_create(hash, hashes)
67
+ if !selected = hashes.select {|h| h.value == hash }.first
68
+ selected = self.new
69
+ selected.value = hash
70
+ hashes << selected
71
+ end
72
+
73
+ selected
74
+ end
75
+ end
76
+
77
+ private
78
+ def extract_data
79
+ doc = Nokogiri::XML(@raw_response)
80
+
81
+ if doc.root.xpath("/Error[1]").first
82
+ raise Calais::Error, doc.root.xpath("/Error/Exception").first.content
83
+ end
84
+
85
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfometa]}')]/..").each do |node|
86
+ @language = node['language']
87
+ @submission_date = DateTime.parse node['submissionDate']
88
+
89
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
90
+
91
+ @signature = attributes.delete('signature')
92
+ @submitter_code = attributes.delete('submitterCode')
93
+
94
+ node.remove
95
+ end
96
+
97
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfo]}')]/..").each do |node|
98
+ @request_id = node['calaisRequestID']
99
+
100
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
101
+
102
+ @doc_title = attributes.delete('docTitle')
103
+ @doc_date = Date.parse(attributes.delete('docDate'))
104
+
105
+ node.remove
106
+ end
107
+
108
+ @categories = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:doccat]}')]/..").map do |node|
109
+ category = Category.new
110
+ category.name = node.xpath("c:categoryName[1]").first.content
111
+ score = node.xpath("c:score[1]").first
112
+ category.score = score.content.to_f unless score.nil?
113
+
114
+ node.remove
115
+ category
116
+ end
117
+
118
+ @relevances = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relevances]}')]/..").inject({}) do |acc, node|
119
+ subject_hash = node.xpath("c:subject[1]").first[:resource].split('/')[-1]
120
+ acc[subject_hash] = node.xpath("c:relevance[1]").first.content.to_f
121
+
122
+ node.remove
123
+ acc
124
+ end
125
+
126
+ @entities = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:entities]}')]/..").map do |node|
127
+ extracted_hash = node['about'].split('/')[-1] rescue nil
128
+
129
+ entity = Entity.new
130
+ entity.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
131
+ entity.type = extract_type(node)
132
+ entity.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
133
+
134
+ entity.relevance = @relevances[extracted_hash]
135
+ entity.instances = extract_instances(doc, extracted_hash)
136
+
137
+ node.remove
138
+ entity
139
+ end
140
+
141
+ @relations = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relations]}')]/..").map do |node|
142
+ extracted_hash = node['about'].split('/')[-1] rescue nil
143
+
144
+ relation = Relation.new
145
+ relation.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
146
+ relation.type = extract_type(node)
147
+ relation.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
148
+ relation.instances = extract_instances(doc, extracted_hash)
149
+
150
+ node.remove
151
+ relation
152
+ end
153
+
154
+ @geographies = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:geographies]}')]/..").map do |node|
155
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
156
+
157
+ geography = Geography.new
158
+ geography.name = attributes.delete('name')
159
+ geography.calais_hash = attributes.delete('subject')
160
+ geography.attributes = attributes
161
+
162
+ node.remove
163
+ geography
164
+ end
165
+
166
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:defaultlangid]}')]/..").each { |node| node.remove }
167
+ doc.root.xpath("./*").each { |node| node.remove }
168
+
169
+ return
170
+ end
171
+
172
+ def extract_instances(doc, hash)
173
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:instances]}')]/..").select do |instance_node|
174
+ instance_node.xpath("c:subject[1]").first[:resource].split("/")[-1] == hash
175
+ end.map do |instance_node|
176
+ instance = Instance.from_node(instance_node)
177
+ instance_node.remove
178
+
179
+ instance
180
+ end
181
+ end
182
+
183
+ def extract_type(node)
184
+ node.xpath("*[name()='rdf:type']")[0]['resource'].split('/')[-1]
185
+ rescue
186
+ nil
187
+ end
188
+
189
+ def extract_attributes(nodes)
190
+ nodes.inject({}) do |hsh, node|
191
+ value = if node['resource']
192
+ extracted_hash = node['resource'].split('/')[-1] rescue nil
193
+ CalaisHash.find_or_create(extracted_hash, @hashes)
194
+ else
195
+ node.content
196
+ end
197
+ hsh.merge(node.name => value)
198
+ end
199
+ end
200
+ end
201
+ end
@@ -0,0 +1,79 @@
1
+ require File.join(File.dirname(__FILE__), %w[.. helper])
2
+
3
+ describe Calais::Client, :new do
4
+ it 'accepts arguments as a hash' do
5
+ client = nil
6
+
7
+ lambda { client = Calais::Client.new(:content => SAMPLE_DOCUMENT, :license_id => LICENSE_ID) }.should_not raise_error
8
+
9
+ client.license_id.should == LICENSE_ID
10
+ client.content.should == SAMPLE_DOCUMENT
11
+ end
12
+
13
+ it 'accepts arguments as a block' do
14
+ client = nil
15
+
16
+ lambda {
17
+ client = Calais::Client.new do |c|
18
+ c.content = SAMPLE_DOCUMENT
19
+ c.license_id = LICENSE_ID
20
+ end
21
+ }.should_not raise_error
22
+
23
+ client.license_id.should == LICENSE_ID
24
+ client.content.should == SAMPLE_DOCUMENT
25
+ end
26
+
27
+ it 'should not accept unknown attributes' do
28
+ lambda { Calais::Client.new(:monkey => 'monkey', :license_id => LICENSE_ID) }.should raise_error(NoMethodError)
29
+ end
30
+ end
31
+
32
+ describe Calais::Client, :params_xml do
33
+ it 'returns an xml encoded string' do
34
+ client = Calais::Client.new(:content => SAMPLE_DOCUMENT, :license_id => LICENSE_ID)
35
+ client.params_xml.should == %[<c:params xmlns:c=\"http://s.opencalais.com/1/pred/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"><c:processingDirectives/><c:userDirectives/></c:params>]
36
+
37
+ client.content_type = :xml
38
+ client.output_format = :json
39
+ client.reltag_base_url = 'http://opencalais.com'
40
+ client.calculate_relevance = true
41
+ client.metadata_enables = Calais::KNOWN_ENABLES
42
+ client.metadata_discards = Calais::KNOWN_DISCARDS
43
+ client.allow_distribution = true
44
+ client.allow_search = true
45
+ client.external_id = Digest::SHA1.hexdigest(client.content)
46
+ client.submitter = 'calais.rb'
47
+
48
+ client.params_xml.should == %[<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><c:processingDirectives c:contentType="text/xml" c:outputFormat="application/json" c:reltagBaseURL="http://opencalais.com" c:enableMetadataType="GenericRelations" c:discardMetadata="er/Company;er/Geo"/><c:userDirectives c:allowDistribution="true" c:allowSearch="true" c:externalID="1a008b91e7d21962e132bc1d6cb252532116a606" c:submitter="calais.rb"/></c:params>]
49
+ end
50
+ end
51
+
52
+ describe Calais::Client, :enlighten do
53
+ before do
54
+ @client = Calais::Client.new do |c|
55
+ c.content = SAMPLE_DOCUMENT
56
+ c.license_id = LICENSE_ID
57
+ c.content_type = :xml
58
+ c.output_format = :json
59
+ c.calculate_relevance = true
60
+ c.metadata_enables = Calais::KNOWN_ENABLES
61
+ c.allow_distribution = true
62
+ c.allow_search = true
63
+ end
64
+ end
65
+
66
+ it 'provides access to the enlighten command on the generic rest endpoint' do
67
+ @client.should_receive(:do_request).with(anything).and_return(SAMPLE_RESPONSE)
68
+ @client.enlighten
69
+ @client.instance_variable_get(:@client).url.should == Calais::REST_ENDPOINT
70
+ end
71
+
72
+ it 'provides access to the enlighten command on the beta rest endpoint' do
73
+ @client.use_beta = true
74
+
75
+ @client.should_receive(:do_request).with(anything).and_return(SAMPLE_RESPONSE)
76
+ @client.enlighten
77
+ @client.instance_variable_get(:@client).url.should == Calais::BETA_REST_ENDPOINT
78
+ end
79
+ end
@@ -0,0 +1,128 @@
1
+ require File.join(File.dirname(__FILE__), %w[.. helper])
2
+
3
+ describe Calais::Response, :new do
4
+ it 'accepts an rdf string to generate the response object' do
5
+ lambda { Calais::Response.new(SAMPLE_RESPONSE) }.should_not raise_error
6
+ end
7
+ end
8
+
9
+ describe Calais::Response, :new do
10
+ it "should return error message in runtime error" do
11
+ lambda {
12
+ @response = Calais::Response.new(RESPONSE_WITH_EXCEPTION)
13
+ }.should raise_error(Calais::Error, "My Error Message")
14
+ end
15
+ end
16
+
17
+ describe Calais::Response, :new do
18
+ before :all do
19
+ @response = Calais::Response.new(SAMPLE_RESPONSE)
20
+ end
21
+
22
+ it 'should extract document information' do
23
+ @response.language.should == 'English'
24
+ @response.submission_date.should be_a_kind_of(DateTime)
25
+ @response.signature.should be_a_kind_of(String)
26
+ @response.submitter_code.should be_a_kind_of(String)
27
+ @response.request_id.should be_a_kind_of(String)
28
+ @response.doc_title.should == 'Record number of bicycles sold in Australia in 2006'
29
+ @response.doc_date.should be_a_kind_of(Date)
30
+ end
31
+
32
+ it 'should extract entities' do
33
+ entities = @response.entities
34
+ entities.map { |e| e.type }.sort.uniq.should == %w[City Continent Country IndustryTerm Organization Person Position ProvinceOrState]
35
+ end
36
+
37
+ it 'should extract relations' do
38
+ relations = @response.relations
39
+ relations.map { |e| e.type }.sort.uniq.should == %w[GenericRelations PersonAttributes PersonCareer Quotation]
40
+ end
41
+
42
+ it 'should extract geographies' do
43
+ geographies = @response.geographies
44
+ geographies.map { |e| e.name }.sort.uniq.should == %w[Australia Hobart,Tasmania,Australia Tasmania,Australia]
45
+ end
46
+
47
+ it 'should extract relevances' do
48
+ @response.instance_variable_get(:@relevances).should be_a_kind_of(Hash)
49
+ end
50
+
51
+ it 'should assign a floating-point relevance to each entity' do
52
+ @response.entities.each {|e| e.relevance.should be_a_kind_of(Float) }
53
+ end
54
+
55
+ it 'should find the correct document categories returned by OpenCalais' do
56
+ @response.categories.map {|c| c.name }.sort.should == %w[Business_Finance Technology_Internet]
57
+ end
58
+
59
+ it 'should find the correct document category scores returned by OpenCalais' do
60
+ @response.categories.map {|c| c.score.should be_a_kind_of(Float) }
61
+ end
62
+
63
+ it "should not raise an error if no score is given by OpenCalais" do
64
+ lambda {Calais::Response.new(SAMPLE_RESPONSE_WITH_NO_SCORE)}.should_not raise_error
65
+ end
66
+
67
+ it "should not raise an error if no score is given by OpenCalais" do
68
+ response = Calais::Response.new(SAMPLE_RESPONSE_WITH_NO_SCORE)
69
+ response.categories.map {|c| c.score }.should == [nil]
70
+ end
71
+
72
+ it 'should find instances for each entity' do
73
+ @response.entities.each {|e|
74
+ e.instances.size.should > 0
75
+ }
76
+ end
77
+
78
+
79
+ it 'should find instances for each relation' do
80
+ @response.relations.each {|r|
81
+ r.instances.size.should > 0
82
+ }
83
+ end
84
+
85
+ it 'should find the correct instances for each entity' do
86
+ ## This currently tests only for the "Australia" entity's
87
+ ## instances. A more thorough test that tests for the instances
88
+ ## of each of the many entities in the sample doc is desirable in
89
+ ## the future.
90
+
91
+ australia = @response.entities.select {|e| e.attributes["name"] == "Australia" }.first
92
+ australia.instances.size.should == 3
93
+ instances = australia.instances.sort{|a,b| a.offset <=> b.offset }
94
+
95
+ instances[0].prefix.should == "number of bicycles sold in "
96
+ instances[0].exact.should == "Australia"
97
+ instances[0].suffix.should == " in 2006<\/title>\n<date>January 4,"
98
+ instances[0].offset.should == 67
99
+ instances[0].length.should == 9
100
+
101
+ instances[1].prefix.should == "4, 2007<\/date>\n<body>\nBicycle sales in "
102
+ instances[1].exact.should == "Australia"
103
+ instances[1].suffix.should == " have recorded record sales of 1,273,781 units"
104
+ instances[1].offset.should == 146
105
+ instances[1].length.should == 9
106
+
107
+ instances[2].prefix.should == " the traditional company car,\" he said.\n\n\"Some of "
108
+ instances[2].exact.should == "Australia"
109
+ instances[2].suffix.should == "'s biggest corporations now have bicycle fleets,"
110
+ instances[2].offset.should == 952
111
+ instances[2].length.should == 9
112
+ end
113
+
114
+ it 'should find the correct instances for each relation' do
115
+ ## This currently tests only for one relation's instances. A more
116
+ ## thorough test that tests for the instances of each of the many other
117
+ ## relations in the sample doc is desirable in the future.
118
+
119
+ rel = @response.relations.select {|e| e.calais_hash.value == "8f3936d9-cf6b-37fc-ae0d-a145959ae3b5" }.first
120
+ rel.instances.size.should == 1
121
+
122
+ rel.instances.first.prefix.should == " manufacturers.\n\nThe Cycling Promotion Fund (CPF) "
123
+ rel.instances.first.exact.should == "spokesman Ian Christie said Australians were increasingly using bicycles as an alternative to cars."
124
+ rel.instances.first.suffix.should == " Sales rose nine percent in 2006 while the car"
125
+ rel.instances.first.offset.should == 425
126
+ rel.instances.first.length.should == 99
127
+ end
128
+ end
data/spec/helper.rb ADDED
@@ -0,0 +1,12 @@
1
+ require 'rubygems'
2
+ require 'spec'
3
+ require 'yaml'
4
+
5
+ require File.dirname(__FILE__) + '/../lib/calais'
6
+
7
+ FIXTURES_DIR = File.join File.dirname(__FILE__), %[fixtures]
8
+ SAMPLE_DOCUMENT = File.read(File.join(FIXTURES_DIR, %[bicycles_australia.xml]))
9
+ SAMPLE_RESPONSE = File.read(File.join(FIXTURES_DIR, %[bicycles_australia.response.rdf]))
10
+ SAMPLE_RESPONSE_WITH_NO_SCORE = File.read(File.join(FIXTURES_DIR, %[twitter_tweet_without_score.response.rdf]))
11
+ RESPONSE_WITH_EXCEPTION = File.read(File.join(FIXTURES_DIR, %[error.response.xml]))
12
+ LICENSE_ID = YAML.load(File.read(File.join(FIXTURES_DIR, %[calais.yml])))['key']
metadata ADDED
@@ -0,0 +1,93 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: documentcloud-calais
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.7
5
+ platform: ruby
6
+ authors:
7
+ - Abhay Kumar
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2009-06-08 00:00:00 -07:00
13
+ default_executable:
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ name: nokogiri
17
+ type: :runtime
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ">="
22
+ - !ruby/object:Gem::Version
23
+ version: 1.3.3
24
+ version:
25
+ - !ruby/object:Gem::Dependency
26
+ name: json
27
+ type: :runtime
28
+ version_requirement:
29
+ version_requirements: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: 1.1.3
34
+ version:
35
+ - !ruby/object:Gem::Dependency
36
+ name: curb
37
+ type: :runtime
38
+ version_requirement:
39
+ version_requirements: !ruby/object:Gem::Requirement
40
+ requirements:
41
+ - - ">="
42
+ - !ruby/object:Gem::Version
43
+ version: 0.1.4
44
+ version:
45
+ description: A Ruby interface to the Calais Web Service
46
+ email: info@opensynapse.net
47
+ executables: []
48
+
49
+ extensions: []
50
+
51
+ extra_rdoc_files:
52
+ - README.markdown
53
+ files:
54
+ - CHANGELOG.markdown
55
+ - MIT-LICENSE
56
+ - README.markdown
57
+ - Rakefile
58
+ - VERSION.yml
59
+ - lib/calais.rb
60
+ - lib/calais/client.rb
61
+ - lib/calais/error.rb
62
+ - lib/calais/response.rb
63
+ has_rdoc: true
64
+ homepage: http://github.com/abhay/calais
65
+ licenses:
66
+ post_install_message:
67
+ rdoc_options:
68
+ - --charset=UTF-8
69
+ require_paths:
70
+ - lib
71
+ required_ruby_version: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: "0"
76
+ version:
77
+ required_rubygems_version: !ruby/object:Gem::Requirement
78
+ requirements:
79
+ - - ">="
80
+ - !ruby/object:Gem::Version
81
+ version: "0"
82
+ version:
83
+ requirements: []
84
+
85
+ rubyforge_project: calais
86
+ rubygems_version: 1.3.5
87
+ signing_key:
88
+ specification_version: 2
89
+ summary: A Ruby interface to the Calais Web Service
90
+ test_files:
91
+ - spec/helper.rb
92
+ - spec/calais/response_spec.rb
93
+ - spec/calais/client_spec.rb