jm-calais 0.0.13

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,63 @@
1
+ # Changes
2
+
3
+ ## 0.0.13
4
+
5
+ * load path fix
6
+
7
+ ## 0.0.12
8
+
9
+ * added relevances to Geographies
10
+ * improved doc
11
+ * removed jeweler dependency and simplified Rakefile
12
+ * bumped rspec requirement
13
+
14
+ ## 0.0.11
15
+
16
+ * simple fix for some rubies not liking DateTime.parse without including date
17
+ * tests for SocialTags
18
+ * typo fix: SocailTag != SocialTag
19
+
20
+ ## 0.0.10
21
+
22
+ * community patch to expose SocialTags
23
+
24
+ ## 0.0.9
25
+
26
+ * updates related to API changes
27
+ * community patches to support bundler, support ruby 1.9
28
+
29
+ ## 0.0.8
30
+
31
+ * community patches to use nokogiri
32
+
33
+ ## 0.0.7
34
+ * verified 4.0 API
35
+ * moved gem packaging to `jeweler` and documentation to `yard`
36
+
37
+ ## 0.0.6
38
+ * fully implemented 3.1 API
39
+
40
+ ## 0.0.5
41
+ * fixed error where classes weren't being required in the proper order on Ubuntu (reported by Jon Moses)
42
+ * New things coming back from the API. Fixing in tests.
43
+
44
+ ## 0.0.4
45
+ * changed dependency from `hpricot` to `libxml`
46
+ * unicode fun
47
+ * cleanup all around
48
+
49
+ ## 0.0.3
50
+ * pluginized the library for Rails (thanks [pius](http://gitorious.org/projects/calais-au-rails))
51
+ * added helper methods name entity types from a response
52
+
53
+ ## 0.0.2
54
+ * cleanup in the specs
55
+ * cleaner parsing
56
+ * location of named entities
57
+ * more data in relationships
58
+ * moved Names and Relationships
59
+
60
+ ## 0.0.1
61
+ * Access to OpenCalais's Enlighten action
62
+ * Single method to process a document
63
+ * Get relationships and names from a document
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source :gemcutter
2
+
3
+ gemspec
data/MIT-LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2008 Abhay Kumar info@opensynapse.net
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ 'Software'), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
18
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
19
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
20
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.markdown ADDED
@@ -0,0 +1,55 @@
1
+ # Calais #
2
+ A Ruby interface to the [Open Calais Web Service](http://opencalais.com)
3
+
4
+ ## About this Fork ##
5
+ Forked from https://github.com/abhay/calais version ~> 0.0.13
6
+ to fix issues caused by the depreciation of iconv in ruby > 1.9.3
7
+
8
+ ## Features ##
9
+ * Accepts documents in text/plain, text/xml and text/html format.
10
+ * Basic access to the Open Calais API's Enlighten action.
11
+ * Output is RDF representation of input document.
12
+ * Single function ability to extract names, entities and geographies from given text.
13
+
14
+ ## Synopsis ##
15
+
16
+ This is a very basic wrapper to the Open Calais API. It uses the POST endpoint and currently supports the Enlighten action. Here's a simple call:
17
+
18
+ Calais.enlighten(
19
+ :content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
20
+ :content_type => :raw,
21
+ :license_id => 'your license id'
22
+ )
23
+
24
+ This is the easiest way to get the RDF-formated response from the OpenCalais service.
25
+
26
+ If you want to do something more fun like getting all sorts of fun information about a document, you can try this:
27
+
28
+ Calais.process_document(
29
+ :content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
30
+ :content_type => :raw,
31
+ :license_id => 'your license id'
32
+ )
33
+
34
+ This will return an object containing information extracted from the RDF response.
35
+
36
+ ## Requirements ##
37
+
38
+ * [Ruby 1.8.5 or better](http://ruby-lang.org)
39
+ * [nokogiri](http://nokogiri.rubyforge.org/nokogiri/), [libxml2](http://xmlsoft.org/), [libxslt](http://xmlsoft.org/xslt/)
40
+ * [curb](http://curb.rubyforge.org/), [libcurl](http://curl.haxx.se/)
41
+ * [json](http://json.rubyforge.org/)
42
+
43
+ ## Install ##
44
+
45
+ You can install the Calais gem via Rubygems (`gem install calais`) or by building from source.
46
+
47
+ ## Authors ##
48
+
49
+ * [Abhay Kumar](http://opensynapse.net)
50
+
51
+ ## Acknowledgements ##
52
+
53
+ * [Paul Legato](http://www.economaton.com/): Help all around with the new response processor and implementation of the 3.1 API.
54
+ * [Ryan Ong](http://www.ryanong.net/)
55
+ * [Juan Antonio Chavez](https://github.com/TheNaoX): Geographies relevance
data/Rakefile ADDED
@@ -0,0 +1,36 @@
1
+ # -*- ruby -*-
2
+
3
+ require 'rake'
4
+ require 'rake/clean'
5
+
6
+ require './lib/calais.rb'
7
+
8
+ begin
9
+ require 'rspec/core/rake_task'
10
+
11
+ RSpec::Core::RakeTask.new(:spec)
12
+
13
+ task :default => :spec
14
+ rescue LoadError
15
+ puts "RSpec, or one of its dependencies, is not available. Please install it."
16
+ exit(1)
17
+ end
18
+
19
+ begin
20
+ require 'yard'
21
+ require 'yard/rake/yardoc_task'
22
+
23
+ YARD::Rake::YardocTask.new do |t|
24
+ t.options = ["--verbose", "--markup=markdown", "--files=CHANGELOG.markdown,MIT-LICENSE"]
25
+ end
26
+
27
+ task :rdoc => :yardoc
28
+
29
+ CLOBBER.include 'doc'
30
+ CLOBBER.include '.yardoc'
31
+ rescue LoadError
32
+ puts "Yard, or one of its dependencies is not available. Please install it."
33
+ exit(1)
34
+ end
35
+
36
+ # vim: syntax=Ruby
@@ -0,0 +1,115 @@
1
+ module Calais
2
+ class Client
3
+ # base attributes of the call
4
+ attr_accessor :content
5
+ attr_accessor :license_id
6
+
7
+ # processing directives
8
+ attr_accessor :content_type, :output_format, :reltag_base_url, :calculate_relevance, :omit_outputting_original_text
9
+ attr_accessor :store_rdf, :metadata_enables, :metadata_discards
10
+
11
+ # user directives
12
+ attr_accessor :allow_distribution, :allow_search, :external_id, :submitter
13
+
14
+ attr_accessor :external_metadata
15
+
16
+ attr_accessor :use_beta
17
+
18
+ def initialize(options={}, &block)
19
+ options.each {|k,v| send("#{k}=", v)}
20
+ yield(self) if block_given?
21
+ end
22
+
23
+ def enlighten
24
+ post_args = {
25
+ "licenseID" => @license_id,
26
+ "content" => RUBY_VERSION.to_f < 1.9 ?
27
+ Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "#{@content} ").first[0..-2] :
28
+ "#{@content} ".encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')[0 .. -2],
29
+ "paramsXML" => params_xml
30
+ }
31
+
32
+ do_request(post_args)
33
+ end
34
+
35
+ def params_xml
36
+ check_params
37
+ document = Nokogiri::XML::Document.new
38
+
39
+ params_node = Nokogiri::XML::Node.new('c:params', document)
40
+ params_node['xmlns:c'] = 'http://s.opencalais.com/1/pred/'
41
+ params_node['xmlns:rdf'] = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
42
+
43
+ processing_node = Nokogiri::XML::Node.new('c:processingDirectives', document)
44
+ processing_node['c:contentType'] = AVAILABLE_CONTENT_TYPES[@content_type] if @content_type
45
+ processing_node['c:outputFormat'] = AVAILABLE_OUTPUT_FORMATS[@output_format] if @output_format
46
+ processing_node['c:calculateRelevanceScore'] = 'false' if @calculate_relevance == false
47
+ processing_node['c:reltagBaseURL'] = @reltag_base_url.to_s if @reltag_base_url
48
+
49
+ processing_node['c:enableMetadataType'] = @metadata_enables.join(',') unless @metadata_enables.empty?
50
+ processing_node['c:docRDFaccessible'] = @store_rdf if @store_rdf
51
+ processing_node['c:discardMetadata'] = @metadata_discards.join(';') unless @metadata_discards.empty?
52
+ processing_node['c:omitOutputtingOriginalText'] = 'true' if @omit_outputting_original_text
53
+
54
+ user_node = Nokogiri::XML::Node.new('c:userDirectives', document)
55
+ user_node['c:allowDistribution'] = @allow_distribution.to_s unless @allow_distribution.nil?
56
+ user_node['c:allowSearch'] = @allow_search.to_s unless @allow_search.nil?
57
+ user_node['c:externalID'] = @external_id.to_s if @external_id
58
+ user_node['c:submitter'] = @submitter.to_s if @submitter
59
+
60
+ params_node << processing_node
61
+ params_node << user_node
62
+
63
+ if @external_metadata
64
+ external_node = Nokogiri::XML::Node.new('c:externalMetadata', document)
65
+ external_node << @external_metadata
66
+ params_node << external_node
67
+ end
68
+
69
+ params_node.to_xml(:indent => 2)
70
+ end
71
+
72
+ def url
73
+ @url ||= URI.parse(calais_endpoint)
74
+ end
75
+
76
+ private
77
+ def check_params
78
+ raise 'missing content' if @content.nil? || @content.empty?
79
+
80
+ content_length = @content.length
81
+ raise 'content is too small' if content_length < MIN_CONTENT_SIZE
82
+ raise 'content is too large' if content_length > MAX_CONTENT_SIZE
83
+
84
+ raise 'missing license id' if @license_id.nil? || @license_id.empty?
85
+
86
+ raise 'unknown content type' unless AVAILABLE_CONTENT_TYPES.keys.include?(@content_type) if @content_type
87
+ raise 'unknown output format' unless AVAILABLE_OUTPUT_FORMATS.keys.include?(@output_format) if @output_format
88
+
89
+ %w[calculate_relevance store_rdf allow_distribution allow_search].each do |variable|
90
+ value = self.send(variable)
91
+ unless NilClass === value || TrueClass === value || FalseClass === value
92
+ raise "expected a boolean value for #{variable} but got #{value}"
93
+ end
94
+ end
95
+
96
+ @metadata_enables ||= []
97
+ unknown_enables = Set.new(@metadata_enables) - KNOWN_ENABLES
98
+ raise "unknown metadata enables: #{unknown_enables.to_a.inspect}" unless unknown_enables.empty?
99
+
100
+ @metadata_discards ||= []
101
+ unknown_discards = Set.new(@metadata_discards) - KNOWN_DISCARDS
102
+ raise "unknown metadata discards: #{unknown_discards.to_a.inspect}" unless unknown_discards.empty?
103
+ end
104
+
105
+ def do_request(post_fields)
106
+ @request ||= Net::HTTP::Post.new(url.path)
107
+ @request.set_form_data(post_fields)
108
+ Net::HTTP.new(url.host, url.port).start {|http| http.request(@request)}.body
109
+ end
110
+
111
+ def calais_endpoint
112
+ @use_beta ? BETA_REST_ENDPOINT : REST_ENDPOINT
113
+ end
114
+ end
115
+ end
@@ -0,0 +1,3 @@
1
+ class Calais::Error < StandardError
2
+
3
+ end
@@ -0,0 +1,220 @@
1
+ module Calais
2
+ class Response
3
+ MATCHERS = {
4
+ :docinfo => 'DocInfo',
5
+ :docinfometa => 'DocInfoMeta',
6
+ :defaultlangid => 'DefaultLangId',
7
+ :doccat => 'DocCat',
8
+ :entities => 'type/em/e',
9
+ :relations => 'type/em/r',
10
+ :geographies => 'type/er',
11
+ :instances => 'type/sys/InstanceInfo',
12
+ :relevances => 'type/sys/RelevanceInfo',
13
+ }
14
+
15
+ attr_accessor :submitter_code, :signature, :language, :submission_date, :request_id, :doc_title, :doc_date
16
+ attr_accessor :hashes, :entities, :relations, :geographies, :categories, :socialtags, :relevances
17
+
18
+ def initialize(rdf_string)
19
+ @raw_response = rdf_string
20
+
21
+ @hashes = []
22
+ @entities = []
23
+ @relations = []
24
+ @geographies = []
25
+ @relevances = {} # key = String hash, val = Float relevance
26
+ @categories = []
27
+ @socialtags = []
28
+
29
+ extract_data
30
+ end
31
+
32
+ class Entity
33
+ attr_accessor :calais_hash, :type, :attributes, :relevance, :instances
34
+ end
35
+
36
+ class Relation
37
+ attr_accessor :calais_hash, :type, :attributes, :instances
38
+ end
39
+
40
+ class Geography
41
+ attr_accessor :name, :calais_hash, :attributes, :relevance
42
+ end
43
+
44
+ class Category
45
+ attr_accessor :name, :score
46
+ end
47
+
48
+ class SocialTag
49
+ attr_accessor :name, :importance
50
+ end
51
+
52
+ class Instance
53
+ attr_accessor :prefix, :exact, :suffix, :offset, :length
54
+
55
+ # Makes a new Instance object from an appropriate Nokogiri::XML::Node.
56
+ def self.from_node(node)
57
+ instance = self.new
58
+ instance.prefix = node.xpath("c:prefix[1]").first.content
59
+ instance.exact = node.xpath("c:exact[1]").first.content
60
+ instance.suffix = node.xpath("c:suffix[1]").first.content
61
+ instance.offset = node.xpath("c:offset[1]").first.content.to_i
62
+ instance.length = node.xpath("c:length[1]").first.content.to_i
63
+
64
+ instance
65
+ end
66
+ end
67
+
68
+ class CalaisHash
69
+ attr_accessor :value
70
+
71
+ def self.find_or_create(hash, hashes)
72
+ if !selected = hashes.select {|h| h.value == hash }.first
73
+ selected = self.new
74
+ selected.value = hash
75
+ hashes << selected
76
+ end
77
+
78
+ selected
79
+ end
80
+ end
81
+
82
+ private
83
+ def extract_data
84
+ doc = Nokogiri::XML(@raw_response)
85
+
86
+ if doc.root.xpath("/Error[1]").first
87
+ raise Calais::Error, doc.root.xpath("/Error/Exception").first.content
88
+ end
89
+
90
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfometa]}')]/..").each do |node|
91
+ @language = node['language']
92
+ @submission_date = DateTime.parse node['submissionDate']
93
+
94
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
95
+
96
+ @signature = attributes.delete('signature')
97
+ @submitter_code = attributes.delete('submitterCode')
98
+
99
+ node.remove
100
+ end
101
+
102
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfo]}')]/..").each do |node|
103
+ @request_id = node['calaisRequestID']
104
+
105
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
106
+
107
+ @doc_title = attributes.delete('docTitle')
108
+ @doc_date = Date.parse(attributes.delete('docDate'))
109
+
110
+ node.remove
111
+ end
112
+
113
+ @socialtags = doc.root.xpath("rdf:Description/c:socialtag/..").map do |node|
114
+ tag = SocialTag.new
115
+ tag.name = node.xpath("c:name[1]").first.content
116
+ tag.importance = node.xpath("c:importance[1]").first.content.to_i
117
+
118
+ node.remove if node.xpath("c:categoryName[1]").first.nil?
119
+
120
+ tag
121
+ end
122
+
123
+ @categories = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:doccat]}')]/..").map do |node|
124
+ category = Category.new
125
+ category.name = node.xpath("c:categoryName[1]").first.content
126
+ score = node.xpath("c:score[1]").first
127
+ category.score = score.content.to_f unless score.nil?
128
+
129
+ node.remove
130
+ category
131
+ end
132
+
133
+ @relevances = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relevances]}')]/..").inject({}) do |acc, node|
134
+ subject_hash = node.xpath("c:subject[1]").first[:resource].split('/')[-1]
135
+ acc[subject_hash] = node.xpath("c:relevance[1]").first.content.to_f
136
+
137
+ node.remove
138
+ acc
139
+ end
140
+
141
+ @entities = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:entities]}')]/..").map do |node|
142
+ extracted_hash = node['about'].split('/')[-1] rescue nil
143
+
144
+ entity = Entity.new
145
+ entity.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
146
+ entity.type = extract_type(node)
147
+ entity.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
148
+
149
+ entity.relevance = @relevances[extracted_hash]
150
+ entity.instances = extract_instances(doc, extracted_hash)
151
+
152
+ node.remove
153
+ entity
154
+ end
155
+
156
+ @relations = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relations]}')]/..").map do |node|
157
+ extracted_hash = node['about'].split('/')[-1] rescue nil
158
+
159
+ relation = Relation.new
160
+ relation.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
161
+ relation.type = extract_type(node)
162
+ relation.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
163
+ relation.instances = extract_instances(doc, extracted_hash)
164
+
165
+ node.remove
166
+ relation
167
+ end
168
+
169
+ @geographies = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:geographies]}')]/..").map do |node|
170
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
171
+
172
+ geography = Geography.new
173
+ geography.name = attributes.delete('name')
174
+ geography.calais_hash = attributes.delete('subject')
175
+ geography.attributes = attributes
176
+ geography.relevance = extract_relevance(geography.calais_hash.value)
177
+
178
+ node.remove
179
+ geography
180
+ end
181
+
182
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:defaultlangid]}')]/..").each { |node| node.remove }
183
+ doc.root.xpath("./*").each { |node| node.remove }
184
+
185
+ return
186
+ end
187
+
188
+ def extract_instances(doc, hash)
189
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:instances]}')]/..").select do |instance_node|
190
+ instance_node.xpath("c:subject[1]").first[:resource].split("/")[-1] == hash
191
+ end.map do |instance_node|
192
+ instance = Instance.from_node(instance_node)
193
+ instance_node.remove
194
+
195
+ instance
196
+ end
197
+ end
198
+
199
+ def extract_type(node)
200
+ node.xpath("*[name()='rdf:type']")[0]['resource'].split('/')[-1]
201
+ rescue
202
+ nil
203
+ end
204
+
205
+ def extract_attributes(nodes)
206
+ nodes.inject({}) do |hsh, node|
207
+ value = if node['resource']
208
+ extracted_hash = node['resource'].split('/')[-1] rescue nil
209
+ CalaisHash.find_or_create(extracted_hash, @hashes)
210
+ else
211
+ node.content
212
+ end
213
+ hsh.merge(node.name => value)
214
+ end
215
+ end
216
+ def extract_relevance(value)
217
+ return @relevances[value]
218
+ end
219
+ end
220
+ end