jm-calais 0.0.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,63 @@
1
+ # Changes
2
+
3
+ ## 0.0.13
4
+
5
+ * load path fix
6
+
7
+ ## 0.0.12
8
+
9
+ * added relevances to Geographies
10
+ * improved doc
11
+ * removed jeweler dependency and simplified Rakefile
12
+ * bumped rspec requirement
13
+
14
+ ## 0.0.11
15
+
16
+ * simple fix for some rubies not liking DateTime.parse without including date
17
+ * tests for SocialTags
18
+ * typo fix: SocailTag != SocialTag
19
+
20
+ ## 0.0.10
21
+
22
+ * community patch to expose SocialTags
23
+
24
+ ## 0.0.9
25
+
26
+ * updates related to API changes
27
+ * community patches to support bundler, support ruby 1.9
28
+
29
+ ## 0.0.8
30
+
31
+ * community patches to use nokogiri
32
+
33
+ ## 0.0.7
34
+ * verified 4.0 API
35
+ * moved gem packaging to `jeweler` and documentation to `yard`
36
+
37
+ ## 0.0.6
38
+ * fully implemented 3.1 API
39
+
40
+ ## 0.0.5
41
+ * fixed error where classes weren't being required in the proper order on Ubuntu (reported by Jon Moses)
42
+ * New things coming back from the API. Fixing in tests.
43
+
44
+ ## 0.0.4
45
+ * changed dependency from `hpricot` to `libxml`
46
+ * unicode fun
47
+ * cleanup all around
48
+
49
+ ## 0.0.3
50
+ * pluginized the library for Rails (thanks [pius](http://gitorious.org/projects/calais-au-rails))
51
+ * added helper methods name entity types from a response
52
+
53
+ ## 0.0.2
54
+ * cleanup in the specs
55
+ * cleaner parsing
56
+ * location of named entities
57
+ * more data in relationships
58
+ * moved Names and Relationships
59
+
60
+ ## 0.0.1
61
+ * Access to OpenCalais's Enlighten action
62
+ * Single method to process a document
63
+ * Get relationships and names from a document
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source :gemcutter
2
+
3
+ gemspec
data/MIT-LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2008 Abhay Kumar info@opensynapse.net
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ 'Software'), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
18
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
19
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
20
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.markdown ADDED
@@ -0,0 +1,55 @@
1
+ # Calais #
2
+ A Ruby interface to the [Open Calais Web Service](http://opencalais.com)
3
+
4
+ ## About this Fork ##
5
+ Forked from https://github.com/abhay/calais version ~> 0.0.13
6
+ to fix issues caused by the depreciation of iconv in ruby > 1.9.3
7
+
8
+ ## Features ##
9
+ * Accepts documents in text/plain, text/xml and text/html format.
10
+ * Basic access to the Open Calais API's Enlighten action.
11
+ * Output is RDF representation of input document.
12
+ * Single function ability to extract names, entities and geographies from given text.
13
+
14
+ ## Synopsis ##
15
+
16
+ This is a very basic wrapper to the Open Calais API. It uses the POST endpoint and currently supports the Enlighten action. Here's a simple call:
17
+
18
+ Calais.enlighten(
19
+ :content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
20
+ :content_type => :raw,
21
+ :license_id => 'your license id'
22
+ )
23
+
24
+ This is the easiest way to get the RDF-formated response from the OpenCalais service.
25
+
26
+ If you want to do something more fun like getting all sorts of fun information about a document, you can try this:
27
+
28
+ Calais.process_document(
29
+ :content => "The government of the United Kingdom has given corporations like fast food chain McDonald's the right to award high school qualifications to employees who complete a company training program.",
30
+ :content_type => :raw,
31
+ :license_id => 'your license id'
32
+ )
33
+
34
+ This will return an object containing information extracted from the RDF response.
35
+
36
+ ## Requirements ##
37
+
38
+ * [Ruby 1.8.5 or better](http://ruby-lang.org)
39
+ * [nokogiri](http://nokogiri.rubyforge.org/nokogiri/), [libxml2](http://xmlsoft.org/), [libxslt](http://xmlsoft.org/xslt/)
40
+ * [curb](http://curb.rubyforge.org/), [libcurl](http://curl.haxx.se/)
41
+ * [json](http://json.rubyforge.org/)
42
+
43
+ ## Install ##
44
+
45
+ You can install the Calais gem via Rubygems (`gem install calais`) or by building from source.
46
+
47
+ ## Authors ##
48
+
49
+ * [Abhay Kumar](http://opensynapse.net)
50
+
51
+ ## Acknowledgements ##
52
+
53
+ * [Paul Legato](http://www.economaton.com/): Help all around with the new response processor and implementation of the 3.1 API.
54
+ * [Ryan Ong](http://www.ryanong.net/)
55
+ * [Juan Antonio Chavez](https://github.com/TheNaoX): Geographies relevance
data/Rakefile ADDED
@@ -0,0 +1,36 @@
1
+ # -*- ruby -*-
2
+
3
+ require 'rake'
4
+ require 'rake/clean'
5
+
6
+ require './lib/calais.rb'
7
+
8
+ begin
9
+ require 'rspec/core/rake_task'
10
+
11
+ RSpec::Core::RakeTask.new(:spec)
12
+
13
+ task :default => :spec
14
+ rescue LoadError
15
+ puts "RSpec, or one of its dependencies, is not available. Please install it."
16
+ exit(1)
17
+ end
18
+
19
+ begin
20
+ require 'yard'
21
+ require 'yard/rake/yardoc_task'
22
+
23
+ YARD::Rake::YardocTask.new do |t|
24
+ t.options = ["--verbose", "--markup=markdown", "--files=CHANGELOG.markdown,MIT-LICENSE"]
25
+ end
26
+
27
+ task :rdoc => :yardoc
28
+
29
+ CLOBBER.include 'doc'
30
+ CLOBBER.include '.yardoc'
31
+ rescue LoadError
32
+ puts "Yard, or one of its dependencies is not available. Please install it."
33
+ exit(1)
34
+ end
35
+
36
+ # vim: syntax=Ruby
@@ -0,0 +1,115 @@
1
+ module Calais
2
+ class Client
3
+ # base attributes of the call
4
+ attr_accessor :content
5
+ attr_accessor :license_id
6
+
7
+ # processing directives
8
+ attr_accessor :content_type, :output_format, :reltag_base_url, :calculate_relevance, :omit_outputting_original_text
9
+ attr_accessor :store_rdf, :metadata_enables, :metadata_discards
10
+
11
+ # user directives
12
+ attr_accessor :allow_distribution, :allow_search, :external_id, :submitter
13
+
14
+ attr_accessor :external_metadata
15
+
16
+ attr_accessor :use_beta
17
+
18
+ def initialize(options={}, &block)
19
+ options.each {|k,v| send("#{k}=", v)}
20
+ yield(self) if block_given?
21
+ end
22
+
23
+ def enlighten
24
+ post_args = {
25
+ "licenseID" => @license_id,
26
+ "content" => RUBY_VERSION.to_f < 1.9 ?
27
+ Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "#{@content} ").first[0..-2] :
28
+ "#{@content} ".encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')[0 .. -2],
29
+ "paramsXML" => params_xml
30
+ }
31
+
32
+ do_request(post_args)
33
+ end
34
+
35
+ def params_xml
36
+ check_params
37
+ document = Nokogiri::XML::Document.new
38
+
39
+ params_node = Nokogiri::XML::Node.new('c:params', document)
40
+ params_node['xmlns:c'] = 'http://s.opencalais.com/1/pred/'
41
+ params_node['xmlns:rdf'] = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
42
+
43
+ processing_node = Nokogiri::XML::Node.new('c:processingDirectives', document)
44
+ processing_node['c:contentType'] = AVAILABLE_CONTENT_TYPES[@content_type] if @content_type
45
+ processing_node['c:outputFormat'] = AVAILABLE_OUTPUT_FORMATS[@output_format] if @output_format
46
+ processing_node['c:calculateRelevanceScore'] = 'false' if @calculate_relevance == false
47
+ processing_node['c:reltagBaseURL'] = @reltag_base_url.to_s if @reltag_base_url
48
+
49
+ processing_node['c:enableMetadataType'] = @metadata_enables.join(',') unless @metadata_enables.empty?
50
+ processing_node['c:docRDFaccessible'] = @store_rdf if @store_rdf
51
+ processing_node['c:discardMetadata'] = @metadata_discards.join(';') unless @metadata_discards.empty?
52
+ processing_node['c:omitOutputtingOriginalText'] = 'true' if @omit_outputting_original_text
53
+
54
+ user_node = Nokogiri::XML::Node.new('c:userDirectives', document)
55
+ user_node['c:allowDistribution'] = @allow_distribution.to_s unless @allow_distribution.nil?
56
+ user_node['c:allowSearch'] = @allow_search.to_s unless @allow_search.nil?
57
+ user_node['c:externalID'] = @external_id.to_s if @external_id
58
+ user_node['c:submitter'] = @submitter.to_s if @submitter
59
+
60
+ params_node << processing_node
61
+ params_node << user_node
62
+
63
+ if @external_metadata
64
+ external_node = Nokogiri::XML::Node.new('c:externalMetadata', document)
65
+ external_node << @external_metadata
66
+ params_node << external_node
67
+ end
68
+
69
+ params_node.to_xml(:indent => 2)
70
+ end
71
+
72
+ def url
73
+ @url ||= URI.parse(calais_endpoint)
74
+ end
75
+
76
+ private
77
+ def check_params
78
+ raise 'missing content' if @content.nil? || @content.empty?
79
+
80
+ content_length = @content.length
81
+ raise 'content is too small' if content_length < MIN_CONTENT_SIZE
82
+ raise 'content is too large' if content_length > MAX_CONTENT_SIZE
83
+
84
+ raise 'missing license id' if @license_id.nil? || @license_id.empty?
85
+
86
+ raise 'unknown content type' unless AVAILABLE_CONTENT_TYPES.keys.include?(@content_type) if @content_type
87
+ raise 'unknown output format' unless AVAILABLE_OUTPUT_FORMATS.keys.include?(@output_format) if @output_format
88
+
89
+ %w[calculate_relevance store_rdf allow_distribution allow_search].each do |variable|
90
+ value = self.send(variable)
91
+ unless NilClass === value || TrueClass === value || FalseClass === value
92
+ raise "expected a boolean value for #{variable} but got #{value}"
93
+ end
94
+ end
95
+
96
+ @metadata_enables ||= []
97
+ unknown_enables = Set.new(@metadata_enables) - KNOWN_ENABLES
98
+ raise "unknown metadata enables: #{unknown_enables.to_a.inspect}" unless unknown_enables.empty?
99
+
100
+ @metadata_discards ||= []
101
+ unknown_discards = Set.new(@metadata_discards) - KNOWN_DISCARDS
102
+ raise "unknown metadata discards: #{unknown_discards.to_a.inspect}" unless unknown_discards.empty?
103
+ end
104
+
105
+ def do_request(post_fields)
106
+ @request ||= Net::HTTP::Post.new(url.path)
107
+ @request.set_form_data(post_fields)
108
+ Net::HTTP.new(url.host, url.port).start {|http| http.request(@request)}.body
109
+ end
110
+
111
+ def calais_endpoint
112
+ @use_beta ? BETA_REST_ENDPOINT : REST_ENDPOINT
113
+ end
114
+ end
115
+ end
@@ -0,0 +1,3 @@
1
+ class Calais::Error < StandardError
2
+
3
+ end
@@ -0,0 +1,220 @@
1
+ module Calais
2
+ class Response
3
+ MATCHERS = {
4
+ :docinfo => 'DocInfo',
5
+ :docinfometa => 'DocInfoMeta',
6
+ :defaultlangid => 'DefaultLangId',
7
+ :doccat => 'DocCat',
8
+ :entities => 'type/em/e',
9
+ :relations => 'type/em/r',
10
+ :geographies => 'type/er',
11
+ :instances => 'type/sys/InstanceInfo',
12
+ :relevances => 'type/sys/RelevanceInfo',
13
+ }
14
+
15
+ attr_accessor :submitter_code, :signature, :language, :submission_date, :request_id, :doc_title, :doc_date
16
+ attr_accessor :hashes, :entities, :relations, :geographies, :categories, :socialtags, :relevances
17
+
18
+ def initialize(rdf_string)
19
+ @raw_response = rdf_string
20
+
21
+ @hashes = []
22
+ @entities = []
23
+ @relations = []
24
+ @geographies = []
25
+ @relevances = {} # key = String hash, val = Float relevance
26
+ @categories = []
27
+ @socialtags = []
28
+
29
+ extract_data
30
+ end
31
+
32
+ class Entity
33
+ attr_accessor :calais_hash, :type, :attributes, :relevance, :instances
34
+ end
35
+
36
+ class Relation
37
+ attr_accessor :calais_hash, :type, :attributes, :instances
38
+ end
39
+
40
+ class Geography
41
+ attr_accessor :name, :calais_hash, :attributes, :relevance
42
+ end
43
+
44
+ class Category
45
+ attr_accessor :name, :score
46
+ end
47
+
48
+ class SocialTag
49
+ attr_accessor :name, :importance
50
+ end
51
+
52
+ class Instance
53
+ attr_accessor :prefix, :exact, :suffix, :offset, :length
54
+
55
+ # Makes a new Instance object from an appropriate Nokogiri::XML::Node.
56
+ def self.from_node(node)
57
+ instance = self.new
58
+ instance.prefix = node.xpath("c:prefix[1]").first.content
59
+ instance.exact = node.xpath("c:exact[1]").first.content
60
+ instance.suffix = node.xpath("c:suffix[1]").first.content
61
+ instance.offset = node.xpath("c:offset[1]").first.content.to_i
62
+ instance.length = node.xpath("c:length[1]").first.content.to_i
63
+
64
+ instance
65
+ end
66
+ end
67
+
68
+ class CalaisHash
69
+ attr_accessor :value
70
+
71
+ def self.find_or_create(hash, hashes)
72
+ if !selected = hashes.select {|h| h.value == hash }.first
73
+ selected = self.new
74
+ selected.value = hash
75
+ hashes << selected
76
+ end
77
+
78
+ selected
79
+ end
80
+ end
81
+
82
+ private
83
+ def extract_data
84
+ doc = Nokogiri::XML(@raw_response)
85
+
86
+ if doc.root.xpath("/Error[1]").first
87
+ raise Calais::Error, doc.root.xpath("/Error/Exception").first.content
88
+ end
89
+
90
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfometa]}')]/..").each do |node|
91
+ @language = node['language']
92
+ @submission_date = DateTime.parse node['submissionDate']
93
+
94
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
95
+
96
+ @signature = attributes.delete('signature')
97
+ @submitter_code = attributes.delete('submitterCode')
98
+
99
+ node.remove
100
+ end
101
+
102
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:docinfo]}')]/..").each do |node|
103
+ @request_id = node['calaisRequestID']
104
+
105
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
106
+
107
+ @doc_title = attributes.delete('docTitle')
108
+ @doc_date = Date.parse(attributes.delete('docDate'))
109
+
110
+ node.remove
111
+ end
112
+
113
+ @socialtags = doc.root.xpath("rdf:Description/c:socialtag/..").map do |node|
114
+ tag = SocialTag.new
115
+ tag.name = node.xpath("c:name[1]").first.content
116
+ tag.importance = node.xpath("c:importance[1]").first.content.to_i
117
+
118
+ node.remove if node.xpath("c:categoryName[1]").first.nil?
119
+
120
+ tag
121
+ end
122
+
123
+ @categories = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:doccat]}')]/..").map do |node|
124
+ category = Category.new
125
+ category.name = node.xpath("c:categoryName[1]").first.content
126
+ score = node.xpath("c:score[1]").first
127
+ category.score = score.content.to_f unless score.nil?
128
+
129
+ node.remove
130
+ category
131
+ end
132
+
133
+ @relevances = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relevances]}')]/..").inject({}) do |acc, node|
134
+ subject_hash = node.xpath("c:subject[1]").first[:resource].split('/')[-1]
135
+ acc[subject_hash] = node.xpath("c:relevance[1]").first.content.to_f
136
+
137
+ node.remove
138
+ acc
139
+ end
140
+
141
+ @entities = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:entities]}')]/..").map do |node|
142
+ extracted_hash = node['about'].split('/')[-1] rescue nil
143
+
144
+ entity = Entity.new
145
+ entity.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
146
+ entity.type = extract_type(node)
147
+ entity.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
148
+
149
+ entity.relevance = @relevances[extracted_hash]
150
+ entity.instances = extract_instances(doc, extracted_hash)
151
+
152
+ node.remove
153
+ entity
154
+ end
155
+
156
+ @relations = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:relations]}')]/..").map do |node|
157
+ extracted_hash = node['about'].split('/')[-1] rescue nil
158
+
159
+ relation = Relation.new
160
+ relation.calais_hash = CalaisHash.find_or_create(extracted_hash, @hashes)
161
+ relation.type = extract_type(node)
162
+ relation.attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
163
+ relation.instances = extract_instances(doc, extracted_hash)
164
+
165
+ node.remove
166
+ relation
167
+ end
168
+
169
+ @geographies = doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:geographies]}')]/..").map do |node|
170
+ attributes = extract_attributes(node.xpath("*[contains(name(), 'c:')]"))
171
+
172
+ geography = Geography.new
173
+ geography.name = attributes.delete('name')
174
+ geography.calais_hash = attributes.delete('subject')
175
+ geography.attributes = attributes
176
+ geography.relevance = extract_relevance(geography.calais_hash.value)
177
+
178
+ node.remove
179
+ geography
180
+ end
181
+
182
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:defaultlangid]}')]/..").each { |node| node.remove }
183
+ doc.root.xpath("./*").each { |node| node.remove }
184
+
185
+ return
186
+ end
187
+
188
+ def extract_instances(doc, hash)
189
+ doc.root.xpath("rdf:Description/rdf:type[contains(@rdf:resource, '#{MATCHERS[:instances]}')]/..").select do |instance_node|
190
+ instance_node.xpath("c:subject[1]").first[:resource].split("/")[-1] == hash
191
+ end.map do |instance_node|
192
+ instance = Instance.from_node(instance_node)
193
+ instance_node.remove
194
+
195
+ instance
196
+ end
197
+ end
198
+
199
+ def extract_type(node)
200
+ node.xpath("*[name()='rdf:type']")[0]['resource'].split('/')[-1]
201
+ rescue
202
+ nil
203
+ end
204
+
205
+ def extract_attributes(nodes)
206
+ nodes.inject({}) do |hsh, node|
207
+ value = if node['resource']
208
+ extracted_hash = node['resource'].split('/')[-1] rescue nil
209
+ CalaisHash.find_or_create(extracted_hash, @hashes)
210
+ else
211
+ node.content
212
+ end
213
+ hsh.merge(node.name => value)
214
+ end
215
+ end
216
+ def extract_relevance(value)
217
+ return @relevances[value]
218
+ end
219
+ end
220
+ end