semantic_extraction 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -4,22 +4,26 @@ Extract meaningful information from unstructured text with Ruby.
4
4
 
5
5
  Using a variety of APIs (Yahoo Term Extractor and Alchemy are currently supported), semantic_extraction can automatically return a collection of keywords for an arbitrary block of text. If you use Alchemy, it can also return named entities.
6
6
 
7
+ A brief walkthrough:
8
+
9
+ $ require 'rubygems'
10
+ $ require 'semantic_extraction'
11
+ $ SemanticExtraction.alchemy_api_key = 'YOUR_API_KEY_HERE'
12
+ $ SemanticExtraction.find_keywords("http://chrisvannoy.com/2010/03/10/introducing_semantic_extraction/")
13
+ $ ["Knight News Challenge", "Yahoo Term Extractor", "obscure gem", "soon-to-be twitter employee", "handle serving gems", "API providers", "Alchemy API", "unstructured text", "earliest steps", "Rails 3-compatible version", "structured data", "early stage", "death threats", "github", "final aside", "awesome piece", "default choice", "HTTP communication", "Indianapolis Star", "Feel free"]
14
+
7
15
  == The APIs in use
8
16
 
9
- * [Yahoo Term Extractor](http://developer.yahoo.com/search/content/V1/termExtraction.html)
17
+ * {Yahoo Term Extractor}[http://developer.yahoo.com/search/content/V1/termExtraction.html]
10
18
 
11
- * [Alchemy API](http://www.alchemyapi.com/api/)
19
+ * {Alchemy API}[http://www.alchemyapi.com/api/]
12
20
 
13
21
  == Upcoming To-Dos
14
22
 
15
- * Add support for [OpenCalais](http://www.opencalais.com/documentation/opencalais-documentation)
23
+ * Add support for {OpenCalais}[http://www.opencalais.com/documentation/opencalais-documentation]
16
24
 
17
25
  * Flesh out the rest of the Alchemy API
18
26
 
19
- * Make it possible to dynamically pick with API to use (so its possible to use multiple APIs in the same app)
20
-
21
- * Make it less fugly.
22
-
23
27
  * Tests, tests and more tests.
24
28
 
25
29
  == Note on Patches/Pull Requests
data/Rakefile CHANGED
@@ -10,7 +10,8 @@ begin
10
10
  gem.email = "chris@chrisvannoy.com"
11
11
  gem.homepage = "http://github.com/dummied/semantic_extraction"
12
12
  gem.authors = ["Chris Vannoy"]
13
- gem.add_development_dependency "thoughtbot-shoulda", ">= 0"
13
+ gem.add_development_dependency "shoulda", ">= 0"
14
+ gem.add_development_dependency "fakeweb"
14
15
  gem.add_dependency "ruby_tubesday"
15
16
  gem.add_dependency "nokogiri"
16
17
  gem.add_dependency "extlib"
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.1.1
1
+ 0.2.0
@@ -2,10 +2,35 @@ require 'ruby_tubesday'
2
2
  require 'nokogiri'
3
3
  require 'extlib'
4
4
  require 'ostruct'
5
+ require 'active_support/core_ext/module/attribute_accessors'
5
6
 
6
7
  module SemanticExtraction
7
8
 
9
+
10
+ mattr_accessor :preferred_extractor
11
+ mattr_accessor :alchemy_api_key
12
+ mattr_accessor :yahoo_api_key
13
+ mattr_accessor :valid_extractors
14
+ mattr_accessor :requires_api_key
15
+
16
+ self.valid_extractors = ["yahoo", "alchemy"]
17
+ self.requires_api_key = ["yahoo", "alchemy"]
18
+
19
+ # By default, we assume you want to use Alchemy.
20
+ # To override, just set SemanticExtraction.preferred_extractor somewhere and define the appropriate api_key.
21
+ def self.preferred_extractor=(value)
22
+ if self.valid_extractors.include?(value)
23
+ @@preferred_extractor = value
24
+ else
25
+ raise NotSupportedExtractor
26
+ end
27
+ end
28
+
29
+ self.preferred_extractor = "alchemy" if self.preferred_extractor.blank?
30
+
31
+
8
32
  # Screw it. Hard-code time!
33
+ require 'semantic_extraction/utility_methods'
9
34
  require 'semantic_extraction/extractors/yahoo'
10
35
  require 'semantic_extraction/extractors/alchemy'
11
36
 
@@ -16,53 +41,30 @@ module SemanticExtraction
16
41
  # This will become more important when we start mapping out all of the other features in the Alchemy API
17
42
  class NotSupportedExtraction < StandardError; end
18
43
 
19
- # By default, we assume you want to use Alchemy.
20
- # To override, just set SemanticExtraction::PREFERRED_EXTRACTOR somewhere.
21
- def self.preferred_extractor
22
- defined?(PREFERRED_EXTRACTOR) ? PREFERRED_EXTRACTOR : "alchemy"
23
- end
44
+ # Thrown when you attempt to set the preferred extractor to an extractor we don't yet support.
45
+ class NotSupportedExtractor < StandardError; end
24
46
 
25
- HTTP = RubyTubesday.new
26
-
27
- # Will return an array of keywords gleaned from the text you pass in.
28
- # Both Yahoo and Alchemy will handle a block of text, but Alchemy can also handle a plain URL.
29
- def self.find_keywords(text)
30
- klass = SemanticExtraction.const_get(self.preferred_extractor.capitalize)
31
- if klass.respond_to?(:find_keywords) && defined?(self.preferred_extractor.upcase + "_API_KEY")
32
- return klass.find_keywords(text)
33
- elsif !klass.respond_to?(:find_keywords)
47
+ def self.find_generic(typer, args)
48
+ if self.is_valid?(typer)
49
+ return @@klass.send(typer.to_sym, args)
50
+ elsif !@@klass.respond_to?(typer.to_sym)
34
51
  raise NotSupportedExtraction
35
52
  else
36
53
  raise MissingApiKey
37
54
  end
38
55
  end
39
56
 
40
- # Will return an array of OpenStruct representing the named entities from the text.
41
- # At the moment, Alchemy is the only one to support this.
42
- # Down the road, we'll add in OpenCalais and others.
43
- def self.find_entities(text)
44
- klass = SemanticExtraction.const_get(self.preferred_extractor.capitalize)
45
- if klass.respond_to?(:find_entities) && defined?(self.preferred_extractor.upcase + "_API_Key")
46
- return klass.find_entities(text)
47
- elsif !klass.respond_to?(:find_entities)
48
- raise NotSupportedExtraction
49
- else
50
- raise MissingApiKey
51
- end
57
+ def self.method_missing(method, args)
58
+ find_generic(method.to_sym, args)
52
59
  end
53
60
 
54
- # Posts the url to the API.
55
- def self.post(url, target, calling_param, api_key, api_param="apikey".to_sym)
56
- HTTP.post(url, :params => {calling_param => target, api_param => api_key} )
57
- end
58
61
 
59
- # Checks to see if a string is a URL.
60
- # This is really dumb at the moment, and will likely be refactored in future releases.
61
- def self.is_url?(link)
62
- if link[0..3] == "http"
63
- return true
62
+ def self.is_valid?(method)
63
+ @@klass = SemanticExtraction.const_get(self.preferred_extractor.gsub(/\/(.?)/) { "::#{$1.upcase}" }.gsub(/(?:^|_)(.)/) { $1.upcase })
64
+ if self.requires_api_key.include? self.preferred_extractor
65
+ (@@klass.respond_to?(method) && defined?(self.send((preferred_extractor + "_api_key").to_sym)) && !(self.send((preferred_extractor + "_api_key").to_sym)).empty?) ? true : false
64
66
  else
65
- return false
67
+ @@klass.respond_to?(method)
66
68
  end
67
69
  end
68
70
 
@@ -1,41 +1,58 @@
1
1
  module SemanticExtraction
2
- class Alchemy
2
+ module Alchemy
3
+ include SemanticExtraction::UtilityMethods
3
4
  STARTER = "http://access.alchemyapi.com/calls/"
4
5
 
5
6
  def self.find_keywords(text)
6
- prefix = (SemanticExtraction.is_url?(text) ? "url" : "text")
7
+ prefix = is_url?(text) ? "url" : "text"
7
8
  endpoint = (prefix == "url" ? "URL" : "Text") + "GetKeywords"
8
9
  url = STARTER + prefix + "/" + endpoint
9
- raw = SemanticExtraction.post(url, text, prefix, SemanticExtraction::ALCHEMY_API_KEY)
10
+ raw = post(url, text, prefix)
11
+ output_keywords(raw)
12
+ end
13
+
14
+ def self.output_keywords(raw)
10
15
  h = Nokogiri::XML(raw)
11
- if (h/"keywords keyword")
12
- keywords = []
13
- (h/"keywords keyword").each do |p|
14
- keywords << p.text
15
- end
16
+ keywords = []
17
+ (h/"keywords keyword").each do |p|
18
+ keywords << p.text
16
19
  end
17
20
  return keywords
18
21
  end
19
22
 
20
23
  def self.find_entities(text)
21
- prefix = (SemanticExtraction.is_url?(text) ? "url" : "text")
24
+ prefix = is_url?(text) ? "url" : "text"
22
25
  endpoint = (prefix == "url" ? "URL" : "Text") + "GetRankedNamedEntities"
23
26
  url = STARTER + prefix + "/" + endpoint
24
- raw = SemanticExtraction.post(url, text, prefix, SemanticExtraction::ALCHEMY_API_KEY)
27
+ raw = post(url, text, prefix)
28
+ output_entities(raw)
29
+ end
30
+
31
+ def self.output_entities(raw)
25
32
  h = Nokogiri::XML(raw)
26
- if (h/"entities entity")
27
- entities = []
28
- (h/"entities entity").each do |p|
29
- hashie = Hash.from_xml(p.to_s)["entity"]
30
- typer = hashie.delete("type")
31
- if typer
32
- hashie["entity_type"] = typer
33
- end
34
- entities << OpenStruct.new(hashie)
33
+ entities = []
34
+ (h/"entities entity").each do |p|
35
+ hashie = Hash.from_xml(p.to_s)["entity"]
36
+ typer = hashie.delete("type")
37
+ if typer
38
+ hashie["entity_type"] = typer
35
39
  end
40
+ entities << OpenStruct.new(hashie)
36
41
  end
37
42
  return entities
38
43
  end
39
-
44
+
45
+ def self.extract_text(text)
46
+ prefix = is_url?(text) ? "url" : "html"
47
+ endpoint = (prefix == "url" ? "URL" : "HTML") + "GetText"
48
+ url = STARTER + prefix + "/" + endpoint
49
+ raw = post(url, text, prefix)
50
+ output_text(raw)
51
+ end
52
+
53
+ def self.output_text(raw)
54
+ h = Nokogiri::XML(raw)
55
+ return (h/"text").first.inner_html
56
+ end
40
57
  end
41
58
  end
@@ -1,16 +1,19 @@
1
1
  module SemanticExtraction
2
- class Yahoo
2
+ module Yahoo
3
+ include SemanticExtraction::UtilityMethods
3
4
  STARTER = "http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction"
4
-
5
+
5
6
  def self.find_keywords(text)
6
7
  prefix = 'context'
7
- raw = SemanticExtraction.post(STARTER, text, prefix, SemanticExtraction::YAHOO_API_KEY, :appid)
8
+ raw = SemanticExtraction.post(STARTER, text, prefix, :appid)
9
+ self.output_keywords(raw)
10
+ end
11
+
12
+ def self.output_keywords(raw)
8
13
  h = Nokogiri::XML(raw)
9
- if (h/"Result")
10
- keywords = []
11
- (h/"Result").each do |p|
12
- keywords << p.text
13
- end
14
+ keywords = []
15
+ (h/"Result").each do |p|
16
+ keywords << p.text
14
17
  end
15
18
  return keywords
16
19
  end
@@ -0,0 +1,26 @@
1
+ module SemanticExtraction
2
+ module UtilityMethods
3
+
4
+ def self.included(includer)
5
+ includer.module_eval do
6
+
7
+ # Posts the url to the API.
8
+ def self.post(url, target, calling_param, api_param="apikey".to_sym)
9
+ RubyTubesday.new.post(url, :params => {calling_param => target, api_param => (SemanticExtraction.send((SemanticExtraction.preferred_extractor + "_api_key").to_sym))} )
10
+ end
11
+
12
+ # Checks to see if a string is a URL.
13
+ # This is really dumb at the moment, and will likely be refactored in future releases.
14
+ def self.is_url?(link)
15
+ if link[0..3] == "http"
16
+ return true
17
+ else
18
+ return false
19
+ end
20
+ end
21
+ end
22
+ end
23
+
24
+
25
+ end
26
+ end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{semantic_extraction}
8
- s.version = "0.1.1"
8
+ s.version = "0.2.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Chris Vannoy"]
12
- s.date = %q{2010-03-10}
12
+ s.date = %q{2010-07-06}
13
13
  s.description = %q{Using a variety of APIs (Yahoo term Extractor and Alchemy are currently supported), semantic_extraction can automatically return a collection of keywords for an arbitrary block of text. If using Alchemy, it can also return named entities.}
14
14
  s.email = %q{chris@chrisvannoy.com}
15
15
  s.extra_rdoc_files = [
@@ -26,17 +26,22 @@ Gem::Specification.new do |s|
26
26
  "lib/semantic_extraction.rb",
27
27
  "lib/semantic_extraction/extractors/alchemy.rb",
28
28
  "lib/semantic_extraction/extractors/yahoo.rb",
29
+ "lib/semantic_extraction/utility_methods.rb",
29
30
  "semantic_extraction.gemspec",
31
+ "test/api_extractor.rb",
30
32
  "test/helper.rb",
33
+ "test/our_extractor.rb",
31
34
  "test/test_semantic_extraction.rb"
32
35
  ]
33
36
  s.homepage = %q{http://github.com/dummied/semantic_extraction}
34
37
  s.rdoc_options = ["--charset=UTF-8"]
35
38
  s.require_paths = ["lib"]
36
- s.rubygems_version = %q{1.3.6}
39
+ s.rubygems_version = %q{1.3.7}
37
40
  s.summary = %q{Extract meaningful information from unstructured text with Ruby}
38
41
  s.test_files = [
39
- "test/helper.rb",
42
+ "test/api_extractor.rb",
43
+ "test/helper.rb",
44
+ "test/our_extractor.rb",
40
45
  "test/test_semantic_extraction.rb"
41
46
  ]
42
47
 
@@ -44,19 +49,22 @@ Gem::Specification.new do |s|
44
49
  current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
45
50
  s.specification_version = 3
46
51
 
47
- if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
48
- s.add_development_dependency(%q<thoughtbot-shoulda>, [">= 0"])
52
+ if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
53
+ s.add_development_dependency(%q<shoulda>, [">= 0"])
54
+ s.add_development_dependency(%q<fakeweb>, [">= 0"])
49
55
  s.add_runtime_dependency(%q<ruby_tubesday>, [">= 0"])
50
56
  s.add_runtime_dependency(%q<nokogiri>, [">= 0"])
51
57
  s.add_runtime_dependency(%q<extlib>, [">= 0"])
52
58
  else
53
- s.add_dependency(%q<thoughtbot-shoulda>, [">= 0"])
59
+ s.add_dependency(%q<shoulda>, [">= 0"])
60
+ s.add_dependency(%q<fakeweb>, [">= 0"])
54
61
  s.add_dependency(%q<ruby_tubesday>, [">= 0"])
55
62
  s.add_dependency(%q<nokogiri>, [">= 0"])
56
63
  s.add_dependency(%q<extlib>, [">= 0"])
57
64
  end
58
65
  else
59
- s.add_dependency(%q<thoughtbot-shoulda>, [">= 0"])
66
+ s.add_dependency(%q<shoulda>, [">= 0"])
67
+ s.add_dependency(%q<fakeweb>, [">= 0"])
60
68
  s.add_dependency(%q<ruby_tubesday>, [">= 0"])
61
69
  s.add_dependency(%q<nokogiri>, [">= 0"])
62
70
  s.add_dependency(%q<extlib>, [">= 0"])
@@ -0,0 +1,17 @@
1
+ module SemanticExtraction
2
+ module ApiExtractor
3
+ include SemanticExtraction::UtilityMethods
4
+
5
+ SemanticExtraction.valid_extractors << "api_extractor"
6
+ SemanticExtraction.requires_api_key << "api_extractor"
7
+
8
+ SemanticExtraction.module_eval("mattr_accessor :api_extractor_api_key")
9
+
10
+ SemanticExtraction.api_extractor_api_key = "bogus"
11
+
12
+ def self.find_keywords(text)
13
+ return []
14
+ end
15
+
16
+ end
17
+ end
@@ -0,0 +1,12 @@
1
+ module SemanticExtraction
2
+ module OurExtractor
3
+ include SemanticExtraction::UtilityMethods
4
+
5
+ SemanticExtraction.valid_extractors << "our_extractor"
6
+
7
+ def self.find_keywords(text)
8
+ return []
9
+ end
10
+
11
+ end
12
+ end
@@ -1,7 +1,30 @@
1
1
  require 'helper'
2
+ require 'our_extractor'
3
+ require 'api_extractor'
2
4
 
3
5
  class TestSemanticExtraction < Test::Unit::TestCase
4
- should "probably rename this file and start testing for real" do
5
- flunk "hey buddy, you should probably rename this file and start testing for real"
6
+ should "correctly identify a url in is_url?" do
7
+ assert_equal SemanticExtraction.is_url?("http://www.indystar.com"), true
8
+ assert_equal SemanticExtraction.is_url?("I am a cheeky monkey"), false
6
9
  end
10
+
11
+ should "throw error when trying to set an invalid extractor" do
12
+ begin
13
+ SemanticExtraction.preferred_extractor = "bullshit"
14
+ rescue StandardError => err
15
+ assert_equal err.class.to_s, "SemanticExtraction::NotSupportedExtractor"
16
+ end
17
+ end
18
+
19
+ should "be able to define new extractors without api keys" do
20
+ SemanticExtraction.preferred_extractor = "our_extractor"
21
+ assert_equal true, SemanticExtraction.is_valid?(:find_keywords)
22
+ end
23
+
24
+ should "be able to define new extractors with api keys" do
25
+ SemanticExtraction.preferred_extractor = "api_extractor"
26
+ assert_equal true, SemanticExtraction.requires_api_key.include?("api_extractor")
27
+ assert_equal true, SemanticExtraction.is_valid?(:find_keywords)
28
+ end
29
+
7
30
  end
metadata CHANGED
@@ -1,12 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: semantic_extraction
3
3
  version: !ruby/object:Gem::Version
4
+ hash: 23
4
5
  prerelease: false
5
6
  segments:
6
7
  - 0
7
- - 1
8
- - 1
9
- version: 0.1.1
8
+ - 2
9
+ - 0
10
+ version: 0.2.0
10
11
  platform: ruby
11
12
  authors:
12
13
  - Chris Vannoy
@@ -14,57 +15,79 @@ autorequire:
14
15
  bindir: bin
15
16
  cert_chain: []
16
17
 
17
- date: 2010-03-10 00:00:00 -05:00
18
+ date: 2010-07-06 00:00:00 -04:00
18
19
  default_executable:
19
20
  dependencies:
20
21
  - !ruby/object:Gem::Dependency
21
- name: thoughtbot-shoulda
22
+ name: shoulda
22
23
  prerelease: false
23
24
  requirement: &id001 !ruby/object:Gem::Requirement
25
+ none: false
24
26
  requirements:
25
27
  - - ">="
26
28
  - !ruby/object:Gem::Version
29
+ hash: 3
27
30
  segments:
28
31
  - 0
29
32
  version: "0"
30
33
  type: :development
31
34
  version_requirements: *id001
32
35
  - !ruby/object:Gem::Dependency
33
- name: ruby_tubesday
36
+ name: fakeweb
34
37
  prerelease: false
35
38
  requirement: &id002 !ruby/object:Gem::Requirement
39
+ none: false
36
40
  requirements:
37
41
  - - ">="
38
42
  - !ruby/object:Gem::Version
43
+ hash: 3
39
44
  segments:
40
45
  - 0
41
46
  version: "0"
42
- type: :runtime
47
+ type: :development
43
48
  version_requirements: *id002
44
49
  - !ruby/object:Gem::Dependency
45
- name: nokogiri
50
+ name: ruby_tubesday
46
51
  prerelease: false
47
52
  requirement: &id003 !ruby/object:Gem::Requirement
53
+ none: false
48
54
  requirements:
49
55
  - - ">="
50
56
  - !ruby/object:Gem::Version
57
+ hash: 3
51
58
  segments:
52
59
  - 0
53
60
  version: "0"
54
61
  type: :runtime
55
62
  version_requirements: *id003
56
63
  - !ruby/object:Gem::Dependency
57
- name: extlib
64
+ name: nokogiri
58
65
  prerelease: false
59
66
  requirement: &id004 !ruby/object:Gem::Requirement
67
+ none: false
60
68
  requirements:
61
69
  - - ">="
62
70
  - !ruby/object:Gem::Version
71
+ hash: 3
63
72
  segments:
64
73
  - 0
65
74
  version: "0"
66
75
  type: :runtime
67
76
  version_requirements: *id004
77
+ - !ruby/object:Gem::Dependency
78
+ name: extlib
79
+ prerelease: false
80
+ requirement: &id005 !ruby/object:Gem::Requirement
81
+ none: false
82
+ requirements:
83
+ - - ">="
84
+ - !ruby/object:Gem::Version
85
+ hash: 3
86
+ segments:
87
+ - 0
88
+ version: "0"
89
+ type: :runtime
90
+ version_requirements: *id005
68
91
  description: Using a variety of APIs (Yahoo term Extractor and Alchemy are currently supported), semantic_extraction can automatically return a collection of keywords for an arbitrary block of text. If using Alchemy, it can also return named entities.
69
92
  email: chris@chrisvannoy.com
70
93
  executables: []
@@ -84,8 +107,11 @@ files:
84
107
  - lib/semantic_extraction.rb
85
108
  - lib/semantic_extraction/extractors/alchemy.rb
86
109
  - lib/semantic_extraction/extractors/yahoo.rb
110
+ - lib/semantic_extraction/utility_methods.rb
87
111
  - semantic_extraction.gemspec
112
+ - test/api_extractor.rb
88
113
  - test/helper.rb
114
+ - test/our_extractor.rb
89
115
  - test/test_semantic_extraction.rb
90
116
  has_rdoc: true
91
117
  homepage: http://github.com/dummied/semantic_extraction
@@ -97,26 +123,32 @@ rdoc_options:
97
123
  require_paths:
98
124
  - lib
99
125
  required_ruby_version: !ruby/object:Gem::Requirement
126
+ none: false
100
127
  requirements:
101
128
  - - ">="
102
129
  - !ruby/object:Gem::Version
130
+ hash: 3
103
131
  segments:
104
132
  - 0
105
133
  version: "0"
106
134
  required_rubygems_version: !ruby/object:Gem::Requirement
135
+ none: false
107
136
  requirements:
108
137
  - - ">="
109
138
  - !ruby/object:Gem::Version
139
+ hash: 3
110
140
  segments:
111
141
  - 0
112
142
  version: "0"
113
143
  requirements: []
114
144
 
115
145
  rubyforge_project:
116
- rubygems_version: 1.3.6
146
+ rubygems_version: 1.3.7
117
147
  signing_key:
118
148
  specification_version: 3
119
149
  summary: Extract meaningful information from unstructured text with Ruby
120
150
  test_files:
151
+ - test/api_extractor.rb
121
152
  - test/helper.rb
153
+ - test/our_extractor.rb
122
154
  - test/test_semantic_extraction.rb