semantic_extraction 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -4,22 +4,26 @@ Extract meaningful information from unstructured text with Ruby.
4
4
 
5
5
  Using a variety of APIs (Yahoo Term Extractor and Alchemy are currently supported), semantic_extraction can automatically return a collection of keywords for an arbitrary block of text. If you use Alchemy, it can also return named entities.
6
6
 
7
+ A brief walkthrough:
8
+
9
+ $ require 'rubygems'
10
+ $ require 'semantic_extraction'
11
+ $ SemanticExtraction.alchemy_api_key = 'YOUR_API_KEY_HERE'
12
+ $ SemanticExtraction.find_keywords("http://chrisvannoy.com/2010/03/10/introducing_semantic_extraction/")
13
+ $ ["Knight News Challenge", "Yahoo Term Extractor", "obscure gem", "soon-to-be twitter employee", "handle serving gems", "API providers", "Alchemy API", "unstructured text", "earliest steps", "Rails 3-compatible version", "structured data", "early stage", "death threats", "github", "final aside", "awesome piece", "default choice", "HTTP communication", "Indianapolis Star", "Feel free"]
14
+
7
15
  == The APIs in use
8
16
 
9
- * [Yahoo Term Extractor](http://developer.yahoo.com/search/content/V1/termExtraction.html)
17
+ * {Yahoo Term Extractor}[http://developer.yahoo.com/search/content/V1/termExtraction.html]
10
18
 
11
- * [Alchemy API](http://www.alchemyapi.com/api/)
19
+ * {Alchemy API}[http://www.alchemyapi.com/api/]
12
20
 
13
21
  == Upcoming To-Dos
14
22
 
15
- * Add support for [OpenCalais](http://www.opencalais.com/documentation/opencalais-documentation)
23
+ * Add support for {OpenCalais}[http://www.opencalais.com/documentation/opencalais-documentation]
16
24
 
17
25
  * Flesh out the rest of the Alchemy API
18
26
 
19
- * Make it possible to dynamically pick with API to use (so its possible to use multiple APIs in the same app)
20
-
21
- * Make it less fugly.
22
-
23
27
  * Tests, tests and more tests.
24
28
 
25
29
  == Note on Patches/Pull Requests
data/Rakefile CHANGED
@@ -10,7 +10,8 @@ begin
10
10
  gem.email = "chris@chrisvannoy.com"
11
11
  gem.homepage = "http://github.com/dummied/semantic_extraction"
12
12
  gem.authors = ["Chris Vannoy"]
13
- gem.add_development_dependency "thoughtbot-shoulda", ">= 0"
13
+ gem.add_development_dependency "shoulda", ">= 0"
14
+ gem.add_development_dependency "fakeweb"
14
15
  gem.add_dependency "ruby_tubesday"
15
16
  gem.add_dependency "nokogiri"
16
17
  gem.add_dependency "extlib"
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.1.1
1
+ 0.2.0
@@ -2,10 +2,35 @@ require 'ruby_tubesday'
2
2
  require 'nokogiri'
3
3
  require 'extlib'
4
4
  require 'ostruct'
5
+ require 'active_support/core_ext/module/attribute_accessors'
5
6
 
6
7
  module SemanticExtraction
7
8
 
9
+
10
+ mattr_accessor :preferred_extractor
11
+ mattr_accessor :alchemy_api_key
12
+ mattr_accessor :yahoo_api_key
13
+ mattr_accessor :valid_extractors
14
+ mattr_accessor :requires_api_key
15
+
16
+ self.valid_extractors = ["yahoo", "alchemy"]
17
+ self.requires_api_key = ["yahoo", "alchemy"]
18
+
19
+ # By default, we assume you want to use Alchemy.
20
+ # To override, just set SemanticExtraction.preferred_extractor somewhere and define the appropriate api_key.
21
+ def self.preferred_extractor=(value)
22
+ if self.valid_extractors.include?(value)
23
+ @@preferred_extractor = value
24
+ else
25
+ raise NotSupportedExtractor
26
+ end
27
+ end
28
+
29
+ self.preferred_extractor = "alchemy" if self.preferred_extractor.blank?
30
+
31
+
8
32
  # Screw it. Hard-code time!
33
+ require 'semantic_extraction/utility_methods'
9
34
  require 'semantic_extraction/extractors/yahoo'
10
35
  require 'semantic_extraction/extractors/alchemy'
11
36
 
@@ -16,53 +41,30 @@ module SemanticExtraction
16
41
  # This will become more important when we start mapping out all of the other features in the Alchemy API
17
42
  class NotSupportedExtraction < StandardError; end
18
43
 
19
- # By default, we assume you want to use Alchemy.
20
- # To override, just set SemanticExtraction::PREFERRED_EXTRACTOR somewhere.
21
- def self.preferred_extractor
22
- defined?(PREFERRED_EXTRACTOR) ? PREFERRED_EXTRACTOR : "alchemy"
23
- end
44
+ # Thrown when you attempt to set the preferred extractor to an extractor we don't yet support.
45
+ class NotSupportedExtractor < StandardError; end
24
46
 
25
- HTTP = RubyTubesday.new
26
-
27
- # Will return an array of keywords gleaned from the text you pass in.
28
- # Both Yahoo and Alchemy will handle a block of text, but Alchemy can also handle a plain URL.
29
- def self.find_keywords(text)
30
- klass = SemanticExtraction.const_get(self.preferred_extractor.capitalize)
31
- if klass.respond_to?(:find_keywords) && defined?(self.preferred_extractor.upcase + "_API_KEY")
32
- return klass.find_keywords(text)
33
- elsif !klass.respond_to?(:find_keywords)
47
+ def self.find_generic(typer, args)
48
+ if self.is_valid?(typer)
49
+ return @@klass.send(typer.to_sym, args)
50
+ elsif !@@klass.respond_to?(typer.to_sym)
34
51
  raise NotSupportedExtraction
35
52
  else
36
53
  raise MissingApiKey
37
54
  end
38
55
  end
39
56
 
40
- # Will return an array of OpenStruct representing the named entities from the text.
41
- # At the moment, Alchemy is the only one to support this.
42
- # Down the road, we'll add in OpenCalais and others.
43
- def self.find_entities(text)
44
- klass = SemanticExtraction.const_get(self.preferred_extractor.capitalize)
45
- if klass.respond_to?(:find_entities) && defined?(self.preferred_extractor.upcase + "_API_Key")
46
- return klass.find_entities(text)
47
- elsif !klass.respond_to?(:find_entities)
48
- raise NotSupportedExtraction
49
- else
50
- raise MissingApiKey
51
- end
57
+ def self.method_missing(method, args)
58
+ find_generic(method.to_sym, args)
52
59
  end
53
60
 
54
- # Posts the url to the API.
55
- def self.post(url, target, calling_param, api_key, api_param="apikey".to_sym)
56
- HTTP.post(url, :params => {calling_param => target, api_param => api_key} )
57
- end
58
61
 
59
- # Checks to see if a string is a URL.
60
- # This is really dumb at the moment, and will likely be refactored in future releases.
61
- def self.is_url?(link)
62
- if link[0..3] == "http"
63
- return true
62
+ def self.is_valid?(method)
63
+ @@klass = SemanticExtraction.const_get(self.preferred_extractor.gsub(/\/(.?)/) { "::#{$1.upcase}" }.gsub(/(?:^|_)(.)/) { $1.upcase })
64
+ if self.requires_api_key.include? self.preferred_extractor
65
+ (@@klass.respond_to?(method) && defined?(self.send((preferred_extractor + "_api_key").to_sym)) && !(self.send((preferred_extractor + "_api_key").to_sym)).empty?) ? true : false
64
66
  else
65
- return false
67
+ @@klass.respond_to?(method)
66
68
  end
67
69
  end
68
70
 
@@ -1,41 +1,58 @@
1
1
  module SemanticExtraction
2
- class Alchemy
2
+ module Alchemy
3
+ include SemanticExtraction::UtilityMethods
3
4
  STARTER = "http://access.alchemyapi.com/calls/"
4
5
 
5
6
  def self.find_keywords(text)
6
- prefix = (SemanticExtraction.is_url?(text) ? "url" : "text")
7
+ prefix = is_url?(text) ? "url" : "text"
7
8
  endpoint = (prefix == "url" ? "URL" : "Text") + "GetKeywords"
8
9
  url = STARTER + prefix + "/" + endpoint
9
- raw = SemanticExtraction.post(url, text, prefix, SemanticExtraction::ALCHEMY_API_KEY)
10
+ raw = post(url, text, prefix)
11
+ output_keywords(raw)
12
+ end
13
+
14
+ def self.output_keywords(raw)
10
15
  h = Nokogiri::XML(raw)
11
- if (h/"keywords keyword")
12
- keywords = []
13
- (h/"keywords keyword").each do |p|
14
- keywords << p.text
15
- end
16
+ keywords = []
17
+ (h/"keywords keyword").each do |p|
18
+ keywords << p.text
16
19
  end
17
20
  return keywords
18
21
  end
19
22
 
20
23
  def self.find_entities(text)
21
- prefix = (SemanticExtraction.is_url?(text) ? "url" : "text")
24
+ prefix = is_url?(text) ? "url" : "text"
22
25
  endpoint = (prefix == "url" ? "URL" : "Text") + "GetRankedNamedEntities"
23
26
  url = STARTER + prefix + "/" + endpoint
24
- raw = SemanticExtraction.post(url, text, prefix, SemanticExtraction::ALCHEMY_API_KEY)
27
+ raw = post(url, text, prefix)
28
+ output_entities(raw)
29
+ end
30
+
31
+ def self.output_entities(raw)
25
32
  h = Nokogiri::XML(raw)
26
- if (h/"entities entity")
27
- entities = []
28
- (h/"entities entity").each do |p|
29
- hashie = Hash.from_xml(p.to_s)["entity"]
30
- typer = hashie.delete("type")
31
- if typer
32
- hashie["entity_type"] = typer
33
- end
34
- entities << OpenStruct.new(hashie)
33
+ entities = []
34
+ (h/"entities entity").each do |p|
35
+ hashie = Hash.from_xml(p.to_s)["entity"]
36
+ typer = hashie.delete("type")
37
+ if typer
38
+ hashie["entity_type"] = typer
35
39
  end
40
+ entities << OpenStruct.new(hashie)
36
41
  end
37
42
  return entities
38
43
  end
39
-
44
+
45
+ def self.extract_text(text)
46
+ prefix = is_url?(text) ? "url" : "html"
47
+ endpoint = (prefix == "url" ? "URL" : "HTML") + "GetText"
48
+ url = STARTER + prefix + "/" + endpoint
49
+ raw = post(url, text, prefix)
50
+ output_text(raw)
51
+ end
52
+
53
+ def self.output_text(raw)
54
+ h = Nokogiri::XML(raw)
55
+ return (h/"text").first.inner_html
56
+ end
40
57
  end
41
58
  end
@@ -1,16 +1,19 @@
1
1
  module SemanticExtraction
2
- class Yahoo
2
+ module Yahoo
3
+ include SemanticExtraction::UtilityMethods
3
4
  STARTER = "http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction"
4
-
5
+
5
6
  def self.find_keywords(text)
6
7
  prefix = 'context'
7
- raw = SemanticExtraction.post(STARTER, text, prefix, SemanticExtraction::YAHOO_API_KEY, :appid)
8
+ raw = SemanticExtraction.post(STARTER, text, prefix, :appid)
9
+ self.output_keywords(raw)
10
+ end
11
+
12
+ def self.output_keywords(raw)
8
13
  h = Nokogiri::XML(raw)
9
- if (h/"Result")
10
- keywords = []
11
- (h/"Result").each do |p|
12
- keywords << p.text
13
- end
14
+ keywords = []
15
+ (h/"Result").each do |p|
16
+ keywords << p.text
14
17
  end
15
18
  return keywords
16
19
  end
@@ -0,0 +1,26 @@
1
+ module SemanticExtraction
2
+ module UtilityMethods
3
+
4
+ def self.included(includer)
5
+ includer.module_eval do
6
+
7
+ # Posts the url to the API.
8
+ def self.post(url, target, calling_param, api_param="apikey".to_sym)
9
+ RubyTubesday.new.post(url, :params => {calling_param => target, api_param => (SemanticExtraction.send((SemanticExtraction.preferred_extractor + "_api_key").to_sym))} )
10
+ end
11
+
12
+ # Checks to see if a string is a URL.
13
+ # This is really dumb at the moment, and will likely be refactored in future releases.
14
+ def self.is_url?(link)
15
+ if link[0..3] == "http"
16
+ return true
17
+ else
18
+ return false
19
+ end
20
+ end
21
+ end
22
+ end
23
+
24
+
25
+ end
26
+ end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{semantic_extraction}
8
- s.version = "0.1.1"
8
+ s.version = "0.2.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Chris Vannoy"]
12
- s.date = %q{2010-03-10}
12
+ s.date = %q{2010-07-06}
13
13
  s.description = %q{Using a variety of APIs (Yahoo term Extractor and Alchemy are currently supported), semantic_extraction can automatically return a collection of keywords for an arbitrary block of text. If using Alchemy, it can also return named entities.}
14
14
  s.email = %q{chris@chrisvannoy.com}
15
15
  s.extra_rdoc_files = [
@@ -26,17 +26,22 @@ Gem::Specification.new do |s|
26
26
  "lib/semantic_extraction.rb",
27
27
  "lib/semantic_extraction/extractors/alchemy.rb",
28
28
  "lib/semantic_extraction/extractors/yahoo.rb",
29
+ "lib/semantic_extraction/utility_methods.rb",
29
30
  "semantic_extraction.gemspec",
31
+ "test/api_extractor.rb",
30
32
  "test/helper.rb",
33
+ "test/our_extractor.rb",
31
34
  "test/test_semantic_extraction.rb"
32
35
  ]
33
36
  s.homepage = %q{http://github.com/dummied/semantic_extraction}
34
37
  s.rdoc_options = ["--charset=UTF-8"]
35
38
  s.require_paths = ["lib"]
36
- s.rubygems_version = %q{1.3.6}
39
+ s.rubygems_version = %q{1.3.7}
37
40
  s.summary = %q{Extract meaningful information from unstructured text with Ruby}
38
41
  s.test_files = [
39
- "test/helper.rb",
42
+ "test/api_extractor.rb",
43
+ "test/helper.rb",
44
+ "test/our_extractor.rb",
40
45
  "test/test_semantic_extraction.rb"
41
46
  ]
42
47
 
@@ -44,19 +49,22 @@ Gem::Specification.new do |s|
44
49
  current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
45
50
  s.specification_version = 3
46
51
 
47
- if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
48
- s.add_development_dependency(%q<thoughtbot-shoulda>, [">= 0"])
52
+ if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
53
+ s.add_development_dependency(%q<shoulda>, [">= 0"])
54
+ s.add_development_dependency(%q<fakeweb>, [">= 0"])
49
55
  s.add_runtime_dependency(%q<ruby_tubesday>, [">= 0"])
50
56
  s.add_runtime_dependency(%q<nokogiri>, [">= 0"])
51
57
  s.add_runtime_dependency(%q<extlib>, [">= 0"])
52
58
  else
53
- s.add_dependency(%q<thoughtbot-shoulda>, [">= 0"])
59
+ s.add_dependency(%q<shoulda>, [">= 0"])
60
+ s.add_dependency(%q<fakeweb>, [">= 0"])
54
61
  s.add_dependency(%q<ruby_tubesday>, [">= 0"])
55
62
  s.add_dependency(%q<nokogiri>, [">= 0"])
56
63
  s.add_dependency(%q<extlib>, [">= 0"])
57
64
  end
58
65
  else
59
- s.add_dependency(%q<thoughtbot-shoulda>, [">= 0"])
66
+ s.add_dependency(%q<shoulda>, [">= 0"])
67
+ s.add_dependency(%q<fakeweb>, [">= 0"])
60
68
  s.add_dependency(%q<ruby_tubesday>, [">= 0"])
61
69
  s.add_dependency(%q<nokogiri>, [">= 0"])
62
70
  s.add_dependency(%q<extlib>, [">= 0"])
@@ -0,0 +1,17 @@
1
+ module SemanticExtraction
2
+ module ApiExtractor
3
+ include SemanticExtraction::UtilityMethods
4
+
5
+ SemanticExtraction.valid_extractors << "api_extractor"
6
+ SemanticExtraction.requires_api_key << "api_extractor"
7
+
8
+ SemanticExtraction.module_eval("mattr_accessor :api_extractor_api_key")
9
+
10
+ SemanticExtraction.api_extractor_api_key = "bogus"
11
+
12
+ def self.find_keywords(text)
13
+ return []
14
+ end
15
+
16
+ end
17
+ end
@@ -0,0 +1,12 @@
1
+ module SemanticExtraction
2
+ module OurExtractor
3
+ include SemanticExtraction::UtilityMethods
4
+
5
+ SemanticExtraction.valid_extractors << "our_extractor"
6
+
7
+ def self.find_keywords(text)
8
+ return []
9
+ end
10
+
11
+ end
12
+ end
@@ -1,7 +1,30 @@
1
1
  require 'helper'
2
+ require 'our_extractor'
3
+ require 'api_extractor'
2
4
 
3
5
  class TestSemanticExtraction < Test::Unit::TestCase
4
- should "probably rename this file and start testing for real" do
5
- flunk "hey buddy, you should probably rename this file and start testing for real"
6
+ should "correctly identify a url in is_url?" do
7
+ assert_equal SemanticExtraction.is_url?("http://www.indystar.com"), true
8
+ assert_equal SemanticExtraction.is_url?("I am a cheeky monkey"), false
6
9
  end
10
+
11
+ should "throw error when trying to set an invalid extractor" do
12
+ begin
13
+ SemanticExtraction.preferred_extractor = "bullshit"
14
+ rescue StandardError => err
15
+ assert_equal err.class.to_s, "SemanticExtraction::NotSupportedExtractor"
16
+ end
17
+ end
18
+
19
+ should "be able to define new extractors without api keys" do
20
+ SemanticExtraction.preferred_extractor = "our_extractor"
21
+ assert_equal true, SemanticExtraction.is_valid?(:find_keywords)
22
+ end
23
+
24
+ should "be able to define new extractors with api keys" do
25
+ SemanticExtraction.preferred_extractor = "api_extractor"
26
+ assert_equal true, SemanticExtraction.requires_api_key.include?("api_extractor")
27
+ assert_equal true, SemanticExtraction.is_valid?(:find_keywords)
28
+ end
29
+
7
30
  end
metadata CHANGED
@@ -1,12 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: semantic_extraction
3
3
  version: !ruby/object:Gem::Version
4
+ hash: 23
4
5
  prerelease: false
5
6
  segments:
6
7
  - 0
7
- - 1
8
- - 1
9
- version: 0.1.1
8
+ - 2
9
+ - 0
10
+ version: 0.2.0
10
11
  platform: ruby
11
12
  authors:
12
13
  - Chris Vannoy
@@ -14,57 +15,79 @@ autorequire:
14
15
  bindir: bin
15
16
  cert_chain: []
16
17
 
17
- date: 2010-03-10 00:00:00 -05:00
18
+ date: 2010-07-06 00:00:00 -04:00
18
19
  default_executable:
19
20
  dependencies:
20
21
  - !ruby/object:Gem::Dependency
21
- name: thoughtbot-shoulda
22
+ name: shoulda
22
23
  prerelease: false
23
24
  requirement: &id001 !ruby/object:Gem::Requirement
25
+ none: false
24
26
  requirements:
25
27
  - - ">="
26
28
  - !ruby/object:Gem::Version
29
+ hash: 3
27
30
  segments:
28
31
  - 0
29
32
  version: "0"
30
33
  type: :development
31
34
  version_requirements: *id001
32
35
  - !ruby/object:Gem::Dependency
33
- name: ruby_tubesday
36
+ name: fakeweb
34
37
  prerelease: false
35
38
  requirement: &id002 !ruby/object:Gem::Requirement
39
+ none: false
36
40
  requirements:
37
41
  - - ">="
38
42
  - !ruby/object:Gem::Version
43
+ hash: 3
39
44
  segments:
40
45
  - 0
41
46
  version: "0"
42
- type: :runtime
47
+ type: :development
43
48
  version_requirements: *id002
44
49
  - !ruby/object:Gem::Dependency
45
- name: nokogiri
50
+ name: ruby_tubesday
46
51
  prerelease: false
47
52
  requirement: &id003 !ruby/object:Gem::Requirement
53
+ none: false
48
54
  requirements:
49
55
  - - ">="
50
56
  - !ruby/object:Gem::Version
57
+ hash: 3
51
58
  segments:
52
59
  - 0
53
60
  version: "0"
54
61
  type: :runtime
55
62
  version_requirements: *id003
56
63
  - !ruby/object:Gem::Dependency
57
- name: extlib
64
+ name: nokogiri
58
65
  prerelease: false
59
66
  requirement: &id004 !ruby/object:Gem::Requirement
67
+ none: false
60
68
  requirements:
61
69
  - - ">="
62
70
  - !ruby/object:Gem::Version
71
+ hash: 3
63
72
  segments:
64
73
  - 0
65
74
  version: "0"
66
75
  type: :runtime
67
76
  version_requirements: *id004
77
+ - !ruby/object:Gem::Dependency
78
+ name: extlib
79
+ prerelease: false
80
+ requirement: &id005 !ruby/object:Gem::Requirement
81
+ none: false
82
+ requirements:
83
+ - - ">="
84
+ - !ruby/object:Gem::Version
85
+ hash: 3
86
+ segments:
87
+ - 0
88
+ version: "0"
89
+ type: :runtime
90
+ version_requirements: *id005
68
91
  description: Using a variety of APIs (Yahoo term Extractor and Alchemy are currently supported), semantic_extraction can automatically return a collection of keywords for an arbitrary block of text. If using Alchemy, it can also return named entities.
69
92
  email: chris@chrisvannoy.com
70
93
  executables: []
@@ -84,8 +107,11 @@ files:
84
107
  - lib/semantic_extraction.rb
85
108
  - lib/semantic_extraction/extractors/alchemy.rb
86
109
  - lib/semantic_extraction/extractors/yahoo.rb
110
+ - lib/semantic_extraction/utility_methods.rb
87
111
  - semantic_extraction.gemspec
112
+ - test/api_extractor.rb
88
113
  - test/helper.rb
114
+ - test/our_extractor.rb
89
115
  - test/test_semantic_extraction.rb
90
116
  has_rdoc: true
91
117
  homepage: http://github.com/dummied/semantic_extraction
@@ -97,26 +123,32 @@ rdoc_options:
97
123
  require_paths:
98
124
  - lib
99
125
  required_ruby_version: !ruby/object:Gem::Requirement
126
+ none: false
100
127
  requirements:
101
128
  - - ">="
102
129
  - !ruby/object:Gem::Version
130
+ hash: 3
103
131
  segments:
104
132
  - 0
105
133
  version: "0"
106
134
  required_rubygems_version: !ruby/object:Gem::Requirement
135
+ none: false
107
136
  requirements:
108
137
  - - ">="
109
138
  - !ruby/object:Gem::Version
139
+ hash: 3
110
140
  segments:
111
141
  - 0
112
142
  version: "0"
113
143
  requirements: []
114
144
 
115
145
  rubyforge_project:
116
- rubygems_version: 1.3.6
146
+ rubygems_version: 1.3.7
117
147
  signing_key:
118
148
  specification_version: 3
119
149
  summary: Extract meaningful information from unstructured text with Ruby
120
150
  test_files:
151
+ - test/api_extractor.rb
121
152
  - test/helper.rb
153
+ - test/our_extractor.rb
122
154
  - test/test_semantic_extraction.rb