semaphore_classification 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +20 -0
- data/README.rdoc +156 -0
- data/VERSION +1 -0
- data/lib/semaphore.rb +2 -0
- data/lib/semaphore_classification/client.rb +52 -0
- data/lib/semaphore_classification/connection.rb +94 -0
- data/lib/semaphore_classification.rb +19 -0
- metadata +122 -0
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2010 Gemini SBS, LLC
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,156 @@
|
|
1
|
+
= Semaphore Classification
|
2
|
+
|
3
|
+
Ruby wrapper around the Semaphore Classification Server (CS) API.
|
4
|
+
|
5
|
+
== Usage
|
6
|
+
|
7
|
+
Before you can classify documents, you must set the path to your CS:
|
8
|
+
|
9
|
+
Semaphore::Client.set_realm(<uri_to_classification_server>)
|
10
|
+
|
11
|
+
To classify documents:
|
12
|
+
|
13
|
+
Semaphore::Client.classify(<uri_to_document>, [options])
|
14
|
+
|
15
|
+
=== Semaphore::Client.classify() Options
|
16
|
+
|
17
|
+
==== :document_uri (required)
|
18
|
+
|
19
|
+
This may use the following protocols (FTP, FTPS, HTTP, HTTPS). For example: http://mybucket.s3.amazonaws.com/some_file.pdf
|
20
|
+
Supported document types include: Microsoft Office files, Lotus files, OpenOffice files, PDFs, WordPerfect docs, HTML docs and most other common file formats.
|
21
|
+
The document type will be automatically identified by the CS.
|
22
|
+
|
23
|
+
==== :title (optional)
|
24
|
+
|
25
|
+
*Value:* String
|
26
|
+
|
27
|
+
The title in the request is used mainly for classification of documents held by a content management system.
|
28
|
+
|
29
|
+
*Default:* none
|
30
|
+
|
31
|
+
==== :alternate_body (optional)
|
32
|
+
|
33
|
+
*Value:* String
|
34
|
+
|
35
|
+
This will be used to classify on if the document fails to be retrieved by the CS for some reason.
|
36
|
+
|
37
|
+
*Default:* none
|
38
|
+
|
39
|
+
==== :article_mode (optional)
|
40
|
+
|
41
|
+
*Value:* :single or :multi
|
42
|
+
|
43
|
+
:single will process the document in 1 large chunk. This will mean that evidence from all parts of the document are considered at the same time. Depending on the design of the rulenet this may increase the chance of mis-classifications. Singlearticle may also require large amounts of memory (or if this is restricted, large amounts of time) due to the size of evidence tables which have to be evaluated.
|
44
|
+
|
45
|
+
:multi will attempt to split the document into "articles" so that the rules only consider evidence within an article and then clustering is applied to calculate which categories are representative for the document as a whole rather than simply for an article.
|
46
|
+
|
47
|
+
*Default:* :multi
|
48
|
+
|
49
|
+
==== :debug (optional)
|
50
|
+
|
51
|
+
*Value:* true or false
|
52
|
+
|
53
|
+
Will return the article(s) as well as rule matches in the response. Useful for troubleshooting, but results in large responses.
|
54
|
+
|
55
|
+
*Default:* false
|
56
|
+
|
57
|
+
==== :clustering_type (optional)
|
58
|
+
|
59
|
+
*Value:* [:all, :average, :average_scored_only, :common_scored_only, :common, :rms_scored_only, :rms, :none]
|
60
|
+
|
61
|
+
Clustering type specifies the type of calculation to use in deriving the document level scores from the article scores. This only applies to multiarticle style classifications.
|
62
|
+
|
63
|
+
*Default:* :rms_scored_only
|
64
|
+
|
65
|
+
==== :clustering_threshold (optional)
|
66
|
+
|
67
|
+
*Value:* [0-100]
|
68
|
+
|
69
|
+
The clustering threshold is only used in multiarticle mode. When the clustering algorithm is selected, the result is checked against this threshold and a score is only promoted to document level if it is >= this value.
|
70
|
+
|
71
|
+
*Default:* 48
|
72
|
+
|
73
|
+
==== :threshold (optional)
|
74
|
+
|
75
|
+
*Value:* [0-100]
|
76
|
+
|
77
|
+
The threshold is used to decide at what level of significance a category rule will fire.
|
78
|
+
|
79
|
+
The score (or significance if you prefer) varies between 0 and 100 sometimes this is displayed as 0.00 - 1.00 depending on whether it is used for integer calculations (0-100) or for statistical floating point operations (0.00 - 1.00 ie a normalised value is generally better here).
|
80
|
+
|
81
|
+
*Default:* 48
|
82
|
+
|
83
|
+
==== :language (optional)
|
84
|
+
|
85
|
+
*Value:* [:english, :english_marathon_stemmer, :english_morphological_stemmer, :english_morph_and_derivational_stemmer, :french, :italian, :german, :spanish, :dutch, :portuguese, :danish, :norwegian, :swedish, :arabic]
|
86
|
+
|
87
|
+
Note: for Standard Language processing only English has multiple stemmers available - The other languages supported only have Marathon stemmer available.
|
88
|
+
|
89
|
+
*Default:* :english_marathon_stemmer
|
90
|
+
|
91
|
+
==== :generated_keys (optional)
|
92
|
+
|
93
|
+
*Value:* true or false
|
94
|
+
|
95
|
+
Using generated keys will mean that all rules will have a unique key (which is simply the index of the rule in the rulenet).
|
96
|
+
|
97
|
+
*Default:* true
|
98
|
+
|
99
|
+
==== :min_avg_article_page_size (optional)
|
100
|
+
|
101
|
+
*Value:* Decimal
|
102
|
+
|
103
|
+
The minimum average article page size is only relevant in multi article mode
|
104
|
+
|
105
|
+
For documents which contain page information (ie not html and other continuous formats) the count of pages in the document is used to check whether automatic splitting has provided a sensible result. If the number of articles made multiplied by this value is greater than the count of pages in the document then CS will assume that the splitting does not make sense for this document and will revert back to a single article.
|
106
|
+
|
107
|
+
The idea is that this gives an easy to use approximate measure for checking splitting - ie a min average article page size of 1 means that on average we want 1 article to be bigger than a single page so if a document of 10 pages splits into 20 articles then we probably have a bad statistical split so classifying as a single article will give better results
|
108
|
+
|
109
|
+
*Default:* 1.0
|
110
|
+
|
111
|
+
==== :character_cutoff (optional)
|
112
|
+
|
113
|
+
*Values:* FixNum
|
114
|
+
|
115
|
+
The character count cutoff is a mechanism for avoiding errors or lengthy classification times on large documents.
|
116
|
+
|
117
|
+
if the corpus of documents that is to be classified is likely to include junk (eg automatically generated log files from SQL servers etc) then a cutoff can make sense.
|
118
|
+
|
119
|
+
A value of 0 means no cutoff action is performed.
|
120
|
+
|
121
|
+
The cutoff defines the approximate size (in characters) after which CS will stop parsing the data. The value is used as an approximation so various parsers implement this behaviour in different manners - eg pdf documents are cutoff at the next page boundary when this limit is reached, word documents are cut off after the next complete word etc.
|
122
|
+
|
123
|
+
Measuring in characters appears to be the most sensible unit here since this cutoff is applied fairly early on in the processing of a document - at this point CS may not yet know what language the document is in so possibly cannot assume that the language is space seperated words - let alone count the number of sentences or paragraphs yet.
|
124
|
+
|
125
|
+
Other multi document type parsing systems (for example the simple parser included with google search appliance) often have a cutoff value (which is generally not configurable) defined in terms of the file size - however a word document with many embedded pictures may have a very large physical size but a relatively small number of characters so would be able to be classified perfectly well - hence the choice of a cutoff defined in terms of number of characters.
|
126
|
+
|
127
|
+
Generally this value is set high enough that any reasonable document (ie one produced by a person) will be fully considered so a value of 1/2 a million is realistic - automatically generated text files which cause lengthy classification times often have more than this characters but the information is very rarely of any use to an end user.
|
128
|
+
|
129
|
+
*Default:* 500000
|
130
|
+
|
131
|
+
==== :document_score_limit (optional)
|
132
|
+
|
133
|
+
*Value:* FixNum
|
134
|
+
|
135
|
+
The document level score limit is a mechanism for restricting the document level classifications to the top-N results only.
|
136
|
+
|
137
|
+
This is generally not a good idea since the confidence of a classification is meant to provide an absolute measure of the confidence of a particular classification across documents. That is a higher confidence means that the document is more likely to be "about" or "mainly concerned with" the particular topic. So if document 1 classifies category "A" with a confidence of 0.75 whilst document 2 has a confidence of 0.6 for category "A" then document 1 should be returned before document 2 (if this is a search style installation rather than a conformance checking). If whilst processing document 1 the classification of "A" is discarded since document 1 has too many higher confidence classifications then the results across the corpus are skewed and search will not be as accurate.
|
138
|
+
|
139
|
+
However some systems have rather strict limits on the number of "tags" or "meta information" which a particular document is allowed to contain - when CS is integrated with one of these systems it is probably better to allow CS pick the top-N rather than having this code at the integration layer.
|
140
|
+
|
141
|
+
Currently the implementation is pretty simplistic (will return N or less scores sorted by the confidence) so could easily be implemented in the integration layer but it is possible that further work could go here so that CS could check particular categories or classes of categories in a specific rulenet defined manner so that "important" classes of categorisations (though with a low confidence) are not excluded by large numbers of higher confidence classifications in some less important class of rules.
|
142
|
+
|
143
|
+
*Default:* 0
|
144
|
+
|
145
|
+
== Dependencies
|
146
|
+
|
147
|
+
* {nokogiri}[http://github.com/tenderlove/nokogiri] (HTML, XML, SAX and Reader parser)
|
148
|
+
* {curb}[http://github.com/taf2/curb] (Ruby wrapper around the great {libcurl}[http://curl.haxx.se/])
|
149
|
+
|
150
|
+
== Copyright
|
151
|
+
|
152
|
+
Copyright (c) 2010 Gemini SBS. See LICENSE for details.
|
153
|
+
|
154
|
+
== Authors
|
155
|
+
|
156
|
+
* {Mauricio Gomes}[http://github.com/mgomes]
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.1.0
|
data/lib/semaphore.rb
ADDED
@@ -0,0 +1,52 @@
|
|
1
|
+
module Semaphore
|
2
|
+
|
3
|
+
class Client
|
4
|
+
|
5
|
+
LANGUAGES = { :english => "en", :english_marathon_stemmer => "en1", :english_morphological_stemmer => "en2", :english_morph_and_derivational_stemmer => "en3",
|
6
|
+
:french => "fr", :italian => "it", :german => "de", :spanish => "es", :dutch => "nl", :portuguese => "pt", :danish => "da", :norwegian => "no",
|
7
|
+
:swedish => "sv", :arabic => "ar"
|
8
|
+
}
|
9
|
+
|
10
|
+
CLUSTERING_TYPES = { :all => "ALL", :average => "AVERAGE_INCLUDING_EMPTY", :average_scored_only => "AVERAGE", :common_scored_only => "COMMON",
|
11
|
+
:common => "COMMON_INCLUDING_EMPTY", :rms_scored_only => "RMS", :rms => "RMS_INCLUDING_EMPTY", :none => "NONE"
|
12
|
+
}
|
13
|
+
|
14
|
+
@@default_options = { :title => "", :alternate_body => "", :debug => false, :clustering_type => CLUSTERING_TYPES[:rms_scored_only], :clustering_threshold => 48,
|
15
|
+
:threshold => 48, :language => LANGUAGES[:english_marathon_stemmer], :generated_keys => true, :min_avg_article_page_size => 1.0,
|
16
|
+
:character_cutoff => 500000, :document_score_limit => 0, :article_mode => :multi
|
17
|
+
}
|
18
|
+
@@connection = nil
|
19
|
+
|
20
|
+
class << self
|
21
|
+
|
22
|
+
def set_realm(realm, proxy=nil)
|
23
|
+
@@connection = Connection.new(realm, proxy)
|
24
|
+
end
|
25
|
+
|
26
|
+
def classify(document_uri, *args)
|
27
|
+
options = extract_options!(args)
|
28
|
+
options[:document_uri] = document_uri
|
29
|
+
|
30
|
+
result = post @@default_options.merge(options)
|
31
|
+
end
|
32
|
+
|
33
|
+
private
|
34
|
+
|
35
|
+
def post(data)
|
36
|
+
raise RealmNotSpecified if @@connection.nil?
|
37
|
+
@@connection.post data
|
38
|
+
end
|
39
|
+
|
40
|
+
def extract_options!(args)
|
41
|
+
if args.last.is_a?(Hash)
|
42
|
+
return args.pop
|
43
|
+
else
|
44
|
+
return {}
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
end
|
49
|
+
|
50
|
+
end
|
51
|
+
|
52
|
+
end
|
@@ -0,0 +1,94 @@
|
|
1
|
+
module Semaphore
|
2
|
+
|
3
|
+
class Connection
|
4
|
+
|
5
|
+
attr_reader :realm
|
6
|
+
|
7
|
+
def initialize(realm, proxy=nil)
|
8
|
+
@realm = realm
|
9
|
+
@proxy = proxy
|
10
|
+
end
|
11
|
+
|
12
|
+
def post(data)
|
13
|
+
request :post, data
|
14
|
+
end
|
15
|
+
|
16
|
+
private
|
17
|
+
|
18
|
+
def request(method, data)
|
19
|
+
response = send_request method, construct_document(data)
|
20
|
+
|
21
|
+
deconstruct_document(response)
|
22
|
+
end
|
23
|
+
|
24
|
+
def send_request(method, data)
|
25
|
+
RestClient.proxy = @proxy unless @proxy.nil?
|
26
|
+
|
27
|
+
begin
|
28
|
+
response = RestClient.post @realm, :XML_INPUT => data, :multipart => true
|
29
|
+
rescue => e
|
30
|
+
raise_errors(e.response)
|
31
|
+
end
|
32
|
+
|
33
|
+
response
|
34
|
+
end
|
35
|
+
|
36
|
+
def construct_document(data)
|
37
|
+
builder = Nokogiri::XML::Builder.new do |xml|
|
38
|
+
xml.request(:op => "#{ data[:debug] ? 'TEST' : 'CLASSIFY' }") {
|
39
|
+
xml.document {
|
40
|
+
xml.title data[:title] unless data[:title].empty?
|
41
|
+
xml.path data[:document_uri]
|
42
|
+
xml.body data[:alternate_body] unless data[:alternate_body].empty?
|
43
|
+
case data[:article_mode]
|
44
|
+
when :multi
|
45
|
+
xml.multiarticle
|
46
|
+
when :single
|
47
|
+
xml.singlearticle
|
48
|
+
end
|
49
|
+
xml.feedback if data[:debug]
|
50
|
+
xml.use_generated_keys if data[:generated_keys]
|
51
|
+
xml.clustering(:type => data[:clustering_type], :threshold => data[:clustering_threshold])
|
52
|
+
xml.language data[:language]
|
53
|
+
xml.threshold data[:threshold]
|
54
|
+
xml.min_average_article_pagesize data[:min_avg_article_page_size]
|
55
|
+
xml.char_count_cutoff data[:character_cutoff]
|
56
|
+
xml.document_score_limit data[:document_score_limit]
|
57
|
+
}
|
58
|
+
}
|
59
|
+
end
|
60
|
+
|
61
|
+
builder.to_xml
|
62
|
+
end
|
63
|
+
|
64
|
+
def deconstruct_document(response)
|
65
|
+
data = Array.new
|
66
|
+
|
67
|
+
if !response.body.empty?
|
68
|
+
begin
|
69
|
+
doc = Nokogiri::XML.parse(response)
|
70
|
+
doc.xpath('//META').each do |node|
|
71
|
+
data << { :term => node['value'], :key => node['key'], :score => node['score'] } if node['name'] == "Generic"
|
72
|
+
end
|
73
|
+
rescue
|
74
|
+
raise DecodeError, "content: <#{response.body}>"
|
75
|
+
end
|
76
|
+
end
|
77
|
+
|
78
|
+
data.uniq
|
79
|
+
end
|
80
|
+
|
81
|
+
def raise_errors(response)
|
82
|
+
case response.code
|
83
|
+
when 500
|
84
|
+
raise ServerError, "Semaphore Classification Server had an internal error. #{response.description}\n\n#{response.body}"
|
85
|
+
when 502..503
|
86
|
+
raise Unavailable, response.description
|
87
|
+
else
|
88
|
+
raise SemaphoreError, response.description
|
89
|
+
end
|
90
|
+
end
|
91
|
+
|
92
|
+
end
|
93
|
+
|
94
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
require 'uri'
|
2
|
+
require 'nokogiri'
|
3
|
+
require 'rest_client'
|
4
|
+
|
5
|
+
require 'semaphore_classification/connection'
|
6
|
+
require 'semaphore_classification/client'
|
7
|
+
|
8
|
+
module Semaphore
|
9
|
+
VERSION = File.read(File.join(File.dirname(__FILE__), '..', 'VERSION'))
|
10
|
+
|
11
|
+
class SemaphoreError < StandardError; end
|
12
|
+
class InsufficientArgs < SemaphoreError; end
|
13
|
+
class Unauthorized < SemaphoreError; end
|
14
|
+
class NotFound < SemaphoreError; end
|
15
|
+
class ServerError < SemaphoreError; end
|
16
|
+
class Unavailable < SemaphoreError; end
|
17
|
+
class DecodeError < SemaphoreError; end
|
18
|
+
class RealmNotSpecified < SemaphoreError; end
|
19
|
+
end
|
metadata
ADDED
@@ -0,0 +1,122 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: semaphore_classification
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
hash: 27
|
5
|
+
prerelease: false
|
6
|
+
segments:
|
7
|
+
- 0
|
8
|
+
- 1
|
9
|
+
- 0
|
10
|
+
version: 0.1.0
|
11
|
+
platform: ruby
|
12
|
+
authors:
|
13
|
+
- Mauricio Gomes
|
14
|
+
autorequire:
|
15
|
+
bindir: bin
|
16
|
+
cert_chain: []
|
17
|
+
|
18
|
+
date: 2010-08-27 00:00:00 -04:00
|
19
|
+
default_executable:
|
20
|
+
dependencies:
|
21
|
+
- !ruby/object:Gem::Dependency
|
22
|
+
name: nokogiri
|
23
|
+
prerelease: false
|
24
|
+
requirement: &id001 !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ~>
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
hash: 113
|
30
|
+
segments:
|
31
|
+
- 1
|
32
|
+
- 4
|
33
|
+
- 3
|
34
|
+
- 1
|
35
|
+
version: 1.4.3.1
|
36
|
+
type: :runtime
|
37
|
+
version_requirements: *id001
|
38
|
+
- !ruby/object:Gem::Dependency
|
39
|
+
name: rest-client
|
40
|
+
prerelease: false
|
41
|
+
requirement: &id002 !ruby/object:Gem::Requirement
|
42
|
+
none: false
|
43
|
+
requirements:
|
44
|
+
- - ~>
|
45
|
+
- !ruby/object:Gem::Version
|
46
|
+
hash: 15
|
47
|
+
segments:
|
48
|
+
- 1
|
49
|
+
- 6
|
50
|
+
- 0
|
51
|
+
version: 1.6.0
|
52
|
+
type: :runtime
|
53
|
+
version_requirements: *id002
|
54
|
+
- !ruby/object:Gem::Dependency
|
55
|
+
name: rspec
|
56
|
+
prerelease: false
|
57
|
+
requirement: &id003 !ruby/object:Gem::Requirement
|
58
|
+
none: false
|
59
|
+
requirements:
|
60
|
+
- - ">="
|
61
|
+
- !ruby/object:Gem::Version
|
62
|
+
hash: 13
|
63
|
+
segments:
|
64
|
+
- 1
|
65
|
+
- 2
|
66
|
+
- 9
|
67
|
+
version: 1.2.9
|
68
|
+
type: :development
|
69
|
+
version_requirements: *id003
|
70
|
+
description: Ruby wrapper around the Semaphore Classification Server API.
|
71
|
+
email: mauricio@geminisbs.com
|
72
|
+
executables: []
|
73
|
+
|
74
|
+
extensions: []
|
75
|
+
|
76
|
+
extra_rdoc_files:
|
77
|
+
- LICENSE
|
78
|
+
- README.rdoc
|
79
|
+
files:
|
80
|
+
- LICENSE
|
81
|
+
- README.rdoc
|
82
|
+
- VERSION
|
83
|
+
- lib/semaphore.rb
|
84
|
+
- lib/semaphore_classification.rb
|
85
|
+
- lib/semaphore_classification/client.rb
|
86
|
+
- lib/semaphore_classification/connection.rb
|
87
|
+
has_rdoc: true
|
88
|
+
homepage: http://github.com/geminisbs/semaphore_classification
|
89
|
+
licenses: []
|
90
|
+
|
91
|
+
post_install_message:
|
92
|
+
rdoc_options:
|
93
|
+
- --charset=UTF-8
|
94
|
+
require_paths:
|
95
|
+
- lib
|
96
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
97
|
+
none: false
|
98
|
+
requirements:
|
99
|
+
- - ">="
|
100
|
+
- !ruby/object:Gem::Version
|
101
|
+
hash: 3
|
102
|
+
segments:
|
103
|
+
- 0
|
104
|
+
version: "0"
|
105
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
106
|
+
none: false
|
107
|
+
requirements:
|
108
|
+
- - ">="
|
109
|
+
- !ruby/object:Gem::Version
|
110
|
+
hash: 3
|
111
|
+
segments:
|
112
|
+
- 0
|
113
|
+
version: "0"
|
114
|
+
requirements: []
|
115
|
+
|
116
|
+
rubyforge_project:
|
117
|
+
rubygems_version: 1.3.7
|
118
|
+
signing_key:
|
119
|
+
specification_version: 3
|
120
|
+
summary: Ruby wrapper around the Semaphore Classification Server
|
121
|
+
test_files: []
|
122
|
+
|