semaphore_classification 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +44 -42
- data/VERSION +1 -1
- data/lib/semaphore_classification/client.rb +2 -2
- data/spec/semaphore_classification_spec.rb +7 -0
- data/spec/spec_helper.rb +9 -0
- metadata +8 -5
data/README.rdoc
CHANGED
@@ -10,95 +10,97 @@ Before you can classify documents, you must set the path to your CS:
|
|
10
10
|
|
11
11
|
To classify documents:
|
12
12
|
|
13
|
-
Semaphore::Client.classify(
|
13
|
+
Semaphore::Client.classify([options])
|
14
|
+
|
15
|
+
Mostly likely you will specify a :document_uri when classifying documents, but if you do not you will need to specify an :alternate_body
|
14
16
|
|
15
|
-
|
17
|
+
== Semaphore::Client.classify() Options
|
16
18
|
|
17
|
-
|
19
|
+
=== :document_uri (optional)
|
18
20
|
|
19
21
|
This may use the following protocols (FTP, FTPS, HTTP, HTTPS). For example: http://mybucket.s3.amazonaws.com/some_file.pdf
|
20
22
|
Supported document types include: Microsoft Office files, Lotus files, OpenOffice files, PDFs, WordPerfect docs, HTML docs and most other common file formats.
|
21
23
|
The document type will be automatically identified by the CS.
|
22
24
|
|
23
|
-
|
25
|
+
=== :title (optional)
|
24
26
|
|
25
|
-
*Value
|
27
|
+
*Value* String
|
26
28
|
|
27
29
|
The title in the request is used mainly for classification of documents held by a content management system.
|
28
30
|
|
29
|
-
*Default
|
31
|
+
*Default* none
|
30
32
|
|
31
|
-
|
33
|
+
=== :alternate_body (optional)
|
32
34
|
|
33
|
-
*Value
|
35
|
+
*Value* String
|
34
36
|
|
35
37
|
This will be used to classify on if the document fails to be retrieved by the CS for some reason.
|
36
38
|
|
37
|
-
*Default
|
39
|
+
*Default* none
|
38
40
|
|
39
|
-
|
41
|
+
=== :article_mode (optional)
|
40
42
|
|
41
|
-
*Value
|
43
|
+
*Value* :single or :multi
|
42
44
|
|
43
45
|
:single will process the document in 1 large chunk. This will mean that evidence from all parts of the document are considered at the same time. Depending on the design of the rulenet this may increase the chance of mis-classifications. Singlearticle may also require large amounts of memory (or if this is restricted, large amounts of time) due to the size of evidence tables which have to be evaluated.
|
44
46
|
|
45
47
|
:multi will attempt to split the document into "articles" so that the rules only consider evidence within an article and then clustering is applied to calculate which categories are representative for the document as a whole rather than simply for an article.
|
46
48
|
|
47
|
-
*Default
|
49
|
+
*Default* :multi
|
48
50
|
|
49
|
-
|
51
|
+
=== :debug (optional)
|
50
52
|
|
51
|
-
*Value
|
53
|
+
*Value* true or false
|
52
54
|
|
53
55
|
Will return the article(s) as well as rule matches in the response. Useful for troubleshooting, but results in large responses.
|
54
56
|
|
55
|
-
*Default
|
57
|
+
*Default* false
|
56
58
|
|
57
|
-
|
59
|
+
=== :clustering_type (optional)
|
58
60
|
|
59
|
-
*Value
|
61
|
+
*Value* [:all, :average, :average_scored_only, :common_scored_only, :common, :rms_scored_only, :rms, :none]
|
60
62
|
|
61
63
|
Clustering type specifies the type of calculation to use in deriving the document level scores from the article scores. This only applies to multiarticle style classifications.
|
62
64
|
|
63
|
-
*Default
|
65
|
+
*Default* :rms_scored_only
|
64
66
|
|
65
|
-
|
67
|
+
=== :clustering_threshold (optional)
|
66
68
|
|
67
|
-
*Value
|
69
|
+
*Value* [0-100]
|
68
70
|
|
69
71
|
The clustering threshold is only used in multiarticle mode. When the clustering algorithm is selected, the result is checked against this threshold and a score is only promoted to document level if it is >= this value.
|
70
72
|
|
71
|
-
*Default
|
73
|
+
*Default* 48
|
72
74
|
|
73
|
-
|
75
|
+
=== :threshold (optional)
|
74
76
|
|
75
|
-
*Value
|
77
|
+
*Value* [0-100]
|
76
78
|
|
77
79
|
The threshold is used to decide at what level of significance a category rule will fire.
|
78
80
|
|
79
81
|
The score (or significance if you prefer) varies between 0 and 100 sometimes this is displayed as 0.00 - 1.00 depending on whether it is used for integer calculations (0-100) or for statistical floating point operations (0.00 - 1.00 ie a normalised value is generally better here).
|
80
82
|
|
81
|
-
*Default
|
83
|
+
*Default* 48
|
82
84
|
|
83
|
-
|
85
|
+
=== :language (optional)
|
84
86
|
|
85
|
-
*Value
|
87
|
+
*Value* [:english, :english_marathon_stemmer, :english_morphological_stemmer, :english_morph_and_derivational_stemmer, :french, :italian, :german, :spanish, :dutch, :portuguese, :danish, :norwegian, :swedish, :arabic]
|
86
88
|
|
87
|
-
|
89
|
+
_Note_ for Standard Language processing only English has multiple stemmers available - The other languages supported only have Marathon stemmer available.
|
88
90
|
|
89
|
-
*Default
|
91
|
+
*Default* :english_marathon_stemmer
|
90
92
|
|
91
|
-
|
93
|
+
=== :generated_keys (optional)
|
92
94
|
|
93
|
-
*Value
|
95
|
+
*Value* true or false
|
94
96
|
|
95
97
|
Using generated keys will mean that all rules will have a unique key (which is simply the index of the rule in the rulenet).
|
96
98
|
|
97
|
-
*Default
|
99
|
+
*Default* true
|
98
100
|
|
99
|
-
|
101
|
+
=== :min_avg_article_page_size (optional)
|
100
102
|
|
101
|
-
*Value
|
103
|
+
*Value* Decimal
|
102
104
|
|
103
105
|
The minimum average article page size is only relevant in multi article mode
|
104
106
|
|
@@ -106,11 +108,11 @@ For documents which contain page information (ie not html and other continuous f
|
|
106
108
|
|
107
109
|
The idea is that this gives an easy to use approximate measure for checking splitting - ie a min average article page size of 1 means that on average we want 1 article to be bigger than a single page so if a document of 10 pages splits into 20 articles then we probably have a bad statistical split so classifying as a single article will give better results
|
108
110
|
|
109
|
-
*Default
|
111
|
+
*Default* 1.0
|
110
112
|
|
111
|
-
|
113
|
+
=== :character_cutoff (optional)
|
112
114
|
|
113
|
-
*Values
|
115
|
+
*Values* FixNum
|
114
116
|
|
115
117
|
The character count cutoff is a mechanism for avoiding errors or lengthy classification times on large documents.
|
116
118
|
|
@@ -126,11 +128,11 @@ Other multi document type parsing systems (for example the simple parser include
|
|
126
128
|
|
127
129
|
Generally this value is set high enough that any reasonable document (ie one produced by a person) will be fully considered so a value of 1/2 a million is realistic - automatically generated text files which cause lengthy classification times often have more than this characters but the information is very rarely of any use to an end user.
|
128
130
|
|
129
|
-
*Default
|
131
|
+
*Default* 500000
|
130
132
|
|
131
|
-
|
133
|
+
=== :document_score_limit (optional)
|
132
134
|
|
133
|
-
*Value
|
135
|
+
*Value* FixNum
|
134
136
|
|
135
137
|
The document level score limit is a mechanism for restricting the document level classifications to the top-N results only.
|
136
138
|
|
@@ -140,12 +142,12 @@ However some systems have rather strict limits on the number of "tags" or "meta
|
|
140
142
|
|
141
143
|
Currently the implementation is pretty simplistic (will return N or less scores sorted by the confidence) so could easily be implemented in the integration layer but it is possible that further work could go here so that CS could check particular categories or classes of categories in a specific rulenet defined manner so that "important" classes of categorisations (though with a low confidence) are not excluded by large numbers of higher confidence classifications in some less important class of rules.
|
142
144
|
|
143
|
-
*Default
|
145
|
+
*Default* 0
|
144
146
|
|
145
147
|
== Dependencies
|
146
148
|
|
147
|
-
* {nokogiri}[http://github.com/tenderlove/nokogiri]
|
148
|
-
* {
|
149
|
+
* {nokogiri}[http://github.com/tenderlove/nokogiri]
|
150
|
+
* {rest-client}[http://github.com/archiloque/rest-client]
|
149
151
|
|
150
152
|
== Copyright
|
151
153
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.1.
|
1
|
+
0.1.1
|
@@ -23,9 +23,9 @@ module Semaphore
|
|
23
23
|
@@connection = Connection.new(realm, proxy)
|
24
24
|
end
|
25
25
|
|
26
|
-
def classify(
|
26
|
+
def classify(*args)
|
27
27
|
options = extract_options!(args)
|
28
|
-
options[:
|
28
|
+
raise InsufficientArgs if options[:alternate_body].empty? && options[:document_uri].empty?
|
29
29
|
|
30
30
|
result = post @@default_options.merge(options)
|
31
31
|
end
|
data/spec/spec_helper.rb
ADDED
metadata
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: semaphore_classification
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash:
|
4
|
+
hash: 25
|
5
5
|
prerelease: false
|
6
6
|
segments:
|
7
7
|
- 0
|
8
8
|
- 1
|
9
|
-
-
|
10
|
-
version: 0.1.
|
9
|
+
- 1
|
10
|
+
version: 0.1.1
|
11
11
|
platform: ruby
|
12
12
|
authors:
|
13
13
|
- Mauricio Gomes
|
@@ -84,6 +84,8 @@ files:
|
|
84
84
|
- lib/semaphore_classification.rb
|
85
85
|
- lib/semaphore_classification/client.rb
|
86
86
|
- lib/semaphore_classification/connection.rb
|
87
|
+
- spec/semaphore_classification_spec.rb
|
88
|
+
- spec/spec_helper.rb
|
87
89
|
has_rdoc: true
|
88
90
|
homepage: http://github.com/geminisbs/semaphore_classification
|
89
91
|
licenses: []
|
@@ -118,5 +120,6 @@ rubygems_version: 1.3.7
|
|
118
120
|
signing_key:
|
119
121
|
specification_version: 3
|
120
122
|
summary: Ruby wrapper around the Semaphore Classification Server
|
121
|
-
test_files:
|
122
|
-
|
123
|
+
test_files:
|
124
|
+
- spec/semaphore_classification_spec.rb
|
125
|
+
- spec/spec_helper.rb
|