textmood 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +20 -0
- data/README.md +162 -0
- data/bin/textmood +108 -0
- data/lang/en_US.txt +18539 -0
- data/lang/no_NB.txt +9274 -0
- data/lang/symbols.txt +54 -0
- data/lib/textmood.rb +107 -0
- data/test/test.rb +51 -0
- metadata +55 -0
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2013 Stian Grytøyr
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
6
|
+
this software and associated documentation files (the "Software"), to deal in
|
7
|
+
the Software without restriction, including without limitation the rights to
|
8
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
9
|
+
the Software, and to permit persons to whom the Software is furnished to do so,
|
10
|
+
subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
17
|
+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
18
|
+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
19
|
+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
20
|
+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,162 @@
|
|
1
|
+
## TextMood - Simple sentiment analyzer
|
2
|
+
*TextMood* is a simple sentiment analyzer, provided as a Ruby gem with a command-line
|
3
|
+
tool for simple interoperability with other processes. It takes text as input and
|
4
|
+
returns a sentiment score. Above 0 is typically considered positive, below is
|
5
|
+
considered negative.
|
6
|
+
|
7
|
+
The goal is to have a robust and simple tool that comes with baseline sentiment files
|
8
|
+
for many languages.
|
9
|
+
|
10
|
+
### Installation
|
11
|
+
The easiest way to get the latest stable version is to use gem:
|
12
|
+
|
13
|
+
gem install textmood
|
14
|
+
|
15
|
+
If you’d like to get the bleeding-edge version:
|
16
|
+
|
17
|
+
git clone https://github.com/stiang/textmood
|
18
|
+
|
19
|
+
### Usage
|
20
|
+
TextMood can be used as a ruby library or as a standalone CLI tool.
|
21
|
+
|
22
|
+
#### Ruby library
|
23
|
+
You can use textmood in a ruby program like this:
|
24
|
+
```ruby
|
25
|
+
require "textmood"
|
26
|
+
|
27
|
+
# The :lang parameter makes TextMood use one of the bundled language sentiment files
|
28
|
+
scorer = TextMood.new(lang: "en_US")
|
29
|
+
score = scorer.score_text("some text")
|
30
|
+
#=> '1.121'
|
31
|
+
|
32
|
+
# The :files parameter makes TextMood ignore the bundled sentiment files and use the
|
33
|
+
# specified files instead. You can specify as many files as you want.
|
34
|
+
scorer = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
|
35
|
+
|
36
|
+
# TextMood will by default make one pass over the text, checking every word, but it
|
37
|
+
# supports doing several passes for any range of word N-grams. Both the start and end
|
38
|
+
# N-gram can be specified using the :start_ngram and :end_ngram options
|
39
|
+
scorer = TextMood.new(lang: "en_US", debug: true, start_ngram: 2, end_ngram: 3)
|
40
|
+
score = scorer.score_text("some long text with many words")
|
41
|
+
#=> some long: 0.1
|
42
|
+
#=> long text: 0.1
|
43
|
+
#=> text with: -0.1
|
44
|
+
#=> with many: -0.1
|
45
|
+
#=> many words: -0.1
|
46
|
+
#=> some long text: -0.1
|
47
|
+
#=> long text with: 0.1
|
48
|
+
#=> text with many: 0.1
|
49
|
+
#=> with many words: 0.1
|
50
|
+
#=> '0.1'
|
51
|
+
|
52
|
+
# Using :normalize, you can make TextMood return a normalized value: 1 for positive,
|
53
|
+
# 0 for neutral and -1 for negative
|
54
|
+
scorer = TextMood.new(lang: "en_US", normalize: true)
|
55
|
+
score = scorer.score_text("some text")
|
56
|
+
#=> '1'
|
57
|
+
|
58
|
+
# :min_threshold and :max_threshold lets you customize the way :normalize treats
|
59
|
+
# different values. The options below will make all scores below 1 negative,
|
60
|
+
# 1-2 will be neutral, and above 2 will be positive.
|
61
|
+
scorer = TextMood.new(lang: "en_US", normalize: true, min_threshold: 1, max_threshold: 2)
|
62
|
+
score = scorer.score_text("some text")
|
63
|
+
#=> '0'
|
64
|
+
|
65
|
+
# :debug prints out all tokens to stdout, alongs with their values (or 'nil' when the
|
66
|
+
# token was not found)
|
67
|
+
scorer = TextMood.new(lang: "en_US", debug: true)
|
68
|
+
score = scorer.score_text("some text")
|
69
|
+
#=> some: 0.1
|
70
|
+
#=> text: 0.1
|
71
|
+
#=> some text: -0.1
|
72
|
+
#=> '0.1'
|
73
|
+
```
|
74
|
+
|
75
|
+
#### CLI tool
|
76
|
+
Or you can simply pass some UTF-8-encoded text to the cli tool and get a score back, like so
|
77
|
+
```bash
|
78
|
+
textmood -l en_US "<some text>"
|
79
|
+
-0.4375
|
80
|
+
```
|
81
|
+
|
82
|
+
The cli tool has many useful options, mostly mirroring those of the library:
|
83
|
+
```
|
84
|
+
Usage: textmood [options] "<text>"
|
85
|
+
|
86
|
+
Returns a floating-point sentiment score of the provided text.
|
87
|
+
Above 0 is considered positive, below is considered negative.
|
88
|
+
|
89
|
+
MANDATORY options:
|
90
|
+
-l, --language LANGUAGE The IETF language tag for the provided text.
|
91
|
+
Examples: en_US, no_NB
|
92
|
+
|
93
|
+
OR
|
94
|
+
|
95
|
+
-f, --file PATH TO FILE Use the specified sentiment file. May be used
|
96
|
+
multiple times to load several files. No other
|
97
|
+
files will be loaded if this option is used.
|
98
|
+
|
99
|
+
OPTIONAL options:
|
100
|
+
--start-ngram INTEGER The lowest word N-gram number to split the text into
|
101
|
+
(default 1). Note that this only makes sense if the
|
102
|
+
sentiment file has tokens of similar N-gram length
|
103
|
+
|
104
|
+
--end-ngram INTEGER The highest word N-gram number to to split the text into
|
105
|
+
(default 1). Note that this only makes sense if the
|
106
|
+
sentiment file has tokens of similar N-gram length
|
107
|
+
|
108
|
+
-n, --normalize Return 1 (positive), -1 (negative) or 0 (neutral)
|
109
|
+
instead of the actual score. See also --min and --max.
|
110
|
+
|
111
|
+
--min-threshold FLOAT Scores lower than this are considered negative when
|
112
|
+
using --normalize (default -0.5)
|
113
|
+
|
114
|
+
--max-threshold FLOAT Scores higher than this are considered positive when
|
115
|
+
using --normalize (default 0.5)
|
116
|
+
|
117
|
+
-s, --skip-symbols Do not include symbols file (emoticons etc.).
|
118
|
+
Only applies when using -l/--language.
|
119
|
+
|
120
|
+
-d, --debug Prints out the score for each token in the provided text
|
121
|
+
or 'nil' if the token was not found in the sentiment file
|
122
|
+
|
123
|
+
-h, --help Show this message
|
124
|
+
```
|
125
|
+
|
126
|
+
## Sentiment files
|
127
|
+
The included sentiment files reside in the *lang* directory. I hope to add many
|
128
|
+
more baseline sentiment files in the future.
|
129
|
+
|
130
|
+
Sentiment files should be named according to the IETF language tag, like *en_US*,
|
131
|
+
and contain one colon-separated line per token, like so:
|
132
|
+
```
|
133
|
+
1.0: epic
|
134
|
+
1.0: good
|
135
|
+
1.0: upright
|
136
|
+
0.958: fortunate
|
137
|
+
0.875: wonderfulness
|
138
|
+
0.875: wonderful
|
139
|
+
0.875: wide-eyed
|
140
|
+
0.875: wholesomeness
|
141
|
+
0.875: well-to-do
|
142
|
+
0.875: well-situated
|
143
|
+
0.6: well suited
|
144
|
+
```
|
145
|
+
The score is to the left of the first ':', and everything to the right is the
|
146
|
+
(potentially multi-word) token.
|
147
|
+
|
148
|
+
## Contribute
|
149
|
+
Including baseline word/N-gram scores for many different languages is one
|
150
|
+
of the expressed goals of this project. If you are able to contribute scores
|
151
|
+
for a missing language or improve an existing one, it would be much appreciated!
|
152
|
+
|
153
|
+
The process is the usual:
|
154
|
+
* Fork
|
155
|
+
* Add/improve
|
156
|
+
* Pull request
|
157
|
+
|
158
|
+
## Credits
|
159
|
+
Loosely based on https://github.com/cmaclell/Basic-Tweet-Sentiment-Analyzer
|
160
|
+
|
161
|
+
## Author
|
162
|
+
Stian Grytøyr
|
data/bin/textmood
ADDED
@@ -0,0 +1,108 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
#encoding: utf-8
|
3
|
+
|
4
|
+
if RUBY_VERSION < '1.9'
|
5
|
+
$KCODE='u'
|
6
|
+
else
|
7
|
+
Encoding.default_external = Encoding::UTF_8
|
8
|
+
Encoding.default_internal = Encoding::UTF_8
|
9
|
+
end
|
10
|
+
|
11
|
+
$:.unshift File.join(File.dirname(__FILE__), *%w{ .. lib })
|
12
|
+
|
13
|
+
require "optparse"
|
14
|
+
require "textmood"
|
15
|
+
|
16
|
+
usage = "Usage: #{File.basename($0)} [options] \"<text>\""
|
17
|
+
|
18
|
+
def mini_usage(usage, notext = false)
|
19
|
+
puts usage
|
20
|
+
puts ""
|
21
|
+
if notext
|
22
|
+
puts "ERROR: Quoted text must be provided after the last option."
|
23
|
+
else
|
24
|
+
puts "ERROR: An IETF language tag must be provided using the -l/--language option."
|
25
|
+
end
|
26
|
+
puts ""
|
27
|
+
puts "Use \"#{File.basename($0)} -h\" for full usage info."
|
28
|
+
puts ""
|
29
|
+
exit 20
|
30
|
+
end
|
31
|
+
|
32
|
+
if ARGV[0] != "-h" and ARGV[0] != "--help" and not (ARGV[0] and ARGV[1])
|
33
|
+
mini_usage(usage)
|
34
|
+
end
|
35
|
+
|
36
|
+
options = {:files => []}
|
37
|
+
opts_parser = OptionParser.new do |opts|
|
38
|
+
opts.banner = usage
|
39
|
+
opts.separator ""
|
40
|
+
opts.separator "Returns a floating-point sentiment score of the provided text."
|
41
|
+
opts.separator "Above 0 is considered positive, below is considered negative."
|
42
|
+
opts.separator ""
|
43
|
+
opts.separator "MANDATORY options:"
|
44
|
+
opts.on("-l", "--language LANGUAGE", "The IETF language tag for the provided text.",
|
45
|
+
"Examples: en_US, no_NB") do |l|
|
46
|
+
options[:lang] = l
|
47
|
+
end
|
48
|
+
opts.separator ""
|
49
|
+
opts.separator " OR "
|
50
|
+
opts.separator ""
|
51
|
+
opts.on("-f", "--file PATH TO FILE", "Use the specified sentiment file. May be used",
|
52
|
+
"multiple times to load several files. No other",
|
53
|
+
"files will be loaded if this option is used.") do |f|
|
54
|
+
options[:files] << f
|
55
|
+
end
|
56
|
+
opts.separator ""
|
57
|
+
opts.separator "OPTIONAL options:"
|
58
|
+
opts.on("--start-ngram INTEGER", "The lowest word N-gram number to split the text into",
|
59
|
+
"(default 1). Note that this only makes sense if the",
|
60
|
+
"sentiment file has tokens of similar N-gram length") do |start_ngram|
|
61
|
+
options[:start_ngram] = start_ngram.to_i
|
62
|
+
end
|
63
|
+
opts.separator ""
|
64
|
+
opts.on("--end-ngram INTEGER", "The highest word N-gram number to to split the text into",
|
65
|
+
"(default 1). Note that this only makes sense if the",
|
66
|
+
"sentiment file has tokens of similar N-gram length") do |end_ngram|
|
67
|
+
options[:end_ngram] = end_ngram.to_i
|
68
|
+
end
|
69
|
+
opts.separator ""
|
70
|
+
opts.on("-n", "--normalize", "Return 1 (positive), -1 (negative) or 0 (neutral)",
|
71
|
+
"instead of the actual score. See also --min and --max.") do |n|
|
72
|
+
options[:normalize] = true
|
73
|
+
end
|
74
|
+
opts.separator ""
|
75
|
+
opts.on("--min-threshold FLOAT", "Scores lower than this are considered negative when",
|
76
|
+
"using --normalize (default -0.5)") do |min|
|
77
|
+
options[:min_threshold] = min.to_f
|
78
|
+
end
|
79
|
+
opts.separator ""
|
80
|
+
opts.on("--max-threshold FLOAT", "Scores higher than this are considered positive when",
|
81
|
+
"using --normalize (default 0.5)") do |max|
|
82
|
+
options[:max_threshold] = max.to_f
|
83
|
+
end
|
84
|
+
opts.separator ""
|
85
|
+
opts.on("-s", "--skip-symbols", "Do not include symbols file (emoticons etc.).",
|
86
|
+
"Only applies when using -l/--language.") do |s|
|
87
|
+
options[:include_symbols] = false
|
88
|
+
end
|
89
|
+
opts.separator ""
|
90
|
+
opts.on("-d", "--debug", "Prints out the score for each token in the provided text",
|
91
|
+
"or 'nil' if the token was not found in the sentiment file") do |d|
|
92
|
+
options[:debug] = true
|
93
|
+
end
|
94
|
+
opts.separator ""
|
95
|
+
opts.on_tail("-h", "--help", "Show this message") do
|
96
|
+
puts opts
|
97
|
+
puts ""
|
98
|
+
exit
|
99
|
+
end
|
100
|
+
end
|
101
|
+
opts_parser.parse!
|
102
|
+
|
103
|
+
if ARGV[0]
|
104
|
+
scorer = TextMood.new(options)
|
105
|
+
puts scorer.score_text(ARGV[0])
|
106
|
+
else
|
107
|
+
mini_usage(usage, true)
|
108
|
+
end
|