textmood 0.0.7 → 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +47 -14
- data/bin/textmood +6 -0
- data/lang/da.txt +11277 -0
- data/lang/de.txt +12953 -0
- data/lang/{en_US.txt → en.txt} +8370 -8370
- data/lang/es.txt +14855 -0
- data/lang/fr.txt +12708 -0
- data/lang/no_NB.txt +9246 -9246
- data/lang/ru.txt +16494 -0
- data/lang/sv.txt +11106 -0
- data/lib/textmood.rb +22 -1
- metadata +8 -2
data/README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1
|
-
## TextMood - Simple sentiment analyzer
|
2
|
-
|
1
|
+
## TextMood - Simple, powerful sentiment analyzer
|
2
|
+
TextMood is a simple and powerful sentiment analyzer, provided as a Ruby gem with
|
3
3
|
a command-line tool for simple interoperability with other processes. It takes text
|
4
4
|
as input and returns a sentiment score.
|
5
5
|
|
@@ -12,6 +12,25 @@ it into tokens of N words (N-grams) for each pass. By adding multi-word tokens t
|
|
12
12
|
the sentiment file and using this feature, you can achieve much greater accuracy
|
13
13
|
than with just single-word analysis.
|
14
14
|
|
15
|
+
### Summary of features
|
16
|
+
* Bundles baseline sentiment scores for many languages, making it easy to get started
|
17
|
+
* CLI tool that makes it extremely simple to get sentiment scores for any text
|
18
|
+
* Supports multiple passes for any range of N-grams
|
19
|
+
* Has a flexible API that’s easy to use and understand
|
20
|
+
|
21
|
+
### Bundled languages
|
22
|
+
* English ("en") - decent quality, copied from cmaclell/Basic-Tweet-Sentiment-Analyzer
|
23
|
+
* Russian ("ru") - low quality, raw Google Translate of the English file
|
24
|
+
* Spanish ("es") - low quality, raw Google Translate of the English file
|
25
|
+
* German ("de") - low quality, raw Google Translate of the English file
|
26
|
+
* French ("fr") - low quality, raw Google Translate of the English file
|
27
|
+
* Norwegian Bokmål ("no_NB") - low quality, slightly improved Google Translate of the English file
|
28
|
+
* Swedish ("se") - low quality, raw Google Translate of the English file
|
29
|
+
* Danish ("da") - low quality, raw Google Translate of the English file
|
30
|
+
|
31
|
+
Please see the Contribute section for more info on how to improve the quality of these
|
32
|
+
files, or adding new ones.
|
33
|
+
|
15
34
|
### Installation
|
16
35
|
The easiest way to get the latest stable version is to install the gem:
|
17
36
|
|
@@ -21,6 +40,9 @@ If you’d like to get the bleeding-edge version:
|
|
21
40
|
|
22
41
|
git clone https://github.com/stiang/textmood
|
23
42
|
|
43
|
+
The *master* branch will normally be in sync with the gem, but there may be
|
44
|
+
newer code in branches.
|
45
|
+
|
24
46
|
### Usage
|
25
47
|
TextMood can be used as a Ruby library or as a standalone CLI tool.
|
26
48
|
|
@@ -30,7 +52,7 @@ You can use it in a Ruby program like this:
|
|
30
52
|
require "textmood"
|
31
53
|
|
32
54
|
# The :lang parameter makes TextMood use one of the bundled language sentiment files
|
33
|
-
tm = TextMood.new(lang: "
|
55
|
+
tm = TextMood.new(lang: "en")
|
34
56
|
score = tm.analyze("some text")
|
35
57
|
#=> '1.121'
|
36
58
|
|
@@ -38,16 +60,20 @@ score = tm.analyze("some text")
|
|
38
60
|
# specified files instead. You can specify as many files as you want.
|
39
61
|
tm = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
|
40
62
|
|
63
|
+
# Use :alias_file to make TextMood look up the file to use for the given language tag
|
64
|
+
# in a JSON file containing a hash with {"language_tag": "path_to_file"} mappings
|
65
|
+
tm = TextMood.new(lang: "zw", alias_file: "my-custom-languages.json")
|
66
|
+
|
41
67
|
# :normalize_score will try to normalize the score to an integer between +/- 100,
|
42
68
|
# based on how many tokens were scored, which can be useful when trying to compare
|
43
69
|
# scores for texts of different length
|
44
|
-
tm = TextMood.new(lang: "
|
70
|
+
tm = TextMood.new(lang: "en", normalize_score: true)
|
45
71
|
score = tm.analyze("some text")
|
46
72
|
#=> '14'
|
47
73
|
|
48
74
|
# :ternary_output will make TextMood return one of three fixed values:
|
49
75
|
# 1 for positive, 0 for neutral and -1 for negative
|
50
|
-
tm = TextMood.new(lang: "
|
76
|
+
tm = TextMood.new(lang: "en", ternary_output: true)
|
51
77
|
score = tm.analyze("some text")
|
52
78
|
#=> '1'
|
53
79
|
|
@@ -55,7 +81,7 @@ score = tm.analyze("some text")
|
|
55
81
|
# treats different values. The options below will make all scores below 10 negative,
|
56
82
|
# 10-20 will be neutral, and above 20 will be positive. Note that these thresholds
|
57
83
|
# are compared to the normalized score, if applicable.
|
58
|
-
tm = TextMood.new(lang: "
|
84
|
+
tm = TextMood.new(lang: "en",
|
59
85
|
ternary_output: true,
|
60
86
|
normalize_score: true,
|
61
87
|
min_threshold: 10,
|
@@ -66,7 +92,7 @@ score = tm.analyze("some text")
|
|
66
92
|
# TextMood will by default make one pass over the text, checking every word, but it
|
67
93
|
# supports doing several passes for any range of word N-grams. Both the start and end
|
68
94
|
# N-gram can be specified using the :start_ngram and :end_ngram options
|
69
|
-
tm = TextMood.new(lang: "
|
95
|
+
tm = TextMood.new(lang: "en", debug: true, start_ngram: 2, end_ngram: 3)
|
70
96
|
score = tm.analyze("some long text with many words")
|
71
97
|
#(stdout): some long: 0.1
|
72
98
|
#(stdout): long text: 0.1
|
@@ -81,7 +107,7 @@ score = tm.analyze("some long text with many words")
|
|
81
107
|
|
82
108
|
# :debug prints out all tokens to stdout, alongs with their values (or 'nil' when the
|
83
109
|
# token was not found)
|
84
|
-
tm = TextMood.new(lang: "
|
110
|
+
tm = TextMood.new(lang: "en", debug: true)
|
85
111
|
score = tm.analyze("some text")
|
86
112
|
#(stdout): some: 0.1
|
87
113
|
#(stdout): text: 0.1
|
@@ -92,13 +118,13 @@ score = tm.analyze("some text")
|
|
92
118
|
#### CLI tool
|
93
119
|
You can also pass some UTF-8-encoded text to the CLI tool and get a score back, like so
|
94
120
|
```bash
|
95
|
-
textmood -l
|
121
|
+
textmood -l en "<some text>"
|
96
122
|
-0.4375
|
97
123
|
```
|
98
124
|
|
99
125
|
Alternatively, you can pipe some text to textmood on stdin:
|
100
126
|
```bash
|
101
|
-
echo "<some text>" | textmood -l
|
127
|
+
echo "<some text>" | textmood -l en
|
102
128
|
-0.4375
|
103
129
|
```
|
104
130
|
|
@@ -114,7 +140,7 @@ Above 0 is considered positive, below is considered negative.
|
|
114
140
|
|
115
141
|
MANDATORY options:
|
116
142
|
-l, --language LANGUAGE The IETF language tag for the provided text.
|
117
|
-
Examples:
|
143
|
+
Examples: en, fr, no_NB, sv,
|
118
144
|
|
119
145
|
OR
|
120
146
|
|
@@ -133,7 +159,7 @@ OPTIONAL options:
|
|
133
159
|
and --max-threshold.
|
134
160
|
|
135
161
|
-i, --min-threshold FLOAT Scores lower than this are considered negative when
|
136
|
-
using --ternary-output (default 0.5). Note that the
|
162
|
+
using --ternary-output (default -0.5). Note that the
|
137
163
|
threshold is compared to the normalized score, if applicable
|
138
164
|
|
139
165
|
-x, --max-threshold FLOAT Scores higher than this are considered positive when
|
@@ -161,8 +187,8 @@ OPTIONAL options:
|
|
161
187
|
The included sentiment files reside in the *lang* directory. I hope to add many
|
162
188
|
more baseline sentiment files in the future.
|
163
189
|
|
164
|
-
Sentiment files should be named according to the IETF language tag, like *
|
165
|
-
and contain one colon-separated line per token, like so:
|
190
|
+
Sentiment files should be named according to the IETF language tag, like *en* or
|
191
|
+
*no_NB*, and contain one colon-separated line per token, like so:
|
166
192
|
```
|
167
193
|
1.0: epic
|
168
194
|
1.0: good
|
@@ -175,10 +201,17 @@ and contain one colon-separated line per token, like so:
|
|
175
201
|
0.875: well-to-do
|
176
202
|
0.875: well-situated
|
177
203
|
0.6: well suited
|
204
|
+
-0.3: dishonest
|
205
|
+
-0.5: tragedy
|
178
206
|
```
|
179
207
|
The score, which must be between -1.0 and 1.0, is to the left of the first ':',
|
180
208
|
and everything to the right is the (potentially multi-word) token.
|
181
209
|
|
210
|
+
# TODO
|
211
|
+
* Add more sentiment language files
|
212
|
+
* Improve sentiment files, adding bigrams and trigrams
|
213
|
+
* Improve test coverage
|
214
|
+
|
182
215
|
## Contribute
|
183
216
|
Including baseline word/N-gram scores for many different languages is one
|
184
217
|
of the expressed goals of this project. If you are able to contribute scores
|
data/bin/textmood
CHANGED
@@ -56,6 +56,12 @@ opts_parser = OptionParser.new do |opts|
|
|
56
56
|
end
|
57
57
|
opts.separator ""
|
58
58
|
opts.separator "OPTIONAL options:"
|
59
|
+
opts.on("-a", "--alias-file PATH TO FILE", "JSON file containing a hash that maps language codes to",
|
60
|
+
"sentiment score files. This lets you use the convenience of",
|
61
|
+
"language codes with custom sentiment score files.") do |a|
|
62
|
+
options[:alias_file] = a.to_s
|
63
|
+
end
|
64
|
+
opts.separator ""
|
59
65
|
opts.on("-n", "--normalize-score", "Tries to normalize the score to an integer between +/- 100",
|
60
66
|
"according to the number of tokens that were scored, making",
|
61
67
|
"it more feasible to compare scores for texts of different",
|