textmood 0.0.7 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (12) hide show
  1. data/README.md +47 -14
  2. data/bin/textmood +6 -0
  3. data/lang/da.txt +11277 -0
  4. data/lang/de.txt +12953 -0
  5. data/lang/{en_US.txt → en.txt} +8370 -8370
  6. data/lang/es.txt +14855 -0
  7. data/lang/fr.txt +12708 -0
  8. data/lang/no_NB.txt +9246 -9246
  9. data/lang/ru.txt +16494 -0
  10. data/lang/sv.txt +11106 -0
  11. data/lib/textmood.rb +22 -1
  12. metadata +8 -2
data/README.md CHANGED
@@ -1,5 +1,5 @@
1
- ## TextMood - Simple sentiment analyzer
2
- *TextMood* is a simple but powerful sentiment analyzer, provided as a Ruby gem with
1
+ ## TextMood - Simple, powerful sentiment analyzer
2
+ TextMood is a simple and powerful sentiment analyzer, provided as a Ruby gem with
3
3
  a command-line tool for simple interoperability with other processes. It takes text
4
4
  as input and returns a sentiment score.
5
5
 
@@ -12,6 +12,25 @@ it into tokens of N words (N-grams) for each pass. By adding multi-word tokens t
12
12
  the sentiment file and using this feature, you can achieve much greater accuracy
13
13
  than with just single-word analysis.
14
14
 
15
+ ### Summary of features
16
+ * Bundles baseline sentiment scores for many languages, making it easy to get started
17
+ * CLI tool that makes it extremely simple to get sentiment scores for any text
18
+ * Supports multiple passes for any range of N-grams
19
+ * Has a flexible API that’s easy to use and understand
20
+
21
+ ### Bundled languages
22
+ * English ("en") - decent quality, copied from cmaclell/Basic-Tweet-Sentiment-Analyzer
23
+ * Russian ("ru") - low quality, raw Google Translate of the English file
24
+ * Spanish ("es") - low quality, raw Google Translate of the English file
25
+ * German ("de") - low quality, raw Google Translate of the English file
26
+ * French ("fr") - low quality, raw Google Translate of the English file
27
+ * Norwegian Bokmål ("no_NB") - low quality, slightly improved Google Translate of the English file
28
+ * Swedish ("se") - low quality, raw Google Translate of the English file
29
+ * Danish ("da") - low quality, raw Google Translate of the English file
30
+
31
+ Please see the Contribute section for more info on how to improve the quality of these
32
+ files, or adding new ones.
33
+
15
34
  ### Installation
16
35
  The easiest way to get the latest stable version is to install the gem:
17
36
 
@@ -21,6 +40,9 @@ If you’d like to get the bleeding-edge version:
21
40
 
22
41
  git clone https://github.com/stiang/textmood
23
42
 
43
+ The *master* branch will normally be in sync with the gem, but there may be
44
+ newer code in branches.
45
+
24
46
  ### Usage
25
47
  TextMood can be used as a Ruby library or as a standalone CLI tool.
26
48
 
@@ -30,7 +52,7 @@ You can use it in a Ruby program like this:
30
52
  require "textmood"
31
53
 
32
54
  # The :lang parameter makes TextMood use one of the bundled language sentiment files
33
- tm = TextMood.new(lang: "en_US")
55
+ tm = TextMood.new(lang: "en")
34
56
  score = tm.analyze("some text")
35
57
  #=> '1.121'
36
58
 
@@ -38,16 +60,20 @@ score = tm.analyze("some text")
38
60
  # specified files instead. You can specify as many files as you want.
39
61
  tm = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
40
62
 
63
+ # Use :alias_file to make TextMood look up the file to use for the given language tag
64
+ # in a JSON file containing a hash with {"language_tag": "path_to_file"} mappings
65
+ tm = TextMood.new(lang: "zw", alias_file: "my-custom-languages.json")
66
+
41
67
  # :normalize_score will try to normalize the score to an integer between +/- 100,
42
68
  # based on how many tokens were scored, which can be useful when trying to compare
43
69
  # scores for texts of different length
44
- tm = TextMood.new(lang: "en_US", normalize_score: true)
70
+ tm = TextMood.new(lang: "en", normalize_score: true)
45
71
  score = tm.analyze("some text")
46
72
  #=> '14'
47
73
 
48
74
  # :ternary_output will make TextMood return one of three fixed values:
49
75
  # 1 for positive, 0 for neutral and -1 for negative
50
- tm = TextMood.new(lang: "en_US", ternary_output: true)
76
+ tm = TextMood.new(lang: "en", ternary_output: true)
51
77
  score = tm.analyze("some text")
52
78
  #=> '1'
53
79
 
@@ -55,7 +81,7 @@ score = tm.analyze("some text")
55
81
  # treats different values. The options below will make all scores below 10 negative,
56
82
  # 10-20 will be neutral, and above 20 will be positive. Note that these thresholds
57
83
  # are compared to the normalized score, if applicable.
58
- tm = TextMood.new(lang: "en_US",
84
+ tm = TextMood.new(lang: "en",
59
85
  ternary_output: true,
60
86
  normalize_score: true,
61
87
  min_threshold: 10,
@@ -66,7 +92,7 @@ score = tm.analyze("some text")
66
92
  # TextMood will by default make one pass over the text, checking every word, but it
67
93
  # supports doing several passes for any range of word N-grams. Both the start and end
68
94
  # N-gram can be specified using the :start_ngram and :end_ngram options
69
- tm = TextMood.new(lang: "en_US", debug: true, start_ngram: 2, end_ngram: 3)
95
+ tm = TextMood.new(lang: "en", debug: true, start_ngram: 2, end_ngram: 3)
70
96
  score = tm.analyze("some long text with many words")
71
97
  #(stdout): some long: 0.1
72
98
  #(stdout): long text: 0.1
@@ -81,7 +107,7 @@ score = tm.analyze("some long text with many words")
81
107
 
82
108
  # :debug prints out all tokens to stdout, alongs with their values (or 'nil' when the
83
109
  # token was not found)
84
- tm = TextMood.new(lang: "en_US", debug: true)
110
+ tm = TextMood.new(lang: "en", debug: true)
85
111
  score = tm.analyze("some text")
86
112
  #(stdout): some: 0.1
87
113
  #(stdout): text: 0.1
@@ -92,13 +118,13 @@ score = tm.analyze("some text")
92
118
  #### CLI tool
93
119
  You can also pass some UTF-8-encoded text to the CLI tool and get a score back, like so
94
120
  ```bash
95
- textmood -l en_US "<some text>"
121
+ textmood -l en "<some text>"
96
122
  -0.4375
97
123
  ```
98
124
 
99
125
  Alternatively, you can pipe some text to textmood on stdin:
100
126
  ```bash
101
- echo "<some text>" | textmood -l en_US
127
+ echo "<some text>" | textmood -l en
102
128
  -0.4375
103
129
  ```
104
130
 
@@ -114,7 +140,7 @@ Above 0 is considered positive, below is considered negative.
114
140
 
115
141
  MANDATORY options:
116
142
  -l, --language LANGUAGE The IETF language tag for the provided text.
117
- Examples: en_US, no_NB
143
+ Examples: en, fr, no_NB, sv,
118
144
 
119
145
  OR
120
146
 
@@ -133,7 +159,7 @@ OPTIONAL options:
133
159
  and --max-threshold.
134
160
 
135
161
  -i, --min-threshold FLOAT Scores lower than this are considered negative when
136
- using --ternary-output (default 0.5). Note that the
162
+ using --ternary-output (default -0.5). Note that the
137
163
  threshold is compared to the normalized score, if applicable
138
164
 
139
165
  -x, --max-threshold FLOAT Scores higher than this are considered positive when
@@ -161,8 +187,8 @@ OPTIONAL options:
161
187
  The included sentiment files reside in the *lang* directory. I hope to add many
162
188
  more baseline sentiment files in the future.
163
189
 
164
- Sentiment files should be named according to the IETF language tag, like *en_US*,
165
- and contain one colon-separated line per token, like so:
190
+ Sentiment files should be named according to the IETF language tag, like *en* or
191
+ *no_NB*, and contain one colon-separated line per token, like so:
166
192
  ```
167
193
  1.0: epic
168
194
  1.0: good
@@ -175,10 +201,17 @@ and contain one colon-separated line per token, like so:
175
201
  0.875: well-to-do
176
202
  0.875: well-situated
177
203
  0.6: well suited
204
+ -0.3: dishonest
205
+ -0.5: tragedy
178
206
  ```
179
207
  The score, which must be between -1.0 and 1.0, is to the left of the first ':',
180
208
  and everything to the right is the (potentially multi-word) token.
181
209
 
210
+ # TODO
211
+ * Add more sentiment language files
212
+ * Improve sentiment files, adding bigrams and trigrams
213
+ * Improve test coverage
214
+
182
215
  ## Contribute
183
216
  Including baseline word/N-gram scores for many different languages is one
184
217
  of the expressed goals of this project. If you are able to contribute scores
data/bin/textmood CHANGED
@@ -56,6 +56,12 @@ opts_parser = OptionParser.new do |opts|
56
56
  end
57
57
  opts.separator ""
58
58
  opts.separator "OPTIONAL options:"
59
+ opts.on("-a", "--alias-file PATH TO FILE", "JSON file containing a hash that maps language codes to",
60
+ "sentiment score files. This lets you use the convenience of",
61
+ "language codes with custom sentiment score files.") do |a|
62
+ options[:alias_file] = a.to_s
63
+ end
64
+ opts.separator ""
59
65
  opts.on("-n", "--normalize-score", "Tries to normalize the score to an integer between +/- 100",
60
66
  "according to the number of tokens that were scored, making",
61
67
  "it more feasible to compare scores for texts of different",