textmood 0.0.7 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (12) hide show
  1. data/README.md +47 -14
  2. data/bin/textmood +6 -0
  3. data/lang/da.txt +11277 -0
  4. data/lang/de.txt +12953 -0
  5. data/lang/{en_US.txt → en.txt} +8370 -8370
  6. data/lang/es.txt +14855 -0
  7. data/lang/fr.txt +12708 -0
  8. data/lang/no_NB.txt +9246 -9246
  9. data/lang/ru.txt +16494 -0
  10. data/lang/sv.txt +11106 -0
  11. data/lib/textmood.rb +22 -1
  12. metadata +8 -2
data/README.md CHANGED
@@ -1,5 +1,5 @@
1
- ## TextMood - Simple sentiment analyzer
2
- *TextMood* is a simple but powerful sentiment analyzer, provided as a Ruby gem with
1
+ ## TextMood - Simple, powerful sentiment analyzer
2
+ TextMood is a simple and powerful sentiment analyzer, provided as a Ruby gem with
3
3
  a command-line tool for simple interoperability with other processes. It takes text
4
4
  as input and returns a sentiment score.
5
5
 
@@ -12,6 +12,25 @@ it into tokens of N words (N-grams) for each pass. By adding multi-word tokens t
12
12
  the sentiment file and using this feature, you can achieve much greater accuracy
13
13
  than with just single-word analysis.
14
14
 
15
+ ### Summary of features
16
+ * Bundles baseline sentiment scores for many languages, making it easy to get started
17
+ * CLI tool that makes it extremely simple to get sentiment scores for any text
18
+ * Supports multiple passes for any range of N-grams
19
+ * Has a flexible API that’s easy to use and understand
20
+
21
+ ### Bundled languages
22
+ * English ("en") - decent quality, copied from cmaclell/Basic-Tweet-Sentiment-Analyzer
23
+ * Russian ("ru") - low quality, raw Google Translate of the English file
24
+ * Spanish ("es") - low quality, raw Google Translate of the English file
25
+ * German ("de") - low quality, raw Google Translate of the English file
26
+ * French ("fr") - low quality, raw Google Translate of the English file
27
+ * Norwegian Bokmål ("no_NB") - low quality, slightly improved Google Translate of the English file
28
+ * Swedish ("se") - low quality, raw Google Translate of the English file
29
+ * Danish ("da") - low quality, raw Google Translate of the English file
30
+
31
+ Please see the Contribute section for more info on how to improve the quality of these
32
+ files, or adding new ones.
33
+
15
34
  ### Installation
16
35
  The easiest way to get the latest stable version is to install the gem:
17
36
 
@@ -21,6 +40,9 @@ If you’d like to get the bleeding-edge version:
21
40
 
22
41
  git clone https://github.com/stiang/textmood
23
42
 
43
+ The *master* branch will normally be in sync with the gem, but there may be
44
+ newer code in branches.
45
+
24
46
  ### Usage
25
47
  TextMood can be used as a Ruby library or as a standalone CLI tool.
26
48
 
@@ -30,7 +52,7 @@ You can use it in a Ruby program like this:
30
52
  require "textmood"
31
53
 
32
54
  # The :lang parameter makes TextMood use one of the bundled language sentiment files
33
- tm = TextMood.new(lang: "en_US")
55
+ tm = TextMood.new(lang: "en")
34
56
  score = tm.analyze("some text")
35
57
  #=> '1.121'
36
58
 
@@ -38,16 +60,20 @@ score = tm.analyze("some text")
38
60
  # specified files instead. You can specify as many files as you want.
39
61
  tm = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
40
62
 
63
+ # Use :alias_file to make TextMood look up the file to use for the given language tag
64
+ # in a JSON file containing a hash with {"language_tag": "path_to_file"} mappings
65
+ tm = TextMood.new(lang: "zw", alias_file: "my-custom-languages.json")
66
+
41
67
  # :normalize_score will try to normalize the score to an integer between +/- 100,
42
68
  # based on how many tokens were scored, which can be useful when trying to compare
43
69
  # scores for texts of different length
44
- tm = TextMood.new(lang: "en_US", normalize_score: true)
70
+ tm = TextMood.new(lang: "en", normalize_score: true)
45
71
  score = tm.analyze("some text")
46
72
  #=> '14'
47
73
 
48
74
  # :ternary_output will make TextMood return one of three fixed values:
49
75
  # 1 for positive, 0 for neutral and -1 for negative
50
- tm = TextMood.new(lang: "en_US", ternary_output: true)
76
+ tm = TextMood.new(lang: "en", ternary_output: true)
51
77
  score = tm.analyze("some text")
52
78
  #=> '1'
53
79
 
@@ -55,7 +81,7 @@ score = tm.analyze("some text")
55
81
  # treats different values. The options below will make all scores below 10 negative,
56
82
  # 10-20 will be neutral, and above 20 will be positive. Note that these thresholds
57
83
  # are compared to the normalized score, if applicable.
58
- tm = TextMood.new(lang: "en_US",
84
+ tm = TextMood.new(lang: "en",
59
85
  ternary_output: true,
60
86
  normalize_score: true,
61
87
  min_threshold: 10,
@@ -66,7 +92,7 @@ score = tm.analyze("some text")
66
92
  # TextMood will by default make one pass over the text, checking every word, but it
67
93
  # supports doing several passes for any range of word N-grams. Both the start and end
68
94
  # N-gram can be specified using the :start_ngram and :end_ngram options
69
- tm = TextMood.new(lang: "en_US", debug: true, start_ngram: 2, end_ngram: 3)
95
+ tm = TextMood.new(lang: "en", debug: true, start_ngram: 2, end_ngram: 3)
70
96
  score = tm.analyze("some long text with many words")
71
97
  #(stdout): some long: 0.1
72
98
  #(stdout): long text: 0.1
@@ -81,7 +107,7 @@ score = tm.analyze("some long text with many words")
81
107
 
82
108
  # :debug prints out all tokens to stdout, alongs with their values (or 'nil' when the
83
109
  # token was not found)
84
- tm = TextMood.new(lang: "en_US", debug: true)
110
+ tm = TextMood.new(lang: "en", debug: true)
85
111
  score = tm.analyze("some text")
86
112
  #(stdout): some: 0.1
87
113
  #(stdout): text: 0.1
@@ -92,13 +118,13 @@ score = tm.analyze("some text")
92
118
  #### CLI tool
93
119
  You can also pass some UTF-8-encoded text to the CLI tool and get a score back, like so
94
120
  ```bash
95
- textmood -l en_US "<some text>"
121
+ textmood -l en "<some text>"
96
122
  -0.4375
97
123
  ```
98
124
 
99
125
  Alternatively, you can pipe some text to textmood on stdin:
100
126
  ```bash
101
- echo "<some text>" | textmood -l en_US
127
+ echo "<some text>" | textmood -l en
102
128
  -0.4375
103
129
  ```
104
130
 
@@ -114,7 +140,7 @@ Above 0 is considered positive, below is considered negative.
114
140
 
115
141
  MANDATORY options:
116
142
  -l, --language LANGUAGE The IETF language tag for the provided text.
117
- Examples: en_US, no_NB
143
+ Examples: en, fr, no_NB, sv,
118
144
 
119
145
  OR
120
146
 
@@ -133,7 +159,7 @@ OPTIONAL options:
133
159
  and --max-threshold.
134
160
 
135
161
  -i, --min-threshold FLOAT Scores lower than this are considered negative when
136
- using --ternary-output (default 0.5). Note that the
162
+ using --ternary-output (default -0.5). Note that the
137
163
  threshold is compared to the normalized score, if applicable
138
164
 
139
165
  -x, --max-threshold FLOAT Scores higher than this are considered positive when
@@ -161,8 +187,8 @@ OPTIONAL options:
161
187
  The included sentiment files reside in the *lang* directory. I hope to add many
162
188
  more baseline sentiment files in the future.
163
189
 
164
- Sentiment files should be named according to the IETF language tag, like *en_US*,
165
- and contain one colon-separated line per token, like so:
190
+ Sentiment files should be named according to the IETF language tag, like *en* or
191
+ *no_NB*, and contain one colon-separated line per token, like so:
166
192
  ```
167
193
  1.0: epic
168
194
  1.0: good
@@ -175,10 +201,17 @@ and contain one colon-separated line per token, like so:
175
201
  0.875: well-to-do
176
202
  0.875: well-situated
177
203
  0.6: well suited
204
+ -0.3: dishonest
205
+ -0.5: tragedy
178
206
  ```
179
207
  The score, which must be between -1.0 and 1.0, is to the left of the first ':',
180
208
  and everything to the right is the (potentially multi-word) token.
181
209
 
210
+ # TODO
211
+ * Add more sentiment language files
212
+ * Improve sentiment files, adding bigrams and trigrams
213
+ * Improve test coverage
214
+
182
215
  ## Contribute
183
216
  Including baseline word/N-gram scores for many different languages is one
184
217
  of the expressed goals of this project. If you are able to contribute scores
data/bin/textmood CHANGED
@@ -56,6 +56,12 @@ opts_parser = OptionParser.new do |opts|
56
56
  end
57
57
  opts.separator ""
58
58
  opts.separator "OPTIONAL options:"
59
+ opts.on("-a", "--alias-file PATH TO FILE", "JSON file containing a hash that maps language codes to",
60
+ "sentiment score files. This lets you use the convenience of",
61
+ "language codes with custom sentiment score files.") do |a|
62
+ options[:alias_file] = a.to_s
63
+ end
64
+ opts.separator ""
59
65
  opts.on("-n", "--normalize-score", "Tries to normalize the score to an integer between +/- 100",
60
66
  "according to the number of tokens that were scored, making",
61
67
  "it more feasible to compare scores for texts of different",