RubyGems - textmood - Versions diffs - 0.0.7 → 0.1.0 - Mend

textmood 0.0.7 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

data/README.md +47 -14
data/bin/textmood +6 -0
data/lang/da.txt +11277 -0
data/lang/de.txt +12953 -0
data/lang/{en_US.txt → en.txt} +8370 -8370
data/lang/es.txt +14855 -0
data/lang/fr.txt +12708 -0
data/lang/no_NB.txt +9246 -9246
data/lang/ru.txt +16494 -0
data/lang/sv.txt +11106 -0
data/lib/textmood.rb +22 -1
metadata +8 -2

data/README.md CHANGED Viewed

@@ -1,5 +1,5 @@
-## TextMood - Simple sentiment analyzer
-*TextMood* is a simple but powerful sentiment analyzer, provided as a Ruby gem with
+## TextMood - Simple, powerful sentiment analyzer
+TextMood is a simple and powerful sentiment analyzer, provided as a Ruby gem with
 a command-line tool for simple interoperability with other processes. It takes text
 as input and returns a sentiment score.
@@ -12,6 +12,25 @@ it into tokens of N words (N-grams) for each pass. By adding multi-word tokens t
 the sentiment file and using this feature, you can achieve much greater accuracy
 than with just single-word analysis.
+### Summary of features
+* Bundles baseline sentiment scores for many languages, making it easy to get started
+* CLI tool that makes it extremely simple to get sentiment scores for any text
+* Supports multiple passes for any range of N-grams
+* Has a flexible API that’s easy to use and understand
+### Bundled languages
+* English ("en") - decent quality, copied from cmaclell/Basic-Tweet-Sentiment-Analyzer
+* Russian ("ru") - low quality, raw Google Translate of the English file
+* Spanish ("es") - low quality, raw Google Translate of the English file
+* German ("de") - low quality, raw Google Translate of the English file
+* French ("fr") - low quality, raw Google Translate of the English file
+* Norwegian Bokmål ("no_NB") - low quality, slightly improved Google Translate of the English file
+* Swedish ("se") - low quality, raw Google Translate of the English file
+* Danish ("da") - low quality, raw Google Translate of the English file
+Please see the Contribute section for more info on how to improve the quality of these
+files, or adding new ones.
 ### Installation
 The easiest way to get the latest stable version is to install the gem:
@@ -21,6 +40,9 @@ If you’d like to get the bleeding-edge version:
     git clone https://github.com/stiang/textmood
+The *master* branch will normally be in sync with the gem, but there may be
+newer code in branches.
 ### Usage
 TextMood can be used as a Ruby library or as a standalone CLI tool.
@@ -30,7 +52,7 @@ You can use it in a Ruby program like this:
 require "textmood"
 # The :lang parameter makes TextMood use one of the bundled language sentiment files
-tm = TextMood.new(lang: "en_US")
+tm = TextMood.new(lang: "en")
 score = tm.analyze("some text")
 #=> '1.121'
@@ -38,16 +60,20 @@ score = tm.analyze("some text")
 # specified files instead. You can specify as many files as you want.
 tm = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
+# Use :alias_file to make TextMood look up the file to use for the given language tag
+# in a JSON file containing a hash with {"language_tag": "path_to_file"} mappings
+tm = TextMood.new(lang: "zw", alias_file: "my-custom-languages.json")
 # :normalize_score will try to normalize the score to an integer between +/- 100,
 # based on how many tokens were scored, which can be useful when trying to compare
 # scores for texts of different length
-tm = TextMood.new(lang: "en_US", normalize_score: true)
+tm = TextMood.new(lang: "en", normalize_score: true)
 score = tm.analyze("some text")
 #=> '14'
 # :ternary_output will make TextMood return one of three fixed values:
 # 1 for positive, 0 for neutral and -1 for negative
-tm = TextMood.new(lang: "en_US", ternary_output: true)
+tm = TextMood.new(lang: "en", ternary_output: true)
 score = tm.analyze("some text")
 #=> '1'
@@ -55,7 +81,7 @@ score = tm.analyze("some text")
 # treats different values. The options below will make all scores below 10 negative,
 # 10-20 will be neutral, and above 20 will be positive. Note that these thresholds
 # are compared to the normalized score, if applicable.
-tm = TextMood.new(lang: "en_US",
+tm = TextMood.new(lang: "en",
                   ternary_output: true,
                   normalize_score: true,
                   min_threshold: 10,
@@ -66,7 +92,7 @@ score = tm.analyze("some text")
 # TextMood will by default make one pass over the text, checking every word, but it
 # supports doing several passes for any range of word N-grams. Both the start and end
 # N-gram can be specified using the :start_ngram and :end_ngram options
-tm = TextMood.new(lang: "en_US", debug: true, start_ngram: 2, end_ngram: 3)
+tm = TextMood.new(lang: "en", debug: true, start_ngram: 2, end_ngram: 3)
 score = tm.analyze("some long text with many words")
 #(stdout): some long: 0.1
 #(stdout): long text: 0.1
@@ -81,7 +107,7 @@ score = tm.analyze("some long text with many words")
 # :debug prints out all tokens to stdout, alongs with their values (or 'nil' when the
 # token was not found)
-tm = TextMood.new(lang: "en_US", debug: true)
+tm = TextMood.new(lang: "en", debug: true)
 score = tm.analyze("some text")
 #(stdout): some: 0.1
 #(stdout): text: 0.1
@@ -92,13 +118,13 @@ score = tm.analyze("some text")
 #### CLI tool
 You can also pass some UTF-8-encoded text to the CLI tool and get a score back, like so
 ```bash
-textmood -l en_US "<some text>"
+textmood -l en "<some text>"
 -0.4375
 ```
 Alternatively, you can pipe some text to textmood on stdin:
 ```bash
-echo "<some text>" | textmood -l en_US
+echo "<some text>" | textmood -l en
 -0.4375
 ```
@@ -114,7 +140,7 @@ Above 0 is considered positive, below is considered negative.
 MANDATORY options:
     -l, --language LANGUAGE          The IETF language tag for the provided text.
-                                     Examples: en_US, no_NB
+                                     Examples: en, fr, no_NB, sv,
               OR
@@ -133,7 +159,7 @@ OPTIONAL options:
                                      and --max-threshold.
     -i, --min-threshold FLOAT        Scores lower than this are considered negative when
-                                     using --ternary-output (default 0.5). Note that the
+                                     using --ternary-output (default -0.5). Note that the
                                      threshold is compared to the normalized score, if applicable
     -x, --max-threshold FLOAT        Scores higher than this are considered positive when
@@ -161,8 +187,8 @@ OPTIONAL options:
 The included sentiment files reside in the *lang* directory. I hope to add many
 more baseline sentiment files in the future.
-Sentiment files should be named according to the IETF language tag, like *en_US*,
-and contain one colon-separated line per token, like so:
+Sentiment files should be named according to the IETF language tag, like *en* or
+*no_NB*, and contain one colon-separated line per token, like so:
 ```
 1.0: epic
 1.0: good
@@ -175,10 +201,17 @@ and contain one colon-separated line per token, like so:
 0.875: well-to-do
 0.875: well-situated
 0.6: well suited
+-0.3: dishonest
+-0.5: tragedy
 ```
 The score, which must be between -1.0 and 1.0, is to the left of the first ':',
 and everything to the right is the (potentially multi-word) token.
+# TODO
+* Add more sentiment language files
+* Improve sentiment files, adding bigrams and trigrams
+* Improve test coverage
 ## Contribute
 Including baseline word/N-gram scores for many different languages is one
 of the expressed goals of this project. If you are able to contribute scores

data/bin/textmood CHANGED Viewed

@@ -56,6 +56,12 @@ opts_parser = OptionParser.new do |opts|
   end
   opts.separator ""
   opts.separator "OPTIONAL options:"
+  opts.on("-a", "--alias-file PATH TO FILE", "JSON file containing a hash that maps language codes to",
+                                             "sentiment score files. This lets you use the convenience of",
+                                             "language codes with custom sentiment score files.") do |a|
+    options[:alias_file] = a.to_s
+  end
+  opts.separator ""
   opts.on("-n", "--normalize-score", "Tries to normalize the score to an integer between +/- 100",
                                      "according to the number of tokens that were scored, making",
                                      "it more feasible to compare scores for texts of different",